[jira] [Comment Edited] (CASSANDRA-19776) Spinning trying to capture readers

Stefan Miklosovic (Jira) Mon, 12 May 2025 06:31:26 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950953#comment-17950953
 ]


Stefan Miklosovic edited comment on CASSANDRA-19776 at 5/12/25 12:53 PM:
-------------------------------------------------------------------------

I was looking into TWCS and its getNextBackgroundTasks calls 
getNextBackgroundSSTables which calls getFullyExpiredSSTables and then it will 
add them among "compactionCandidates" when not empty. Similarly it is done for 
UCS. So it is indeed true that a UCS / TWCS goes into compaction with expired 
tables as well. 

Based on what [~blambov] wrote, we should investigate who, if anything, is 
removing references sooner than it is compacted away. SSTableReader has 2 
implementations of Tidy which I believe are the ones responsible for eventual 
removal of an SSTable on disk. 

The actual removal of an SSTable is done via GlobalTidy#tidy() which runs so 
called "obsoletion" (Runnable). "obsoletion" is set in 
SSTableReader#markObsolete. That method is called only in Helpers#markObsolete 
where it goes over list of all LogTransaction.Obsoletions.

GlobalTidy#tidy() is called only in case GlobalState#release will decrement its 
counts to -1. GlobalState#release is called at two places:

1) Ref#ensureRelease
2) Ref#release

Ref#release is particularly interesting, that also calls release on "state" in 
Ref. That is also called via Ref#reapOneReference() which is a method run in 
EXEC which is infinite loop executor. referenceQueue is queue which is 
populated by GC itself when an object is evaluated to be garbage collected.

An idea I have, maybe totally wrong, is that when we go to compact, we include 
expired sstables into that as well, then somewhere in that logic, we remove 
that SSTableReader or we stop to reference it, so it will go to be GCed, it 
will come to that queue, taken by EXEC and it is released. That will invoke 
GlobalTidy which will eventually start to remove components from disk.

So, when we call that metric method, there we try to select and reference 
SSTableReaders, which might contain expired as well, but we can not reference 
expired there, because it was released already, probably while compaction is 
happening but have not fully stopped yet which would eventually go through GC.

{code}
    public RefViewFragment selectAndReference(Function<View, 
Iterable<SSTableReader>> filter)
    {
        long failingSince = -1L;
        while (true)
        {
            ViewFragment view = select(filter);
            Refs<SSTableReader> refs = Refs.tryRef(view.sstables);
{code}

then "Refs.tryRef" will do

{code}
    public static <T extends RefCounted<T>> Refs<T> tryRef(Iterable<T> 
reference)
    {
        HashMap<T, Ref<T>> refs = new HashMap<>();
        for (T rc : reference)
        {
            Ref<T> ref = rc.tryRef();
            if (ref == null)
            {
                release(refs.values());
                return null;
            }
            refs.put(rc, ref);
        }
        return new Refs<T>(refs);
    }
{code}

here we see that it goes over all CANONICAL, then calls "Ref<T> ref = 
rc.tryRef();" for each. When it is null, it will release everything which it 
referenced so far and it returns null.

SSTableReader#tryRef is:

{code}
    public Ref<SSTableReader> tryRef()
    {
        return selfRef.tryRef();
    }
{code}

but that will return null when its global state is released.

{code}
    public Ref<T> tryRef()
    {
        return state.globalState.ref() ? new Ref<>(referent, state.globalState) 
: null;
    }
{code}

and that is called only in case int cur = counts.get(); < 0.

{code}
        boolean ref()
        {
            while (true)
            {
                int cur = counts.get();
                if (cur < 0)
                    return false;
                if (counts.compareAndSet(cur, cur + 1))
                    return true;
            }
        }
{code}

But if it is -1, then release method which is calling global tidy (aka 
obsoletion aka removal of components) should be called as well.

{code}
        // release a single reference, and cleanup if no more are extant
        Throwable release(Ref.State ref, Throwable accumulate)
        {
            locallyExtant.remove(ref);
            if (-1 == counts.decrementAndGet())
            {
                globallyExtant.remove(this);
                try
                {
                    if (tidy != null)
                        tidy.tidy();
                }
                catch (Throwable t)
                {
                    accumulate = merge(accumulate, t);
                }
            }
            return accumulate;
        }
{code}


was (Author: smiklosovic):
I was looking into TWCS and its getNextBackgroundTasks calls 
getNextBackgroundSSTables which calls getFullyExpiredSSTables and then it will 
add them among "compactionCandidates" when not empty. Similarly it is done for 
UCS. So it is indeed true that a UCS / TWCS goes into compaction with expired 
tables as well. 

Based on what [~blambov] wrote, we should investigate who, if anything, is 
removing references sooner than it is compacted away. SSTableReader has 2 
implementations of Tidy which I believe are the ones responsible for eventual 
removal of an SSTable on disk. 

The actual removal of an SSTable is done via GlobalTidy#tidy() which runs so 
called "obsoletion" (Runnable). "obsoletion" is set in 
SSTableReader#markObsolete. That method is called only in Helpers#markObsolete 
where it goes over list of all LogTransaction.Obsoletions.

GlobalTidy#tidy() is called only in case GlobalState#release will decrement its 
counts to -1. GlobalState#release is called at two places:

1) Ref#ensureRelease
2) Ref#release

Ref#release is particularly interesting, that also calls release on "state" in 
Ref. That is also called via Ref#reapOneReference() which is a method run in 
EXEC which is infinite loop executor. referenceQueue is queue which is 
populated by GC itself when an object is evaluated to be garbage collected.

An idea I have, maybe totally wrong, is that when we go to compact, we include 
expired sstables into that as well, then somewhere in that logic, we remove 
that SSTableReader or we stop to reference it, so it will go to be GCed, it 
will come to that queue, taken by EXEC and it is released. That will invoke 
GlobalTidy which will eventually start to remove components from disk.

So, when we call that metric method, there we try to select and reference 
SSTableReaders, which might contain expired as well, but we can not reference 
expired there, because it was released already, probably while compaction is 
happening but have not fully stopped yet which would eventually went through GC.

{code}
    public RefViewFragment selectAndReference(Function<View, 
Iterable<SSTableReader>> filter)
    {
        long failingSince = -1L;
        while (true)
        {
            ViewFragment view = select(filter);
            Refs<SSTableReader> refs = Refs.tryRef(view.sstables);
{code}

then "Refs.tryRef" will do

{code}
    public static <T extends RefCounted<T>> Refs<T> tryRef(Iterable<T> 
reference)
    {
        HashMap<T, Ref<T>> refs = new HashMap<>();
        for (T rc : reference)
        {
            Ref<T> ref = rc.tryRef();
            if (ref == null)
            {
                release(refs.values());
                return null;
            }
            refs.put(rc, ref);
        }
        return new Refs<T>(refs);
    }
{code}

here we see that it goes over all CANONICAL, then calls "Ref<T> ref = 
rc.tryRef();" for each. When it is null, it will release everything which it 
referenced so far and it returns null.

SSTableReader#tryRef is:

{code}
    public Ref<SSTableReader> tryRef()
    {
        return selfRef.tryRef();
    }
{code}

but that will return null when its global state is released.

{code}
    public Ref<T> tryRef()
    {
        return state.globalState.ref() ? new Ref<>(referent, state.globalState) 
: null;
    }
{code}

and that is called only in case int cur = counts.get(); < 0.

{code}
        boolean ref()
        {
            while (true)
            {
                int cur = counts.get();
                if (cur < 0)
                    return false;
                if (counts.compareAndSet(cur, cur + 1))
                    return true;
            }
        }
{code}

But if it is -1, then release method which is calling global tidy (aka 
obsoletion aka removal of components) should be called as well.

{code}
        // release a single reference, and cleanup if no more are extant
        Throwable release(Ref.State ref, Throwable accumulate)
        {
            locallyExtant.remove(ref);
            if (-1 == counts.decrementAndGet())
            {
                globallyExtant.remove(this);
                try
                {
                    if (tidy != null)
                        tidy.tidy();
                }
                catch (Throwable t)
                {
                    accumulate = merge(accumulate, t);
                }
            }
            return accumulate;
        }
{code}

> Spinning trying to capture readers
> ----------------------------------
>
>                 Key: CASSANDRA-19776
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19776
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Legacy/Core
>            Reporter: Cameron Zemek
>            Assignee: Stefan Miklosovic
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>         Attachments: extract.log
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> On a handful of clusters we are noticing Spin locks occurring. I traced back 
> all the calls to the EstimatedPartitionCount metric (eg. 
> org.apache.cassandra.metrics:type=Table,keyspace=testks,scope=testcf,name=EstimatedPartitionCount)
> Using the following patched function:
> {code:java}
>     public RefViewFragment selectAndReference(Function<View, 
> Iterable<SSTableReader>> filter)
>     {
>         long failingSince = -1L;
>         boolean first = true;
>         while (true)
>         {
>             ViewFragment view = select(filter);
>             Refs<SSTableReader> refs = Refs.tryRef(view.sstables);
>             if (refs != null)
>                 return new RefViewFragment(view.sstables, view.memtables, 
> refs);
>             if (failingSince <= 0)
>             {
>                 failingSince = System.nanoTime();
>             }
>             else if (System.nanoTime() - failingSince > 
> TimeUnit.MILLISECONDS.toNanos(100))
>             {
>                 List<SSTableReader> released = new ArrayList<>();
>                 for (SSTableReader reader : view.sstables)
>                     if (reader.selfRef().globalCount() == 0)
>                         released.add(reader);
>                 NoSpamLogger.log(logger, NoSpamLogger.Level.WARN, 1, 
> TimeUnit.SECONDS,
>                                  "Spinning trying to capture readers {}, 
> released: {}, ", view.sstables, released);
>                 if (first)
>                 {
>                     first = false;
>                     try {
>                         throw new RuntimeException("Spinning trying to 
> capture readers");
>                     } catch (Exception e) {
>                         logger.warn("Spin lock stacktrace", e);
>                     }
>                 }
>                 failingSince = System.nanoTime();
>             }
>         }
>     }
>  {code}
> Digging into this code I found it will fail if any of the sstables are in 
> released state (ie. reader.selfRef().globalCount() == 0).
> See the extract.log for an example of one of these spin lock occurrences. 
> Sometimes these spin locks last over 5 minutes. Across the worst cluster with 
> this issue, I ran a log processing script that everytime the 'Spinning trying 
> to capture readers' was different to previous one it would output if the 
> released tables were in Compacting state. Every single occurrence has it spin 
> locking with released listing a sstable that is compacting.
> In the extract.log example its spin locking saying that nb-320533-big-Data.db 
> has been released. But you can see prior to it spinning that sstable is 
> involved in a compaction. The compaction completes at 01:03:36 and the 
> spinning stops. nb-320533-big-Data.db is deleted at 01:03:49 along with the 
> other 9 sstables involved in the compaction.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-19776) Spinning trying to capture readers

Reply via email to