[ 
https://issues.apache.org/jira/browse/CASSANDRA-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955907#comment-17955907
 ] 

Stefan Miklosovic commented on CASSANDRA-19776:
-----------------------------------------------

This evaluates to false (1) so it never gets to {{checkpoint}} in that {{if}}.

However I see it calls {{checkpoint}} from {{switchWriter}}. But sometimes it 
does not and returns from switchWriter from its third if. But then, it will 
eventually call {{checkpoint}} anyway in {{SSTableRewriter.doPrepare}}.

I have added logging (2) where appropriate, run the test and it logs (3) which 
should give you more insights into what it going on.

(1) 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/SSTableRewriter.java#L161
(2) https://github.com/apache/cassandra/pull/4191/commits
(3) 
https://gist.githubusercontent.com/smiklosovic/5eeb977d0de342e4c41c9600b60267e1/raw/88c5d9b038f1501f850942bdbcb4e8e38f0cf8fe/gistfile1.txt



> Spinning trying to capture readers
> ----------------------------------
>
>                 Key: CASSANDRA-19776
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19776
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Legacy/Core
>            Reporter: Cameron Zemek
>            Assignee: Stefan Miklosovic
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>         Attachments: extract.log
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> On a handful of clusters we are noticing Spin locks occurring. I traced back 
> all the calls to the EstimatedPartitionCount metric (eg. 
> org.apache.cassandra.metrics:type=Table,keyspace=testks,scope=testcf,name=EstimatedPartitionCount)
> Using the following patched function:
> {code:java}
>     public RefViewFragment selectAndReference(Function<View, 
> Iterable<SSTableReader>> filter)
>     {
>         long failingSince = -1L;
>         boolean first = true;
>         while (true)
>         {
>             ViewFragment view = select(filter);
>             Refs<SSTableReader> refs = Refs.tryRef(view.sstables);
>             if (refs != null)
>                 return new RefViewFragment(view.sstables, view.memtables, 
> refs);
>             if (failingSince <= 0)
>             {
>                 failingSince = System.nanoTime();
>             }
>             else if (System.nanoTime() - failingSince > 
> TimeUnit.MILLISECONDS.toNanos(100))
>             {
>                 List<SSTableReader> released = new ArrayList<>();
>                 for (SSTableReader reader : view.sstables)
>                     if (reader.selfRef().globalCount() == 0)
>                         released.add(reader);
>                 NoSpamLogger.log(logger, NoSpamLogger.Level.WARN, 1, 
> TimeUnit.SECONDS,
>                                  "Spinning trying to capture readers {}, 
> released: {}, ", view.sstables, released);
>                 if (first)
>                 {
>                     first = false;
>                     try {
>                         throw new RuntimeException("Spinning trying to 
> capture readers");
>                     } catch (Exception e) {
>                         logger.warn("Spin lock stacktrace", e);
>                     }
>                 }
>                 failingSince = System.nanoTime();
>             }
>         }
>     }
>  {code}
> Digging into this code I found it will fail if any of the sstables are in 
> released state (ie. reader.selfRef().globalCount() == 0).
> See the extract.log for an example of one of these spin lock occurrences. 
> Sometimes these spin locks last over 5 minutes. Across the worst cluster with 
> this issue, I ran a log processing script that everytime the 'Spinning trying 
> to capture readers' was different to previous one it would output if the 
> released tables were in Compacting state. Every single occurrence has it spin 
> locking with released listing a sstable that is compacting.
> In the extract.log example its spin locking saying that nb-320533-big-Data.db 
> has been released. But you can see prior to it spinning that sstable is 
> involved in a compaction. The compaction completes at 01:03:36 and the 
> spinning stops. nb-320533-big-Data.db is deleted at 01:03:49 along with the 
> other 9 sstables involved in the compaction.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to