[ 
https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7704:
--------------------------------

    Attachment: 7704.txt

Attaching a patch that I think addresses this. There are a number of 
concurrency bugs here, and whilst we could fix them with more advanced 
lock-freedom, there is no compelling reason this class doesn't use synchronized 
everywhere, which would probably have avoided this problem in the first place. 
There is only one place where the execution is not guaranteed to be prompt, and 
I have left this out of the synchronization. I have at the same time simplified 
the logic, and fixed the logic for cancelling timeouts, as well as made the 
scheduled executor for timeouts globally shared (there's no good reason to 
spinup a new executor for each set of transfers)

In this particular instance the issue seems to have been a lack of atomicity 
between abort() and complete(); an ACK arrived at the same time as abort() was 
cancelling all transfers, causing a reference to be released twice. This could 
also occur with the timeouts, but since they occur only every 12hrs, the risk 
is low.

> FileNotFoundException during STREAM-OUT triggers 100% CPU usage
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-7704
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7704
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Rick Branson
>         Attachments: 7704.txt, backtrace.txt
>
>
> See attached backtrace which was what triggered this. This stream failed and 
> then ~12 seconds later it emitted that exception. At that point, all CPUs 
> went to 100%. A thread dump shows all the ReadStage threads stuck inside 
> IntervalTree.searchInternal inside of CFS.markReferenced().



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to