[
https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benedict updated CASSANDRA-7704:
--------------------------------
Attachment: 7704.txt
Attaching a patch that I think addresses this. There are a number of
concurrency bugs here, and whilst we could fix them with more advanced
lock-freedom, there is no compelling reason this class doesn't use synchronized
everywhere, which would probably have avoided this problem in the first place.
There is only one place where the execution is not guaranteed to be prompt, and
I have left this out of the synchronization. I have at the same time simplified
the logic, and fixed the logic for cancelling timeouts, as well as made the
scheduled executor for timeouts globally shared (there's no good reason to
spinup a new executor for each set of transfers)
In this particular instance the issue seems to have been a lack of atomicity
between abort() and complete(); an ACK arrived at the same time as abort() was
cancelling all transfers, causing a reference to be released twice. This could
also occur with the timeouts, but since they occur only every 12hrs, the risk
is low.
> FileNotFoundException during STREAM-OUT triggers 100% CPU usage
> ---------------------------------------------------------------
>
> Key: CASSANDRA-7704
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7704
> Project: Cassandra
> Issue Type: Bug
> Reporter: Rick Branson
> Attachments: 7704.txt, backtrace.txt
>
>
> See attached backtrace which was what triggered this. This stream failed and
> then ~12 seconds later it emitted that exception. At that point, all CPUs
> went to 100%. A thread dump shows all the ReadStage threads stuck inside
> IntervalTree.searchInternal inside of CFS.markReferenced().
--
This message was sent by Atlassian JIRA
(v6.2#6252)