sankalp kohli created CASSANDRA-8815:
----------------------------------------
Summary: Race in sstable ref counting during streaming failures
Key: CASSANDRA-8815
URL: https://issues.apache.org/jira/browse/CASSANDRA-8815
Project: Cassandra
Issue Type: Bug
Components: Core
Reporter: sankalp kohli
We have a seen a machine in Prod whose all read threads are blocked(spinning)
on trying to acquire the reference lock on stables. There are also some stream
sessions which are doing the same.
On looking at the heap dump, we could see that a live sstable which is part of
the View has a ref count = 0. This sstable is also not compacting or is part of
any failed compaction.
On looking through the code, we could see that if ref goes to zero and the
stable is part of the View, all reader threads will spin forever.
On further looking through the code of streaming, we could see that if
StreamTransferTask.complete is called after closeSession has been called due to
error in OutgoingMessageHandler, it will double decrement the ref count of an
sstable.
This race can happen and we see through exception in logs that closeSession was
triggered by OutgoingMessageHandler.
The fix for this is very simple i think. In StreamTransferTask.abort, we can
remove a file from "files” before decrementing the ref count. This will avoid
this race.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)