[ 
https://issues.apache.org/jira/browse/CASSANDRA-14554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677739#comment-16677739
 ] 

Stefania commented on CASSANDRA-14554:
--------------------------------------

We had a related issue where one of our customers ended up with a corrupt txn 
log file during streaming, with an ADD record following an ABORT record. We 
couldn't look at the logs as they weren't available any longer, since the 
customer only noticed the problem when the node would not restart 22 days 
later. However, it's pretty obvious in my opinion that one thread aborted the 
streaming session whilst the receiving thread was adding a new sstable. So this 
seems the same root cause as reported in this ticket, which is that streaming 
is using the txn in a thread unsafe way. In my opinion, the problem exists 
since 3.0. However it becomes significanlty more likely with the Netty 
streaming refactoring. Our customer was on a branch based on 3.11.

We took a very conservative approach with the fix, in that we didn't want to 
fully synchronize abstract transactional and the lifecycle transaction on 
released branches. We could consider synchronizing these classes for 4.0 
however, or reworking streaming.

Here are the 3.11 changes, if there is interest in this approach I can create 
patches for 3.0 and trunk as well:

[https://github.com/apache/cassandra/compare/cassandra-3.11...stef1927:db-2633-3.11]

We simply extracted a new interface, the [sstable 
tracker|https://github.com/apache/cassandra/compare/cassandra-3.11...stef1927:db-2633-3.11#diff-9d71c7ad9ad16368bd0429d3b34e2b21R15],
 which is also 
[implemented|https://github.com/apache/cassandra/compare/cassandra-3.11...stef1927:db-2633-3.11#diff-1a464da4a62ac4a734c725059cbc918bR144]
 by {{StreamReceiveTask}} by synchronizing the access to the txn, just like it 
does for all its other accesses to the txn. Whilst it's not ideal to have an 
additional interface, the change should be quite safe for released branches.

> LifecycleTransaction encounters ConcurrentModificationException when used in 
> multi-threaded context
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-14554
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14554
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Dinesh Joshi
>            Assignee: Dinesh Joshi
>            Priority: Major
>
> When LifecycleTransaction is used in a multi-threaded context, we encounter 
> this exception -
> {quote}java.util.ConcurrentModificationException: null
>  at 
> java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719)
>  at java.util.LinkedHashMap$LinkedKeyIterator.next(LinkedHashMap.java:742)
>  at java.lang.Iterable.forEach(Iterable.java:74)
>  at 
> org.apache.cassandra.db.lifecycle.LogReplicaSet.maybeCreateReplica(LogReplicaSet.java:78)
>  at org.apache.cassandra.db.lifecycle.LogFile.makeRecord(LogFile.java:320)
>  at org.apache.cassandra.db.lifecycle.LogFile.add(LogFile.java:285)
>  at 
> org.apache.cassandra.db.lifecycle.LogTransaction.trackNew(LogTransaction.java:136)
>  at 
> org.apache.cassandra.db.lifecycle.LifecycleTransaction.trackNew(LifecycleTransaction.java:529)
> {quote}
> During streaming we create a reference to a {{LifeCycleTransaction}} and 
> share it between threads -
> [https://github.com/apache/cassandra/blob/5cc68a87359dd02412bdb70a52dfcd718d44a5ba/src/java/org/apache/cassandra/db/streaming/CassandraStreamReader.java#L156]
> This is used in a multi-threaded context insideĀ {{CassandraIncomingFile}} 
> which is anĀ {{IncomingStreamMessage}}. This is being deserialized in parallel.
> {{LifecycleTransaction}} is not meant to be used in a multi-threaded context 
> and this leads to streaming failures due to object sharing. On trunk, this 
> object is shared across all threads that transfer sstables in parallel for 
> the given {{TableId}} in a {{StreamSession}}. There are two options to solve 
> this - make {{LifecycleTransaction}} and the associated objects thread safe, 
> scope the transaction to a single {{CassandraIncomingFile}}. The consequences 
> of the latter option is that if we experience streaming failure we may have 
> redundant SSTables on disk. This is ok as compaction should clean this up. A 
> third option is we synchronize access in the streaming infrastructure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to