[ https://issues.apache.org/jira/browse/CASSANDRA-14554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677739#comment-16677739 ]
Stefania commented on CASSANDRA-14554: -------------------------------------- We had a related issue where one of our customers ended up with a corrupt txn log file during streaming, with an ADD record following an ABORT record. We couldn't look at the logs as they weren't available any longer, since the customer only noticed the problem when the node would not restart 22 days later. However, it's pretty obvious in my opinion that one thread aborted the streaming session whilst the receiving thread was adding a new sstable. So this seems the same root cause as reported in this ticket, which is that streaming is using the txn in a thread unsafe way. In my opinion, the problem exists since 3.0. However it becomes significanlty more likely with the Netty streaming refactoring. Our customer was on a branch based on 3.11. We took a very conservative approach with the fix, in that we didn't want to fully synchronize abstract transactional and the lifecycle transaction on released branches. We could consider synchronizing these classes for 4.0 however, or reworking streaming. Here are the 3.11 changes, if there is interest in this approach I can create patches for 3.0 and trunk as well: [https://github.com/apache/cassandra/compare/cassandra-3.11...stef1927:db-2633-3.11] We simply extracted a new interface, the [sstable tracker|https://github.com/apache/cassandra/compare/cassandra-3.11...stef1927:db-2633-3.11#diff-9d71c7ad9ad16368bd0429d3b34e2b21R15], which is also [implemented|https://github.com/apache/cassandra/compare/cassandra-3.11...stef1927:db-2633-3.11#diff-1a464da4a62ac4a734c725059cbc918bR144] by {{StreamReceiveTask}} by synchronizing the access to the txn, just like it does for all its other accesses to the txn. Whilst it's not ideal to have an additional interface, the change should be quite safe for released branches. > LifecycleTransaction encounters ConcurrentModificationException when used in > multi-threaded context > --------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-14554 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14554 > Project: Cassandra > Issue Type: Bug > Reporter: Dinesh Joshi > Assignee: Dinesh Joshi > Priority: Major > > When LifecycleTransaction is used in a multi-threaded context, we encounter > this exception - > {quote}java.util.ConcurrentModificationException: null > at > java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719) > at java.util.LinkedHashMap$LinkedKeyIterator.next(LinkedHashMap.java:742) > at java.lang.Iterable.forEach(Iterable.java:74) > at > org.apache.cassandra.db.lifecycle.LogReplicaSet.maybeCreateReplica(LogReplicaSet.java:78) > at org.apache.cassandra.db.lifecycle.LogFile.makeRecord(LogFile.java:320) > at org.apache.cassandra.db.lifecycle.LogFile.add(LogFile.java:285) > at > org.apache.cassandra.db.lifecycle.LogTransaction.trackNew(LogTransaction.java:136) > at > org.apache.cassandra.db.lifecycle.LifecycleTransaction.trackNew(LifecycleTransaction.java:529) > {quote} > During streaming we create a reference to a {{LifeCycleTransaction}} and > share it between threads - > [https://github.com/apache/cassandra/blob/5cc68a87359dd02412bdb70a52dfcd718d44a5ba/src/java/org/apache/cassandra/db/streaming/CassandraStreamReader.java#L156] > This is used in a multi-threaded context insideĀ {{CassandraIncomingFile}} > which is anĀ {{IncomingStreamMessage}}. This is being deserialized in parallel. > {{LifecycleTransaction}} is not meant to be used in a multi-threaded context > and this leads to streaming failures due to object sharing. On trunk, this > object is shared across all threads that transfer sstables in parallel for > the given {{TableId}} in a {{StreamSession}}. There are two options to solve > this - make {{LifecycleTransaction}} and the associated objects thread safe, > scope the transaction to a single {{CassandraIncomingFile}}. The consequences > of the latter option is that if we experience streaming failure we may have > redundant SSTables on disk. This is ok as compaction should clean this up. A > third option is we synchronize access in the streaming infrastructure. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org