[ https://issues.apache.org/jira/browse/CASSANDRA-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13925605#comment-13925605 ]
Benedict edited comment on CASSANDRA-6285 at 3/10/14 9:52 AM: -------------------------------------------------------------- Hmm. Just taking a look at Viktor's patch, I realised that my initial conclusions were actually quite plausible and probably (one of) the causes of the problem. When I dismissed them, I didn't realise we were using a custom TBinaryProtocol implementation. In particular (1) is definitely possible, and probably the cause of the issue, although the attack jar source would be helpful to figure out of there are any other potential causes. We should be able to force the problem to occur by artificially delaying the commit log write to prove this. Either way, I don't think Viktor's patch is the best way to deal with this problem, as it leaves cleaning up the direct buffers to GC. Since we could be creating a lot of these, we could create an awful lot of artificial memory pressure. Honestly, I think the best solution is to simply avoid using direct buffers with thrift, at least until 2.1, which should fix this problem by ensuring the CL _write_ (if not commit) has happened before performing the memtable insertion. was (Author: benedict): Hmm. Just taking a look at Viktor's patch, I realised that my initial conclusions were actually quite plausible and probably the causes of the problem. When I dismissed them, I didn't realise we were using a custom TBinaryProtocol implementation. In particular (1) is definitely possible, and probably the cause of the issue, although the attack jar source would be helpful to figure thatout of there are any other potential causes. We should be able to force the problem to occur by artificially delaying the commit log write to prove this. Either way, I don't think Viktor's patch is the best way to deal with this problem, as it leaves cleaning up the direct buffers to GC. Since we could be creating a lot of these, we could create an awful lot of artificial memory pressure. Honestly, I think the best solution is to simply avoid using direct buffers with thrift, at least until 2.1, which should fix this problem by ensuring the CL _write_ (if not commit) has happened before performing the memtable insertion. > 2.0 HSHA server introduces corrupt data > --------------------------------------- > > Key: CASSANDRA-6285 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6285 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 4 nodes, shortly updated from 1.2.11 to 2.0.2 > Reporter: David Sauer > Assignee: Pavel Yaskevich > Priority: Critical > Fix For: 2.0.6 > > Attachments: 6285_testnotes1.txt, > CASSANDRA-6285-disruptor-heap.patch, cassandra-attack-src.zip, > compaction_test.py, disruptor-high-cpu.patch, > disruptor-memory-corruption.patch > > > After altering everything to LCS the table OpsCenter.rollups60 amd one other > none OpsCenter-Table got stuck with everything hanging around in L0. > The compaction started and ran until the logs showed this: > ERROR [CompactionExecutor:111] 2013-11-01 19:14:53,865 CassandraDaemon.java > (line 187) Exception in thread Thread[CompactionExecutor:111,1,RMI Runtime] > java.lang.RuntimeException: Last written key > DecoratedKey(1326283851463420237, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574426c6f6f6d46696c746572537061636555736564) > >= current key DecoratedKey(954210699457429663, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574546f74616c4469736b5370616365557365640b0f) > writing into > /var/lib/cassandra/data/OpsCenter/rollups60/OpsCenter-rollups60-tmp-jb-58656-Data.db > at > org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:141) > at > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:164) > at > org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:160) > at > org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) > at > org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > Moving back to STC worked to keep the compactions running. > Especialy my own Table i would like to move to LCS. > After a major compaction with STC the move to LCS fails with the same > Exception. -- This message was sent by Atlassian JIRA (v6.2#6252)