[jira] [Comment Edited] (CASSANDRA-6285) 2.0 HSHA server introduces corrupt data

Benedict (JIRA) Mon, 10 Mar 2014 02:53:25 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13925605#comment-13925605
 ]


Benedict edited comment on CASSANDRA-6285 at 3/10/14 9:52 AM:
--------------------------------------------------------------

Hmm. Just taking a look at Viktor's patch, I realised that my initial 
conclusions were actually quite plausible and probably (one of) the causes of 
the problem. When I dismissed them, I didn't realise we were using a custom 
TBinaryProtocol implementation. In particular (1) is definitely possible, and 
probably the cause of the issue, although the attack jar source would be 
helpful to figure out of there are any other potential causes. We should be 
able to force the problem to occur by artificially delaying the commit log 
write to prove this.

Either way, I don't think Viktor's patch is the best way to deal with this 
problem, as it leaves cleaning up the direct buffers to GC. Since we could be 
creating a lot of these, we could create an awful lot of artificial memory 
pressure. Honestly, I think the best solution is to simply avoid using direct 
buffers with thrift, at least until 2.1, which should fix this problem by 
ensuring the CL _write_ (if not commit) has happened before performing the 
memtable insertion.


was (Author: benedict):
Hmm. Just taking a look at Viktor's patch, I realised that my initial 
conclusions were actually quite plausible and probably the causes of the 
problem. When I dismissed them, I didn't realise we were using a custom 
TBinaryProtocol implementation. In particular (1) is definitely possible, and 
probably the cause of the issue, although the attack jar source would be 
helpful to figure thatout of there are any other potential causes. We should be 
able to force the problem to occur by artificially delaying the commit log 
write to prove this.

Either way, I don't think Viktor's patch is the best way to deal with this 
problem, as it leaves cleaning up the direct buffers to GC. Since we could be 
creating a lot of these, we could create an awful lot of artificial memory 
pressure. Honestly, I think the best solution is to simply avoid using direct 
buffers with thrift, at least until 2.1, which should fix this problem by 
ensuring the CL _write_ (if not commit) has happened before performing the 
memtable insertion.

> 2.0 HSHA server introduces corrupt data
> ---------------------------------------
>
>                 Key: CASSANDRA-6285
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6285
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: 4 nodes, shortly updated from 1.2.11 to 2.0.2
>            Reporter: David Sauer
>            Assignee: Pavel Yaskevich
>            Priority: Critical
>             Fix For: 2.0.6
>
>         Attachments: 6285_testnotes1.txt, 
> CASSANDRA-6285-disruptor-heap.patch, cassandra-attack-src.zip, 
> compaction_test.py, disruptor-high-cpu.patch, 
> disruptor-memory-corruption.patch
>
>
> After altering everything to LCS the table OpsCenter.rollups60 amd one other 
> none OpsCenter-Table got stuck with everything hanging around in L0.
> The compaction started and ran until the logs showed this:
> ERROR [CompactionExecutor:111] 2013-11-01 19:14:53,865 CassandraDaemon.java 
> (line 187) Exception in thread Thread[CompactionExecutor:111,1,RMI Runtime]
> java.lang.RuntimeException: Last written key 
> DecoratedKey(1326283851463420237, 
> 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574426c6f6f6d46696c746572537061636555736564)
>  >= current key DecoratedKey(954210699457429663, 
> 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574546f74616c4469736b5370616365557365640b0f)
>  writing into 
> /var/lib/cassandra/data/OpsCenter/rollups60/OpsCenter-rollups60-tmp-jb-58656-Data.db
>       at 
> org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:141)
>       at 
> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:164)
>       at 
> org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:160)
>       at 
> org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
>       at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>       at 
> org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60)
>       at 
> org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
>       at 
> org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296)
>       at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:724)
> Moving back to STC worked to keep the compactions running.
> Especialy my own Table i would like to move to LCS.
> After a major compaction with STC the move to LCS fails with the same 
> Exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (CASSANDRA-6285) 2.0 HSHA server introduces corrupt data

Reply via email to