[jira] Resolved: (CASSANDRA-1316) Read repair does not always work correctly

Jonathan Ellis (JIRA) Tue, 27 Jul 2010 09:48:40 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jonathan Ellis resolved CASSANDRA-1316.
---------------------------------------

         Assignee: Brandon Williams
    Fix Version/s: 0.6.4
                       (was: 0.6.5)
       Resolution: Fixed

Brandon's first patch fixing reads at CL.ALL turns out to be the only bug.  The 
rest is obscure-but-valid behavior when expired tombstones haven't been 
replicated across the cluster (i.e., the tombstones exist on some nodes, but 
not all).  Let me give an example:

say node A has columns x and y, where x is an expired tombstone with timestamp 
T1, and node B has live column x, at time T2 where T2 < T1.

if you read at ALL you will see x from B and y from A.  you will _not_ see x 
from A -- since it is expired, it is no longer relevant off-node.  thus, the 
ALL read will send a repair of column x to A, since it was "missing."

But next time you read from A the tombstone will supress the newly-written copy 
of x-from-B still, because its timestamp is higher.  So the replicas won't 
converge.

This is not a bug, because the design explicitly allows that behavior when 
tombstones expire before being propagated to all nodes; see 
http://wiki.apache.org/cassandra/DistributedDeletes.  The best way to avoid 
this of course is to run repair frequently enough to ensure that tombstones are 
propagated within GCGraceSeconds of being written.

But if you do find yourself in this situation, you have two options to get 
things to converge again:

1) the simplest option is to simply perform a major compaction on each node, 
which will eliminate all expired tombstones.

2) but if you want to propagate as many of the tombstones as possible first, 
increase your GCGraceSeconds setting everywhere (requires rolling restart), and 
perform a full repair as described in 
http://wiki.apache.org/cassandra/Operations.  After the repair is complete you 
can put GCGraceSeconds back to what it was.


> Read repair does not always work correctly
> ------------------------------------------
>
>                 Key: CASSANDRA-1316
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1316
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.4
>            Reporter: Brandon Williams
>            Assignee: Brandon Williams
>             Fix For: 0.6.4
>
>         Attachments: 001_correct_responsecount_in_RRR.txt, 1316-RRM.txt, 
> cassandra-1.json, cassandra-2.json, cassandra-3.json, RRR-v2.txt
>
>
> Read repair does not always work.  At the least, we allow violation of the 
> CL.ALL contract.  To reproduce, create a three node cluster with RF=3, and 
> json2sstable one of the attached json files on each node.  This creates a row 
> whose key is 'test' with 9 columns, but only 3 columns are on each machine.  
> If you get_count this row in quick succession at CL.ALL, sometimes you will 
> receive a count of 6, sometimes 9.  After the ReadRepairManager has sent the 
> repairs, you will always get 9, which is the desired behavior.
> I have another data set obtained in the wild which never fully repairs for 
> some reason, but it's a bit large to attach (600ish columns per machine.)  
> I'm still trying to figure out why RR isn't working on this set, but I always 
> get different results when reading at any CL including ALL, no matter how 
> long I wait or how many reads I do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (CASSANDRA-1316) Read repair does not always work correctly

Reply via email to