Re: Tombstone passed GC period causes un-repairable inconsistent data

Stefan Podkowinski Wed, 20 Jun 2018 01:11:32 -0700

Sounds like an older issue that I tried to address two years ago:
https://issues.apache.org/jira/browse/CASSANDRA-11427

As you can see, the result hasn't been as expected and we got some
unintended side effects based on the patch. I'm not sure I'd be willing
to give this another try, considering the behaviour we like to fix in
the first place is rather harmless and the read repairs shouldn't happen
at all to any users who regularly run repairs within gc_grace.

What I'd suggest is to think more into the direction of a
post-full-repair-world and to fully embrace incremental repairs, as
fixed by Blake in 4.0. In that case, we should stop doing read repairs
at all for repaired data, as described in
https://issues.apache.org/jira/browse/CASSANDRA-13912. RRs are certainly
useful, but can be very risky if not very very carefully implemented. So
I'm wondering if we shouldn't disable RRs for everything but unrepaired
data. I'd btw also be interested to hear any opinions on this in context
of transient replicas.

On 20.06.2018 03:07, Jay Zhuang wrote:
> Hi,
> 
> We know that the deleted data may re-appear if repair is not run within
> gc_grace_seconds. When the tombstone is not propagated to all nodes, the
> data will re-appear. But it's also causing following 2 issues before the
> tombstone is compacted away:
> a. inconsistent query result
> 
> With consistency level ONE or QUORUM, it may or may not return the value.
> b. lots of read repairs, but doesn't repair anything
> 
> With consistency level ALL, it always triggers a read repair.
> With consistency level QUORUM, it also very likely (2/3) causes a read
> repair. But it doesn't repair the data, so it's causing repair every time.
> 
> 
> Here are the reproducing steps:
> 
> 1. Create a 3 nodes cluster
> 2. Create a table (with small gc_grace_seconds):
> 
> CREATE KEYSPACE foo WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': 3};
> CREATE TABLE foo.bar (
>     id int PRIMARY KEY,
>     name text
> ) WITH gc_grace_seconds=30;
> 
> 3. Insert data with consistency all:
> 
> INSERT INTO foo.bar (id, name) VALUES(1, 'cstar');
> 
> 4. stop 1 node
> 
> $ ccm node2 stop
> 
> 5. Delete the data with consistency quorum:
> 
> DELETE FROM foo.bar WHERE id=1;
> 
> 6. Wait 30 seconds and then start node2:
> 
> $ ccm node2 start
> 
> Now the tombstone is on node1 and node3 but not on node2.
> 
> With quorum read, it may or may not return value, and read repair will send
> the data from node2 to node1 and node3, but it doesn't repair anything.
> 
> I'd like to discuss a few potential solutions and workarounds:
> 
> 1. Can hints replay sends GCed tombstone?
> 
> 2. Can we have a "deep repair" which detects such issue and repair the GCed
> tombstone? Or temperately increase the gc_grace_seconds for repair?
> 
> What other suggestions you have if the user is having such issue?
> 
> 
> Thanks,
> 
> Jay
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Tombstone passed GC period causes un-repairable inconsistent data

Reply via email to