[ 
https://issues.apache.org/jira/browse/CASSANDRA-14532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519023#comment-16519023
 ] 

Sylvain Lebresne commented on CASSANDRA-14532:
----------------------------------------------

That's not really a bug in that this is working as designed. The very reason 
for gc grace is to be a long enough time that we can guarantee any data 
(including tombstone) has been propagated to all replica, and that's why you 
must run repair within the gc grace window (otherwise other mechanism don't 
truly guarantee that). So we should not have to propagate anything past gcgs 
and doing so is at best an inefficiency.

Or another way to thing about it is, post-gcgs, a tombstone can be purged at 
any time, including immediately, solely based on local compaction conditions. 
So if not propagating post-gcgs tombstone was _a bug_, we'd be basically saying 
the whole concept of ever purging tombstones is bugged.

And sending tombstone past gcgs is actually somewhat bad, exactly due to the 
fact that such tombstone will be purged by different nodes at (possibly very) 
different times. As every time a node has purged a tombstones while other 
haven't and we reads, we'd digest mismatch, increasing the operation latency 
and incurring more work on the system. And that for no reason whatsoever to any 
user that actually properly configure gcgs and properly run repairs. Even for 
those that don't, whether sending post-gcgs tombstone save their asses or not 
will be at best totally random (and so largely useless imo).

Overall, I'd be kind of -1 (at least unless disprove my reasoning above) on a 
patch that simply start sending post-gcgs tombstones on reads as it create 
inefficiencies without buying any additional concrete guarantee.

> Partition level deletions past GCGS are not propagated/merged on read
> ---------------------------------------------------------------------
>
>                 Key: CASSANDRA-14532
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14532
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Kurt Greaves
>            Assignee: Kurt Greaves
>            Priority: Major
>
> So as [~jay.zhuang] mentioned on the mailing list 
> [here|http://mail-archives.us.apache.org/mod_mbox/cassandra-dev/201806.mbox/<CAAXszS0%3DmCu5ptDccki_coxRwwF0ZFrTYs_EJLpMTDjNT3tFSA%40mail.gmail.com>],
>  it appears that partition deletions that have passed GCGS are not 
> propagated/merged properly on read, and also not repaired via read repair.
> Steps to reproduce:
> {code}
> create keyspace test WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 3};
> create table test.test (id int PRIMARY KEY , data text) WITH gc_grace_seconds 
> = 10;
> CONSISTENCY ALL;
> INSERT INTO test.test (id, data) values (1, 'test');
> ccm node2 stop
> CONSISTENCY QUORUM;
> DELETE from test.test where id = 1; // wait 10 seconds so HH doesn't 
> propagate tombstone when starting node2
> select * from test.test where id = 1 ;
>  id | data
> ----+------
> (0 rows)
> ccm node2 start
> CONSISTENCY ALL;
> select * from test.test where id = 1 ;
>  id | data
> ----+------
>   1 | test
> alter table test.test WITH gc_grace_seconds = 100000; // GC
> select * from test.test where id = 1 ;
>  id | data
> ----+------
> (0 rows)
> {code}
> We've also found a seemingly related issue in compaction where trying to 
> compact an SSTable which contains the partition deletion post GCGS, the 
> partition deletion won't be removed via compaction. Likely the same code is 
> causing both bugs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to