[jira] [Commented] (CASSANDRA-20014) Discard hints based on write time, not timeout time

Matt Byrd (Jira) Fri, 08 Nov 2024 21:45:37 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-20014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17896848#comment-17896848
 ]


Matt Byrd commented on CASSANDRA-20014:
---------------------------------------

So just to elaborate a bit on an example of the particular case we're 
protecting against in the patch:

1. t=0, client1 -> co-ordinator1  write key1, then 2 of 3 replicas receive 
data, success response sent back to client
2. t=1, client1 -> co-ordinator1 delete key1, assume 3 of 3 replicas receive 
delete and persist it
3. t=2, on co-ordinator1 write to third replica times out, hint is submitted 
for replica3
4. hint delivery delayed for some reason (target down/instability/too many 
hints and cannot deliver fast enough)
5. t = 0 + gc_grace_seconds, compaction occurs on replica3 and tombstone is 
eligible as data has already been compacted into tombstone or involved in 
current compaction, data + tombstone both removed.
6. t = 0 + gc_grace_seconds + epsilon (<gc_grace_seconds+2), hint still 
eligible for delivery since TTL created with 2 second offset, hint arrives on 
replica3 and data is gone. (see related min of current/past gcgs/max hint ttl 
etc: 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/hints/Hint.java#L133)
7. t=later, tombstones + data removed on replica1, replica2 as eligible for 
collection and data now visible or re-appeared if read.

The thing which is confusing you wrote something, you deleted it later, then 
much later it suddenly re-appears.

Another mitigation that a user can take is to ensure that 
CASSANDRA_MAX_HINT_TTL is suitably lower than gc_grace_seconds for any table 
where this poses a problem.

Coming back to the alternate problem you're describing, I'm wondering if it's 
an altogether different sequence of events?

Since the hint is happening on the write, not the delete, having a GC pause 
will only increase the time to get a response from the delete and the time that 
delete tombstone will live,
possibly shrinking the window and making the problem less likely.
With the existing code (prior to the fix) GC pauses before hint submission 
elongate the window.

To cause a problem you want to send deletion time backwards as compared to the 
timestamp used for hint TTL.
It is possible to set deletion time (in addition to the cell timestamp for both 
the write an delete)
https://github.com/apache/cassandra/blob/fd4113d5fef09f6361bea88a35655db2ecb46427/doc/native_protocol_v5.spec#L572

So something like the following:
Assume gcgs of 60.
1. t = 100 write data with timestamp t = 50 to 2/3 replicas
2. t = 101 submit a hint to replica3 (could even be at t=100 with just the 
replica being down rather than on timeout)
3. t = 102 write delete at t = 51 with local deletion time set to t = 51
4. t = 111 tombstone/data collected
5  t = 120 hint delivered to replica3


Would not be solved by the above patch.
The cell timestamp setting in addition may not even be necessary.
 
However in the above example, the information we need to decide what to set the 
hint ttl timestamp to, is only available out of band after we've hinted (the 
deletion time from subsequent delete)
So I'm not sure there is an elegant general solution to this problem.
I suppose it may be mostly theoretical, since the spec does mention setting 
"now_in_seconds" is intended for testing purposes.
Introduced here:
https://issues.apache.org/jira/browse/CASSANDRA-14664

> Discard hints based on write time, not timeout time
> ---------------------------------------------------
>
>                 Key: CASSANDRA-20014
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20014
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Hints
>            Reporter: Blake Eggleston
>            Assignee: Matt Byrd
>            Priority: Normal
>
> Hints are created after a write timeout are created with the timeout time as 
> the hint creation time. In the case of slow hint delivery, this can create a 
> window of time where a write is applied after gcgs would have elapsed for 
> tombstones written after the original write, and the tombstone has been 
> purged, causing data resurrection. We should use the time the client request 
> thread started working on the request as the hint creation time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-20014) Discard hints based on write time, not timeout time

Reply via email to