[jira] [Comment Edited] (CASSANDRA-16721) Repaired data tracking on a read coordinator is susceptible to races between local and remote requests

Caleb Rackliffe (Jira) Mon, 16 Aug 2021 12:50:06 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17399972#comment-17399972
 ]


Caleb Rackliffe edited comment on CASSANDRA-16721 at 8/16/21, 7:49 PM:
-----------------------------------------------------------------------

I've made a first pass at the patch, and I think it does solve the problem 
described in the description above. However, there are a few questions I'm 
struggling with:
  

1.) Why do we share any aspect of {{RepairedDataInfo}} across threads at all? 
It seems like both the problem above and a class of other possible problems 
(read on) would be sidestepped completely. More specifically, perhaps we could 
do something like just indicating to the {{ReadExecutionController}} whether we 
should track repaired status?

2.) If we follow the scenario above, and two remote reads return and indicate a 
mismatch while the local read is still executing, is it possible that both the 
local read (likely on a Native Transport thread, but possibly on a ReadStage 
thread) and the local read started in {{startRepair()}} (and now on a ReadStage 
thread) use the same {{RepairedDataInfo}} instance as they serialize their 
local data responses? (In other words, is there ever a reason of an initial 
local data read to use RDI?)

 
 Even if the second item above isn't possible, it still seems like our 
implementation would be less brittle if if we could find a minimally invasive 
way to make the change in the first item. I'm open to making a pass at it, but 
I want to make sure my starting assumptions are correct.


was (Author: maedhroz):
I've made a first pass at the patch, and I think it does solve the problem 
described in the description above. However, there are a few questions I'm 
struggling with:
 

1.) Why do we share any aspect of {{RepairedDataInfo}} across threads at all? 
It seems like both the problem above and a class of other possible problems 
(read on) would be sidestepped completely. More specifically, perhaps we could 
do something like just indicating to the {{ReadExecutionController}} whether we 
should track repaired status?

2.) If we follow the scenario above, and two remote reads return and indicate a 
mismatch while the local read is still executing, is it possible that both the 
local read (likely on a Native Transport thread, but possibly on a ReadStage 
thread) and the local read started in {{startRepair()}} (and now on a ReadStage 
thread) use the same {{RepairedDataInfo}} instance as they serialize their 
local data responses?

 
Even if the second item above isn't possible, it still seems like our 
implementation would be less brittle if if we could find a minimally invasive 
way to make the change in the first item. I'm open to making a pass at it, but 
I want to make sure my starting assumptions are correct.

> Repaired data tracking on a read coordinator is susceptible to races between 
> local and remote requests
> ------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16721
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16721
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination
>            Reporter: Sam Tunnicliffe
>            Assignee: Sam Tunnicliffe
>            Priority: Normal
>             Fix For: 4.0.x
>
>
> At read time on a coordinator which is also a replica, the local and remote 
> reads can race such that the remote responses are received while the local 
> read is executing. If the remote responses are mismatching, triggering a 
> {{DigestMismatchException}} and subsequent round of full data reads and read 
> repair, the local runnable may find the {{isTrackingRepairedStatus}} flag 
> flipped mid-execution.  If this happens after a certain point in execution, 
> it would mean
> that the RepairedDataInfo instance in use is the singleton null object 
> {{RepairedDataInfo.NULL_REPAIRED_DATA_INFO}}. If this happens, it can lead to 
> an NPE when calling {{RepairedDataInfo::extend}} when the local results are 
> iterated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-16721) Repaired data tracking on a read coordinator is susceptible to races between local and remote requests

Reply via email to