[jira] [Commented] (CASSANDRA-15566) Repair coordinator can hang under some cases

David Capwell (Jira) Mon, 09 Mar 2020 08:12:59 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055063#comment-17055063
 ]


David Capwell commented on CASSANDRA-15566:
-------------------------------------------

bq. C* 4.0 code is quite new to me...

Me too :)

One of the best ways to start is testing; we need more tests to show where 
repair needs improvement. When I joined this project I asked operators top pain 
points with repair (all were from 2.1) and as I write tests I  see 4.0 has the 
same issues.  More tests which show new areas world be great!

Think your 5 classifications are good, though 1/2 can merge; our networking is 
lossy (not a bad thing, under load it’s crash or drop).  I would love a smoke 
test which runs user/operators tasks constantly under “load” (should be able to 
artificially lower resources). This test would help show if the different sub 
systems work well or need improvement as well.

About participate crashing, I added a jvm dtest with shows this is handled; 
assuming failure detector detect this (restart node also fails repair).

About detection and abort, I agree it should be external for now. Any/all 
things the external tools need must be identified and tested to show they work 
(for example does aborting repair work?). 

> Repair coordinator can hang under some cases
> --------------------------------------------
>
>                 Key: CASSANDRA-15566
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15566
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Repair
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>             Fix For: 4.0-beta
>
>
> Repair coordination makes a few assumptions about message delivery which 
> cause it to hang forever when those assumptions don’t hold true: fire and 
> forget will not get rejected (participate has an issue and rejects the 
> message), and a very delayed message will one day be seen (messaging can be 
> dropped under load or when failure detector thinks a node is bad but is just 
> GCing).
> Given this and the desire to have better observability with repair (see 
> CASSANDRA-15399), coordination should be changed into a request/response 
> pattern (with retries) and polling (validation status and MerkleTree 
> sending).  This would allow the coordinator to detect changes in state (it 
> was known participate was working on validation, but it no longer knows about 
> the validation task), and to be able to recover from ephemeral issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-15566) Repair coordinator can hang under some cases

Reply via email to