[
https://issues.apache.org/jira/browse/CASSANDRA-15566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055063#comment-17055063
]
David Capwell commented on CASSANDRA-15566:
-------------------------------------------
bq. C* 4.0 code is quite new to me...
Me too :)
One of the best ways to start is testing; we need more tests to show where
repair needs improvement. When I joined this project I asked operators top pain
points with repair (all were from 2.1) and as I write tests I see 4.0 has the
same issues. More tests which show new areas world be great!
Think your 5 classifications are good, though 1/2 can merge; our networking is
lossy (not a bad thing, under load it’s crash or drop). I would love a smoke
test which runs user/operators tasks constantly under “load” (should be able to
artificially lower resources). This test would help show if the different sub
systems work well or need improvement as well.
About participate crashing, I added a jvm dtest with shows this is handled;
assuming failure detector detect this (restart node also fails repair).
About detection and abort, I agree it should be external for now. Any/all
things the external tools need must be identified and tested to show they work
(for example does aborting repair work?).
> Repair coordinator can hang under some cases
> --------------------------------------------
>
> Key: CASSANDRA-15566
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15566
> Project: Cassandra
> Issue Type: Improvement
> Components: Consistency/Repair
> Reporter: David Capwell
> Assignee: David Capwell
> Priority: Normal
> Fix For: 4.0-beta
>
>
> Repair coordination makes a few assumptions about message delivery which
> cause it to hang forever when those assumptions don’t hold true: fire and
> forget will not get rejected (participate has an issue and rejects the
> message), and a very delayed message will one day be seen (messaging can be
> dropped under load or when failure detector thinks a node is bad but is just
> GCing).
> Given this and the desire to have better observability with repair (see
> CASSANDRA-15399), coordination should be changed into a request/response
> pattern (with retries) and polling (validation status and MerkleTree
> sending). This would allow the coordinator to detect changes in state (it
> was known participate was working on validation, but it no longer knows about
> the validation task), and to be able to recover from ephemeral issues.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]