David Capwell created CASSANDRA-15566:
-----------------------------------------
Summary: Repair coordinator can hang under some cases
Key: CASSANDRA-15566
URL: https://issues.apache.org/jira/browse/CASSANDRA-15566
Project: Cassandra
Issue Type: Improvement
Components: Consistency/Repair
Reporter: David Capwell
Assignee: David Capwell
Repair coordination makes a few assumptions about message delivery which cause
it to hang forever when those assumptions don’t hold true: fire and forget will
not get rejected (participate has an issue and rejects the message), and a very
delayed message will one day be seen (messaging can be dropped under load or
when failure detector thinks a node is bad but is just GCing).
Given this and the desire to have better observability with repair (see
CASSANDRA-15399), coordination should be changed into a request/response
pattern (with retries) and polling (validation status and MerkleTree sending).
This would allow the coordinator to detect changes in state (it was known
participate was working on validation, but it no longer knows about the
validation task), and to be able to recover from ephemeral issues.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]