David Capwell created CASSANDRA-15566:
-----------------------------------------

             Summary: Repair coordinator can hang under some cases
                 Key: CASSANDRA-15566
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15566
             Project: Cassandra
          Issue Type: Improvement
          Components: Consistency/Repair
            Reporter: David Capwell
            Assignee: David Capwell


Repair coordination makes a few assumptions about message delivery which cause 
it to hang forever when those assumptions don’t hold true: fire and forget will 
not get rejected (participate has an issue and rejects the message), and a very 
delayed message will one day be seen (messaging can be dropped under load or 
when failure detector thinks a node is bad but is just GCing).

Given this and the desire to have better observability with repair (see 
CASSANDRA-15399), coordination should be changed into a request/response 
pattern (with retries) and polling (validation status and MerkleTree sending).  
This would allow the coordinator to detect changes in state (it was known 
participate was working on validation, but it no longer knows about the 
validation task), and to be able to recover from ephemeral issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to