Paulo Motta created CASSANDRA-11190:
---------------------------------------

             Summary: Fail fast repairs
                 Key: CASSANDRA-11190
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11190
             Project: Cassandra
          Issue Type: Bug
          Components: Streaming and Messaging
            Reporter: Paulo Motta
            Assignee: Paulo Motta
            Priority: Minor


Currently, if one node fails any phase of the repair (validation, streaming), 
the repair session is aborted, but the other nodes are not notified and keep 
doing either validation or syncing with other nodes.

With CASSANDRA-10070 automatically scheduling repairs and potentially 
scheduling retries it would be nice to make sure all nodes abort failed repairs 
in other to be able to start other repairs safely in the same nodes.

>From CASSANDRA-10070:

bq. As far as I understood, if there are nodes A, B, C running repair, A is the 
coordinator. If validation or streaming fails on node B, the coordinator (A) is 
notified and fails the repair session, but node C will remain doing validation 
and/or streaming, what could cause problems (or increased load) if we start 
another repair session on the same range.

bq. We will probably need to extend the repair protocol to perform this 
cleanup/abort step on failure. We already have a legacy cleanup message that 
doesn't seem to be used in the current protocol that we could maybe reuse to 
cleanup repair state after a failure. This repair abortion will probably have 
intersection with CASSANDRA-3486. In any case, this is a separate (but related) 
issue and we should address it in an independent ticket, and make this ticket 
dependent on that.

On CASSANDRA-5426 [~slebresne] suggested doing this to avoid unexpected 
conditions/hangs:

bq. I wonder if maybe we should have more of a fail-fast policy when there is 
errors. For instance, if one node fail it's validation phase, maybe it might be 
worth failing right away and let the user re-trigger a repair once he has fixed 
whatever was the source of the error, rather than still differencing/syncing 
the other nodes.

bq. Going a bit further, I think we should add 2 messages to interrupt the 
validation and sync phase. If only because that could be useful to users if 
they need to stop a repair for some reason, but also, if we get an error during 
validation from one node, we could use that to interrupt the other nodes and 
thus fail fast while minimizing the amount of work done uselessly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to