Paulo Motta created CASSANDRA-11190:
---------------------------------------
Summary: Fail fast repairs
Key: CASSANDRA-11190
URL: https://issues.apache.org/jira/browse/CASSANDRA-11190
Project: Cassandra
Issue Type: Bug
Components: Streaming and Messaging
Reporter: Paulo Motta
Assignee: Paulo Motta
Priority: Minor
Currently, if one node fails any phase of the repair (validation, streaming),
the repair session is aborted, but the other nodes are not notified and keep
doing either validation or syncing with other nodes.
With CASSANDRA-10070 automatically scheduling repairs and potentially
scheduling retries it would be nice to make sure all nodes abort failed repairs
in other to be able to start other repairs safely in the same nodes.
>From CASSANDRA-10070:
bq. As far as I understood, if there are nodes A, B, C running repair, A is the
coordinator. If validation or streaming fails on node B, the coordinator (A) is
notified and fails the repair session, but node C will remain doing validation
and/or streaming, what could cause problems (or increased load) if we start
another repair session on the same range.
bq. We will probably need to extend the repair protocol to perform this
cleanup/abort step on failure. We already have a legacy cleanup message that
doesn't seem to be used in the current protocol that we could maybe reuse to
cleanup repair state after a failure. This repair abortion will probably have
intersection with CASSANDRA-3486. In any case, this is a separate (but related)
issue and we should address it in an independent ticket, and make this ticket
dependent on that.
On CASSANDRA-5426 [~slebresne] suggested doing this to avoid unexpected
conditions/hangs:
bq. I wonder if maybe we should have more of a fail-fast policy when there is
errors. For instance, if one node fail it's validation phase, maybe it might be
worth failing right away and let the user re-trigger a repair once he has fixed
whatever was the source of the error, rather than still differencing/syncing
the other nodes.
bq. Going a bit further, I think we should add 2 messages to interrupt the
validation and sync phase. If only because that could be useful to users if
they need to stop a repair for some reason, but also, if we get an error during
validation from one node, we could use that to interrupt the other nodes and
thus fail fast while minimizing the amount of work done uselessly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)