[jira] [Updated] (CASSANDRA-11190) Fail fast repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-11190: --- Reviewers: Michael Semb Wever, Yuki Morishita (was: Yuki Morishita) > Fail fast repairs > - > > Key: CASSANDRA-11190 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11190 > Project: Cassandra > Issue Type: Improvement > Components: Legacy/Streaming and Messaging >Reporter: Paulo Motta (Deprecated) >Priority: Low > > Currently, if one node fails any phase of the repair (validation, streaming), > the repair session is aborted, but the other nodes are not notified and keep > doing either validation or syncing with other nodes. > With CASSANDRA-10070 automatically scheduling repairs and potentially > scheduling retries it would be nice to make sure all nodes abort failed > repairs in other to be able to start other repairs safely in the same nodes. > From CASSANDRA-10070: > bq. As far as I understood, if there are nodes A, B, C running repair, A is > the coordinator. If validation or streaming fails on node B, the coordinator > (A) is notified and fails the repair session, but node C will remain doing > validation and/or streaming, what could cause problems (or increased load) if > we start another repair session on the same range. > bq. We will probably need to extend the repair protocol to perform this > cleanup/abort step on failure. We already have a legacy cleanup message that > doesn't seem to be used in the current protocol that we could maybe reuse to > cleanup repair state after a failure. This repair abortion will probably have > intersection with CASSANDRA-3486. In any case, this is a separate (but > related) issue and we should address it in an independent ticket, and make > this ticket dependent on that. > On CASSANDRA-5426 [~slebresne] suggested doing this to avoid unexpected > conditions/hangs: > bq. I wonder if maybe we should have more of a fail-fast policy when there is > errors. For instance, if one node fail it's validation phase, maybe it might > be worth failing right away and let the user re-trigger a repair once he has > fixed whatever was the source of the error, rather than still > differencing/syncing the other nodes. > bq. Going a bit further, I think we should add 2 messages to interrupt the > validation and sync phase. If only because that could be useful to users if > they need to stop a repair for some reason, but also, if we get an error > during validation from one node, we could use that to interrupt the other > nodes and thus fail fast while minimizing the amount of work done uselessly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-11190) Fail fast repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshua McKenzie updated CASSANDRA-11190: Reviewer: Yuki Morishita > Fail fast repairs > - > > Key: CASSANDRA-11190 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11190 > Project: Cassandra > Issue Type: Improvement > Components: Streaming and Messaging >Reporter: Paulo Motta >Assignee: Paulo Motta >Priority: Minor > > Currently, if one node fails any phase of the repair (validation, streaming), > the repair session is aborted, but the other nodes are not notified and keep > doing either validation or syncing with other nodes. > With CASSANDRA-10070 automatically scheduling repairs and potentially > scheduling retries it would be nice to make sure all nodes abort failed > repairs in other to be able to start other repairs safely in the same nodes. > From CASSANDRA-10070: > bq. As far as I understood, if there are nodes A, B, C running repair, A is > the coordinator. If validation or streaming fails on node B, the coordinator > (A) is notified and fails the repair session, but node C will remain doing > validation and/or streaming, what could cause problems (or increased load) if > we start another repair session on the same range. > bq. We will probably need to extend the repair protocol to perform this > cleanup/abort step on failure. We already have a legacy cleanup message that > doesn't seem to be used in the current protocol that we could maybe reuse to > cleanup repair state after a failure. This repair abortion will probably have > intersection with CASSANDRA-3486. In any case, this is a separate (but > related) issue and we should address it in an independent ticket, and make > this ticket dependent on that. > On CASSANDRA-5426 [~slebresne] suggested doing this to avoid unexpected > conditions/hangs: > bq. I wonder if maybe we should have more of a fail-fast policy when there is > errors. For instance, if one node fail it's validation phase, maybe it might > be worth failing right away and let the user re-trigger a repair once he has > fixed whatever was the source of the error, rather than still > differencing/syncing the other nodes. > bq. Going a bit further, I think we should add 2 messages to interrupt the > validation and sync phase. If only because that could be useful to users if > they need to stop a repair for some reason, but also, if we get an error > during validation from one node, we could use that to interrupt the other > nodes and thus fail fast while minimizing the amount of work done uselessly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-11190) Fail fast repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Yeschenko updated CASSANDRA-11190: -- Issue Type: Improvement (was: Bug) > Fail fast repairs > - > > Key: CASSANDRA-11190 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11190 > Project: Cassandra > Issue Type: Improvement > Components: Streaming and Messaging >Reporter: Paulo Motta >Assignee: Paulo Motta >Priority: Minor > > Currently, if one node fails any phase of the repair (validation, streaming), > the repair session is aborted, but the other nodes are not notified and keep > doing either validation or syncing with other nodes. > With CASSANDRA-10070 automatically scheduling repairs and potentially > scheduling retries it would be nice to make sure all nodes abort failed > repairs in other to be able to start other repairs safely in the same nodes. > From CASSANDRA-10070: > bq. As far as I understood, if there are nodes A, B, C running repair, A is > the coordinator. If validation or streaming fails on node B, the coordinator > (A) is notified and fails the repair session, but node C will remain doing > validation and/or streaming, what could cause problems (or increased load) if > we start another repair session on the same range. > bq. We will probably need to extend the repair protocol to perform this > cleanup/abort step on failure. We already have a legacy cleanup message that > doesn't seem to be used in the current protocol that we could maybe reuse to > cleanup repair state after a failure. This repair abortion will probably have > intersection with CASSANDRA-3486. In any case, this is a separate (but > related) issue and we should address it in an independent ticket, and make > this ticket dependent on that. > On CASSANDRA-5426 [~slebresne] suggested doing this to avoid unexpected > conditions/hangs: > bq. I wonder if maybe we should have more of a fail-fast policy when there is > errors. For instance, if one node fail it's validation phase, maybe it might > be worth failing right away and let the user re-trigger a repair once he has > fixed whatever was the source of the error, rather than still > differencing/syncing the other nodes. > bq. Going a bit further, I think we should add 2 messages to interrupt the > validation and sync phase. If only because that could be useful to users if > they need to stop a repair for some reason, but also, if we get an error > during validation from one node, we could use that to interrupt the other > nodes and thus fail fast while minimizing the amount of work done uselessly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)