[jira] [Updated] (CASSANDRA-11190) Fail fast repairs

2021-03-14 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-11190:
---
Reviewers: Michael Semb Wever, Yuki Morishita  (was: Yuki Morishita)

> Fail fast repairs
> -
>
> Key: CASSANDRA-11190
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11190
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Legacy/Streaming and Messaging
>Reporter: Paulo Motta (Deprecated)
>Priority: Low
>
> Currently, if one node fails any phase of the repair (validation, streaming), 
> the repair session is aborted, but the other nodes are not notified and keep 
> doing either validation or syncing with other nodes.
> With CASSANDRA-10070 automatically scheduling repairs and potentially 
> scheduling retries it would be nice to make sure all nodes abort failed 
> repairs in other to be able to start other repairs safely in the same nodes.
> From CASSANDRA-10070:
> bq. As far as I understood, if there are nodes A, B, C running repair, A is 
> the coordinator. If validation or streaming fails on node B, the coordinator 
> (A) is notified and fails the repair session, but node C will remain doing 
> validation and/or streaming, what could cause problems (or increased load) if 
> we start another repair session on the same range.
> bq. We will probably need to extend the repair protocol to perform this 
> cleanup/abort step on failure. We already have a legacy cleanup message that 
> doesn't seem to be used in the current protocol that we could maybe reuse to 
> cleanup repair state after a failure. This repair abortion will probably have 
> intersection with CASSANDRA-3486. In any case, this is a separate (but 
> related) issue and we should address it in an independent ticket, and make 
> this ticket dependent on that.
> On CASSANDRA-5426 [~slebresne] suggested doing this to avoid unexpected 
> conditions/hangs:
> bq. I wonder if maybe we should have more of a fail-fast policy when there is 
> errors. For instance, if one node fail it's validation phase, maybe it might 
> be worth failing right away and let the user re-trigger a repair once he has 
> fixed whatever was the source of the error, rather than still 
> differencing/syncing the other nodes.
> bq. Going a bit further, I think we should add 2 messages to interrupt the 
> validation and sync phase. If only because that could be useful to users if 
> they need to stop a repair for some reason, but also, if we get an error 
> during validation from one node, we could use that to interrupt the other 
> nodes and thus fail fast while minimizing the amount of work done uselessly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-11190) Fail fast repairs

2016-04-19 Thread Joshua McKenzie (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua McKenzie updated CASSANDRA-11190:

Reviewer: Yuki Morishita

> Fail fast repairs
> -
>
> Key: CASSANDRA-11190
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11190
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Streaming and Messaging
>Reporter: Paulo Motta
>Assignee: Paulo Motta
>Priority: Minor
>
> Currently, if one node fails any phase of the repair (validation, streaming), 
> the repair session is aborted, but the other nodes are not notified and keep 
> doing either validation or syncing with other nodes.
> With CASSANDRA-10070 automatically scheduling repairs and potentially 
> scheduling retries it would be nice to make sure all nodes abort failed 
> repairs in other to be able to start other repairs safely in the same nodes.
> From CASSANDRA-10070:
> bq. As far as I understood, if there are nodes A, B, C running repair, A is 
> the coordinator. If validation or streaming fails on node B, the coordinator 
> (A) is notified and fails the repair session, but node C will remain doing 
> validation and/or streaming, what could cause problems (or increased load) if 
> we start another repair session on the same range.
> bq. We will probably need to extend the repair protocol to perform this 
> cleanup/abort step on failure. We already have a legacy cleanup message that 
> doesn't seem to be used in the current protocol that we could maybe reuse to 
> cleanup repair state after a failure. This repair abortion will probably have 
> intersection with CASSANDRA-3486. In any case, this is a separate (but 
> related) issue and we should address it in an independent ticket, and make 
> this ticket dependent on that.
> On CASSANDRA-5426 [~slebresne] suggested doing this to avoid unexpected 
> conditions/hangs:
> bq. I wonder if maybe we should have more of a fail-fast policy when there is 
> errors. For instance, if one node fail it's validation phase, maybe it might 
> be worth failing right away and let the user re-trigger a repair once he has 
> fixed whatever was the source of the error, rather than still 
> differencing/syncing the other nodes.
> bq. Going a bit further, I think we should add 2 messages to interrupt the 
> validation and sync phase. If only because that could be useful to users if 
> they need to stop a repair for some reason, but also, if we get an error 
> during validation from one node, we could use that to interrupt the other 
> nodes and thus fail fast while minimizing the amount of work done uselessly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11190) Fail fast repairs

2016-02-19 Thread Aleksey Yeschenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Yeschenko updated CASSANDRA-11190:
--
Issue Type: Improvement  (was: Bug)

> Fail fast repairs
> -
>
> Key: CASSANDRA-11190
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11190
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Streaming and Messaging
>Reporter: Paulo Motta
>Assignee: Paulo Motta
>Priority: Minor
>
> Currently, if one node fails any phase of the repair (validation, streaming), 
> the repair session is aborted, but the other nodes are not notified and keep 
> doing either validation or syncing with other nodes.
> With CASSANDRA-10070 automatically scheduling repairs and potentially 
> scheduling retries it would be nice to make sure all nodes abort failed 
> repairs in other to be able to start other repairs safely in the same nodes.
> From CASSANDRA-10070:
> bq. As far as I understood, if there are nodes A, B, C running repair, A is 
> the coordinator. If validation or streaming fails on node B, the coordinator 
> (A) is notified and fails the repair session, but node C will remain doing 
> validation and/or streaming, what could cause problems (or increased load) if 
> we start another repair session on the same range.
> bq. We will probably need to extend the repair protocol to perform this 
> cleanup/abort step on failure. We already have a legacy cleanup message that 
> doesn't seem to be used in the current protocol that we could maybe reuse to 
> cleanup repair state after a failure. This repair abortion will probably have 
> intersection with CASSANDRA-3486. In any case, this is a separate (but 
> related) issue and we should address it in an independent ticket, and make 
> this ticket dependent on that.
> On CASSANDRA-5426 [~slebresne] suggested doing this to avoid unexpected 
> conditions/hangs:
> bq. I wonder if maybe we should have more of a fail-fast policy when there is 
> errors. For instance, if one node fail it's validation phase, maybe it might 
> be worth failing right away and let the user re-trigger a repair once he has 
> fixed whatever was the source of the error, rather than still 
> differencing/syncing the other nodes.
> bq. Going a bit further, I think we should add 2 messages to interrupt the 
> validation and sync phase. If only because that could be useful to users if 
> they need to stop a repair for some reason, but also, if we get an error 
> during validation from one node, we could use that to interrupt the other 
> nodes and thus fail fast while minimizing the amount of work done uselessly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)