[jira] [Commented] (CASSANDRA-15027) Handle IR prepare phase failures less race prone by waiting for all results
[ https://issues.apache.org/jira/browse/CASSANDRA-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775437#comment-16775437 ] Stefan Podkowinski commented on CASSANDRA-15027: LGTM +1 > Handle IR prepare phase failures less race prone by waiting for all results > --- > > Key: CASSANDRA-15027 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15027 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair, Local/Compaction >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Major > Fix For: 4.x > > > Handling incremental repairs as a coordinator begins by sending a > {{PrepareConsistentRequest}} message to all participants, which may also > include the coordinator itself. Participants will run anti-compactions upon > receiving such a message and report the result of the operation back to the > coordinator. > Once we receive a failure response from any of the participants, we fail-fast > in {{CoordinatorSession.handlePrepareResponse()}}, which will in turn > completes the {{prepareFuture}} that {{RepairRunnable}} is blocking on. Then > the repair command will terminate with an error status, as expected. > The issue is that in case the node will both be coordinator and participant, > we may end up with a local session and submitted anti-compactions, which will > be executed without any coordination with the coordinator session (on same > node). This may result in situations where running repair commands right > after another, may cause overlapping execution of anti-compactions that will > cause the following (misleading) message to show up in the logs and will > cause the repair to fail again: > "Prepare phase for incremental repair session %s has failed because it > encountered intersecting sstables belonging to another incremental repair > session (%s). This is by starting an incremental repair session before a > previous one has completed. Check nodetool repair_admin for hung sessions and > fix them." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15027) Handle IR prepare phase failures less race prone by waiting for all results
[ https://issues.apache.org/jira/browse/CASSANDRA-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775361#comment-16775361 ] Blake Eggleston commented on CASSANDRA-15027: - Nice. Your follow on changes look good to me, I have 2 nits, but those can just be fixed on commit. * We should log the session id in compaction manager when an anti-compaction is cancelled (and probably when there's an error as well) * Some error handling should be added to the commit fixing the race between proposeFuture and hasFailure so nodetool doesn't hang if there's an error in the callback > Handle IR prepare phase failures less race prone by waiting for all results > --- > > Key: CASSANDRA-15027 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15027 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair, Local/Compaction >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Major > Fix For: 4.x > > > Handling incremental repairs as a coordinator begins by sending a > {{PrepareConsistentRequest}} message to all participants, which may also > include the coordinator itself. Participants will run anti-compactions upon > receiving such a message and report the result of the operation back to the > coordinator. > Once we receive a failure response from any of the participants, we fail-fast > in {{CoordinatorSession.handlePrepareResponse()}}, which will in turn > completes the {{prepareFuture}} that {{RepairRunnable}} is blocking on. Then > the repair command will terminate with an error status, as expected. > The issue is that in case the node will both be coordinator and participant, > we may end up with a local session and submitted anti-compactions, which will > be executed without any coordination with the coordinator session (on same > node). This may result in situations where running repair commands right > after another, may cause overlapping execution of anti-compactions that will > cause the following (misleading) message to show up in the logs and will > cause the repair to fail again: > "Prepare phase for incremental repair session %s has failed because it > encountered intersecting sstables belonging to another incremental repair > session (%s). This is by starting an incremental repair session before a > previous one has completed. Check nodetool repair_admin for hung sessions and > fix them." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15027) Handle IR prepare phase failures less race prone by waiting for all results
[ https://issues.apache.org/jira/browse/CASSANDRA-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775178#comment-16775178 ] Stefan Podkowinski commented on CASSANDRA-15027: Your updates look like valuable improvement over the initial patch. I'm +1 in general as for the changes, but also fixed some additional minor issues and added a new tests: * [CASSANDRA-15027|https://github.com/spodkowinski/cassandra/commits/CASSANDRA-15027] * [circleci|https://circleci.com/gh/spodkowinski/cassandra/653] Please see comments with each commit in branch above for details. Also happy to discuss any of the changes (most likely the last commit) in another jira, if you feel it's out of scope for this ticket. > Handle IR prepare phase failures less race prone by waiting for all results > --- > > Key: CASSANDRA-15027 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15027 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair, Local/Compaction >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Major > Fix For: 4.x > > > Handling incremental repairs as a coordinator begins by sending a > {{PrepareConsistentRequest}} message to all participants, which may also > include the coordinator itself. Participants will run anti-compactions upon > receiving such a message and report the result of the operation back to the > coordinator. > Once we receive a failure response from any of the participants, we fail-fast > in {{CoordinatorSession.handlePrepareResponse()}}, which will in turn > completes the {{prepareFuture}} that {{RepairRunnable}} is blocking on. Then > the repair command will terminate with an error status, as expected. > The issue is that in case the node will both be coordinator and participant, > we may end up with a local session and submitted anti-compactions, which will > be executed without any coordination with the coordinator session (on same > node). This may result in situations where running repair commands right > after another, may cause overlapping execution of anti-compactions that will > cause the following (misleading) message to show up in the logs and will > cause the repair to fail again: > "Prepare phase for incremental repair session %s has failed because it > encountered intersecting sstables belonging to another incremental repair > session (%s). This is by starting an incremental repair session before a > previous one has completed. Check nodetool repair_admin for hung sessions and > fix them." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15027) Handle IR prepare phase failures less race prone by waiting for all results
[ https://issues.apache.org/jira/browse/CASSANDRA-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774515#comment-16774515 ] Blake Eggleston commented on CASSANDRA-15027: - Thanks [~spo...@gmail.com]. I’ve extended your code so that in addition to waiting for other anti-compactions to complete, the coordinator also pro-actively cancels ongoing anti-compactions on the other participants. This avoids wasting time waiting for anti-compactions on other machines. The code does 3 things: * Adds a session state check to the {{isStopRequested}} method in the anti-compaction iterator. * The coordinator now sends failure messages to all participants when it receives a failure message from one of them in the prepare phase. It does not mark these participants as having failed internally though, since that would cause the nodetool session to immediately complete. Instead, it waits until it’s received messages from all the other nodes. * The participants will now respond with a failed prepare message if the anti-compaction completes, but the session was failed in the mean time. This prevents a dead lock on the coordinator in the case where the participant received a failure message between the time the anti-compaction completes and the callback fires. Let me know what you think. If everything looks ok to you, I’m +1 on committing. [trunk|https://github.com/bdeggleston/cassandra/tree/15027-trunk] [circle|https://circleci.com/gh/bdeggleston/workflows/cassandra/tree/15027-trunk] > Handle IR prepare phase failures less race prone by waiting for all results > --- > > Key: CASSANDRA-15027 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15027 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair, Local/Compaction >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Major > Fix For: 4.x > > > Handling incremental repairs as a coordinator begins by sending a > {{PrepareConsistentRequest}} message to all participants, which may also > include the coordinator itself. Participants will run anti-compactions upon > receiving such a message and report the result of the operation back to the > coordinator. > Once we receive a failure response from any of the participants, we fail-fast > in {{CoordinatorSession.handlePrepareResponse()}}, which will in turn > completes the {{prepareFuture}} that {{RepairRunnable}} is blocking on. Then > the repair command will terminate with an error status, as expected. > The issue is that in case the node will both be coordinator and participant, > we may end up with a local session and submitted anti-compactions, which will > be executed without any coordination with the coordinator session (on same > node). This may result in situations where running repair commands right > after another, may cause overlapping execution of anti-compactions that will > cause the following (misleading) message to show up in the logs and will > cause the repair to fail again: > "Prepare phase for incremental repair session %s has failed because it > encountered intersecting sstables belonging to another incremental repair > session (%s). This is by starting an incremental repair session before a > previous one has completed. Check nodetool repair_admin for hung sessions and > fix them." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15027) Handle IR prepare phase failures less race prone by waiting for all results
[ https://issues.apache.org/jira/browse/CASSANDRA-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770869#comment-16770869 ] Stefan Podkowinski commented on CASSANDRA-15027: * [ [trunk|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-15027] ][ [circleci|https://circleci.com/workflow-run/2b027f87-cf45-48ee-8eae-45a563701bc6] ] > Handle IR prepare phase failures less race prone by waiting for all results > --- > > Key: CASSANDRA-15027 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15027 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair, Local/Compaction >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Major > Fix For: 4.x > > > Handling incremental repairs as a coordinator begins by sending a > {{PrepareConsistentRequest}} message to all participants, which may also > include the coordinator itself. Participants will run anti-compactions upon > receiving such a message and report the result of the operation back to the > coordinator. > Once we receive a failure response from any of the participants, we fail-fast > in {{CoordinatorSession.handlePrepareResponse()}}, which will in turn > completes the {{prepareFuture}} that {{RepairRunnable}} is blocking on. Then > the repair command will terminate with an error status, as expected. > The issue is that in case the node will both be coordinator and participant, > we may end up with a local session and submitted anti-compactions, which will > be executed without any coordination with the coordinator session (on same > node). This may result in situations where running repair commands right > after another, may cause overlapping execution of anti-compactions that will > cause the following (misleading) message to show up in the logs and will > cause the repair to fail again: > "Prepare phase for incremental repair session %s has failed because it > encountered intersecting sstables belonging to another incremental repair > session (%s). This is by starting an incremental repair session before a > previous one has completed. Check nodetool repair_admin for hung sessions and > fix them." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org