[ https://issues.apache.org/jira/browse/CASSANDRA-12901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671127#comment-15671127 ]
Paulo Motta commented on CASSANDRA-12901: ----------------------------------------- Thanks for the review Yuki! bq. Looks like we need to mark failed node and eliminate from anti-compacting nodes rather than relying on FD alive check in AntiCompactionTask. This would work but it would be a bit wasteful since any incremental repair work done by any non-dead participant of a failed session would be lost, even if participated in non-failed sessions. Furthermore a node can still die in the middle of anticompaction, and right now the session would wait for a timeout of 1 day until it unblocks. So, the approach I took was to register {{AntiCompactionTask}} on the FD, and if a node fails during anti-compaction the task immediately fails preventing it to hang. I added some byteman dtests to validate repair completes (with error) if a node fails during anti-compaction, validation or sync ([PR|https://github.com/riptano/cassandra-dtest/pull/1390]) Updated patch and CI results available below: ||2.2||3.0||3.X||trunk||dtest|| |[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-12901]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-12901]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.X...pauloricardomg:3.X-12901]|[branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-12901]|[branch|https://github.com/riptano/cassandra-dtest/compare/master...pauloricardomg:12901]| |[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-12901-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-12901-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.X-12901-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-12901-testall/lastCompletedBuild/testReport/]| |[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-12901-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-12901-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.X-12901-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-12901-dtest/lastCompletedBuild/testReport/]| > Repair may hang if node dies during sync > ---------------------------------------- > > Key: CASSANDRA-12901 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12901 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Reporter: Paulo Motta > Assignee: Paulo Motta > > Since the repair coordinator unregisters from the FD after validation > (CASSANDRA-3569), if the initiator of a RemoteSyncTask fails, the coordinator > will never know the sync task failed and hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332)