[
https://issues.apache.org/jira/browse/CASSANDRA-12901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671127#comment-15671127
]
Paulo Motta commented on CASSANDRA-12901:
-----------------------------------------
Thanks for the review Yuki!
bq. Looks like we need to mark failed node and eliminate from anti-compacting
nodes rather than relying on FD alive check in AntiCompactionTask.
This would work but it would be a bit wasteful since any incremental repair
work done by any non-dead participant of a failed session would be lost, even
if participated in non-failed sessions. Furthermore a node can still die in the
middle of anticompaction, and right now the session would wait for a timeout of
1 day until it unblocks. So, the approach I took was to register
{{AntiCompactionTask}} on the FD, and if a node fails during anti-compaction
the task immediately fails preventing it to hang.
I added some byteman dtests to validate repair completes (with error) if a node
fails during anti-compaction, validation or sync
([PR|https://github.com/riptano/cassandra-dtest/pull/1390])
Updated patch and CI results available below:
||2.2||3.0||3.X||trunk||dtest||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-12901]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-12901]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.X...pauloricardomg:3.X-12901]|[branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-12901]|[branch|https://github.com/riptano/cassandra-dtest/compare/master...pauloricardomg:12901]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-12901-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-12901-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.X-12901-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-12901-testall/lastCompletedBuild/testReport/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-12901-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-12901-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.X-12901-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-12901-dtest/lastCompletedBuild/testReport/]|
> Repair may hang if node dies during sync
> ----------------------------------------
>
> Key: CASSANDRA-12901
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12901
> Project: Cassandra
> Issue Type: Bug
> Components: Streaming and Messaging
> Reporter: Paulo Motta
> Assignee: Paulo Motta
>
> Since the repair coordinator unregisters from the FD after validation
> (CASSANDRA-3569), if the initiator of a RemoteSyncTask fails, the coordinator
> will never know the sync task failed and hang.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
