[ 
https://issues.apache.org/jira/browse/CASSANDRA-12901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671127#comment-15671127
 ] 

Paulo Motta commented on CASSANDRA-12901:
-----------------------------------------

Thanks for the review Yuki!

bq. Looks like we need to mark failed node and eliminate from anti-compacting 
nodes rather than relying on FD alive check in AntiCompactionTask.

This would work but it would be a bit wasteful since any incremental repair 
work done by any non-dead participant of a failed session would be lost, even 
if participated in non-failed sessions. Furthermore a node can still die in the 
middle of anticompaction, and right now the session would wait for a timeout of 
1 day until it unblocks. So, the approach I took was to register 
{{AntiCompactionTask}} on the FD, and if a node fails during anti-compaction 
the task immediately fails preventing it to hang.

I added some byteman dtests to validate repair completes (with error) if a node 
fails during anti-compaction, validation or sync 
([PR|https://github.com/riptano/cassandra-dtest/pull/1390])

Updated patch and CI results available below:

||2.2||3.0||3.X||trunk||dtest||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-12901]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-12901]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.X...pauloricardomg:3.X-12901]|[branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-12901]|[branch|https://github.com/riptano/cassandra-dtest/compare/master...pauloricardomg:12901]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-12901-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-12901-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.X-12901-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-12901-testall/lastCompletedBuild/testReport/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-12901-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-12901-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.X-12901-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-12901-dtest/lastCompletedBuild/testReport/]|


> Repair may hang if node dies during sync
> ----------------------------------------
>
>                 Key: CASSANDRA-12901
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12901
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>
> Since the repair coordinator unregisters from the FD after validation 
> (CASSANDRA-3569), if the initiator of a RemoteSyncTask fails, the coordinator 
> will never know the sync task failed and hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to