[
https://issues.apache.org/jira/browse/CASSANDRA-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163669#comment-13163669
]
Sylvain Lebresne commented on CASSANDRA-3112:
---------------------------------------------
bq. 1) Stream session or the stream doesn't have any progress (Read Timeout/rpc
timeout - Socket timeout might help)
But do you know what is the reason for it making no progress? Because unless we
know what can cause it, not sure what to fix?
{quote}
2) Validation compaction completed but the result tree is sent but not received.
3) Repair request is sent but the receiving node didn't receive it.
{quote}
How can we "lose" messages, aren't tcp supposed to avoid this?
4) When we have a big repair which runs for hours it will be better to retry
the failed part rather than full retry.
Streaming is supposed to have some part of built-in retry, though I'm not sure
there is situation where it is actually useful. But if we talking like having a
repair fail because a node die and continuing it once the node is back up, then
that would be nice, but I'm pretty sure this will be mightily complicated. In
particular and to name only one difficulty, whether this is for the validation
compaction or the streaming itself, we likely will have a hard time making sure
that sstables haven't been compacted between the initial try and the retry (or
we'll risk hanging on obsolete sstables forever). But in principle, that would
be nice. Clearly not in the scope of this ticket in any case.
> Make repair fail when an unexpected error occurs
> ------------------------------------------------
>
> Key: CASSANDRA-3112
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3112
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Sylvain Lebresne
> Assignee: Sylvain Lebresne
> Priority: Minor
> Labels: repair
> Fix For: 1.0.6
>
> Attachments: 0003-Report-streaming-errors-back-to-repair-v4.patch,
> 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch
>
>
> CASSANDRA-2433 makes it so that nodetool repair will fail if a node
> participating to repair dies before completing his part of the repair. This
> handles most of the situation where repair was previously hanging, but repair
> can still hang if an unexpected error occurs during either the merkle tree
> creation (an on-disk corruption triggers an IOError say) or during streaming
> (though I'm not sure what could make streaming failed outside of 'one of the
> node died' (besides a bug)).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira