[ 
https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment: 2433.patch

Attaching a rebase of the two previous first patches as '2433.patch'. That is, 
this patch adds registering in gossip so that repair fails and report it to the 
user when a node participating to the repair dies. Compared to the previous 
version, it fails fast because it's the easier thing to do now and a better 
option imho.

I should mention that while it is lame that repair get stuck when a node dies 
and we should fix it, this means that if a node is wrongly marked down, we will 
fail repair for no reason (but I suppose it's a failure detector problem).

Attached patch is against 0.8. This has no upgrade consequence of any sort and 
is a reasonably simple patch, so I think it could be worth committing in 0.8.
The rest of what was in previous patch 0003 and 0004 cannot go into 0.8 because 
it changes the wire protocol, so I will rebase against trunk directly, and 
maybe in another ticket. Having this first patch committed would help with that 
though :)

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.4
>
>         Attachments: 
> 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch, 
> 0002-Register-in-gossip-to-handle-node-failures-v4.patch, 
> 0003-Report-streaming-errors-back-to-repair-v4.patch, 
> 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch, 2433.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to 
> clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up 
> the data partition.
> These issues together are making repair very difficult for nearly everyone 
> running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was 
> moved to 0.8 for a few reasons. This ticket is to fix the immediate issues 
> that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to