[ 
https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment: 0004-Reports-validation-compaction-errors-back-to-repair.patch
                0003-Report-streaming-errors-back-to-repair.patch
                0002-Register-in-gossip-to-handle-node-failures.patch
                0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch

Attached patches are against 0.8.

This tries to catch what can go wrong with repair and reports it back to the 
user by making the full repair throw an exception. More precisely:
  * patch 0001: add a method to repair for reporting failure and propagate that 
up to the repair session. This puts repair session on a specific stage (instead 
of having RepairSession be a Thread) and use a future to allow waiting on 
completion. This allows a cleaner API to deal with errors (the Future.get() 
simply throw an ExecutionException) and this add the advantage of stage 
management to repair sessions.
  * patch 0002: Make repair session register through gossip to be informed of 
node dying and failing the session when that happens.
  * patch 0003: Reports errors during streaming to the repair session. This 
actually introduces a generic way to handle streaming failures and after that 
we should probably update the other user of streaming to deal correctly with 
failure too.
  * patch 004: Catch errors during validation compaction and push them up to 
repair (whether those happens on the coordinator of the repair or not).

Note that this includes streaming failures and thus includes stuffs from the 
patch of Aaron Morton attached on CASSANDRA-2088, but contrarily to that patch, 
it takes the approach of failing fast. This means that if streaming fails on a 
file, it fails the streaming altogether (same for repair). I think this is 
simpler code-wise and more useful from the point of view of the user, since a 
failure means the use will have to retry anyway.

Last but not least, this makes some modification to messages. So either this 
goes into 0.8.0 (which I think it should, because this really is a bug fix and 
fixes something that is a pain for users), or we should had a new messaging 
version for 0.8.0 and modify this to take it into account (we should probably 
add a 0.8.0 version to the messaging service anyway).


> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.4
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>         Attachments: 
> 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 
> 0002-Register-in-gossip-to-handle-node-failures.patch, 
> 0003-Report-streaming-errors-back-to-repair.patch, 
> 0004-Reports-validation-compaction-errors-back-to-repair.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to 
> clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up 
> the data partition.
> These issues together are making repair very difficult for nearly everyone 
> running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was 
> moved to 0.8 for a few reasons. This ticket is to fix the immediate issues 
> that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to