[jira] [Commented] (CASSANDRA-2433) Failed Streams Break Repair

Sylvain Lebresne (JIRA) Tue, 30 Aug 2011 09:38:02 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093874#comment-13093874
 ]


Sylvain Lebresne commented on CASSANDRA-2433:
---------------------------------------------

bq. Why do we need the new AE_SESSIONS stage?

If you mean "why AE_SESSIONS when we already have the AE stage?", then it is 
because repair push stuffs on the AE stage that it wait for, so we would 
deadlock. If you mean "why a stage?", it felt cleaner that just a Thread now 
that we want to check for exception at the end of the exception. If you mean 
"why a stage rather than a simple ThreadExecutor?", it is a good question. I 
guess it was just some reflex of mine to get a JMXEnabledThreadPool, but it's 
probably not worth a stage, not even the jmx enabledness maybe.

bq. I prefer using WrappedRunnable to a Callable when you want to allow 
exceptions but don't care about a return value

Agreed. I'll update the patch.

bq. I think we can avoid a bunch of no-op onConvicts if RepairSession were to 
subscribe to FD directly instead of going through Gossip

Yeah, I kind of started with that but the problem is that we must deal with the 
case of a node restarting before it has been convicted (especially if the 
conviction threshold is higher), which the FD won't see. We could deal of that 
last situation separately and have Gossip call some trigger into AntiEntropy on 
a gossip generation change to indicate to stop every started session involving 
the given endpoint, but creating a dependency of gossip to anti-entropy didn't 
felt like a good idea a priori.

> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.5
>
>         Attachments: 
> 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch, 
> 0002-Register-in-gossip-to-handle-node-failures-v4.patch, 
> 0003-Report-streaming-errors-back-to-repair-v4.patch, 
> 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch, 
> 2433.patch, 2433_v2.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to 
> clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up 
> the data partition.
> These issues together are making repair very difficult for nearly everyone 
> running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was 
> moved to 0.8 for a few reasons. This ticket is to fix the immediate issues 
> that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2433) Failed Streams Break Repair

Reply via email to