[ 
https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059459#comment-13059459
 ] 

Sylvain Lebresne commented on CASSANDRA-2816:
---------------------------------------------

bq. So... 12 node cluster, this is maybe ugly, I know, but start repair on all 
of them.

Is it started on all of them ? If so, this is "kind of" expected in the sense 
that the patch assumes that each node does not do more than 2 repairs (for any 
column family) at the same time (this is configurable through the new 
concurrent_validators option, but it's probably better to stick to 2 and 
stagger the repair). If you do more than that (that is, if you did repair on 
all node at the same time and RF>2), then we're back on our old demons.

bq. I have really no idea if this is the case, but I am getting the hunch that 
this node has ended up streaming out some of the data it is getting in. Would 
this be possible?

Not really. That is, it could be that you create a merkle tree on some data and 
once you start streaming you, you're picking up data that was just streamed to 
you and wasn't there when computing the tree. This patch is suppose to fixes 
this in parts, but this can still happen if you do repairs in parallel on 
neighboring nodes. However, you shouldn't get into a situation where 2 nodes 
stream forever because they pick up what is just streamed to them for instance, 
because what is streaming is determined at the very beginning of the streaming 
session.

So my first question would be, was all those repair started in parallel. If 
yes, you shall not do this :). CASSANDRA-2606 and CASSANDRA-2610 are here to 
help making the repair of a full cluster much easier (and efficient), but right 
now it's more about getting patch in one at a time.
If the repairs were started one at a time in a rolling fashion, then we do have 
a unknown problem somewhere.

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and 
> CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its 
> neighbor as well as himself and assume for correction that the building of 
> those trees will be started on every node roughly at the same time (if not, 
> we end up comparing data snapshot at different time and will thus mistakenly 
> repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other 
> compaction, the start of the validation on the different node is subject to 
> other compactions. 0.8 mitigates this in a way by being multi-threaded (and 
> thus there is less change to be blocked a long time by a long running 
> compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. 
> As a consequence it will generate lots of merkle tree requests and all of 
> those requests will be issued at the same time. Because even in 0.8 the 
> compaction executor is bounded, some of those validations will end up being 
> queued behind the first ones. Even assuming that the different validation are 
> submitted in the same order on each node (which isn't guaranteed either), 
> there is no guarantee that on all nodes, the first validation will take the 
> same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and 
> range (which is the unit at which trees are computed), we make sure that all 
> node will start the validation at the same time (or, since we can't do magic, 
> as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair 
> schedule validation compactions across nodes one by one (i.e, one CF/range at 
> a time), waiting for all nodes to return their tree before submitting the 
> next request. Then on each node, we should make sure that the node will start 
> the validation compaction as soon as requested. For that, we probably want to 
> have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute 
> the validation compaction right away (because no thread are available right 
> away).
> * we simply tell the user that if he start too many repairs in parallel, he 
> may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to