[jira] [Updated] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Sylvain Lebresne (JIRA) Wed, 20 Jul 2011 03:07:43 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sylvain Lebresne updated CASSANDRA-2816:
----------------------------------------

    Attachment: 2816_0.8_v3.patch

bq. rebased

You rebased it against trunk while the fix version is still 0.8.2. I agree that 
this feel a bigger change that what we would want for a minor release, but 
repair is really a major pain point for users. And to tell the truth, it's 
worth in 0.8 than it is in 0.7 because even though the problem this patch 
solves exists in 0.7 as well, the splitting into range of repair made for 0.8 
exacerbate those problems. Moreover this patch is fairly well delimited in what 
it changes, and it don't make any change to the wire protocol or anything that 
would make upgrades a problem. So I'm actually in favor of taking the small 
risk of pushing that in 0.8.2 (and be very vigilant to test repair extensively 
before the release). So for now, attaching a rebase with tests fixed against 
0.8.

bq. and switched to unbounded executor for validations.

Ok, I realize that I'm not sure I understand what you mean by unbounded 
executor. In you rebased version, the ValidationExecutor apparently use the 
first constructor of DebuggableThreadPoolExecutor that will construct a 
mono-threaded executor, which is not what we want. Sure the queue of the 
executor will be unbounded, but if that was the problem, there was a 
misunderstanding, because the queue has always been unbounded, even in my 
initial patch. What concurrent_validators was dealing with is the number of 
core threads. And we need multiple core threads if we want to allow multiple 
concurrent repairs to work correctly.

Now the idea behind a default of 2 for the core threads was because I see a 
reason to want 2 concurrent repairs, but I don't see a very good reason to want 
more (and it's configurable if someone really need more). I'm glad to say it's 
not a marvelous default and the patch I've just attached used the same default 
than concurrent_compactors which is maybe less "random". Now we could have an 
executor with unbounded threads, that spawn a thread if needed making sure we 
never queue validation compaction, but that seems a little bit dangerous to me. 
It seems more reasonable to me to have a (configurable) reasonable number of 
threads and let validation queue up if the user start more than that number of 
concurrent repair (which will impact the precision of those repair, but it 
would be the user fault and it's a better way to deal with such fault than 
starting an unreasonable number of validation compaction that will starve 
memory (on likely more than one node btw)). But if you still think that it's 
better to have an unbounded number of threads, I won't fight over this.

bq. tests do not compile but I believe that was already the case w/ v1

Yes, I completely forgot to update the unit tests, sorry. Attached patch fixes 
those.


> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch, 
> 2816-v2.txt, 2816_0.8_v3.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and 
> CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its 
> neighbor as well as himself and assume for correction that the building of 
> those trees will be started on every node roughly at the same time (if not, 
> we end up comparing data snapshot at different time and will thus mistakenly 
> repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other 
> compaction, the start of the validation on the different node is subject to 
> other compactions. 0.8 mitigates this in a way by being multi-threaded (and 
> thus there is less change to be blocked a long time by a long running 
> compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. 
> As a consequence it will generate lots of merkle tree requests and all of 
> those requests will be issued at the same time. Because even in 0.8 the 
> compaction executor is bounded, some of those validations will end up being 
> queued behind the first ones. Even assuming that the different validation are 
> submitted in the same order on each node (which isn't guaranteed either), 
> there is no guarantee that on all nodes, the first validation will take the 
> same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and 
> range (which is the unit at which trees are computed), we make sure that all 
> node will start the validation at the same time (or, since we can't do magic, 
> as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair 
> schedule validation compactions across nodes one by one (i.e, one CF/range at 
> a time), waiting for all nodes to return their tree before submitting the 
> next request. Then on each node, we should make sure that the node will start 
> the validation compaction as soon as requested. For that, we probably want to 
> have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute 
> the validation compaction right away (because no thread are available right 
> away).
> * we simply tell the user that if he start too many repairs in parallel, he 
> may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

Reply via email to