[ 
https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055421#comment-13055421
 ] 

Sylvain Lebresne commented on CASSANDRA-2816:
---------------------------------------------

bq. We have also spotted very noticable issues with full GCs when the merkle 
trees are passed around. Hopefully this could fix that too.

This do make sure that we don't do multiple validation at the same time and 
that we keep a small number of merkle tree in memory at the same time. So I 
suppose this could help on the GC side. But overall I don't know if I am too 
optimistic about that, in part because I'm not sure what causes your issues. 
But this can't hurt on that side at least.

bq. I will see if I can get this patch tested somewhere if it is ready for that.

I believe it should be ready for that.

bq. would it be an potential interesting idea to separate tombstones in 
different sstables.

The thing is that some tombstones may be irrelevant become some update 
supersedes it (this is specially true of row tombstones). Hence basing a repair 
on tombstone only may transfer irrelevant data. I suppose it may depend on the 
use case this will be more or less a big deal. Also, this means that a read 
will be impacted in that we will often have to hit twice as many sstables. 
Given that it's not a crazy idea either to want to repair data regularly (if 
only for durability guarantee), I don't know if it is worth the trouble (we 
would have to separate tombstones from data at flush time, we'll have to 
maintain the two separate set of data/tombstone sstables, etc...).

bq. make compaction deterministic or synchronized by a master across nodes

Pretty sure we want to avoid going to a master architecture for everything if 
we can. Having master means that failure handling is more difficult (think 
network partition for instance) and require leader election and such, and the 
whole point of the fully distribution of Cassandra is to avoid those. Even 
without consider those, synchronizing compaction means synchronizing flush 
somehow and you want to be precise if you're going to use whole sstable md5s, 
which will be hard and quite probably inefficient.

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and 
> CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its 
> neighbor as well as himself and assume for correction that the building of 
> those trees will be started on every node roughly at the same time (if not, 
> we end up comparing data snapshot at different time and will thus mistakenly 
> repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other 
> compaction, the start of the validation on the different node is subject to 
> other compactions. 0.8 mitigates this in a way by being multi-threaded (and 
> thus there is less change to be blocked a long time by a long running 
> compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. 
> As a consequence it will generate lots of merkle tree requests and all of 
> those requests will be issued at the same time. Because even in 0.8 the 
> compaction executor is bounded, some of those validations will end up being 
> queued behind the first ones. Even assuming that the different validation are 
> submitted in the same order on each node (which isn't guaranteed either), 
> there is no guarantee that on all nodes, the first validation will take the 
> same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and 
> range (which is the unit at which trees are computed), we make sure that all 
> node will start the validation at the same time (or, since we can't do magic, 
> as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair 
> schedule validation compactions across nodes one by one (i.e, one CF/range at 
> a time), waiting for all nodes to return their tree before submitting the 
> next request. Then on each node, we should make sure that the node will start 
> the validation compaction as soon as requested. For that, we probably want to 
> have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute 
> the validation compaction right away (because no thread are available right 
> away).
> * we simply tell the user that if he start too many repairs in parallel, he 
> may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to