[
https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054237#comment-13054237
]
Jonathan Ellis edited comment on CASSANDRA-2816 at 6/24/11 4:48 AM:
--------------------------------------------------------------------
Supporting actual live-reading of snapshotted sstables is a little more than
"polishing." It would be cool, but I wouldn't want it to block fixing repair.
was (Author: jbellis):
Supporting actual live-reading of snapshotted sstables is a little more
than "polishing."
> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
> Key: CASSANDRA-2816
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.7.0, 0.8.0
> Reporter: Sylvain Lebresne
> Assignee: Sylvain Lebresne
> Labels: repair
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and
> CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its
> neighbor as well as himself and assume for correction that the building of
> those trees will be started on every node roughly at the same time (if not,
> we end up comparing data snapshot at different time and will thus mistakenly
> repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other
> compaction, the start of the validation on the different node is subject to
> other compactions. 0.8 mitigates this in a way by being multi-threaded (and
> thus there is less change to be blocked a long time by a long running
> compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs.
> As a consequence it will generate lots of merkle tree requests and all of
> those requests will be issued at the same time. Because even in 0.8 the
> compaction executor is bounded, some of those validations will end up being
> queued behind the first ones. Even assuming that the different validation are
> submitted in the same order on each node (which isn't guaranteed either),
> there is no guarantee that on all nodes, the first validation will take the
> same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and
> range (which is the unit at which trees are computed), we make sure that all
> node will start the validation at the same time (or, since we can't do magic,
> as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair
> schedule validation compactions across nodes one by one (i.e, one CF/range at
> a time), waiting for all nodes to return their tree before submitting the
> next request. Then on each node, we should make sure that the node will start
> the validation compaction as soon as requested. For that, we probably want to
> have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute
> the validation compaction right away (because no thread are available right
> away).
> * we simply tell the user that if he start too many repairs in parallel, he
> may start seeing some of those repairing more data than it should.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira