[
https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059643#comment-13059643
]
Jonathan Ellis commented on CASSANDRA-2816:
-------------------------------------------
bq. May I change to
Sure.
bq. The system should protect the user from that
I'm not sure that in a p2p design we can posit an omniscient "the system."
> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
> Key: CASSANDRA-2816
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Sylvain Lebresne
> Assignee: Sylvain Lebresne
> Labels: repair
> Fix For: 0.8.2
>
> Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and
> CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its
> neighbor as well as himself and assume for correction that the building of
> those trees will be started on every node roughly at the same time (if not,
> we end up comparing data snapshot at different time and will thus mistakenly
> repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other
> compaction, the start of the validation on the different node is subject to
> other compactions. 0.8 mitigates this in a way by being multi-threaded (and
> thus there is less change to be blocked a long time by a long running
> compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs.
> As a consequence it will generate lots of merkle tree requests and all of
> those requests will be issued at the same time. Because even in 0.8 the
> compaction executor is bounded, some of those validations will end up being
> queued behind the first ones. Even assuming that the different validation are
> submitted in the same order on each node (which isn't guaranteed either),
> there is no guarantee that on all nodes, the first validation will take the
> same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and
> range (which is the unit at which trees are computed), we make sure that all
> node will start the validation at the same time (or, since we can't do magic,
> as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair
> schedule validation compactions across nodes one by one (i.e, one CF/range at
> a time), waiting for all nodes to return their tree before submitting the
> next request. Then on each node, we should make sure that the node will start
> the validation compaction as soon as requested. For that, we probably want to
> have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute
> the validation compaction right away (because no thread are available right
> away).
> * we simply tell the user that if he start too many repairs in parallel, he
> may start seeing some of those repairing more data than it should.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira