[ 
https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059991#comment-13059991
 ] 

Sylvain Lebresne commented on CASSANDRA-2816:
---------------------------------------------

bq. Can you point out where this happens in AES?

Mostly in AES.rendezvous and AES.RepairSession. Basically, RepairSession 
creates a queue of jobs, a job representing the repair of a given column family 
(for a given range, but that comes from the session itself). AES.rendezvous is 
then call for each received merkleTree. It waits to have all the merkeTree for 
the first job in the queue. When that is done, it dequeue the job (computing 
the merkle tree differences and scheduling streaming accordingly) and send the 
tree request for the next job in the queue.
Moreover, in StorageService.forceTableRepair(), when scheduling the repair for 
all the ranges of the node, we actually start the session for the first range 
and wait for all the "jobs" for this range to be done before starting the next 
session.

bq. That feels like the wrong default to me. I think you can make a case for 
one (minimal interference with the rest of the system) or unlimited (no weird 
"cliff" to catch the unwary repair operator). But two is weird.

Well the rational was the following one: if you set it to two, then you're 
saying that as soon as you start 2 repairs in parallel, they will start being 
inaccurate. But as Peter was suggesting (maybe in another ticket but anyway), 
if you have huge CF and tiny ones, it's nice to be able to repair on the tiny 
ones while a repair on the huge one(s) is running. Now, making it unlimited 
feels dangerous, because if you do so, it means that if the use start a lot of 
repair, all the validation compaction will start right away. This will kill the 
cluster (at least a few nodes if all those repair were started on the same 
node). It sounded better to have degraded precision for repair in those cases 
rather than basically killing the nodes. Maybe 2 or 4 may be a better default 
than 2, but 1 is a bit limited and unlimited is clearly much too dangerous.

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and 
> CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its 
> neighbor as well as himself and assume for correction that the building of 
> those trees will be started on every node roughly at the same time (if not, 
> we end up comparing data snapshot at different time and will thus mistakenly 
> repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other 
> compaction, the start of the validation on the different node is subject to 
> other compactions. 0.8 mitigates this in a way by being multi-threaded (and 
> thus there is less change to be blocked a long time by a long running 
> compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. 
> As a consequence it will generate lots of merkle tree requests and all of 
> those requests will be issued at the same time. Because even in 0.8 the 
> compaction executor is bounded, some of those validations will end up being 
> queued behind the first ones. Even assuming that the different validation are 
> submitted in the same order on each node (which isn't guaranteed either), 
> there is no guarantee that on all nodes, the first validation will take the 
> same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and 
> range (which is the unit at which trees are computed), we make sure that all 
> node will start the validation at the same time (or, since we can't do magic, 
> as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair 
> schedule validation compactions across nodes one by one (i.e, one CF/range at 
> a time), waiting for all nodes to return their tree before submitting the 
> next request. Then on each node, we should make sure that the node will start 
> the validation compaction as soon as requested. For that, we probably want to 
> have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute 
> the validation compaction right away (because no thread are available right 
> away).
> * we simply tell the user that if he start too many repairs in parallel, he 
> may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to