[ 
https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055455#comment-13055455
 ] 

Terje Marthinussen commented on CASSANDRA-2816:
-----------------------------------------------

I don't know what causes GC when doing repairs either, but fire off repair on a 
few nodes with 100 million docs/node and there is a reasonable chance that a 
node here and there will log messages about reducing cache sizes due to memory 
pressure (I am not really sure it is a good idea to do this at all, reducing 
caches during stress rarely improves anything) or full GC.

The thought about the master controlled compaction would not really affect 
network splits etc.

Reconciliation after a network split is really as complex with or without a 
master. We need to get back to a state where all the nodes have the same data 
anyway which is a complex task anyway.

This is more a consideration of the fact that we do not necessarily need to 
live in quorum based world during compaction and we are free to use alternative 
approaches in the compaction without changing read/write path or affecting 
availability. Master selection is not really a problem here. Start compaction, 
talk to other nodes with the same token ranges, select a leader. 

Does not even have to be the same master every time and could consider if we 
could make compaction part of a background read repair to reduce the amount of 
times we need to read/write data. 

For instance, if we can verify that the oldest/biggest sstables is 100% in sync 
with data on other replicas when it is compacted (why not do it during 
compaction when we go through the data anyway rather than later?),can we use 
that info to optimize the scans done during repairs by only using data in 
sstables with data received after some checkpoint in time as the starting point 
for the consistency check?

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>         Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch
>
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and 
> CASSANDRA-2815 that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its 
> neighbor as well as himself and assume for correction that the building of 
> those trees will be started on every node roughly at the same time (if not, 
> we end up comparing data snapshot at different time and will thus mistakenly 
> repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other 
> compaction, the start of the validation on the different node is subject to 
> other compactions. 0.8 mitigates this in a way by being multi-threaded (and 
> thus there is less change to be blocked a long time by a long running 
> compaction), but the compaction executor being bounded, its still a problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. 
> As a consequence it will generate lots of merkle tree requests and all of 
> those requests will be issued at the same time. Because even in 0.8 the 
> compaction executor is bounded, some of those validations will end up being 
> queued behind the first ones. Even assuming that the different validation are 
> submitted in the same order on each node (which isn't guaranteed either), 
> there is no guarantee that on all nodes, the first validation will take the 
> same time, hence desynchronizing the queued ones.
> Overall, it is important for the precision of repair that for a given CF and 
> range (which is the unit at which trees are computed), we make sure that all 
> node will start the validation at the same time (or, since we can't do magic, 
> as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair 
> schedule validation compactions across nodes one by one (i.e, one CF/range at 
> a time), waiting for all nodes to return their tree before submitting the 
> next request. Then on each node, we should make sure that the node will start 
> the validation compaction as soon as requested. For that, we probably want to 
> have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute 
> the validation compaction right away (because no thread are available right 
> away).
> * we simply tell the user that if he start too many repairs in parallel, he 
> may start seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to