[ 
https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661478#comment-13661478
 ] 

Charlie Groves commented on CASSANDRA-5351:
-------------------------------------------

I've been looking at implementing this, and either I'm not understanding how it 
works or it needs an extra wrinkle to keep from streaming a lot of data around. 
Given nodes 1, 2 and 3, each of which are replicas for the same range of keys, 
my understanding is that this style of repair would play out like this:

# Run repair on 1 and it's just like current repair: 1 streams the sections of 
sstables for its divergent ranges to 2 and 3 and they stream their versions of 
the divergent ranges back to 1
# 1 marks its initial sstables and the ones it received as repaired
# Run repair on 2, it streams back and forth with 3 in the same fashion as in 
step 1. Node 1 doesn't include the sstables it repaired, so the merkle trees 
are mostly different and 1 and 2 stream the majority of their unrepaired 
sstables to each other
# 2 marks its initial sstables and the ones it received repaired
# Run repair on 3, and neither 1 nor 2 send their repaired sstables. All the 
trees are quite divergent, so both 1 and 2 send their unrepaired sstables to 3 
and 3 sends its to 1 and 2.

If you add more replicas, you stream the majority of the sstables for each 
repaired node until you move to a node that isn't replicating the same range. 
Am I missing something? It seems like the amount of data streamed would knock 
out much of the benefit of not reading the repaired data.

If the above is the case, I was thinking it could be fixed by adding a 
"generation" to repairs. You supply a generation number to the repair command 
and all sstables repaired in that run are marked as repaired in that 
generation. The generation is sent to all the neighbor nodes requesting repairs 
from them, and they build their merkle trees using any unrepaired ranges and 
repaired ranges at that generation or higher. Compaction would create new 
sstables at the highest generation number of its source sstables. Once you've 
repaired all the way around the ring, you'd increment the generation number.

We could even use the repairedAt timestamp as the generation: don't give the 
first node repaired in the ring for this round a generation, and it returns its 
timestamp when it's done. Pass that timestamp as the generation around the 
ring, and they're all on the same generation afterwards. 
"--include-previously-repaired" could be implemented as repairing with 
generation 0.

                
> Avoid repairing already-repaired data by default
> ------------------------------------------------
>
>                 Key: CASSANDRA-5351
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351
>             Project: Cassandra
>          Issue Type: Task
>          Components: Core
>            Reporter: Jonathan Ellis
>              Labels: repair
>             Fix For: 2.0
>
>
> Repair has always built its merkle tree from all the data in a columnfamily, 
> which is guaranteed to work but is inefficient.
> We can improve this by remembering which sstables have already been 
> successfully repaired, and only repairing sstables new since the last repair. 
>  (This automatically makes CASSANDRA-3362 much less of a problem too.)
> The tricky part is, compaction will (if not taught otherwise) mix repaired 
> data together with non-repaired.  So we should segregate unrepaired sstables 
> from the repaired ones.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to