[
https://issues.apache.org/jira/browse/CASSANDRA-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012034#comment-13012034
]
Sylvain Lebresne commented on CASSANDRA-2324:
---------------------------------------------
The problem is, the ranges repair hashes are not actual node ranges.
Let's consider the following ring (RF=2), where I consider token being in
[0..12] to simplify, and where everything is consistent:
{noformat}
_.-""""-._
C (token: 11) .' `.
[11,3][3,7] / \
| |
| | A(token: 3)
| | [3,7],[7,11]
\ /
`._ _.'
B (token: 7)`-....-'
[7,11],[11,3]
{noformat}
Now say I run a repair on node A. The problem is that the Merkle tree ranges
are built by dividing the full range by 2 recursively. This means that in this
example, the ranges in the tree will for instance be [0,2], [2, 4], [4, 6], [6,
8], [8,10] and [10,12].
If you compare the hashes for A and B on those ranges, changes are you'll find
mismatches for [6,8] and [10,12] (because A don't have anyone on [11, 12] while
B have, and B don't have anyone on [6, 7] while A have). As a consequence, the
range [7,8] and [10,11] will be repaired, even though there is no
inconsistencies.
What that means in practice is that it will be very rare for anti-antropy to
actually consider the nodes in sync, it will almost surely "repair" something,
even if the nodes are perfectly consistent. It's Very easy to check btw: with a
cluster right the one above (3 nodes, RF=2), with as few as 5 keys for the
whole cluster I'm able to have a repair do repairs over and over again.
Now the good question is: how bad is it ? I'm not sure, I depends a bit.
On a 3 nodes cluster (RF=2), I tried inserting 1M keys with stress (stress -l
2) and triggered repair afterwards. The amount of (unnecessarily) repaired keys
was around 150 keys for a given node (it varies slightly for run to run because
there is some randomness in the creation of the Merkle tree), corresponding to
~44KB streamed (that is the amount transfered to the node where repair has been
ran, so for the total operation its twice this, since we stream in both ways).
That's ~0.02% of keys (a given node have ~666 666 keys). It's bad to do
useless work, but not a really big deal.
However, the less keys we'll have, the worst it gets (and the bigger our rows
are, the more useless transfer we do). With the same experiment inserting only
10K keys, there is 190 keys uselessly repaired. That's now close to 3% of the
load. It also gets worst with increasing replication factor.
To fix this, we would need for the range in the Merkle tree to "share" the node
range boundaries. An interesting way to do this would be to have the
coordinating node give a list a range for which to calculate Merkle trees, and
the node would compute one tree by range (for the coordinating node, that would
be #RF's tree). A nice think with this is that it would leave room to
optimizing repair since a node would need to do a validation compaction only on
the range asked for, which means that only the coordinator node would validate
all its data. The neighbors would do less work.
> Repair transfers more data than necessary
> -----------------------------------------
>
> Key: CASSANDRA-2324
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2324
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.7.0
> Reporter: Brandon Williams
> Assignee: Sylvain Lebresne
> Fix For: 0.7.5
>
>
> To repro: 3 node cluster, stress.java 1M rows with -x KEYS and -l 2. The
> index is enough to make some mutations drop (about 20-30k total in my tests).
> Repair afterwards will repair a large amount of ranges the first time.
> However, each subsequent run will repair the same set of small ranges every
> time. INDEXED_RANGE_SLICE in stress never fully works. Counting rows with
> sstablekeys shows there are 2M rows total as expected, however when trying to
> count the indexed keys, I get exceptions like:
> {noformat}
> Exception in thread "main" java.io.IOException: Key out of order!
> DecoratedKey(101571366040797913119296586470838356016,
> 0707ab782c5b5029d28a5e6d508ef72f0222528b5e28da3b7787492679dc51b96f868e0746073e54bc173be927049d0f51e25a6a95b3268213b8969abf40cea7d7)
> > DecoratedKey(12639574763031545147067490818595764132,
> 0bc414be3093348a2ad389ed28f18f0cc9a044b2e98587848a0d289dae13ed0ad479c74654900eeffc6236)
> at
> org.apache.cassandra.tools.SSTableExport.enumeratekeys(SSTableExport.java:206)
> at
> org.apache.cassandra.tools.SSTableExport.main(SSTableExport.java:388)
> {noformat}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira