[
https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13795120#comment-13795120
]
Jason Brown commented on CASSANDRA-5351:
----------------------------------------
Interesting ideas here. However, here's some problems off the top of my head
that need to be addressed (in no order):
* Nodes that are fully replaced (see: Netflix running in the cloud). When a
node is replaced, we bootstrap the node by streaming data from closest peers
(usually) in the local DC. The new node would not have anti-compacted sstables,
as it's never had a chance to repair. I'm not sure if bootstrapping data can be
considered anti-compacted through cummutativity; it might be true, but I'd need
to think about it more. Asuuming not, when this new node is involved in any
repair, it would generate a different MT than it's already repaired peers, and
thus all hell would break loose streaming already repaired data to every node
involved in the repair, worse than today's repair (think streaming TBs of data
across multiple amazon datacenters). If we can prove that the new node's data
is commutatively repaired just by bootstrap, then this is not a problem as
such. Note this also affects move (to a lesser degree) and rebuild.
* Consider nodes A, B, and C. If nodes A and B successfully repair, but C fails
to repair with them (due to partitioning, app crash, etc) during the repair. C
is forced to do an -ipr repair as A and B have already anti-compacted and that
is the only way C will be able to repair against A and B.
* If the operator chooses to cacncel the repair, we are left at an indetermant
state wrt which node has successfully completed repairs with another (similar
to last point).
* Local DC repair vs. global is largely incompatible with this. Looks like you
will get one shot with each sstable's range for repair, so if you choose do
local DC repair with an ssttable, you are forced to do -ipr if you later want
to globally repair.
Note that these problems are magnified immensely when you run in multiple
datacenters, especially datacenters separated by great distances.
While none of these situations is unresolvable, it seems that there are many
non-obvious ways into which we can get into a non-deterministic state that
operators will see either tons of data being streamed due to different
anti-compaction points being different or will be forced to run -ipr without an
easily understood reason. I already see operators terminate repair jobs because
"they hang" or "take too long", for better or worse (mostly worse). At that
point, the operator is pretty much required to do an -ipr repair, which gets us
back into the same situation we are in today, but with more confusion and
possibly using -ipr as the default.
It would probably be good to run -ipr as a best practice anyways every n
days/weeks/months, but I worry about the very non-obvious edge cases this
introduces and the possiblity that operators will simply fall back to using
-ipr whenever something goes bump or doesn't make sense.
Thanks for listening.
> Avoid repairing already-repaired data by default
> ------------------------------------------------
>
> Key: CASSANDRA-5351
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5351
> Project: Cassandra
> Issue Type: Task
> Components: Core
> Reporter: Jonathan Ellis
> Assignee: Lyuben Todorov
> Labels: repair
> Fix For: 2.1
>
>
> Repair has always built its merkle tree from all the data in a columnfamily,
> which is guaranteed to work but is inefficient.
> We can improve this by remembering which sstables have already been
> successfully repaired, and only repairing sstables new since the last repair.
> (This automatically makes CASSANDRA-3362 much less of a problem too.)
> The tricky part is, compaction will (if not taught otherwise) mix repaired
> data together with non-repaired. So we should segregate unrepaired sstables
> from the repaired ones.
--
This message was sent by Atlassian JIRA
(v6.1#6144)