[
https://issues.apache.org/jira/browse/CASSANDRA-193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722054#action_12722054
]
Stu Hood edited comment on CASSANDRA-193 at 6/20/09 9:16 PM:
-------------------------------------------------------------
The current plan is for an AntiEntropyService per table to maintain a Merkle
Tree per column family.
The tree will be implemented as a full, randomized binary tree in memory (a
''Treap'': http://en.wikipedia.org/wiki/Treap ), where every item in the tree
represents a range bounded by the dht.Tokens of its left and right neighbors.
By placing a bound on the total number of nodes in the tree, we can limit the
memory usage. We can compact or split ranges in the tree by removing or adding
Tokens. The algorithm for deciding which ranges to compact/split will be
described below.
When a write comes in for a given table, we will place 'invalidation'
operations in a queue for all affected column families. The ExecutorService for
the table will read from the queue and perform the 'invalidations' as fast as
it can. For a given Key/Token, if any column family tree is marked as
'invalid', the entire row needs to be read from disk and repaired (at some
point in the future).
An 'invalidation' operation does a binary search in the Merkle Tree and marks
the matching range as 'invalid', deleting its hash. We will also take advantage
of this step to optimize the tree: A ''Treap'' stores a random priority (P) on
each node, and by generating a random P' and replacing P for a node iff P' < P
as we invalidate it, more frequently invalidated ranges will shift to the top
of the tree.
The AEService maintaining the tree for a table will occasionally need to
exchange portions of the tree with other nodes. In order to do this, subtrees
that both nodes are interested in from all CF trees will have to be locked long
enough to recalculate all 'invalid' children, and then the locks can flow down
the tree as progressively smaller ranges are exhanged. Doing this locking
efficiently is going to be interesting (aka: I haven't thought about it).
Implementing the exchange between nodes is blocked by CASSANDRA-242 because in
order to align the subtrees on different nodes, we need to be able to
deterministically split two ranges.
In order to fill in 'invalid' ranges in the tree, the MerkleTree will provide
an operation that builds a list of invalid ranges to be fetched from disk.
During this step, we can also compact/split ranges. Because of our Treap
maintenance, frequently invalidated ranges will be nearer to the top of the
tree, and stable ranges will be closer to the bottom. By compacting the deepest
N leaves and expanding the shallowest N, we can minimize the size of the ranges
that are affected by invalidations in the future.
Given the list of 'invalid' ranges (and pointers directly to the tree nodes),
the AEService will fetch the ranges from the current MemTable and SSTables for
the CF, hash them, and store the hashes into the relevant nodes. After this
operation, we can recursively calculate hashes for inner nodes.
was (Author: stuhood):
The current plan is for an AntiEntropyService per table to maintain a
Merkle Tree per column family.
The tree will be a implemented as a randomized binary tree in memory (a
''Treap'': http://en.wikipedia.org/wiki/Treap), where every item in the tree
represents a range bounded by the dht.Tokens of its left and right neighbors.
By placing a bound on the total number of nodes in the tree, we can limit the
memory usage. We can compact or split ranges in the tree by removing or adding
Tokens. The algorithm for deciding which ranges to compact/split will be
described below.
When a write comes in for a given table, we will place 'invalidation'
operations in a queue for all affected column families. The ExecutorService for
the table will read from the queue and perform the 'invalidations' as fast as
it can. For a given Key/Token, if any column family tree is marked as
'invalid', the entire row needs to be read from disk and repaired (at some
point in the future).
An 'invalidation' operation does a binary search in the Merkle Tree and marks
the matching range as 'invalid', deleting its hash. We will also take advantage
of this step to optimize the tree: A ''Treap'' stores a random priority (P) on
each node, and by generating a random P' and replacing P for a node iff P' < P
as we invalidate it, more frequently invalidated ranges will shift to the top
of the tree.
The AEService maintaining the tree for a table will occasionally need to
exchange portions of the tree with other nodes. In order to do this, subtrees
that both nodes are interested in from all CF trees will have to be locked long
enough to recalculate all 'invalid' children, and then the locks can flow down
the tree as progressively smaller ranges are exhanged. Doing this locking
efficiently is going to be interesting (aka: I haven't thought about it).
Implementing the exchange between nodes is blocked by CASSANDRA-242 because in
order to align the subtrees on different nodes, we need to be able to
deterministically split two ranges.
In order to fill in 'invalid' ranges in the tree, the MerkleTree will provide
an operation that builds a list of invalid ranges to be fetched from disk.
During this step, we can also compact/split ranges. Because of our Treap
maintenance, frequently invalidated ranges will be near the top of the tree,
and stable ranges will be closer to the leaves. By compacting the deepest N
nodes and expanding the shallowest N, we can minimizing the size of the ranges
that are affected by invalidations in the future.
Given the list of 'invalid' ranges (and pointers directly to the tree nodes),
the AEService will fetch the ranges from the current MemTable and SSTables for
the CF, hash them, and store the hashes into the relevant nodes. After this
operation, we can recursively calculate hashes for inner nodes.
> Proactive repair
> ----------------
>
> Key: CASSANDRA-193
> URL: https://issues.apache.org/jira/browse/CASSANDRA-193
> Project: Cassandra
> Issue Type: New Feature
> Reporter: Jonathan Ellis
> Assignee: Stu Hood
> Fix For: 0.5
>
>
> Currently cassandra supports "read repair," i.e., lazy repair when a read is
> done. This is better than nothing but is not sufficient for some cases (e.g.
> catastrophic node failure where you need to rebuild all of a node's data on a
> new machine).
> Dynamo uses merkle trees here. This is harder for Cassandra given the CF
> data model but I suppose we could just hash the serialized CF value.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.