[ 
https://issues.apache.org/jira/browse/CASSANDRA-193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722054#action_12722054
 ] 

Stu Hood edited comment on CASSANDRA-193 at 6/20/09 9:16 PM:
-------------------------------------------------------------

The current plan is for an AntiEntropyService per table to maintain a Merkle 
Tree per column family.

The tree will be implemented as a full, randomized binary tree in memory (a 
''Treap'': http://en.wikipedia.org/wiki/Treap ), where every item in the tree 
represents a range bounded by the dht.Tokens of its left and right neighbors. 
By placing a bound on the total number of nodes in the tree, we can limit the 
memory usage. We can compact or split ranges in the tree by removing or adding 
Tokens. The algorithm for deciding which ranges to compact/split will be 
described below.

When a write comes in for a given table, we will place 'invalidation' 
operations in a queue for all affected column families. The ExecutorService for 
the table will read from the queue and perform the 'invalidations' as fast as 
it can. For a given Key/Token, if any column family tree is marked as 
'invalid', the entire row needs to be read from disk and repaired (at some 
point in the future).

An 'invalidation' operation does a binary search in the Merkle Tree and marks 
the matching range as 'invalid', deleting its hash. We will also take advantage 
of this step to optimize the tree: A ''Treap'' stores a random priority (P) on 
each node, and by generating a random P' and replacing P for a node iff P' < P 
as we invalidate it, more frequently invalidated ranges will shift to the top 
of the tree.

The AEService maintaining the tree for a table will occasionally need to 
exchange portions of the tree with other nodes. In order to do this, subtrees 
that both nodes are interested in from all CF trees will have to be locked long 
enough to recalculate all 'invalid' children, and then the locks can flow down 
the tree as progressively smaller ranges are exhanged. Doing this locking 
efficiently is going to be interesting (aka: I haven't thought about it).

Implementing the exchange between nodes is blocked by CASSANDRA-242 because in 
order to align the subtrees on different nodes, we need to be able to 
deterministically split two ranges.

In order to fill in 'invalid' ranges in the tree, the MerkleTree will provide 
an operation that builds a list of invalid ranges to be fetched from disk. 
During this step, we can also compact/split ranges. Because of our Treap 
maintenance, frequently invalidated ranges will be nearer to the top of the 
tree, and stable ranges will be closer to the bottom. By compacting the deepest 
N leaves and expanding the shallowest N, we can minimize the size of the ranges 
that are affected by invalidations in the future.

Given the list of 'invalid' ranges (and pointers directly to the tree nodes), 
the AEService will fetch the ranges from the current MemTable and SSTables for 
the CF, hash them, and store the hashes into the relevant nodes. After this 
operation, we can recursively calculate hashes for inner nodes.

      was (Author: stuhood):
    The current plan is for an AntiEntropyService per table to maintain a 
Merkle Tree per column family.

The tree will be a implemented as a randomized binary tree in memory (a 
''Treap'': http://en.wikipedia.org/wiki/Treap), where every item in the tree 
represents a range bounded by the dht.Tokens of its left and right neighbors. 
By placing a bound on the total number of nodes in the tree, we can limit the 
memory usage. We can compact or split ranges in the tree by removing or adding 
Tokens. The algorithm for deciding which ranges to compact/split will be 
described below.

When a write comes in for a given table, we will place 'invalidation' 
operations in a queue for all affected column families. The ExecutorService for 
the table will read from the queue and perform the 'invalidations' as fast as 
it can. For a given Key/Token, if any column family tree is marked as 
'invalid', the entire row needs to be read from disk and repaired (at some 
point in the future).

An 'invalidation' operation does a binary search in the Merkle Tree and marks 
the matching range as 'invalid', deleting its hash. We will also take advantage 
of this step to optimize the tree: A ''Treap'' stores a random priority (P) on 
each node, and by generating a random P' and replacing P for a node iff P' < P 
as we invalidate it, more frequently invalidated ranges will shift to the top 
of the tree.

The AEService maintaining the tree for a table will occasionally need to 
exchange portions of the tree with other nodes. In order to do this, subtrees 
that both nodes are interested in from all CF trees will have to be locked long 
enough to recalculate all 'invalid' children, and then the locks can flow down 
the tree as progressively smaller ranges are exhanged. Doing this locking 
efficiently is going to be interesting (aka: I haven't thought about it).

Implementing the exchange between nodes is blocked by CASSANDRA-242 because in 
order to align the subtrees on different nodes, we need to be able to 
deterministically split two ranges.

In order to fill in 'invalid' ranges in the tree, the MerkleTree will provide 
an operation that builds a list of invalid ranges to be fetched from disk. 
During this step, we can also compact/split ranges. Because of our Treap 
maintenance, frequently invalidated ranges will be near the top of the tree, 
and stable ranges will be closer to the leaves. By compacting the deepest N 
nodes and expanding the shallowest N, we can minimizing the size of the ranges 
that are affected by invalidations in the future.

Given the list of 'invalid' ranges (and pointers directly to the tree nodes), 
the AEService will fetch the ranges from the current MemTable and SSTables for 
the CF, hash them, and store the hashes into the relevant nodes. After this 
operation, we can recursively calculate hashes for inner nodes.
  
> Proactive repair
> ----------------
>
>                 Key: CASSANDRA-193
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-193
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Jonathan Ellis
>            Assignee: Stu Hood
>             Fix For: 0.5
>
>
> Currently cassandra supports "read repair," i.e., lazy repair when a read is 
> done.  This is better than nothing but is not sufficient for some cases (e.g. 
> catastrophic node failure where you need to rebuild all of a node's data on a 
> new machine).
> Dynamo uses merkle trees here.  This is harder for Cassandra given the CF 
> data model but I suppose we could just hash the serialized CF value.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to