Okay, this is going to be a pretty long post, but I think its an interesting 
data model, and hopefully someone will find it worth going through.

First, I think it will be easier to understand the modeling choices I made if 
you see the end product. Go to 
http://www.fold3.com/browse.php#249|hzUkLqDmI<http://www.fold3.com/browse.php#249%7ChzUkLqDmI>.
 What you see looks like one big tree, but actually is a combination of trees 
spliced together. There is one tree in a relational database that forms what I 
call the top-level browse. The top-level browse is used to navigate through 
categories until you arrive at a publication. When you drill down into a 
publication, you are then viewing data stored in Cassandra. The link provided 
above points to the root of a publication (in this case, maps from the Civil 
War), so to the left is top-level browse coming from MySQL, and to the right is 
the Cassandra browse. Each publication has an independent tree in Cassandra, 
with all trees stored in the same set of tables (I do not dynamically create 
tables for each publication — I personally think that’s a bad practice). We 
currently have 458 publications, and collectively they have about half a 
billion nodes and consume about 400 GB (RF=3).

My trees are immutable. When there are changes to a publication (e.g. adding 
new documents), it is very difficult to know what changes need to be made to 
the tree to edit it in-place. Also, it would be impossible to maintain 
navigational consistency while a tree is in process of being restructured. So, 
when a publication changes, I create a completely new tree. Once the new tree 
is built, I change a reference to point to the new tree. I have a process that 
nightly pages through the tables and deletes records that belong to obsolete 
trees. This process takes about five hours. If it were much longer than that, I 
would probably run it weekly. My database has quite a bit of churn, which is 
fairly unusual for a Cassandra-based application. Most nights build two or 
three trees, generally resulting in a few tens of millions of new records and a 
slightly fewer number of deletions. Size-tiered compaction is a bad choice for 
churn, so I use leveled compaction. Most publications are at most a few million 
nodes, and generally build in less than 20 minutes.

Since any modeling exercise requires knowing the queries, I should describe 
that before getting into the model. Here are the features I need to support. 
For browsing the tree, I need to be able to get the children of a node 
(paginated), the siblings of a node (also paginated), and the ancestors of a 
node. The leaves of each tree are images and form a filmstrip. You can use the 
filmstrip to navigate through all the images in a publication in the tree’s 
natural order. If you go to my browse page and keep drilling down, you’ll 
eventually get to an image. The filmstrip appears at the bottom of the image 
viewer.

Before I discuss the schema, I should discuss a couple of other non-obvious 
things that are relevant to the data model. One very common operation is to 
retrieve a node and all of its ancestors in order to display a path. 
Denormalization would suggest that I store the data for each node, along with 
that of all of its ancestors. That would mean that in my biggest tree, I would 
store the root node 180 million times. I didn’t consider that kind of bloat to 
be acceptable, so I do not denormalize ancestors. I also wanted to retrieve a 
node and its ancestors in constant time, rather than O(n) as would be typical 
for tree traversal. In order to accomplish this, I use a pretty unique idea for 
a node's primary key. I create a hash from information in the node, and then 
append it to the hash of its parent. So, the primary key is really a path. When 
I need to retrieve a node and its ancestors, I tokenize the path and issue 
queries in parallel to get all the nodes in the ancestry at the same time. In 
keeping with this pattern of not denormalizing, my auxiliary tables do not have 
node data in them, but instead provide a means of getting hash paths, which I 
then tokenize and make parallel requests with. Most requests that use an 
auxiliary table can generally just make a query to the auxiliary table to get 
the hash path, and then retrieve the node and its ancestors from the node 
table. Three or fewer trips to Cassandra are sufficient for all my API’s.

Without further ado, here’s my schema (with commentary):

CREATE TABLE tree (
tree INT,
pub INT,
rhpath VARCHAR,
atime TIMESTAMP,
ccount INT,
ncount INT,
PRIMARY KEY (tree)
) WITH gc_grace_seconds = 864000;

This table maintains the references to the root nodes for each tree. pub is the 
primary key for the publication table in my relational database. There is 
usually just one record for each publication. When a tree is being built (and 
until the old one is retired), a publication may have more than one tree. This 
table is small (458 records), and I cache it in memory and maintain inverted 
indexes. Not caching this table would result in one additional trip to 
Cassandra for all API’s. Each tree is assigned a random tree ID (which is an 
INT instead of UUID for reasons beyond this discussion). All my tables have a 
tree ID in them so I can know which tree each node belongs to, since the hash 
path is not unique.

CREATE TABLE node (
hpath VARCHAR,
tree INT,
node VARCHAR,
ccount INT,
PRIMARY KEY (hpath, tree)
) WITH gc_grace_seconds = 0 AND compaction = { 'class' : 
'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };

This is the main table that holds all the text and ordinal information for all 
my nodes. This table contains the lion’s share of the data in my cluster. The 
node column is a JSON document. ccount is the number of child nodes.

CREATE TABLE path_by_page (
page BIGINT,
tree INT,
hpath VARCHAR,
pord INT,
ord INT,
PRIMARY KEY (page, tree)
) WITH gc_grace_seconds = 0 AND compaction = { 'class' : 
'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };

This table allows me to get the hash path for any image. page is the primary 
key of the page from my relational database. pord is the image’s ordinal in the 
publication filmstrip. ord is the page’s ordinal amongst its siblings.

CREATE TABLE path_by_pub (
tree INT,
bucket INT,
pord INT,
ord INT,
hpath VARCHAR,
page BIGINT,
PRIMARY KEY ((tree, bucket), pord)
) WITH gc_grace_seconds = 0 AND compaction = { 'class' : 
'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };

This table allows me to do the filmstrip. pord is what I key off of to paginate.

CREATE TABLE path_by_parent (
phpath VARCHAR,
bucket INT,
tree INT,
ord INT,
hpath VARCHAR,
PRIMARY KEY ((phpath, bucket, tree), ord)
) WITH gc_grace_seconds = 0 AND compaction = { 'class' : 
'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };

This table allows me to get the children for a node. ord is the node's ordinal 
within its siblings. It is what I key off of to paginate. The link I provided 
above is to a publication that has a fanout of 1 for the leaf node’s parent 
nodes, so isn’t very interesting (although the content is very interesting). 
Here’s a more interesting node that has bigger fanout: 
http://www.fold3.com/browse.php#246|hrRUXqv6Rj6GFxKAo<http://www.fold3.com/browse.php#246%7ChrRUXqv6Rj6GFxKAo>.
 And finally, here’s a node with a fanout of 378622: 
http://www.fold3.com/browse.php#1|hhqJwp03TQBwCAyoD<http://www.fold3.com/browse.php#1%7ChhqJwp03TQBwCAyoD>.

As long as this post is, it probably wasn’t enough to fully understand 
everything I do with my schema. I have dozens of queries. If anyone would like 
me to dig a little deeper, I’d be happy to. Just email me.

Robert

On Mar 27, 2015, at 5:35 PM, Ben Bromhead 
<b...@instaclustr.com<mailto:b...@instaclustr.com>> wrote:

+1 would love to see how you do it

On 27 March 2015 at 07:18, Jonathan Haddad 
<j...@jonhaddad.com<mailto:j...@jonhaddad.com>> wrote:
I'd be interested to see that data model. I think the entire list would benefit!

On Thu, Mar 26, 2015 at 8:16 PM Robert Wille 
<rwi...@fold3.com<mailto:rwi...@fold3.com>> wrote:
I have a cluster which stores tree structures. I keep several hundred unrelated 
trees. The largest has about 180 million nodes, and the smallest has 1 node. 
The largest fanout is almost 400K. Depth is arbitrary, but in practice is 
probably less than 10. I am able to page through children and siblings. It 
works really well.

Doesn’t sound like its exactly like what you’re looking for, but if you want 
any pointers on how I went about implementing mine, I’d be happy to share.

On Mar 26, 2015, at 3:05 PM, List 
<l...@airstreamcomm.net<mailto:l...@airstreamcomm.net>> wrote:

> Not sure if this is the right place to ask, but we are trying to model a 
> user-generated tree hierarchy in which they create child objects of a root 
> node, and can create an arbitrary number of children (and children of 
> children, and on and on).  So far we have looked at storing each tree 
> structure as a single document in JSON format and reading/writing it out in 
> it's entirety, doing materialized paths where we store the root id with every 
> child and the tree structure above the child as a map, and some form of an 
> adjacency list (which does not appear to be very viable as looking up the 
> entire tree would be ridiculous).
>
> The hope is to end up with a data model that allows us to display the entire 
> tree quickly, as well as see the entire path to a leaf when selecting that 
> leaf.  If anyone has some suggestions/experience on how to model such a tree 
> heirarchy we would greatly appreciate your input.
>




--

Ben Bromhead

Instaclustr | www.instaclustr.com<https://www.instaclustr.com/> | 
@instaclustr<http://twitter.com/instaclustr> | (650) 284 9692

Reply via email to