that's neat, thanks for sharing. sounds like the solution is partly inspired by merkle tree to make lookup fast and easy.
peter On Fri, Mar 27, 2015 at 10:07 PM, Robert Wille <[email protected]> wrote: > Okay, this is going to be a pretty long post, but I think its an > interesting data model, and hopefully someone will find it worth going > through. > > First, I think it will be easier to understand the modeling choices I > made if you see the end product. Go to > http://www.fold3.com/browse.php#249|hzUkLqDmI. What you see looks like > one big tree, but actually is a combination of trees spliced together. > There is one tree in a relational database that forms what I call the > top-level browse. The top-level browse is used to navigate through > categories until you arrive at a publication. When you drill down into a > publication, you are then viewing data stored in Cassandra. The link > provided above points to the root of a publication (in this case, maps from > the Civil War), so to the left is top-level browse coming from MySQL, and > to the right is the Cassandra browse. Each publication has an independent > tree in Cassandra, with all trees stored in the same set of tables (I do > not dynamically create tables for each publication — I personally think > that’s a bad practice). We currently have 458 publications, and > collectively they have about half a billion nodes and consume about 400 GB > (RF=3). > > My trees are immutable. When there are changes to a publication (e.g. > adding new documents), it is very difficult to know what changes need to be > made to the tree to edit it in-place. Also, it would be impossible to > maintain navigational consistency while a tree is in process of being > restructured. So, when a publication changes, I create a completely new > tree. Once the new tree is built, I change a reference to point to the new > tree. I have a process that nightly pages through the tables and deletes > records that belong to obsolete trees. This process takes about five hours. > If it were much longer than that, I would probably run it weekly. My > database has quite a bit of churn, which is fairly unusual for a > Cassandra-based application. Most nights build two or three trees, > generally resulting in a few tens of millions of new records and a slightly > fewer number of deletions. Size-tiered compaction is a bad choice for > churn, so I use leveled compaction. Most publications are at most a few > million nodes, and generally build in less than 20 minutes. > > Since any modeling exercise requires knowing the queries, I should > describe that before getting into the model. Here are the features I need > to support. For browsing the tree, I need to be able to get the children of > a node (paginated), the siblings of a node (also paginated), and the > ancestors of a node. The leaves of each tree are images and form a > filmstrip. You can use the filmstrip to navigate through all the images in > a publication in the tree’s natural order. If you go to my browse page and > keep drilling down, you’ll eventually get to an image. The filmstrip > appears at the bottom of the image viewer. > > Before I discuss the schema, I should discuss a couple of other > non-obvious things that are relevant to the data model. One very common > operation is to retrieve a node and all of its ancestors in order to > display a path. Denormalization would suggest that I store the data for > each node, along with that of all of its ancestors. That would mean that in > my biggest tree, I would store the root node 180 million times. I didn’t > consider that kind of bloat to be acceptable, so I do not denormalize > ancestors. I also wanted to retrieve a node and its ancestors in constant > time, rather than O(n) as would be typical for tree traversal. In order to > accomplish this, I use a pretty unique idea for a node's primary key. I > create a hash from information in the node, and then append it to the hash > of its parent. So, the primary key is really a path. When I need to > retrieve a node and its ancestors, I tokenize the path and issue queries in > parallel to get all the nodes in the ancestry at the same time. In keeping > with this pattern of not denormalizing, my auxiliary tables do not have > node data in them, but instead provide a means of getting hash paths, which > I then tokenize and make parallel requests with. Most requests that use an > auxiliary table can generally just make a query to the auxiliary table to > get the hash path, and then retrieve the node and its ancestors from the > node table. Three or fewer trips to Cassandra are sufficient for all my > API’s. > > Without further ado, here’s my schema (with commentary): > > CREATE TABLE tree ( > tree INT, > pub INT, > rhpath VARCHAR, > atime TIMESTAMP, > ccount INT, > ncount INT, > PRIMARY KEY (tree) > ) WITH gc_grace_seconds = 864000; > > This table maintains the references to the root nodes for each tree. pub > is the primary key for the publication table in my relational database. > There is usually just one record for each publication. When a tree is being > built (and until the old one is retired), a publication may have more > than one tree. This table is small (458 records), and I cache it in memory > and maintain inverted indexes. Not caching this table would result in one > additional trip to Cassandra for all API’s. Each tree is assigned a random > tree ID (which is an INT instead of UUID for reasons beyond this > discussion). All my tables have a tree ID in them so I can know which tree > each node belongs to, since the hash path is not unique. > > CREATE TABLE node ( > hpath VARCHAR, > tree INT, > node VARCHAR, > ccount INT, > PRIMARY KEY (hpath, tree) > ) WITH gc_grace_seconds = 0 AND compaction = { 'class' : > 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 }; > > This is the main table that holds all the text and ordinal information > for all my nodes. This table contains the lion’s share of the data in my > cluster. The node column is a JSON document. ccount is the number of child > nodes. > > CREATE TABLE path_by_page ( > page BIGINT, > tree INT, > hpath VARCHAR, > pord INT, > ord INT, > PRIMARY KEY (page, tree) > ) WITH gc_grace_seconds = 0 AND compaction = { 'class' : > 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 }; > > This table allows me to get the hash path for any image. page is the > primary key of the page from my relational database. pord is the image’s > ordinal in the publication filmstrip. ord is the page’s ordinal amongst its > siblings. > > CREATE TABLE path_by_pub ( > tree INT, > bucket INT, > pord INT, > ord INT, > hpath VARCHAR, > page BIGINT, > PRIMARY KEY ((tree, bucket), pord) > ) WITH gc_grace_seconds = 0 AND compaction = { 'class' : > 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 }; > > This table allows me to do the filmstrip. pord is what I key off of to > paginate. > > CREATE TABLE path_by_parent ( > phpath VARCHAR, > bucket INT, > tree INT, > ord INT, > hpath VARCHAR, > PRIMARY KEY ((phpath, bucket, tree), ord) > ) WITH gc_grace_seconds = 0 AND compaction = { 'class' : > 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 }; > > This table allows me to get the children for a node. ord is the node's > ordinal within its siblings. It is what I key off of to paginate. The link > I provided above is to a publication that has a fanout of 1 for the leaf > node’s parent nodes, so isn’t very interesting (although the content is > very interesting). Here’s a more interesting node that has bigger fanout: > http://www.fold3.com/browse.php#246|hrRUXqv6Rj6GFxKAo. And finally, > here’s a node with a fanout of 378622: > http://www.fold3.com/browse.php#1|hhqJwp03TQBwCAyoD. > > As long as this post is, it probably wasn’t enough to fully understand > everything I do with my schema. I have dozens of queries. If anyone would > like me to dig a little deeper, I’d be happy to. Just email me. > > Robert > > On Mar 27, 2015, at 5:35 PM, Ben Bromhead <[email protected]> wrote: > > +1 would love to see how you do it > > On 27 March 2015 at 07:18, Jonathan Haddad <[email protected]> wrote: > >> I'd be interested to see that data model. I think the entire list would >> benefit! >> >> On Thu, Mar 26, 2015 at 8:16 PM Robert Wille <[email protected]> wrote: >> >>> I have a cluster which stores tree structures. I keep several hundred >>> unrelated trees. The largest has about 180 million nodes, and the smallest >>> has 1 node. The largest fanout is almost 400K. Depth is arbitrary, but in >>> practice is probably less than 10. I am able to page through children and >>> siblings. It works really well. >>> >>> Doesn’t sound like its exactly like what you’re looking for, but if you >>> want any pointers on how I went about implementing mine, I’d be happy to >>> share. >>> >>> On Mar 26, 2015, at 3:05 PM, List <[email protected]> wrote: >>> >>> > Not sure if this is the right place to ask, but we are trying to model >>> a user-generated tree hierarchy in which they create child objects of a >>> root node, and can create an arbitrary number of children (and children of >>> children, and on and on). So far we have looked at storing each tree >>> structure as a single document in JSON format and reading/writing it out in >>> it's entirety, doing materialized paths where we store the root id with >>> every child and the tree structure above the child as a map, and some form >>> of an adjacency list (which does not appear to be very viable as looking up >>> the entire tree would be ridiculous). >>> > >>> > The hope is to end up with a data model that allows us to display the >>> entire tree quickly, as well as see the entire path to a leaf when >>> selecting that leaf. If anyone has some suggestions/experience on how to >>> model such a tree heirarchy we would greatly appreciate your input. >>> > >>> >>> > > > -- > > Ben Bromhead > > Instaclustr | www.instaclustr.com | @instaclustr > <http://twitter.com/instaclustr> | (650) 284 9692 > > >
