that's neat, thanks for sharing.

sounds like the solution is partly inspired by merkle tree to make lookup
fast and easy.

peter

On Fri, Mar 27, 2015 at 10:07 PM, Robert Wille <[email protected]> wrote:

>  Okay, this is going to be a pretty long post, but I think its an
> interesting data model, and hopefully someone will find it worth going
> through.
>
>  First, I think it will be easier to understand the modeling choices I
> made if you see the end product. Go to
> http://www.fold3.com/browse.php#249|hzUkLqDmI. What you see looks like
> one big tree, but actually is a combination of trees spliced together.
> There is one tree in a relational database that forms what I call the
> top-level browse. The top-level browse is used to navigate through
> categories until you arrive at a publication. When you drill down into a
> publication, you are then viewing data stored in Cassandra. The link
> provided above points to the root of a publication (in this case, maps from
> the Civil War), so to the left is top-level browse coming from MySQL, and
> to the right is the Cassandra browse. Each publication has an independent
> tree in Cassandra, with all trees stored in the same set of tables (I do
> not dynamically create tables for each publication — I personally think
> that’s a bad practice). We currently have 458 publications, and
> collectively they have about half a billion nodes and consume about 400 GB
> (RF=3).
>
>  My trees are immutable. When there are changes to a publication (e.g.
> adding new documents), it is very difficult to know what changes need to be
> made to the tree to edit it in-place. Also, it would be impossible to
> maintain navigational consistency while a tree is in process of being
> restructured. So, when a publication changes, I create a completely new
> tree. Once the new tree is built, I change a reference to point to the new
> tree. I have a process that nightly pages through the tables and deletes
> records that belong to obsolete trees. This process takes about five hours.
> If it were much longer than that, I would probably run it weekly. My
> database has quite a bit of churn, which is fairly unusual for a
> Cassandra-based application. Most nights build two or three trees,
> generally resulting in a few tens of millions of new records and a slightly
> fewer number of deletions. Size-tiered compaction is a bad choice for
> churn, so I use leveled compaction. Most publications are at most a few
> million nodes, and generally build in less than 20 minutes.
>
>  Since any modeling exercise requires knowing the queries, I should
> describe that before getting into the model. Here are the features I need
> to support. For browsing the tree, I need to be able to get the children of
> a node (paginated), the siblings of a node (also paginated), and the
> ancestors of a node. The leaves of each tree are images and form a
> filmstrip. You can use the filmstrip to navigate through all the images in
> a publication in the tree’s natural order. If you go to my browse page and
> keep drilling down, you’ll eventually get to an image. The filmstrip
> appears at the bottom of the image viewer.
>
>  Before I discuss the schema, I should discuss a couple of other
> non-obvious things that are relevant to the data model. One very common
> operation is to retrieve a node and all of its ancestors in order to
> display a path. Denormalization would suggest that I store the data for
> each node, along with that of all of its ancestors. That would mean that in
> my biggest tree, I would store the root node 180 million times. I didn’t
> consider that kind of bloat to be acceptable, so I do not denormalize
> ancestors. I also wanted to retrieve a node and its ancestors in constant
> time, rather than O(n) as would be typical for tree traversal. In order to
> accomplish this, I use a pretty unique idea for a node's primary key. I
> create a hash from information in the node, and then append it to the hash
> of its parent. So, the primary key is really a path. When I need to
> retrieve a node and its ancestors, I tokenize the path and issue queries in
> parallel to get all the nodes in the ancestry at the same time. In keeping
> with this pattern of not denormalizing, my auxiliary tables do not have
> node data in them, but instead provide a means of getting hash paths, which
> I then tokenize and make parallel requests with. Most requests that use an
> auxiliary table can generally just make a query to the auxiliary table to
> get the hash path, and then retrieve the node and its ancestors from the
> node table. Three or fewer trips to Cassandra are sufficient for all my
> API’s.
>
>  Without further ado, here’s my schema (with commentary):
>
>  CREATE TABLE tree (
> tree INT,
> pub INT,
> rhpath VARCHAR,
> atime TIMESTAMP,
> ccount INT,
> ncount INT,
> PRIMARY KEY (tree)
> ) WITH gc_grace_seconds = 864000;
>
>  This table maintains the references to the root nodes for each tree. pub
> is the primary key for the publication table in my relational database.
> There is usually just one record for each publication. When a tree is being
> built (and until the old one is retired), a publication may have more
> than one tree. This table is small (458 records), and I cache it in memory
> and maintain inverted indexes. Not caching this table would result in one
> additional trip to Cassandra for all API’s. Each tree is assigned a random
> tree ID (which is an INT instead of UUID for reasons beyond this
> discussion). All my tables have a tree ID in them so I can know which tree
> each node belongs to, since the hash path is not unique.
>
>  CREATE TABLE node (
> hpath VARCHAR,
> tree INT,
> node VARCHAR,
> ccount INT,
> PRIMARY KEY (hpath, tree)
> ) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
> 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };
>
>  This is the main table that holds all the text and ordinal information
> for all my nodes. This table contains the lion’s share of the data in my
> cluster. The node column is a JSON document. ccount is the number of child
> nodes.
>
>  CREATE TABLE path_by_page (
> page BIGINT,
> tree INT,
> hpath VARCHAR,
> pord INT,
> ord INT,
> PRIMARY KEY (page, tree)
> ) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
> 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };
>
>  This table allows me to get the hash path for any image. page is the
> primary key of the page from my relational database. pord is the image’s
> ordinal in the publication filmstrip. ord is the page’s ordinal amongst its
> siblings.
>
>  CREATE TABLE path_by_pub (
> tree INT,
> bucket INT,
> pord INT,
> ord INT,
> hpath VARCHAR,
> page BIGINT,
> PRIMARY KEY ((tree, bucket), pord)
> ) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
> 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };
>
>  This table allows me to do the filmstrip. pord is what I key off of to
> paginate.
>
>  CREATE TABLE path_by_parent (
> phpath VARCHAR,
> bucket INT,
> tree INT,
> ord INT,
> hpath VARCHAR,
> PRIMARY KEY ((phpath, bucket, tree), ord)
> ) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
> 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };
>
>  This table allows me to get the children for a node. ord is the node's
> ordinal within its siblings. It is what I key off of to paginate. The link
> I provided above is to a publication that has a fanout of 1 for the leaf
> node’s parent nodes, so isn’t very interesting (although the content is
> very interesting). Here’s a more interesting node that has bigger fanout:
> http://www.fold3.com/browse.php#246|hrRUXqv6Rj6GFxKAo. And finally,
> here’s a node with a fanout of 378622:
> http://www.fold3.com/browse.php#1|hhqJwp03TQBwCAyoD.
>
>  As long as this post is, it probably wasn’t enough to fully understand
> everything I do with my schema. I have dozens of queries. If anyone would
> like me to dig a little deeper, I’d be happy to. Just email me.
>
>  Robert
>
>  On Mar 27, 2015, at 5:35 PM, Ben Bromhead <[email protected]> wrote:
>
>  +1 would love to see how you do it
>
> On 27 March 2015 at 07:18, Jonathan Haddad <[email protected]> wrote:
>
>> I'd be interested to see that data model. I think the entire list would
>> benefit!
>>
>> On Thu, Mar 26, 2015 at 8:16 PM Robert Wille <[email protected]> wrote:
>>
>>> I have a cluster which stores tree structures. I keep several hundred
>>> unrelated trees. The largest has about 180 million nodes, and the smallest
>>> has 1 node. The largest fanout is almost 400K. Depth is arbitrary, but in
>>> practice is probably less than 10. I am able to page through children and
>>> siblings. It works really well.
>>>
>>> Doesn’t sound like its exactly like what you’re looking for, but if you
>>> want any pointers on how I went about implementing mine, I’d be happy to
>>> share.
>>>
>>> On Mar 26, 2015, at 3:05 PM, List <[email protected]> wrote:
>>>
>>> > Not sure if this is the right place to ask, but we are trying to model
>>> a user-generated tree hierarchy in which they create child objects of a
>>> root node, and can create an arbitrary number of children (and children of
>>> children, and on and on).  So far we have looked at storing each tree
>>> structure as a single document in JSON format and reading/writing it out in
>>> it's entirety, doing materialized paths where we store the root id with
>>> every child and the tree structure above the child as a map, and some form
>>> of an adjacency list (which does not appear to be very viable as looking up
>>> the entire tree would be ridiculous).
>>> >
>>> > The hope is to end up with a data model that allows us to display the
>>> entire tree quickly, as well as see the entire path to a leaf when
>>> selecting that leaf.  If anyone has some suggestions/experience on how to
>>> model such a tree heirarchy we would greatly appreciate your input.
>>> >
>>>
>>>
>
>
>  --
>
> Ben Bromhead
>
> Instaclustr | www.instaclustr.com | @instaclustr
> <http://twitter.com/instaclustr> | (650) 284 9692
>
>
>

Reply via email to