Hi,

On Mon, Jun 23, 2014 at 3:46 AM, Thomas Mueller <[email protected]> wrote:
> What if a node contains millions of (direct) child nodes, how would one do
> an efficient lookup?

The index structure contains the names of the matching subtrees, which
allows us to avoid iterating over all the child nodes in such cases.

For example, a part of the index structure for a node with 10k child
nodes named "aNNNN" might look something like this:

    { "foo": { "a2135": 238, "a5382": 13 },
      "bar": { "a2135": 172, "a9821": 86 } }

When looking up nodes with "foo", you'd only need to consider the
"a2135" and "a5382" subtrees, and can ignore the remaining 9998
children.

> We have quite many (property) indexes, what would be the storage overhead?
> (I think it would be quite significant with about 100 property indexes.)

This would be a separate index, with nothing in common with the
property indexes.

Given the index structure and the way SegmentMK avoids storing
duplicate copies of repeating names, I expect the storage overhead to
be pretty low, < 1% of the repository size. Big-O calculation of the
storage overhead is a bit complicated, so I plan to do a simple
prototype to better estimate the actual overhead on a few real-world
repositories.

> How could this index structure speed up node name queries? Would the root
> node know all the names of all nodes in the repository?

The root node would know all the *distinct* names in the repository,
along with the subtrees that contain at least a single instance of
those names.

> What would be very, very, very valuable (to better estimate the cost of an
> index or traversal), is to know the number of descendent nodes of a node,
> see also OAK-894.

Checking this index for the count of "jcr:primaryType" would give a
pretty accurate estimate of the size of a subtree.

Alternatively the index could be easily be extended to also maintain
an exact subtree size count, for example by indexing a virtual ":node"
name for each node in the repository. Similarly a property count could
be implemented by indexing ":property" along with the normal name of
each property. Even further, we could pretty efficiently keep track of
the size of each subtree by indexing ":size" for the length() of each
value. With such extensions the index data for an nt:resource leaf
node could look like this:

    { "jcr:primaryType": 1,
      "jcr:uuid", 1,
      "jcr:data": 1,
      "jcr:lastModified": 1,
      "jcr:mimeType": 1,
      ":node": 1,
      ":property", 5,
      ":size", 15827 }

(Of course, to save storage space, such single-node index data would
not actually be stored in the repository, only aggregated to the index
entries higher up the tree.)

BR,

Jukka Zitting

Reply via email to