Hi, What if a node contains millions of (direct) child nodes, how would one do an efficient lookup?
We have quite many (property) indexes, what would be the storage overhead? (I think it would be quite significant with about 100 property indexes.) > for speeding up node name and property existence queries. How could this index structure speed up node name queries? Would the root node know all the names of all nodes in the repository? What would be very, very, very valuable (to better estimate the cost of an index or traversal), is to know the number of descendent nodes of a node, see also OAK-894. Regards, Thomas On 20/06/14 18:06, "Jukka Zitting" <[email protected]> wrote: >Hi, > >Here's an idea for an index structure (for now somewhat specific to >SegmentMK) for speeding up node name and property existence queries. >The index could also be used to avoid full repository traversals when >processing queries with unindexed property constraints. In some cases >it might also work pretty well for queries with multi-property >constraints for which we currently don't have very good support. > >INDEX STRUCTURE > >The basic idea is to keep track of how many times each node or >property name occurs within each subtree. The index structure would >look something like this: > > { "foo": { "a": 238, "b": 13 }, > "bar": { "a": 172, "c": 86 } } > >Such an index would for example tell that there are 238 nodes named >"foo" or with a property named "foo" within the "/a" subtree, and 86 >nodes with "bar" within the "/c" subtree. > >The index structure would be repeated for each subtree that contains >more than two nodes. For example the index structure for "/a" could >be: > > { "foo": { "d": 137, "e": 100 }, > "bar": { "e": 100, "f": 72 } } > >So we'd have 137 nodes with "foo" in the "/a/d" subtree and 100 in the >"/a/e" subtree. Based on the index we can also tell that there are no >"bar" nodes or properties within the "/a/d" subtree. > >COST CALCULATION > >To calculate the cost of a query like "//[@foo='...']", you'd look at >the index structure and add together the subtree counts for that name. >In this case the cost estimate would be: > > 238 + 13 = 251 > >If multiple names are referenced in a query, like "//[@foo='...' and >@bar='...']", then you'd take the counts for both names and add >together the minimum value of each subtree. In this case the cost >estimate would be: > > min(238, 172) + min(13, 0) + min(0, 86) = 172 > >If a query has a path restriction, like in >"/jcr:root/a//[@foo='...']", we can use the index structure within >that path for a more specific cost estimate. > >INDEX LOOKUP > >Index lookup would be implemented as a tree traversal that's guided by >the subtree counts stored in the index. > >When looking up a query like "//[@foo='...']", you'd first look a the >top-level index structure to find that the only nodes with "foo" items >are within the "/a" and "/b" subtrees, and would then descend to those >subtrees. In "/a" you'd find that there are more "foo" items within >"/a/d" and "/a/e" so that's where the traversal would continue. Also, >since 137 + 100 < 238, you'd report the "/a" path as a potential match >for the query. Finally, when encountering a leaf node for which no >explicit index data is present, you'd do a hasProperty("foo") check to >determine whether that path should be reported as a potential match. > >And if multiple names are referenced in a query, like "//[@foo='...' >and @bar='...']", you'd use the minimum of the matching subtree counts >to direct the tree traversal. Since by looking at the index we can >tell that neither the "/b" nor the "/c" subtree contains both "foo" >and "bar" items, the traversal can be limited to just the "/a" >subtree, and in there to the "/a/e" subtree that is the only place >where both "foo" and "bar" can be found. > >If a query has a path restriction, the guided traversal can start >directly at that path. > >INDEX UPDATES > >The index would be maintained by a normal index editor that for each >added/removed items would update the respective name counts at each >level of the index up to the root. These updates can be consolidated >at each level of the index to keep the number of writes down to a >minimum. The cost of updating the index would be O(d * k) where d is >the depth of the changes in a commit and k the number of added/removed >items. The cost is similar to that of the property index where k is >the number of added/removed values. Changing the value of an existing >property would trigger no updates to this index. > >Since the index updates work their way recursively up to the root of >the tree, there is an inherent synchronization bottleneck in updating >the count values higher up the tree. This (and the way storage is >optimized) makes this index structure well suited for the SegmentMK, >but presents a problem for the DocMK. In there a better way to >implement a similar index might be to leverage the underlying indexing >capabilities of MongoDB or whichever database is used under the hood. >Alternatively it might be possible to adjust the index structure to >avoid this problem, though for now I don't see any good way of doing >so. > >BR, > >Jukka Zitting
