[ 
https://issues.apache.org/jira/browse/OAK-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138769#comment-14138769
 ] 

Thomas Mueller commented on OAK-1907:
-------------------------------------

[~mreutegg], [~amitjain], [~chetanm], and me discussed a few options to 
maintain the approximate descendant node count at different levels of the 
repository. All of the options have advantages and disadvantages:

* (A) ApproxCount above. No state, relatively little data per counter, no 
problem with conflicts. Main disadvantage: Not accurate after many entries have 
been removed (and possibly re-added).

* (B) Provide a "cluster node id" and possibly "session id" in the Oak API, 
which is guaranteed to be unique within the cluster. Then each cluster node 
(and possibly session) would keep it's own private counter property. When 
reading, the counters of all cluster node is summed up. Old (now unused) 
cluster nodes ids might be a problem (some kind of garbage collection might be 
needed).

* (C) Hidden ":count" property in Oak, special-cased in the NodeStore 
(DocumentNodeStore and SegmentNodeStore separately). The NodeStore (at low 
level) would "special case" that property and automatically merge concurrent 
updates based on a rule (which is: sum up the changes). Would need to be 
implemented in each NodeStore separately. Main disadvantage: we feel this is 
kind of a "hack" because it's a "magic" special case in the NodeStore. Merging 
changes is probably not accurate in all cases (for example for branch commits), 
but for this use case that's not important.

* (D) The DocumentNodeStore could maintain a relatively accurate counter as an 
internal property, similar to the _modCount and _lastRev. Merging changes is 
probably not accurate in all cases, as above. The problem is, the 
SegmentNodeStore would need a separate algorithm.

* (E) Maintain a node "count" (possibly hidden, depending on where we want to 
use it). When adding / removing a node in the given tree, with probability 
1:1000 a randomly named property or child node is added. A background thread 
per cluster (similar to the AsyncIndex updater) would periodically clean up 
(consolidate) those nodes.

The current plan is that I will try out (E) for property indexes, in 
combination with (A).

> Better cost estimates for traversal, property, and ordered indexes
> ------------------------------------------------------------------
>
>                 Key: OAK-1907
>                 URL: https://issues.apache.org/jira/browse/OAK-1907
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: query
>    Affects Versions: 1.0, 1.0.1, 1.0.2
>            Reporter: Thomas Mueller
>            Assignee: Thomas Mueller
>             Fix For: 1.1
>
>         Attachments: ApproxCount.java, OAK-1907.diff
>
>
> Currently, cost estimates of traversal, property index, and ordered index 
> don't take the number of nodes into account, if there are more than about 100 
> nodes. This is problematic because in many cases, the wrong index is used 
> (because of incorrect cost estimate).
> To get a better estimate, a very rough estimate on the number of child nodes 
> below a given path is needed. 
> One idea is: when adding a node, if Math.random() < 0.00001, add a hidden, 
> randomly named property (for example called ":count-xyz" where xyz is a uuid, 
> value 100'000) to the parents of that node, so that we know there are 
> probably more than 100'000 nodes below a given path. When removing a node, 
> with the same algorithm add a hidden property (":count-xyz", value -100'000). 
> That should result in a slowdown of less than 0.01%, but should allow us much 
> better cost estimates. Those properties could be consolidated asynchronously 
> if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to