[ 
https://issues.apache.org/jira/browse/OAK-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-1907:
--------------------------------

    Attachment: ApproxCount.java

Sample, standalone program (proof of concept). The program runs 10 million 
random operations, one operation is either "add a node" or "remove a node". It 
uses the "approximate counting algorithm" 
(http://en.wikipedia.org/wiki/Approximate_counting_algorithm) to keep track of 
the added and removed nodes (separately). The estimated node count is probably 
somewhere in the range \[(added - removed) ... (added / 2)\]. It uses randomly 
named properties to ensure nodes can be added and removed concurrently.

With this algorithm, after many operations, the counter would  have to be reset 
or replaced with the real count. That could be done as part of garbage 
collection or so.


> Better cost estimates for traversal, property, and ordered indexes
> ------------------------------------------------------------------
>
>                 Key: OAK-1907
>                 URL: https://issues.apache.org/jira/browse/OAK-1907
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: query
>    Affects Versions: 1.0, 1.0.1, 1.0.2
>            Reporter: Thomas Mueller
>            Assignee: Thomas Mueller
>             Fix For: 1.1
>
>         Attachments: ApproxCount.java
>
>
> Currently, cost estimates of traversal, property index, and ordered index 
> don't take the number of nodes into account, if there are more than about 100 
> nodes. This is problematic because in many cases, the wrong index is used 
> (because of incorrect cost estimate).
> To get a better estimate, a very rough estimate on the number of child nodes 
> below a given path is needed. 
> One idea is: when adding a node, if Math.random() < 0.00001, add a hidden, 
> randomly named property (for example called ":count-xyz" where xyz is a uuid, 
> value 100'000) to the parents of that node, so that we know there are 
> probably more than 100'000 nodes below a given path. When removing a node, 
> with the same algorithm add a hidden property (":count-xyz", value -100'000). 
> That should result in a slowdown of less than 0.01%, but should allow us much 
> better cost estimates. Those properties could be consolidated asynchronously 
> if needed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to