Thomas Mueller created OAK-894:
----------------------------------

             Summary: Query: better cost estimates for indexes
                 Key: OAK-894
                 URL: https://issues.apache.org/jira/browse/OAK-894
             Project: Jackrabbit Oak
          Issue Type: Improvement
          Components: query
            Reporter: Thomas Mueller
            Assignee: Thomas Mueller


Currently, the cost estimation for the PropertyIndex is somewhat limited, so 
that the estimated cost for the index on "jcr:primaryType", with filter 
"jcr:primarType is not null" is relatively low. However, there are many nodes 
with this property (well, _most_ nodes have a primary type). Because of that, 
this index is used for the following query, which tests the existence of a 
child node:

{code}
-- xpath:
/jcr:root/home/users//*[a/b/c/@jcr:primaryType]
-- sql2:
select [jcr:path], [jcr:score], * from [nt:base] as a 
where [a/b/c/jcr:primaryType] is not null 
and isdescendantnode(a, '/home/users')
{code}

With SQL-2, the query could be written as an inner join, but I guess the query 
would end up rather complex (4 selectors as far as I see).

It would be nice if the index would result a better cost estimate. There are 
multiple ways to do that:

* Automatically update the index statistics when modifying the index
* Update the statistics in a background thread
* Hardcode the statistics in the index (when declaring the index)

To automatically update the statistics, the following algorithms could be used: 
whenever adding an entry to the index, increment a counter; when removing an 
entry, decrement. This would result in many write operations and have a high 
risk of conflicts. Slightly improved: when adding/removing an entry, and if 
Math.random()<0.001, increment/decrement the counter by 1000 (which is a good 
enough accuracy for our case). This would improve performance and reduce the 
risk of conflicts; but there would still be a risk.

Updating the statistics in a background thread would require a background 
thread, but it could be done rather efficiently. We wouldn't need to read and 
count all child node entries; instead, a random walk could be used to estimate 
the number of entries. I guess it would be somewhat less accurate than the 
automatic update.

Hardcoding the statistics should be relatively easy to do. I guess this is what 
I will try for now.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to