[jira] [Commented] (OAK-894) Query: better cost estimates for indexes

Thomas Mueller (JIRA) Wed, 17 Jul 2013 01:01:16 -0700

    [ 
https://issues.apache.org/jira/browse/OAK-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13710836#comment-13710836
 ]


Thomas Mueller commented on OAK-894:
------------------------------------

The number of entries in the index is now stored in the index definition node 
in the property "entryCount".

In addition to that, a solution might be if one of the cluster nodes updates 
the entry count using observation. However we need to ensure this doesn't 
affect scalability (the one cluster node that is responsible for this might 
become overloaded). Also, it might be a bit complicated to ensure one of the 
cluster nodes is actually doing it (even thought it's not terrible if it's not 
done in this case). It would be nice if we come up with a solution that can not 
result in a conflict, but also doesn't require such heavy machinery.
                
> Query: better cost estimates for indexes
> ----------------------------------------
>
>                 Key: OAK-894
>                 URL: https://issues.apache.org/jira/browse/OAK-894
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: query
>            Reporter: Thomas Mueller
>            Assignee: Thomas Mueller
>
> Currently, the cost estimation for the PropertyIndex is somewhat limited, so 
> that the estimated cost for the index on "jcr:primaryType", with filter 
> "jcr:primarType is not null" is relatively low. However, there are many nodes 
> with this property (well, _most_ nodes have a primary type). Because of that, 
> this index is used for the following query, which tests the existence of a 
> child node:
> {code}
> -- xpath:
> /jcr:root/home/users//*[a/b/c/@jcr:primaryType]
> -- sql2:
> select [jcr:path], [jcr:score], * from [nt:base] as a 
> where [a/b/c/jcr:primaryType] is not null 
> and isdescendantnode(a, '/home/users')
> {code}
> With SQL-2, the query could be written as an inner join, but I guess the 
> query would end up rather complex (4 selectors as far as I see).
> It would be nice if the index would return a better cost estimate. There are 
> multiple ways to do that:
> * Automatically update the index statistics when modifying the index
> * Update the statistics in a background thread
> * Hardcode the statistics in the index (when declaring the index)
> With statistics I mean: number of entries in total within the index, and 
> selectivity (or number of distinct index keys), and potentially number of 
> entries for each index key.
> To automatically update the statistics at runtime, the following algorithms 
> could be used: whenever adding an entry to the index, increment a counter; 
> when removing an entry, decrement. This would result in many write operations 
> and have a high risk of conflicts. Slightly improved: when adding/removing an 
> entry, and if Math.random()<0.001, increment/decrement the counter by 1000 
> (which is a good enough accuracy for our case). This would improve 
> performance and reduce the risk of conflicts; but there would still be a risk.
> Updating the statistics in a background thread would require a background 
> thread, but it could be done rather efficiently. We wouldn't need to read and 
> count _all_ child node entries; instead, a random walk could be used to 
> estimate the number of entries. I guess it would be somewhat less accurate 
> than the automatic update.
> Hardcoding the statistics should be relatively easy to do. I guess this is 
> what I will try for now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (OAK-894) Query: better cost estimates for indexes

Reply via email to