[jira] [Updated] (OAK-894) Query: better cost estimates for indexes

Thomas Mueller (JIRA) Thu, 28 Nov 2013 01:38:18 -0800

     [ 
https://issues.apache.org/jira/browse/OAK-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Thomas Mueller updated OAK-894:
-------------------------------

    Fix Version/s:     (was: 0.13)
                   0.14

> Query: better cost estimates for indexes
> ----------------------------------------
>
>                 Key: OAK-894
>                 URL: https://issues.apache.org/jira/browse/OAK-894
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: query
>            Reporter: Thomas Mueller
>            Assignee: Thomas Mueller
>             Fix For: 0.14
>
>
> Currently, the cost estimation for the PropertyIndex is somewhat limited, so 
> that the estimated cost for the index on "jcr:primaryType", with filter 
> "jcr:primaryType is not null" is relatively low. However, there are many 
> nodes with this property (well, _most_ nodes have a primary type). Because of 
> that, this index is used for the following query, which tests the existence 
> of a child node:
> {code}
> -- xpath:
> /jcr:root/home/users//*[a/b/c/@jcr:primaryType]
> -- sql2:
> select [jcr:path], [jcr:score], * from [nt:base] as a 
> where [a/b/c/jcr:primaryType] is not null 
> and isdescendantnode(a, '/home/users')
> {code}
> With SQL-2, the query could be written as an inner join, but I guess the 
> query would end up rather complex (4 selectors as far as I see).
> It would be nice if the index would return a better cost estimate. There are 
> multiple ways to do that:
> * Automatically update the index statistics when modifying the index
> * Update the statistics in a background thread
> * Hardcode the statistics in the index (when declaring the index)
> With statistics I mean: number of entries in total within the index, and 
> selectivity (or number of distinct index keys), and potentially number of 
> entries for each index key.
> To automatically update the statistics at runtime, the following algorithms 
> could be used: whenever adding an entry to the index, increment a counter; 
> when removing an entry, decrement. This would result in many write operations 
> and have a high risk of conflicts. Slightly improved: when adding/removing an 
> entry, and if Math.random()<0.001, increment/decrement the counter by 1000 
> (which is a good enough accuracy for our case). This would improve 
> performance and reduce the risk of conflicts; but there would still be a risk.
> Updating the statistics in a background thread would require a background 
> thread, but it could be done rather efficiently. We wouldn't need to read and 
> count _all_ child node entries; instead, a random walk could be used to 
> estimate the number of entries. I guess it would be somewhat less accurate 
> than the automatic update.
> Hardcoding the statistics should be relatively easy to do. I guess this is 
> what I will try for now.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (OAK-894) Query: better cost estimates for indexes

Reply via email to