[
https://issues.apache.org/jira/browse/OAK-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thomas Mueller updated OAK-894:
-------------------------------
Fix Version/s: (was: 0.13)
0.14
> Query: better cost estimates for indexes
> ----------------------------------------
>
> Key: OAK-894
> URL: https://issues.apache.org/jira/browse/OAK-894
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: query
> Reporter: Thomas Mueller
> Assignee: Thomas Mueller
> Fix For: 0.14
>
>
> Currently, the cost estimation for the PropertyIndex is somewhat limited, so
> that the estimated cost for the index on "jcr:primaryType", with filter
> "jcr:primaryType is not null" is relatively low. However, there are many
> nodes with this property (well, _most_ nodes have a primary type). Because of
> that, this index is used for the following query, which tests the existence
> of a child node:
> {code}
> -- xpath:
> /jcr:root/home/users//*[a/b/c/@jcr:primaryType]
> -- sql2:
> select [jcr:path], [jcr:score], * from [nt:base] as a
> where [a/b/c/jcr:primaryType] is not null
> and isdescendantnode(a, '/home/users')
> {code}
> With SQL-2, the query could be written as an inner join, but I guess the
> query would end up rather complex (4 selectors as far as I see).
> It would be nice if the index would return a better cost estimate. There are
> multiple ways to do that:
> * Automatically update the index statistics when modifying the index
> * Update the statistics in a background thread
> * Hardcode the statistics in the index (when declaring the index)
> With statistics I mean: number of entries in total within the index, and
> selectivity (or number of distinct index keys), and potentially number of
> entries for each index key.
> To automatically update the statistics at runtime, the following algorithms
> could be used: whenever adding an entry to the index, increment a counter;
> when removing an entry, decrement. This would result in many write operations
> and have a high risk of conflicts. Slightly improved: when adding/removing an
> entry, and if Math.random()<0.001, increment/decrement the counter by 1000
> (which is a good enough accuracy for our case). This would improve
> performance and reduce the risk of conflicts; but there would still be a risk.
> Updating the statistics in a background thread would require a background
> thread, but it could be done rather efficiently. We wouldn't need to read and
> count _all_ child node entries; instead, a random walk could be used to
> estimate the number of entries. I guess it would be somewhat less accurate
> than the automatic update.
> Hardcoding the statistics should be relatively easy to do. I guess this is
> what I will try for now.
--
This message was sent by Atlassian JIRA
(v6.1#6144)