[
https://issues.apache.org/jira/browse/OAK-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thomas Mueller updated OAK-894:
-------------------------------
Description:
Currently, the cost estimation for the PropertyIndex is somewhat limited, so
that the estimated cost for the index on "jcr:primaryType", with filter
"jcr:primarType is not null" is relatively low. However, there are many nodes
with this property (well, _most_ nodes have a primary type). Because of that,
this index is used for the following query, which tests the existence of a
child node:
{code}
-- xpath:
/jcr:root/home/users//*[a/b/c/@jcr:primaryType]
-- sql2:
select [jcr:path], [jcr:score], * from [nt:base] as a
where [a/b/c/jcr:primaryType] is not null
and isdescendantnode(a, '/home/users')
{code}
With SQL-2, the query could be written as an inner join, but I guess the query
would end up rather complex (4 selectors as far as I see).
It would be nice if the index would return a better cost estimate. There are
multiple ways to do that:
* Automatically update the index statistics when modifying the index
* Update the statistics in a background thread
* Hardcode the statistics in the index (when declaring the index)
With statistics I mean: number of entries in total within the index, and
selectivity (or number of distinct index keys), and potentially number of
entries for each index key.
To automatically update the statistics at runtime, the following algorithms
could be used: whenever adding an entry to the index, increment a counter; when
removing an entry, decrement. This would result in many write operations and
have a high risk of conflicts. Slightly improved: when adding/removing an
entry, and if Math.random()<0.001, increment/decrement the counter by 1000
(which is a good enough accuracy for our case). This would improve performance
and reduce the risk of conflicts; but there would still be a risk.
Updating the statistics in a background thread would require a background
thread, but it could be done rather efficiently. We wouldn't need to read and
count _all_ child node entries; instead, a random walk could be used to
estimate the number of entries. I guess it would be somewhat less accurate than
the automatic update.
Hardcoding the statistics should be relatively easy to do. I guess this is what
I will try for now.
was:
Currently, the cost estimation for the PropertyIndex is somewhat limited, so
that the estimated cost for the index on "jcr:primaryType", with filter
"jcr:primarType is not null" is relatively low. However, there are many nodes
with this property (well, _most_ nodes have a primary type). Because of that,
this index is used for the following query, which tests the existence of a
child node:
{code}
-- xpath:
/jcr:root/home/users//*[a/b/c/@jcr:primaryType]
-- sql2:
select [jcr:path], [jcr:score], * from [nt:base] as a
where [a/b/c/jcr:primaryType] is not null
and isdescendantnode(a, '/home/users')
{code}
With SQL-2, the query could be written as an inner join, but I guess the query
would end up rather complex (4 selectors as far as I see).
It would be nice if the index would result a better cost estimate. There are
multiple ways to do that:
* Automatically update the index statistics when modifying the index
* Update the statistics in a background thread
* Hardcode the statistics in the index (when declaring the index)
To automatically update the statistics, the following algorithms could be used:
whenever adding an entry to the index, increment a counter; when removing an
entry, decrement. This would result in many write operations and have a high
risk of conflicts. Slightly improved: when adding/removing an entry, and if
Math.random()<0.001, increment/decrement the counter by 1000 (which is a good
enough accuracy for our case). This would improve performance and reduce the
risk of conflicts; but there would still be a risk.
Updating the statistics in a background thread would require a background
thread, but it could be done rather efficiently. We wouldn't need to read and
count all child node entries; instead, a random walk could be used to estimate
the number of entries. I guess it would be somewhat less accurate than the
automatic update.
Hardcoding the statistics should be relatively easy to do. I guess this is what
I will try for now.
> Query: better cost estimates for indexes
> ----------------------------------------
>
> Key: OAK-894
> URL: https://issues.apache.org/jira/browse/OAK-894
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: query
> Reporter: Thomas Mueller
> Assignee: Thomas Mueller
>
> Currently, the cost estimation for the PropertyIndex is somewhat limited, so
> that the estimated cost for the index on "jcr:primaryType", with filter
> "jcr:primarType is not null" is relatively low. However, there are many nodes
> with this property (well, _most_ nodes have a primary type). Because of that,
> this index is used for the following query, which tests the existence of a
> child node:
> {code}
> -- xpath:
> /jcr:root/home/users//*[a/b/c/@jcr:primaryType]
> -- sql2:
> select [jcr:path], [jcr:score], * from [nt:base] as a
> where [a/b/c/jcr:primaryType] is not null
> and isdescendantnode(a, '/home/users')
> {code}
> With SQL-2, the query could be written as an inner join, but I guess the
> query would end up rather complex (4 selectors as far as I see).
> It would be nice if the index would return a better cost estimate. There are
> multiple ways to do that:
> * Automatically update the index statistics when modifying the index
> * Update the statistics in a background thread
> * Hardcode the statistics in the index (when declaring the index)
> With statistics I mean: number of entries in total within the index, and
> selectivity (or number of distinct index keys), and potentially number of
> entries for each index key.
> To automatically update the statistics at runtime, the following algorithms
> could be used: whenever adding an entry to the index, increment a counter;
> when removing an entry, decrement. This would result in many write operations
> and have a high risk of conflicts. Slightly improved: when adding/removing an
> entry, and if Math.random()<0.001, increment/decrement the counter by 1000
> (which is a good enough accuracy for our case). This would improve
> performance and reduce the risk of conflicts; but there would still be a risk.
> Updating the statistics in a background thread would require a background
> thread, but it could be done rather efficiently. We wouldn't need to read and
> count _all_ child node entries; instead, a random walk could be used to
> estimate the number of entries. I guess it would be somewhat less accurate
> than the automatic update.
> Hardcoding the statistics should be relatively easy to do. I guess this is
> what I will try for now.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira