[
https://issues.apache.org/jira/browse/OAK-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thomas Mueller updated OAK-894:
-------------------------------
Description:
Currently, the cost estimation for the PropertyIndex is somewhat limited, so
that the estimated cost for the index on "jcr:primaryType", with filter
"jcr:primaryType is not null" is relatively low. However, there are many nodes
with this property (well, _most_ nodes have a primary type). Because of that,
this index is used for the following query, which tests the existence of a
child node:
{code}
-- xpath:
/jcr:root/home/users//*[a/b/c/@jcr:primaryType]
-- sql2:
select [jcr:path], [jcr:score], * from [nt:base] as a
where [a/b/c/jcr:primaryType] is not null
and isdescendantnode(a, '/home/users')
{code}
With SQL-2, the query could be written as an inner join, but I guess the query
would end up rather complex (4 selectors as far as I see).
It would be nice if the index would return a better cost estimate. There are
multiple ways to do that:
* Automatically update the index statistics when modifying the index
* Update the statistics in a background thread
* Hardcode the statistics in the index (when declaring the index)
With statistics I mean: number of entries in total within the index, and
selectivity (or number of distinct index keys), and potentially number of
entries for each index key.
To automatically update the statistics at runtime, the following algorithms
could be used: whenever adding an entry to the index, increment a counter; when
removing an entry, decrement. This would result in many write operations and
have a high risk of conflicts. Slightly improved: when adding/removing an
entry, and if Math.random()<0.001, increment/decrement the counter by 1000
(which is a good enough accuracy for our case). This would improve performance
and reduce the risk of conflicts; but there would still be a risk.
Updating the statistics in a background thread would require a background
thread, but it could be done rather efficiently. We wouldn't need to read and
count _all_ child node entries; instead, a random walk could be used to
estimate the number of entries. I guess it would be somewhat less accurate than
the automatic update.
Hardcoding the statistics should be relatively easy to do. I guess this is what
I will try for now.
was:
Currently, the cost estimation for the PropertyIndex is somewhat limited, so
that the estimated cost for the index on "jcr:primaryType", with filter
"jcr:primarType is not null" is relatively low. However, there are many nodes
with this property (well, _most_ nodes have a primary type). Because of that,
this index is used for the following query, which tests the existence of a
child node:
{code}
-- xpath:
/jcr:root/home/users//*[a/b/c/@jcr:primaryType]
-- sql2:
select [jcr:path], [jcr:score], * from [nt:base] as a
where [a/b/c/jcr:primaryType] is not null
and isdescendantnode(a, '/home/users')
{code}
With SQL-2, the query could be written as an inner join, but I guess the query
would end up rather complex (4 selectors as far as I see).
It would be nice if the index would return a better cost estimate. There are
multiple ways to do that:
* Automatically update the index statistics when modifying the index
* Update the statistics in a background thread
* Hardcode the statistics in the index (when declaring the index)
With statistics I mean: number of entries in total within the index, and
selectivity (or number of distinct index keys), and potentially number of
entries for each index key.
To automatically update the statistics at runtime, the following algorithms
could be used: whenever adding an entry to the index, increment a counter; when
removing an entry, decrement. This would result in many write operations and
have a high risk of conflicts. Slightly improved: when adding/removing an
entry, and if Math.random()<0.001, increment/decrement the counter by 1000
(which is a good enough accuracy for our case). This would improve performance
and reduce the risk of conflicts; but there would still be a risk.
Updating the statistics in a background thread would require a background
thread, but it could be done rather efficiently. We wouldn't need to read and
count _all_ child node entries; instead, a random walk could be used to
estimate the number of entries. I guess it would be somewhat less accurate than
the automatic update.
Hardcoding the statistics should be relatively easy to do. I guess this is what
I will try for now.
> Query: better cost estimates for indexes
> ----------------------------------------
>
> Key: OAK-894
> URL: https://issues.apache.org/jira/browse/OAK-894
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: query
> Reporter: Thomas Mueller
> Assignee: Thomas Mueller
> Fix For: 0.13
>
>
> Currently, the cost estimation for the PropertyIndex is somewhat limited, so
> that the estimated cost for the index on "jcr:primaryType", with filter
> "jcr:primaryType is not null" is relatively low. However, there are many
> nodes with this property (well, _most_ nodes have a primary type). Because of
> that, this index is used for the following query, which tests the existence
> of a child node:
> {code}
> -- xpath:
> /jcr:root/home/users//*[a/b/c/@jcr:primaryType]
> -- sql2:
> select [jcr:path], [jcr:score], * from [nt:base] as a
> where [a/b/c/jcr:primaryType] is not null
> and isdescendantnode(a, '/home/users')
> {code}
> With SQL-2, the query could be written as an inner join, but I guess the
> query would end up rather complex (4 selectors as far as I see).
> It would be nice if the index would return a better cost estimate. There are
> multiple ways to do that:
> * Automatically update the index statistics when modifying the index
> * Update the statistics in a background thread
> * Hardcode the statistics in the index (when declaring the index)
> With statistics I mean: number of entries in total within the index, and
> selectivity (or number of distinct index keys), and potentially number of
> entries for each index key.
> To automatically update the statistics at runtime, the following algorithms
> could be used: whenever adding an entry to the index, increment a counter;
> when removing an entry, decrement. This would result in many write operations
> and have a high risk of conflicts. Slightly improved: when adding/removing an
> entry, and if Math.random()<0.001, increment/decrement the counter by 1000
> (which is a good enough accuracy for our case). This would improve
> performance and reduce the risk of conflicts; but there would still be a risk.
> Updating the statistics in a background thread would require a background
> thread, but it could be done rather efficiently. We wouldn't need to read and
> count _all_ child node entries; instead, a random walk could be used to
> estimate the number of entries. I guess it would be somewhat less accurate
> than the automatic update.
> Hardcoding the statistics should be relatively easy to do. I guess this is
> what I will try for now.
--
This message was sent by Atlassian JIRA
(v6.1#6144)