[ 
https://issues.apache.org/jira/browse/OAK-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Mueller updated OAK-894:
-------------------------------

    Description: 
Currently, the cost estimation for the PropertyIndex is somewhat limited, so 
that the estimated cost for the index on "jcr:primaryType", with filter 
"jcr:primaryType is not null" is relatively low. However, there are many nodes 
with this property (well, _most_ nodes have a primary type). Because of that, 
this index is used for the following query, which tests the existence of a 
child node:

{code}
-- xpath:
/jcr:root/home/users//*[a/b/c/@jcr:primaryType]
-- sql2:
select [jcr:path], [jcr:score], * from [nt:base] as a 
where [a/b/c/jcr:primaryType] is not null 
and isdescendantnode(a, '/home/users')
{code}

With SQL-2, the query could be written as an inner join, but I guess the query 
would end up rather complex (4 selectors as far as I see).

It would be nice if the index would return a better cost estimate. There are 
multiple ways to do that:

* Automatically update the index statistics when modifying the index
* Update the statistics in a background thread
* Hardcode the statistics in the index (when declaring the index)

With statistics I mean: number of entries in total within the index, and 
selectivity (or number of distinct index keys), and potentially number of 
entries for each index key.

To automatically update the statistics at runtime, the following algorithms 
could be used: whenever adding an entry to the index, increment a counter; when 
removing an entry, decrement. This would result in many write operations and 
have a high risk of conflicts. Slightly improved: when adding/removing an 
entry, and if Math.random()<0.001, increment/decrement the counter by 1000 
(which is a good enough accuracy for our case). This would improve performance 
and reduce the risk of conflicts; but there would still be a risk.

Updating the statistics in a background thread would require a background 
thread, but it could be done rather efficiently. We wouldn't need to read and 
count _all_ child node entries; instead, a random walk could be used to 
estimate the number of entries. I guess it would be somewhat less accurate than 
the automatic update.

Hardcoding the statistics should be relatively easy to do. I guess this is what 
I will try for now.


  was:
Currently, the cost estimation for the PropertyIndex is somewhat limited, so 
that the estimated cost for the index on "jcr:primaryType", with filter 
"jcr:primarType is not null" is relatively low. However, there are many nodes 
with this property (well, _most_ nodes have a primary type). Because of that, 
this index is used for the following query, which tests the existence of a 
child node:

{code}
-- xpath:
/jcr:root/home/users//*[a/b/c/@jcr:primaryType]
-- sql2:
select [jcr:path], [jcr:score], * from [nt:base] as a 
where [a/b/c/jcr:primaryType] is not null 
and isdescendantnode(a, '/home/users')
{code}

With SQL-2, the query could be written as an inner join, but I guess the query 
would end up rather complex (4 selectors as far as I see).

It would be nice if the index would return a better cost estimate. There are 
multiple ways to do that:

* Automatically update the index statistics when modifying the index
* Update the statistics in a background thread
* Hardcode the statistics in the index (when declaring the index)

With statistics I mean: number of entries in total within the index, and 
selectivity (or number of distinct index keys), and potentially number of 
entries for each index key.

To automatically update the statistics at runtime, the following algorithms 
could be used: whenever adding an entry to the index, increment a counter; when 
removing an entry, decrement. This would result in many write operations and 
have a high risk of conflicts. Slightly improved: when adding/removing an 
entry, and if Math.random()<0.001, increment/decrement the counter by 1000 
(which is a good enough accuracy for our case). This would improve performance 
and reduce the risk of conflicts; but there would still be a risk.

Updating the statistics in a background thread would require a background 
thread, but it could be done rather efficiently. We wouldn't need to read and 
count _all_ child node entries; instead, a random walk could be used to 
estimate the number of entries. I guess it would be somewhat less accurate than 
the automatic update.

Hardcoding the statistics should be relatively easy to do. I guess this is what 
I will try for now.



> Query: better cost estimates for indexes
> ----------------------------------------
>
>                 Key: OAK-894
>                 URL: https://issues.apache.org/jira/browse/OAK-894
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: query
>            Reporter: Thomas Mueller
>            Assignee: Thomas Mueller
>             Fix For: 0.13
>
>
> Currently, the cost estimation for the PropertyIndex is somewhat limited, so 
> that the estimated cost for the index on "jcr:primaryType", with filter 
> "jcr:primaryType is not null" is relatively low. However, there are many 
> nodes with this property (well, _most_ nodes have a primary type). Because of 
> that, this index is used for the following query, which tests the existence 
> of a child node:
> {code}
> -- xpath:
> /jcr:root/home/users//*[a/b/c/@jcr:primaryType]
> -- sql2:
> select [jcr:path], [jcr:score], * from [nt:base] as a 
> where [a/b/c/jcr:primaryType] is not null 
> and isdescendantnode(a, '/home/users')
> {code}
> With SQL-2, the query could be written as an inner join, but I guess the 
> query would end up rather complex (4 selectors as far as I see).
> It would be nice if the index would return a better cost estimate. There are 
> multiple ways to do that:
> * Automatically update the index statistics when modifying the index
> * Update the statistics in a background thread
> * Hardcode the statistics in the index (when declaring the index)
> With statistics I mean: number of entries in total within the index, and 
> selectivity (or number of distinct index keys), and potentially number of 
> entries for each index key.
> To automatically update the statistics at runtime, the following algorithms 
> could be used: whenever adding an entry to the index, increment a counter; 
> when removing an entry, decrement. This would result in many write operations 
> and have a high risk of conflicts. Slightly improved: when adding/removing an 
> entry, and if Math.random()<0.001, increment/decrement the counter by 1000 
> (which is a good enough accuracy for our case). This would improve 
> performance and reduce the risk of conflicts; but there would still be a risk.
> Updating the statistics in a background thread would require a background 
> thread, but it could be done rather efficiently. We wouldn't need to read and 
> count _all_ child node entries; instead, a random walk could be used to 
> estimate the number of entries. I guess it would be somewhat less accurate 
> than the automatic update.
> Hardcoding the statistics should be relatively easy to do. I guess this is 
> what I will try for now.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to