Hi,

On Mon, Jun 23, 2014 at 11:18 AM, Thomas Mueller <[email protected]> wrote:
>>Sure, but we don't use a covered index.
>
> Yes, we are not there yet. The node is currently loaded to check access
> rights, but that's an implementation detail of access control part. And
> it's not needed for the admin. If (when) this is changed, it depends on
> the query. If the query only returns the path (for example), and the
> indexed property, then the node does not need to be loaded.

It's more than access control. The query engine needs to double-check
the constraints of the query for each matching path before passing
that node to the client (see the constraint.evaluate() call in [1]). I
don't see any easy way to avoid that step without major refactoring.

>> At least in the current design
>>the query engine will always load all the matching nodes, regardless
>>of any extra information stored in the index. Thus we can't use the
>>performance of the index lookup as an accurate estimate of the overall
>>query performance.
>
> Yes, the query engine knows the cost overhead per entry, that's why the
> AdvancedQueryIndex interface is a bit different (does not just contain the
> cost).

Are there any potential indexes where the
AdvancedQueryIndex.getCostPerEntry() method (at least the way it's now
used in [2]) should return a value that's different from 1?

>>The overhead of the index lookup is probably significant when only few
>>matching paths are returned (the UUID index would be the ultimate
>>example), but in those cases query performance is probably best
>>optimized in other ways (caching, configuration, profiling, etc.) than
>>making a more accurate estimate of the index lookup performance,
>>especially since in most such cases there is only a single index with
>>exact matches.
>
> We have queries that have multiple property constraints (for example "size
> = 'L' and color = 'red'"). If there are two indexes, one for size and one
> for color, it's important to know which one is faster. (If there is a
> multi-property index, that one might be faster.)

Say I have 10k nodes distributed uniformly across two sizes and ten
colors. Querying for "size=L and color=red" would return 5000 paths
when using the size index, or 1000 paths when using the color index (a
multi-property index would return only the 500 exact matches). If the
size index is really fast and returns those 5000 paths in just 10ms
whereas the color index takes 100ms to return the 1000 matching paths,
it'll still be much slower to use the size index for the query if
accessing a single node takes say 1ms on average.

The index-level entry cost estimates only become relevant when the
cost of returning a path is more than a fraction of the cost of
loading a node. I don't believe that's the case for any reasonable
index implementations.

>>Agreed, but we should be making guesses about the overall query
>>performance, not just the index lookup time.
>
> Yes, but the index implementation doesn't know (can't know) the cost of
> the query engine. The query engine needs to calculate its own cost.

Agreed. I'm just worried about potential confusion about what the
getCostPerEntry() method (as used in [2]) should return. The value is
currently only set in [3], but there the estimate seems to be based on
the relative performance of the *index lookup*, not the overall
performance of a query. I believe either [2] or [3] should be adjusted
to fix the cost calculation.

[1] 
https://github.com/apache/jackrabbit-oak/blob/jackrabbit-oak-1.0.0/oak-core/src/main/java/org/apache/jackrabbit/oak/query/QueryImpl.java#L628
[2] 
https://github.com/apache/jackrabbit-oak/blob/jackrabbit-oak-1.0.0/oak-core/src/main/java/org/apache/jackrabbit/oak/query/QueryImpl.java#L810
[3] 
https://github.com/apache/jackrabbit-oak/blob/jackrabbit-oak-1.0.0/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/property/OrderedPropertyIndex.java#L73

BR,

Jukka Zitting

Reply via email to