Re: [DISCUSS] - QueryIndex selection

Thomas Mueller Mon, 23 Jun 2014 11:03:47 -0700

Hi,

>It's more than access control. The query engine needs to double-check
>the constraints of the query for each matching path before passing
>that node to the client (see the constraint.evaluate() call in [1]). I
>don't see any easy way to avoid that step without major refactoring.


If there is no other constraint, then no additional checks are needed.

>Are there any potential indexes where the
>AdvancedQueryIndex.getCostPerEntry() method (at least the way it's now
>used in [2]) should return a value that's different from 1?

Yes, for example an index that keep all (relevant) entries in memory, the
cost should be close to zero.

>Say I have 10k nodes distributed uniformly across two sizes and ten
>colors. Querying for "size=L and color=red" would return 5000 paths
>when using the size index, or 1000 paths when using the color index (a
>multi-property index would return only the 500 exact matches). If the
>size index is really fast and returns those 5000 paths in just 10ms
>whereas the color index takes 100ms to return the 1000 matching paths,
>it'll still be much slower to use the size index for the query if
>accessing a single node takes say 1ms on average.

Yes. The AdvancedQueryIndex has separate methods so that the query engine
can better calculate the cost (getCostPerExecution, getCostPerEntry,
getEstimatedEntryCount). The query could contain a limit (let's say 100)
which should also be taken into account, and possibly an "order by"
restriction. Plus the query engine could take into account that typically,
only the first 50 entries are read (optimize for "fast first 50 entries" -
see also 
http://stackoverflow.com/questions/1308946/should-i-use-query-hint-fast-num
ber-rows-fastfirstrow ).

>The index-level entry cost estimates only become relevant when the
>cost of returning a path is more than a fraction of the cost of
>loading a node. I don't believe that's the case for any reasonable
>index implementations.

It depends on whether (and when) the node needs to be loaded.

>I'm just worried about potential confusion about what the
>getCostPerEntry() method (as used in [2]) should return. The value is
>currently only set in [3], but there the estimate seems to be based on
>the relative performance of the *index lookup*, not the overall
>performance of a query. I believe either [2] or [3] should be adjusted
>to fix the cost calculation.

Yes, you are right. Currently the formula assumes that the query engine
doesn't load the node. That's not correct. I created OAK-1910 to track
this.

Regards,
Thomas

Re: [DISCUSS] - QueryIndex selection

Reply via email to