Hi,

On Mon, Jun 23, 2014 at 3:30 AM, Thomas Mueller <[email protected]> wrote:
>>Right. I don't believe the cost of the index lookup is significant (at
>>least in the asymptotic sense) compared to the overall cost of
>>executing a query.
>
> Sorry, I don't understand. The cost of the index lookup *is* significant
> of course, specially if the nodes are not in the cache (which is very
> common). In which case, index lookup can contribute to about 50% of the
> overall cost of a query (if, let's assume, one disk read is needed for the
> index lookup, and one disk read is needed to read the actual node).

The problem with that assumption is that typically a single disk read
to the index would return n paths, whereas loading those n nodes might
well take n more disk reads.

Similarly for the Solr case you mention: There is some overhead in the
network roundtrip, but the search server will typically reply with the
list of all matching paths in a single roundtrip, whereas then loading
those matching nodes will (at least with our current code) require
O(n) network roundtrips or disk reads.

> For a covering index -
> http://stackoverflow.com/questions/62137/what-is-a-covered-index - the
> index lookup is nearly 100% of the query cost.

Sure, but we don't use a covered index. At least in the current design
the query engine will always load all the matching nodes, regardless
of any extra information stored in the index. Thus we can't use the
performance of the index lookup as an accurate estimate of the overall
query performance.

>> in the asymptotic sense
>
> Sorry could you translate that for me please? :-)

In the Big-O sense.

The overhead of the index lookup is probably significant when only few
matching paths are returned (the UUID index would be the ultimate
example), but in those cases query performance is probably best
optimized in other ways (caching, configuration, profiling, etc.) than
making a more accurate estimate of the index lookup performance,
especially since in most such cases there is only a single index with
exact matches.

When the number of matching paths becomes large, say in the 1k-1M
range, the relative performance overhead of different query indexes
becomes less of an issue. For example, say index A returns 10k
matching paths in 100ms and index B returns 20k less accurately
matching paths in just 10ms. If accessing each node takes 1ms on
average, the significantly faster performance of index B is totally
irrelevant to the overall time it takes to execute the full query.

> Trying to make an informed guess about the cost is the whole point of a
> cost based optimizer - http://en.wikipedia.org/wiki/Query_optimization

Agreed, but we should be making guesses about the overall query
performance, not just the index lookup time.

BR,

Jukka Zitting

Reply via email to