Hi, On Mon, Jun 23, 2014 at 3:30 AM, Thomas Mueller <[email protected]> wrote: >>Right. I don't believe the cost of the index lookup is significant (at >>least in the asymptotic sense) compared to the overall cost of >>executing a query. > > Sorry, I don't understand. The cost of the index lookup *is* significant > of course, specially if the nodes are not in the cache (which is very > common). In which case, index lookup can contribute to about 50% of the > overall cost of a query (if, let's assume, one disk read is needed for the > index lookup, and one disk read is needed to read the actual node).
The problem with that assumption is that typically a single disk read to the index would return n paths, whereas loading those n nodes might well take n more disk reads. Similarly for the Solr case you mention: There is some overhead in the network roundtrip, but the search server will typically reply with the list of all matching paths in a single roundtrip, whereas then loading those matching nodes will (at least with our current code) require O(n) network roundtrips or disk reads. > For a covering index - > http://stackoverflow.com/questions/62137/what-is-a-covered-index - the > index lookup is nearly 100% of the query cost. Sure, but we don't use a covered index. At least in the current design the query engine will always load all the matching nodes, regardless of any extra information stored in the index. Thus we can't use the performance of the index lookup as an accurate estimate of the overall query performance. >> in the asymptotic sense > > Sorry could you translate that for me please? :-) In the Big-O sense. The overhead of the index lookup is probably significant when only few matching paths are returned (the UUID index would be the ultimate example), but in those cases query performance is probably best optimized in other ways (caching, configuration, profiling, etc.) than making a more accurate estimate of the index lookup performance, especially since in most such cases there is only a single index with exact matches. When the number of matching paths becomes large, say in the 1k-1M range, the relative performance overhead of different query indexes becomes less of an issue. For example, say index A returns 10k matching paths in 100ms and index B returns 20k less accurately matching paths in just 10ms. If accessing each node takes 1ms on average, the significantly faster performance of index B is totally irrelevant to the overall time it takes to execute the full query. > Trying to make an informed guess about the cost is the whole point of a > cost based optimizer - http://en.wikipedia.org/wiki/Query_optimization Agreed, but we should be making guesses about the overall query performance, not just the index lookup time. BR, Jukka Zitting
