[
https://issues.apache.org/jira/browse/LUCENE-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16404750#comment-16404750
]
Amir Hadadi edited comment on LUCENE-8213 at 3/19/18 12:46 PM:
---------------------------------------------------------------
[~rcmuir] the issue is that the execution path for (q1 AND q2) depends on
whether q2 gets cached or not.
When q2 does not get cached, doc values are used to execute q2 and only the
single document matching q1 is evaluated against the range.
When q2 gets cached, it gets cached as a query that stands by itself, i.e. not
in the context of (q1 AND q2).
So the entire 10M documents that q2 matches are scanned in the BKD tree and
get cached to a bit set.
To protect against the caching of q2 causing the latency of (q1 AND q2) to be
too high, [~jpountz] added maxCostFactor.
This factor checks whether the cost of caching q2 is higher by more than
maxCostFactor than the cost of evaluating (q1 AND q2).
This is the relevant code from LRUQueryCache:
{code:java}
double costFactor = (double) inSupplier.cost() / leadCost;
if (costFactor >= maxCostFactor) {
// too costly, caching might make the query much slower
return inSupplier.get(leadCost);
}{code}
My suggestion is to always evaluate (q1 AND q2) using the optimal execution
path, and cache q2 asynchrounously.
A refinement is to cache q2 synchronously if the cost of caching it is not too
high.
was (Author: hermes):
[~rcmuir] the issue is that the execution path for (q1 AND q2) depends on
whether q2 gets cached or not.
When q2 does not get cached, doc values are used to execute q2 and only the
single document matching q1 is evaluated against the range.
When q2 gets cached, it gets cached as a query that stands by itself, i.e. not
in the context of (q1 AND q2).
So the entire 10M documents that q2 matches are scanned in the BKD tree and
get cached to a bit set.
To protect against the caching of q2 causing the latency of (q1 AND q2) to be
too high, [~jpountz] added maxCostFactor.
This factor checks whether the cost of caching q2 is higher by more than
maxCostFactor than the cost of evaluating (q1 AND q2).
This is the relevant code from LRUQueryCache:
{code:java}
double costFactor = (double) inSupplier.cost() / leadCost;
if (costFactor >= maxCostFactor) {
// too costly, caching might make the query much slower
return inSupplier.get(leadCost);
}{code}
My suggestion is to always evaluate (q1 AND q2) using the optimal path, and
cache q2 asynchrounously.
A refinement is to cache q2 synchronously if the cost of caching it is not too
high.
> offload caching to a dedicated threadpool
> -----------------------------------------
>
> Key: LUCENE-8213
> URL: https://issues.apache.org/jira/browse/LUCENE-8213
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/query/scoring
> Affects Versions: 7.2.1
> Reporter: Amir Hadadi
> Priority: Minor
> Labels: performance
>
> IndexOrDocValuesQuery allows to combine non selective range queries with a
> selective lead iterator in an optimized way. However, the range query at some
> point gets cached by a querying thread in LRUQueryCache, which negates the
> optimization of IndexOrDocValuesQuery for that specific query.
> It would be nice to see a caching implementation that offloads to a different
> thread pool, so that queries involving IndexOrDocValuesQuery would have
> consistent performance characteristics.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]