[
https://issues.apache.org/jira/browse/LUCENE-7897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-7897:
---------------------------------
Attachment: LUCENE-7897.patch
I tried to implement some of your ideas:
- The {{boolean randomAccess}} parameter has been replaced with {{long
leadCost}} which gives the cost of the scorer that will be used to lead
iteration. This way we could move the decision making of whether the query
should run with points or doc values from {{Boolean2ScorerSupplier}} to
{{IndexOrDocValuesQuery}}.
- The cost of {{IndexOrDocValuesQuery}} is now the cost of the wrapped
{{indexQuery}}, it ignores the doc values query.
- I gave doc values queries an arbitrary 8x penalty, meaning that when
intersecting a term query with a range query, we will only use doc values for
the range if it seems to match more than 8x more documents than the term query
(vs. 1x before this patch).
I had not though about it before writing this patch, but it should also make
the situation better for disjunctions in conjunctions, since the sparse clauses
will now use sequential access, reducing balancing operations on the priority
queue.
> RangeQuery optimization in IndexOrDocValuesQuery
> -------------------------------------------------
>
> Key: LUCENE-7897
> URL: https://issues.apache.org/jira/browse/LUCENE-7897
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
> Affects Versions: trunk, 7.0
> Reporter: Murali Krishna P
> Attachments: LUCENE-7897.patch
>
>
> For range queries, Lucene uses either Points or Docvalues based on cost
> estimation
> (https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/search/IndexOrDocValuesQuery.html).
> Scorer is chosen based on the minCost here:
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java#L16
> However, the cost calculation for TermQuery and IndexOrDocvalueQuery seems to
> have same weightage. Essentially, cost depends upon the docfreq in TermDict,
> number of points visited and number of docvalues. In a situation where
> docfreq is not too restrictive, this is lot of lookups for docvalues and
> using points would have been better.
> Following query with 1M matches, takes 60ms with docvalues, but only 27ms
> with points. If I change the query to "message:*", which matches all docs, it
> choses the points(since cost is same), but with message:xyz it choses
> docvalues eventhough doc frequency is 1million which results in many docvalue
> fetches. Would it make sense to change the cost of docvalues query to be
> higher or use points if the docfreq is too high for the term query(find an
> optimum threshold where points cost < docvalue cost)?
> {noformat}
> {
> "query": {
> "bool": {
> "must": [
> {
> "query_string": {
> "query": "message:xyz"
> }
> },
> {
> "range": {
> "@timestamp": {
> "gte": 1498652400000,
> "lte": 1498905000000,
> "format": "epoch_millis"
> }
> }
> }
> ]
> }
> }
> }
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]