[jira] [Commented] (LUCENE-7897) RangeQuery optimization in IndexOrDocValuesQuery

Murali Krishna P (JIRA) Wed, 05 Jul 2017 00:35:27 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074350#comment-16074350
 ]


Murali Krishna P commented on LUCENE-7897:
------------------------------------------

Adrien, range query matches only 13% of the docs here, most likely that the 
negative range query won't kick in.

I agree it is going to be hard to figure out the threshold. I am trying to make 
sense of the cost calculation in IndexOrDocValuesQuery and 
Boolean2ScorerSupplier. To select points or docvalues, there are 3 costs being 
considered:
1. TermQuery cost -> docfreq (from other scorers)
2. PointsQuery cost -> estimatePointcount
3. DocvalueQuery cost -> maxDoc

2&3 are part of IndexOrDocValuesQuery and it returns min of those 2 as it's 
cost. But choice of points or docvalues is not done based on this cost. It is 
considering the minCost across all scorers to decide that. If the cost of 
IndexOrDocValuesQuery > minCost, it choses docvalues. This is bit 
counter-intuitive for me,  I was thinking IndexOrDocValuesQuery would take a 
hint of the matches from other scorers and calculate the cost accordingly. It 
seems like that happens in score supplier by comparing with minScore. 

Here is a proposal based on my understanding: Consider a situation of fetching 
N docs via IndexOrDocvaluesQuery:
1. Points: Would cost estimatePointcount/1024. This is to consider the cost of 
reads(1024 is the docids in a point block), we probably need to factor in cost 
of sorting the docids across multiple point splits as well.
2. Docvalues: N (assumes 1 read for each doc from the columnar store). Given 
the various encodings and sequential read, N may not be the right approach 
though(thoughts?). Currently this cost from DocValueProducer seems to be maxDoc 
(or #value if it is sparse) for the entry irrespective of how many we are 
actually fetching. But this cost is probably not getting considered as the 
condition currently in most cases would translate to "docfreq < pointEstimate ? 
docvalues : points".

Let me know whether this approach of reducing cost of points in 
IndexOrDocvaluesQuery makes sense. I know we might endup with wrong decision on 
other side now. We could probably benchmark by changing the queries to make 
docfreq match different percentages of points. 

> RangeQuery optimization in IndexOrDocValuesQuery 
> -------------------------------------------------
>
>                 Key: LUCENE-7897
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7897
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: trunk, 7.0
>            Reporter: Murali Krishna P
>
> For range queries, Lucene uses either Points or Docvalues based on cost 
> estimation 
> (https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/search/IndexOrDocValuesQuery.html).
>  Scorer is chosen based on the minCost here: 
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java#L16
> However, the cost calculation for TermQuery and IndexOrDocvalueQuery seems to 
> have same weightage. Essentially, cost depends upon the docfreq in TermDict, 
> number of points visited and number of docvalues. In a situation where 
> docfreq is not too restrictive, this is lot of lookups for docvalues and 
> using points would have been better.
> Following query with 1M matches, takes 60ms with docvalues, but only 27ms 
> with points. If I change the query to "message:*", which matches all docs, it 
> choses the points(since cost is same), but with message:xyz it choses 
> docvalues eventhough doc frequency is 1million which results in many docvalue 
> fetches. Would it make sense to change the cost of docvalues query to be 
> higher or use points if the docfreq is too high for the term query(find an 
> optimum threshold where points cost < docvalue cost)?
> {noformat}
> {
>   "query": {
>     "bool": {
>       "must": [
>         {
>           "query_string": {
>             "query": "message:xyz"
>           }
>         },
>         {
>           "range": {
>             "@timestamp": {
>               "gte": 1498652400000,
>               "lte": 1498905000000,
>               "format": "epoch_millis"
>             }
>           }
>         }
>       ]
>     }
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7897) RangeQuery optimization in IndexOrDocValuesQuery

Reply via email to