[ 
https://issues.apache.org/jira/browse/LUCENE-7897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074487#comment-16074487
 ] 

Adrien Grand commented on LUCENE-7897:
--------------------------------------

Thanks for checking how many documents match the range query. Your 
understanding of the way things are working today is correct.

bq. But choice of points or docvalues is not done based on this cost. It is 
considering the minCost across all scorers to decide that. If the cost of 
IndexOrDocValuesQuery > minCost, it chooses docvalues. This is bit 
counter-intuitive for me, I was thinking IndexOrDocValuesQuery would take a 
hint of the matches from other scorers and calculate the cost accordingly. It 
seems like that happens in score supplier by comparing with minScore.

Would it more more intuitive if {{IndexOrDocValuesQuery}} returned 
{{indexScorerSupplier.cost()}} directly? This is what should happen in practice 
anyway. Taking the min only helps when the approximation of 
{{estimatePointCount}} returns a number that is greater than the number of docs 
that have a value in the index but we could easily remove it and it should not 
hurt.

bq. Points: Would cost estimatePointcount/1024

Right now the cost we are using is an estimation of the number of matches. You 
are right that a more interesting metric would be the cost of building the 
scorer, but as you wrote this becomes more complicated as we need to fold in 
the cost of sorting the documents, etc.  I am a bit afraid of opening a can of 
worms if we start doing something like this.  However you have a point that for 
a similar value of the {{cost}}, the index query can be expected to be more 
efficient than the doc-values based query because it can more easily amortize 
the cost of matching documents across documents. As a first step, maybe it 
would make sense to give an arbitrary penalty for doc-values queries and only 
use them if we only need to check something like 1/8th of matching documents? 
Like you said this kind of things might end up with a wrong decision on the 
other side, but maybe it is better as queries that provide good iterators are a 
safer bet under doubt?

> RangeQuery optimization in IndexOrDocValuesQuery 
> -------------------------------------------------
>
>                 Key: LUCENE-7897
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7897
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: trunk, 7.0
>            Reporter: Murali Krishna P
>
> For range queries, Lucene uses either Points or Docvalues based on cost 
> estimation 
> (https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/search/IndexOrDocValuesQuery.html).
>  Scorer is chosen based on the minCost here: 
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java#L16
> However, the cost calculation for TermQuery and IndexOrDocvalueQuery seems to 
> have same weightage. Essentially, cost depends upon the docfreq in TermDict, 
> number of points visited and number of docvalues. In a situation where 
> docfreq is not too restrictive, this is lot of lookups for docvalues and 
> using points would have been better.
> Following query with 1M matches, takes 60ms with docvalues, but only 27ms 
> with points. If I change the query to "message:*", which matches all docs, it 
> choses the points(since cost is same), but with message:xyz it choses 
> docvalues eventhough doc frequency is 1million which results in many docvalue 
> fetches. Would it make sense to change the cost of docvalues query to be 
> higher or use points if the docfreq is too high for the term query(find an 
> optimum threshold where points cost < docvalue cost)?
> {noformat}
> {
>   "query": {
>     "bool": {
>       "must": [
>         {
>           "query_string": {
>             "query": "message:xyz"
>           }
>         },
>         {
>           "range": {
>             "@timestamp": {
>               "gte": 1498652400000,
>               "lte": 1498905000000,
>               "format": "epoch_millis"
>             }
>           }
>         }
>       ]
>     }
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to