[
https://issues.apache.org/jira/browse/LUCENE-7897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074818#comment-16074818
]
Murali Krishna P commented on LUCENE-7897:
------------------------------------------
bq. Would it more more intuitive if IndexOrDocValuesQuery returned
indexScorerSupplier.cost() directly?
Definitely, that is more sensible. It is very unlikely that doc values will be
less than point estimation. If the docvalue returned a smaller score than the
docfreq(from the term scorer), it would have used points anyway as per code.
bq. arbitrary penalty for doc-values queries
theoretically yes. The problem is currently we are ignoring the docvalue cost
and comparing the cost of original scorer with that of points. So if original
term had 1M match and estimatepoints is even 1M+1, we would endup with
docvalue. That is why I was suggesting reducing the cost of points. May be we
could refactor this if we can pass the "#matchingdocs or minScore" to the place
where we decide the scorer.
{noformat}
public Scorer get(boolean randomAccess) throws IOException {
return (randomAccess ? dvScorerSupplier :
indexScorerSupplier).get(randomAccess);
}
{noformat}
> RangeQuery optimization in IndexOrDocValuesQuery
> -------------------------------------------------
>
> Key: LUCENE-7897
> URL: https://issues.apache.org/jira/browse/LUCENE-7897
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
> Affects Versions: trunk, 7.0
> Reporter: Murali Krishna P
>
> For range queries, Lucene uses either Points or Docvalues based on cost
> estimation
> (https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/search/IndexOrDocValuesQuery.html).
> Scorer is chosen based on the minCost here:
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java#L16
> However, the cost calculation for TermQuery and IndexOrDocvalueQuery seems to
> have same weightage. Essentially, cost depends upon the docfreq in TermDict,
> number of points visited and number of docvalues. In a situation where
> docfreq is not too restrictive, this is lot of lookups for docvalues and
> using points would have been better.
> Following query with 1M matches, takes 60ms with docvalues, but only 27ms
> with points. If I change the query to "message:*", which matches all docs, it
> choses the points(since cost is same), but with message:xyz it choses
> docvalues eventhough doc frequency is 1million which results in many docvalue
> fetches. Would it make sense to change the cost of docvalues query to be
> higher or use points if the docfreq is too high for the term query(find an
> optimum threshold where points cost < docvalue cost)?
> {noformat}
> {
> "query": {
> "bool": {
> "must": [
> {
> "query_string": {
> "query": "message:xyz"
> }
> },
> {
> "range": {
> "@timestamp": {
> "gte": 1498652400000,
> "lte": 1498905000000,
> "format": "epoch_millis"
> }
> }
> }
> ]
> }
> }
> }
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]