Bruno Roustant created LUCENE-8836:
--------------------------------------

             Summary: Optimize DocValues TermsDict to continue scanning from 
the last position when possible
                 Key: LUCENE-8836
                 URL: https://issues.apache.org/jira/browse/LUCENE-8836
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Bruno Roustant


Lucene80DocValuesProducer.TermsDict is used to lookup for either a term or a 
term ordinal.

Currently it does not have the optimization the FSTEnum has: to be able to 
continue a sequential scan from where the last lookup was in the IndexInput. 
For sparse lookups (when searching only a few terms or ordinal) it is not an 
issue. But for multiple lookups in a row this optimization could save 
re-scanning all the terms from the block start (since they are delat encoded).

This patch proposes the optimization.

To estimate the gain, we ran 3 Lucene tests while counting the seeks and the 
term reads in the IndexInput, with and without the optimization:

TestLucene70DocValuesFormat - the optimization saves 24% seeks and 15% term 
reads.
TestDocValuesQueries - the optimization adds 0.7% seeks and 0.003% term reads.
TestDocValuesRewriteMethod.testRegexps - the optimization saves 71% seeks and 
82% term reads.

In some cases, when scanning many terms in lexicographical order, the 
optimization saves a lot. In some case, when only looking for some sparse 
terms, the optimization does not bring improvement, but does not penalize 
neither. It seems to be worth to always have it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to