Adrien Grand created LUCENE-7462:
------------------------------------

             Summary: Faster search APIs for doc values
                 Key: LUCENE-7462
                 URL: https://issues.apache.org/jira/browse/LUCENE-7462
             Project: Lucene - Core
          Issue Type: Improvement
    Affects Versions: master (7.0)
            Reporter: Adrien Grand
            Priority: Minor


While the iterator API helps deal with sparse doc values more efficiently, it 
also makes search-time operations more costly. For instance, the old 
random-access API allowed to compute facets on a given segment without any 
conditionals, by just incrementing the counter at index {{ordinal+1}} while the 
new API requires to advance the iterator if necessary and then check whether it 
is exactly on the right document or not.

Since it is very common for fields to exist across most documents, I suspect 
codecs will keep an internal structure that is similar to the current codec in 
the dense case, by having a dense representation of the data and just making 
the iterator skip over the minority of documents that do not have a value.

I suggest that we add APIs that make things cheaper at search time. For 
instance in the case of SORTED doc values, it could look like 
{{LegacySortedDocValues}} with the additional restriction that documents can 
only be consumed in order. Codecs that can implement this API efficiently would 
hide it behind a {{SortedDocValues}} adapter, and then at search time facets 
and comparators (which liked the {{LegacySortedDocValues}} API better) would 
either unwrap or hide the SortedDocValues they got behind a more random-access 
API (which would only happen in the truly sparse case if the codec optimizes 
the dense case).

One challenge is that we already use the same idea for hiding single-valued 
impls behind multi-valued impls, so we would need to enforce the order in which 
the wrapping needs to happen. At first sight, it seems that it would be best to 
do the single-value-behind-multi-value-API wrapping above the 
random-access-behind-iterator-API wrapping. The complexity of 
wrapping/unwrapping in the right order could be contained in the {{DocValues}} 
helper class.

I think this change would also simplify search-time consumption of doc values, 
which currently needs to spend several lines of code positioning the iterator 
everytime it needs to do something interesting with doc values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to