Adrien Grand commented on LUCENE-7407:

I agree there are things to be improved there (see LUCENE-7462 too).

The comparison might not be entirely fair to the new API since the Lucene54 
format was really designed with the old random-access API in mind. I'm 
wondering how much we can get back by more naturally implementing the new API. 
But I suspect we will have to do more to get back to performance that is close 
to what we had before, at least in the dense case. To me there are two ways 
that we can do it:
 - adjust the iterator API of doc values to require less search-time work. 
Because of the advance() semantics, we currently need to guard all value 
accesses under something that looks like this:
      int curDocID = docTerms.docID();
      if (doc > curDocID) {
        curDocID = docTerms.advance(doc);
      if (doc == curDocID) {
        // handle value
      } else {
        // handle missing value
(copied from {{FieldComparator}}). The advance() semantics both return the next 
document that has a value, which we never need at search time so this is an 
unnecessary effort from the codec, and also require that the target is strictly 
beyond the current document, which prevents from calling advance(doc) blindly: 
we need to check whether the iterator is on the current document or beyond 
already. Maybe we could have instead something like an advanceExact(target) 
method that would only advance to the target document and return whether it has 
a value.
 * have a 2nd DV API that looks like the old API (with the additional 
constraint that doc ids need to be consumed in order) and helpers in the 
DocValues class to convert from dv producers with an iterator API to this 
random-access API (the LUCENE-7462 proposal). When the codec specializes the 
dense case (which would be always the case with the default codec), the 
conversion would only unwrap the iterator to return a random-access API. And 
otherwise it would wrap in order to check the current doc ID and advance if 
necessary (like almost every consumer of the doc values APIs needs to do now in 

> Explore switching doc values to an iterator API
> -----------------------------------------------
>                 Key: LUCENE-7407
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7407
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>              Labels: docValues
>             Fix For: master (7.0)
>         Attachments: LUCENE-7407.patch
> I think it could be compelling if we restricted doc values to use an
> iterator API at read time, instead of the more general random access
> API we have today:
>   * It would make doc values disk usage more of a "you pay for what
>     what you actually use", like postings, which is a compelling
>     reduction for sparse usage.
>   * I think codecs could compress better and maybe speed up decoding
>     of doc values, even in the non-sparse case, since the read-time
>     API is more restrictive "forward only" instead of random access.
>   * We could remove {{getDocsWithField}} entirely, since that's
>     implicit in the iteration, and the awkward "return 0 if the
>     document didn't have this field" would go away.
>   * We can remove the annoying thread locals we must make today in
>     {{CodecReader}}, and close the trappy "I accidentally shared a
>     single XXXDocValues instance across threads", since an iterator is
>     inherently "use once".
>   * We could maybe leverage the numerous optimizations we've done for
>     postings over time, since the two problems ("iterate over doc ids
>     and store something interesting for each") are very similar.
> This idea has come up many in the past, e.g. LUCENE-7253 is a recent
> example, and very early iterations of doc values started with exactly
> this ;)
> However, it's a truly enormous change, likely 7.0 only.  Or maybe we
> could have the new iterator APIs also ported to 6.x side by side with
> the deprecate existing random-access APIs.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to