Nikolay Khitrin created LUCENE-8178:

             Summary: Bulk operations for LongValues and Sorted[Set]DocValues
                 Key: LUCENE-8178
             Project: Lucene - Core
          Issue Type: Improvement
    Affects Versions: 7.2.1
            Reporter: Nikolay Khitrin

One-by-one DocValues iteration by {{advanceExact}} and {{nextOrd}}/{{ordValue}} 
is really slow for bulk operations like facetting. Reading and unpacking 
integers in blocks is substantially faster but DocValues for now can be queried 
only for single document.

To apply document-based bulk processing {{DocIdSetIterator}} matches have to be 
splitted to sequential docID runs and remapped to underlying {{LongValues}} 
 After this transformation relatively large linear scans can be performed over 
packed integers.


To do this two new interfaces

1. {{LongValuesCollector}} ({{collectValue(long index, long value)}}).
 2. {{OrdStatsCollector}} ({{collectOrd(long ord)}}, {{collectMissing(int 

and three new functions are introduced

1. {{LongValues.forRange(long begin, long end, LongValuesCollector collector)}}
 2. {{SortedDocValues.forEach(DocIdSetIterator disi, OrdStatsConsumer 
 3. {{SortedSetDocValues.forEach(DocIdSetIterator disi, OrdStatsConsumer 

with reference implementations.

Optimized versions of these functions are provided for:
 1. {{DirectReader}} for non-32/64 bits per value cases (using 
 2. {{Lucene70DocValuesProducer}} {{getSorted}} and {{getSortedSet}} (both 
sparse and dense).


Measured Solr facetting performance boost is up to 2 - 2.5x on real index.
 Patch for Solr {{DocValuesFacets}} is also provided as separate file.


Implementation notes:
 * {{OrdStatsCollector}} does not accept document id because it will ruin 
performance for {{SortedSetDocValues}} due to excessive position lookups.
 * This patch is fully compatible with Lucene 7.0 DocValues format.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to