Christoph Kaser created LUCENE-7861:
---------------------------------------

             Summary: Hidden assumption that return value of 
IndexSearcher.slices is an array of continous sequential slices of the index
                 Key: LUCENE-7861
                 URL: https://issues.apache.org/jira/browse/LUCENE-7861
             Project: Lucene - Core
          Issue Type: Bug
          Components: core/search
    Affects Versions: 6.5.1, 6.0
            Reporter: Christoph Kaser


The IndexSearcher-method 
{code:java}protected LeafSlice[] slices(List<LeafReaderContext> leaves){code}
can be overwritten to customize how the index is searched with multipe threads. 
However, the IndexSearcher assumes the result is an ordered array of continuous 
slices of the index. If the result is "interleaved" or unordered, searchAfter 
may skip results.

The issue seems to be how searchAfter works vs how TopDocs.merge works:

searchAfter skips every document with a higher score than the "after" document. 
In case of equal scores, it uses the document id and skips every document with 
a <= document id (see PagingFieldCollector).

TopDocs.merge uses the score to determine which hits should be part of the 
merged TopDocs. In case of equal scores, it uses the shard index (this 
corresponds to the slices the IndexSearcher uses) to break ties (see 
ScoreMergeSortQueue.lessThan)

So if the shards are noncontinuous/unordered, searchAfter uses a different way 
of sorting the documents than TopDocs.merge, and therefore hits are skipped.

On the mailing list, Michael McCandless suggested either improving 
TopDocs.merge to optionally use the docID for tie breaking (optionally as 
apparently the docId is not always global for every call of TopDocs.merge) or 
at least documenting the requirement on the return value of 
IndexSearcher.slices().

In my use case (generating a fixed amount of slices of approximately equal 
size), the requirement of ordered slices will result in a less optimal result - 
but I am not sure whether this has a real impact on performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to