Christoph Kaser created LUCENE-7861:
---------------------------------------
Summary: Hidden assumption that return value of
IndexSearcher.slices is an array of continous sequential slices of the index
Key: LUCENE-7861
URL: https://issues.apache.org/jira/browse/LUCENE-7861
Project: Lucene - Core
Issue Type: Bug
Components: core/search
Affects Versions: 6.5.1, 6.0
Reporter: Christoph Kaser
The IndexSearcher-method
{code:java}protected LeafSlice[] slices(List<LeafReaderContext> leaves){code}
can be overwritten to customize how the index is searched with multipe threads.
However, the IndexSearcher assumes the result is an ordered array of continuous
slices of the index. If the result is "interleaved" or unordered, searchAfter
may skip results.
The issue seems to be how searchAfter works vs how TopDocs.merge works:
searchAfter skips every document with a higher score than the "after" document.
In case of equal scores, it uses the document id and skips every document with
a <= document id (see PagingFieldCollector).
TopDocs.merge uses the score to determine which hits should be part of the
merged TopDocs. In case of equal scores, it uses the shard index (this
corresponds to the slices the IndexSearcher uses) to break ties (see
ScoreMergeSortQueue.lessThan)
So if the shards are noncontinuous/unordered, searchAfter uses a different way
of sorting the documents than TopDocs.merge, and therefore hits are skipped.
On the mailing list, Michael McCandless suggested either improving
TopDocs.merge to optionally use the docID for tie breaking (optionally as
apparently the docId is not always global for every call of TopDocs.merge) or
at least documenting the requirement on the return value of
IndexSearcher.slices().
In my use case (generating a fixed amount of slices of approximately equal
size), the requirement of ordered slices will result in a less optimal result -
but I am not sure whether this has a real impact on performance.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]