[jira] [Commented] (LUCENE-8829) TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved

Atri Sharma (JIRA) Tue, 11 Jun 2019 02:12:32 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16860713#comment-16860713
 ]


Atri Sharma commented on LUCENE-8829:
-------------------------------------

bq. I mean ordering on score or sort fields, then shardIndex, then docID all 
the time. In the case that we are interested in, all documents will have a 
shardIndex of -1 so this would be equivalent to sorting on score or sort fields 
and then docID?

Ok, that makes sense.

bq.  That said, it would probably be a bug if some hits have a shardIndex and 
others don't (value == -1) so maybe we could check this on the ScoreDocs that 
we are seeing instead of the existing check that we have today when 
setShardIndex==false and shardIndex==-1?

Yes, the reason I proposed checking all ScoreDocs upfront is to guard against 
malformed docs.

I think we still will have to get IndexSearcher to pass in false for 
setShardIndex to ensure that mergeAux does not assume that hits are coming from 
different shards. The other thing that will probably make sense is to ensure 
that all ScoreDocs that the PQ sees are consistent in their setting of 
shardIndex i.e. either all have valid shard indices or none of them have the 
index set. This check should be done irrespective of whether setShardIndex is 
true or false.

WDYT?

> TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8829
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8829
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Atri Sharma
>            Priority: Major
>         Attachments: LUCENE-8829.patch, LUCENE-8829.patch, LUCENE-8829.patch
>
>
> While investigating LUCENE-8819, I understood that TopDocs#merge's order of 
> results are indirectly dependent on the number of collectors involved in the 
> merge. This is troubling because 1) The number of collectors involved in a 
> merge are cost based and directly dependent on the number of slices created 
> for the parallel searcher case. 2) TopN hits code path will invoke merge with 
> a single Collector, so essentially, doing the same TopN query with single 
> threaded and parallel threaded searcher will invoke different order of 
> results, which is a bad invariant that breaks.
>  
> The reason why this happens is because of the subtle way TopDocs#merge sets 
> shardIndex in the ScoreDoc population during populating the priority queue 
> used for merging. ShardIndex is essentially set to the ordinal of the 
> collector which generates the hit. This means that the shardIndex is 
> dependent on the number of collectors, even for the same set of hits.
>  
> In case of no sort order specified, shardIndex is used for tie breaking when 
> scores are equal. This translates to different orders for same hits with 
> different shardIndices.
>  
> I propose that we remove shardIndex from the default tie breaking mechanism 
> and replace it with docID. DocID order is the de facto that is expected 
> during collection, so it might make sense to use the same factor during tie 
> breaking when scores are the same.
>  
> CC: [~ivera]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8829) TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved

Reply via email to