[jira] [Commented] (LUCENE-8829) TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved

Atri Sharma (JIRA) Tue, 11 Jun 2019 00:29:12 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16860621#comment-16860621
 ]


Atri Sharma commented on LUCENE-8829:
-------------------------------------

[~jpountz] Agreed that the API is a bit weird in the proposed approach.

I stared for long at TopDocs#mergeAux and realised that the primary reason that 
mergeAux cares about shard indices when setShardIndex = false is because it 
assumes that shard indices are going to participate in hits ordering in any 
case, so if shardIndex for any hit is not set, then that is going to cause 
issues. However, if setShardIndex = false, then that also implies that user 
does not care about shard index orderings, so we should really be using docIDs 
to resolve ties then.

Attached patch proposes a new approach where if setShardIndex= false, then 
simply use docIDs to tie break. Note that we still have a safety net for the 
case when merge() is asked to set shard indices but the hit's shard index is 
not set for some reason (getShardIndex()).

IndexSearcher then can ask merge to ignore shard indices and tie break on 
docIDs. When merging across different IndexSearchers, setShardIndex =true can 
do the right thing. WDYT?

 [^LUCENE-8829.patch] 

> TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8829
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8829
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Atri Sharma
>            Priority: Major
>         Attachments: LUCENE-8829.patch, LUCENE-8829.patch, LUCENE-8829.patch
>
>
> While investigating LUCENE-8819, I understood that TopDocs#merge's order of 
> results are indirectly dependent on the number of collectors involved in the 
> merge. This is troubling because 1) The number of collectors involved in a 
> merge are cost based and directly dependent on the number of slices created 
> for the parallel searcher case. 2) TopN hits code path will invoke merge with 
> a single Collector, so essentially, doing the same TopN query with single 
> threaded and parallel threaded searcher will invoke different order of 
> results, which is a bad invariant that breaks.
>  
> The reason why this happens is because of the subtle way TopDocs#merge sets 
> shardIndex in the ScoreDoc population during populating the priority queue 
> used for merging. ShardIndex is essentially set to the ordinal of the 
> collector which generates the hit. This means that the shardIndex is 
> dependent on the number of collectors, even for the same set of hits.
>  
> In case of no sort order specified, shardIndex is used for tie breaking when 
> scores are equal. This translates to different orders for same hits with 
> different shardIndices.
>  
> I propose that we remove shardIndex from the default tie breaking mechanism 
> and replace it with docID. DocID order is the de facto that is expected 
> during collection, so it might make sense to use the same factor during tie 
> breaking when scores are the same.
>  
> CC: [~ivera]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8829) TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved

Reply via email to