[GitHub] [solr] cpoerschke commented on a change in pull request #151: SOLR-15437: ReRanking/LTR does not work in combination with custom sort and SolrCloud

GitBox Mon, 31 May 2021 14:01:35 -0700


cpoerschke commented on a change in pull request #151:
URL: https://github.com/apache/solr/pull/151#discussion_r642675160




##########
File path: 
solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java
##########
@@ -999,20 +1013,34 @@ protected void mergeIds(ResponseBuilder rb, ShardRequest 
sreq) {
 
           shardDoc.sortFieldValues = unmarshalledSortFieldValues;
 
-          queue.insertWithOverflow(shardDoc);
+          if(reRankQueue != null && docCounter++ <= reRankDocsSize) {
+              ShardDoc droppedShardDoc = 
reRankQueue.insertWithOverflow(shardDoc);
+              // FIXME: Only works if the original request does not sort by 
score

Review comment:
       > ... but that would entail a PR to Lucene and we would have to wait for 
a newer version ...
   
   Maybe, maybe not. In terms of thinking about the code it is perhaps easiest 
to imagine `ScoreDoc` having an additional `original_score` element but in 
terms of writing the code it might not necessarily work out that way e.g. 
instead imagine within Solr
   
   ```
   public class ScoreDocWithOriginalScore extends 
org.apache.lucene.search.ScoreDoc {
     public float original_score;
     ...
   }
   ```
   
   or something similar and for that class to be used if and only if needed 
thus reducing impact of the changes on code paths that don't need the change. 
But still even then there'd be quite a bit of Solr code to cover to build 
confidence that the original score makes it to all the right places and isn't 
somehow lost somewhere.
   
   > ... to collect the returned documents before reranking in one queue (queue 
A) and the reranked documents in another queue (queue B). ...
   > ... But the reranking happens on the shards and we are on the coordinator 
(in mergeIds), so we only have the already "mixed" (reranked and nor reranked) 
results from the shards. ...
   
   Yes, the idea was to collect reranked and not-reranked documents in separate 
non-overlapping queues and it _assumed_ that within our "mixed" results we _can 
already tell apart_ the reranked and not-reranked results i.e. within each 
shard's response the first `reRankDocs` documents will be reranked and 
documents (if any) after that are not-reranked i.e. the `i < reRankDocsSize` 
condition. I have not verified if that assumption is accurate but here's a 
short example to explore the idea, noting that in the dual sharded collection 
the client application chooses a halfed reRankDocs on the understanding that 
the collection has two shards.
   
   ```
   single sharded collection
   
   shard1: a b c d e f g h i j x y z # without reranking
   
   shard1: J I H G F E D C B A x y z # with reRankDocs=10 (assuming reranking 
reverses order of reranked docs and uppercases them)
   
   overall results: J I H G F E D C B A x y z
   ```
   
   ```
   dual sharded collection
   
   shard1: a c e g i   y # without reranking
   shard2: b d f h j x z # without reranking
   
   shard1: I G E C A   y # with reRankDocs=5 (assuming reranking reverses order 
of reranked docs and uppercases them)
   shard2: J H F D B x z # with reRankDocs=5 (assuming reranking reverses order 
of reranked docs and uppercases them)
   
   reRankedQueue: J I H G F E D C B A
   queue: x y z
   
   overall results: J I H G F E D C B A x y z
   ```
   
   > ... What do you think about excluding ... to another PR? ... people with 
custom sorting that could already benefit from the first part ...
   
   Yes, that possibility had occurred to me too :-)
   
   In principle I think a fix for half of the issue could be better than no fix:
   * if it can be clearly documented/communicated what is fixed and what is not 
fixed
   * if existing behaviour is maintained for the 'not (yet) fixed' scenario
   * if we anticipate that the fix for the second half is not going to 
break/change things for anyone who benefitted from the first fix
   
   In terms of the specifics here, so the "fixed" condition would be that 
"sort" did not use "score" in any way i.e. `sort=popularity desc` is "fixed" 
but both `sort=score desc` and `sort=popularity desc, score desc` is "not 
fixed" as yet. I wonder if `QueryComponent` and `SortedHitQueueManager` could 
accommodate that logic so that we would maintain existing behaviour for the 
not-yet-fixed use cases?
   
   > ... If I did not understand your last idea correctly, please let me know. 
...
   
   Thanks for identifying which bits were and weren't clear! I've tried to 
clarify with the example and making the key _assumption_ explicit. Please let 
me know if that was useful or not. I think that approach would solve both the 
without-score and the with-score scenario.
   
   > ... And please be aware that I do not want to create pressure by 
suggesting the exclusion of the fix from this PR. :)
   
   I did not perceive any pressure here at all but appreciate your openness and 
clarity w.r.t. no pressure being intended, thank you. :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [solr] cpoerschke commented on a change in pull request #151: SOLR-15437: ReRanking/LTR does not work in combination with custom sort and SolrCloud

Reply via email to