cpoerschke commented on a change in pull request #151:
URL: https://github.com/apache/solr/pull/151#discussion_r642675160
##########
File path:
solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java
##########
@@ -999,20 +1013,34 @@ protected void mergeIds(ResponseBuilder rb, ShardRequest
sreq) {
shardDoc.sortFieldValues = unmarshalledSortFieldValues;
- queue.insertWithOverflow(shardDoc);
+ if(reRankQueue != null && docCounter++ <= reRankDocsSize) {
+ ShardDoc droppedShardDoc =
reRankQueue.insertWithOverflow(shardDoc);
+ // FIXME: Only works if the original request does not sort by
score
Review comment:
> ... but that would entail a PR to Lucene and we would have to wait for
a newer version ...
Maybe, maybe not. In terms of thinking about the code it is perhaps easiest
to imagine `ScoreDoc` having an additional `original_score` element but in
terms of writing the code it might not necessarily work out that way e.g.
instead imagine within Solr
```
public class ScoreDocWithOriginalScore extends
org.apache.lucene.search.ScoreDoc {
public float original_score;
...
}
```
or something similar and for that class to be used if and only if needed
thus reducing impact of the changes on code paths that don't need the change.
But still even then there'd be quite a bit of Solr code to cover to build
confidence that the original score makes it to all the right places and isn't
somehow lost somewhere.
> ... to collect the returned documents before reranking in one queue (queue
A) and the reranked documents in another queue (queue B). ...
> ... But the reranking happens on the shards and we are on the coordinator
(in mergeIds), so we only have the already "mixed" (reranked and nor reranked)
results from the shards. ...
Yes, the idea was to collect reranked and not-reranked documents in separate
non-overlapping queues and it _assumed_ that within our "mixed" results we _can
already tell apart_ the reranked and not-reranked results i.e. within each
shard's response the first `reRankDocs` documents will be reranked and
documents (if any) after that are not-reranked i.e. the `i < reRankDocsSize`
condition. I have not verified if that assumption is accurate but here's a
short example to explore the idea, noting that in the dual sharded collection
the client application chooses a halfed reRankDocs on the understanding that
the collection has two shards.
```
single sharded collection
shard1: a b c d e f g h i j x y z # without reranking
shard1: J I H G F E D C B A x y z # with reRankDocs=10 (assuming reranking
reverses order of reranked docs and uppercases them)
overall results: J I H G F E D C B A x y z
```
```
dual sharded collection
shard1: a c e g i y # without reranking
shard2: b d f h j x z # without reranking
shard1: I G E C A y # with reRankDocs=5 (assuming reranking reverses order
of reranked docs and uppercases them)
shard2: J H F D B x z # with reRankDocs=5 (assuming reranking reverses order
of reranked docs and uppercases them)
reRankedQueue: J I H G F E D C B A
queue: x y z
overall results: J I H G F E D C B A x y z
```
> ... What do you think about excluding ... to another PR? ... people with
custom sorting that could already benefit from the first part ...
Yes, that possibility had occurred to me too :-)
In principle I think a fix for half of the issue could be better than no fix:
* if it can be clearly documented/communicated what is fixed and what is not
fixed
* if existing behaviour is maintained for the 'not (yet) fixed' scenario
* if we anticipate that the fix for the second half is not going to
break/change things for anyone who benefitted from the first fix
In terms of the specifics here, so the "fixed" condition would be that
"sort" did not use "score" in any way i.e. `sort=popularity desc` is "fixed"
but both `sort=score desc` and `sort=popularity desc, score desc` is "not
fixed" as yet. I wonder if `QueryComponent` and `SortedHitQueueManager` could
accommodate that logic so that we would maintain existing behaviour for the
not-yet-fixed use cases?
> ... If I did not understand your last idea correctly, please let me know.
...
Thanks for identifying which bits were and weren't clear! I've tried to
clarify with the example and making the key _assumption_ explicit. Please let
me know if that was useful or not. I think that approach would solve both the
without-score and the with-score scenario.
> ... And please be aware that I do not want to create pressure by
suggesting the exclusion of the fix from this PR. :)
I did not perceive any pressure here at all but appreciate your openness and
clarity w.r.t. no pressure being intended, thank you. :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]