cpoerschke commented on a change in pull request #151:
URL: https://github.com/apache/solr/pull/151#discussion_r641971468
##########
File path:
solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java
##########
@@ -999,20 +1013,34 @@ protected void mergeIds(ResponseBuilder rb, ShardRequest
sreq) {
shardDoc.sortFieldValues = unmarshalledSortFieldValues;
- queue.insertWithOverflow(shardDoc);
+ if(reRankQueue != null && docCounter++ <= reRankDocsSize) {
+ ShardDoc droppedShardDoc =
reRankQueue.insertWithOverflow(shardDoc);
+ // FIXME: Only works if the original request does not sort by
score
Review comment:
> The current solution only works if the original sort did not sort by
score. ...
> ... We have no access to the score before reRanking at this point and I
currently see no possibility to retrieve it again. ...
I had some more ideas around this, it's a semi-structured trail of thoughts,
but sharing in that form in the hope that it might be useful and/or trigger
further ideas and thoughts.
----
So as indicated by the `// FIXME: Only works if the original request does
not sort by score` comment location, the problem happens when a document drops
out of `reRankQueue` (where it's compared against other reranked documents via
the reranked score) and joins `queue` (where it needs to be compared against
other (reranked or unreranked) documents via the sort clause).
One way to avoid this is to ensure dropping out never happens but that is a
change to the overall reranking logic and so for now here let's assume that is
not the logic we want ...
OriginalScoreFeature via fl=[fv] providing an original score is too late
timing wise since it happens during "get fields" after the "merge ids".
ShardParams.DISTRIB_SINGLE_PASS "distrib.singlePass" could be used to bring
timing forward but that would have other performance etc. implications.
Implying "distrib.singlePass=true" and implying fl=[fv] and an
OriginalScoreFeature that might not have been there in the model or feature
store would also be messy both on shard and federator code level.
All that aside, this a reranking problem generally i.e. not an LTR reranking
problem specifically i.e. OriginalScoreFeature and fl=[fv] is too specific a
solution.
So let's leave the merging logic here alone for a moment and go to the shard
code and determine where exactly the "original_score" information is lost. If
we find where it's lost then we can preserve it there and pass it along via the
code path that the "score" information takes and eventually it would reach the
merging logic here, at least that's the theory.
----
The score changes during rescoring.
In the case of LTR rescoring
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/contrib/ltr/src/java/org/apache/solr/ltr/LTRRescorer.java#L121-L131
allocates `reranked` which is then filled from `firstPassResults` i.e.
structurally this is where the information loss happens. For non-LTR rescoring
something equivalent presumably happens elsewhere.
For thinking purposes let's assume that we could add an `original_score`
field to `ScoreDoc` itself i.e. at line 210 we could preserve the score:
```
scorer.getDocInfo().setOriginalDocScore(hit.score);
+ hit.original_score = hit.score;
hit.score = scorer.score();
```
----
Next let's explore how a distributed query's scores get from the shard to
the federator.
(Shoutout to Hoss' "Lifecycle of a Solr Search Request" talk at this point
e.g. http://people.apache.org/~hossman/rev2017/ slides.)
Somewhat jumping into the middle of the code QueryComponent
[L368](https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java#L368)
and
[L1512](https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java#L1512)
```
QueryResult result = new QueryResult();
...
rb.setResult(result);
```
`QueryResult` contains `DocListAndSet` which contains `DocList` and
`DocSet`. `DocSlice` implements `DocList` and one of the places (there might be
others) where `ScoreDoc` turns into `DocSlice` is at
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L1623
----
So for thinking purposes we added `original_score` field to `ScoreDoc`
already and `DocSlice` contains an optional `scores` list and so let's imagine
adding an optional `original_scores` list alongside it:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/search/DocSlice.java#L39
```
public class DocSlice extends DocSetBase implements DocList {
...
final int[] docs; // a slice of documents (docs 0-100 of the query)
final float[] scores; // optional score list
+ final float[] original_scores; // optional original_score list
final long matches;
final TotalHits.Relation matchesRelation;
...
```
----
Now then though, how will that new original score list get "transported"
from shard to federator?
Using an unusual(?) trick here: instead of navigating the code via
DocSlice.scores we could navigate via DocSlice.matchesRelation on the
assumption that it leads to the same area but more easily.
One of the areas is
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/response/TextResponseWriter.java#L189-L197
and there the `res.getReturnFields()` leads to
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/search/SolrReturnFields.java#L51-L53
```
public class SolrReturnFields extends ReturnFields {
// Special Field Keys
public static final String SCORE = "score";
```
and the speculation that perhaps a SolrReturnFields.ORIGINAL_SCORE might be
needed.
----
That's where the trail of thoughts led me.
Of course, it is too early to reach any conclusions here but going
intuitively and based on the code trails above it seems to me that getting
access to the original_score is feasible but would be quite an undertaking and
so before investing effort in that direction it could be helpful to first
further consider and firmly rule out any overall reranking logic that requires
only the already available information e.g. for shards to use `reRankDocs` and
for the federator whilst combining results to _not_ use `reRankDocs` but to
combine all shards' reranked documents separately in a `reRankQueue` that has
the same sizing as `queue` (and then documents dropping out of `reRankQueue`
need not be added to `queue` since them dropping out means there's already
sufficient documents in `reRankQueue` to serve the request).
I hope that is not too opinionated a way of sharing one perspective and that
some bits of it are useful? This distributed reranking stuff is an intriguing
problem! :-)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]