[
https://issues.apache.org/jira/browse/SOLR-17976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18032208#comment-18032208
]
Yue Yu commented on SOLR-17976:
-------------------------------
capturing [~hossman] 's comment from the email:
"
Ugh.
I think you are 100% correct, the merge logic *should* use the "shard
name" as the tie-breaker.
As for how to fix this...
The key hiccup is that a the concept of a "shard" predates the concept of
a "shard name" -- going back to before "SolrCloud" was an idea, and you
could send solr a "distributed search" request by specifying a ','
seperated list of "shards", where each shard was a '|' seperated list of
"replica urls"
much of the low level code still works that way, and only the higher level
code uses the cluster state to map a "shard name" to a "list of replica
urls.
By the time the code gets low enough down to where/when a ShardDoc is
constructed, I don't think the "shard name" info is in scope.
"
> Solr 9.5 distributed search tie breaking logic is non-deterministic
> -------------------------------------------------------------------
>
> Key: SOLR-17976
> URL: https://issues.apache.org/jira/browse/SOLR-17976
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Reporter: Yue Yu
> Priority: Major
>
> In the mergeIds function of QueryComponent, this heap
> ShardFieldSortedHitQueue is used to order the ShardDoc. However, in the
> *lessThan* function:
>
> {code:java}
> protected boolean lessThan(ShardDoc docA, ShardDoc docB) {
> // If these docs are from the same shard, then the relative order // is how
> they appeared in the response from that shard. if (Objects.equals(docA.shard,
> docB.shard)) {
> // if docA has a smaller position, it should be "larger" so it // comes
> before docB. // This will handle sorting by docid within the same shard //
> comment this out to test comparators. return !(docA.orderInShard <
> docB.orderInShard);
> }
> // run comparators final int n = comparators.length;
> int c = 0;
> for (int i = 0; i < n && c == 0; i++) {
> c =
> (fields[i].getReverse())
> ? comparators[i].compare(docB, docA)
> : comparators[i].compare(docA, docB);
> }
> // solve tiebreaks by comparing shards (similar to using docid) // smaller
> docid's beat larger ids, so reverse the natural ordering if (c == 0) {
> c = -docA.shard.compareTo(docB.shard);
> }
> return c < 0;
> }
> {code}
> The last tie-breaking logic is comparing ShardDoc.shard:
> {code:java}
> // solve tiebreaks by comparing shards (similar to using docid)// smaller
> docid's beat larger ids, so reverse the natural orderingif (c == 0) {
> c = -docA.shard.compareTo(docB.shard);
> }{code}
> Here ShardDoc.shard contains node ip as well as shard name, for example:
> [http://127.0.0.1:8983/solr/my_collection_shard1_replica_n1]
> Consider this setup: 1 collection with 2 shard 2 replica running on a 2 nodes
> cluster. For the same query, we may have documents coming from the following
> core combinations:
> # [http://node1_ip:8983/solr/my_collection_shard1_replica_n1] +
> [http://node2_ip:8983/solr/my_collection_shard2_replica_n2]
> # [http://node2_ip:8983/solr/my_collection_shard1_replica_n2] +
> [http://node1_ip:8983/solr/my_collection_shard2_replica_n1]
> Hence the same request may have different document rankings when there are
> documents from both shards with the same scores. This can get worse with more
> nodes/shards/replicas.
> I'm wondering if we should just use the shard name for tie breaking instead
> (no node ip), if that's possible
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]