Re: Solr 9.5 distributed search tie breaking logic is non-deterministic

Chris Hostetter Tue, 21 Oct 2025 16:24:44 -0700


Ugh.


I think you are 100% correct, the merge logic *should* use the "shard 
name" as the tie-breaker.

As for how to fix this...

The key hiccup is that a the concept of a "shard" predates the concept of 
a "shard name" -- going back to before "SolrCloud" was an idea, and you 
could send solr a "distributed search" request by specifying a ',' 
seperated list of "shards", where each shard was a '|' seperated list of 
"replica urls"

much of the low level code still works that way, and only the higher level 
code uses the cluster state to map a "shard name" to a "list of replica 
urls.

By the time the code gets low enough down to where/when a ShardDoc is 
constructed, I don't think the "shard name" info is in scope.


Either way: Can you please file a bug report capturing this discussion ... 
not sure how hard it will be to fix, but we should at least track it (even 
if it probably has been broken for 10+ years)




: Date: Tue, 21 Oct 2025 17:16:09 -0500
: From: Yue Yu <[email protected]>
: Reply-To: [email protected]
: To: [email protected]
: Subject: Solr 9.5 distributed search tie breaking logic is non-deterministic
: 
: Hello,
: In the mergeIds function of QueryComponent, this
: heap ShardFieldSortedHitQueue is used to order the ShardDoc. However, in
: the *lessThan* function:
: 
: protected boolean lessThan(ShardDoc docA, ShardDoc docB) {
:   // If these docs are from the same shard, then the relative order
:   // is how they appeared in the response from that shard.
:   if (Objects.equals(docA.shard, docB.shard)) {
:     // if docA has a smaller position, it should be "larger" so it
:     // comes before docB.
:     // This will handle sorting by docid within the same shard
: 
:     // comment this out to test comparators.
:     return !(docA.orderInShard < docB.orderInShard);
:   }
: 
:   // run comparators
:   final int n = comparators.length;
:   int c = 0;
:   for (int i = 0; i < n && c == 0; i++) {
:     c =
:         (fields[i].getReverse())
:             ? comparators[i].compare(docB, docA)
:             : comparators[i].compare(docA, docB);
:   }
: 
:   // solve tiebreaks by comparing shards (similar to using docid)
:   // smaller docid's beat larger ids, so reverse the natural ordering
:   if (c == 0) {
:     c = -docA.shard.compareTo(docB.shard);
:   }
: 
:   return c < 0;
: }
: 
: The last tie-breaking logic is comparing ShardDoc.shard:
: 
: // solve tiebreaks by comparing shards (similar to using docid)
: // smaller docid's beat larger ids, so reverse the natural ordering
: if (c == 0) {
:   c = -docA.shard.compareTo(docB.shard);
: }
: 
: 
: Here ShardDoc.shard contains node ip as well as shard name, for example:
: http://127.0.0.1:8983/solr/my_collection_shard1_replica_n1
: Consider this setup: 1 collection with 2 shard 2 replica running on a 2
: nodes cluster. For the same query, we may have documents coming from the
: following core combinations:
: 
:    1. http://node1_ip:8983/solr/my_collection_shard1_replica_n1 +
:    http://node2_ip:8983/solr/my_collection_shard2_replica_n2
:    2. http://node2_ip:8983/solr/my_collection_shard1_replica_n2 +
:    http://node1_ip:8983/solr/my_collection_shard2_replica_n1
: 
: Hence the same request may have different document rankings when there are
: documents from both shards with the same scores. This can get worse with
: more nodes/shards/replicas.
: I'm wondering if we should just use the shard name for tie breaking instead
: (no node ip), if that's possible
: 
: Thank you,
: Yue
: 

-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Solr 9.5 distributed search tie breaking logic is non-deterministic

Reply via email to