After debugging a little I can confirm that the dedup is happening in
QueryComponent.mergeIds.
Distributed search has always done an quick-n-dirty dedup (i.e. it's
considered an error condition to have the same ID in different shards
anyway).
Actually it is in the same shard we have two documents with the same ID.
They are routed to the same shard because the have the same ID. Remember
I tweek my request-params (basically setting overwrite=false) so that I
end up with indexWriter.addDocument (for both documents) in
DirectUpdateHandler2 instead of indexWriter.updateDocument
There is a little inconsistency. The dedup does not reflect on total
numFound unless you actually happen to get the document(s) back in your
query.
Simple example: I have only two document in my entire collection
(consisting of several shards). They both live in the same shard and
have the same ID (actually they are complete duplicates). I get this
funny behavior when searching
* Searching with rows=0 or rows=1, I get the numFound=2 back - and in
the case of rows=1 I get the document (or one of them)
* Searching with rows>=2, I get numFound=1 back - and the document (or
one of them)
It should be in QueryComponent.mergeIds
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]