benwtrent commented on issue #13611:
URL: https://github.com/apache/lucene/issues/13611#issuecomment-3357213692

   > seeing different results when using KnnFloatVectorQuery vs 
DiversifyingChildrenFloatKnnVectorQuery seems odd.
   
   I don't think seeing differences is that odd. 
   
   `DiversifyingChildrenFloatKnnVectorQuery` is joining to parent documents 
EAGERLY, meaning the `k` you provide isn't referring to the `k` children 
matches, but instead `k` parent docs. So, we keep searching while we find a 
better `kth` nearest parent document.
   
   When using `KnnFloatVectorQuery`, we stop searching once we stop finding a 
child document that is nearer than the `kth` child document. You then join back 
to the parent docs afterwards.
   
   Intuitively, it means that running `KnnFloatVectorQuery` should require a 
much higher `k` to achieve similar quality to 
`DiversifyingChildrenFloatKnnVectorQuery` because its doing the join later.
   
   > Why would using KnnFloatVectorQuery require a k of 2,822 to get the right 
top-5 results, but DiversifyingChildrenFloatKnnVectorQuery required a k of 
1,812?
   
   This is of course what you found.
   
   
   > Shouldn't the same k value return the same results (assuming we manually 
join with the parent on the KnnFloatVectorQuery results?). I found 
DiversifyingChildrenFloatKnnVectorQuery required a k of 1812 and 
KnnFloatVectorQuery a k of 2822 to match the right top-5 results from 
FloatVectorSimilarityQuery.
   
   No, it shouldn't.
   
   Let me try to get to the intuition another way. 
   
   If you have a parent document with many children, this particular parent 
dominates the top-k queue. When searching the graph, you basically just find 
all its children. Kicking out other potentially relevant parents, or even not 
exploring the graph enough to find them. Consequently, when joining later, that 
parent is over represented and harming the overall results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to