benwtrent commented on issue #13611: URL: https://github.com/apache/lucene/issues/13611#issuecomment-3357213692
> seeing different results when using KnnFloatVectorQuery vs DiversifyingChildrenFloatKnnVectorQuery seems odd. I don't think seeing differences is that odd. `DiversifyingChildrenFloatKnnVectorQuery` is joining to parent documents EAGERLY, meaning the `k` you provide isn't referring to the `k` children matches, but instead `k` parent docs. So, we keep searching while we find a better `kth` nearest parent document. When using `KnnFloatVectorQuery`, we stop searching once we stop finding a child document that is nearer than the `kth` child document. You then join back to the parent docs afterwards. Intuitively, it means that running `KnnFloatVectorQuery` should require a much higher `k` to achieve similar quality to `DiversifyingChildrenFloatKnnVectorQuery` because its doing the join later. > Why would using KnnFloatVectorQuery require a k of 2,822 to get the right top-5 results, but DiversifyingChildrenFloatKnnVectorQuery required a k of 1,812? This is of course what you found. > Shouldn't the same k value return the same results (assuming we manually join with the parent on the KnnFloatVectorQuery results?). I found DiversifyingChildrenFloatKnnVectorQuery required a k of 1812 and KnnFloatVectorQuery a k of 2822 to match the right top-5 results from FloatVectorSimilarityQuery. No, it shouldn't. Let me try to get to the intuition another way. If you have a parent document with many children, this particular parent dominates the top-k queue. When searching the graph, you basically just find all its children. Kicking out other potentially relevant parents, or even not exploring the graph enough to find them. Consequently, when joining later, that parent is over represented and harming the overall results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
