Github user Yunni commented on the issue: https://github.com/apache/spark/pull/15800 > One way to look at it is that (a) will contain many duplicates in the L sets of points, so (b) is more likely to have higher precision and recall. I think this might be the place we are not on the same page. I consider the output of (a)/(b) as our "Probing Sequence" (or "Probing buckets"), and in the next step we pick and return k nearest keys in those buckets. Do you agree with this part? If you agree, then I claim more duplicates (It's actually redundancy rather than duplicates) brings more chance for finding the correct k nearest neighbors because we enlarge our search range. If you disagree, I think we are not discussing based on the same NN search implementation (differs from the current implementation). I would like to know how you return k nearest neighbor after (b)?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org