Github user Yunni commented on the issue: https://github.com/apache/spark/pull/16965 @merlintang (1) `hashDistance` is only used for multi-probe NN Search. The term `numHashTables`, `numHashFunctions` is very hard to interpret in OR-AND cases. (2) For similarity join, we actually first do explode and then join. The join key would be type of vector. (3) Yes. However, in order to get rows using hashes, we need to do intersections on large sets of rows. While in AND-OR cases, we do union of small sets of rows, which is more efficient. I also suggest we limit the scope to the implementation of AND-amplification here. We can open other tickets to discuss memory issues, etc.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org