Github user Yunni commented on the issue: https://github.com/apache/spark/pull/16965 @merlintang Sorry I still don't quite get why we need to support OR-AND when the effective threshold is low. My understanding is that we can always tune numHashTables and numHashFunctions for AND-OR to make the possibility as good as OR-AND. Please correct me if I am wrong. My concerns on supporting OR-AND are the followings: (1) We probably need some backward incompatible API changes. `Array[Vector]`, numHashTables, numHashFunctions seems to make less sense for OR-AND. (2) To avoid broadcast join, we will need a very different and complicated mechanism for the join step in approxSimilarityJoin for OR-AND. (3) I am thinking about building index to improve performance for nearest neighbor (https://docs.google.com/document/d/1opWy2ohXaDWjamV8iC0NKbaZL9JsjZCix2Av5SS3D9g/edit). Supporting OR-AND will make the index less efficient when we get records given hash buckets. @jkbradley @sethah @MLnick Any thoughts?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org