Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/16965
@merlintang Sorry I still don't quite get why we need to support OR-AND
when the effective threshold is low. My understanding is that we can always
tune numHashTables and numHashFunctions for AND-OR to make the possibility as
good as OR-AND. Please correct me if I am wrong.
My concerns on supporting OR-AND are the followings:
(1) We probably need some backward incompatible API changes.
`Array[Vector]`, numHashTables, numHashFunctions seems to make less sense for
OR-AND.
(2) To avoid broadcast join, we will need a very different and complicated
mechanism for the join step in approxSimilarityJoin for OR-AND.
(3) I am thinking about building index to improve performance for nearest
neighbor
(https://docs.google.com/document/d/1opWy2ohXaDWjamV8iC0NKbaZL9JsjZCix2Av5SS3D9g/edit).
Supporting OR-AND will make the index less efficient when we get records given
hash buckets.
@jkbradley @sethah @MLnick Any thoughts?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]