Github user Yunni commented on the issue:

    https://github.com/apache/spark/pull/16965
  
    @merlintang Sorry I still don't quite get why we need to support OR-AND 
when the effective threshold is low. My understanding is that we can always 
tune numHashTables and numHashFunctions for AND-OR to make the possibility as 
good as OR-AND. Please correct me if I am wrong.
    
    My concerns on supporting OR-AND are the followings:
    (1) We probably need some backward incompatible API changes. 
`Array[Vector]`, numHashTables, numHashFunctions seems to make less sense for 
OR-AND.
    (2) To avoid broadcast join, we will need a very different and complicated 
mechanism for the join step in approxSimilarityJoin for OR-AND.
    (3) I am thinking about building index to improve performance for nearest 
neighbor 
(https://docs.google.com/document/d/1opWy2ohXaDWjamV8iC0NKbaZL9JsjZCix2Av5SS3D9g/edit).
 Supporting OR-AND will make the index less efficient when we get records given 
hash buckets.
    
    @jkbradley @sethah @MLnick Any thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to