[ 
https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893508#comment-15893508
 ] 

Yun Ni commented on SPARK-19771:
--------------------------------

[~merlin] What you are suggesting is to hash each AND hash vector into a single 
integer, which I don't think make sense. It does little improvement to running 
time since SparkSQL does a hash join and the chance of vector comparison is 
almost minimized. It improves the memory cost of each transformed row from 
O(NumHashFunctions*NumHashTables) to O(NumHashTables) but at the cost of 
increasing false positive rate especially when the NumHashFunctions is large.

>From user experience perspective, hiding the actual hash values from users is 
>a bad practice because users need to run their own algorithms based on the 
>hash values. Besides that, we expect users to increase the number of hash 
>functions when they want to lower the false positive rate. Hashing the vector 
>will increase the false positive rate again, which should not be expected.

> Support OR-AND amplification in Locality Sensitive Hashing (LSH)
> ----------------------------------------------------------------
>
>                 Key: SPARK-19771
>                 URL: https://issues.apache.org/jira/browse/SPARK-19771
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Yun Ni
>
> The current LSH implementation only supports AND-OR amplification. We need to 
> discuss the following questions before we goes to implementations:
> (1) Whether we should support OR-AND amplification
> (2) What API changes we need for OR-AND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to