[
https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893508#comment-15893508
]
Yun Ni commented on SPARK-19771:
--------------------------------
[~merlin] What you are suggesting is to hash each AND hash vector into a single
integer, which I don't think make sense. It does little improvement to running
time since SparkSQL does a hash join and the chance of vector comparison is
almost minimized. It improves the memory cost of each transformed row from
O(NumHashFunctions*NumHashTables) to O(NumHashTables) but at the cost of
increasing false positive rate especially when the NumHashFunctions is large.
>From user experience perspective, hiding the actual hash values from users is
>a bad practice because users need to run their own algorithms based on the
>hash values. Besides that, we expect users to increase the number of hash
>functions when they want to lower the false positive rate. Hashing the vector
>will increase the false positive rate again, which should not be expected.
> Support OR-AND amplification in Locality Sensitive Hashing (LSH)
> ----------------------------------------------------------------
>
> Key: SPARK-19771
> URL: https://issues.apache.org/jira/browse/SPARK-19771
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 2.1.0
> Reporter: Yun Ni
>
> The current LSH implementation only supports AND-OR amplification. We need to
> discuss the following questions before we goes to implementations:
> (1) Whether we should support OR-AND amplification
> (2) What API changes we need for OR-AND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]