[
https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893117#comment-15893117
]
Mingjie Tang commented on SPARK-19771:
--------------------------------------
(1) because you need to explode each tuple. For example mentioned above, for
one input tuple, you have to build 3 rows, and each hashvalue contain a vector
is the length of hash functions. thus, for one tuple, your memory overhead is
NumHashFunctions*NumHashTables=15. Thus, if the number input tuple is N, the
overhead is NumHashFunctions*NumHashTables*N.
(2) yes, the hashvalue can be any based on your input bucketwidth W. Actually,
it should be very big for less collision.
(3) I am not sure the hashCode can work, because we need to use this function
for multi-probe searching.
> Support OR-AND amplification in Locality Sensitive Hashing (LSH)
> ----------------------------------------------------------------
>
> Key: SPARK-19771
> URL: https://issues.apache.org/jira/browse/SPARK-19771
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 2.1.0
> Reporter: Yun Ni
>
> The current LSH implementation only supports AND-OR amplification. We need to
> discuss the following questions before we goes to implementations:
> (1) Whether we should support OR-AND amplification
> (2) What API changes we need for OR-AND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]