[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

Karl Higley (JIRA) Thu, 26 Nov 2015 15:21:32 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029318#comment-15029318
 ]


Karl Higley edited comment on SPARK-5992 at 11/26/15 11:20 PM:
---------------------------------------------------------------

I'm a bit confused by this section of the design doc:
{quote}
It is pretty hard to define a common interface. Because LSH algorithm has two 
types at least. One is to calculate hash value. The other is to calculate a 
similarity between a feature(vector) and another one. 

For example, random projection algorithm is a type of calculating a similarity. 
It is designed to approximate the cosine distance between vectors. On the other 
hand, min hash algorithm is a type of calculating a hash value. The hash 
function maps a d dimensional vector onto a set of integers.
{quote}
Sign-random-projection LSH does calculate a hash value (essentially a Bitset) 
for each feature vector, and the Hamming distance between two hash values is 
used to estimate the cosine similarity between the corresponding feature 
vectors. The two "types" of LSH mentioned here seem more like two kinds of 
operations which are sometimes applied sequentially. Maybe this distinction 
makes more sense for other types of LSH?


was (Author: karlhigley):
I'm a bit confused by this section of the design doc:
{quote}
It is pretty hard to define a common interface. Because LSH algorithm has two 
types at least. One is to calculate hash value. The other is to calculate a 
similarity between a feature(vector) and another one. 

For example, random projection algorithm is a type of calculating a similarity. 
It is designed to approximate the cosine distance between vectors. On the other 
hand, min hash algorithm is a type of calculating a hash value. The hash 
function maps a d dimensional vector onto a set of integers.
{quote}
Sign-random-projection LSH does calculate a hash value (essentially a Bitset) 
for each feature vector, and the Hamming distance between two hash values is 
used to estimate the cosine similarity between the corresponding vectors. The 
two "types" of LSH mentioned here seem more like two kinds of operations which 
are sometimes applied sequentially. Maybe this distinction makes more sense for 
other types of LSH?

> Locality Sensitive Hashing (LSH) for MLlib
> ------------------------------------------
>
>                 Key: SPARK-5992
>                 URL: https://issues.apache.org/jira/browse/SPARK-5992
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

Reply via email to