[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637337#comment-14637337 ]
Maruf Aytekin edited comment on SPARK-5992 at 7/31/15 11:33 AM: ---------------------------------------------------------------- In addition to Charikar's scheme for cosine [~karlhigley] pointed out, LSH schemes for the other known similarity/distance measures are as follows: 1. Hamming norm: A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999). http://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf 2. Lp norms: M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf http://people.csail.mit.edu/indyk/nips-nn.ps -3. Jaccard distance:- -Mining Massive Data Sets chapter#3- _spark-hash package referenced above already implements this_ 4. Cosine distance and Earth movers distance (EMD): M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002). http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf was (Author: cepnyci): In addition to Charikar's scheme for cosine [~karlhigley] pointed out, LSH schemes for the other known similarity/distance measures are as follows: 1. Hamming norm: A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Intl. Conf. on Very Large Data Bases, VLDB(1999). http://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf 2. Lp norms: M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In Proc. of the 20th ACM Annual http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf http://people.csail.mit.edu/indyk/nips-nn.ps 3. Jaccard distance: Mining Massive Data Sets chapter#3: http://infolab.stanford.edu/~ullman/mmds/ch3.pdf 4. Cosine distance and Earth movers distance (EMD): M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual ACM Symposium on Theory of Computing, STOC (2002). http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf > Locality Sensitive Hashing (LSH) for MLlib > ------------------------------------------ > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib > Affects Versions: 1.4.0 > Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org