[jira] [Updated] (SPARK-2885) All-pairs similarity via DIMSUM

Xiangrui Meng (JIRA) Wed, 27 Aug 2014 02:42:05 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xiangrui Meng updated SPARK-2885:
---------------------------------

    Target Version/s: 1.2.0

> All-pairs similarity via DIMSUM
> -------------------------------
>
>                 Key: SPARK-2885
>                 URL: https://issues.apache.org/jira/browse/SPARK-2885
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Reza Zadeh
>            Assignee: Reza Zadeh
>
> Build all-pairs similarity algorithm via DIMSUM. 
> Given a dataset of sparse vector data, the all-pairs similarity problem is to 
> find all similar vector pairs according to a similarity function such as 
> cosine similarity, and a given similarity score threshold. Sometimes, this 
> problem is called a “similarity join”.
> The brute force approach of considering all pairs quickly breaks, since it 
> scales quadratically. For example, for a million vectors, it is not feasible 
> to check all roughly trillion pairs to see if they are above the similarity 
> threshold. Having said that, there exist clever sampling techniques to focus 
> the computational effort on those pairs that are above the similarity 
> threshold, which makes the problem feasible.
> DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
> accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
> computation vs accuracy from low computation to high accuracy. For a very 
> large gamma, all cosine similarities are computed exactly with no sampling.
> Current PR:
> https://github.com/apache/spark/pull/1778
> Justification for adding to MLlib:
> - All-pairs similarity is missing from MLlib and has been requested several 
> times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
> https://github.com/apache/spark/pull/1778#issuecomment-51300825)
> - Algorithm is used in large-scale production at Twitter. e.g. see 
> https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
>  . Twitter also open-sourced their version in scalding: 
> https://github.com/twitter/scalding/pull/833
> - When used with the gamma parameter set high, this algorithm becomes the 
> normalized gramian matrix, which is useful in RowMatrix alongside the 
> computeGramianMatrix method already in RowMatrix



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-2885) All-pairs similarity via DIMSUM

Reply via email to