[
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiangrui Meng updated SPARK-2885:
---------------------------------
Target Version/s: 1.2.0
> All-pairs similarity via DIMSUM
> -------------------------------
>
> Key: SPARK-2885
> URL: https://issues.apache.org/jira/browse/SPARK-2885
> Project: Spark
> Issue Type: New Feature
> Reporter: Reza Zadeh
> Assignee: Reza Zadeh
>
> Build all-pairs similarity algorithm via DIMSUM.
> Given a dataset of sparse vector data, the all-pairs similarity problem is to
> find all similar vector pairs according to a similarity function such as
> cosine similarity, and a given similarity score threshold. Sometimes, this
> problem is called a “similarity join”.
> The brute force approach of considering all pairs quickly breaks, since it
> scales quadratically. For example, for a million vectors, it is not feasible
> to check all roughly trillion pairs to see if they are above the similarity
> threshold. Having said that, there exist clever sampling techniques to focus
> the computational effort on those pairs that are above the similarity
> threshold, which makes the problem feasible.
> DIMSUM has a single parameter (called gamma) to tradeoff computation time vs
> accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of
> computation vs accuracy from low computation to high accuracy. For a very
> large gamma, all cosine similarities are computed exactly with no sampling.
> Current PR:
> https://github.com/apache/spark/pull/1778
> Justification for adding to MLlib:
> - All-pairs similarity is missing from MLlib and has been requested several
> times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see
> https://github.com/apache/spark/pull/1778#issuecomment-51300825)
> - Algorithm is used in large-scale production at Twitter. e.g. see
> https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
> . Twitter also open-sourced their version in scalding:
> https://github.com/twitter/scalding/pull/833
> - When used with the gamma parameter set high, this algorithm becomes the
> normalized gramian matrix, which is useful in RowMatrix alongside the
> computeGramianMatrix method already in RowMatrix
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]