[
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Reza Zadeh updated SPARK-2885:
------------------------------
Description:
Build all-pairs similarity algorithm via DIMSUM.
Given a dataset of sparse vector data, the all-pairs similarity problem is to
find all similar vector pairs according to a similarity function such as cosine
similarity, and a given similarity score threshold. Sometimes, this problem is
called a “similarity join”.
The brute force approach of considering all pairs quickly breaks, since it
scales quadratically. For example, for a million vectors, it is not feasible to
check all roughly trillion pairs to see if they are above the similarity
threshold. Having said that, there exist clever sampling techniques to focus
the computational effort on those pairs that are above the similarity
threshold, which makes the problem feasible.
DIMSUM has a single parameter (called gamma) to tradeoff computation time vs
accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of
computation vs accuracy from low computation to high accuracy. For a very large
gamma, all cosine similarities are computed exactly with no sampling.
Current PR:
https://github.com/apache/spark/pull/1778
Justification for adding to MLlib:
- All-pairs similarity is missing from MLlib and has been requested several
times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see
https://github.com/apache/spark/pull/1778#issuecomment-51300825)
- Algorithm is used in large-scale production at Twitter. e.g. see
https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
. Twitter also open-sourced their version in scalding:
https://github.com/twitter/scalding/pull/833
- When used with the gamma parameter set high, this algorithm becomes the
normalized gramian matrix, which is useful in RowMatrix alongside the
computeGramianMatrix method already in RowMatrix
More details about usage at Twitter:
https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
For correctness proof, see Theorem 4.3 in
http://stanford.edu/~rezab/papers/dimsum.pdf
was:
Build all-pairs similarity algorithm via DIMSUM.
Given a dataset of sparse vector data, the all-pairs similarity problem is to
find all similar vector pairs according to a similarity function such as cosine
similarity, and a given similarity score threshold. Sometimes, this problem is
called a “similarity join”.
The brute force approach of considering all pairs quickly breaks, since it
scales quadratically. For example, for a million vectors, it is not feasible to
check all roughly trillion pairs to see if they are above the similarity
threshold. Having said that, there exist clever sampling techniques to focus
the computational effort on those pairs that are above the similarity
threshold, which makes the problem feasible.
DIMSUM has a single parameter (called gamma) to tradeoff computation time vs
accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of
computation vs accuracy from low computation to high accuracy. For a very large
gamma, all cosine similarities are computed exactly with no sampling.
Current PR:
https://github.com/apache/spark/pull/1778
Justification for adding to MLlib:
- All-pairs similarity is missing from MLlib and has been requested several
times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see
https://github.com/apache/spark/pull/1778#issuecomment-51300825)
- Algorithm is used in large-scale production at Twitter. e.g. see
https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
. Twitter also open-sourced their version in scalding:
https://github.com/twitter/scalding/pull/833
- When used with the gamma parameter set high, this algorithm becomes the
normalized gramian matrix, which is useful in RowMatrix alongside the
computeGramianMatrix method already in RowMatrix
> All-pairs similarity via DIMSUM
> -------------------------------
>
> Key: SPARK-2885
> URL: https://issues.apache.org/jira/browse/SPARK-2885
> Project: Spark
> Issue Type: New Feature
> Reporter: Reza Zadeh
> Assignee: Reza Zadeh
>
> Build all-pairs similarity algorithm via DIMSUM.
> Given a dataset of sparse vector data, the all-pairs similarity problem is to
> find all similar vector pairs according to a similarity function such as
> cosine similarity, and a given similarity score threshold. Sometimes, this
> problem is called a “similarity join”.
> The brute force approach of considering all pairs quickly breaks, since it
> scales quadratically. For example, for a million vectors, it is not feasible
> to check all roughly trillion pairs to see if they are above the similarity
> threshold. Having said that, there exist clever sampling techniques to focus
> the computational effort on those pairs that are above the similarity
> threshold, which makes the problem feasible.
> DIMSUM has a single parameter (called gamma) to tradeoff computation time vs
> accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of
> computation vs accuracy from low computation to high accuracy. For a very
> large gamma, all cosine similarities are computed exactly with no sampling.
> Current PR:
> https://github.com/apache/spark/pull/1778
> Justification for adding to MLlib:
> - All-pairs similarity is missing from MLlib and has been requested several
> times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see
> https://github.com/apache/spark/pull/1778#issuecomment-51300825)
> - Algorithm is used in large-scale production at Twitter. e.g. see
> https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
> . Twitter also open-sourced their version in scalding:
> https://github.com/twitter/scalding/pull/833
> - When used with the gamma parameter set high, this algorithm becomes the
> normalized gramian matrix, which is useful in RowMatrix alongside the
> computeGramianMatrix method already in RowMatrix
> More details about usage at Twitter:
> https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
> For correctness proof, see Theorem 4.3 in
> http://stanford.edu/~rezab/papers/dimsum.pdf
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]