Reza Zadeh created SPARK-2885:
---------------------------------
Summary: All-pairs similarity via DIMSUM
Key: SPARK-2885
URL: https://issues.apache.org/jira/browse/SPARK-2885
Project: Spark
Issue Type: New Feature
Reporter: Reza Zadeh
Build all-pairs similarity algorithm via DIMSUM.
Given a dataset of sparse vector data, the all-pairs similarity problem is to
find all similar vector pairs according to a similarity function such as cosine
similarity, and a given similarity score threshold. Sometimes, this problem is
called a “similarity join”.
The brute force approach of considering all pairs quickly breaks, since it
scales quadratically. For example, for a million vectors, it is not feasible to
check all roughly trillion pairs to see if they are above the similarity
threshold. Having said that, there exist clever sampling techniques to focus
the computational effort on those pairs that are above the similarity
threshold, which makes the problem feasible.
Current PR for this is WIP:
https://github.com/apache/spark/pull/1778
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]