[ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118269#comment-14118269
 ] 

Clive Cox commented on SPARK-2885:
----------------------------------

Hi.

 I am testing this and am having problems with SparseVectors and the call to 
RowMatrix.columnMagnitudes(); Its throwing Assertions when the row matrix is 
built from SparseVectors. Seems to work ok with DenseVectors. It does not do 
this everytime and I have not tracked down yet what is the particular error. 
Its happening in the Breeze code. 
 Has anyone else seen these issue?
Creating random small sets of SparseVectors I sometimes get exception like::
java.lang.AssertionError: assertion failed: 2 3 3 2 2 SparseVector((72,1.0), 
(80,1.0), (87,1.0), (2,1.0)) SparseVector((72,1.0), (80,1.0), (87,1.0), (2,1.0))
        at scala.Predef$.assert(Predef.scala:179)
        at 
breeze.linalg.operators.SparseVectorOps$$anon$164.apply(SparseVectorOps.scala:426)
        at 
breeze.linalg.operators.SparseVectorOps$$anon$164.apply(SparseVectorOps.scala:388)
        at breeze.linalg.operators.BinaryRegistry$class.apply(BinaryOp.scala:60)
        at breeze.linalg.VectorOps$$anon$14.apply(Vector.scala:250)
        at breeze.linalg.NumericOps$class.$colon$times(NumericOps.scala:44)
        at breeze.linalg.SparseVector.$colon$times(SparseVector.scala:48)
        at 
org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$columnMagnitudes$2.apply(RowMatrix.scala:405)
        at 
org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$columnMagnitudes$2.apply(RowMatrix.scala:403)


> All-pairs similarity via DIMSUM
> -------------------------------
>
>                 Key: SPARK-2885
>                 URL: https://issues.apache.org/jira/browse/SPARK-2885
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Reza Zadeh
>            Assignee: Reza Zadeh
>
> Build all-pairs similarity algorithm via DIMSUM. 
> Given a dataset of sparse vector data, the all-pairs similarity problem is to 
> find all similar vector pairs according to a similarity function such as 
> cosine similarity, and a given similarity score threshold. Sometimes, this 
> problem is called a “similarity join”.
> The brute force approach of considering all pairs quickly breaks, since it 
> scales quadratically. For example, for a million vectors, it is not feasible 
> to check all roughly trillion pairs to see if they are above the similarity 
> threshold. Having said that, there exist clever sampling techniques to focus 
> the computational effort on those pairs that are above the similarity 
> threshold, which makes the problem feasible.
> DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
> accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
> computation vs accuracy from low computation to high accuracy. For a very 
> large gamma, all cosine similarities are computed exactly with no sampling.
> Current PR:
> https://github.com/apache/spark/pull/1778
> Justification for adding to MLlib:
> - All-pairs similarity is missing from MLlib and has been requested several 
> times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
> https://github.com/apache/spark/pull/1778#issuecomment-51300825)
> - Algorithm is used in large-scale production at Twitter. e.g. see 
> https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
>  . Twitter also open-sourced their version in scalding: 
> https://github.com/twitter/scalding/pull/833
> - When used with the gamma parameter set high, this algorithm becomes the 
> normalized gramian matrix, which is useful in RowMatrix alongside the 
> computeGramianMatrix method already in RowMatrix
> More details about usage at Twitter: 
> https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
> For correctness proof, see Theorem 4.3 in 
> http://stanford.edu/~rezab/papers/dimsum.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to