Hi all, 
I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This really
confused me today. At first I thought my implementation is wrong. It turns
out it's an issue in MLlib. Fortunately, I've figured it out. 

I suggest to add a hint on user document of MLlib ( as far as I know, there
have not been such hints yet) that  indices of Local Sparse Vector must be
ordered in ascending manner. Because of ignorance of this point, I spent a
lot of time looking for reasons why computeSVD of RowMatrix did not run
correctly on Sparse data. I don't know the influence of Sparse Vector
without ordered indices on other functions, but I believe it is necessary to
let the users know or fix it. Actually, it's very easy to fix. Just add a
sortBy function in internal construction of SparseVector. 

Here is an example to reproduce the affect of unordered Sparse Vector on
computeSVD. 
================================================ 
//in spark-shell, Spark 1.3.1 
 import org.apache.spark.mllib.linalg.distributed.RowMatrix 
 import org.apache.spark.mllib.linalg.{SparseVector, DenseVector, Vector,
Vectors} 

  val sparseData_ordered = Seq( 
    Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)), 
    Vectors.sparse(3, Array(0,1,2), Array(3.0, 4.0, 5.0)), 
    Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)), 
    Vectors.sparse(3, Array(0,2), Array(9.0, 1.0)) 
  ) 
  val sparseMat_ordered = new RowMatrix(sc.parallelize(sparseData_ordered,
2)) 

  val sparseData_not_ordered = Seq( 
    Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)), 
    Vectors.sparse(3, Array(2,1,0), Array(5.0,4.0,3.0)), 
    Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)), 
    Vectors.sparse(3, Array(2,0), Array(1.0,9.0)) 
  ) 
 val sparseMat_not_ordered = new
RowMatrix(sc.parallelize(sparseData_not_ordered, 2)) 

//apparently, sparseMat_ordered and sparseMat_not_ordered are essentially
the same matirx 
//however, the computeSVD result of these two matrixes are different. Users
should be notified about this situation. 
  println(sparseMat_ordered.computeSVD(2,
true).U.rows.collect.mkString("\n")) 
  println("===================") 
  println(sparseMat_not_ordered.computeSVD(2,
true).U.rows.collect.mkString("\n")) 
====================================================== 
The results are: 
ordered: 
[-0.10972870132786407,-0.18850811494220537] 
[-0.44712472003608356,-0.24828866611663725] 
[-0.784520738744303,-0.3080692172910691] 
[-0.4154110101064339,0.8988385762953358] 

not ordered: 
[-0.10830447119599484,-0.1559341848984378] 
[-0.4522713511277327,-0.23449829541447448] 
[-0.7962382310594706,-0.3130624059305111] 
[-0.43131320303494614,0.8453864703362308] 

Looking into this issue, I can see it's reason locates in
RowMatrix.scala(line 629). The implementation of Sparse dspr here requires
ordered indices. Because it is scanning the indices consecutively to skip
empty columns. 



-----
Feel the sparking Spark!
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Indices-of-SparseVector-must-be-ordered-while-computing-SVD-tp11731.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to