zhengruifeng created SPARK-13970:
------------------------------------
Summary: Add Non-Negative Matrix Factorization to MLlib
Key: SPARK-13970
URL: https://issues.apache.org/jira/browse/SPARK-13970
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: zhengruifeng
Priority: Minor
NMF is to find two non-negative matrices (W, H) whose product W * H.T
approximates the non-negative matrix X. This factorization can be used for
example for dimensionality reduction, source separation or topic extraction.
NMF was implemented in several packages:
Scikit-Learn
(http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF)
R-NMF (https://cran.r-project.org/web/packages/NMF/index.html)
LibNMF (http://www.univie.ac.at/rlcta/software/)
I have implemented in MLlib according to the following papers:
Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis
on MapReduce (http://research.microsoft.com/pubs/119077/DNMF.pdf)
Algorithms for Non-negative Matrix Factorization
(http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf)
It can be used like this:
val m = 4
val n = 3
val data = Seq(
(0L, Vectors.dense(0.0, 1.0, 2.0)),
(1L, Vectors.dense(3.0, 4.0, 5.0)),
(3L, Vectors.dense(9.0, 0.0, 1.0))
).map(x => IndexedRow(x._1, x._2))
val A = new IndexedRowMatrix(indexedRows).toCoordinateMatrix()
val k = 2
// run the nmf algo
val r = NMF.solve(A, k, 10)
val rW = r.W.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix]
>>> org.apache.spark.mllib.linalg.DenseMatrix =
1.1349295096806706 1.4423101890626953E-5
3.453054133110303 0.46312492493865615
0.0 0.0
0.3133764134585149 2.70684017255672
val rH = r.H.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix]
>>> org.apache.spark.mllib.linalg.DenseMatrix =
0.4184163313845057 3.2719352525149286
1.12188012613645 0.002939823716977737
1.456499371939653 0.18992996116069297
val R = rW.multiply(rH.transpose)
>>> org.apache.spark.mllib.linalg.DenseMatrix =
0.4749202332761286 1.273254903877907 1.6530268574248572
2.9601290106732367 3.8752743120480346 5.117332475154927
0.0 0.0 0.0
8.987727592773672 0.35952840319637736 0.9705425982249293
val AD = A.toBlockMatrix().toLocalMatrix()
>>> org.apache.spark.mllib.linalg.Matrix =
0.0 1.0 2.0
3.0 4.0 5.0
0.0 0.0 0.0
9.0 0.0 1.0
var loss = 0.0
for(i <- 0 until AD.numRows; j <- 0 until AD.numCols) {
val diff = AD(i, j) - R(i, j)
loss += diff * diff
}
loss
>>> Double = 0.5817999580912183
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]