Imagine that 4 documents exist as shown below:

D1: the cat sat on the mat
D2: the cat sat on the cat
D3: the cat sat
D4: the mat sat

where each word in the vocabulary can be translated to its wordID:

0 the
1 cat
2 sat
3 on
4 the
5 mat

Now every document, can be represented using sparse vectors as shown below:

Vectors.sparse(5, Seq((0, 2.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0))),
Vectors.sparse(5, Seq((0, 2.0), (1, 2.0), (2, 1.0), (3, 1.0))),
Vectors.sparse(5, Seq((0, 1.0), (1, 1.0), (2, 1.0))),
Vectors.sparse(5, Seq((0, 1.0), (2, 1.0), (4, 1.0))))
and finally, principal components can be computed as follows:

val data = Array(
    Vectors.sparse(5, Seq((0, 2.0), (1, 1.0), (2, 1.0), (3, 1.0), (4,
1.0))),
    Vectors.sparse(5, Seq((0, 2.0), (1, 2.0), (2, 1.0), (3, 1.0))),
    Vectors.sparse(5, Seq((0, 1.0), (1, 1.0), (2, 1.0))),
    Vectors.sparse(5, Seq((0, 1.0), (2, 1.0), (4, 1.0))))

val dataRDD = sc.parallelize(data)
val mat: RowMatrix = new RowMatrix(dataRDD)
val pc: Matrix = mat.computePrincipalComponents(4)
What I want to do, is to read the following dataset and represent each
document using sparse vectors like above, in order to compute the principal
components.


In the form: docID wordID count


1 2 1
1 39 1
1 42 3
1 77 1
1 95 1
1 96 1
2 105 1
2 108 1
3 133 3

however I am not quite sure how to read and represent the dataset as sparse
vectors. Any help would be much appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Represent-documents-as-a-sequence-of-wordID-frequency-and-perform-PCA-tp28554.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to