GitHub user dusenberrymw opened a pull request:
https://github.com/apache/spark/pull/7554
[SPARK-6485] [MLlib] [Python] Add
CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark.
This PR adds the RowMatrix, IndexedRowMatrix, and CoordinateMatrix
distributed matrices to PySpark. Each distributed matrix class acts as a
wrapper around the Scala/Java counterpart by maintaining a reference to the
Java object. New distributed matrices can be created using factory methods
added to DistributedMatrices, which creates the Java distributed matrix and
then wraps it with the corresponding PySpark class. This design allows for
simple conversion between the various distributed matrices, and lets us re-use
the Scala code. Serialization between Python and Java is implemented using
DataFrames as needed for IndexedRowMatrix and CoordinateMatrix for simplicity.
Associated documentation and unit-tests have also been added. To facilitate
code review, this PR implements access to the rows/entries as RDDs, the number
of rows & columns, and conversions between the various distributed matrices
(not including BlockMatrix), and does not implement the other linear algebra
funct
ions of the matrices, although this will be very simple to add now.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dusenberrymw/spark
SPARK-6485_Add_CoordinateMatrix_RowMatrix_IndexedMatrix_to_PySpark
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7554.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7554
----
commit 4d715303e36c102e48b7345d96e53e0d4855a7d0
Author: Mike Dusenberry <[email protected]>
Date: 2015-06-26T23:21:26Z
Implemented the RowMatrix API in PySpark by doing the following: Added a
DistributedMatrices class to contain factory methods for creating the various
distributed matrices. Added a factory method for creating a RowMatrix from an
RDD of Vectors. Added a createRowMatrix function to the PythonMLlibAPI to
interface with the factory method. Added DistributedMatrix,
DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api.
commit 7186141c90e105199b2aeace8972e3438ed85ed3
Author: Mike Dusenberry <[email protected]>
Date: 2015-06-26T23:24:52Z
Adding unit tests for RowMatrix methods.
commit bdb9ae389f63c3cdc01fb05b5d6a98027eb6ec52
Author: Mike Dusenberry <[email protected]>
Date: 2015-06-29T18:40:10Z
Updating design to have a PySpark RowMatrix simply create and keep a
reference to a wrapper over a Java RowMatrix. Updating DistributedMatrices
factory methods to accept numRows and numCols with default values. Updating
PySpark DistributedMatrices factory method to simply create a PySpark
RowMatrix. Adding additional doctests for numRows and numCols parameters.
commit 6e70fc468daa973b68a62fe91089acdc9c928af1
Author: Mike Dusenberry <[email protected]>
Date: 2015-06-29T19:01:21Z
Updating documentation to add PySpark RowMatrix. Inserting newline above
doctest so that it renders properly in API docs.
commit 9b434d50982173c1de5c148c6cc6b80421770500
Author: Mike Dusenberry <[email protected]>
Date: 2015-07-20T18:27:51Z
Implemented the IndexedRowMatrix API in PySpark, following the idea of the
RowMatrix API. Note that for the IndexedRowMatrix, we use DataFrames to
serialize the data between Python and Scala/Java, so we accept PySpark RDDs,
then convert to a DataFrame, then convert back to RDDs on the Scala/Java side
before constructing the IndexedRowMatrix.
commit 5655235a1cb4a36719633744fe63abadb89fcede
Author: Mike Dusenberry <[email protected]>
Date: 2015-07-20T20:34:11Z
Updating the architecture a bit to make conversions between the various
distributed matrix types easier. The different distributed matrix classes are
now only wrappers around the Java objects, and take the Java object as an
argument during construction. This way, we can call for example on an , which
returns a reference to a Java RowMatrix object, and then construct a PySpark
RowMatrix object wrapped around the Java object. This is analogous to the
behavior of PySpark RDDs and DataFrames. We now delegate creation of the
various distributed matrices from scratch in PySpark to the factory methods on .
commit 8091cf72fa626e73610527def785d36971630c6c
Author: Mike Dusenberry <[email protected]>
Date: 2015-07-20T22:49:16Z
Implemented the CoordinateMatrix API in PySpark, following the idea of the
IndexedRowMatrix API, including using DataFrames for serialization.
commit 2dae31420dab31a1e1027713fdc0eb25dd20c536
Author: Mike Dusenberry <[email protected]>
Date: 2015-07-21T01:14:22Z
Added wrappers for the conversions between the various distributed
matrices. Added logic to be able to access the rows/entries of the distributed
matrices, which requires serialization through DataFrames for IndexedRowMatrix
and CoordinateMatrix types. Added unit tests.
commit 3fb6a5efe3847c2bd7bcd3b75e40738ff7f47fc0
Author: Mike Dusenberry <[email protected]>
Date: 2015-07-21T01:43:07Z
Updating unit test to be more useful.
commit 06c39e215028584fc2b3849ee5be57fc2ee2a373
Author: Mike Dusenberry <[email protected]>
Date: 2015-07-21T01:43:42Z
Updating documentation for each of the distributed matrices.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]