GitHub user dusenberrymw opened a pull request:

    https://github.com/apache/spark/pull/7554

    [SPARK-6485] [MLlib] [Python] Add 
CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark.

    This PR adds the RowMatrix, IndexedRowMatrix, and CoordinateMatrix 
distributed matrices to PySpark.  Each distributed matrix class acts as a 
wrapper around the Scala/Java counterpart by maintaining a reference to the 
Java object.  New distributed matrices can be created using factory methods 
added to DistributedMatrices, which creates the Java distributed matrix and 
then wraps it with the corresponding PySpark class.  This design allows for 
simple conversion between the various distributed matrices, and lets us re-use 
the Scala code.  Serialization between Python and Java is implemented using 
DataFrames as needed for IndexedRowMatrix and CoordinateMatrix for simplicity.  
Associated documentation and unit-tests have also been added.  To facilitate 
code review, this PR implements access to the rows/entries as RDDs, the number 
of rows & columns, and conversions between the various distributed matrices 
(not including BlockMatrix), and does not implement the other linear algebra 
funct
 ions of the matrices, although this will be very simple to add now.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dusenberrymw/spark 
SPARK-6485_Add_CoordinateMatrix_RowMatrix_IndexedMatrix_to_PySpark

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7554.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7554
    
----
commit 4d715303e36c102e48b7345d96e53e0d4855a7d0
Author: Mike Dusenberry <[email protected]>
Date:   2015-06-26T23:21:26Z

    Implemented the RowMatrix API in PySpark by doing the following: Added a 
DistributedMatrices class to contain factory methods for creating the various 
distributed matrices.  Added a factory method for creating a RowMatrix from an 
RDD of Vectors.  Added a createRowMatrix function to the PythonMLlibAPI to 
interface with the factory method.  Added DistributedMatrix, 
DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api.

commit 7186141c90e105199b2aeace8972e3438ed85ed3
Author: Mike Dusenberry <[email protected]>
Date:   2015-06-26T23:24:52Z

    Adding unit tests for RowMatrix methods.

commit bdb9ae389f63c3cdc01fb05b5d6a98027eb6ec52
Author: Mike Dusenberry <[email protected]>
Date:   2015-06-29T18:40:10Z

    Updating design to have a PySpark RowMatrix simply create and keep a 
reference to a wrapper over a Java RowMatrix.  Updating DistributedMatrices 
factory methods to accept numRows and numCols with default values.  Updating 
PySpark DistributedMatrices factory method to simply create a PySpark 
RowMatrix. Adding additional doctests for numRows and numCols parameters.

commit 6e70fc468daa973b68a62fe91089acdc9c928af1
Author: Mike Dusenberry <[email protected]>
Date:   2015-06-29T19:01:21Z

    Updating documentation to add PySpark RowMatrix. Inserting newline above 
doctest so that it renders properly in API docs.

commit 9b434d50982173c1de5c148c6cc6b80421770500
Author: Mike Dusenberry <[email protected]>
Date:   2015-07-20T18:27:51Z

    Implemented the IndexedRowMatrix API in PySpark, following the idea of the 
RowMatrix API.  Note that for the IndexedRowMatrix, we use DataFrames to 
serialize the data between Python and Scala/Java, so we accept PySpark RDDs, 
then convert to a DataFrame, then convert back to RDDs on the Scala/Java side 
before constructing the IndexedRowMatrix.

commit 5655235a1cb4a36719633744fe63abadb89fcede
Author: Mike Dusenberry <[email protected]>
Date:   2015-07-20T20:34:11Z

    Updating the architecture a bit to make conversions between the various 
distributed matrix types easier.  The different distributed matrix classes are 
now only wrappers around the Java objects, and take the Java object as an 
argument during construction.  This way, we can call  for example on an , which 
returns a reference to a Java RowMatrix object, and then construct a PySpark 
RowMatrix object wrapped around the Java object.  This is analogous to the 
behavior of PySpark RDDs and DataFrames.  We now delegate creation of the 
various distributed matrices from scratch in PySpark to the factory methods on .

commit 8091cf72fa626e73610527def785d36971630c6c
Author: Mike Dusenberry <[email protected]>
Date:   2015-07-20T22:49:16Z

    Implemented the CoordinateMatrix API in PySpark, following the idea of the 
IndexedRowMatrix API, including using DataFrames for serialization.

commit 2dae31420dab31a1e1027713fdc0eb25dd20c536
Author: Mike Dusenberry <[email protected]>
Date:   2015-07-21T01:14:22Z

    Added wrappers for the conversions between the various distributed 
matrices.  Added logic to be able to access the rows/entries of the distributed 
matrices, which requires serialization through DataFrames for IndexedRowMatrix 
and CoordinateMatrix types. Added unit tests.

commit 3fb6a5efe3847c2bd7bcd3b75e40738ff7f47fc0
Author: Mike Dusenberry <[email protected]>
Date:   2015-07-21T01:43:07Z

    Updating unit test to be more useful.

commit 06c39e215028584fc2b3849ee5be57fc2ee2a373
Author: Mike Dusenberry <[email protected]>
Date:   2015-07-21T01:43:42Z

    Updating documentation for each of the distributed matrices.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to