GitHub user dusenberrymw opened a pull request:
https://github.com/apache/spark/pull/9441
[WIP] [SPARK-9656] [MLlib] [Python] Add missing methods to PySpark's
Distributed Linear Algebra Classes
This PR adds the remaining group of methods to PySpark's distributed linear
algebra classes as follows:
* `RowMatrix` <sup>**[1]**</sup>
1. `computeGramianMatrix`
2. `computeCovariance`
3. `computeColumnSummaryStatistics`
4. `columnSimilarities`
5. `tallSkinnyQR` <sup>**[2]**</sup>
* `IndexedRowMatrix` <sup>**[3]**</sup>
1. `computeGramianMatrix`
* `CoordinateMatrix`
1. `transpose`
* `BlockMatrix`
1. `validate`
2. `cache`
3. `persist`
4. `transpose`
**[1]**: Note: `multiply`, `computeSVD`, and `computePrincipalComponents`
are already part of PR #7963 for SPARK-6227.
**[2]**: Implementing `tallSkinnyQR` uncovered a bug with our PySpark
`RowMatrix` constructor. As discussed on the dev list
[here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html),
there appears to be an issue with type erasure with RDDs coming from Java, and
by extension from PySpark. Although we are attempting to construct a
`RowMatrix` from an `RDD[Vector]` in
[PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115),
the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when
calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException`
in which an `Object` cannot be cast to a Spark `Vector`. As noted in the
aforementioned dev list thread, this issue was also encountered with
`DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a
`Vector` type. Thus, this PR currently contains that fix applie
d to the `createRowMatrix` helper function in `PythonMLlibAPI`.
`IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue
likely due to their related helper functions in `PythonMLlibAPI` creating the
RDDs explicitly from DataFrames with pattern matching, thus preserving the
types. However, this fix may be out of scope for this single PR, and it may be
better suited in a separate JIRA/PR. Therefore, I have marked this PR as WIP
and am open to discussion.
**[3]**: Note: `multiply` and `computeSVD` are already part of PR #7963 for
SPARK-6227.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dusenberrymw/spark
SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/9441.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #9441
----
commit 7a98f55c60bfb90cda76ca1104db0486924e3667
Author: Mike Dusenberry <[email protected]>
Date: 2015-10-30T21:35:15Z
Adding remaining methods to PySpark BlockMatrix: cache, persist, validate,
transpose.
commit c713a27e3952bc3b533f129fb32a1e698af7bc13
Author: Mike Dusenberry <[email protected]>
Date: 2015-10-30T21:56:19Z
Adding remaining method to PySpark CoordinateMatrix: transpose.
commit 0532f12dbb5bc136f231a6028446f17ea90b7bb0
Author: Mike Dusenberry <[email protected]>
Date: 2015-10-30T22:27:37Z
Adding remaining method to PySpark IndexedRowMatrix: computeGramianMatrix.
Note that 'multiply' and 'computeSVD' are part of the SPARK-6227 PR.
commit cbddf10e717e74508ba512c8f303106959439d17
Author: Mike Dusenberry <[email protected]>
Date: 2015-11-02T21:53:15Z
Adding remaining methods to PySpark RowMatrix: computeGramianMatrix,
computeCovariance, computeColumnSummaryStatistics, columnSimilarities,
tallSkinnyQR.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]