GitHub user dusenberrymw opened a pull request:

    https://github.com/apache/spark/pull/9458

    [SPARK-11497] [MLlib] [Python] PySpark RowMatrix Constructor Has Type 
Erasure Issue

    As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our 
PySpark `RowMatrix` constructor.  As discussed on the dev list 
[here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html),
 there appears to be an issue with type erasure with RDDs coming from Java, and 
by extension from PySpark.  Although we are attempting to construct a 
`RowMatrix` from an `RDD[Vector]` in 
[PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115),
 the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when 
calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` 
in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the 
aforementioned dev list thread, this issue was also encountered with 
`DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a 
`Vector` type.  `IndexedRowMatrix` and `CoordinateM
 atrix` do not appear to have this issue likely due to their related helper 
functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with 
pattern matching, thus preserving the types.
    
    This PR currently contains that retagging fix applied to the 
`createRowMatrix` helper function in `PythonMLlibAPI`.  This PR blocks #9441, 
so once this is merged, the other can be rebased.
    
    cc @holdenk 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dusenberrymw/spark 
SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9458.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9458
    
----
commit c1258a6a8a1741ce67e912a74f8e9444a9b7a590
Author: Mike Dusenberry <[email protected]>
Date:   2015-11-04T02:33:13Z

    Retagging the rows RDD to be an RDD[Vector].

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to