Mike Dusenberry created SPARK-11497:
---------------------------------------

             Summary: PySpark RowMatrix Constructor Has Type Erasure Issue
                 Key: SPARK-11497
                 URL: https://issues.apache.org/jira/browse/SPARK-11497
             Project: Spark
          Issue Type: Bug
          Components: MLlib, PySpark
            Reporter: Mike Dusenberry


Implementing tallSkinnyQR in SPARK-9656 uncovered a bug with our PySpark 
RowMatrix constructor. As discussed on the dev list 
[here|http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html],
 there appears to be an issue with type erasure with RDDs coming from Java, and 
by extension from PySpark. Although we are attempting to construct a RowMatrix 
from an RDD[Vector] in 
[PythonMLlibAPI|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115],
 the Vector type is erased, resulting in an RDD[Object]. Thus, when calling 
Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which an 
Object cannot be cast to a Spark Vector. As noted in the aforementioned dev 
list thread, this issue was also encountered with DecisionTrees, and the fix 
involved an explicit retag of the RDD with a Vector type. Thus, this PR will 
apply that fix to the createRowMatrix helper function in PythonMLlibAPI. 
IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely 
due to their related helper functions in PythonMLlibAPI creating the RDDs 
explicitly from DataFrames with pattern matching, thus preserving the types. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to