Mike Dusenberry created SPARK-11497:
---------------------------------------
Summary: PySpark RowMatrix Constructor Has Type Erasure Issue
Key: SPARK-11497
URL: https://issues.apache.org/jira/browse/SPARK-11497
Project: Spark
Issue Type: Bug
Components: MLlib, PySpark
Reporter: Mike Dusenberry
Implementing tallSkinnyQR in SPARK-9656 uncovered a bug with our PySpark
RowMatrix constructor. As discussed on the dev list
[here|http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html],
there appears to be an issue with type erasure with RDDs coming from Java, and
by extension from PySpark. Although we are attempting to construct a RowMatrix
from an RDD[Vector] in
[PythonMLlibAPI|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115],
the Vector type is erased, resulting in an RDD[Object]. Thus, when calling
Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which an
Object cannot be cast to a Spark Vector. As noted in the aforementioned dev
list thread, this issue was also encountered with DecisionTrees, and the fix
involved an explicit retag of the RDD with a Vector type. Thus, this PR will
apply that fix to the createRowMatrix helper function in PythonMLlibAPI.
IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely
due to their related helper functions in PythonMLlibAPI creating the RDDs
explicitly from DataFrames with pattern matching, thus preserving the types.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]