[
https://issues.apache.org/jira/browse/SPARK-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14995420#comment-14995420
]
Joseph K. Bradley commented on SPARK-11497:
-------------------------------------------
Correction: Does this affect any functionality in current versions of Spark?
Or is this only an issue with your wrapper for tallSkinnyQR?
I'm asking because I can still call things like computePrincipalComponents
without this fix. I can't call tallSkinnyQR, but that's also because of its
return type not being converted to a Python object.
Do you have an example of current functionality which is not available? If
not, I might ask that we postpone this for 1.7 (unless it's needed for
something targeted for 1.6).
> PySpark RowMatrix Constructor Has Type Erasure Issue
> ----------------------------------------------------
>
> Key: SPARK-11497
> URL: https://issues.apache.org/jira/browse/SPARK-11497
> Project: Spark
> Issue Type: Bug
> Components: MLlib, PySpark
> Affects Versions: 1.5.0, 1.5.1, 1.6.0
> Reporter: Mike Dusenberry
> Assignee: Mike Dusenberry
> Priority: Minor
>
> Implementing tallSkinnyQR in SPARK-9656 uncovered a bug with our PySpark
> RowMatrix constructor. As discussed on the dev list
> [here|http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html],
> there appears to be an issue with type erasure with RDDs coming from Java,
> and by extension from PySpark. Although we are attempting to construct a
> RowMatrix from an RDD[Vector] in
> [PythonMLlibAPI|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115],
> the Vector type is erased, resulting in an RDD[Object]. Thus, when calling
> Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which
> an Object cannot be cast to a Spark Vector. As noted in the aforementioned
> dev list thread, this issue was also encountered with DecisionTrees, and the
> fix involved an explicit retag of the RDD with a Vector type. Thus, this PR
> will apply that fix to the createRowMatrix helper function in PythonMLlibAPI.
> IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely
> due to their related helper functions in PythonMLlibAPI creating the RDDs
> explicitly from DataFrames with pattern matching, thus preserving the types.
> The following reproduces this issue on the latest Git head, 1.5.1, and 1.5.0:
> {code}
> from pyspark.mllib.linalg.distributed import RowMatrix
> rows = sc.parallelize([[3, -6], [4, -8], [0, 1]])
> mat = RowMatrix(rows)
> mat._java_matrix_wrapper.call("tallSkinnyQR", True)
> {code}
> Should result in the following exception:
> {code}
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to
> [Lorg.apache.spark.mllib.linalg.Vector;
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]