[ 
https://issues.apache.org/jira/browse/SPARK-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Dusenberry updated SPARK-11497:
------------------------------------
    Description: 
Implementing tallSkinnyQR in SPARK-9656 uncovered a bug with our PySpark 
RowMatrix constructor. As discussed on the dev list 
[here|http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html],
 there appears to be an issue with type erasure with RDDs coming from Java, and 
by extension from PySpark. Although we are attempting to construct a RowMatrix 
from an RDD[Vector] in 
[PythonMLlibAPI|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115],
 the Vector type is erased, resulting in an RDD[Object]. Thus, when calling 
Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which an 
Object cannot be cast to a Spark Vector. As noted in the aforementioned dev 
list thread, this issue was also encountered with DecisionTrees, and the fix 
involved an explicit retag of the RDD with a Vector type. Thus, this PR will 
apply that fix to the createRowMatrix helper function in PythonMLlibAPI. 
IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely 
due to their related helper functions in PythonMLlibAPI creating the RDDs 
explicitly from DataFrames with pattern matching, thus preserving the types. 

The following reproduces this issue on the latest Git head, 1.5.1, and 1.5.0:

{code}
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([[3, -6], [4, -8], [0, 1]])
mat = RowMatrix(rows)
mat._java_matrix_wrapper.call("tallSkinnyQR", True)
{code}

Should result in the following exception:

{code}
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
[Lorg.apache.spark.mllib.linalg.Vector;
{code}

  was:
Implementing tallSkinnyQR in SPARK-9656 uncovered a bug with our PySpark 
RowMatrix constructor. As discussed on the dev list 
[here|http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html],
 there appears to be an issue with type erasure with RDDs coming from Java, and 
by extension from PySpark. Although we are attempting to construct a RowMatrix 
from an RDD[Vector] in 
[PythonMLlibAPI|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115],
 the Vector type is erased, resulting in an RDD[Object]. Thus, when calling 
Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which an 
Object cannot be cast to a Spark Vector. As noted in the aforementioned dev 
list thread, this issue was also encountered with DecisionTrees, and the fix 
involved an explicit retag of the RDD with a Vector type. Thus, this PR will 
apply that fix to the createRowMatrix helper function in PythonMLlibAPI. 
IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely 
due to their related helper functions in PythonMLlibAPI creating the RDDs 
explicitly from DataFrames with pattern matching, thus preserving the types. 

The following reproduces this issue:

{code}
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([[3, -6], [4, -8], [0, 1]])
mat = RowMatrix(rows)
mat._java_matrix_wrapper.call("tallSkinnyQR", True)
{code}

Should result in the following exception:

{code}
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
[Lorg.apache.spark.mllib.linalg.Vector;
{code}


> PySpark RowMatrix Constructor Has Type Erasure Issue
> ----------------------------------------------------
>
>                 Key: SPARK-11497
>                 URL: https://issues.apache.org/jira/browse/SPARK-11497
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, PySpark
>            Reporter: Mike Dusenberry
>
> Implementing tallSkinnyQR in SPARK-9656 uncovered a bug with our PySpark 
> RowMatrix constructor. As discussed on the dev list 
> [here|http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html],
>  there appears to be an issue with type erasure with RDDs coming from Java, 
> and by extension from PySpark. Although we are attempting to construct a 
> RowMatrix from an RDD[Vector] in 
> [PythonMLlibAPI|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115],
>  the Vector type is erased, resulting in an RDD[Object]. Thus, when calling 
> Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which 
> an Object cannot be cast to a Spark Vector. As noted in the aforementioned 
> dev list thread, this issue was also encountered with DecisionTrees, and the 
> fix involved an explicit retag of the RDD with a Vector type. Thus, this PR 
> will apply that fix to the createRowMatrix helper function in PythonMLlibAPI. 
> IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely 
> due to their related helper functions in PythonMLlibAPI creating the RDDs 
> explicitly from DataFrames with pattern matching, thus preserving the types. 
> The following reproduces this issue on the latest Git head, 1.5.1, and 1.5.0:
> {code}
> from pyspark.mllib.linalg.distributed import RowMatrix
> rows = sc.parallelize([[3, -6], [4, -8], [0, 1]])
> mat = RowMatrix(rows)
> mat._java_matrix_wrapper.call("tallSkinnyQR", True)
> {code}
> Should result in the following exception:
> {code}
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
> [Lorg.apache.spark.mllib.linalg.Vector;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to