[ 
https://issues.apache.org/jira/browse/SPARK-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5361.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 1.2.2

I've cherry-picked the fix into `branch-1.2` (1.2.2), so I'm marking this as 
Fixed.

> Multiple Java RDD <-> Python RDD conversions not working correctly
> ------------------------------------------------------------------
>
>                 Key: SPARK-5361
>                 URL: https://issues.apache.org/jira/browse/SPARK-5361
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.0
>            Reporter: Winston Chen
>              Labels: backport-needed
>             Fix For: 1.3.0, 1.2.2
>
>
> This is found through reading RDD from `sc.newAPIHadoopRDD` and writing it 
> back using `rdd.saveAsNewAPIHadoopFile` in pyspark.
> It turns out that whenever there are multiple RDD conversions from JavaRDD to 
> PythonRDD then back to JavaRDD, the exception below happens:
> {noformat}
> 15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7)
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
> java.util.ArrayList
>       at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157)
>       at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
>       at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>       at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> {noformat}
> The test case code below reproduces it:
> {noformat}
> from pyspark.rdd import RDD
> dl = [
>     (u'2', {u'director': u'David Lean'}), 
>     (u'7', {u'director': u'Andrew Dominik'})
> ]
> dl_rdd = sc.parallelize(dl)
> tmp = dl_rdd._to_java_object_rdd()
> tmp2 = sc._jvm.SerDe.javaToPython(tmp)
> t = RDD(tmp2, sc)
> t.count()
> tmp = t._to_java_object_rdd()
> tmp2 = sc._jvm.SerDe.javaToPython(tmp)
> t = RDD(tmp2, sc)
> t.count() # it blows up here during the 2nd time of conversion
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to