[ 
https://issues.apache.org/jira/browse/SPARK-6008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-6008.
-------------------------------
       Resolution: Duplicate
    Fix Version/s: 1.2.2
                   1.3.0
         Assignee: Davies Liu

> zip two rdds derived from pickleFile fails
> ------------------------------------------
>
>                 Key: SPARK-6008
>                 URL: https://issues.apache.org/jira/browse/SPARK-6008
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.3.0
>            Reporter: Charles Hayden
>            Assignee: Davies Liu
>             Fix For: 1.3.0, 1.2.2
>
>
> Read an rdd from a pickle file.
> Then create another from the first, and zip them together.
> from pyspark import SparkContext
> sc = SparkContext()
> print sc.version
> r = sc.parallelize(range(1, 1000))
> r.saveAsPickleFile('file')
> rdd = sc.pickleFile('file')
> res = rdd.map(lambda row: row, preservesPartitioning=True)
> z = rdd.zip(res)
> print z.take(1)
> Gives the following error:
>   File "bug.py", line 30, in <module>
>     print z.take(1)
>   File "/home/ubuntu/spark/python/pyspark/rdd.py", line 1225, in take
>     res = self.context.runJob(self, takeUpToNumLeft, p, True)
>   File "/home/ubuntu/spark/python/pyspark/context.py", line 843, in runJob
>     it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> javaPartitions, allowLocal)
>   File 
> "/home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", 
> line 538, in __call__
>   File "/home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", 
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 8, localhost): org.apache.spark.SparkException: Can only zip RDDs with 
> same number of elements in each partition
>       at 
> org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:746)
>       at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>       at 
> org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.foreach(RDD.scala:742)
>       at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:406)
>       at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:244)
>       at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1613)
>       at 
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to