[jira] [Updated] (SPARK-5558) pySpark zip function unexpected errors

Charles Hayden (JIRA) Tue, 03 Feb 2015 08:54:54 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-5558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Charles Hayden updated SPARK-5558:
----------------------------------
    Description: 
Example:
{quote}
x = sc.parallelize(range(0,5))
y = x.map(lambda x: x+1000, preservesPartitioning=True)
y.take(10)
x.zip\(y).collect()
{quote}

Fails in the JVM: py4J: org.apache.spark.SparkException: 
Can only zip RDDs with same number of elements in each partition

If the range is changed to range(0,1000) it fails in pySpark code:
ValueError: Can not deserialize RDD with different number of items in pair: 
(100, 1) 

It also fails if y.take(10) is replaced with y.toDebugString()
It even fails if we print y._jrdd

  was:
Example:
{quote}
x = sc.parallelize(range(0,5))
y = x.map(lambda x: x+1000, preservesPartitioning=True)
y.take(10)
x.zip\(y).collect()
{quote}

Fails in the JVM: py4J: org.apache.spark.SparkException: 
Can only zip RDDs with same number of elements in each partition

If the range is changed to range(0,1000) it fails in pySpark code:
ValueError: Can not deserialize RDD with different number of items in pair: 
(100, 1) 

It also fails if y.take(10) is replaced with y.toDebugString()



> pySpark zip function unexpected errors
> --------------------------------------
>
>                 Key: SPARK-5558
>                 URL: https://issues.apache.org/jira/browse/SPARK-5558
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.0
>            Reporter: Charles Hayden
>              Labels: pyspark
>
> Example:
> {quote}
> x = sc.parallelize(range(0,5))
> y = x.map(lambda x: x+1000, preservesPartitioning=True)
> y.take(10)
> x.zip\(y).collect()
> {quote}
> Fails in the JVM: py4J: org.apache.spark.SparkException: 
> Can only zip RDDs with same number of elements in each partition
> If the range is changed to range(0,1000) it fails in pySpark code:
> ValueError: Can not deserialize RDD with different number of items in pair: 
> (100, 1) 
> It also fails if y.take(10) is replaced with y.toDebugString()
> It even fails if we print y._jrdd



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5558) pySpark zip function unexpected errors

Reply via email to