[jira] [Created] (SPARK-2790) PySpark zip() doesn't work properly if RDDs have different serializers

Josh Rosen (JIRA) Fri, 01 Aug 2014 12:50:11 -0700

Josh Rosen created SPARK-2790:
---------------------------------

             Summary: PySpark zip() doesn't work properly if RDDs have 
different serializers
                 Key: SPARK-2790
                 URL: https://issues.apache.org/jira/browse/SPARK-2790
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.0.0, 1.1.0
            Reporter: Josh Rosen
            Priority: Critical



In PySpark, attempting to {{zip()}} two RDDs may fail if the RDDs have 
different serializers (e.g. batched vs. unbatched), even if those RDDs have the 
same number of partitions and same numbers of elements.  This problem occurs in 
the MLlib Python APIs, where we might want to zip a JavaRDD of LabelledPoints 
with a JavaRDD of batch-serialized Python objects.

This is problematic because whether zip() succeeds or errors depends on the 
partitioning / batching strategy, and we don't want to surface the 
serialization details to users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2790) PySpark zip() doesn't work properly if RDDs have different serializers

Reply via email to