Josh Rosen created SPARK-2790:
---------------------------------
Summary: PySpark zip() doesn't work properly if RDDs have
different serializers
Key: SPARK-2790
URL: https://issues.apache.org/jira/browse/SPARK-2790
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 1.0.0, 1.1.0
Reporter: Josh Rosen
Priority: Critical
In PySpark, attempting to {{zip()}} two RDDs may fail if the RDDs have
different serializers (e.g. batched vs. unbatched), even if those RDDs have the
same number of partitions and same numbers of elements. This problem occurs in
the MLlib Python APIs, where we might want to zip a JavaRDD of LabelledPoints
with a JavaRDD of batch-serialized Python objects.
This is problematic because whether zip() succeeds or errors depends on the
partitioning / batching strategy, and we don't want to surface the
serialization details to users.
--
This message was sent by Atlassian JIRA
(v6.2#6252)