Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19226#discussion_r138790747
--- Diff: python/pyspark/serializers.py ---
@@ -343,9 +346,6 @@ def _load_stream_without_unbatching(self, stream):
key_batch_stream =
self.key_ser._load_stream_without_unbatching(stream)
val_batch_stream =
self.val_ser._load_stream_without_unbatching(stream)
for (key_batch, val_batch) in zip(key_batch_stream,
val_batch_stream):
- if len(key_batch) != len(val_batch):
- raise ValueError("Can not deserialize PairRDD with
different number of items"
- " in batches: (%d, %d)" %
(len(key_batch), len(val_batch)))
# for correctness with repeated cartesian/zip this must be
returned as one batch
yield zip(key_batch, val_batch)
--- End diff --
How about returning this batch as a list (and as described in the doc)?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]