Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19226#discussion_r139032001
--- Diff: python/pyspark/serializers.py ---
@@ -343,6 +343,8 @@ def _load_stream_without_unbatching(self, stream):
key_batch_stream =
self.key_ser._load_stream_without_unbatching(stream)
val_batch_stream =
self.val_ser._load_stream_without_unbatching(stream)
for (key_batch, val_batch) in zip(key_batch_stream,
val_batch_stream):
+ key_batch = list(key_batch)
+ val_batch = list(val_batch)
--- End diff --
Ah, I had to be clear. Actually, I meant if
`Serializer._load_stream_without_unbatching` works as documented `an iterator
of deserialized batches (lists)`, everything should have worked fine. So, I
think the reverse is actually more correct because `PairDeserializer` and
`CartesianDeserializer` do not follow this.
I am okay with the current change but I believe the reverse is better. WDYT
@aray and @holdenk ?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]