[GitHub] spark pull request #19226: [SPARK-21985][PySpark] PairDeserializer is broken...

HyukjinKwon Thu, 14 Sep 2017 16:05:57 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19226#discussion_r139032001
  
    --- Diff: python/pyspark/serializers.py ---
    @@ -343,6 +343,8 @@ def _load_stream_without_unbatching(self, stream):
             key_batch_stream = 
self.key_ser._load_stream_without_unbatching(stream)
             val_batch_stream = 
self.val_ser._load_stream_without_unbatching(stream)
             for (key_batch, val_batch) in zip(key_batch_stream, 
val_batch_stream):
    +            key_batch = list(key_batch)
    +            val_batch = list(val_batch)
    --- End diff --
    
    Ah, I had to be clear. Actually, I meant if 
`Serializer._load_stream_without_unbatching` works as documented `an iterator 
of deserialized batches (lists)`, everything should have worked fine. So, I 
think the reverse is actually more correct because `PairDeserializer` and 
`CartesianDeserializer` do not follow this.
    
    I am okay with the current change but I believe the reverse is better. WDYT 
@aray and @holdenk ?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19226: [SPARK-21985][PySpark] PairDeserializer is broken...

Reply via email to