Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/755#issuecomment-45974068
  
    Digged a little deeper and found the following.
    
    1) ```saveAsPickleFile``` calls ```saveAsObjectFile```, which does its own 
grouping by a factor of 10, although no serialization is involved here, which 
is probably fine.
    
    2) ```RDD._reserialize``` currently does nothing if the target 
serialization differs from the current one only in terms of batch size. This is 
due to our notion of serializer equality: *"output generated by equal 
serializers can be deserialized using the same serializer."* It is probably 
fine for operations like ```union```, so as to avoid unnecessary 
re-serialization. However, for ```saveAsPickleFile```, it means the actual 
batch size used may be different from what the user specified (and very likely 
so since our current default batch size for SparkContext is 1024), which can be 
confusing. @mateiz , what's your thoughts on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to