Github user kanzhang commented on the pull request:
https://github.com/apache/spark/pull/755#issuecomment-45974068
Digged a little deeper and found the following.
1) ```saveAsPickleFile``` calls ```saveAsObjectFile```, which does its own
grouping by a factor of 10, although no serialization is involved here, which
is probably fine.
2) ```RDD._reserialize``` currently does nothing if the target
serialization differs from the current one only in terms of batch size. This is
due to our notion of serializer equality: *"output generated by equal
serializers can be deserialized using the same serializer."* It is probably
fine for operations like ```union```, so as to avoid unnecessary
re-serialization. However, for ```saveAsPickleFile```, it means the actual
batch size used may be different from what the user specified (and very likely
so since our current default batch size for SparkContext is 1024), which can be
confusing. @mateiz , what's your thoughts on this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---