Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/1338#issuecomment-50186945
The batch size is a performance-tuning knob that controls the granularity
of data transfer between Python and Java. Larger batch sizes help to amortize
certain costs associating with pickling / depickling and in storing pickled
objects in byte arrays. If the batch size is too large, the Python worker may
run out of memory while trying to serialize or deserialize too much data at
once.
A batch size of _n_ means that we emit a new batch every _n_ items, but
this doesn't guarantee that every batch will contain exactly _n_ items (for
example, if the last (or only) batch might contain fewer than _n_ items). As a
result, the batch deserializer doesn't make any assumptions about how many
items appear in a batch, so it's safe to union two batched datasets as long as
they have the same underlying serializer (such as PickleSerializer).
I don't remember why we chose these default batch sizes; I suppose we could
run some experiments to sweep across different batch sizes, but I imagine that
the optimal size depends on the data type.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---