Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1338#issuecomment-50186945
  
    The batch size is a performance-tuning knob that controls the granularity 
of data transfer between Python and Java.  Larger batch sizes help to amortize 
certain costs associating with pickling / depickling and in storing pickled 
objects in byte arrays.  If the batch size is too large, the Python worker may 
run out of memory while trying to serialize or deserialize too much data at 
once.
    
    A batch size of _n_ means that we emit a new batch every _n_ items, but 
this doesn't guarantee that every batch will contain exactly _n_ items (for 
example, if the last (or only) batch might contain fewer than _n_ items).  As a 
result, the batch deserializer doesn't make any assumptions about how many 
items appear in a batch, so it's safe to union two batched datasets as long as 
they have the same underlying serializer (such as PickleSerializer).
    
    I don't remember why we chose these default batch sizes; I suppose we could 
run some experiments to sweep across different batch sizes, but I imagine that 
the optimal size depends on the data type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to