Github user davies commented on the pull request:
https://github.com/apache/spark/pull/6714#issuecomment-113056954
@JoshRosen Considering a case that the used memory is above memory limit
(because of broadcasted object), then it will start to spill after the first
batch (batch =100), and for every 100 records, it's easy to reach maximum open
files limit (1024, by default).
In order to balance the memory usage and batch size (not too large or too
small), I thought it's better to adjust batch size up and down (but having a
bug). After some experiments, I saw the memory usage still go up and up even we
try to shrink the batch size (because of memory fragments).
Finally, I'd like to switch to the most simple approach, assuming the items
having similar sizes, always use the first batch size when first spilling.
I agreed that we could inline `_next_limit()`, current it looks similar as
others (ExternalXXX), not that bad.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]