[GitHub] spark pull request: [SPARK-8202] [PYSPARK] fix infinite loop durin...

davies Thu, 18 Jun 2015 00:04:57 -0700

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/6714#issuecomment-113056954
  
    @JoshRosen Considering a case that the used memory is above memory limit 
(because of broadcasted object), then it will start to spill after the first 
batch (batch =100), and for every 100 records, it's easy to reach maximum open 
files limit (1024, by default).
    
    In order to balance the memory usage and batch size (not too large or too 
small), I thought it's better to adjust batch size up and down (but having a 
bug). After some experiments, I saw the memory usage still go up and up even we 
try to shrink the batch size (because of memory fragments).
    
    Finally, I'd like to switch to the most simple approach, assuming the items 
having similar sizes, always use the first batch size when first spilling.
    
    I agreed that we could inline `_next_limit()`, current it looks similar as 
others (ExternalXXX), not that bad.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8202] [PYSPARK] fix infinite loop durin...

Reply via email to