Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/6220#issuecomment-103627537
  
    An internal `System.gc()` is a pretty good idea.  I was wondering whether a 
one hour default might be too long, but maybe not:
    
    If the driver fills up with too much in-memory metadata, then the GC will 
kick in and clean it up, so I guess we're only worried about cases where we run 
out of a non-memory resource, such as disk space, because GC wasn't run on the 
driver.  You can probably back-of-the-envelope calculate the right GC interval 
based on your disk capacity and the maximum write throughput of your disks: if 
you have 100 gigabytes of temporary space for shuffle files and can only write 
at a maximum speed of 100 MB/s, then running GC at least once every 15 minutes 
should be sufficient to prevent the disks from filling up (since 100 gigabytes 
/ (100 megabytes / second) ~= 16.5 minutes to fill the disks).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to