[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

markhamstra Tue, 05 Jul 2016 11:03:56 -0700

Github user markhamstra commented on the issue:

    https://github.com/apache/spark/pull/14039
  
    I haven't got anything more concrete to offer at this time than the 
descriptions in the relevant JIRA's, but I do have this running in production 
with 1.6, and it does work.  Essentially, you build a cache in your application 
whose keys are a canonicalization of query fragments and whose values are RDDs 
associated with that fragment of the logical plan, and which produce the 
shuffle files.  For as long as you hold the references to those RDDs in your 
cache, Spark won't remove the shuffle files.  For as long as you have 
sufficient memory available to the OS, those shuffle files will be accessed via 
the OS buffer cache, which is actually pretty quick and doesn't require any of 
Java heap management and garbage collection.  That was the original motivation 
behind using shuffle files in this way and before off-heap caching and unified 
memory management were available.  It's less necessary now (at least once I 
figure out how to do the mapping between logical plan fragments and tables c
 ached off-heap), but it is still a valid alternative caching mechanism.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

Reply via email to