holdenk commented on a change in pull request #28038: [SPARK-31208][CORE] Add
an expiremental cleanShuffleDependencies
URL: https://github.com/apache/spark/pull/28038#discussion_r402525720
##########
File path: core/src/main/scala/org/apache/spark/rdd/RDD.scala
##########
@@ -1700,6 +1723,40 @@ abstract class RDD[T: ClassTag](
}
}
+ /**
+ * :: Experimental ::
+ * Marks an RDD's shuffles and it's non-persisted ancestors as no longer
needed.
+ * This cleans up shuffle files aggressively to allow nodes to be terminated.
+ * If the RDD will still be used downstream checkpoint and materialize it
first.
+ * If you are uncertain of what you are doing please do not use this feature.
+ * Additional techniques for mitigating orphaned shuffle files:
+ * * Tuning the driver GC to be more aggressive so the regular context
cleaner is triggered
+ * * Setting an appropriate TTL for shuffle files to be auto cleaned
+ */
+ @Experimental
+ @DeveloperApi
+ @Since("3.1.0")
+ def cleanShuffleDependencies(blocking: Boolean = false): Unit = {
+ sc.cleaner.foreach { cleaner =>
+ /**
+ * Clean the shuffles & all of its parents.
+ */
+ def cleanEagerly(dep: Dependency[_]): Unit = {
+ if (dep.isInstanceOf[ShuffleDependency[_, _, _]]) {
+ val shuffleId = dep.asInstanceOf[ShuffleDependency[_, _,
_]].shuffleId
+ cleaner.doCleanupShuffle(shuffleId, blocking)
+ }
+ val rdd = dep.rdd
+ val rddDepsOpt = rdd.internalDependencies
+ if (rdd.getStorageLevel == StorageLevel.NONE) {
Review comment:
If someone has persisted an RDD and not unpersisted it we assume they intend
to reuse it, and cleaning the shuffle files would be wise. This is important
with the ALS code path (the original reason for this feature) so there is a cut
point in the graph and we don't go unexpectedly cleaning up users shuffle files.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]