Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/126#discussion_r10702886
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1025,6 +1025,14 @@ abstract class RDD[T: ClassTag](
checkpointData.flatMap(_.getCheckpointFile)
}
+ def cleanup() {
--- End diff --
Hope I am missing something here ...
After involing cleanup, use of rddA would result in errors - coupled with
lazy execution, users would actually can end up cleaning up rdd's which have
not yet been 'used'.
We should defer exposing this api until we have more clarity on this.
In particular, cleanup should ensure that all pending jobs which require
the rdd should have finished.
Contrived example:
--
rdd3 = rdd1.join(rdd2)
rdd1.count()
rdd1.cleanup()
...
rdd3.count()
--
would cause issues for rdd3 - this is, arguably, bad code from spark dev
point of view - but unlike with unpersist, where we can recover with
performance panalty, with cleanup we will fail. From a (naive ?) users point of
view, rdd1 has already been "used" by the time it is getting cleaned up.
Best option would be if we can throw an exception when cleanup is invoked
in cases like this; and same exception being thrown if rdd is subsequently used
(for any op) after a cleanup : this latter might be more an involved change
though.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---