Github user tdas commented on a diff in the pull request:
https://github.com/apache/spark/pull/126#discussion_r10681316
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1025,6 +1025,14 @@ abstract class RDD[T: ClassTag](
checkpointData.flatMap(_.getCheckpointFile)
}
+ def cleanup() {
--- End diff --
If I understand the code in
[CoGroupedRDD](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala#L84)
correctly, every time a CoGrouped RDD is created (join uses cogroup
underneath), a new dependency object is created. So even though rddA and rddB
depend on the same rdd1, they should not be sharing the shuffle dependency.
Regarding the new code snippet, yes, that would result in problems for
rdd2. But that can also be done currently with rdd1.unpersist().
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---