[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...

mridulm Tue, 18 Mar 2014 06:47:20 -0700

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/126#discussion_r10702886
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1025,6 +1025,14 @@ abstract class RDD[T: ClassTag](
         checkpointData.flatMap(_.getCheckpointFile)
       }
     
    +  def cleanup() {
    --- End diff --
    
    Hope I am missing something here ...
    
    After involing cleanup, use of rddA would result in errors - coupled with 
lazy execution, users would actually can end up cleaning up rdd's which have 
not yet been 'used'.
    We should defer exposing this api until we have more clarity on this.
    
    In particular, cleanup should ensure that all pending jobs which require 
the rdd should have finished.
    Contrived example:
    --
    rdd3 = rdd1.join(rdd2)
    rdd1.count()
    rdd1.cleanup()
    ...
    rdd3.count()
    --
    
    would cause issues for rdd3 - this is, arguably, bad code from spark dev 
point of view - but unlike with unpersist, where we can recover with 
performance panalty, with cleanup we will fail. From a (naive ?) users point of 
view, rdd1 has already been "used" by the time it is getting cleaned up.
    Best option would be if we can throw an exception when cleanup is invoked 
in cases like this; and same exception being thrown if rdd is subsequently used 
(for any op) after a cleanup : this latter might be more an involved change 
though.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...

Reply via email to