Github user szhem commented on the issue:

    https://github.com/apache/spark/pull/19373
  
    @felixcheung
    
    > It is deleting earlier checkpoint after the current checkpoint is called 
though?
    
    Currently PeriodicCheckpointer can fail in case of checkpointing RDDs which 
depend on each other like in the sample below.
    ```
    // create a periodic checkpointer with interval of 2
    val checkpointer = new PeriodicRDDCheckpointer[Double](2, sc)
    val rdd1 = createRDD(sc)
    
    // rdd2 depends on rdd1
    val rdd2 = rdd1.filter(_ => true)
    checkpointer.update(rdd2)
    // on the second update rdd2 is checkpointed and checkpoint files of rdd1 
are deleted
    checkpointer.update(rdd2)
    // on action it's necessary to read already removed checkpoint files of rdd1
    rdd2.count()
    ```
    It's about deleting files of the already checkpointed and materialized RDD 
in case of another RDD depends on it.
    
    If RDDs are cached before checkpointing (like it is often recommended) then 
this issue is likely to be not visible, because the checkpointed RDD will be 
read from cache and not from the materiazed files. 
    
    The good example of such a behaviour is described in this PR - #19410, 
where GraphX fails with `FileNotFoundException` in case of insufficient memory 
resources when cached blocks of checkpointed and materialized RDDs are evicted 
from memory, causing them to be read from already deleted files.
    
    > is this just an issue with DataSet.checkpoint(eager = true)?
    
    This PR does not include modifications to DataSet API and affects mainly 
`PeriodicCheckpointer` and `PeriodicRDDCheckpointer`. 
    It was created as a preliminary PR to this one - #19410 (where GraphX fails 
in case of reading cached RDDs already evicted from memory).



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to