GitHub user WeichenXu123 opened a pull request:

    [SPARK-19215] Add necessary check for `RDD.checkpoint` to avoid potential 

    ## What changes were proposed in this pull request?
    Currently RDD.checkpoint must be called before any job executed on this 
RDD, otherwise the `doCheckpoint` will never be called. This is a pitfall we 
should check this and throw exception (or at least log warning ? ) for such 
    And, if RDD haven't been persisted, doing checkpoint will cause RDD 
recomputation, because current implementation will run separated job for 
checkpointing. I think such case it should also print some warning message, 
remind user to check whether he forgot persist the RDD.
    ## How was this patch tested?
    Test case 1:
    val rdd = sc.makeRDD(Array(1,2,3),3)
    rdd.checkpoint() // here because `rdd.count` executed, checkpoint will 
never take effect, so that this patch will directly throw exception.
    Test case 2:
    val rdd = sc.makeRDD(Array(1,2,3),3).map(_ + 10)
    rdd.checkpoint() // because `rdd` do not persisted, so that checkpoint will 
cause this RDD recomputation, this patch will print warning here.
    rdd.count() // trigger `doCheckpoint`
    Test case 3:
    val rdd = sc.makeRDD(Array(1,2,3),3).map(_ + 10)
    rdd.checkpoint() // This is correct usage, won't print any warning in this 
    rdd.count() // trigger `doCheckpoint`

You can merge this pull request into a Git repository by running:

    $ git pull 

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16576
commit 70fbfb07adbaca17831fd736661135f2d7b2b0e0
Author: WeichenXu <>
Date:   2017-01-13T14:17:28Z

    add check and warning msg for rdd checkpoint


If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to