GitHub user WeichenXu123 opened a pull request:

    https://github.com/apache/spark/pull/16576

    [SPARK-19215] Add necessary check for `RDD.checkpoint` to avoid potential 
mistakes

    ## What changes were proposed in this pull request?
    
    Currently RDD.checkpoint must be called before any job executed on this 
RDD, otherwise the `doCheckpoint` will never be called. This is a pitfall we 
should check this and throw exception (or at least log warning ? ) for such 
case.
    And, if RDD haven't been persisted, doing checkpoint will cause RDD 
recomputation, because current implementation will run separated job for 
checkpointing. I think such case it should also print some warning message, 
remind user to check whether he forgot persist the RDD.
    
    ## How was this patch tested?
    
    Manual.
    
    Test case 1:
    ```
    val rdd = sc.makeRDD(Array(1,2,3),3)
    rdd.count()
    rdd.checkpoint() // here because `rdd.count` executed, checkpoint will 
never take effect, so that this patch will directly throw exception.
    ```
    
    Test case 2:
    ```
    val rdd = sc.makeRDD(Array(1,2,3),3).map(_ + 10)
    rdd.checkpoint() // because `rdd` do not persisted, so that checkpoint will 
cause this RDD recomputation, this patch will print warning here.
    rdd.count() // trigger `doCheckpoint`
    ```
    
    Test case 3:
    ```
    val rdd = sc.makeRDD(Array(1,2,3),3).map(_ + 10)
    rdd.persist()
    rdd.checkpoint() // This is correct usage, won't print any warning in this 
patch.
    rdd.count() // trigger `doCheckpoint`
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/WeichenXu123/spark 
add_check_and_warning_for_rdd_checkpoint

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16576.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16576
    
----
commit 70fbfb07adbaca17831fd736661135f2d7b2b0e0
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2017-01-13T14:17:28Z

    add check and warning msg for rdd checkpoint

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to