GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/16576
[SPARK-19215] Add necessary check for `RDD.checkpoint` to avoid potential
mistakes
## What changes were proposed in this pull request?
Currently RDD.checkpoint must be called before any job executed on this
RDD, otherwise the `doCheckpoint` will never be called. This is a pitfall we
should check this and throw exception (or at least log warning ? ) for such
case.
And, if RDD haven't been persisted, doing checkpoint will cause RDD
recomputation, because current implementation will run separated job for
checkpointing. I think such case it should also print some warning message,
remind user to check whether he forgot persist the RDD.
## How was this patch tested?
Manual.
Test case 1:
```
val rdd = sc.makeRDD(Array(1,2,3),3)
rdd.count()
rdd.checkpoint() // here because `rdd.count` executed, checkpoint will
never take effect, so that this patch will directly throw exception.
```
Test case 2:
```
val rdd = sc.makeRDD(Array(1,2,3),3).map(_ + 10)
rdd.checkpoint() // because `rdd` do not persisted, so that checkpoint will
cause this RDD recomputation, this patch will print warning here.
rdd.count() // trigger `doCheckpoint`
```
Test case 3:
```
val rdd = sc.makeRDD(Array(1,2,3),3).map(_ + 10)
rdd.persist()
rdd.checkpoint() // This is correct usage, won't print any warning in this
patch.
rdd.count() // trigger `doCheckpoint`
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/WeichenXu123/spark
add_check_and_warning_for_rdd_checkpoint
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16576.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16576
----
commit 70fbfb07adbaca17831fd736661135f2d7b2b0e0
Author: WeichenXu <[email protected]>
Date: 2017-01-13T14:17:28Z
add check and warning msg for rdd checkpoint
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]