Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/2956#discussion_r19391269
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1204,6 +1204,8 @@ abstract class RDD[T: ClassTag](
} else if (checkpointData.isEmpty) {
checkpointData = Some(new RDDCheckpointData(this))
checkpointData.get.markForCheckpoint()
+ // There is supposed to be doCheckpoint in the following, reset
doCheckpointCalled first
+ doCheckpointCalled = false
--- End diff --
From the docs, it's clear that this is not intended to be called after
operations have executed on the RDD. These changes kind of hack it so it
doesn't directly fail, but are you certain this is valid? race conditions and
so on? What's the point of `doCheckpointCalled` after this change, really? the
criteria seems to collapse to "allow checkpoint if no checkpoint data has been
written". If it's that easy I do wonder why it wasn't this way in the first
place.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]