GitHub user szhem opened a pull request:
https://github.com/apache/spark/pull/19373
[SPARK-22150][CORE] PeriodicCheckpointer fails in case of dependant RDDs
## What changes were proposed in this pull request?
Fix for [SPARK-22150](https://issues.apache.org/jira/browse/SPARK-22150)
JIRA issue.
In case of checkpointing RDDs which depend on previously checkpointed RDDs
(for example in iterative algorithms) PeriodicCheckpointer removes already
checkpointed materialized RDDs too early leading to FileNotFoundExceptions.
Consider the following snippet
// create a periodic checkpointer with interval of 2
val checkpointer = new PeriodicRDDCheckpointer[Double](2, sc)
val rdd1 = createRDD(sc)
checkpointer.update(rdd1)
// on the second update rdd1 is checkpointed
checkpointer.update(rdd1)
// on action checkpointed rdd is materialized and its lineage is
truncated
rdd1.count()
// rdd2 depends on rdd1
val rdd2 = rdd1.filter(_ => true)
checkpointer.update(rdd2)
// on the second update rdd2 is checkpointed and checkpoint files of
rdd1 are deleted
checkpointer.update(rdd2)
// on action it's necessary to read already removed checkpoint files of
rdd1
rdd2.count()
## How was this patch tested?
Unit tests
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/szhem/spark SPARK-22150-early-checkpoints
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19373.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19373
----
commit 0c3338cd645f5824f08fe37fd7174e25c416529b
Author: Sergey Zhemzhitsky <[email protected]>
Date: 2017-09-27T21:33:18Z
[SPARK-22150][CORE] preventing too early removal of checkpoints in case of
dependant RDDs
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]