[
https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-4094.
------------------------------
Resolution: Won't Fix
> checkpoint should still be available after rdd actions
> ------------------------------------------------------
>
> Key: SPARK-4094
> URL: https://issues.apache.org/jira/browse/SPARK-4094
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 1.1.0
> Reporter: Zhang, Liye
> Assignee: Zhang, Liye
>
> rdd.checkpoint() must be called before any actions on this rdd, if there is
> any other actions before, checkpoint would never succeed. For the following
> code as example:
> *rdd = sc.makeRDD(...)*
> *rdd.collect()*
> *rdd.checkpoint()*
> *rdd.count()*
> This rdd would never be checkpointed. For algorithms that have many
> iterations would have some problem. Such as graph algorithm, there will have
> many iterations which will cause the RDD lineage very long. So RDD may need
> checkpoint after a certain iteration number. And if there is also any action
> within the iteration loop, the checkpoint() operation will never work for the
> later iterations after the iteration which calls the action operation.
> But this would not happen for RDD cache. RDD cache would always make
> successfully before rdd actions no matter whether there is any actions before
> cache().
> So rdd.checkpoint() should also be with the same behavior with rdd.cache().
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]