[
https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhang, Liye updated SPARK-4094:
-------------------------------
Description:
rdd.checkpoint() must be called before any actions on this rdd, if there is any
other actions before, checkpoint would never succeed. For the following code as
example:
*rdd = sc.makeRDD(...)*
*rdd.collect()*
*rdd.checkpoint()*
*rdd.count()*
This rdd would never be checkpointed. For algorithms that have many iterations
would have some problem. Such as graph algorithm, there will have many
iterations which will cause the RDD lineage very long. So RDD may need
checkpoint after a certain iteration number. And if there is also any action
within the iteration loop, the checkpoint() operation will never work for the
later iterations after the iteration which calls the action operation.
But this would not happen for RDD cache. RDD cache would always make
successfully before rdd actions no matter whether there is any actions before
cache().
So rdd.checkpoint() should also be with the same behavior with rdd.cache().
was:
rdd.checkpoint() must be called before any actions on this rdd, if there is any
other actions before, checkpoint would never succeed. For the following code as
example:
*rdd = sc.makeRDD(...)*
*rdd.collect()*
*rdd.checkpoint()*
*rdd.count()*
This rdd would never be checkpointed. For algorithms that have many iterations
would have some problem. Such as graph algorithm, there will have many
iterations which will cause the RDD lineage very long. So RDD may need
checkpoint after a certain iteration number. And if there is also any action
within the iteration loop, the checkpoint() operation will never work for the
later iterations after the iteration whichs call the action operation.
But this would not happen for RDD cache. RDD cache would always make
successfully before rdd actions no matter whether there is any actions before
cache().
So rdd.checkpoint() should also be with the same behavior with rdd.cache().
> checkpoint should still be available after rdd actions
> ------------------------------------------------------
>
> Key: SPARK-4094
> URL: https://issues.apache.org/jira/browse/SPARK-4094
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 1.1.0
> Reporter: Zhang, Liye
> Assignee: Zhang, Liye
>
> rdd.checkpoint() must be called before any actions on this rdd, if there is
> any other actions before, checkpoint would never succeed. For the following
> code as example:
> *rdd = sc.makeRDD(...)*
> *rdd.collect()*
> *rdd.checkpoint()*
> *rdd.count()*
> This rdd would never be checkpointed. For algorithms that have many
> iterations would have some problem. Such as graph algorithm, there will have
> many iterations which will cause the RDD lineage very long. So RDD may need
> checkpoint after a certain iteration number. And if there is also any action
> within the iteration loop, the checkpoint() operation will never work for the
> later iterations after the iteration which calls the action operation.
> But this would not happen for RDD cache. RDD cache would always make
> successfully before rdd actions no matter whether there is any actions before
> cache().
> So rdd.checkpoint() should also be with the same behavior with rdd.cache().
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]