[ https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246294#comment-17246294 ]
Chao Ma commented on SPARK-8582: -------------------------------- [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala#L164] Looks like we still have this problem right now? > Optimize checkpointing to avoid computing an RDD twice > ------------------------------------------------------ > > Key: SPARK-8582 > URL: https://issues.apache.org/jira/browse/SPARK-8582 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.0.0 > Reporter: Andrew Or > Assignee: Shixiong Zhu > Priority: Major > Labels: bulk-closed > > In Spark, checkpointing allows the user to truncate the lineage of his RDD > and save the intermediate contents to HDFS for fault tolerance. However, this > is not currently implemented super efficiently: > Every time we checkpoint an RDD, we actually compute it twice: once during > the action that triggered the checkpointing in the first place, and once > while we checkpoint (we iterate through an RDD's partitions and write them to > disk). See this line for more detail: > https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102. > Instead, we should have a `CheckpointingInterator` that writes checkpoint > data to HDFS while we run the action. This will speed up many usages of > `RDD#checkpoint` by 2X. > (Alternatively, the user can just cache the RDD before checkpointing it, but > this is not always viable for very large input data. It's also not a great > API to use in general.) -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org