[jira] [Commented] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice

Chao Ma (Jira) Tue, 08 Dec 2020 21:42:07 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246294#comment-17246294
 ]


Chao Ma commented on SPARK-8582:
--------------------------------

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala#L164]

 

Looks like we still have this problem right now?

> Optimize checkpointing to avoid computing an RDD twice
> ------------------------------------------------------
>
>                 Key: SPARK-8582
>                 URL: https://issues.apache.org/jira/browse/SPARK-8582
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Andrew Or
>            Assignee: Shixiong Zhu
>            Priority: Major
>              Labels: bulk-closed
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD 
> and save the intermediate contents to HDFS for fault tolerance. However, this 
> is not currently implemented super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during 
> the action that triggered the checkpointing in the first place, and once 
> while we checkpoint (we iterate through an RDD's partitions and write them to 
> disk). See this line for more detail: 
> https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint 
> data to HDFS while we run the action. This will speed up many usages of 
> `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but 
> this is not always viable for very large input data. It's also not a great 
> API to use in general.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice

Reply via email to