[jira] [Commented] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice

Robert B. Kim (JIRA) Thu, 10 Sep 2015 04:07:54 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14738587#comment-14738587
 ]


Robert B. Kim commented on SPARK-8582:
--------------------------------------

[~andrewor14]
Because of this bug, we suffer from performance degradation in large stated 
streaming app. and looking forward to fixing it.
Is there any update or any plan ? 

> Optimize checkpointing to avoid computing an RDD twice
> ------------------------------------------------------
>
>                 Key: SPARK-8582
>                 URL: https://issues.apache.org/jira/browse/SPARK-8582
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Andrew Or
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD 
> and save the intermediate contents to HDFS for fault tolerance. However, this 
> is not currently implemented super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during 
> the action that triggered the checkpointing in the first place, and once 
> while we checkpoint (we iterate through an RDD's partitions and write them to 
> disk). See this line for more detail: 
> https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint 
> data to HDFS while we run the action. This will speed up many usages of 
> `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but 
> this is not always viable for very large input data. It's also not a great 
> API to use in general.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice

Reply via email to