[
https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14738587#comment-14738587
]
Robert B. Kim commented on SPARK-8582:
--------------------------------------
[~andrewor14]
Because of this bug, we suffer from performance degradation in large stated
streaming app. and looking forward to fixing it.
Is there any update or any plan ?
> Optimize checkpointing to avoid computing an RDD twice
> ------------------------------------------------------
>
> Key: SPARK-8582
> URL: https://issues.apache.org/jira/browse/SPARK-8582
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.0.0
> Reporter: Andrew Or
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD
> and save the intermediate contents to HDFS for fault tolerance. However, this
> is not currently implemented super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during
> the action that triggered the checkpointing in the first place, and once
> while we checkpoint (we iterate through an RDD's partitions and write them to
> disk). See this line for more detail:
> https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint
> data to HDFS while we run the action. This will speed up many usages of
> `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but
> this is not always viable for very large input data. It's also not a great
> API to use in general.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]