[ 
https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8582:
-----------------------------
    Description: 
In Spark, checkpointing allows the user to truncate the lineage of his RDD and 
save the intermediate contents to HDFS for fault tolerance. However, this is 
not currently implemented super efficiently:

Every time we checkpoint an RDD, we actually compute it twice: once during the 
action that triggered the checkpointing in the first place, and once while we 
checkpoint (we iterate through an RDD's partitions and write them to disk). See 
this line for more detail: 
https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.

Instead, we should have a `CheckpointingInterator` that writes checkpoint data 
to HDFS while we run the action. This will speed up many usages of 
`RDD#checkpoint` by 2X.

(Alternatively, the user can just cache the RDD before checkpointing it, but 
this is not always viable for very large input data. It's also not a great API 
to use in general.)

  was:
In Spark, checkpointing allows the user to truncate the lineage of his RDD and 
save the intermediate contents to HDFS for fault tolerance. However, this is 
not currently implemented super efficiently:

Every time we checkpoint an RDD, we actually compute it twice: once during the 
action that triggered the checkpointing in the first place, and once while we 
checkpoint (we iterate through an RDD's partitions and write them to disk).

Instead, we should have a `CheckpointingInterator` that writes checkpoint data 
to HDFS while we run the action. This will speed up many usages of 
`RDD#checkpoint` by 2X.

(Alternatively, the user can just cache the RDD before checkpointing it, but 
this is not always viable for very large input data. It's also not a great API 
to use in general.)


> Optimize checkpointing to avoid computing an RDD twice
> ------------------------------------------------------
>
>                 Key: SPARK-8582
>                 URL: https://issues.apache.org/jira/browse/SPARK-8582
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Andrew Or
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD 
> and save the intermediate contents to HDFS for fault tolerance. However, this 
> is not currently implemented super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during 
> the action that triggered the checkpointing in the first place, and once 
> while we checkpoint (we iterate through an RDD's partitions and write them to 
> disk). See this line for more detail: 
> https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint 
> data to HDFS while we run the action. This will speed up many usages of 
> `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but 
> this is not always viable for very large input data. It's also not a great 
> API to use in general.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to