Andrew Or created SPARK-8582:
--------------------------------
Summary: Optimize checkpointing to avoid computing an RDD twice
Key: SPARK-8582
URL: https://issues.apache.org/jira/browse/SPARK-8582
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
In Spark, checkpointing allows the user to truncate the lineage of his RDD and
save the intermediate contents to HDFS for fault tolerance. However, this is
not currently implemented super efficiently:
Every time we checkpoint an RDD, we actually compute it twice: once during the
action that triggered the checkpointing in the first place, and once while we
checkpoint (we iterate through an RDD's partitions and write them to disk).
Instead, we should have a `CheckpointingInterator` that writes checkpoint data
to HDFS while we run the action. This will speed up many usages of
`RDD#checkpoint` by 2X.
(Alternatively, the user can just cache the RDD before checkpointing it, but
this is not always viable for very large input data. It's also not a great API
to use in general.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]