Glenn Strycker created SPARK-8666:
-------------------------------------

             Summary: checkpointing does not take advantage of persisted/cached 
RDDs
                 Key: SPARK-8666
                 URL: https://issues.apache.org/jira/browse/SPARK-8666
             Project: Spark
          Issue Type: New Feature
            Reporter: Glenn Strycker


I have been noticing that when checkpointing RDDs, all operations are occurring 
TWICE.

For example, when I run the following code and watch the stages...

{noformat}
val newRDD = prevRDD.map(a => (a._1, 1L)).distinct.persist()
newRDD.checkpoint
print(newRDD.count())
{noformat}

I see distinct and count operations appearing TWICE, and shuffle disk writes 
and reads (from the distinct) occurring TWICE.

My newRDD is persisted to memory, why can't the checkpoint simply save those 
partitions to disk when the first operations have completed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to