[ 
https://issues.apache.org/jira/browse/SPARK-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glenn Strycker updated SPARK-8666:
----------------------------------
    Description: 
I have been noticing that when checkpointing RDDs, all operations are occurring 
TWICE.

For example, when I run the following code and watch the stages...

{noformat}
val newRDD = prevRDD.map(a => (a._1, 
1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER)
newRDD.checkpoint
print(newRDD.count())
{noformat}

I see distinct and count operations appearing TWICE, and shuffle disk writes 
and reads (from the distinct) occurring TWICE.

My newRDD is persisted to memory, why can't the checkpoint simply save those 
partitions to disk when the first operations have completed?

  was:
I have been noticing that when checkpointing RDDs, all operations are occurring 
TWICE.

For example, when I run the following code and watch the stages...

{noformat}
val newRDD = prevRDD.map(a => (a._1, 1L)).distinct.persist()
newRDD.checkpoint
print(newRDD.count())
{noformat}

I see distinct and count operations appearing TWICE, and shuffle disk writes 
and reads (from the distinct) occurring TWICE.

My newRDD is persisted to memory, why can't the checkpoint simply save those 
partitions to disk when the first operations have completed?


> checkpointing does not take advantage of persisted/cached RDDs
> --------------------------------------------------------------
>
>                 Key: SPARK-8666
>                 URL: https://issues.apache.org/jira/browse/SPARK-8666
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Glenn Strycker
>
> I have been noticing that when checkpointing RDDs, all operations are 
> occurring TWICE.
> For example, when I run the following code and watch the stages...
> {noformat}
> val newRDD = prevRDD.map(a => (a._1, 
> 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER)
> newRDD.checkpoint
> print(newRDD.count())
> {noformat}
> I see distinct and count operations appearing TWICE, and shuffle disk writes 
> and reads (from the distinct) occurring TWICE.
> My newRDD is persisted to memory, why can't the checkpoint simply save those 
> partitions to disk when the first operations have completed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to