Re: How can i remove the need for calling cache

jeff saremi Tue, 01 Aug 2017 12:18:05 -0700

Thanks Mark. I'll examine the status more carefully to observe this.

________________________________
From: Mark Hamstra <m...@clearstorydata.com>
Sent: Tuesday, August 1, 2017 11:25:46 AM
To: user@spark.apache.org
Subject: Re: How can i remove the need for calling cache


Very likely, much of the potential duplication is already being avoided even 
without calling cache/persist. When running the above code without 
`myrdd.cache`, have you looked at the Spark web UI for the Jobs? For at least 
one of them you will likely see that many Stages are marked as "skipped", which 
means that prior shuffle files that cover the results of those Stages were 
still available, so Spark did not recompute those results. Spark will 
eventually clean up those shuffle files (unless you hold onto a reference to 
them), but if your Jobs using myrdd run fairly close together in time, then 
duplication is already minimized even without an explicit cache call.

On Tue, Aug 1, 2017 at 11:05 AM, jeff saremi 
<jeffsar...@hotmail.com<mailto:jeffsar...@hotmail.com>> wrote:

Calling cache/persist fails all our jobs (i have  posted 2 threads on this).

And we're giving up hope in finding a solution.
So I'd like to find a workaround for that:

If I save an RDD to hdfs and read it back, can I use it in more than one 
operation?

Example: (using cache)
// do a whole bunch of transformations on an RDD

myrdd.cache()

val result1 = myrdd.map(op1(_))

val result2 = myrdd.map(op2(_))

// in the above I am assuming that a call to cache will prevent all previous 
transformation from being calculated twice


I'd like to somehow get result1 and result2 without duplicating work. How can I 
do that?

thanks

Jeff

Re: How can i remove the need for calling cache

Reply via email to