Thanks Mark. I'll examine the status more carefully to observe this. ________________________________ From: Mark Hamstra <m...@clearstorydata.com> Sent: Tuesday, August 1, 2017 11:25:46 AM To: user@spark.apache.org Subject: Re: How can i remove the need for calling cache
Very likely, much of the potential duplication is already being avoided even without calling cache/persist. When running the above code without `myrdd.cache`, have you looked at the Spark web UI for the Jobs? For at least one of them you will likely see that many Stages are marked as "skipped", which means that prior shuffle files that cover the results of those Stages were still available, so Spark did not recompute those results. Spark will eventually clean up those shuffle files (unless you hold onto a reference to them), but if your Jobs using myrdd run fairly close together in time, then duplication is already minimized even without an explicit cache call. On Tue, Aug 1, 2017 at 11:05 AM, jeff saremi <jeffsar...@hotmail.com<mailto:jeffsar...@hotmail.com>> wrote: Calling cache/persist fails all our jobs (i have posted 2 threads on this). And we're giving up hope in finding a solution. So I'd like to find a workaround for that: If I save an RDD to hdfs and read it back, can I use it in more than one operation? Example: (using cache) // do a whole bunch of transformations on an RDD myrdd.cache() val result1 = myrdd.map(op1(_)) val result2 = myrdd.map(op2(_)) // in the above I am assuming that a call to cache will prevent all previous transformation from being calculated twice I'd like to somehow get result1 and result2 without duplicating work. How can I do that? thanks Jeff