Re: Effects of persist(XYZ_2)

2015-02-25 Thread Sean Owen
Then every worker would have to hold the whole RDD in memory. That's got some significant drawbacks. As long as you are able to execute all tasks locally to their partition, any additional copies of the data don't help locality. And you need far less than N copies of the data for that in general.

Re: Effects of persist(XYZ_2)

2015-02-25 Thread Marius Soutier
Yes. Effectively, could it avoid network transfers? Or put differently, would an option like persist(MEMORY_ALL) improve job speed by caching an RDD on every worker? > On 25.02.2015, at 11:42, Sean Owen wrote: > > If you mean, can both copies of the blocks be used for computations? > yes they

Re: Effects of persist(XYZ_2)

2015-02-25 Thread Sean Owen
If you mean, can both copies of the blocks be used for computations? yes they can. On Wed, Feb 25, 2015 at 10:36 AM, Marius Soutier wrote: > Hi, > > just a quick question about calling persist with the _2 option. Is the 2x > replication only useful for fault tolerance, or will it also increase j

Effects of persist(XYZ_2)

2015-02-25 Thread Marius Soutier
Hi, just a quick question about calling persist with the _2 option. Is the 2x replication only useful for fault tolerance, or will it also increase job speed by avoiding network transfers? Assuming I’m doing joins or other shuffle operations. Thanks --