Re: Fully in-memory shuffles

2015-06-11 Thread Mark Hamstra
> > I would guess in such shuffles the bottleneck is serializing the data > rather than raw IO, so I'm not sure explicitly buffering the data in the > JVM process would yield a large improvement. Good guess! It is very hard to beat the performance of retrieving shuffle outputs from the OS buffe

Re: Fully in-memory shuffles

2015-06-11 Thread Patrick Wendell
Hey Corey, Yes, when shuffles are smaller than available memory to the OS, most often the outputs never get stored to disk. I believe this holds same for the YARN shuffle service, because the write path is actually the same, i.e. we don't fsync the writes and force them to disk. I would guess in s

Re: Fully in-memory shuffles

2015-06-10 Thread Davies Liu
If you have enough memory, you can put the temporary work directory in tempfs (in memory file system). On Wed, Jun 10, 2015 at 8:43 PM, Corey Nolet wrote: > Ok so it is the case that small shuffles can be done without hitting any > disk. Is this the same case for the aux shuffle service in yarn?

Re: Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
Ok so it is the case that small shuffles can be done without hitting any disk. Is this the same case for the aux shuffle service in yarn? Can that be done without hitting disk? On Wed, Jun 10, 2015 at 9:17 PM, Patrick Wendell wrote: > In many cases the shuffle will actually hit the OS buffer cac

Re: Fully in-memory shuffles

2015-06-10 Thread Patrick Wendell
In many cases the shuffle will actually hit the OS buffer cache and not ever touch spinning disk if it is a size that is less than memory on the machine. - Patrick On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet wrote: > So with this... to help my understanding of Spark under the hood- > > Is this

Re: Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
So with this... to help my understanding of Spark under the hood- Is this statement correct "When data needs to pass between multiple JVMs, a shuffle will *always* hit disk"? On Wed, Jun 10, 2015 at 10:11 AM, Josh Rosen wrote: > There's a discussion of this at https://github.com/apache/spark/pu

Re: Fully in-memory shuffles

2015-06-10 Thread Josh Rosen
There's a discussion of this at https://github.com/apache/spark/pull/5403 On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet wrote: > Is it possible to configure Spark to do all of its shuffling FULLY in > memory (given that I have enough memory to store all the data)? > > > >

Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
Is it possible to configure Spark to do all of its shuffling FULLY in memory (given that I have enough memory to store all the data)?