>
> I would guess in such shuffles the bottleneck is serializing the data
> rather than raw IO, so I'm not sure explicitly buffering the data in the
> JVM process would yield a large improvement.
Good guess! It is very hard to beat the performance of retrieving shuffle
outputs from the OS buffe
Hey Corey,
Yes, when shuffles are smaller than available memory to the OS, most
often the outputs never get stored to disk. I believe this holds same
for the YARN shuffle service, because the write path is actually the
same, i.e. we don't fsync the writes and force them to disk. I would
guess in s
If you have enough memory, you can put the temporary work directory in
tempfs (in memory file system).
On Wed, Jun 10, 2015 at 8:43 PM, Corey Nolet wrote:
> Ok so it is the case that small shuffles can be done without hitting any
> disk. Is this the same case for the aux shuffle service in yarn?
Ok so it is the case that small shuffles can be done without hitting any
disk. Is this the same case for the aux shuffle service in yarn? Can that
be done without hitting disk?
On Wed, Jun 10, 2015 at 9:17 PM, Patrick Wendell wrote:
> In many cases the shuffle will actually hit the OS buffer cac
In many cases the shuffle will actually hit the OS buffer cache and
not ever touch spinning disk if it is a size that is less than memory
on the machine.
- Patrick
On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet wrote:
> So with this... to help my understanding of Spark under the hood-
>
> Is this
So with this... to help my understanding of Spark under the hood-
Is this statement correct "When data needs to pass between multiple JVMs, a
shuffle will *always* hit disk"?
On Wed, Jun 10, 2015 at 10:11 AM, Josh Rosen wrote:
> There's a discussion of this at https://github.com/apache/spark/pu
There's a discussion of this at https://github.com/apache/spark/pull/5403
On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet wrote:
> Is it possible to configure Spark to do all of its shuffling FULLY in
> memory (given that I have enough memory to store all the data)?
>
>
>
>
Is it possible to configure Spark to do all of its shuffling FULLY in
memory (given that I have enough memory to store all the data)?