Re: Why does SortShuffleWriter write to disk always?

Pramod Biligiri Sat, 02 May 2015 23:05:09 -0700

Thanks for the info. I agree, it makes sense the way it is designed.

Pramod


On Sat, May 2, 2015 at 10:37 PM, Mridul Muralidharan <mri...@gmail.com>
wrote:

> I agree, this is better handled by the filesystem cache - not to
> mention, being able to do zero copy writes.
>
> Regards,
> Mridul
>
> On Sat, May 2, 2015 at 10:26 PM, Reynold Xin <r...@databricks.com> wrote:
> > I've personally prototyped completely in-memory shuffle for Spark 3
> times.
> > However, it is unclear how big of a gain it would be to put all of these
> in
> > memory, under newer file systems (ext4, xfs). If the shuffle data is
> small,
> > they are still in the file system buffer cache anyway. Note that network
> > throughput is often lower than disk throughput, so it won't be a problem
> to
> > read them from disk. And not having to keep all of these stuff in-memory
> > substantially simplifies memory management.
> >
> >
> >
> > On Fri, May 1, 2015 at 7:59 PM, Pramod Biligiri <
> pramodbilig...@gmail.com>
> > wrote:
> >
> >> Hi,
> >> I was trying to see if I can make Spark avoid hitting the disk for small
> >> jobs, but I see that the SortShuffleWriter.write() always writes to
> disk. I
> >> found an older thread (
> >>
> >>
> http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html
> >> )
> >> saying that it doesn't call fsync on this write path.
> >>
> >> My question is why does it always write to disk?
> >> Does it mean the reduce phase reads the result from the disk as well?
> >> Isn't it possible to read the data from map/buffer in ExternalSorter
> >> directly during the reduce phase?
> >>
> >> Thanks,
> >> Pramod
> >>
>

Re: Why does SortShuffleWriter write to disk always?

Reply via email to