Thanks for the info. I agree, it makes sense the way it is designed. Pramod
On Sat, May 2, 2015 at 10:37 PM, Mridul Muralidharan <mri...@gmail.com> wrote: > I agree, this is better handled by the filesystem cache - not to > mention, being able to do zero copy writes. > > Regards, > Mridul > > On Sat, May 2, 2015 at 10:26 PM, Reynold Xin <r...@databricks.com> wrote: > > I've personally prototyped completely in-memory shuffle for Spark 3 > times. > > However, it is unclear how big of a gain it would be to put all of these > in > > memory, under newer file systems (ext4, xfs). If the shuffle data is > small, > > they are still in the file system buffer cache anyway. Note that network > > throughput is often lower than disk throughput, so it won't be a problem > to > > read them from disk. And not having to keep all of these stuff in-memory > > substantially simplifies memory management. > > > > > > > > On Fri, May 1, 2015 at 7:59 PM, Pramod Biligiri < > pramodbilig...@gmail.com> > > wrote: > > > >> Hi, > >> I was trying to see if I can make Spark avoid hitting the disk for small > >> jobs, but I see that the SortShuffleWriter.write() always writes to > disk. I > >> found an older thread ( > >> > >> > http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html > >> ) > >> saying that it doesn't call fsync on this write path. > >> > >> My question is why does it always write to disk? > >> Does it mean the reduce phase reads the result from the disk as well? > >> Isn't it possible to read the data from map/buffer in ExternalSorter > >> directly during the reduce phase? > >> > >> Thanks, > >> Pramod > >> >