Thanks. It's clear now. :) On Wed, Jul 15, 2009 at 11:40 AM, Jothi Padmanabhan <[email protected]>wrote:
> It is true, map writes its output to a memory buffer. But when the map > process is complete, the contents of this buffer are sorted and spilled to > the disk so that the Task Tracker running on that node can serve these map > outputs to the requesting reducers. > > > On 7/15/09 7:59 AM, "Mu Qiao" <[email protected]> wrote: > > > Thanks. But when I refer to "Hadoop: The Definitive Guide" chapter 6, I > find > > that the map writes its outputs to a memory buffer(not to local disk) > whose > > size is controlled by io.sort.mb. Only the buffer reaches its threshold, > it > > will spill the outputs to local disk. If that is true, I can't see any > need > > for the map to store its outputs to disk if the io.sort.mb is large > enough. > > > > On Wed, Jul 15, 2009 at 12:45 AM, Owen O'Malley <[email protected] > >wrote: > > > >> There is no requirement that all of the reduces are running while the > map > >> is > >> running. The dataflow is that the map writes its output to local disk > and > >> that the reduces pull the map outputs when they need them. There are > >> threads > >> handling sorting and spill of the records to disk, but that doesn't > remove > >> the need for the map to store its outputs to disk. (Of course, if there > is > >> enough ram, the operating system will have the map outputs in its file > >> cache > >> and not need to read from disk.) > >> > >> It is an interesting question as to what the changes would need to be to > >> have the maps push to the reduces, but they would be substantial. > >> > >> -- Owen > >> > > > > > > -- Best wishes, Qiao Mu
