Thanks. But when I refer to "Hadoop: The Definitive Guide" chapter 6, I find that the map writes its outputs to a memory buffer(not to local disk) whose size is controlled by io.sort.mb. Only the buffer reaches its threshold, it will spill the outputs to local disk. If that is true, I can't see any need for the map to store its outputs to disk if the io.sort.mb is large enough.
On Wed, Jul 15, 2009 at 12:45 AM, Owen O'Malley <[email protected]>wrote: > There is no requirement that all of the reduces are running while the map > is > running. The dataflow is that the map writes its output to local disk and > that the reduces pull the map outputs when they need them. There are > threads > handling sorting and spill of the records to disk, but that doesn't remove > the need for the map to store its outputs to disk. (Of course, if there is > enough ram, the operating system will have the map outputs in its file > cache > and not need to read from disk.) > > It is an interesting question as to what the changes would need to be to > have the maps push to the reduces, but they would be substantial. > > -- Owen > -- Best wishes, Qiao Mu
