Re: Why is Spilled Records always equal to Map output records

Mu Qiao Tue, 14 Jul 2009 19:30:37 -0700

Thanks. But when I refer to "Hadoop: The Definitive Guide" chapter 6, I find
that the map writes its outputs to a memory buffer(not to local disk) whose
size is controlled by io.sort.mb. Only the buffer reaches its threshold, it
will spill the outputs to local disk. If that is true, I can't see any need
for the map to store its outputs to disk if the io.sort.mb is large enough.


On Wed, Jul 15, 2009 at 12:45 AM, Owen O'Malley <[email protected]>wrote:

> There is no requirement that all of the reduces are running while the map
> is
> running. The dataflow is that the map writes its output to local disk and
> that the reduces pull the map outputs when they need them. There are
> threads
> handling sorting and spill of the records to disk, but that doesn't remove
> the need for the map to store its outputs to disk. (Of course, if there is
> enough ram, the operating system will have the map outputs in its file
> cache
> and not need to read from disk.)
>
> It is an interesting question as to what the changes would need to be to
> have the maps push to the reduces, but they would be substantial.
>
> -- Owen
>



-- 
Best wishes,
Qiao Mu

Re: Why is Spilled Records always equal to Map output records

Reply via email to