Re: Why is Spilled Records always equal to Map output records

Mu Qiao Tue, 14 Jul 2009 21:01:04 -0700

Thanks. It's clear now. :)

On Wed, Jul 15, 2009 at 11:40 AM, Jothi Padmanabhan
<[email protected]>wrote:


> It is true, map writes its output to a memory buffer. But when the map
> process is complete, the contents of this buffer are sorted and spilled to
> the disk so that the Task Tracker running on that node can serve these map
> outputs to the requesting reducers.
>
>
> On 7/15/09 7:59 AM, "Mu Qiao" <[email protected]> wrote:
>
> > Thanks. But when I refer to "Hadoop: The Definitive Guide" chapter 6, I
> find
> > that the map writes its outputs to a memory buffer(not to local disk)
> whose
> > size is controlled by io.sort.mb. Only the buffer reaches its threshold,
> it
> > will spill the outputs to local disk. If that is true, I can't see any
> need
> > for the map to store its outputs to disk if the io.sort.mb is large
> enough.
> >
> > On Wed, Jul 15, 2009 at 12:45 AM, Owen O'Malley <[email protected]
> >wrote:
> >
> >> There is no requirement that all of the reduces are running while the
> map
> >> is
> >> running. The dataflow is that the map writes its output to local disk
> and
> >> that the reduces pull the map outputs when they need them. There are
> >> threads
> >> handling sorting and spill of the records to disk, but that doesn't
> remove
> >> the need for the map to store its outputs to disk. (Of course, if there
> is
> >> enough ram, the operating system will have the map outputs in its file
> >> cache
> >> and not need to read from disk.)
> >>
> >> It is an interesting question as to what the changes would need to be to
> >> have the maps push to the reduces, but they would be substantial.
> >>
> >> -- Owen
> >>
> >
> >
>
>


-- 
Best wishes,
Qiao Mu

Re: Why is Spilled Records always equal to Map output records

Reply via email to