There is no requirement that all of the reduces are running while the map is
running. The dataflow is that the map writes its output to local disk and
that the reduces pull the map outputs when they need them. There are threads
handling sorting and spill of the records to disk, but that doesn't remove
the need for the map to store its outputs to disk. (Of course, if there is
enough ram, the operating system will have the map outputs in its file cache
and not need to read from disk.)

It is an interesting question as to what the changes would need to be to
have the maps push to the reduces, but they would be substantial.

-- Owen

Reply via email to