There is no requirement that all of the reduces are running while the map is running. The dataflow is that the map writes its output to local disk and that the reduces pull the map outputs when they need them. There are threads handling sorting and spill of the records to disk, but that doesn't remove the need for the map to store its outputs to disk. (Of course, if there is enough ram, the operating system will have the map outputs in its file cache and not need to read from disk.)
It is an interesting question as to what the changes would need to be to have the maps push to the reduces, but they would be substantial. -- Owen
