Kevin, As the data is sorted, it is spilled to disk as needed. There are parameters you can set that affect the spill mechanics. Basically each map task has a circular memory buffer that it write the ouput to, and when the contents of the buffer reach a certain threshold (one that you can tune) --- a background thread begins to spill the contents to disk. In the process of spilling to disk, the thread divides the data into partitions and sorts each partition by key (there are a lot of tricks you can do here to provide fancier algorithms). Before the map task is done, the spill files are merged into a single file which is partitioned and sorted. This is the file that reduce tasks then will download at some point.
Something else you can look at doing is employing LZO support and compressing the map task output to provide faster throughput by touching the disk less. For a better understanding of the shuffle, check out Tom White's Oreilly Press book, chapter 6 section "Shuffle and Sort". Josh Patterson Solutions Architect Cloudera On Sun, May 30, 2010 at 12:22 PM, Kevin Tse <[email protected]>wrote: > If the data generated by the Map function and Reduce function is far bigger > than the available RAM on my server, is it possible to sort the data? >
