I think I've found a bug in the Merger code for Hadoop. When the Map job runs, it creates spill files based on io.sort.mb. It then sorts io.sort.factor files at a time in order to create an output file that's passed to the reduce job. The higher these two settings are configured, the more memory is used.
However, as far as I can tell, the memory used is the same no matter what the io.sort parameters are set to. For example, with io.sort.mb of 256, io.sort factor of 10 and 10 spill files, we get the following scenario. The merger merges all 10 of those spill files into one output file, using roughly 2.5GB of memory. If we change the io.sort.factor to 4, then the merger will merge 4 of the 10 spill files and output the result as a temp file on the hard drive. It then adds the resulting file back into the merge queue. It repeats this action with the next 4 spill files. Now we have 2 spill files remaining and the 2 temp files which are each 4 spill files combined. So on the third pass of the merger, we're back to merging everything into one output file, using roughly 2.5GB of memory. No matter what you set your io.sort.factor to, you will eventually end up using the same amount of memory. It's just that lower factors will take longer due to the intermediate steps. As such, if you only have 2GB of memory available for the Map job, you will get an OutOfMemoryException every time you attempt to run the job. Can anyone confirm what I'm seeing or point out any flaws in my reasoning? Thanks.
