HI Donovan, This is sort of tangential to your question but we tried upping our io.sort.mb to a really high value and it actually resulted in slower performance (I think we bumped it up to 1400MB and this was slower than leaving it at 256 on a machine with 32GB of RAM). I'm not entirely sure why this was the case. It could have been a garbage collection issue or some other secondary effect that was slowing things down.
Keep in mind that Hadoop will always spill map outputs to disk no matter how large your sort buffer is in case the reducer crashes, the data needs to exist on disk somewhere for the next reducer making the attempt so it might be counterproductive to try and "eliminate" spills. ~Ed On Tue, Oct 19, 2010 at 8:02 AM, Donovan Hide <[email protected]> wrote: > Hi, > is there a reason why the io.sort.mb setting is hard-coded to the > maximum of 2047MB? > > MapTask.java 789-791 > > if ((sortmb & 0x7FF) != sortmb) { > throw new IOException("Invalid \"io.sort.mb\": " + sortmb); > } > > Given that the EC2 High-Memory Quadruple Extra Large Instance has > 68.4GB of memory and 8 cores, it would make sense to be able to set > the io.sort.mb to close to 8GB. I have map task that outputs > 144,586,867 records of average size 12 bytes, and a greater than > 2047MB sort buffer would allow me to prevent the inevitable spills. I > know I can reduce the size of the map inputs to solve the problem, but > 2047MB seems a bit arbitrary given the spec of EC2 instances. > > Cheers, > Donovan. >
