Hey guys, Just wanted to ask, are there any sort of best practices to be followed for hadoop shuffling improvements ?
I am running Hadoop 0.20.205 on 8 nodes cluster.Each node is 24 cores/CPUs with 48 GB RAM. I have set the following parameters : fs.inmemory.size.mb=2000 io.sort.mb=2000 io.sort.factor=200 io.file.buffer.size=262544 mapred.map.tasks=200 mapred.reduce.tasks=40 mapred.reduce.parallel.copies=80 mapred.map.child.java.opts = 1024 Mb mapred.map.reduce.java.opts=1024 Mb mapred.job.tracker.handler.count=60 tasktracker.http.threads=50 mapred.job.reuse.jvm.num.tasks = -1 mapred.compress.map.output = true mapred.reduce.slowstart.completed.maps = 0.5 mapred.tasktracker.map.tasks.maximum=24 mapred.tasktracker.reduce.tasks.maximum=12 Can anyone please validate the above tuning parameters, and suggest any further improvements ? My mappers are running fine. Shuffling and reducing part is comparatively slower, than expected for normal jobs. Wanted to know what I am doing wrong/missing. Thanks, Praveenesh