I remember having a problem like this at one point, it was related to the mean run time of my tasks, and the rate that the jobtracker could start new tasks.
By increasing the split size until the mean run time of my tasks was in the minutes, I was able to drive up the utilization. On Wed, Oct 14, 2009 at 7:31 AM, Chris Seline <[email protected]> wrote: > No, there doesn't seem to be all that much network traffic. Most of the > time traffic (measured with nethogs) is about 15-30K/s on the master and > slaves during map, sometimes it bursts up 5-10 MB/s on a slave for maybe > 5-10 seconds on a query that takes 10 minutes, but that is still less than > what I see in scp transfers on EC2, which is typically about 30 MB/s. > > thanks > > Chris > > > Jason Venner wrote: > >> are your network interface or the namenode/jobtracker/datanodes saturated >> >> >> On Tue, Oct 13, 2009 at 9:05 AM, Chris Seline <[email protected]> >> wrote: >> >> >> >>> I am using the 0.3 Cloudera scripts to start a Hadoop cluster on EC2 of >>> 11 >>> c1.xlarge instances (1 master, 10 slaves), that is the biggest instance >>> available with 20 compute units and 4x 400gb disks. >>> >>> I wrote some scripts to test many (100's) of configurations running a >>> particular Hive query to try to make it as fast as possible, but no >>> matter >>> what I don't seem to be able to get above roughly 45% cpu utilization on >>> the >>> slaves, and not more than about 1.5% wait state. I have also measured >>> network traffic and there don't seem to be bottlenecks there at all. >>> >>> Here are some typical CPU utilization lines from top on a slave when >>> running a query: >>> Cpu(s): 33.9%us, 7.4%sy, 0.0%ni, 56.8%id, 0.6%wa, 0.0%hi, 0.5%si, >>> 0.7%st >>> Cpu(s): 33.6%us, 5.9%sy, 0.0%ni, 58.7%id, 0.9%wa, 0.0%hi, 0.4%si, >>> 0.5%st >>> Cpu(s): 33.9%us, 7.2%sy, 0.0%ni, 56.8%id, 0.5%wa, 0.0%hi, 0.6%si, >>> 1.0%st >>> Cpu(s): 38.6%us, 8.7%sy, 0.0%ni, 50.8%id, 0.5%wa, 0.0%hi, 0.7%si, >>> 0.7%st >>> Cpu(s): 36.8%us, 7.4%sy, 0.0%ni, 53.6%id, 0.4%wa, 0.0%hi, 0.5%si, >>> 1.3%st >>> >>> It seems like if tuned properly, I should be able to max out my cpu (or >>> my >>> disk) and get roughly twice the performance I am seeing now. None of the >>> parameters I am tuning seem to be able to achieve this. Adjusting >>> mapred.map.tasks and mapred.reduce.tasks does help somewhat, and setting >>> the >>> io.file.buffer.size to 4096 does better than the default, but the rest of >>> the values I am testing seem to have little positive effect. >>> >>> These are the parameters I am testing, and the values tried: >>> >>> io.sort.factor=2,3,4,5,10,15,20,25,30,50,100 >>> >>> >>> mapred.job.shuffle.merge.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.97,0.98,0.99 >>> io.bytes.per.checksum=256,512,1024,2048,4192 >>> mapred.output.compress=true,false >>> hive.exec.compress.intermediate=true,false >>> >>> >>> hive.map.aggr.hash.min.reduction=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.97,0.98,0.99 >>> mapred.map.tasks=1,2,3,4,5,6,8,10,12,15,20,25,30,40,50,60,75,100,150,200 >>> >>> >>> mapred.child.java.opts=-Xmx400m,-Xmx500m,-Xmx600m,-Xmx700m,-Xmx800m,-Xmx900m,-Xmx1000m,-Xmx1200m,-Xmx1400m,-Xmx1600m,-Xmx2000m >>> mapred.reduce.tasks=5,10,15,20,25,30,35,40,50,60,70,80,100,125,150,200 >>> mapred.merge.recordsBeforeProgress=5000,10000,20000,30000 >>> >>> >>> mapred.job.shuffle.input.buffer.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.99 >>> >>> >>> io.sort.spill.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.99 >>> mapred.job.tracker.handler.count=3,4,5,7,10,15,25 >>> >>> >>> hive.merge.size.per.task=64000000,128000000,168000000,256000000,300000000,400000000 >>> hive.optimize.ppd=true,false >>> hive.merge.mapredfiles=false,true >>> >>> >>> io.sort.record.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.97,0.98,0.99 >>> >>> >>> hive.map.aggr.hash.percentmemory=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.97,0.98,0.99 >>> mapred.tasktracker.reduce.tasks.maximum=1,2,3,4,5,6,8,10,12,15,20,30 >>> mapred.reduce.parallel.copies=1,2,4,6,8,10,13,16,20,25,30,50 >>> io.seqfile.lazydecompress=true,false >>> io.sort.mb=20,50,75,100,150,200,250,350,500 >>> mapred.compress.map.output=true,false >>> io.file.buffer.size=1024,2048,4096,8192,16384,32768,65536,131072,262144 >>> hive.exec.reducers.bytes.per.reducer=1000000000 >>> dfs.datanode.handler.count=1,2,3,4,5,6,8,10,15 >>> mapred.tasktracker.map.tasks.maximum=5,8,12,20 >>> >>> Anyone have any thoughts for other parameters I might try? Am I going >>> about >>> this the wrong way? Am I missing some other bottleneck? >>> >>> thanks >>> >>> Chris Seline >>> >>> >>> >> >> >> >> >> > > -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
