are your network interface or the namenode/jobtracker/datanodes saturated

On Tue, Oct 13, 2009 at 9:05 AM, Chris Seline <[email protected]> wrote:

> I am using the 0.3 Cloudera scripts to start a Hadoop cluster on EC2 of 11
> c1.xlarge instances (1 master, 10 slaves), that is the biggest instance
> available with 20 compute units and 4x 400gb disks.
>
> I wrote some scripts to test many (100's) of configurations running a
> particular Hive query to try to make it as fast as possible, but no matter
> what I don't seem to be able to get above roughly 45% cpu utilization on the
> slaves, and not more than about 1.5% wait state. I have also measured
> network traffic and there don't seem to be bottlenecks there at all.
>
> Here are some typical CPU utilization lines from top on a slave when
> running a query:
> Cpu(s): 33.9%us,  7.4%sy,  0.0%ni, 56.8%id,  0.6%wa,  0.0%hi,  0.5%si,
>  0.7%st
> Cpu(s): 33.6%us,  5.9%sy,  0.0%ni, 58.7%id,  0.9%wa,  0.0%hi,  0.4%si,
>  0.5%st
> Cpu(s): 33.9%us,  7.2%sy,  0.0%ni, 56.8%id,  0.5%wa,  0.0%hi,  0.6%si,
>  1.0%st
> Cpu(s): 38.6%us,  8.7%sy,  0.0%ni, 50.8%id,  0.5%wa,  0.0%hi,  0.7%si,
>  0.7%st
> Cpu(s): 36.8%us,  7.4%sy,  0.0%ni, 53.6%id,  0.4%wa,  0.0%hi,  0.5%si,
>  1.3%st
>
> It seems like if tuned properly, I should be able to max out my cpu (or my
> disk) and get roughly twice the performance I am seeing now. None of the
> parameters I am tuning seem to be able to achieve this. Adjusting
> mapred.map.tasks and mapred.reduce.tasks does help somewhat, and setting the
> io.file.buffer.size to 4096 does better than the default, but the rest of
> the values I am testing seem to have little positive  effect.
>
> These are the parameters I am testing, and the values tried:
>
> io.sort.factor=2,3,4,5,10,15,20,25,30,50,100
>
> mapred.job.shuffle.merge.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.97,0.98,0.99
> io.bytes.per.checksum=256,512,1024,2048,4192
> mapred.output.compress=true,false
> hive.exec.compress.intermediate=true,false
>
> hive.map.aggr.hash.min.reduction=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.97,0.98,0.99
> mapred.map.tasks=1,2,3,4,5,6,8,10,12,15,20,25,30,40,50,60,75,100,150,200
>
> mapred.child.java.opts=-Xmx400m,-Xmx500m,-Xmx600m,-Xmx700m,-Xmx800m,-Xmx900m,-Xmx1000m,-Xmx1200m,-Xmx1400m,-Xmx1600m,-Xmx2000m
> mapred.reduce.tasks=5,10,15,20,25,30,35,40,50,60,70,80,100,125,150,200
> mapred.merge.recordsBeforeProgress=5000,10000,20000,30000
>
> mapred.job.shuffle.input.buffer.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.99
>
> io.sort.spill.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.99
> mapred.job.tracker.handler.count=3,4,5,7,10,15,25
>
> hive.merge.size.per.task=64000000,128000000,168000000,256000000,300000000,400000000
> hive.optimize.ppd=true,false
> hive.merge.mapredfiles=false,true
>
> io.sort.record.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.97,0.98,0.99
>
> hive.map.aggr.hash.percentmemory=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.97,0.98,0.99
> mapred.tasktracker.reduce.tasks.maximum=1,2,3,4,5,6,8,10,12,15,20,30
> mapred.reduce.parallel.copies=1,2,4,6,8,10,13,16,20,25,30,50
> io.seqfile.lazydecompress=true,false
> io.sort.mb=20,50,75,100,150,200,250,350,500
> mapred.compress.map.output=true,false
> io.file.buffer.size=1024,2048,4096,8192,16384,32768,65536,131072,262144
> hive.exec.reducers.bytes.per.reducer=1000000000
> dfs.datanode.handler.count=1,2,3,4,5,6,8,10,15
> mapred.tasktracker.map.tasks.maximum=5,8,12,20
>
> Anyone have any thoughts for other parameters I might try? Am I going about
> this the wrong way? Am I missing some other bottleneck?
>
> thanks
>
> Chris Seline
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Reply via email to