That definitely helps a lot! I saw a few people talking about it on the webs, and they say to set the value to Long.MAX_VALUE, but that is not what I have found to be best. I see about 25% improvement at 300MB (300000000), CPU utilization is up to about 50-70%+, but I am still fine tuning.

thanks!

Chris

Jason Venner wrote:
I remember having a problem like this at one point, it was related to the
mean run time of my tasks, and the rate that the jobtracker could start new
tasks.

By increasing the split size until the mean run time of my tasks was in the
minutes, I was able to drive up the utilization.


On Wed, Oct 14, 2009 at 7:31 AM, Chris Seline <[email protected]> wrote:

No, there doesn't seem to be all that much network traffic. Most of the
time traffic (measured with nethogs) is about 15-30K/s on the master and
slaves during map, sometimes it bursts up 5-10 MB/s on a slave for maybe
5-10 seconds on a query that takes 10 minutes, but that is still less than
what I see in scp transfers on EC2, which is typically about 30 MB/s.

thanks

Chris


Jason Venner wrote:

are your network interface or the namenode/jobtracker/datanodes saturated


On Tue, Oct 13, 2009 at 9:05 AM, Chris Seline <[email protected]>
wrote:



I am using the 0.3 Cloudera scripts to start a Hadoop cluster on EC2 of
11
c1.xlarge instances (1 master, 10 slaves), that is the biggest instance
available with 20 compute units and 4x 400gb disks.

I wrote some scripts to test many (100's) of configurations running a
particular Hive query to try to make it as fast as possible, but no
matter
what I don't seem to be able to get above roughly 45% cpu utilization on
the
slaves, and not more than about 1.5% wait state. I have also measured
network traffic and there don't seem to be bottlenecks there at all.

Here are some typical CPU utilization lines from top on a slave when
running a query:
Cpu(s): 33.9%us,  7.4%sy,  0.0%ni, 56.8%id,  0.6%wa,  0.0%hi,  0.5%si,
 0.7%st
Cpu(s): 33.6%us,  5.9%sy,  0.0%ni, 58.7%id,  0.9%wa,  0.0%hi,  0.4%si,
 0.5%st
Cpu(s): 33.9%us,  7.2%sy,  0.0%ni, 56.8%id,  0.5%wa,  0.0%hi,  0.6%si,
 1.0%st
Cpu(s): 38.6%us,  8.7%sy,  0.0%ni, 50.8%id,  0.5%wa,  0.0%hi,  0.7%si,
 0.7%st
Cpu(s): 36.8%us,  7.4%sy,  0.0%ni, 53.6%id,  0.4%wa,  0.0%hi,  0.5%si,
 1.3%st

It seems like if tuned properly, I should be able to max out my cpu (or
my
disk) and get roughly twice the performance I am seeing now. None of the
parameters I am tuning seem to be able to achieve this. Adjusting
mapred.map.tasks and mapred.reduce.tasks does help somewhat, and setting
the
io.file.buffer.size to 4096 does better than the default, but the rest of
the values I am testing seem to have little positive  effect.

These are the parameters I am testing, and the values tried:

io.sort.factor=2,3,4,5,10,15,20,25,30,50,100


mapred.job.shuffle.merge.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.97,0.98,0.99
io.bytes.per.checksum=256,512,1024,2048,4192
mapred.output.compress=true,false
hive.exec.compress.intermediate=true,false


hive.map.aggr.hash.min.reduction=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.97,0.98,0.99
mapred.map.tasks=1,2,3,4,5,6,8,10,12,15,20,25,30,40,50,60,75,100,150,200


mapred.child.java.opts=-Xmx400m,-Xmx500m,-Xmx600m,-Xmx700m,-Xmx800m,-Xmx900m,-Xmx1000m,-Xmx1200m,-Xmx1400m,-Xmx1600m,-Xmx2000m
mapred.reduce.tasks=5,10,15,20,25,30,35,40,50,60,70,80,100,125,150,200
mapred.merge.recordsBeforeProgress=5000,10000,20000,30000


mapred.job.shuffle.input.buffer.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.99


io.sort.spill.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.99
mapred.job.tracker.handler.count=3,4,5,7,10,15,25


hive.merge.size.per.task=64000000,128000000,168000000,256000000,300000000,400000000
hive.optimize.ppd=true,false
hive.merge.mapredfiles=false,true


io.sort.record.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.97,0.98,0.99


hive.map.aggr.hash.percentmemory=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.97,0.98,0.99
mapred.tasktracker.reduce.tasks.maximum=1,2,3,4,5,6,8,10,12,15,20,30
mapred.reduce.parallel.copies=1,2,4,6,8,10,13,16,20,25,30,50
io.seqfile.lazydecompress=true,false
io.sort.mb=20,50,75,100,150,200,250,350,500
mapred.compress.map.output=true,false
io.file.buffer.size=1024,2048,4096,8192,16384,32768,65536,131072,262144
hive.exec.reducers.bytes.per.reducer=1000000000
dfs.datanode.handler.count=1,2,3,4,5,6,8,10,15
mapred.tasktracker.map.tasks.maximum=5,8,12,20

Anyone have any thoughts for other parameters I might try? Am I going
about
this the wrong way? Am I missing some other bottleneck?

thanks

Chris Seline









Reply via email to