Here's another data point from a small cluster running Cloudera 20.1:

4 slaves of 2 Quad core (E5405) 2.00 GHz, 8 GB RAM, 4 1TB SATA drives
1 master running nn, 2nn and jt


dfs.replication=2
io.sort.factor: 25
io.sort.mb: 250
io.file.buffer.size: 65536
mapred.child.java.opts: -Xmx400M
mapred.tasktracker.map.tasks.maximum=7
mapred.tasktracker.reduce.tasks.maximum=7
mapred.job.reuse.jvm.num.tasks=10

$>hadoop jar /usr/lib/hadoop/hadoop-0.20.1+133-examples.jar randomwriter -D dfs.block.size=134217728 input

Takes about 4 mins


$>hadoop jar /usr/lib/hadoop/hadoop-0.20.1+133-examples.jar sort input output

Takes about 11 mins (map takes about 4.5 mins)



With the default configurations, the map tasks run for just a couple seconds with the average number of tasks running at any one time being just 20% of the map task capacity. Increasing the block size and reusing jvm tasks had the most noticeable impact on performance.


-Bryan




On Oct 19, 2009, at Oct 19, 7:14 AM, Usman Waheed wrote:

io.sort.factor: 10
io.sort.mb: 100
io.file.buffer.size: 65536
mapred.child.java.opts: -Xmx200M
dfs.datanode.handler.count=3
2 Mappers
2 Reducer

Reply via email to