Here's another data point from a small cluster running Cloudera 20.1:
4 slaves of 2 Quad core (E5405) 2.00 GHz, 8 GB RAM, 4 1TB SATA drives
1 master running nn, 2nn and jt
dfs.replication=2
io.sort.factor: 25
io.sort.mb: 250
io.file.buffer.size: 65536
mapred.child.java.opts: -Xmx400M
mapred.tasktracker.map.tasks.maximum=7
mapred.tasktracker.reduce.tasks.maximum=7
mapred.job.reuse.jvm.num.tasks=10
$>hadoop jar /usr/lib/hadoop/hadoop-0.20.1+133-examples.jar
randomwriter -D dfs.block.size=134217728 input
Takes about 4 mins
$>hadoop jar /usr/lib/hadoop/hadoop-0.20.1+133-examples.jar sort input
output
Takes about 11 mins (map takes about 4.5 mins)
With the default configurations, the map tasks run for just a couple
seconds with the average number of tasks running at any one time being
just 20% of the map task capacity. Increasing the block size and
reusing jvm tasks had the most noticeable impact on performance.
-Bryan
On Oct 19, 2009, at Oct 19, 7:14 AM, Usman Waheed wrote:
io.sort.factor: 10
io.sort.mb: 100
io.file.buffer.size: 65536
mapred.child.java.opts: -Xmx200M
dfs.datanode.handler.count=3
2 Mappers
2 Reducer