Hi all,

We have setup a small cluster (13 nodes) using CDH3

We have been tuning it using TeraSort and Hive queries on our data,
and the copy phase is very slow, so I'd like to ask if anyone can look
over our config.

We have an unbalanced set of machines (all on a single switch):
- 10 of Intel @ 2.83GHz Quad, 8GB, 2x500G 7.2K SATA (3 mappers, 2 reducers)
- 3 of  Intel @ 2.53GHz Dual Quad, 24GB, 6x250GB 5.4K SATA (12
mappers, 12 reducers)

We monitored the load using $top on machines, to settle on the number
of mappers and reducers to stop overloading them, and the map() and
reduce() is working very nicely - all our time

The config:

io.sort.mb=400
io.sort.factor=100
mapred.reduce.parallel.copies=20
tasktracker.http.threads=80
mapred.compress.map.output=true/false (no notible difference)
mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
mapred.output.compression.type=BLOCK
mapred.inmem.merge.threshold=0
mapred.job.reduce.input.buffer.percent=0.7
mapred.job.reuse.jvm.num.tasks=50

An example job:
(select basis_of_record,count(1) from occurrence_record group by
basis_of_record)
Map input records 262,573,931 finished in 2mins30 using 833 mappers
Reduce was at 24% at 2mins30 finished map with all 55 running
Map output records: 1,855
Map output bytes: 28,724
REDUCE COPY PHASE finished after 7mins01 secs
Reduce finished after 7mins17secs

I am correct that 28,724 bytes emitted from a map should not take 4mins30 right?

We're running puppet so can test changes quickly.

Any pointers on how we can debug / improve this are greatly appreciated!
Tim

Reply via email to