Hi all, We have setup a small cluster (13 nodes) using CDH3
We have been tuning it using TeraSort and Hive queries on our data, and the copy phase is very slow, so I'd like to ask if anyone can look over our config. We have an unbalanced set of machines (all on a single switch): - 10 of Intel @ 2.83GHz Quad, 8GB, 2x500G 7.2K SATA (3 mappers, 2 reducers) - 3 of Intel @ 2.53GHz Dual Quad, 24GB, 6x250GB 5.4K SATA (12 mappers, 12 reducers) We monitored the load using $top on machines, to settle on the number of mappers and reducers to stop overloading them, and the map() and reduce() is working very nicely - all our time The config: io.sort.mb=400 io.sort.factor=100 mapred.reduce.parallel.copies=20 tasktracker.http.threads=80 mapred.compress.map.output=true/false (no notible difference) mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec mapred.output.compression.type=BLOCK mapred.inmem.merge.threshold=0 mapred.job.reduce.input.buffer.percent=0.7 mapred.job.reuse.jvm.num.tasks=50 An example job: (select basis_of_record,count(1) from occurrence_record group by basis_of_record) Map input records 262,573,931 finished in 2mins30 using 833 mappers Reduce was at 24% at 2mins30 finished map with all 55 running Map output records: 1,855 Map output bytes: 28,724 REDUCE COPY PHASE finished after 7mins01 secs Reduce finished after 7mins17secs I am correct that 28,724 bytes emitted from a map should not take 4mins30 right? We're running puppet so can test changes quickly. Any pointers on how we can debug / improve this are greatly appreciated! Tim