Very weak mapred performance on small clusters with a massive amount of small files

André Martin Thu, 01 Nov 2007 01:50:30 -0800

Hi everyone,

we are experiencing a very weak map-red performance on the followingmapred cluster setup:


- Hadoop - nightly build from 2007-10-25_17-03-53
- 5 tasktracker-/datanodes + 1 jobtracker-/namenode
- 3.7GB (53,050 files in 18,150 folders - avg. file size: 74kB)

A mapred job takes up to 24 hours before completion on our cluster.We've measured and monitored network bandwidth, diskIO, paging, swappingand CPU utilization in order to exclude those things as bottlenecks onour host machines / network itself. However, a closer look into the logfiles and the source codes revealed the following things:

During the mapping stage, we observed that each task tracker processesonly one or (a maximum of) two file-splits at a time which equals prettymuch to sequential reading/processing of 10.000+ files.(Reading files off the DFS takes only a second or less and we measured arelatively high throughputs of ~1.5MB/s). Adjusting / increasing the"mapred.tasktracker.tasks.maximum" parameter from the given defaultvalue of 2 to 10 didn't work - each node still processes only two (atmax) map tasks at a time...

Another major performance gap seems to lie in the reduce copy phase: Theload balancing / anti swamping policy on line 972 (in ReduceTask.java)which guaranties that each tasktracker copies/fetches only one mapoutput at a time (from the "neighboring" tasktrackers) causes very lowaverage throughputs of 20kB/s and less. We disabled / commented out the"duplicate hosts" check and reached then throughputs up to ~1MB/s inaverage.

It seems like Hadoop scales only well when processing large files onlarge clusters whereas we would like to use it for a huge amount ofsmall files on a small cluster... Does anyone have similarexperiences/cluster setups?


Any thoughts and ideas are much appreciated!
Thanks in advance!

Cu on the 'net,
                       Bye - bye,

                                  <<<<< André <<<< >>>> èrbnA >>>>>

Very weak mapred performance on small clusters with a massive amount of small files

Reply via email to