Hi Everyone: I launched two experiments for sorting 1 Gb and 10 Gb data with hadoop, on (1) a single machine (2) 5-node clustrer in LAN
The cmd is: bin/hadoop jar hadoop-*-examples.jar sort [-m <#maps>] [-r <#reduces>] <in-dir> <out-dir> the result is shown here: [image: image.png] Mapping shows good scalability. The thing is, reduce takes much longer time than expected in cluster. As far as I know, hadoop sort uses identity function for reduce, which simply output the mapping result in a file. I tested LAN bandwidth, which is ~ 100Mbps, and the average LAN flow during reduce is about 10 Mbps (for sending and receiving). as a result, it appears a bit weird to me here... I am quite new in hadoop thus forgive me for any stupid questions here... Thanks. Best Regards Yours Sincerely Jingwei Lu