Question about data sorting on Hadoop

Jingwei Lu Wed, 29 Jun 2011 13:40:21 -0700

Hi Everyone:

I launched two experiments for sorting 1 Gb and 10 Gb data with hadoop, on
(1) a single machine (2) 5-node clustrer in LAN


The cmd is:

bin/hadoop jar hadoop-*-examples.jar sort [-m <#maps>] [-r <#reduces>]
<in-dir> <out-dir>

the result is shown here:

[image: image.png]

Mapping shows good scalability. The thing is, reduce takes much longer time
than expected in cluster.
As far as I know, hadoop sort uses identity function for reduce, which
simply output the mapping
 result in a file. I tested LAN bandwidth, which is ~ 100Mbps, and the
average LAN flow during reduce
is about 10 Mbps (for sending and receiving).
as a result, it appears a bit weird to me here...

I am quite new in hadoop thus forgive me for any stupid questions here...

Thanks.

Best Regards
Yours Sincerely

Jingwei Lu

Question about data sorting on Hadoop

Reply via email to