500 small files comprising one gigabyte? Perhaps you should try concatenating them all into one big file and try; as a mapper is supposed to run at least for a minute optimally. And small files don't make good use of the HDFS block feature.
Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/ 2010/10/5 Jander <442950...@163.com>: > Hi Jeff, > > Thank you very much for your reply sincerely. > > I exactly know hadoop has overhead, but is it too large in my problem? > > The 1GB text input has about 500 map tasks because the input is composed of > little text file. And the time each map taken is from 8 seconds to 20 > seconds. I use compression like conf.setCompressMapOutput(true). > > Thanks, > Jander > > > > > At 2010-10-05 16:28:55,"Jeff Zhang" <zjf...@gmail.com> wrote: > >>Hi Jander, >> >>Hadoop has overhead compared to single-machine solution. How many task >>have you get when you run your hadoop job ? And what is time consuming >>for each map and reduce task ? >> >>There's lots of tips for performance tuning of hadoop. Such as >>compression and jvm reuse. >> >> >>2010/10/5 Jander <442950...@163.com>: >>> Hi, all >>> I do an application using hadoop. >>> I take 1GB text data as input the result as follows: >>> (1) the cluster of 3 PCs: the time consumed is 1020 seconds. >>> (2) the cluster of 4 PCs: the time is about 680 seconds. >>> But the application before I use Hadoop takes about 280 seconds, so as the >>> speed above, I must use 8 PCs in order to have the same speed as before. >>> Now the problem: whether it is correct? >>> >>> Jander, >>> Thanks. >>> >>> >>> >> >> >> >>-- >>Best Regards >> >>Jeff Zhang > -- Harsh J www.harshj.com