Re: Re: Help!!The problem about Hadoop

Harsh J Tue, 05 Oct 2010 01:55:59 -0700

500 small files comprising one gigabyte? Perhaps you should try
concatenating them all into one big file and try; as a mapper is
supposed to run at least for a minute optimally. And small files don't
make good use of the HDFS block feature.


Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/

2010/10/5 Jander <442950...@163.com>:
> Hi Jeff,
>
> Thank you very much for your reply sincerely.
>
> I exactly know hadoop has overhead, but is it too large in my problem?
>
> The 1GB text input has about 500 map tasks because the input is composed of 
> little text file. And the time each map taken is from 8 seconds to 20 
> seconds. I use compression like conf.setCompressMapOutput(true).
>
> Thanks,
> Jander
>
>
>
>
> At 2010-10-05 16:28:55，"Jeff Zhang" <zjf...@gmail.com> wrote:
>
>>Hi Jander,
>>
>>Hadoop has overhead compared to single-machine solution. How many task
>>have you get when you run your hadoop job ? And what is time consuming
>>for each map and reduce task ?
>>
>>There's lots of tips for performance tuning of hadoop. Such as
>>compression and jvm reuse.
>>
>>
>>2010/10/5 Jander <442950...@163.com>:
>>> Hi, all
>>> I do an application using hadoop.
>>> I take 1GB text data as input the result as follows:
>>>    (1) the cluster of 3 PCs: the time consumed is 1020 seconds.
>>>    (2) the cluster of 4 PCs: the time is about 680 seconds.
>>> But the application before I use Hadoop takes about 280 seconds, so as the 
>>> speed above, I must use 8 PCs in order to have the same speed as before. 
>>> Now the problem: whether it is correct?
>>>
>>> Jander,
>>> Thanks.
>>>
>>>
>>>
>>
>>
>>
>>--
>>Best Regards
>>
>>Jeff Zhang
>



-- 
Harsh J
www.harshj.com

Re: Re: Help!!The problem about Hadoop

Reply via email to