I agree with you on the most part. But I have some other questions. mapper are 
working on local machine so there's no network transfers during this process, 
if the original data stored in hdfs is compressed it will only decrease the IO 
time. One major point is I doubt whether the mapper can deal with only part of 
the whole data if the data is compressed which seems can't be split ? I've try 
to do a "select sum()" in hive and trace the job, it seems the .tar.gz data can 
only worked on one single matchine and stuck there for quite a long time(seems 
like need to wait other part of data be copied from other machines),while other 
data not compressed can work on different machines parallelly. Do you know 
something about this ?

2010-08-26 



shangan 



发件人: Harsh J 
发送时间: 2010-08-26  12:15:49 
收件人: common-user 
抄送: 
主题: Re: data in compression format affect mapreduce speed 
 
Logically it 'should' increase time as its an extra step beyond the
Mapper/Reducer. But while your processing time would slightly (very
very slightly) increase, your IO and Network Transfers time would
decrease by a large margin -- giving you a clear impression that your
total job time has decreased overall. The difference being in writing
out say 10 GB before, and writing out 5-7 GB this time (a crude
example).
With the fast CPUs available these days, compressing and decompressing
should hardly take a noticeable amount of extra time. Its almost
negligible in case of using gzip, lzo or plain deflate.
On Thu, Aug 26, 2010 at 9:13 AM, Ted Yu <[email protected]> wrote:
> Compressed data would increase processing time in mapper/reducer but
> decrease the amount of data transferred between tasktracker nodes.
> Normally you should consider applying some form of compression.
>
> On Wed, Aug 25, 2010 at 7:32 PM, shangan <[email protected]> wrote:
>
>> will data stored in  compression format affect mapreduce job speed?
>> increase or decrease? or more complex relationship between these two ?  can
>> anybody give some explanation in detail?
>>
>> 2010-08-26
>>
>>
>>
>> shangan
>>
>
-- 
Harsh J
www.harshj.com
__________ Information from ESET NOD32 Antivirus, version of virus signature 
database 5397 (20100825) __________
The message was checked by ESET NOD32 Antivirus.
http://www.eset.com

Reply via email to