Re: compressed input splits to Map tasks

Rekha Joshi Thu, 15 Apr 2010 02:40:16 -0700

By default, with compressed files you lose the ability to control splits and 
the file is essentially read as one split to one mapper.


There had been some discussion in and around this over bzip2, gzip and some 
fixes are done to allow bzip2 to be splittable.Refer HADOOP-4012

Also Kevin came with lzo compression and LzoTextInputFormat which overcomes 
this disadvantage and is faster. Refer http://github.com/kevinweil/hadoop-lzo

Cheers,
/R

On 4/15/10 6:56 AM, "abhishek sharma" <[email protected]> wrote:

Hi all,

I created some data using the randomwriter utility and compressed the
map task outputs using the options
-D mapred.output.compress=true
-D mapred.map.output.compression.type=BLOCK

I set the bytes per map to be 128 MB but due to compression the final
size of each map tasks output is around 75MB.

I want to use these individual 75MB compressed files as input to
another Map task.
How do I get Hadoop to first decompress the files before computing the
input splits for the map tasks?

Thanks,
Abhishek

Re: compressed input splits to Map tasks

Reply via email to