By default, with compressed files you lose the ability to control splits and the file is essentially read as one split to one mapper.
There had been some discussion in and around this over bzip2, gzip and some fixes are done to allow bzip2 to be splittable.Refer HADOOP-4012 Also Kevin came with lzo compression and LzoTextInputFormat which overcomes this disadvantage and is faster. Refer http://github.com/kevinweil/hadoop-lzo Cheers, /R On 4/15/10 6:56 AM, "abhishek sharma" <[email protected]> wrote: Hi all, I created some data using the randomwriter utility and compressed the map task outputs using the options -D mapred.output.compress=true -D mapred.map.output.compression.type=BLOCK I set the bytes per map to be 128 MB but due to compression the final size of each map tasks output is around 75MB. I want to use these individual 75MB compressed files as input to another Map task. How do I get Hadoop to first decompress the files before computing the input splits for the map tasks? Thanks, Abhishek
