Hello, On Mon, Jan 31, 2011 at 10:41 PM, Sean Bigdatafun <sean.bigdata...@gmail.com> wrote: > > > On Mon, Jan 31, 2011 at 12:36 AM, Niels Basjes <ni...@basjes.nl> wrote: >> >> Hi, >> >> 2011/1/31 Sean Bigdatafun <sean.bigdata...@gmail.com>: >> > GZIP is not splittable. >> >> Correct, gzip is a stream compression system which effectively means >> you can only start at the beginning of the data with decompressing. >> >> > Does that mean a GZIP block compressed sequencefile can't take advantage >> > of MR parallelism? >> >> AFAIK it should be splittable in the same blocks as the compression was >> done. > > Splittable within the same block?
> Normally, each mapper would pick a HDFS block (64MB in an HDFS with default > configuration) of a 1GB file for map processing, should the file not GZIP > compressed --- this is a scenario for an unpressed file. > But as GZIP is not splittable, if/how can a mapper pick a block? (if it > can't, then we can't utilize the Mapreduce framework for the parallelism). > Can you give more answer? > The base fact is that GZip is not a splittable compression algorithm, but SequenceFiles can be written with a set 'block size' for its records, and can also be Block-Compressed with a chosen algorithm. SequenceFile draws its own 'block' boundaries and thus can let you achieve a splittable file with GZip compression applied in its made-up splits. -- Harsh J www.harshj.com