One way to reap benefits of both compression and better parallelism is to use 
compressed SequenceFiles:

Of course this means you will have to do a conversion from .gzip to .seq file 
and load it onto hdfs for your job, which should be fairly simple piece of code.

We really need someone to contribute an InputFormat for bzip files. This has come up before: bzip is a standard compression format that is splittable.

Another InputFormat that would be handy is zip. Zip archives, unlike tar files, can be split by reading the table of contents. So one could package a bunch of tiny files as a zip file, then the input format could split the zip file into splits that each contain a number of files inside the zip. Each map task would then have to read the table of contents from the file, but could then seek directly to the files in its split without scanning the entire file.

Should we file jira issues for these? Any volunteers who're interested in implementing these?


