Arun C Murthy wrote:
One way to reap benefits of both compression and better parallelism is to use
compressed SequenceFiles: http://wiki.apache.org/lucene-hadoop/SequenceFile
Of course this means you will have to do a conversion from .gzip to .seq file
and load it onto hdfs for your job, which should be fairly simple piece of code.
We really need someone to contribute an InputFormat for bzip files.
This has come up before: bzip is a standard compression format that is
splittable.
Another InputFormat that would be handy is zip. Zip archives, unlike
tar files, can be split by reading the table of contents. So one could
package a bunch of tiny files as a zip file, then the input format could
split the zip file into splits that each contain a number of files
inside the zip. Each map task would then have to read the table of
contents from the file, but could then seek directly to the files in its
split without scanning the entire file.
Should we file jira issues for these? Any volunteers who're interested
in implementing these?
Doug