On Fri, Aug 31, 2007 at 10:43:09AM -0700, Doug Cutting wrote: >Arun C Murthy wrote: >>One way to reap benefits of both compression and better parallelism is to >>use compressed SequenceFiles: >>http://wiki.apache.org/lucene-hadoop/SequenceFile >> >>Of course this means you will have to do a conversion from .gzip to .seq >>file and load it onto hdfs for your job, which should be fairly simple >>piece of code. > >We really need someone to contribute an InputFormat for bzip files. >This has come up before: bzip is a standard compression format that is >splittable. > >Another InputFormat that would be handy is zip. Zip archives, unlike >tar files, can be split by reading the table of contents. So one could >package a bunch of tiny files as a zip file, then the input format could >split the zip file into splits that each contain a number of files >inside the zip. Each map task would then have to read the table of >contents from the file, but could then seek directly to the files in its >split without scanning the entire file. > >Should we file jira issues for these? Any volunteers who're interested >in implementing these? >
Please file the bzip and zip issues Doug. I'll try and get to them in the short-term unless someone is more interested and wants to scratch that itch right-away. thanks, Arun >Doug
