Re: Compression using Hadoop...

Arun C Murthy Fri, 31 Aug 2007 22:46:35 -0700

On Fri, Aug 31, 2007 at 10:43:09AM -0700, Doug Cutting wrote:
>Arun C Murthy wrote:
>>One way to reap benefits of both compression and better parallelism is to 
>>use compressed SequenceFiles: 
>>http://wiki.apache.org/lucene-hadoop/SequenceFile
>>
>>Of course this means you will have to do a conversion from .gzip to .seq 
>>file and load it onto hdfs for your job, which should be fairly simple 
>>piece of code.
>
>We really need someone to contribute an InputFormat for bzip files. 
>This has come up before: bzip is a standard compression format that is 
>splittable.
>
>Another InputFormat that would be handy is zip.  Zip archives, unlike 
>tar files, can be split by reading the table of contents.  So one could 
>package a bunch of tiny files as a zip file, then the input format could 
>split the zip file into splits that each contain a number of files 
>inside the zip.  Each map task would then have to read the table of 
>contents from the file, but could then seek directly to the files in its 
>split without scanning the entire file.
>
>Should we file jira issues for these?  Any volunteers who're interested 
>in implementing these?
>


Please file the bzip and zip issues Doug. I'll try and get to them in the 
short-term unless someone is more interested and wants to scratch that itch 
right-away.

thanks,
Arun

>Doug

Re: Compression using Hadoop...

Reply via email to