On Fri, Aug 31, 2007 at 10:22:18AM -0700, Ted Dunning wrote: > >They will only be a non-issue if you have enough of them to get the >parallelism you want. If you have number of gzip files > 10*number of task >nodes you should be fine. >
One way to reap benefits of both compression and better parallelism is to use compressed SequenceFiles: http://wiki.apache.org/lucene-hadoop/SequenceFile Of course this means you will have to do a conversion from .gzip to .seq file and load it onto hdfs for your job, which should be fairly simple piece of code. Arun > >-----Original Message----- >From: [EMAIL PROTECTED] on behalf of jason gessner >Sent: Fri 8/31/2007 9:38 AM >To: [email protected] >Subject: Re: Compression using Hadoop... > >ted, will the gzip files be a non-issue as far as splitting goes if >they are under the default block size? > >C G, glad i could help a little. > >-jason > >On 8/31/07, C G <[EMAIL PROTECTED]> wrote: >> Thanks Ted and Jason for your comments. Ted, your comments about gzip not >> being splittable was very timely...I'm watching my 8 node cluster saturate >> one node (with one gz file) and was wondering why. Thanks for the "answer >> in advance" :-). >> >> Ted Dunning <[EMAIL PROTECTED]> wrote: >> With gzipped files, you do face the problem that your parallelism in the map >> phase is pretty much limited to the number of files you have (because >> gzip'ed files aren't splittable). This is often not a problem since most >> people can arrange to have dozens to hundreds of input files easier than >> they can arrange to have dozens to hundreds of CPU cores working on their >> data.
