On Fri, Aug 31, 2007 at 10:22:18AM -0700, Ted Dunning wrote:
>
>They will only be a non-issue if you have enough of them to get the 
>parallelism you want.  If you have number of gzip files > 10*number of task 
>nodes you should be fine.
>

One way to reap benefits of both compression and better parallelism is to use 
compressed SequenceFiles: http://wiki.apache.org/lucene-hadoop/SequenceFile

Of course this means you will have to do a conversion from .gzip to .seq file 
and load it onto hdfs for your job, which should be fairly simple piece of code.

Arun

>
>-----Original Message-----
>From: [EMAIL PROTECTED] on behalf of jason gessner
>Sent: Fri 8/31/2007 9:38 AM
>To: [email protected]
>Subject: Re: Compression using Hadoop...
> 
>ted, will the gzip files be a non-issue as far as splitting goes if
>they are under the default block size?
>
>C G, glad i could help a little.
>
>-jason
>
>On 8/31/07, C G <[EMAIL PROTECTED]> wrote:
>> Thanks Ted and Jason for your comments.  Ted, your comments about gzip not 
>> being splittable was very timely...I'm watching my 8 node cluster saturate 
>> one node (with one gz file) and was wondering why.  Thanks for the "answer 
>> in advance" :-).
>>
>> Ted Dunning <[EMAIL PROTECTED]> wrote:
>> With gzipped files, you do face the problem that your parallelism in the map
>> phase is pretty much limited to the number of files you have (because
>> gzip'ed files aren't splittable). This is often not a problem since most
>> people can arrange to have dozens to hundreds of input files easier than
>> they can arrange to have dozens to hundreds of CPU cores working on their
>> data.

Reply via email to