With gzipped files, you do face the problem that your parallelism in the map
phase is pretty much limited to the number of files you have (because
gzip'ed files aren't splittable).  This is often not a problem since most
people can arrange to have dozens to hundreds of input files easier than
they can arrange to have dozens to hundreds of CPU cores working on their
data. 


On 8/30/07 8:46 AM, "jason gessner" <[EMAIL PROTECTED]> wrote:

> if you put .gz files up on your HDFS cluster you don't need to do
> anything to read them.  I see lots of extra control via the API, but i
> have simply put the files up and run my jobs on them.
> 
> -jason
> 
> On 8/30/07, C G <[EMAIL PROTECTED]> wrote:
>> Hello All:
>> 
>>   I think I must be missing something fundamental.  Is it possible to load
>> compressed data into HDFS, and then operate on it directly with map/reduce?
>> I see a lot of stuff in the docs about writing compressed outputs, but
>> nothing about reading compressed inputs.
>> 
>>   Am I being ponderously stupid here?
>> 
>>   Any help/comments appreciated...
>> 
>>   Thanks,
>>   C G
>> 
>> 
>> ---------------------------------
>> Luggage? GPS? Comic books?
>> Check out fitting  gifts for grads at Yahoo! Search.

Reply via email to