Thanks Ted and Jason for your comments.  Ted, your comments about gzip not 
being splittable was very timely...I'm watching my 8 node cluster saturate one 
node (with one gz file) and was wondering why.  Thanks for the "answer in 
advance" :-).

Ted Dunning <[EMAIL PROTECTED]> wrote:  
With gzipped files, you do face the problem that your parallelism in the map
phase is pretty much limited to the number of files you have (because
gzip'ed files aren't splittable). This is often not a problem since most
people can arrange to have dozens to hundreds of input files easier than
they can arrange to have dozens to hundreds of CPU cores working on their
data. 


On 8/30/07 8:46 AM, "jason gessner" wrote:

> if you put .gz files up on your HDFS cluster you don't need to do
> anything to read them. I see lots of extra control via the API, but i
> have simply put the files up and run my jobs on them.
> 
> -jason
> 
> On 8/30/07, C G 
wrote:
>> Hello All:
>> 
>> I think I must be missing something fundamental. Is it possible to load
>> compressed data into HDFS, and then operate on it directly with map/reduce?
>> I see a lot of stuff in the docs about writing compressed outputs, but
>> nothing about reading compressed inputs.
>> 
>> Am I being ponderously stupid here?
>> 
>> Any help/comments appreciated...
>> 
>> Thanks,
>> C G
>> 
>> 
>> ---------------------------------
>> Luggage? GPS? Comic books?
>> Check out fitting gifts for grads at Yahoo! Search.



       
---------------------------------
Sick sense of humor? Visit Yahoo! TV's Comedy with an Edge to see what's on, 
when. 

Reply via email to