ted, will the gzip files be a non-issue as far as splitting goes if
they are under the default block size?

C G, glad i could help a little.

-jason

On 8/31/07, C G <[EMAIL PROTECTED]> wrote:
> Thanks Ted and Jason for your comments.  Ted, your comments about gzip not 
> being splittable was very timely...I'm watching my 8 node cluster saturate 
> one node (with one gz file) and was wondering why.  Thanks for the "answer in 
> advance" :-).
>
> Ted Dunning <[EMAIL PROTECTED]> wrote:
> With gzipped files, you do face the problem that your parallelism in the map
> phase is pretty much limited to the number of files you have (because
> gzip'ed files aren't splittable). This is often not a problem since most
> people can arrange to have dozens to hundreds of input files easier than
> they can arrange to have dozens to hundreds of CPU cores working on their
> data.
>
>
> On 8/30/07 8:46 AM, "jason gessner" wrote:
>
> > if you put .gz files up on your HDFS cluster you don't need to do
> > anything to read them. I see lots of extra control via the API, but i
> > have simply put the files up and run my jobs on them.
> >
> > -jason
> >
> > On 8/30/07, C G
> wrote:
> >> Hello All:
> >>
> >> I think I must be missing something fundamental. Is it possible to load
> >> compressed data into HDFS, and then operate on it directly with map/reduce?
> >> I see a lot of stuff in the docs about writing compressed outputs, but
> >> nothing about reading compressed inputs.
> >>
> >> Am I being ponderously stupid here?
> >>
> >> Any help/comments appreciated...
> >>
> >> Thanks,
> >> C G
> >>
> >>
> >> ---------------------------------
> >> Luggage? GPS? Comic books?
> >> Check out fitting gifts for grads at Yahoo! Search.
>
>
>
>
> ---------------------------------
> Sick sense of humor? Visit Yahoo! TV's Comedy with an Edge to see what's on, 
> when.

Reply via email to