ted, will the gzip files be a non-issue as far as splitting goes if they are under the default block size?
C G, glad i could help a little. -jason On 8/31/07, C G <[EMAIL PROTECTED]> wrote: > Thanks Ted and Jason for your comments. Ted, your comments about gzip not > being splittable was very timely...I'm watching my 8 node cluster saturate > one node (with one gz file) and was wondering why. Thanks for the "answer in > advance" :-). > > Ted Dunning <[EMAIL PROTECTED]> wrote: > With gzipped files, you do face the problem that your parallelism in the map > phase is pretty much limited to the number of files you have (because > gzip'ed files aren't splittable). This is often not a problem since most > people can arrange to have dozens to hundreds of input files easier than > they can arrange to have dozens to hundreds of CPU cores working on their > data. > > > On 8/30/07 8:46 AM, "jason gessner" wrote: > > > if you put .gz files up on your HDFS cluster you don't need to do > > anything to read them. I see lots of extra control via the API, but i > > have simply put the files up and run my jobs on them. > > > > -jason > > > > On 8/30/07, C G > wrote: > >> Hello All: > >> > >> I think I must be missing something fundamental. Is it possible to load > >> compressed data into HDFS, and then operate on it directly with map/reduce? > >> I see a lot of stuff in the docs about writing compressed outputs, but > >> nothing about reading compressed inputs. > >> > >> Am I being ponderously stupid here? > >> > >> Any help/comments appreciated... > >> > >> Thanks, > >> C G > >> > >> > >> --------------------------------- > >> Luggage? GPS? Comic books? > >> Check out fitting gifts for grads at Yahoo! Search. > > > > > --------------------------------- > Sick sense of humor? Visit Yahoo! TV's Comedy with an Edge to see what's on, > when.