With gzipped files, you do face the problem that your parallelism in the map phase is pretty much limited to the number of files you have (because gzip'ed files aren't splittable). This is often not a problem since most people can arrange to have dozens to hundreds of input files easier than they can arrange to have dozens to hundreds of CPU cores working on their data.
On 8/30/07 8:46 AM, "jason gessner" <[EMAIL PROTECTED]> wrote: > if you put .gz files up on your HDFS cluster you don't need to do > anything to read them. I see lots of extra control via the API, but i > have simply put the files up and run my jobs on them. > > -jason > > On 8/30/07, C G <[EMAIL PROTECTED]> wrote: >> Hello All: >> >> I think I must be missing something fundamental. Is it possible to load >> compressed data into HDFS, and then operate on it directly with map/reduce? >> I see a lot of stuff in the docs about writing compressed outputs, but >> nothing about reading compressed inputs. >> >> Am I being ponderously stupid here? >> >> Any help/comments appreciated... >> >> Thanks, >> C G >> >> >> --------------------------------- >> Luggage? GPS? Comic books? >> Check out fitting gifts for grads at Yahoo! Search.
