On Wed, Sep 3, 2008 at 9:24 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
Their answer was that if you care enough, then any compression algorithm > around will compress away the type information. I understand the argument, but there are certainly cases where having the type information once in the header is a big win. If I have a dataset with say 100 billion rows with 300 columns in each row, having 1k of type information on each row is pretty much a non-starter. I wish that was a hypothetical case. *smile* So if you have a splittable compressed format (bz2 works with hadoop), you > are set except for the compression cost. Decompression cost is usually > compensated for by the I/O advantage. bz2 is *really* expensive and will almost always substantially slow down your job. The default codec (gz) is usually a win for compressing outputs, but is still fairly expensive and is *not* splittable. LZO is great for speed and is almost always a win for overall job time, even on map outputs. It is also not splittable. It would be really nice to have a codec that was similar in compression/cpu cost to gzip that was splittable. -- Owen
