Re: Serialization with additional schema info

Owen O'Malley Thu, 04 Sep 2008 10:53:27 -0700

On Wed, Sep 3, 2008 at 9:24 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:


Their answer was that if you care enough, then any compression algorithm
> around will compress away the type information.


I understand the argument, but there are certainly cases where having the
type information once in the header is a big win. If I have a dataset with
say 100 billion rows with 300 columns in each row, having 1k of type
information on each row is pretty much a non-starter. I wish that was a
hypothetical case. *smile*

So if you have a splittable compressed format (bz2 works with hadoop), you
> are set except for the compression cost.  Decompression cost is usually
> compensated for by the I/O advantage.


bz2 is *really* expensive and will almost always substantially slow down
your job. The default codec (gz) is usually a win for compressing outputs,
but is still fairly expensive and is *not* splittable. LZO is great for
speed and is almost always a win for overall job time, even on map outputs.
It is also not splittable. It would be really nice to have a codec that was
similar in compression/cpu cost to gzip that was splittable.

-- Owen

Re: Serialization with additional schema info

Reply via email to