On Thu, Nov 30, 2017 at 2:47 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> OK. I think it's a nice use case (and nice gains on the compression
> ratio), demonstrating the datatype-aware compression. The question is
> why shouldn't this be built into the datatypes directly?

Tomas, thanks for running benchmarks of this.  I was surprised to see
how little improvement there was from other modern compression
methods, although lz4 did appear to be a modest win on both size and
speed.  But I share your intuition that a lot of the interesting work
is in datatype-specific compression algorithms.  I have noticed in a
number of papers that I've read that teaching other parts of the
system to operate directly on the compressed data, especially for
column stores, is a critical performance optimization; of course, that
only makes sense if the compression is datatype-specific.  I don't
know exactly what that means for the design of this patch, though.

As a general point, no matter which way you go, you have to somehow
deal with on-disk compatibility.  If you want to build in compression
to the datatype itself, you need to find at least one bit someplace to
mark the fact that you applied built-in compression.  If you want to
build it in as a separate facility, you need to denote the compression
used someplace else.  I haven't looked at how this patch does it, but
the proposal in the past has been to add a value to vartag_external.
One nice thing about the latter method is that it can be used for any
data type generically, regardless of how much bit-space is available
in the data type representation itself.  It's realistically hard to
think of a data-type that has no bit space available anywhere but is
still subject to data-type specific compression; bytea definitionally
has no bit space but is also can't benefit from special-purpose
compression, whereas even something like text could be handled by
starting the varlena with a NUL byte to indicate compressed data
following.  However, you'd have to come up with a different trick for
each data type.  Piggybacking on the TOAST machinery avoids that.  It
also implies that we only try to compress values that are "big", which
is probably be desirable if we're talking about a kind of compression
that makes comprehending the value slower. Not all types of
compression do, cf. commit 145343534c153d1e6c3cff1fa1855787684d9a38,
and for those that don't it probably makes more sense to just build it
into the data type.

All of that is a somewhat separate question from whether we should
have CREATE / DROP COMPRESSION, though (or Alvaro's proposal of using
the ACCESS METHOD stuff instead).  Even if we agree that piggybacking
on TOAST is a good way to implement pluggable compression methods, it
doesn't follow that the compression method is something that should be
attached to the datatype from the outside; it could be built into it
in a deep way.  For example, "packed" varlenas (1-byte header) are a
form of compression, and the default functions for detoasting always
produced unpacked values, but the operators for the text data type
know how to operate on the packed representation.  That's sort of a
trivial example, but it might well be that there are other cases where
we can do something similar.  Maybe jsonb, for example, can compress
data in such a way that some of the jsonb functions can operate
directly on the compressed representation -- perhaps the number of
keys is easily visible, for example, or maybe more.  In this view of
the world, each data type should get to define its own compression
method (or methods) but they are hard-wired into the datatype and you
can't add more later, or if you do, you lose the advantages of the
hard-wired stuff.

BTW, another related concept that comes up a lot in discussions of
this area is that we could do a lot better compression of columns if
we had some place to store a per-column dictionary.  I don't really
know how to make that work.  We could have a catalog someplace that
stores an opaque blob for each column configured to use a compression
method, and let the compression method store whatever it likes in
there.  That's probably fine if you are compressing the whole table at
once and the blob is static thereafter.  But if you want to update
that blob as you see new column values there seem to be almost
insurmountable problems.

To be clear, I'm not trying to load this patch down with a requirement
to solve every problem in the universe.  On the other hand, I think it
would be easy to beat a patch like this into shape in a fairly
mechanical way and then commit-and-forget.  That might be leaving a
lot of money on the table; I'm glad you are thinking about the bigger
picture and hope that my thoughts here somehow contribute.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply via email to