Re: [HACKERS] Custom compression methods

Robert Haas Tue, 12 Dec 2017 16:55:48 -0800

On Tue, Dec 12, 2017 at 5:07 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
>> I definitely think there's a place for compression built right into
>> the data type.  I'm still happy about commit
>> 145343534c153d1e6c3cff1fa1855787684d9a38 -- although really, more
>> needs to be done there.  But that type of improvement and what is
>> proposed here are basically orthogonal.  Having either one is good;
>> having both is better.
>>
> Why orthogonal?


I mean, they are different things.  Data types are already free to
invent more compact representations, and that does not preclude
applying pglz to the result.

> For example, why couldn't (or shouldn't) the tsvector compression be
> done by tsvector code itself? Why should we be doing that at the varlena
> level (so that the tsvector code does not even know about it)?

We could do that, but then:

1. The compression algorithm would be hard-coded into the system
rather than changeable.  Pluggability has some value.

2. If several data types can benefit from a similar approach, it has
to be separately implemented for each one.

3. Compression is only applied to large-ish values.  If you are just
making the data type representation more compact, you probably want to
apply the new representation to all values.  If you are compressing in
the sense that the original data gets smaller but harder to interpret,
then you probably only want to apply the technique where the value is
already pretty wide, and maybe respect the user's configured storage
attributes.  TOAST knows about some of that kind of stuff.

> It seems to me the main reason is that tsvector actually does not allow
> us to do that, as there's no good way to distinguish the different
> internal format (e.g. by storing a flag or format version in some sort
> of header, etc.).

That is also a potential problem, although I suspect it is possible to
work around it somehow for most data types.  It might be annoying,
though.

>> I think there may also be a place for declaring that a particular data
>> type has a "privileged" type of TOAST compression; if you use that
>> kind of compression for that data type, the data type will do smart
>> things, and if not, it will have to decompress in more cases.  But I
>> think this infrastructure makes that kind of thing easier, not harder.
>
> I don't quite understand how that would be done. Isn't TOAST meant to be
> entirely transparent for the datatypes? I can imagine custom TOAST
> compression (which is pretty much what the patch does, after all), but I
> don't see how the datatype could do anything smart about it, because it
> has no idea which particular compression was used. And considering the
> OIDs of the compression methods do change, I'm not sure that's fixable.

I don't think TOAST needs to be entirely transparent for the
datatypes.  We've already dipped our toe in the water by allowing some
operations on "short" varlenas, and there's really nothing to prevent
a given datatype from going further.  The OID problem you mentioned
would presumably be solved by hard-coding the OIDs for any built-in,
privileged compression methods.

> Well, it wasn't my goal to suddenly widen the scope of the patch and
> require it adds all these pieces. My intent was more to point to pieces
> that need to be filled in the future.

Sure, that's fine.  I'm not worked up about this, just explaining why
it seems reasonably well-designed to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Custom compression methods

Reply via email to