Re: [HACKERS] Custom compression methods

Alexander Korotkov Tue, 12 Dec 2017 07:08:09 -0800

Hi!

Let me add my two cents too.

On Tue, Dec 12, 2017 at 2:41 PM, Ildus Kurbangaliev <
[email protected]> wrote:

> On Mon, 11 Dec 2017 20:53:29 +0100 Tomas Vondra <
> [email protected]> wrote:
> > Replacing the algorithm used to compress all varlena values (in a way
> > that makes it transparent for the data type code).
> > ----------------------------------------------------------------------
> >
> > While pglz served us well over time, it was repeatedly mentioned that
> > in some cases it becomes the bottleneck. So supporting other state of
> > the art compression algorithms seems like a good idea, and this patch
> > is one way to do that.
> >
> > But perhaps we should simply make it an initdb option (in which case
> > the whole cluster would simply use e.g. lz4 instead of pglz)?
> >
> > That seems like a much simpler approach - it would only require some
> > ./configure options to add --with-lz4 (and other compression
> > libraries), an initdb option to pick compression algorithm, and
> > probably noting the choice in cluster controldata.
>
> Replacing pglz for all varlena values wasn't the goal of the patch, but
> it's possible to do with it and I think that's good. And as Robert
> mentioned pglz could appear as builtin undroppable compresssion method
> so the others could be added too. And in the future it can open the
> ways to specify compression for specific database or cluster.
>

Yes, usage of custom compression methods to replace generic non
type-specific compression method is really not the primary goal of this
patch.  However, I would consider that as useful side effect.  However,
even in this case I see some advantages of custom compression methods over
initdb option.

1) In order to support alternative compression methods in initdb, we have
to provide builtin support for them.  Then we immediately run into
dependencies/incompatible-licenses problem.  Also, we tie appearance of new
compression methods to our release cycle.  In real life, flexibility means
a lot.  Giving users freedom to experiment with various compression
libraries without having to recompile PostgreSQL core is great.
2) It's not necessary that users would be satisfied with applying single
compression method to the whole database cluster.  Various columns may have
different data distributions with different workloads.  Optimal compression
type for one column is not necessary optimal for another column.
3) Possibility to change compression method on the fly without re-initdb is
very good too.

> Custom datatype-aware compression (e.g. the tsvector)
> > ----------------------------------------------------------------------
> >
> > Exploiting knowledge of the internal data type structure is a
> > promising way to improve compression ratio and/or performance.
> >
> > The obvious question of course is why shouldn't this be done by the
> > data type code directly, which would also allow additional benefits
> > like operating directly on the compressed values.
> >
> > Another thing is that if the datatype representation changes in some
> > way, the compression method has to change too. So it's tightly coupled
> > to the datatype anyway.
> >
> > This does not really require any new infrastructure, all the pieces
> > are already there.
> >
> > In some cases that may not be quite possible - the datatype may not be
> > flexible enough to support alternative (compressed) representation,
> > e.g. because there are no bits available for "compressed" flag, etc.
> >
> > Conclusion: IMHO if we want to exploit the knowledge of the data type
> > internal structure, perhaps doing that in the datatype code directly
> > would be a better choice.
>
> It could be, but let's imagine there will be internal compression for
> tsvector. It means that tsvector has two formats now and minus one bit
> somewhere in the header. After a while we found a better compression
> but we can't add it because there is already one and it's not good to
> have three different formats for one type. Or, the compression methods
> were implemented and we decided to use dictionaries for tsvector (if
> the user going to store limited number of words). But it will mean that
> tsvector will go two compression stages (for its internal and for
> dictionaries).

I would like to add that even for single datatype various compression
methods may have different tradeoffs.  For instance, one compression method
can have better compression ratio, but another one have faster
decompression.  And it's OK for user to choose different compression
methods for different columns.

Depending extensions on datatype internal representation doesn't seem evil
for me.  We already have bunch of extension depending on much more deeper
guts of PostgreSQL.  On major release of PostgreSQL, extensions must adopt
the changes, that is the rule.  And note, the datatype internal
representation alters relatively rare in comparison with other internals,
because it's related to on-disk format and ability to pg_upgrade.

> Custom datatype-aware compression with additional column-specific
> > metadata (e.g. the jsonb with external dictionary).
> > ----------------------------------------------------------------------
> >
> > Exploiting redundancy in multiple values in the same column (instead
> > of compressing them independently) is another attractive way to help
> > the compression. It is inherently datatype-aware, but currently can't
> > be implemented directly in datatype code as there's no concept of
> > column-specific storage (e.g. to store dictionary shared by all values
> > in a particular column).
> >
> > I believe any patch addressing this use case would have to introduce
> > such column-specific storage, and any solution doing that would
> > probably need to introduce the same catalogs, etc.
> >
> > The obvious disadvantage of course is that we need to decompress the
> > varlena value before doing pretty much anything with it, because the
> > datatype is not aware of the compression.
> >
> > So I wonder if the patch should instead provide infrastructure for
> > doing that in the datatype code directly.
> >
> > The other question is if the patch should introduce some
> > infrastructure for handling the column context (e.g. column
> > dictionary). Right now, whoever implements the compression has to
> > implement this bit too.
>
> Column specific storage sounds optional to me. For example compressing
> timestamp[] using some delta compression will not require it.

It's also could be useful to have custom compression method with fixed (not
dynamically complemented) dictionary.  See [1] for example what other
databases do.  We may specify fixed dictionary directly in the compression
method options, I see no problems.  We may also compress that way not only
jsonb or other special data types, but also natural language texts.  Using
fixed dictionaries for natural language we can effectively compress short
texts, when lz and other generic compression methods don't have enough of
information to effectively train per-value dictionary.

For sure, further work to improve infrastructure is required including
per-column storage for dictionary and tighter integration between
compression method and datatype.  However, we are typically deal with such
complex tasks in step-by-step approach.  And I'm not convinced that custom
compression methods are bad for the first step in this direction.  For me
they look clear and already very useful in this shape.

1.
https://www.percona.com/doc/percona-server/LATEST/flexibility/compressed_columns.html

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Custom compression methods

Reply via email to