On Fri, Aug 02, 2019 at 09:39:48AM -0700, Andres Freund wrote:
Hi,
On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote:
We have some kind of "roadmap" of "extensible pglz". We plan to
provide implementation on Novembers CF.
I don't understand why it's a good idea to improve the compression side
of pglz. There's plenty other people that spent a lot of time
developing better compression algorithms.
Isn't it beneficial for existing systems, that will be stuck with pglz
even if we end up adding other algorithms?
Currently, pglz starts with empty cache map: there is no prior 4k
bytes before start. We can add imaginary prefix to any data with
common substrings: this will enhance compression ratio. It is hard
to decide on training data set for this "common prefix". So we want
to produce extension with aggregate function which produces some
"adapted common prefix" from users's data. Then we can "reserve" few
negative bytes for "decompression commands". This command can
instruct database on which common prefix to use. But also system
command can say "invoke decompression from extension".
Thus, user will be able to train database compression on his data and
substitute pglz compression with custom compression method
seamlessly.
This will make hard-choosen compression unneeded, but seems overly
hacky. But there will be no need to have lz4, zstd, brotli, lzma and
others in core. Why not provide e.g. "time series compression"? Or
"DNA compression"? Whatever gun user wants for his foot.
I think this is way too complicated, and will provide not particularly
much benefit for the majority users.
I agree with this. I do see value in the feature, but probably not as a
drop-in replacement for the default compression algorithm. I'd compare
it to the "custom compression methods" patch that was submitted some
time ago.
In fact, I'll argue that we should flat out reject any such patch until
we have at least one decent default compression algorithm in core.
You're trying to work around a poor compression algorithm with
complicated dictionary improvement, that require user interaction, and
only will work in a relatively small subset of the cases, and will very
often increase compression times.
I wouldn't be so strict I guess. But I do agree an algorithm that
requires additional steps (training, ...) is unlikely to be a good
candidate for default instance compression alorithm.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services