On Tue, Dec 10, 2013 at 1:18 AM, Thomas Munro <mu...@ip9.org> wrote: > Hi > > I have been wondering what the minimum useful heap table compression > system would be for Postgres, in order to reduce disk footprint of > large mostly static datasets. Do you think an approach similar to the > static row-level compression of DB2 could make sense? I imagine > something like this: > > 1. You have a table which already has data in it. > > 2. You run a COMPRESS operation, which builds a static dictionary, > and rewrites the whole table with compressed frozen tuples. Frozen > tuples have CTIDs just like regular tuples, and can be pointed to by > indexes. They are decompressed on the fly when needed. > > Clearly things get tricky once you need to update rows. Assume for > simplicity that future UPDATEs and INSERTs produce normal, > non-compressed tuples that would only be compressed by a subsequent > COMPRESS operation. The question is how to deal with the existing > compressed rows when UPDATEd or DELETEd. Some approaches: > > 1. Just don't allow updates of compressed rows (!). > > 2. Exclusively lock the whole page when trying to update any > compressed row, while you explode it into regular uncompressed tuples > on new pages which you can work on (!). > > 3. Pull the minimum header fields out of the compressed tuples so > that the MVCC machinery can work, to support updates of compressed > tuples. Perhaps just the t_xmax, t_ctid values (t_xmin == frozen is > implied), so that a writer can update them. This means an overhead of > at least 10 bytes per tuple over the compressed size (plus the item > offsets in the page header). > > 4. Something far cleverer. > > Well, these are straw-man suggestions really and I probably don't > understand enough about PG internals (MVCC and implications for > VACUUM) to be making them. But I'm curious to know if anyone has > researched something like this. > > For example, I have a system that occupies a couple of TB on disk, but > only a few to a few hundred MB per day change, mostly adding data to > an active partition. I periodically run CLUSTER on any partition that > has pg_stat.correlation < 0.9 (this effectively just re-CLUSTERs the > active one). At the same times I would COMPRESS, and the DB could > potentially fit on much smaller SSDs. > > Most commercial database systems I encounter these days are using > compression of some sort (more sophisticated than the above, > with dynamic dictionaries, and sometimes column based storage etc).
postgres compresses TOASTED data: one strategy could be to arrange your data somehow to utilize TOAST. I doubt you'll ever see generally heap compressed data in the way you're thinking: postgres has a strong informal policy of not implementing features which are dubious and or excessively complicated with limited benefit, particularly if there are ways to handle this outside the database; there are various operating system level tricks that can cause a compressed file or even an entire tablespace (o/s folder) masquerade as a regular structures. So maybe you are asking for a feature we already have: CREATE TABLESPACE. For example take a look here: https://btrfs.wiki.kernel.org/index.php/Compression#How_do_I_enable_compression.3F (out of curiosity, if this strategy fits the bill for you I wouldn't mind seeing a follow up on how this handles your static data use case). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers