Compressed pluggable storage experiments

Ildar Musin Thu, 10 Oct 2019 06:25:52 -0700

Hi hackers,

I've been experimenting with pluggable storage API recently and just
feel like I can share my first experience. First of all it's great to
have this API and that now community has the opportunity to implement
alternative storage engines. There are a few applications that come to
mind and a compressed storage is one of them.


Recently I've been working on a simple append-only compressed storage
[1]. My first idea was to just store data into compressed 1mb blocks
in a continuous file and keep separate file for block offsets (similar
to Knizhnik's CFS proposal). But then i realized that then i won't be
able to use most of postgres' infrastructure like WAL-logging and also
won't be able to implement some of the functions of TableAmRoutine
(like bitmap scan or analyze). So I had to adjust extension the way to
utilize standard postgres 8kb blocks: compressed 1mb blocks are split
into chunks and distributed among 8kb blocks. Current page layout
looks like this:

┌───────────┐
│ metapage  │
└───────────┘
┌───────────┐  ┐
│  block 1  │  │
├────...────┤  │ compressed 1mb block
│  block k  │  │
└───────────┘  ┘
┌───────────┐  ┐
│ block k+1 │  │
├────...────┤  │ another compressed 1mb block
│  block m  │  │
└───────────┘  ┘

Inside compressed blocks there are regular postgres heap tuples.

The following is the list of things i stumbled upon while implementing
storage. Since API is just came out there are not many examples of
pluggable storages and even less as external extensions (I managed to
find only blackhole_am by Michael Paquier which doesn't do much). So
many things i had to figure out by myself. Hopefully some of those
issues have a solution that i just can't see.

1. Unlike FDW API, in pluggable storage API there are no routines like
"begin modify table" and "end modify table" and there is no shared
state between insert/update/delete calls. In context of compressed
storage that means that there is no exact moment when we can finalize
writes (compress, split into chunks etc). We can set a callback at the
end of transaction, but in this case we'll have to keep latest
modifications for every table in memory until the end of transaction.
As for shared state we also can maintain some kind of key-value data
structure with per-relation shared state. But that again requires memory.
Because of this currently I only implemented COPY semantics.

2. It looks like I cannot implement custom storage options. E.g. for
compressed storage it makes sense to implement different compression
methods (lz4, zstd etc.) and corresponding options (like compression
level). But as i can see storage options (like fillfactor etc) are
hardcoded and are not extensible. Possible solution is to use GUCs
which would work but is not extremely convinient.

3. A bit surprising limitation that in order to use bitmap scan the
maximum number of tuples per page must not exceed 291 due to
MAX_TUPLES_PER_PAGE macro in tidbitmap.c which is calculated based on
8kb page size. In case of 1mb page this restriction feels really
limiting.

4. In order to use WAL-logging each page must start with a standard 24
byte PageHeaderData even if it is needless for storage itself. Not a
big deal though. Another (acutally documented) WAL-related limitation
is that only generic WAL can be used within extension. So unless
inserts are made in bulks it's going to require a lot of disk space to
accomodate logs and wide bandwith for replication.

pg_cryogen extension is still in developement so if other issues arise
i'll post them here. At this point the extension already supports
inserts via COPY, index and bitmap scans, vacuum (only freezing),
analyze. It uses lz4 compression and currently i'm working on adding
different compression methods. I'm also willing to work on
forementioned issues in API if community verifies them as valid.


[1] https://github.com/adjust/pg_cryogen

Thanks,
Ildar

Compressed pluggable storage experiments

Reply via email to