Hi hackers, I've been experimenting with pluggable storage API recently and just feel like I can share my first experience. First of all it's great to have this API and that now community has the opportunity to implement alternative storage engines. There are a few applications that come to mind and a compressed storage is one of them.
Recently I've been working on a simple append-only compressed storage [1]. My first idea was to just store data into compressed 1mb blocks in a continuous file and keep separate file for block offsets (similar to Knizhnik's CFS proposal). But then i realized that then i won't be able to use most of postgres' infrastructure like WAL-logging and also won't be able to implement some of the functions of TableAmRoutine (like bitmap scan or analyze). So I had to adjust extension the way to utilize standard postgres 8kb blocks: compressed 1mb blocks are split into chunks and distributed among 8kb blocks. Current page layout looks like this: ┌───────────┐ │ metapage │ └───────────┘ ┌───────────┐ ┐ │ block 1 │ │ ├────...────┤ │ compressed 1mb block │ block k │ │ └───────────┘ ┘ ┌───────────┐ ┐ │ block k+1 │ │ ├────...────┤ │ another compressed 1mb block │ block m │ │ └───────────┘ ┘ Inside compressed blocks there are regular postgres heap tuples. The following is the list of things i stumbled upon while implementing storage. Since API is just came out there are not many examples of pluggable storages and even less as external extensions (I managed to find only blackhole_am by Michael Paquier which doesn't do much). So many things i had to figure out by myself. Hopefully some of those issues have a solution that i just can't see. 1. Unlike FDW API, in pluggable storage API there are no routines like "begin modify table" and "end modify table" and there is no shared state between insert/update/delete calls. In context of compressed storage that means that there is no exact moment when we can finalize writes (compress, split into chunks etc). We can set a callback at the end of transaction, but in this case we'll have to keep latest modifications for every table in memory until the end of transaction. As for shared state we also can maintain some kind of key-value data structure with per-relation shared state. But that again requires memory. Because of this currently I only implemented COPY semantics. 2. It looks like I cannot implement custom storage options. E.g. for compressed storage it makes sense to implement different compression methods (lz4, zstd etc.) and corresponding options (like compression level). But as i can see storage options (like fillfactor etc) are hardcoded and are not extensible. Possible solution is to use GUCs which would work but is not extremely convinient. 3. A bit surprising limitation that in order to use bitmap scan the maximum number of tuples per page must not exceed 291 due to MAX_TUPLES_PER_PAGE macro in tidbitmap.c which is calculated based on 8kb page size. In case of 1mb page this restriction feels really limiting. 4. In order to use WAL-logging each page must start with a standard 24 byte PageHeaderData even if it is needless for storage itself. Not a big deal though. Another (acutally documented) WAL-related limitation is that only generic WAL can be used within extension. So unless inserts are made in bulks it's going to require a lot of disk space to accomodate logs and wide bandwith for replication. pg_cryogen extension is still in developement so if other issues arise i'll post them here. At this point the extension already supports inserts via COPY, index and bitmap scans, vacuum (only freezing), analyze. It uses lz4 compression and currently i'm working on adding different compression methods. I'm also willing to work on forementioned issues in API if community verifies them as valid. [1] https://github.com/adjust/pg_cryogen Thanks, Ildar