On Thu, 29 Sep 2022, 00:06 Robert Haas, <robertmh...@gmail.com> wrote: > > 2. WAL Size. Block references in the WAL are by RelFileLocator, so if > you make RelFileLocators bigger, WAL gets bigger. We'd have to test > the exact impact of this, but it seems a bit scary: if you have a WAL > stream with few FPIs doing DML on a narrow table, probably most > records will contain 1 block reference (and occasionally more, but I > guess most will use BKPBLOCK_SAME_REL) and adding 4 bytes to that > block reference feels like it might add up to something significant. I > don't really see any way around this, either: if you make relfilenode > values wider, they take up more space. Perhaps there's a way to claw > that back elsewhere, or we could do something really crazy like switch > to variable-width representations of integer quantities in WAL > records, but there doesn't seem to be any simple way forward other > than, you know, deciding that we're willing to pay the cost of the > additional WAL volume.
Re: WAL volume and record size optimization I've been working off and on with WAL for some time now due to [0] and the interest of Neon in the area, and I think we can reduce the size of the base record by a significant margin: Currently, our minimal WAL record is exactly 24 bytes: length (4B), TransactionId (4B), previous record pointer (8B), flags (1B), redo manager (1B), 2 bytes of padding and lastly the 4-byte CRC. Of these fields, TransactionID could reasonably be omitted for certain WAL records (as example: index insertions don't really need the XID). Additionally, the length field could be made to be variable length, and any padding is just plain bad (adding 4 bytes to all insert/update/delete/lock records was frowned upon). I'm working on a prototype patch for a more bare-bones WAL record header of which the only required fields would be prevptr (8B), CRC (4B), rmgr (1B) and flags (1B) for a minimal size of 14 bytes. I don't yet know the performance of this, but the considering that there will be a lot more conditionals in header decoding it might be slower for any one backend, but faster overall (less overall IOps) The flags field would be indications for additional information: [flag name (bits): explanation (additional xlog header data in bytes)] - len_size(0..1): xlog record size is at most xlrec_header_only (0B), uint8_max(1B), uint16_max(2B), uint32_max(4B) - has_xid (2): contains transaction ID of logging transaction (4B, or probably 8B when we introduce 64-bit xids) - has_cid (3): contains the command ID of the logging statement (4B) (rationale for logging CID in [0], now in record header because XID is included there as well, and both are required for consistent snapshots. - has_rminfo (4): has non-zero redo-manager flags field (1B) (rationale for separate field [1], non-zero allows 1B space optimization for one of each RMGR's operations) - special_rel (5): pre-existing definition - check_consistency (6): pre-existing definition - unset (7): no meaning defined yet. Could be used for full record compression, or other purposes. A normal record header (XLOG record with at least some registered data) would be only 15 to 17 bytes (0-1B rminfo + 1-2B in xl_len), and one with XID only up to 21 bytes. So, when compared to the current XLogRecord format, we would in general recover 2 or 3 bytes from the xl_tot_len field, 1 or 2 bytes from the alignment hole, and potentially the 4 bytes of the xid when that data is considered useless during recovery, or physical or logical replication. Kind regards, Matthias van de Meent [0] https://postgr.es/m/CAEze2WhmU8WciEgaVPZm71vxFBOpp8ncDc%3DSdEHHsW6HS%2Bk9zw%40mail.gmail.com [1] https://postgr.es/m/20220715173731.6t3km5cww3f5ztfq%40awork3.anarazel.de