On Wed, Mar 20, 2024 at 03:15:49PM +0200, Heikki Linnakangas wrote: > On 20/03/2024 03:36, Melanie Plageman wrote: > > On Mon, Mar 18, 2024 at 01:15:21AM +0200, Heikki Linnakangas wrote: > > > On 15/03/2024 02:56, Melanie Plageman wrote: > > > > Okay, so I was going to start using xl_heap_prune for vacuum here too, > > > > but I realized it would be bigger because of the > > > > snapshotConflictHorizon. Do you think there is a non-terrible way to > > > > make the snapshotConflictHorizon optional? Like with a flag? > > > > > > Yeah, another flag would do the trick. > > > > Okay, I've done this in attached v4 (including removing > > XLOG_HEAP2_VACUUM). I had to put the snapshot conflict horizon in the > > "main chunk" of data available at replay regardless of whether or not > > the record ended up including an FPI. > > > > I made it its own sub-record (xlhp_conflict_horizon) less to help with > > alignment (though we can use all the help we can get there) and more to > > keep it from getting lost. When you look at heapam_xlog.h, you can see > > what a XLOG_HEAP2_PRUNE record will contain starting with the > > xl_heap_prune struct and then all the sub-record types. > > Ok, now that I look at this, I wonder if we're being overly cautious about > the WAL size. We probably could just always include the snapshot field, and > set it to InvalidTransactionId and waste 4 bytes when it's not needed. For > the sake of simplicity. I don't feel strongly either way though, the flag is > pretty simple too.
That will mean that all vacuum records are at least 3 bytes bigger than before -- which makes it somewhat less defensible to get rid of xl_heap_vacuum. That being said, I ended up doing an unaligned access when I packed it and made it optional, so maybe it is less user-friendly. But I also think that making it optional is more clear for vacuum which will never use it. > I realized that the WAL record format changes are pretty independent from > the rest of the patches. They could be applied before the rest. Without the > rest of the changes, we'll still write two WAL records per page in vacuum, > one to prune and another one to freeze, but it's another meaningful > incremental step. So I reshuffled the patches, so that the WAL format is > changed first, before the rest of the changes. Ah, great idea! That eliminates the issue of preliminary commits having larger WAL records that then get streamlined. > 0001-0008: These are the WAL format changes. There's some comment cleanup > needed, but as far as the code goes, I think these are pretty much ready to > be squashed & committed. My review in this email is *only* for 0001-0008. I have not looked at the rest yet. > From 06d5ff5349a8aa95cbfd06a8043fe503b7b1bf7b Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakan...@iki.fi> > Date: Wed, 20 Mar 2024 14:50:14 +0200 > Subject: [PATCH v5 01/26] Merge prune, freeze and vacuum WAL record formats > > The new combined WAL record is now used for pruning, freezing and 2nd > pass of vacuum. This is in preparation of changing vacuuming to write > a combined prune+freeze record per page, instead of separate two > records. The new WAL record format now supports that, but the code > still always writes separate records for pruning and freezing. Attached patch changes-for-0001.patch has a bunch of updated comments -- especially for heapam_xlog.h (including my promised diagram). And a few suggestions (mostly changes that I should have made before). > > XXX I tried to lift-and-shift the code from v4 patch set as unchanged > as possible, for easier review, but some noteworthy changes: In the final commit message, I think it is worth calling out that all of those freeze logging functions heap_log_freeze_eq/cmp/etc are lifted and shifted from one file to another. When I am reviewing a big diff, it is always helpful to know where I need to review line-by-line. > > - Instead of passing PruneState and PageFreezeResult to > log_heap_prune_and_freeze(), pass the arrays of frozen, redirected > et al offsets directly. That way, it can be called from other places. good idea. > - moved heap_xlog_deserialize_prune_and_freeze() from xactdesc.c to > heapdesc.c. (Because that's clearly where it belongs) :) > From cd6cdaebb362b014733e99ecd868896caf0fb3aa Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakan...@iki.fi> > Date: Wed, 20 Mar 2024 13:45:01 +0200 > Subject: [PATCH v5 02/26] Keep the original numbers for existing WAL records > > Doesn't matter much because the WAL format is not compatible across > major versions anyway. But still seems nice to keep the identifiers > unchanged when we can. (There's some precedence for this if you search > the git history for "is free, was"). sounds good. > From d3207bb557aa1d2868a50d357a06318a6c0cb5cd Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakan...@iki.fi> > Date: Wed, 20 Mar 2024 13:48:29 +0200 > Subject: [PATCH v5 03/26] Rename record to XLOG_HEAP2_PRUNE_FREEZE > > To clarify that it also freezes now, and to make it clear that it's > significantly different from the old XLOG_HEAP2_PRUNE format. +1 > From 5d6fc2ffbdd839e0b69242af16446a46cf6a2dc7 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakan...@iki.fi> > Date: Wed, 20 Mar 2024 13:49:59 +0200 > Subject: [PATCH v5 04/26] 'nplans' is a pointer > > I'm surprised the compiler didn't warn about this oops. while looking at this, I noticed that the asserts I added that nredirected > 0, ndead > 0, and nunused > 0 have the same problem. > --- > src/backend/access/rmgrdesc/heapdesc.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/src/backend/access/rmgrdesc/heapdesc.c > b/src/backend/access/rmgrdesc/heapdesc.c > index 8b94c869faf..9ef8a745982 100644 > --- a/src/backend/access/rmgrdesc/heapdesc.c > +++ b/src/backend/access/rmgrdesc/heapdesc.c > @@ -155,8 +155,7 @@ heap_xlog_deserialize_prune_and_freeze(char *cursor, > uint8 flags, > cursor += sizeof(OffsetNumber) * *nunused; > } > > - if (nplans > 0) > From 59f3f80f82ed7a63d86c991d0cb025e4cde2caec Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakan...@iki.fi> > Date: Wed, 20 Mar 2024 13:36:41 +0200 > Subject: [PATCH v5 06/26] Fix logging snapshot conflict horizon. > > - it was accessed without proper alignment, which won't work on > architectures that are strict about alignment. Use memcpy. wow, oops. thanks for fixing this! > - in heap_xlog_prune_freeze, the code tried to access the xid with > "(xlhp_conflict_horizon *) (xlrec + SizeOfHeapPrune);" But 'xlrec' > was "xl_heap_prune *" rather than "char *". That happened to work, > because sizeof(xl_heap_prune) == 1, but make it more robust by > adding a cast to char *. good catch. > - remove xlhp_conflict_horizon and store a TransactionId directly. A > separate struct would make sense if we needed to store anything else > there, but for now it just seems like more code. makes sense. I just want to make sure heapam_xlog.h makes it extra clear that this is happening. I see your comment addition. If you combine it with my comment additions in the attached patch for 0001, hopefully that makes it clear enough. > From 8af186ee9dd8c7dc20f37a69b34cab7b95faa43b Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakan...@iki.fi> > Date: Wed, 20 Mar 2024 14:03:06 +0200 > Subject: [PATCH v5 07/26] Add comment to log_heap_prune_and_freeze(). > > XXX: This should be rewritten, but I tried to at least list some > important points. Are you thinking that it needs to mention more things or that the things it mentions need more detail? > From b26e36ba8614d907a6e15810ed4f684f8f628dd2 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakan...@iki.fi> > Date: Wed, 20 Mar 2024 14:53:31 +0200 > Subject: [PATCH v5 08/26] minor refactoring in log_heap_prune_and_freeze() > > Mostly to make local variables more tightly-scoped. So, I don't think you can move those sub-records into the tighter scope. If you run tests with this applied, you'll see it crashes and a lot of the data in the record is wrong. If you move the sub-record declarations out to a wider scope, the tests pass. The WAL record data isn't actually copied into the buffer until recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_FREEZE); after registering everything. So all of those sub-records you made are out of scope the time it tries to copy from them. I spent the better part of a day last week trying to figure out what was happening after I did the exact same thing. I must say that I found the xloginsert API incredibly unhelpful on this point. I would like to review the rest of the suggested changes in this patch after we fix the issue I just mentioned. - Melanie
>From 93d2790fac9c66c67165555d541410777ec9ad3b Mon Sep 17 00:00:00 2001 From: Melanie Plageman <melanieplage...@gmail.com> Date: Wed, 20 Mar 2024 14:13:33 -0400 Subject: [PATCH 2/9] comments on 0001 --- src/backend/access/heap/heapam.c | 4 +- src/backend/access/heap/pruneheap.c | 7 ++- src/include/access/heapam_xlog.h | 88 ++++++++++++++++++++--------- 3 files changed, 68 insertions(+), 31 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index e6cfffd9f3e..17b733fd706 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -8706,6 +8706,8 @@ heap_xlog_prune(XLogReaderState *record) MarkBufferDirty(buffer); } + // TODO: should we avoid this if we only froze? heap_xlog_freeze() doesn't + // do it if (BufferIsValid(buffer)) { Size freespace = PageGetHeapFreeSpace(BufferGetPage(buffer)); @@ -8713,7 +8715,7 @@ heap_xlog_prune(XLogReaderState *record) UnlockReleaseBuffer(buffer); /* - * After modifying records on a page, it's useful to update the FSM + * After modifying tuples on a page, it's useful to update the FSM * about it, as it may cause the page become target for insertions * later even if vacuum decides not to visit it (which is possible if * gets marked all-visible.) diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c index 9773681868c..6fc5c22a22d 100644 --- a/src/backend/access/heap/pruneheap.c +++ b/src/backend/access/heap/pruneheap.c @@ -1294,7 +1294,8 @@ log_heap_prune_and_freeze(Relation relation, Buffer buffer, int nplans = 0; xl_heap_freeze_plan plans[MaxHeapTuplesPerPage]; OffsetNumber frz_offsets[MaxHeapTuplesPerPage]; - bool do_freeze = (nfrozen > 0); + bool do_freeze = (nfrozen > 0); // don't need these parantheses + // actually probably just lose this variable xlrec.flags = 0; @@ -1311,8 +1312,8 @@ log_heap_prune_and_freeze(Relation relation, Buffer buffer, xlrec.flags |= XLHP_HAS_CONFLICT_HORIZON; /* - * Prepare deduplicated representation for use in WAL record Destructively - * sorts tuples array in-place. + * Prepare deduplicated representation for use in WAL record. This + * destructively sorts frozen tuples array in-place. */ if (do_freeze) nplans = heap_log_freeze_plan(frozen, nfrozen, plans, frz_offsets); diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h index dfeb703d136..14e0e49e539 100644 --- a/src/include/access/heapam_xlog.h +++ b/src/include/access/heapam_xlog.h @@ -225,22 +225,32 @@ typedef struct xl_heap_update #define SizeOfHeapUpdate (offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber)) /* - * This is what we need to know about page pruning and freezing, both during - * VACUUM and during opportunistic pruning. + * These structures and flags encode VACUUM pruning and freezing and on-access + * pruning page modifications. * - * If XLPH_HAS_REDIRECTIONS, XLHP_HAS_DEAD_ITEMS, or XLHP_HAS_NOW_UNUSED is set, - * acquires a full cleanup lock. Otherwise an ordinary exclusive lock is - * enough. This can happen if freezing was the only modification to the page. + * xl_heap_prune is the main record. The XLHP_HAS_* flags indicate which + * "sub-records" are included and the other XLHP_* flags provide additional + * information about the conditions for replay. * - * The data for block reference 0 contains "sub-records" depending on which - * of the XLHP_HAS_* flags are set. See xlhp_* struct definitions below. - * The layout is in the same order as the XLHP_* flags. + * The data for block reference 0 contains "sub-records" depending on which of + * the XLHP_HAS_* flags are set. Offset numbers are in the block reference data + * following each sub-record. See xlhp_* struct definitions below. The layout + * is in the same order as the XLHP_* flags. * - * OFFSET NUMBERS are in the block reference 0 - * - * If only unused item offsets are included because the record is constructed - * during vacuum's second pass (marking LP_DEAD items LP_UNUSED) then only an - * ordinary exclusive lock is required to replay. + * An example record with every sub-record included. + *----------------------------------------------------------------------------- + * uint8 flags (begin xl_heap_prune) + * TransactionId snapshot_conflict_horizon + * uint16 nplans (begin xlhp_freeze) + * xl_heap_freeze_plan plans[nplans] + * uint16 nredirected (begin xlhp_prune_items) + * OffsetNumber redirected[2 * nredirected] + * uint16 ndead (begin xlhp_prune_items) + * OffsetNumber nowdead[ndead] + * uint16 nunused (begin xlhp_prune_items) + * OffsetNumber nowunused[nunused] + * OffsetNumber frz_offsets[sum([plan.ntuples for plan in plans])] + *----------------------------------------------------------------------------- */ typedef struct xl_heap_prune { @@ -251,20 +261,42 @@ typedef struct xl_heap_prune #define XLHP_IS_CATALOG_REL (1 << 1) /* - * During vacuum's second pass which sets LP_DEAD items LP_UNUSED, we will only - * truncate the line pointer array, not call PageRepairFragmentation. We need - * this flag to differentiate what kind of lock (exclusive or cleanup) to take - * on the buffer and whether to call PageTruncateLinePointerArray() or - * PageRepairFragementation(). + * Vacuum's second pass sets LP_DEAD items LP_UNUSED and truncates the line + * pointer array with PageTruncateLinePointerArray(). It will emit a WAL + * record with XLHP_LP_TRUNCATE_ONLY set to indicate that only an ordinary + * exclusive lock is needed to replay the record. When XLHP_LP_TRUNCATE_ONLY is + * unset, we take a cleanup lock and call PageRepairFragementation(). */ #define XLHP_LP_TRUNCATE_ONLY (1 << 2) /* * Vacuum's first pass and on-access pruning may need to include a snapshot - * conflict horizon. + * conflict horizon. The snapshot conflict horizon is needed regardless of + * whether or not a full-page image was emitted, so the + * snapshot_conflict_horizon is located in the "main chunk" of the WAL record, + * available at replay with XLogRecGetData(), while all of the sub-records are + * located in the block reference data, available at replay with + * XLogRecGetBlockData(). */ #define XLHP_HAS_CONFLICT_HORIZON (1 << 3) + +/* + * Indicates that an xlhp_freeze sub-record and one or more xl_heap_freeze_plan + * sub-records are present. If XLHP_HAS_FREEZE_PLANS is set and no other page + * modifications will be made, an ordinary exclusive lock on the buffer is + * sufficient to replay the record. + */ #define XLHP_HAS_FREEZE_PLANS (1 << 4) + +/* + * XLPH_HAS_REDIRECTIONS, XLHP_HAS_DEAD_ITEMS, and XLHP_HAS_NOW_UNUSED indicate + * that xlhp_prune_items sub-records with redirected, dead, and unused item + * offsets are present in the record. If XLHP_HAS_REDIRECTIONS or + * XLHP_HAS_DEAD_ITEMS is set or if XLHP_HAS_NOW_UNUSED items is set and + * XLHP_LP_TRUNCATE_ONLY is not set, a full cleanup lock on the buffer is + * needed to replay the record. Otherwise, an ordinary exclusive lock is + * sufficient. + */ #define XLHP_HAS_REDIRECTIONS (1 << 5) #define XLHP_HAS_DEAD_ITEMS (1 << 6) #define XLHP_HAS_NOW_UNUSED_ITEMS (1 << 7) @@ -299,15 +331,16 @@ typedef struct xl_heap_freeze_plan } xl_heap_freeze_plan; /* - * As of Postgres 17, XLOG_HEAP2_PRUNE records replace - * XLOG_HEAP2_FREEZE_PAGE records. + * As of Postgres 17, XLOG_HEAP2_PRUNE records replace XLOG_HEAP2_FREEZE_PAGE + * records. * * This is what we need to know about a block being frozen during vacuum * - * Backup block 0's data contains an array of xl_heap_freeze_plan structs - * (with nplans elements), followed by one or more page offset number arrays. - * Each such page offset number array corresponds to a single freeze plan - * (REDO routine freezes corresponding heap tuples using freeze plan). + * The backup block's data contains an array of xl_heap_freeze_plan structs + * (with nplans elements). The individual item offsets are located in an array + * at the end of the entire record with with nplans * (each plan's ntuples) + * members. Those offsets are in the same order as the plans. The REDO routine + * uses the offsets to freeze the corresponding heap tuples. */ typedef struct xlhp_freeze { @@ -316,8 +349,9 @@ typedef struct xlhp_freeze } xlhp_freeze; /* - * Sub-record type contained in block reference 0 of a prune record if - * XLHP_HAS_REDIRECTIONS/XLHP_HAS_DEAD_ITEMS/XLHP_HAS_NOW_UNUSED_ITEMS is set. + * Generic sub-record type contained in block reference 0 of an xl_heap_prune + * record and used for redirect, dead, and unused items if any of + * XLHP_HAS_REDIRECTIONS/XLHP_HAS_DEAD_ITEMS/XLHP_HAS_NOW_UNUSED_ITEMS are set. * Note that in the XLHP_HAS_REDIRECTIONS variant, there are actually 2 * * length number of OffsetNumbers in the data. */ -- 2.40.1