On Wed, Mar 20, 2024 at 03:15:49PM +0200, Heikki Linnakangas wrote:
> On 20/03/2024 03:36, Melanie Plageman wrote:
> > On Mon, Mar 18, 2024 at 01:15:21AM +0200, Heikki Linnakangas wrote:
> > > On 15/03/2024 02:56, Melanie Plageman wrote:
> > > > Okay, so I was going to start using xl_heap_prune for vacuum here too,
> > > > but I realized it would be bigger because of the
> > > > snapshotConflictHorizon. Do you think there is a non-terrible way to
> > > > make the snapshotConflictHorizon optional? Like with a flag?
> > > 
> > > Yeah, another flag would do the trick.
> > 
> > Okay, I've done this in attached v4 (including removing
> > XLOG_HEAP2_VACUUM). I had to put the snapshot conflict horizon in the
> > "main chunk" of data available at replay regardless of whether or not
> > the record ended up including an FPI.
> > 
> > I made it its own sub-record (xlhp_conflict_horizon) less to help with
> > alignment (though we can use all the help we can get there) and more to
> > keep it from getting lost. When you look at heapam_xlog.h, you can see
> > what a XLOG_HEAP2_PRUNE record will contain starting with the
> > xl_heap_prune struct and then all the sub-record types.
> 
> Ok, now that I look at this, I wonder if we're being overly cautious about
> the WAL size. We probably could just always include the snapshot field, and
> set it to InvalidTransactionId and waste 4 bytes when it's not needed. For
> the sake of simplicity. I don't feel strongly either way though, the flag is
> pretty simple too.

That will mean that all vacuum records are at least 3 bytes bigger than
before -- which makes it somewhat less defensible to get rid of
xl_heap_vacuum.

That being said, I ended up doing an unaligned access when I
packed it and made it optional, so maybe it is less user-friendly.

But I also think that making it optional is more clear for vacuum which
will never use it.

> I realized that the WAL record format changes are pretty independent from
> the rest of the patches. They could be applied before the rest. Without the
> rest of the changes, we'll still write two WAL records per page in vacuum,
> one to prune and another one to freeze, but it's another meaningful
> incremental step. So I reshuffled the patches, so that the WAL format is
> changed first, before the rest of the changes.

Ah, great idea! That eliminates the issue of preliminary commits having
larger WAL records that then get streamlined.

> 0001-0008: These are the WAL format changes. There's some comment cleanup
> needed, but as far as the code goes, I think these are pretty much ready to
> be squashed & committed.

My review in this email is *only* for 0001-0008. I have not looked at
the rest yet.

> From 06d5ff5349a8aa95cbfd06a8043fe503b7b1bf7b Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
> Date: Wed, 20 Mar 2024 14:50:14 +0200
> Subject: [PATCH v5 01/26] Merge prune, freeze and vacuum WAL record formats
> 
> The new combined WAL record is now used for pruning, freezing and 2nd
> pass of vacuum. This is in preparation of changing vacuuming to write
> a combined prune+freeze record per page, instead of separate two
> records. The new WAL record format now supports that, but the code
> still always writes separate records for pruning and freezing.

Attached patch changes-for-0001.patch has a bunch of updated comments --
especially for heapam_xlog.h (including my promised diagram). And a few
suggestions (mostly changes that I should have made before).

> 
> XXX I tried to lift-and-shift the code from v4 patch set as unchanged
> as possible, for easier review, but some noteworthy changes:

In the final commit message, I think it is worth calling out that all of
those freeze logging functions heap_log_freeze_eq/cmp/etc are lifted and
shifted from one file to another. When I am reviewing a big diff, it is
always helpful to know where I need to review line-by-line.

> 
> - Instead of passing PruneState and PageFreezeResult to
>   log_heap_prune_and_freeze(), pass the arrays of frozen, redirected
>   et al offsets directly. That way, it can be called from other places.

good idea.

> - moved heap_xlog_deserialize_prune_and_freeze() from xactdesc.c to
>   heapdesc.c. (Because that's clearly where it belongs)

:)

> From cd6cdaebb362b014733e99ecd868896caf0fb3aa Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
> Date: Wed, 20 Mar 2024 13:45:01 +0200
> Subject: [PATCH v5 02/26] Keep the original numbers for existing WAL records
> 
> Doesn't matter much because the WAL format is not compatible across
> major versions anyway. But still seems nice to keep the identifiers
> unchanged when we can. (There's some precedence for this if you search
> the git history for "is free, was").

sounds good.

> From d3207bb557aa1d2868a50d357a06318a6c0cb5cd Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
> Date: Wed, 20 Mar 2024 13:48:29 +0200
> Subject: [PATCH v5 03/26] Rename record to XLOG_HEAP2_PRUNE_FREEZE
> 
> To clarify that it also freezes now, and to make it clear that it's
> significantly different from the old XLOG_HEAP2_PRUNE format.

+1 

> From 5d6fc2ffbdd839e0b69242af16446a46cf6a2dc7 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
> Date: Wed, 20 Mar 2024 13:49:59 +0200
> Subject: [PATCH v5 04/26] 'nplans' is a pointer
> 
> I'm surprised the compiler didn't warn about this

oops. while looking at this, I noticed that the asserts I added that
nredirected > 0, ndead > 0, and nunused > 0 have the same problem.

> ---
>  src/backend/access/rmgrdesc/heapdesc.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/src/backend/access/rmgrdesc/heapdesc.c 
> b/src/backend/access/rmgrdesc/heapdesc.c
> index 8b94c869faf..9ef8a745982 100644
> --- a/src/backend/access/rmgrdesc/heapdesc.c
> +++ b/src/backend/access/rmgrdesc/heapdesc.c
> @@ -155,8 +155,7 @@ heap_xlog_deserialize_prune_and_freeze(char *cursor, 
> uint8 flags,
>               cursor += sizeof(OffsetNumber) * *nunused;
>       }
>  
> -     if (nplans > 0)

> From 59f3f80f82ed7a63d86c991d0cb025e4cde2caec Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
> Date: Wed, 20 Mar 2024 13:36:41 +0200
> Subject: [PATCH v5 06/26] Fix logging snapshot conflict horizon.
> 
> - it was accessed without proper alignment, which won't work on
>   architectures that are strict about alignment. Use memcpy.

wow, oops. thanks for fixing this!

> - in heap_xlog_prune_freeze, the code tried to access the xid with
>   "(xlhp_conflict_horizon *) (xlrec + SizeOfHeapPrune);" But 'xlrec'
>   was "xl_heap_prune *" rather than "char *". That happened to work,
>   because sizeof(xl_heap_prune) == 1, but make it more robust by
>   adding a cast to char *.

good catch.

> - remove xlhp_conflict_horizon and store a TransactionId directly. A
>   separate struct would make sense if we needed to store anything else
>   there, but for now it just seems like more code.

makes sense. I just want to make sure heapam_xlog.h makes it extra clear
that this is happening. I see your comment addition. If you combine it
with my comment additions in the attached patch for 0001, hopefully that
makes it clear enough.

> From 8af186ee9dd8c7dc20f37a69b34cab7b95faa43b Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
> Date: Wed, 20 Mar 2024 14:03:06 +0200
> Subject: [PATCH v5 07/26] Add comment to log_heap_prune_and_freeze().
> 
> XXX: This should be rewritten, but I tried to at least list some
> important points.

Are you thinking that it needs to mention more things or that the things
it mentions need more detail?

> From b26e36ba8614d907a6e15810ed4f684f8f628dd2 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
> Date: Wed, 20 Mar 2024 14:53:31 +0200
> Subject: [PATCH v5 08/26] minor refactoring in log_heap_prune_and_freeze()
> 
> Mostly to make local variables more tightly-scoped.

So, I don't think you can move those sub-records into the tighter scope.
If you run tests with this applied, you'll see it crashes and a lot of
the data in the record is wrong. If you move the sub-record declarations
out to a wider scope, the tests pass.

The WAL record data isn't actually copied into the buffer until

        recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_FREEZE);

after registering everything.
So all of those sub-records you made are out of scope the time it tries
to copy from them.

I spent the better part of a day last week trying to figure out what was
happening after I did the exact same thing. I must say that I found the
xloginsert API incredibly unhelpful on this point.

I would like to review the rest of the suggested changes in this patch
after we fix the issue I just mentioned.

- Melanie
>From 93d2790fac9c66c67165555d541410777ec9ad3b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Wed, 20 Mar 2024 14:13:33 -0400
Subject: [PATCH 2/9] comments on 0001

---
 src/backend/access/heap/heapam.c    |  4 +-
 src/backend/access/heap/pruneheap.c |  7 ++-
 src/include/access/heapam_xlog.h    | 88 ++++++++++++++++++++---------
 3 files changed, 68 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e6cfffd9f3e..17b733fd706 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8706,6 +8706,8 @@ heap_xlog_prune(XLogReaderState *record)
 		MarkBufferDirty(buffer);
 	}
 
+	// TODO: should we avoid this if we only froze? heap_xlog_freeze() doesn't
+	// do it
 	if (BufferIsValid(buffer))
 	{
 		Size		freespace = PageGetHeapFreeSpace(BufferGetPage(buffer));
@@ -8713,7 +8715,7 @@ heap_xlog_prune(XLogReaderState *record)
 		UnlockReleaseBuffer(buffer);
 
 		/*
-		 * After modifying records on a page, it's useful to update the FSM
+		 * After modifying tuples on a page, it's useful to update the FSM
 		 * about it, as it may cause the page become target for insertions
 		 * later even if vacuum decides not to visit it (which is possible if
 		 * gets marked all-visible.)
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 9773681868c..6fc5c22a22d 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -1294,7 +1294,8 @@ log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 	int			nplans = 0;
 	xl_heap_freeze_plan plans[MaxHeapTuplesPerPage];
 	OffsetNumber frz_offsets[MaxHeapTuplesPerPage];
-	bool		do_freeze = (nfrozen > 0);
+	bool		do_freeze = (nfrozen > 0); // don't need these parantheses
+										   // actually probably just lose this variable
 
 	xlrec.flags = 0;
 
@@ -1311,8 +1312,8 @@ log_heap_prune_and_freeze(Relation relation, Buffer buffer,
 		xlrec.flags |= XLHP_HAS_CONFLICT_HORIZON;
 
 	/*
-	 * Prepare deduplicated representation for use in WAL record Destructively
-	 * sorts tuples array in-place.
+	 * Prepare deduplicated representation for use in WAL record. This
+	 * destructively sorts frozen tuples array in-place.
 	 */
 	if (do_freeze)
 		nplans = heap_log_freeze_plan(frozen, nfrozen, plans, frz_offsets);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index dfeb703d136..14e0e49e539 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -225,22 +225,32 @@ typedef struct xl_heap_update
 #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
 
 /*
- * This is what we need to know about page pruning and freezing, both during
- * VACUUM and during opportunistic pruning.
+ * These structures and flags encode VACUUM pruning and freezing and on-access
+ * pruning page modifications.
  *
- * If XLPH_HAS_REDIRECTIONS, XLHP_HAS_DEAD_ITEMS, or XLHP_HAS_NOW_UNUSED is set,
- * acquires a full cleanup lock. Otherwise an ordinary exclusive lock is
- * enough. This can happen if freezing was the only modification to the page.
+ * xl_heap_prune is the main record. The XLHP_HAS_* flags indicate which
+ * "sub-records" are included and the other XLHP_* flags provide additional
+ * information about the conditions for replay.
  *
- * The data for block reference 0 contains "sub-records" depending on which
- * of the XLHP_HAS_* flags are set. See xlhp_* struct definitions below.
- * The layout is in the same order as the XLHP_* flags.
+ * The data for block reference 0 contains "sub-records" depending on which of
+ * the XLHP_HAS_* flags are set. Offset numbers are in the block reference data
+ * following each sub-record. See xlhp_* struct definitions below. The layout
+ * is in the same order as the XLHP_* flags.
  *
- * OFFSET NUMBERS are in the block reference 0
- *
- * If only unused item offsets are included because the record is constructed
- * during vacuum's second pass (marking LP_DEAD items LP_UNUSED) then only an
- * ordinary exclusive lock is required to replay.
+ * An example record with every sub-record included.
+ *-----------------------------------------------------------------------------
+ * uint8 		       flags (begin xl_heap_prune)
+ * TransactionId       snapshot_conflict_horizon
+ * uint16		       nplans (begin xlhp_freeze)
+ * xl_heap_freeze_plan plans[nplans]
+ * uint16			   nredirected (begin xlhp_prune_items)
+ * OffsetNumber		   redirected[2 * nredirected]
+ * uint16			   ndead (begin xlhp_prune_items)
+ * OffsetNumber		   nowdead[ndead]
+ * uint16			   nunused (begin xlhp_prune_items)
+ * OffsetNumber		   nowunused[nunused]
+ * OffsetNumber		   frz_offsets[sum([plan.ntuples for plan in plans])]
+ *-----------------------------------------------------------------------------
  */
 typedef struct xl_heap_prune
 {
@@ -251,20 +261,42 @@ typedef struct xl_heap_prune
 #define		XLHP_IS_CATALOG_REL			(1 << 1)
 
 /*
- * During vacuum's second pass which sets LP_DEAD items LP_UNUSED, we will only
- * truncate the line pointer array, not call PageRepairFragmentation. We need
- * this flag to differentiate what kind of lock (exclusive or cleanup) to take
- * on the buffer and whether to call PageTruncateLinePointerArray() or
- * PageRepairFragementation().
+ * Vacuum's second pass sets LP_DEAD items LP_UNUSED and truncates the line
+ * pointer array with PageTruncateLinePointerArray(). It will emit a WAL
+ * record with XLHP_LP_TRUNCATE_ONLY set to indicate that only an ordinary
+ * exclusive lock is needed to replay the record. When XLHP_LP_TRUNCATE_ONLY is
+ * unset, we take a cleanup lock and call PageRepairFragementation().
  */
 #define		XLHP_LP_TRUNCATE_ONLY       (1 << 2)
 
 /*
  * Vacuum's first pass and on-access pruning may need to include a snapshot
- * conflict horizon.
+ * conflict horizon. The snapshot conflict horizon is needed regardless of
+ * whether or not a full-page image was emitted, so the
+ * snapshot_conflict_horizon is located in the "main chunk" of the WAL record,
+ * available at replay with XLogRecGetData(), while all of the sub-records are
+ * located in the block reference data, available at replay with
+ * XLogRecGetBlockData().
  */
 #define		XLHP_HAS_CONFLICT_HORIZON   (1 << 3)
+
+/*
+ * Indicates that an xlhp_freeze sub-record and one or more xl_heap_freeze_plan
+ * sub-records are present. If XLHP_HAS_FREEZE_PLANS is set and no other page
+ * modifications will be made, an ordinary exclusive lock on the buffer is
+ * sufficient to replay the record.
+ */
 #define		XLHP_HAS_FREEZE_PLANS		(1 << 4)
+
+/*
+ * XLPH_HAS_REDIRECTIONS, XLHP_HAS_DEAD_ITEMS, and XLHP_HAS_NOW_UNUSED indicate
+ * that xlhp_prune_items sub-records with redirected, dead, and unused item
+ * offsets are present in the record. If XLHP_HAS_REDIRECTIONS or
+ * XLHP_HAS_DEAD_ITEMS is set or if XLHP_HAS_NOW_UNUSED items is set and
+ * XLHP_LP_TRUNCATE_ONLY is not set, a full cleanup lock on the buffer is
+ * needed to replay the record. Otherwise, an ordinary exclusive lock is
+ * sufficient.
+ */
 #define		XLHP_HAS_REDIRECTIONS		(1 << 5)
 #define		XLHP_HAS_DEAD_ITEMS	        (1 << 6)
 #define		XLHP_HAS_NOW_UNUSED_ITEMS   (1 << 7)
@@ -299,15 +331,16 @@ typedef struct xl_heap_freeze_plan
 } xl_heap_freeze_plan;
 
 /*
- * As of Postgres 17, XLOG_HEAP2_PRUNE records replace
- * XLOG_HEAP2_FREEZE_PAGE records.
+ * As of Postgres 17, XLOG_HEAP2_PRUNE records replace XLOG_HEAP2_FREEZE_PAGE
+ * records.
  *
  * This is what we need to know about a block being frozen during vacuum
  *
- * Backup block 0's data contains an array of xl_heap_freeze_plan structs
- * (with nplans elements), followed by one or more page offset number arrays.
- * Each such page offset number array corresponds to a single freeze plan
- * (REDO routine freezes corresponding heap tuples using freeze plan).
+ * The backup block's data contains an array of xl_heap_freeze_plan structs
+ * (with nplans elements). The individual item offsets are located in an array
+ * at the end of the entire record with with nplans * (each plan's ntuples)
+ * members. Those offsets are in the same order as the plans. The REDO routine
+ * uses the offsets to freeze the corresponding heap tuples.
  */
 typedef struct xlhp_freeze
 {
@@ -316,8 +349,9 @@ typedef struct xlhp_freeze
 } xlhp_freeze;
 
 /*
- * Sub-record type contained in block reference 0 of a prune record if
- * XLHP_HAS_REDIRECTIONS/XLHP_HAS_DEAD_ITEMS/XLHP_HAS_NOW_UNUSED_ITEMS is set.
+ * Generic sub-record type contained in block reference 0 of an xl_heap_prune
+ * record and used for redirect, dead, and unused items if any of
+ * XLHP_HAS_REDIRECTIONS/XLHP_HAS_DEAD_ITEMS/XLHP_HAS_NOW_UNUSED_ITEMS are set.
  * Note that in the XLHP_HAS_REDIRECTIONS variant, there are actually 2 *
  * length number of OffsetNumbers in the data.
  */
-- 
2.40.1

Reply via email to