Re: Combine Prune and Freeze records emitted by vacuum

2024-04-03 Thread Peter Geoghegan
On Wed, Apr 3, 2024 at 1:04 PM Melanie Plageman
 wrote:
> Thanks! And thanks for updating the commitfest entry!

Nice work!

-- 
Peter Geoghegan




Re: Combine Prune and Freeze records emitted by vacuum

2024-04-03 Thread Melanie Plageman
On Wed, Apr 3, 2024 at 12:34 PM Heikki Linnakangas  wrote:
>
> On 03/04/2024 17:18, Melanie Plageman wrote:
> > I noticed you didn't make the comment updates I suggested in my
> > version 13 here [1]. A few of them are outdated references to
> > heap_page_prune() and one to a now deleted variable name
> > (all_visible_except_removable).
> >
> > I applied them to your v13 and attached the diff.
>
> Applied those changes, and committed. Thank you!

Thanks! And thanks for updating the commitfest entry!

- Melanie




Re: Combine Prune and Freeze records emitted by vacuum

2024-04-03 Thread Heikki Linnakangas

On 03/04/2024 17:18, Melanie Plageman wrote:

I noticed you didn't make the comment updates I suggested in my
version 13 here [1]. A few of them are outdated references to
heap_page_prune() and one to a now deleted variable name
(all_visible_except_removable).

I applied them to your v13 and attached the diff.


Applied those changes, and committed. Thank you!


Off-list, Melanie reported that there is a small regression with the
benchmark script she posted yesterday, after all, but I'm not able to
reproduce that.


Actually, I think it was noise.


Ok, phew.

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Combine Prune and Freeze records emitted by vacuum

2024-04-03 Thread Melanie Plageman
On Wed, Apr 3, 2024 at 8:39 AM Heikki Linnakangas  wrote:
>
> On 02/04/2024 16:11, Heikki Linnakangas wrote:
> > On 01/04/2024 20:22, Melanie Plageman wrote:
> >> Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
> >> you didn't modify them much/at all, but I noticed some things in my code
> >> that could be better.
> >
> > Ok, here's what I have now. I made a lot of small comment changes here
> > and there, and some minor local refactorings, but nothing major. I lost
> > track of all the individual changes I'm afraid, so I'm afraid you'll
> > have to just diff against the previous version if you want to see what's
> > changed. I hope I didn't break anything.
> >
> > I'm pretty happy with this now. I will skim through it one more time
> > later today or tomorrow, and commit. Please review once more if you have
> > a chance.
> >
> >> This probably doesn't belong here. I noticed spgdoinsert.c had a static
> >> function for sorting OffsetNumbers, but I didn't see anything general
> >> purpose anywhere else.
> >
> > I copied the spgdoinsert.c implementation to vacuumlazy.c as is. Would
> > be nice to have just one copy of this in some common place, but I also
> > wasn't sure where to put it.
>
> One more version, with two small fixes:
>
> 1. I fumbled the offsetnumber-cmp function at the last minute so that it
> didn't compile. Fixed. that

I noticed you didn't make the comment updates I suggested in my
version 13 here [1]. A few of them are outdated references to
heap_page_prune() and one to a now deleted variable name
(all_visible_except_removable).

I applied them to your v13 and attached the diff.

> Off-list, Melanie reported that there is a small regression with the
> benchmark script she posted yesterday, after all, but I'm not able to
> reproduce that.

Actually, I think it was noise.

- Melanie

[1] 
https://www.postgresql.org/message-id/CAAKRu_aPqZkThyfr0USaHp-3cN_ruEdAHBKtNQJqXDTjWUz0rw%40mail.gmail.com
From 882e937c122f5e83bc9ba643443c1a27c807d82e Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 3 Apr 2024 10:12:44 -0400
Subject: [PATCH 3/3] comment updates

---
 src/backend/access/heap/pruneheap.c | 53 +
 src/backend/storage/ipc/procarray.c |  6 ++--
 src/include/access/heapam.h |  2 +-
 3 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8ed44ba93d..4a6a4cee4d 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -328,7 +328,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  *
  * presult contains output parameters needed by callers, such as the number of
  * tuples removed and the offsets of dead items on the page after pruning.
- * heap_page_prune_and_freeze() is responsible for initializing it.
+ * heap_page_prune_and_freeze() is responsible for initializing it. Required by
+ * all callers.
  *
  * reason indicates why the pruning is performed.  It is included in the WAL
  * record for debugging and analysis purposes, but otherwise has no effect.
@@ -393,6 +394,7 @@ heap_page_prune_and_freeze(Relation relation, Buffer buffer,
 	prstate.pagefrz.freeze_required = false;
 	if (prstate.freeze)
 	{
+		Assert(new_relfrozen_xid && new_relmin_mxid);
 		prstate.pagefrz.FreezePageRelfrozenXid = *new_relfrozen_xid;
 		prstate.pagefrz.NoFreezePageRelfrozenXid = *new_relfrozen_xid;
 		prstate.pagefrz.FreezePageRelminMxid = *new_relmin_mxid;
@@ -415,19 +417,19 @@ heap_page_prune_and_freeze(Relation relation, Buffer buffer,
 	prstate.deadoffsets = presult->deadoffsets;
 
 	/*
-	 * Caller may update the VM after we're done.  We keep track of whether
-	 * the page will be all_visible and all_frozen, once we're done with the
-	 * pruning and freezing, to help the caller to do that.
+	 * Caller may update the VM after we're done.  We can keep track of
+	 * whether the page will be all-visible and all-frozen after pruning and
+	 * freezing to help the caller to do that.
 	 *
 	 * Currently, only VACUUM sets the VM bits.  To save the effort, only do
-	 * only the bookkeeping if the caller needs it.  Currently, that's tied to
-	 * HEAP_PAGE_PRUNE_FREEZE, but it could be a separate flag, if you wanted
-	 * to update the VM bits without also freezing, or freezing without
+	 * the bookkeeping if the caller needs it.  Currently, that's tied to
+	 * HEAP_PAGE_PRUNE_FREEZE, but it could be a separate flag if you wanted
+	 * to update the VM bits without also freezing or freeze without also
 	 * setting the VM bits.
 	 *
 	 * In addition to telling the caller whether it can set the VM bit, we
 	 * also use 'all_visible' and 'all_frozen' for our own decision-making. If
-	 * the whole page will become frozen, we consider opportunistically
+	 * the whole page would become frozen, we consider opportunistically
 	 * freezing tuples.  We will not be able to freeze the whole page if there
 	 * are tuples present that are 

Re: Combine Prune and Freeze records emitted by vacuum

2024-04-03 Thread Heikki Linnakangas

On 02/04/2024 16:11, Heikki Linnakangas wrote:

On 01/04/2024 20:22, Melanie Plageman wrote:

Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
you didn't modify them much/at all, but I noticed some things in my code
that could be better.


Ok, here's what I have now. I made a lot of small comment changes here
and there, and some minor local refactorings, but nothing major. I lost
track of all the individual changes I'm afraid, so I'm afraid you'll
have to just diff against the previous version if you want to see what's
changed. I hope I didn't break anything.

I'm pretty happy with this now. I will skim through it one more time
later today or tomorrow, and commit. Please review once more if you have
a chance.


This probably doesn't belong here. I noticed spgdoinsert.c had a static
function for sorting OffsetNumbers, but I didn't see anything general
purpose anywhere else.


I copied the spgdoinsert.c implementation to vacuumlazy.c as is. Would
be nice to have just one copy of this in some common place, but I also
wasn't sure where to put it.


One more version, with two small fixes:

1. I fumbled the offsetnumber-cmp function at the last minute so that it 
didn't compile. Fixed. that


2. On VACUUM on an unlogged or temp table, the logic always thought that 
we would be generating an FPI, causing it to always freeze when it 
could. But of course, you never generate FPIs on an unlogged table. 
Fixed that. (Perhaps we should indeed freeze more aggressively on an 
unlogged table, but changing the heuristic is out of scope for this patch.)


Off-list, Melanie reported that there is a small regression with the 
benchmark script she posted yesterday, after all, but I'm not able to 
reproduce that.


--
Heikki Linnakangas
Neon (https://neon.tech)
From e04bda4666d7eaff0c520a9c9e1468a9c4cc9f51 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 2 Apr 2024 15:47:06 +0300
Subject: [PATCH v13 1/3] Refactor how heap_prune_chain() updates prunable_xid

In preparation of freezing and counting tuples which are not
candidates for pruning, split heap_prune_record_unchanged() into
multiple functions, depending the kind of line pointer. That's not too
interesting right now, but makes the next commit smaller.

Recording the lowest soon-to-be prunable xid is one of the actions we
take for unchanged LP_NORMAL item pointers but not for others, so move
that to the new heap_prune_record_unchanged_lp_normal() function. The
next commit will add more actions to these functions.

Author: Melanie Plageman 
Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
---
 src/backend/access/heap/pruneheap.c | 125 
 1 file changed, 92 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index ef563e19aa5..1b5bf990d21 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -78,7 +78,11 @@ static void heap_prune_record_redirect(PruneState *prstate,
 static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum, bool was_normal);
 static void heap_prune_record_dead_or_unused(PruneState *prstate, OffsetNumber offnum, bool was_normal);
 static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum, bool was_normal);
-static void heap_prune_record_unchanged(PruneState *prstate, OffsetNumber offnum);
+
+static void heap_prune_record_unchanged_lp_unused(Page page, PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_unchanged_lp_normal(Page page, int8 *htsv, PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_unchanged_lp_dead(Page page, PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_unchanged_lp_redirect(PruneState *prstate, OffsetNumber offnum);
 
 static void page_verify_redirects(Page page);
 
@@ -311,7 +315,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/* Nothing to do if slot doesn't contain a tuple */
 		if (!ItemIdIsUsed(itemid))
 		{
-			heap_prune_record_unchanged(, offnum);
+			heap_prune_record_unchanged_lp_unused(page, , offnum);
 			continue;
 		}
 
@@ -324,7 +328,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			if (unlikely(prstate.mark_unused_now))
 heap_prune_record_unused(, offnum, false);
 			else
-heap_prune_record_unchanged(, offnum);
+heap_prune_record_unchanged_lp_dead(page, , offnum);
 			continue;
 		}
 
@@ -434,7 +438,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			}
 		}
 		else
-			heap_prune_record_unchanged(, offnum);
+			heap_prune_record_unchanged_lp_normal(page, presult->htsv, , offnum);
 	}
 
 	/* We should now have processed every tuple exactly once  */
@@ -652,9 +656,6 @@ heap_prune_chain(Page page, BlockNumber blockno, OffsetNumber maxoff,
 		 */
 		chainitems[nchain++] = offnum;
 
-		/*
-		 * Check tuple's visibility status.
-		 */
 		switch 

Re: Combine Prune and Freeze records emitted by vacuum

2024-04-02 Thread Melanie Plageman
On Tue, Apr 02, 2024 at 01:24:27PM -0400, Melanie Plageman wrote:
> On Tue, Apr 2, 2024 at 9:11 AM Heikki Linnakangas  wrote:
> >
> > On 01/04/2024 20:22, Melanie Plageman wrote:
> > > Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
> > > you didn't modify them much/at all, but I noticed some things in my code
> > > that could be better.
> >
> > Ok, here's what I have now. I made a lot of small comment changes here
> > and there, and some minor local refactorings, but nothing major. I lost
> > track of all the individual changes I'm afraid, so I'm afraid you'll
> > have to just diff against the previous version if you want to see what's
> > changed. I hope I didn't break anything.
> >
> > I'm pretty happy with this now. I will skim through it one more time
> > later today or tomorrow, and commit. Please review once more if you have
> > a chance.
> 
> Thanks!
> 
> 0001 looks good. Attached are some comment updates and such on top of
> 0001 and 0002.
> 
> I started some performance testing of 0002 but haven't finished yet. I
> wanted to provide my other review first.

I tried to do some performance tests of just on-access HOT pruning with
the patches in this thread applied. I'm not sure if I succeeded in being
targeted enough to have usable results.

Off-list Andres gave me some suggestions of how to improve my test case
and setup and this is what I ended up doing:


On-access pruning during a SELECT query:


# Make a table with a single not NULL column of a small datatype to fit
# as many tuples as possible on the page so each page we prune exercises
# those loops in heap_page_prune_and_freeze() and heap_prune_chain() as
# much as possible

psql -c "create table small(col smallint not null)"

# Insert data that is the same except for ~1 row per page with a different value

for i in $(seq 1000)
do
  psql -c "INSERT INTO small VALUES(2);" -c "INSERT INTO small SELECT 1 FROM 
(SELECT generate_series(1,220));"
done

# COPY this data to a file
psql -c "COPY small TO '/tmp/small.data';"

# Start postgres bound to a single CPU core

# Run the following script with pgbench

# Make the table unlogged table so we don't see the effects of WAL 
writes in
# results
#
# Make sure autovacuum doesn't run on the table
  drop table if exists small;
  create unlogged table small(col smallint not null) with (autovacuum_enabled = 
false);
  copy small from '/tmp/small.data';
  update small set col = 9 where col = 2;
  select * from small where col = 0;

pgbench -n -f query.sql -t 100 -M prepared -r

# (I made sure that HOT pruning was actually happening for the SELECT
# query before running this with pgbench)

# There seemed to be no meaningful difference for this example with the
# patches:

on current master:
statement latencies in milliseconds and failures:
12.387   0  drop table if exists small;
 1.914   0  create unlogged table small(col smallint not null) 
with (autovacuum_enabled = false);
   100.152   0  copy small from '/tmp/small.data';
49.480   0  update small set col = 9 where col = 2;
46.835   0  select * from small where col = 0;

with the patches applied:
statement latencies in milliseconds and failures:
13.281   0  drop table if exists small;
 1.952   0  create unlogged table small(col smallint not null) 
with (autovacuum_enabled = false);
99.418   0  copy small from '/tmp/small.data';
47.397   0  update small set col = 9 where col = 2;
46.887   0  select * from small where col = 0;


On-access pruning during UPDATE:


# The idea is to test a query which would be calling the new
# heap_prune_record_unchanged_lp_normal() function a lot

# I made the same table but filled entirely with the same value

psql -c "create table small(col smallint not null)" \
-c "INSERT INTO small SELECT 1 FROM (SELECT generate_series(1, 
221000));"

# COPY this data to a file
psql -c "COPY small TO '/tmp/small_univalue.data';"

# Start postgres bound to a single CPU core

# Run the following script with pgbench

# Pick a low fillfactor so we have room for the HOT updates
  drop table if exists small;
  create unlogged table small(col smallint not null) with (autovacuum_enabled = 
false, fillfactor=25);
  copy small from '/tmp/small_univalue.data';
  update small set col = 3;
  update small set col = 4;
  update small set col = 5;
  update small set col = 6;

pgbench -n -f update.sql -t 20 -M prepared -r

# There again seems to be no meaningful difference with the patches
# applied

on master:
statement latencies in milliseconds and failures:
19.880   0  drop table if exists small;
 2.099   0  create unlogged table small(col 

Re: Combine Prune and Freeze records emitted by vacuum

2024-04-02 Thread Melanie Plageman
On Tue, Apr 2, 2024 at 9:11 AM Heikki Linnakangas  wrote:
>
> On 01/04/2024 20:22, Melanie Plageman wrote:
> > Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
> > you didn't modify them much/at all, but I noticed some things in my code
> > that could be better.
>
> Ok, here's what I have now. I made a lot of small comment changes here
> and there, and some minor local refactorings, but nothing major. I lost
> track of all the individual changes I'm afraid, so I'm afraid you'll
> have to just diff against the previous version if you want to see what's
> changed. I hope I didn't break anything.
>
> I'm pretty happy with this now. I will skim through it one more time
> later today or tomorrow, and commit. Please review once more if you have
> a chance.

Thanks!

0001 looks good. Attached are some comment updates and such on top of
0001 and 0002.

I started some performance testing of 0002 but haven't finished yet. I
wanted to provide my other review first.

> > This probably doesn't belong here. I noticed spgdoinsert.c had a static
> > function for sorting OffsetNumbers, but I didn't see anything general
> > purpose anywhere else.
>
> I copied the spgdoinsert.c implementation to vacuumlazy.c as is. Would
> be nice to have just one copy of this in some common place, but I also
> wasn't sure where to put it.

I looked a bit through utils and common and didn't see anywhere
obvious to put it. We could make a new file? 0003 fixes where you
forgot to change the name of the qsort function, though.

- Melanie
From 3b69cf3123732c3296a784be8f4fc08ec024c0d5 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 2 Apr 2024 11:11:07 -0400
Subject: [PATCH v13 3/4] fix qsort func

---
 src/backend/access/heap/vacuumlazy.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index ace95a4de2..c3a9dc1ad6 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -46,6 +46,7 @@
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
+#include "common/int.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1496,7 +1497,7 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * sorted.
 		 */
 		qsort(presult.deadoffsets, presult.lpdead_items, sizeof(OffsetNumber),
-			  OffsetNumber_cmp);
+			  cmpOffsetNumbers);
 
 		dead_items_add(vacrel, blkno, presult.deadoffsets, presult.lpdead_items);
 	}
-- 
2.40.1

From 021407cee292c0b1ef5145f37aef889b68b739b0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 2 Apr 2024 11:22:35 -0400
Subject: [PATCH v13 4/4] update few more outdated comments

---
 src/backend/access/heap/pruneheap.c | 53 +
 src/backend/storage/ipc/procarray.c |  6 ++--
 src/include/access/heapam.h |  2 +-
 3 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 41c919b15b..ddc228c86d 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -328,7 +328,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  *
  * presult contains output parameters needed by callers, such as the number of
  * tuples removed and the offsets of dead items on the page after pruning.
- * heap_page_prune_and_freeze() is responsible for initializing it.
+ * heap_page_prune_and_freeze() is responsible for initializing it. Required by
+ * all callers.
  *
  * reason indicates why the pruning is performed.  It is included in the WAL
  * record for debugging and analysis purposes, but otherwise has no effect.
@@ -393,6 +394,7 @@ heap_page_prune_and_freeze(Relation relation, Buffer buffer,
 	prstate.pagefrz.freeze_required = false;
 	if (prstate.freeze)
 	{
+		Assert(new_relfrozen_xid && new_relmin_mxid);
 		prstate.pagefrz.FreezePageRelfrozenXid = *new_relfrozen_xid;
 		prstate.pagefrz.NoFreezePageRelfrozenXid = *new_relfrozen_xid;
 		prstate.pagefrz.FreezePageRelminMxid = *new_relmin_mxid;
@@ -415,19 +417,19 @@ heap_page_prune_and_freeze(Relation relation, Buffer buffer,
 	prstate.deadoffsets = presult->deadoffsets;
 
 	/*
-	 * Caller may update the VM after we're done.  We keep track of whether
-	 * the page will be all_visible and all_frozen, once we're done with the
-	 * pruning and freezing, to help the caller to do that.
+	 * Caller may update the VM after we're done.  We can keep track of
+	 * whether the page will be all-visible and all-frozen after pruning and
+	 * freezing to help the caller to do that.
 	 *
 	 * Currently, only VACUUM sets the VM bits.  To save the effort, only do
-	 * only the bookkeeping if the caller needs it.  Currently, that's tied to
-	 * HEAP_PAGE_PRUNE_FREEZE, but it could be a separate flag, if you wanted
-	 * to update the VM bits without also freezing, or freezing without
+	 * the bookkeeping if the caller needs it.  

Re: Combine Prune and Freeze records emitted by vacuum

2024-04-02 Thread Heikki Linnakangas

On 01/04/2024 20:22, Melanie Plageman wrote:

Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
you didn't modify them much/at all, but I noticed some things in my code
that could be better.


Ok, here's what I have now. I made a lot of small comment changes here 
and there, and some minor local refactorings, but nothing major. I lost 
track of all the individual changes I'm afraid, so I'm afraid you'll 
have to just diff against the previous version if you want to see what's 
changed. I hope I didn't break anything.


I'm pretty happy with this now. I will skim through it one more time 
later today or tomorrow, and commit. Please review once more if you have 
a chance.



This probably doesn't belong here. I noticed spgdoinsert.c had a static
function for sorting OffsetNumbers, but I didn't see anything general
purpose anywhere else.


I copied the spgdoinsert.c implementation to vacuumlazy.c as is. Would 
be nice to have just one copy of this in some common place, but I also 
wasn't sure where to put it.


--
Heikki Linnakangas
Neon (https://neon.tech)
From be8891155c93f3555c49371f9804bdf5ba578f6e Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 2 Apr 2024 15:47:06 +0300
Subject: [PATCH v12 1/2] Refactor how heap_prune_chain() updates prunable_xid

In preparation of freezing and counting tuples which are not
candidates for pruning, split heap_prune_record_unchanged() into
multiple functions, depending the kind of line pointer. That's not too
interesting right now, but makes the next commit smaller.

Recording the lowest soon-to-be prunable xid is one of the actions we
take for unchanged LP_NORMAL item pointers but not for others, so move
that to the new heap_prune_record_unchanged_lp_normal() function. The
next commit will add more actions to these functions.

Author: Melanie Plageman 
Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
---
 src/backend/access/heap/pruneheap.c | 125 
 1 file changed, 92 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index ef563e19aa5..1b5bf990d21 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -78,7 +78,11 @@ static void heap_prune_record_redirect(PruneState *prstate,
 static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum, bool was_normal);
 static void heap_prune_record_dead_or_unused(PruneState *prstate, OffsetNumber offnum, bool was_normal);
 static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum, bool was_normal);
-static void heap_prune_record_unchanged(PruneState *prstate, OffsetNumber offnum);
+
+static void heap_prune_record_unchanged_lp_unused(Page page, PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_unchanged_lp_normal(Page page, int8 *htsv, PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_unchanged_lp_dead(Page page, PruneState *prstate, OffsetNumber offnum);
+static void heap_prune_record_unchanged_lp_redirect(PruneState *prstate, OffsetNumber offnum);
 
 static void page_verify_redirects(Page page);
 
@@ -311,7 +315,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		/* Nothing to do if slot doesn't contain a tuple */
 		if (!ItemIdIsUsed(itemid))
 		{
-			heap_prune_record_unchanged(, offnum);
+			heap_prune_record_unchanged_lp_unused(page, , offnum);
 			continue;
 		}
 
@@ -324,7 +328,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			if (unlikely(prstate.mark_unused_now))
 heap_prune_record_unused(, offnum, false);
 			else
-heap_prune_record_unchanged(, offnum);
+heap_prune_record_unchanged_lp_dead(page, , offnum);
 			continue;
 		}
 
@@ -434,7 +438,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 			}
 		}
 		else
-			heap_prune_record_unchanged(, offnum);
+			heap_prune_record_unchanged_lp_normal(page, presult->htsv, , offnum);
 	}
 
 	/* We should now have processed every tuple exactly once  */
@@ -652,9 +656,6 @@ heap_prune_chain(Page page, BlockNumber blockno, OffsetNumber maxoff,
 		 */
 		chainitems[nchain++] = offnum;
 
-		/*
-		 * Check tuple's visibility status.
-		 */
 		switch (htsv_get_valid_status(htsv[offnum]))
 		{
 			case HEAPTUPLE_DEAD:
@@ -670,9 +671,6 @@ heap_prune_chain(Page page, BlockNumber blockno, OffsetNumber maxoff,
 			case HEAPTUPLE_RECENTLY_DEAD:
 
 /*
- * This tuple may soon become DEAD.  Update the hint field so
- * that the page is reconsidered for pruning in future.
- *
  * We don't need to advance the conflict horizon for
  * RECENTLY_DEAD tuples, even if we are removing them.  This
  * is because we only remove RECENTLY_DEAD tuples if they
@@ -681,8 +679,6 @@ heap_prune_chain(Page page, BlockNumber blockno, OffsetNumber maxoff,
  * tuple by virtue of being later in the chain.  We will have
  * advanced the conflict horizon for the DEAD 

Re: Combine Prune and Freeze records emitted by vacuum

2024-04-01 Thread Heikki Linnakangas

On 01/04/2024 20:22, Melanie Plageman wrote:

 From 17e183835a968e81daf7b74a4164b243e2de35aa Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Fri, 29 Mar 2024 19:43:09 -0400
Subject: [PATCH v11 3/7] Introduce PRUNE_DO_* actions

We will eventually take additional actions in heap_page_prune() at the
discretion of the caller. For now, introduce these PRUNE_DO_* macros and
turn mark_unused_now, a paramter to heap_page_prune(), into a PRUNE_DO_


paramter -> parameter


action.
---
  src/backend/access/heap/pruneheap.c  | 51 ++--
  src/backend/access/heap/vacuumlazy.c | 11 --
  src/include/access/heapam.h  | 13 ++-
  3 files changed, 46 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c 
b/src/backend/access/heap/pruneheap.c
index fb0ad834f1b..30965c3c5a1 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -29,10 +29,11 @@
  /* Working data for heap_page_prune and subroutines */
  typedef struct
  {
+   /* PRUNE_DO_* arguments */
+   uint8   actions;


I wasn't sure if actions is a good name. What do you think?


Committed this part, with the name 'options'. There's some precedent for 
that in heap_insert().


I decided to keep it a separate bool field here in the PruneState 
struct, though, and only changed it in the heap_page_prune() function 
signature. It didn't feel worth the code churn here, and 
'prstate.mark_unused_now' is a shorter than "(prstate.options & 
HEAP_PRUNE_PAGE_MARK_UNUSED_NOW) != 0" anyway.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Combine Prune and Freeze records emitted by vacuum

2024-04-01 Thread Melanie Plageman
On Mon, Apr 1, 2024 at 1:37 PM Heikki Linnakangas  wrote:
>
> Committed the first of the remaining patches with those changes. And
> also this, which is worth calling out:
>
> >   if (prstate.htsv[offnum] == HEAPTUPLE_DEAD)
> >   {
> >   ItemId  itemid = PageGetItemId(page, offnum);
> >   HeapTupleHeader htup = (HeapTupleHeader) 
> > PageGetItem(page, itemid);
> >
> >   if (likely(!HeapTupleHeaderIsHotUpdated(htup)))
> >   {
> >   HeapTupleHeaderAdvanceConflictHorizon(htup,
> > 
> > _xid_removed);
> >   heap_prune_record_unused(, offnum, 
> > true);
> >   }
> >   else
> >   {
> >   /*
> >* This tuple should've been processed and 
> > removed as part of
> >* a HOT chain, so something's wrong.  To 
> > preserve evidence,
> >* we don't dare to remove it.  We cannot 
> > leave behind a DEAD
> >* tuple either, because that will cause 
> > VACUUM to error out.
> >* Throwing an error with a distinct error 
> > message seems like
> >* the least bad option.
> >*/
> >   elog(ERROR, "dead heap-only tuple (%u, %d) is 
> > not linked to from any HOT chain",
> >blockno, offnum);
> >   }
> >   }
> >   else
> >   heap_prune_record_unchanged_lp_normal(page, , 
> > offnum);
>
> As you can see, I turned that into a hard error. Previously, that code
> was at the top of heap_prune_chain(), and it was normal to see DEAD
> heap-only tuples there, because they would presumably get processed
> later as part of a HOT chain. But now this is done after all the HOT
> chains have already been processed.
>
> Previously if there was a dead heap-only tuple like that on the page for
> some reason, it was silently not processed by heap_prune_chain()
> (because it was assumed that it would be processed later as part of a
> HOT chain), and was left behind as a HEAPTUPLE_DEAD tuple. If the
> pruning was done as part of VACUUM, VACUUM would fail with "ERROR:
> unexpected HeapTupleSatisfiesVacuum result". Or am I missing something?

I think you are right. I wasn't sure if there was some way for a HOT,
DEAD tuple to be not HOT-updated, but that doesn't make much sense.

> Now you get that above error also on on-access pruning, which is not
> ideal. But I don't remember hearing about corruption like that, and
> you'd get the error on VACUUM anyway.

Yea, that makes sense. One thing I don't really understand is why
vacuum has its own system for saving and restoring error information
for context messages (LVSavedErrInfo and
update/restore_vacuum_err_info()). I'll confess I don't know much
about how error cleanup works in any sub-system. But it stuck out to
me that vacuum has its own. I assume it is okay to add new error
messages and they somehow will work with the existing system?

- Melanie




Re: Combine Prune and Freeze records emitted by vacuum

2024-04-01 Thread Heikki Linnakangas

On 01/04/2024 19:08, Melanie Plageman wrote:

On Mon, Apr 01, 2024 at 05:17:51PM +0300, Heikki Linnakangas wrote:

diff --git a/src/backend/access/heap/pruneheap.c 
b/src/backend/access/heap/pruneheap.c
@@ -256,15 +270,16 @@ heap_page_prune(Relation relation, Buffer buffer,
tup.t_tableOid = RelationGetRelid(relation);
  
  	/*

-* Determine HTSV for all tuples.
+* Determine HTSV for all tuples, and queue them up for processing as 
HOT
+* chain roots or as a heap-only items.


Reading this comment now as a whole, I would add something like
"Determining HTSV for all tuples once is required for correctness" to
the start of the second paragraph. The new conjunction on the first
paragraph sentence followed by the next paragraph is a bit confusing
because it sounds like queuing them up for processing is required for
correctness (which, I suppose it is on some level). Basically, I'm just
saying that it is now less clear what is required for correctness.


Fixed.


While I recognize this is a matter of style and not important, I
personally prefer this for reverse looping:

for (int i = prstate.nheaponly_items; i --> 0;)


I don't think we use that style anywhere in the Postgres source tree 
currently. (And I don't like it ;-) )



I do think a comment about the reverse order would be nice. I know it
says something above the first loop to this effect:

* Processing the items in reverse order (and thus the tuples in
* increasing order) increases prefetching efficiency significantly /
* decreases the number of cache misses.

So perhaps we could just say "as above, process the items in reverse
order"


I'm actually not sure why it makes a difference. I would assume all the 
data to already be in CPU cache at this point, since the first loop 
already accessed it, so I think there's something else going on. But I 
didn't investigate it deeper. Anyway, added a comment.


Committed the first of the remaining patches with those changes. And 
also this, which is worth calling out:



if (prstate.htsv[offnum] == HEAPTUPLE_DEAD)
{
ItemId  itemid = PageGetItemId(page, offnum);
HeapTupleHeader htup = (HeapTupleHeader) 
PageGetItem(page, itemid);

if (likely(!HeapTupleHeaderIsHotUpdated(htup)))
{
HeapTupleHeaderAdvanceConflictHorizon(htup,

  _xid_removed);
heap_prune_record_unused(, offnum, 
true);
}
else
{
/*
 * This tuple should've been processed and 
removed as part of
 * a HOT chain, so something's wrong.  To 
preserve evidence,
 * we don't dare to remove it.  We cannot leave 
behind a DEAD
 * tuple either, because that will cause VACUUM 
to error out.
 * Throwing an error with a distinct error 
message seems like
 * the least bad option.
 */
elog(ERROR, "dead heap-only tuple (%u, %d) is not 
linked to from any HOT chain",
 blockno, offnum);
}
}
else
heap_prune_record_unchanged_lp_normal(page, , 
offnum);


As you can see, I turned that into a hard error. Previously, that code 
was at the top of heap_prune_chain(), and it was normal to see DEAD 
heap-only tuples there, because they would presumably get processed 
later as part of a HOT chain. But now this is done after all the HOT 
chains have already been processed.


Previously if there was a dead heap-only tuple like that on the page for 
some reason, it was silently not processed by heap_prune_chain() 
(because it was assumed that it would be processed later as part of a 
HOT chain), and was left behind as a HEAPTUPLE_DEAD tuple. If the 
pruning was done as part of VACUUM, VACUUM would fail with "ERROR: 
unexpected HeapTupleSatisfiesVacuum result". Or am I missing something?


Now you get that above error also on on-access pruning, which is not 
ideal. But I don't remember hearing about corruption like that, and 
you'd get the error on VACUUM anyway.


With the next patches, heap_prune_record_unchanged() will do more, and 
will also throw an error on a HEAPTUPLE_LIVE tuple, so even though in 
the first patch we could print just a WARNING and move on, it gets more 
awkward with the rest of the patches.


(Continuing with the remaining patches..)

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Combine Prune and Freeze records emitted by vacuum

2024-04-01 Thread Melanie Plageman
On Mon, Apr 01, 2024 at 05:17:51PM +0300, Heikki Linnakangas wrote:
> On 30/03/2024 07:57, Melanie Plageman wrote:
> 
> > The final state of the code could definitely use more cleanup. I've been
> > staring at it for awhile, so I could use some thoughts/ideas about what
> > part to focus on improving.
> 
> Committed some of the changes. I plan to commit at least the first of these
> remaining patches later today. I'm happy with it now, but I'll give it a
> final glance over after dinner.
> 
> I'll continue to review the rest after that, but attached is what I have
> now.

Review for 0003-0006 (I didn't have any new thoughts on 0002). I know
you didn't modify them much/at all, but I noticed some things in my code
that could be better.

> From 17e183835a968e81daf7b74a4164b243e2de35aa Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Fri, 29 Mar 2024 19:43:09 -0400
> Subject: [PATCH v11 3/7] Introduce PRUNE_DO_* actions
> 
> We will eventually take additional actions in heap_page_prune() at the
> discretion of the caller. For now, introduce these PRUNE_DO_* macros and
> turn mark_unused_now, a paramter to heap_page_prune(), into a PRUNE_DO_

paramter -> parameter

> action.
> ---
>  src/backend/access/heap/pruneheap.c  | 51 ++--
>  src/backend/access/heap/vacuumlazy.c | 11 --
>  src/include/access/heapam.h  | 13 ++-
>  3 files changed, 46 insertions(+), 29 deletions(-)
> 
> diff --git a/src/backend/access/heap/pruneheap.c 
> b/src/backend/access/heap/pruneheap.c
> index fb0ad834f1b..30965c3c5a1 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -29,10 +29,11 @@
>  /* Working data for heap_page_prune and subroutines */
>  typedef struct
>  {
> + /* PRUNE_DO_* arguments */
> + uint8   actions;

I wasn't sure if actions is a good name. What do you think?

> +
>   /* tuple visibility test, initialized for the relation */
>   GlobalVisState *vistest;
> - /* whether or not dead items can be set LP_UNUSED during pruning */
> - boolmark_unused_now;
>  
>   TransactionId new_prune_xid;/* new prune hint value for page */
>   TransactionId snapshotConflictHorizon;  /* latest xid removed */

> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> index 32a3fbce961..35b8486c34a 100644
> --- a/src/include/access/heapam.h
> +++ b/src/include/access/heapam.h
> @@ -191,6 +191,17 @@ typedef struct HeapPageFreeze
>  
>  } HeapPageFreeze;
>  
> +/*
> + * Actions that can be taken during pruning and freezing. By default, we will
> + * at least attempt regular pruning.
> + */
> +
> +/*
> + * PRUNE_DO_MARK_UNUSED_NOW indicates whether or not dead items can be set
> + * LP_UNUSED during pruning.
> + */
> +#define  PRUNE_DO_MARK_UNUSED_NOW (1 << 1)


No reason for me to waste the zeroth bit here. I just realized that I
did this with XLHP_IS_CATALOG_REL too.

#define XLHP_IS_CATALOG_REL (1 << 1)

> From 083690b946e19ab5e536a9f2689772e7b91d2a70 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Fri, 29 Mar 2024 21:22:14 -0400
> Subject: [PATCH v11 4/7] Prepare freeze tuples in heap_page_prune()
> 
> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> index 35b8486c34a..ac129692c13 100644
> --- a/src/include/access/heapam.h
> +++ b/src/include/access/heapam.h
>  
> +/*
> + * Prepare to freeze if advantageous or required and try to advance
> + * relfrozenxid and relminmxid. To attempt freezing, we will need to 
> determine
> + * if the page is all frozen. So, if this action is set, we will also inform
> + * the caller if the page is all-visible and/or all-frozen and calculate a

I guess we don't inform the caller if the page is all-visible, so this
is not quite right.

> + * snapshot conflict horizon for updating the visibility map.
> + */
> +#define  PRUNE_DO_TRY_FREEZE (1 << 2)

> From ef8cb2c089ad9474a6da309593029c08f71b0bb9 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Fri, 29 Mar 2024 21:36:37 -0400
> Subject: [PATCH v11 5/7] Set hastup in heap_page_prune
> 
> diff --git a/src/backend/access/heap/pruneheap.c 
> b/src/backend/access/heap/pruneheap.c
> index 8bdd6389b25..65b0ed185ff 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -66,6 +66,9 @@ typedef struct
>   boolprocessed[MaxHeapTuplesPerPage + 1];
>  
>   int ndeleted;   /* Number of tuples 
> deleted from the page */
> +
> + /* Whether or not the page makes rel truncation unsafe */
> + boolhastup;
>  } PruneState;
>  
>  /* Local functions */
> @@ -271,6 +274,7 @@ heap_page_prune(Relation relation, Buffer buffer,
>   prstate.ndeleted = 0;
>   prstate.nroot_items = 0;
>   prstate.nheaponly_items = 0;
> + prstate.hastup = false;
>  
>   /*
>* If we will prepare to 

Re: Combine Prune and Freeze records emitted by vacuum

2024-04-01 Thread Melanie Plageman
On Mon, Apr 01, 2024 at 05:17:51PM +0300, Heikki Linnakangas wrote:
> On 30/03/2024 07:57, Melanie Plageman wrote:
> > On Fri, Mar 29, 2024 at 12:32:21PM -0400, Melanie Plageman wrote:
> > > On Fri, Mar 29, 2024 at 12:00 PM Heikki Linnakangas  
> > > wrote:
> > > > Here's another idea: In the first loop through the offsets, where we
> > > > gather the HTSV status of each item, also collect the offsets of all HOT
> > > > and non-HOT items to two separate arrays. Call heap_prune_chain() for
> > > > all the non-HOT items first, and then process any remaining HOT tuples
> > > > that haven't been marked yet.
> > > 
> > > That's an interesting idea. I'll try it out and see how it works.
> > 
> > Attached v10 implements this method of dividing tuples into HOT and
> > non-HOT and processing the potential HOT chains first then processing
> > tuples not marked by calling heap_prune_chain().
> > 
> > I have applied the refactoring of heap_prune_chain() to master and then
> > built the other patches on top of that.
> 
> Committed some of the changes. Continuing to reviewing the rest.

Thanks!

> > The early patches in the set include some additional comment cleanup as
> > well. 0001 is fairly polished. 0004 could use some variable renaming
> > (this patch partitions the tuples into HOT and not HOT and then
> > processes them). I was struggling with some of the names here
> > (chainmembers and chaincandidates is confusing).
> 
> I didn't understand why you wanted to juggle both partitions in the same
> array. So I separated them into two arrays, and called them 'root_items' and
> 'heaponly_items'.

I thought it was worth it to save the space. And the algorithm for doing
it seemed pretty straightforward. But looking at your patch, it is a lot
easier to understand with two arrays (since, for example, they can each
have a name).

> In some micro-benchmarks, the order that the items were processed made a
> measurable difference. So I'm processing the items in the reverse order.
> That roughly matches the order the items are processed in master, as it
> iterates the offsets from high-to-low in the first loop, and low-to-high in
> the second loop.

This makes sense. I noticed you didn't there isn't a comment about this
above the loop. It might be worth it to mention it.

Below is a review of only 0001. I'll look at the others shortly.

> From a6ab891779876e7cc1b4fb6fddb09f52f0094646 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Mon, 1 Apr 2024 16:59:38 +0300
> Subject: [PATCH v11 1/7] Handle non-chain tuples outside of heap_prune_chain()
> ---
>  src/backend/access/heap/pruneheap.c | 264 +---
>  1 file changed, 166 insertions(+), 98 deletions(-)
> 
> diff --git a/src/backend/access/heap/pruneheap.c 
> b/src/backend/access/heap/pruneheap.c
> @@ -256,15 +270,16 @@ heap_page_prune(Relation relation, Buffer buffer,
>   tup.t_tableOid = RelationGetRelid(relation);
>  
>   /*
> -  * Determine HTSV for all tuples.
> +  * Determine HTSV for all tuples, and queue them up for processing as 
> HOT
> +  * chain roots or as a heap-only items.

Reading this comment now as a whole, I would add something like
"Determining HTSV for all tuples once is required for correctness" to
the start of the second paragraph. The new conjunction on the first
paragraph sentence followed by the next paragraph is a bit confusing
because it sounds like queuing them up for processing is required for
correctness (which, I suppose it is on some level). Basically, I'm just
saying that it is now less clear what is required for correctness.

>* This is required for correctness to deal with cases where running 
> HTSV
>* twice could result in different results (e.g. RECENTLY_DEAD can turn 
> to
>* DEAD if another checked item causes GlobalVisTestIsRemovableFullXid()
>* to update the horizon, INSERT_IN_PROGRESS can change to DEAD if the
> -  * inserting transaction aborts, ...). That in turn could cause
> -  * heap_prune_chain() to behave incorrectly if a tuple is reached twice,
> -  * once directly via a heap_prune_chain() and once following a HOT 
> chain.
> +  * inserting transaction aborts, ...).  VACUUM assumes that there are no
> +  * normal DEAD tuples left on the page after pruning, so it needs to 
> have
> +  * the same understanding of what is DEAD and what is not.
>*
>* It's also good for performance. Most commonly tuples within a page 
> are
>* stored at decreasing offsets (while the items are stored at 
> increasing
> @@ -282,52 +297,140 @@ heap_page_prune(Relation relation, Buffer buffer,

> + /*
> +  * Process any heap-only tuples that were not already processed as part 
> of
> +  * a HOT chain.
> +  */

While I recognize this is a matter of style and not important, I
personally prefer this for reverse looping:

for (int i = prstate.nheaponly_items; i --> 0;)

I do think a comment about the 

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-30 Thread Melanie Plageman
On Sat, Mar 30, 2024 at 8:00 AM Robert Haas  wrote:
>
> On Sat, Mar 30, 2024 at 1:57 AM Melanie Plageman
>  wrote:
> > I think that we are actually successfully removing more RECENTLY_DEAD
> > HOT tuples than in master with heap_page_prune()'s new approach, and I
> > think it is correct; but let me know if I am missing something.
>
> /me blinks.
>
> Isn't zero the only correct number of RECENTLY_DEAD tuples to remove?

At the top of the comment for heap_prune_chain() in master, it says

 * If the item is an index-referenced tuple (i.e. not a heap-only tuple),
 * the HOT chain is pruned by removing all DEAD tuples at the start of the HOT
 * chain.  We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple.
 * This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really
 * DEAD, our visibility test is just too coarse to detect it.

Heikki had added a comment in one of his patches to the fast path for
HOT tuples at the top of heap_prune_chain():

 * Note that we might first arrive at a dead heap-only tuple
 * either while following a chain or here (in the fast
path).  Whichever path
 * gets there first will mark the tuple unused.
 *
 * Whether we arrive at the dead HOT tuple first here or while
 * following a chain above affects whether preceding RECENTLY_DEAD
 * tuples in the chain can be removed or not.  Imagine that you
 * have a chain with two tuples: RECENTLY_DEAD -> DEAD.  If we
 * reach the RECENTLY_DEAD tuple first, the chain-following logic
 * will find the DEAD tuple and conclude that both tuples are in
 * fact dead and can be removed.  But if we reach the DEAD tuple
 * at the end of the chain first, when we reach the RECENTLY_DEAD
 * tuple later, we will not follow the chain because the DEAD
 * TUPLE is already 'marked', and will not remove the
 * RECENTLY_DEAD tuple.  This is not a correctness issue, and the
 * RECENTLY_DEAD tuple will be removed by a later VACUUM.

My patch splits the tuples into HOT and non-HOT while gathering their
visibility information and first calls heap_prune_chain() on the
non-HOT tuples and then processes the yet unmarked HOT tuples in a
separate loop afterward. This will follow all of the chains and
process them completely as well as processing all HOT tuples which may
not be reachable from a valid chain. The fast path contains a special
check to ensure that line pointers for DEAD not HOT-updated HOT tuples
(dead orphaned tuples from aborted HOT updates) are still marked
LP_UNUSED even though they are not reachable from a valid HOT chain.
By doing this later, we don't preclude ourselves from following all
chains.

- Melanie




Re: Combine Prune and Freeze records emitted by vacuum

2024-03-30 Thread Robert Haas
On Sat, Mar 30, 2024 at 1:57 AM Melanie Plageman
 wrote:
> I think that we are actually successfully removing more RECENTLY_DEAD
> HOT tuples than in master with heap_page_prune()'s new approach, and I
> think it is correct; but let me know if I am missing something.

/me blinks.

Isn't zero the only correct number of RECENTLY_DEAD tuples to remove?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Combine Prune and Freeze records emitted by vacuum

2024-03-29 Thread Melanie Plageman
On Fri, Mar 29, 2024 at 12:00 PM Heikki Linnakangas  wrote:
>
> On 29/03/2024 07:04, Melanie Plageman wrote:
> > On Thu, Mar 28, 2024 at 11:07:10AM -0400, Melanie Plageman wrote:
> >> These comments could use another pass. I had added some extra
> >> (probably redundant) content when I thought I was refactoring it a
> >> certain way and then changed my mind.
> >>
> >> Attached is a diff with some ideas I had for a bit of code simplification.
> >>
> >> Are you working on cleaning this patch up or should I pick it up?
> >
> > Attached v9 is rebased over master. But, more importantly, I took
> > another pass at heap_prune_chain() and am pretty happy with what I came
> > up with. See 0021. I simplified the traversal logic and then grouped the
> > chain processing into three branches at the end. I find it much easier
> > to understand what we are doing for different types of HOT chains.
>
> Ah yes, agreed, that's nicer.
>
> The 'survivor' variable is a little confusing, especially here:
>
> if (!survivor)
> {
> int i;
>
> /*
>  * If no DEAD tuple was found, and the root is redirected, 
> mark it as
>  * such.
>  */
> if ((i = ItemIdIsRedirected(rootlp)))
> heap_prune_record_unchanged_lp_redirect(prstate, 
> rootoffnum);
>
> /* the rest of tuples in the chain are normal, unchanged 
> tuples */
> for (; i < nchain; i++)
> heap_prune_record_unchanged(dp, prstate, 
> chainitems[i]);
> }
>
> You would think that "if(!survivor)", it means that there is no live
> tuples on the chain, i.e. no survivors. But in fact it's the opposite;
> it means that the are all live. Maybe call it 'ndeadchain' instead,
> meaning the number of dead items in the chain.

Makes sense.

> > I've done that in my version. While testing this, I found that only
> > on-access pruning needed this final loop calling record_unchanged() on
> > items not yet marked. I know we can't skip this final loop entirely in
> > the ON ACCESS case because it calls record_prunable(), but we could
> > consider moving that back out into heap_prune_chain()? Or what do you
> > think?
>
> Hmm, why is that different with on-access pruning?

Well, it is highly possible we just don't hit cases like this with
vacuum in our test suite (not that it is unreachable by vacuum).

It's just very easy to get in this situation with on-access pruning.
Imagine an UPDATE which caused the following chain:

RECENTLY_DEAD -> DELETE_IN_PROGRESS -> INSERT_IN_PROGRESS

It invokes heap_page_prune_and_freeze() (assume the page meets the
criteria for on-access pruning) and eventually enters
heap_prune_chain() with the first offset in this chain.

The first item is LP_NORMAL and the tuple is RECENTLY_DEAD, so the
survivor variable stays 0 and we record_unchanged() for that tuple and
return. The next two items are LP_NORMAL and the tuples are HOT
tuples, so we just return from the "fast path" at the top of
heap_prune_chain(). After invoking heap_prune_chain() for all of them,
the first offset is marked but the other two are not. Thus, we end up
having to record_unchanged() later. This kind of thing is a super
common case that we see all the time in queries in the regression test
suite.

> Here's another idea: In the first loop through the offsets, where we
> gather the HTSV status of each item, also collect the offsets of all HOT
> and non-HOT items to two separate arrays. Call heap_prune_chain() for
> all the non-HOT items first, and then process any remaining HOT tuples
> that haven't been marked yet.

That's an interesting idea. I'll try it out and see how it works.

> > I haven't finished updating all the comments, but I am really interested
> > to know what you think about heap_prune_chain() now.
>
> Looks much better now, thanks!

I am currently doing chain traversal refactoring in heap_prune_chain()
on top of master as the first patch in the set.

- Melanie




Re: Combine Prune and Freeze records emitted by vacuum

2024-03-29 Thread Heikki Linnakangas

On 29/03/2024 07:04, Melanie Plageman wrote:

On Thu, Mar 28, 2024 at 11:07:10AM -0400, Melanie Plageman wrote:

These comments could use another pass. I had added some extra
(probably redundant) content when I thought I was refactoring it a
certain way and then changed my mind.

Attached is a diff with some ideas I had for a bit of code simplification.

Are you working on cleaning this patch up or should I pick it up?


Attached v9 is rebased over master. But, more importantly, I took
another pass at heap_prune_chain() and am pretty happy with what I came
up with. See 0021. I simplified the traversal logic and then grouped the
chain processing into three branches at the end. I find it much easier
to understand what we are doing for different types of HOT chains.


Ah yes, agreed, that's nicer.

The 'survivor' variable is a little confusing, especially here:

if (!survivor)
{
int i;

/*
 * If no DEAD tuple was found, and the root is redirected, mark 
it as
 * such.
 */
if ((i = ItemIdIsRedirected(rootlp)))
heap_prune_record_unchanged_lp_redirect(prstate, 
rootoffnum);

/* the rest of tuples in the chain are normal, unchanged tuples 
*/
for (; i < nchain; i++)
heap_prune_record_unchanged(dp, prstate, chainitems[i]);
}

You would think that "if(!survivor)", it means that there is no live 
tuples on the chain, i.e. no survivors. But in fact it's the opposite; 
it means that the are all live. Maybe call it 'ndeadchain' instead, 
meaning the number of dead items in the chain.



I got rid of revisited. We can put it back, but I was thinking: we stash
all HOT tuples and then loop over them later, calling record_unchanged()
on the ones that aren't marked. But, if we have a lot of HOT tuples, is
this really that much better than just looping through all the offsets
and calling record_unchanged() on just the ones that aren't marked?


Well, it requires looping through all the offsets one more time, and 
unless you have a lot of HOT tuples, most items would be marked already. 
But maybe the overhead is negligible anyway.



I've done that in my version. While testing this, I found that only
on-access pruning needed this final loop calling record_unchanged() on
items not yet marked. I know we can't skip this final loop entirely in
the ON ACCESS case because it calls record_prunable(), but we could
consider moving that back out into heap_prune_chain()? Or what do you
think?


Hmm, why is that different with on-access pruning?

Here's another idea: In the first loop through the offsets, where we 
gather the HTSV status of each item, also collect the offsets of all HOT 
and non-HOT items to two separate arrays. Call heap_prune_chain() for 
all the non-HOT items first, and then process any remaining HOT tuples 
that haven't been marked yet.



I haven't finished updating all the comments, but I am really interested
to know what you think about heap_prune_chain() now.


Looks much better now, thanks!

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Combine Prune and Freeze records emitted by vacuum

2024-03-28 Thread Melanie Plageman
On Thu, Mar 28, 2024 at 4:49 AM Heikki Linnakangas  wrote:
>
> On 28/03/2024 03:53, Melanie Plageman wrote:
> > On Thu, Mar 28, 2024 at 01:04:04AM +0200, Heikki Linnakangas wrote:
> >> One change with this is that live_tuples and many of the other fields are
> >> now again updated, even if the caller doesn't need them. It was hard to 
> >> skip
> >> them in a way that would save any cycles, with the other refactorings.
> >
> > I am worried we are writing checks that are going to have to come out of
> > SELECT queries' bank accounts, but I'll do some benchmarking when we're
> > all done with major refactoring.
>
> Sounds good, thanks.

Actually, after having looked at it again, there really are only a few
more increments of various counters, the memory consumed by revisit,
and the additional setting of offsets in marked. I think a few
carefully constructed queries which do on-access pruning could test
the impact of this (as opposed to a bigger benchmarking endeavor).

I also wonder if there would be any actual impact of marking the
various record_*() routines inline.

> >> @@ -728,10 +832,12 @@ htsv_get_valid_status(int status)
> >>* DEAD, our visibility test is just too coarse to detect it.
> >>*
> >>* In general, pruning must never leave behind a DEAD tuple that still 
> >> has
> >> - * tuple storage.  VACUUM isn't prepared to deal with that case.  That's 
> >> why
> >> + * tuple storage.  VACUUM isn't prepared to deal with that case (FIXME: 
> >> it no longer cares, right?).
> >> + * That's why
> >>* VACUUM prunes the same heap page a second time (without dropping its 
> >> lock
> >>* in the interim) when it sees a newly DEAD tuple that we initially saw 
> >> as
> >> - * in-progress.  Retrying pruning like this can only happen when an 
> >> inserting
> >> + * in-progress (FIXME: Really? Does it still do that?).
> >
> > Yea, I think all that is no longer true. I missed this comment back
> > then.
>
> Committed a patch to remove it.
>
> >> @@ -981,7 +1069,18 @@ heap_prune_chain(Buffer buffer, OffsetNumber 
> >> rootoffnum,
> >>   * RECENTLY_DEAD tuples just in case there's a DEAD one after 
> >> them;
> >>   * but we can't advance past anything else.  We have to make 
> >> sure that
> >>   * we don't miss any DEAD tuples, since DEAD tuples that 
> >> still have
> >> - * tuple storage after pruning will confuse VACUUM.
> >> + * tuple storage after pruning will confuse VACUUM (FIXME: 
> >> not anymore
> >> + * I think?).
> >
> > Meaning, it won't confuse vacuum anymore or there won't be DEAD tuples
> > with storage after pruning anymore?
>
> I meant that it won't confuse VACUUM anymore. lazy_scan_prune() doesn't
> loop through the items on the page checking their visibility anymore.
>
> Hmm, one confusion remains though: In the 2nd phase of vacuum, we remove
> all the dead line pointers that we have now removed from the indexes.
> When we do that, we assume them all to be dead line pointers, without
> storage, rather than normal tuples that happen to be HEAPTUPLE_DEAD. So
> it's important that if pruning would leave behind HEAPTUPLE_DEAD tuples,
> they are not included in 'deadoffsets'.

It seems like master only adds items it is marking LP_DEAD to
deadoffsets. And things marked LP_DEAD have lp_len set to 0.

> In any case, let's just make sure that pruning doesn't leave
> HEAPTUPLE_DEAD tuples. There's no reason it should.

Maybe worth adding an assert to

static void
heap_prune_record_unchanged_lp_dead(ItemId itemid, PruneState
*prstate, OffsetNumber offnum)
{
...
Assert(!ItemIdHasStorage(itemid));
prstate->deadoffsets[prstate->lpdead_items++] = offnum;
}

By the way, I wasn't quite sure about the purpose of
heap_prune_record_unchanged_lp_dead(). It is called in
heap_prune_chain() in a place where we didn't add things to
deadoffsets before, no?

/*
 * Likewise, a dead line pointer can't be part of the chain. (We
 * already eliminated the case of dead root tuple outside this
 * function.)
 */
if (ItemIdIsDead(lp))
{
/*
 * If the caller set PRUNE_DO_MARK_UNUSED_NOW, we can set dead
 * line pointers LP_UNUSED now.
 */
if (unlikely(prstate->actions & PRUNE_DO_MARK_UNUSED_NOW))
heap_prune_record_unused(prstate, offnum, false);
else
heap_prune_record_unchanged_lp_dead(lp, prstate, offnum);
break;
}

> >> @@ -1224,10 +1327,21 @@ heap_prune_record_live_or_recently_dead(Page page, 
> >> PruneState *prstate, OffsetNu
> >>   * ensure the math works out. The assumption that the transaction will
> >>   * commit and update the counters after we report is a bit shaky; but 
> >> it
> >>   * is what acquire_sample_rows() does, so we do the same to be 
> >> consistent.
> >> + *
> >> + * HEAPTUPLE_DEAD are handled by the other 

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-28 Thread Heikki Linnakangas

On 28/03/2024 03:53, Melanie Plageman wrote:

On Thu, Mar 28, 2024 at 01:04:04AM +0200, Heikki Linnakangas wrote:

One change with this is that live_tuples and many of the other fields are
now again updated, even if the caller doesn't need them. It was hard to skip
them in a way that would save any cycles, with the other refactorings.


I am worried we are writing checks that are going to have to come out of
SELECT queries' bank accounts, but I'll do some benchmarking when we're
all done with major refactoring.


Sounds good, thanks.


+*
+* Whether we arrive at the dead HOT tuple first here 
or while
+* following a chain below affects whether preceding 
RECENTLY_DEAD
+* tuples in the chain can be removed or not.  Imagine 
that you
+* have a chain with two tuples: RECENTLY_DEAD -> DEAD. 
 If we
+* reach the RECENTLY_DEAD tuple first, the 
chain-following logic
+* will find the DEAD tuple and conclude that both 
tuples are in
+* fact dead and can be removed.  But if we reach the 
DEAD tuple
+* at the end of the chain first, when we reach the 
RECENTLY_DEAD
+* tuple later, we will not follow the chain because 
the DEAD
+* TUPLE is already 'marked', and will not remove the
+* RECENTLY_DEAD tuple.  This is not a correctness 
issue, and the
+* RECENTLY_DEAD tuple will be removed by a later 
VACUUM.
 */
if (!HeapTupleHeaderIsHotUpdated(htup))


Is this intentional? Like would it be correct to remove the
RECENTLY_DEAD tuple during the current vacuum?


Yes, it would be correct. And if we happen to visit the items in 
different order, the RECENTLY_DEAD tuple first, it will get removed.


(This is just based on me reading the code, I'm not sure if I've missed 
something. Would be nice to construct a test case for that and step 
through it with a debugger to really see what happens. But this is not a 
new issue, doesn't need to be part of this patch)



 From c2ee7508456d0e76de985f9997a6840450e342a8 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 28 Mar 2024 00:45:26 +0200
Subject: [PATCH v8 22/22] WIP

- Got rid of all_visible_except_removable again. We're back to the
   roughly the same mechanism as on 'master', where the all_visible
   doesn't include LP_DEAD items, but at the end of
   heap_page_prune_and_freeze() when we return to the caller, we clear
   it if there were any LP_DEAD items. I considered calling the
   variable 'all_visible_except_lp_dead', which would be more accurate,
   but it's also very long.


not longer than all_visible_except_removable. I would be happy to keep
it more exact, but I'm also okay with just all_visible.


Ok, let's make it 'all_visible_except_lp_dead' for clarity.


What's this "cutoffs TODO"?


+ * cutoffs TODO
+ *
   * presult contains output parameters needed by callers such as the number of
   * tuples removed and the number of line pointers newly marked LP_DEAD.
   * heap_page_prune_and_freeze() is responsible for initializing it.


All the other arguments are documented in the comment, except 'cutoffs'.


@@ -728,10 +832,12 @@ htsv_get_valid_status(int status)
   * DEAD, our visibility test is just too coarse to detect it.
   *
   * In general, pruning must never leave behind a DEAD tuple that still has
- * tuple storage.  VACUUM isn't prepared to deal with that case.  That's why
+ * tuple storage.  VACUUM isn't prepared to deal with that case (FIXME: it no 
longer cares, right?).
+ * That's why
   * VACUUM prunes the same heap page a second time (without dropping its lock
   * in the interim) when it sees a newly DEAD tuple that we initially saw as
- * in-progress.  Retrying pruning like this can only happen when an inserting
+ * in-progress (FIXME: Really? Does it still do that?).


Yea, I think all that is no longer true. I missed this comment back
then.


Committed a patch to remove it.


@@ -981,7 +1069,18 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
 * RECENTLY_DEAD tuples just in case there's a DEAD one after 
them;
 * but we can't advance past anything else.  We have to make 
sure that
 * we don't miss any DEAD tuples, since DEAD tuples that still 
have
-* tuple storage after pruning will confuse VACUUM.
+* tuple storage after pruning will confuse VACUUM (FIXME: not 
anymore
+* I think?).


Meaning, it won't confuse vacuum anymore or there won't be DEAD tuples
with storage after pruning anymore?


I meant that it won't confuse VACUUM anymore. lazy_scan_prune() doesn't 
loop through the items on the page checking their visibility anymore.


Hmm, one confusion remains though: In the 2nd 

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-27 Thread Melanie Plageman
On Thu, Mar 28, 2024 at 01:04:04AM +0200, Heikki Linnakangas wrote:
> On 27/03/2024 20:26, Melanie Plageman wrote:
> > On Wed, Mar 27, 2024 at 12:18 PM Heikki Linnakangas  wrote:
> > > 
> > > On 27/03/2024 17:18, Melanie Plageman wrote:
> > > > I need some way to modify the control flow or accounting such that I
> > > > know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD.
> > > > And a way to consider freezing and do live tuple accounting for these
> > > > and HEAPTUPLE_LIVE tuples exactly once.
> > > 
> > > Just a quick update: I've been massaging this some more today, and I
> > > think I'm onto got something palatable. I'll send an updated patch later
> > > today, but the key is to note that for each item on the page, there is
> > > one point where we determine the fate of the item, whether it's pruned
> > > or not. That can happen in different points in in heap_page_prune().
> > > That's also when we set marked[offnum] = true. Whenever that happens, we
> > > all call one of the a heap_page_prune_record_*() subroutines. We already
> > > have those subroutines for when a tuple is marked as dead or unused, but
> > > let's add similar subroutines for the case that we're leaving the tuple
> > > unchanged. If we move all the bookkeeping logic to those subroutines, we
> > > can ensure that it gets done exactly once for each tuple, and at that
> > > point we know what we are going to do to the tuple, so we can count it
> > > correctly. So heap_prune_chain() decides what to do with each tuple, and
> > > ensures that each tuple is marked only once, and the subroutines update
> > > all the variables, add the item to the correct arrays etc. depending on
> > > what we're doing with it.
> > 
> > Yes, this would be ideal.
> 
> Well, that took me a lot longer than expected. My approach of "make sure you
> all the right heap_prune_record_*() subroutine in all cases didn't work out
> quite as easily as I thought. Because, as you pointed out, it's difficult to
> know if a non-DEAD tuple that is part of a HOT chain will be visited later
> as part of the chain processing, or needs to be counted at the top of
> heap_prune_chain().
> 
> The solution I came up with is to add a third phase to pruning. At the top
> of heap_prune_chain(), if we see a live heap-only tuple, and we're not sure
> if it will be counted later as part of a HOT chain, we stash it away and
> revisit it later, after processing all the hot chains. That's somewhat
> similar to your 'counted' array, but not quite.

I like this idea (the third phase). I've just started reviewing this but
I thought I would give the initial thoughts I had inline.

> One change with this is that live_tuples and many of the other fields are
> now again updated, even if the caller doesn't need them. It was hard to skip
> them in a way that would save any cycles, with the other refactorings.

I am worried we are writing checks that are going to have to come out of
SELECT queries' bank accounts, but I'll do some benchmarking when we're
all done with major refactoring.

> From 2f38628373ccfb6e8f8fd883955056030092569d Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Thu, 28 Mar 2024 00:16:09 +0200
> Subject: [PATCH v8 21/22] Add comment about a pre-existing issue
> 
> Not sure if we want to keep this, but I wanted to document it for
> discussion at least.
> ---
>  src/backend/access/heap/pruneheap.c | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/src/backend/access/heap/pruneheap.c 
> b/src/backend/access/heap/pruneheap.c
> index e37ba655a7d..2b720ab6aa1 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -792,6 +792,23 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
>* Note that we might first arrive at a dead heap-only 
> tuple
>* either here or while following a chain below.  
> Whichever path
>* gets there first will mark the tuple unused.
> +  *
> +  * FIXME: The next paragraph isn't new with these 
> patches, but
> +  * just something I realized while looking at this. But 
> maybe we should
> +  * add a comment like this? Or is it too much detail?

I think a comment is a good idea.

> +  *
> +  * Whether we arrive at the dead HOT tuple first here 
> or while
> +  * following a chain below affects whether preceding 
> RECENTLY_DEAD
> +  * tuples in the chain can be removed or not.  Imagine 
> that you
> +  * have a chain with two tuples: RECENTLY_DEAD -> DEAD. 
>  If we
> +  * reach the RECENTLY_DEAD tuple first, the 
> chain-following logic
> +  * will find the DEAD tuple and conclude that both 
> tuples are in
> +  * fact dead and can be removed.  But if 

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-27 Thread Melanie Plageman
On Wed, Mar 27, 2024 at 2:26 PM Melanie Plageman
 wrote:
>
> On Wed, Mar 27, 2024 at 12:18 PM Heikki Linnakangas  wrote:
> >
> > On 27/03/2024 17:18, Melanie Plageman wrote:
> > > I need some way to modify the control flow or accounting such that I
> > > know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD.
> > > And a way to consider freezing and do live tuple accounting for these
> > > and HEAPTUPLE_LIVE tuples exactly once.
> >
> > Just a quick update: I've been massaging this some more today, and I
> > think I'm onto got something palatable. I'll send an updated patch later
> > today, but the key is to note that for each item on the page, there is
> > one point where we determine the fate of the item, whether it's pruned
> > or not. That can happen in different points in in heap_page_prune().
> > That's also when we set marked[offnum] = true. Whenever that happens, we
> > all call one of the a heap_page_prune_record_*() subroutines. We already
> > have those subroutines for when a tuple is marked as dead or unused, but
> > let's add similar subroutines for the case that we're leaving the tuple
> > unchanged. If we move all the bookkeeping logic to those subroutines, we
> > can ensure that it gets done exactly once for each tuple, and at that
> > point we know what we are going to do to the tuple, so we can count it
> > correctly. So heap_prune_chain() decides what to do with each tuple, and
> > ensures that each tuple is marked only once, and the subroutines update
> > all the variables, add the item to the correct arrays etc. depending on
> > what we're doing with it.
>
> Yes, this would be ideal.
>
> I was doing some experimentation with pageinspect today (trying to
> find that single place where live tuples fates are decided) and it
> seems like a heap-only tuple that is not HOT-updated will usually be
> the one at the end of the chain. Which seems like it would be covered
> by adding a record_live() type function call  in the loop of
> heap_prune_chain():
>
> /*
>  * If the tuple is not HOT-updated, then we are at the end of this
>  * HOT-update chain.
>  */
> if (!HeapTupleHeaderIsHotUpdated(htup))
> {
> heap_prune_record_live_or_recently_dead(dp, prstate,
> offnum, presult);
> break;
> }
>
> but that doesn't end up producing the same results as
>
> if (HeapTupleHeaderIsHeapOnly(htup)
> && !HeapTupleHeaderIsHotUpdated(htup) &&
> presult->htsv[rootoffnum] == HEAPTUPLE_DEAD)
> heap_prune_record_live_or_recently_dead(dp, prstate,
> offnum, presult);

sorry, that should say presult->htsv[rootoffnum] != HEAPTUPLE_DEAD.

The latter should be a subset of the former. But instead it seems
there are cases I missed by doing only the former.

- Melanie




Re: Combine Prune and Freeze records emitted by vacuum

2024-03-27 Thread Melanie Plageman
On Wed, Mar 27, 2024 at 12:18 PM Heikki Linnakangas  wrote:
>
> On 27/03/2024 17:18, Melanie Plageman wrote:
> > I need some way to modify the control flow or accounting such that I
> > know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD.
> > And a way to consider freezing and do live tuple accounting for these
> > and HEAPTUPLE_LIVE tuples exactly once.
>
> Just a quick update: I've been massaging this some more today, and I
> think I'm onto got something palatable. I'll send an updated patch later
> today, but the key is to note that for each item on the page, there is
> one point where we determine the fate of the item, whether it's pruned
> or not. That can happen in different points in in heap_page_prune().
> That's also when we set marked[offnum] = true. Whenever that happens, we
> all call one of the a heap_page_prune_record_*() subroutines. We already
> have those subroutines for when a tuple is marked as dead or unused, but
> let's add similar subroutines for the case that we're leaving the tuple
> unchanged. If we move all the bookkeeping logic to those subroutines, we
> can ensure that it gets done exactly once for each tuple, and at that
> point we know what we are going to do to the tuple, so we can count it
> correctly. So heap_prune_chain() decides what to do with each tuple, and
> ensures that each tuple is marked only once, and the subroutines update
> all the variables, add the item to the correct arrays etc. depending on
> what we're doing with it.

Yes, this would be ideal.

I was doing some experimentation with pageinspect today (trying to
find that single place where live tuples fates are decided) and it
seems like a heap-only tuple that is not HOT-updated will usually be
the one at the end of the chain. Which seems like it would be covered
by adding a record_live() type function call  in the loop of
heap_prune_chain():

/*
 * If the tuple is not HOT-updated, then we are at the end of this
 * HOT-update chain.
 */
if (!HeapTupleHeaderIsHotUpdated(htup))
{
heap_prune_record_live_or_recently_dead(dp, prstate,
offnum, presult);
break;
}

but that doesn't end up producing the same results as

if (HeapTupleHeaderIsHeapOnly(htup)
&& !HeapTupleHeaderIsHotUpdated(htup) &&
presult->htsv[rootoffnum] == HEAPTUPLE_DEAD)
heap_prune_record_live_or_recently_dead(dp, prstate,
offnum, presult);

at the top of heap_prune_chain().

- Melanie




Re: Combine Prune and Freeze records emitted by vacuum

2024-03-27 Thread Peter Geoghegan
On Tue, Mar 19, 2024 at 9:36 PM Melanie Plageman
 wrote:
>  * "Freeze" NewRelfrozenXid/NewRelminMxid trackers.
>  *
>  * Trackers used when heap_freeze_execute_prepared freezes, or when 
> there
>  * are zero freeze plans for a page.  It is always valid for 
> vacuumlazy.c
>  * to freeze any page, by definition.  This even includes pages that 
> have
>  * no tuples with storage to consider in the first place.  That way 
> the
>  * 'totally_frozen' results from heap_prepare_freeze_tuple can always 
> be
>  * used in the same way, even when no freeze plans need to be 
> executed to
>  * "freeze the page".  Only the "freeze" path needs to consider the 
> need
>  * to set pages all-frozen in the visibility map under this scheme.
>  *
>  * When we freeze a page, we generally freeze all XIDs < OldestXmin, 
> only
>  * leaving behind XIDs that are ineligible for freezing, if any.  And 
> so
>  * you might wonder why these trackers are necessary at all; why 
> should
>  * _any_ page that VACUUM freezes _ever_ be left with XIDs/MXIDs that
>  * ratchet back the top-level NewRelfrozenXid/NewRelminMxid trackers?
>  *
>  * It is useful to use a definition of "freeze the page" that does not
>  * overspecify how MultiXacts are affected.  heap_prepare_freeze_tuple
>  * generally prefers to remove Multis eagerly, but lazy processing is 
> used
>  * in cases where laziness allows VACUUM to avoid allocating a new 
> Multi.
>  * The "freeze the page" trackers enable this flexibility.
>  */
>
> So, I don't really know if it is right to just check presult->nfrozen >
> 0 when updating relminmxid. I have changed it to the way you suggested.
> But we can change it back.

I think that this is just about safe. I had to check, though. I see
that the FRM_NOOP case (within
FreezeMultiXactId/heap_prepare_freeze_tuple) will ratchet back both
sets of trackers (both the freeze and no freeze variants). However,
it's rather hard to see that this is true.

The intent here was that cases where "presult->nfrozen == 0" would
always take the "freeze" path. That seems more natural to me, at
least, since I think of the freeze path as the default choice. By
definition, lazy_scan_prune() can always take the freeze path -- even
when the page has no tuples with storage. But it cannot always take
the no-freeze path -- "disobeying" pagefrz.freeze_required creates the
risk that relfrozenxid/relminmxid will be advanced to unsafe values at
the end of the VACUUM. IMV you should stick with that approach now,
even if it is currently safe to do it the other way around.

-- 
Peter Geoghegan




Re: Combine Prune and Freeze records emitted by vacuum

2024-03-27 Thread Heikki Linnakangas

On 27/03/2024 17:18, Melanie Plageman wrote:

I need some way to modify the control flow or accounting such that I
know which HEAPTUPLE_RECENTLY_DEAD tuples will not be marked LP_DEAD.
And a way to consider freezing and do live tuple accounting for these
and HEAPTUPLE_LIVE tuples exactly once.


Just a quick update: I've been massaging this some more today, and I 
think I'm onto got something palatable. I'll send an updated patch later 
today, but the key is to note that for each item on the page, there is 
one point where we determine the fate of the item, whether it's pruned 
or not. That can happen in different points in in heap_page_prune(). 
That's also when we set marked[offnum] = true. Whenever that happens, we 
all call one of the a heap_page_prune_record_*() subroutines. We already 
have those subroutines for when a tuple is marked as dead or unused, but 
let's add similar subroutines for the case that we're leaving the tuple 
unchanged. If we move all the bookkeeping logic to those subroutines, we 
can ensure that it gets done exactly once for each tuple, and at that 
point we know what we are going to do to the tuple, so we can count it 
correctly. So heap_prune_chain() decides what to do with each tuple, and 
ensures that each tuple is marked only once, and the subroutines update 
all the variables, add the item to the correct arrays etc. depending on 
what we're doing with it.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Combine Prune and Freeze records emitted by vacuum

2024-03-27 Thread Melanie Plageman
On Tue, Mar 26, 2024 at 5:46 PM Melanie Plageman
 wrote:
>
> On Mon, Mar 25, 2024 at 09:33:38PM +0200, Heikki Linnakangas wrote:
> > On 24/03/2024 18:32, Melanie Plageman wrote:
> > > On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas  
> > > wrote:
> > > >
> > > > In heap_page_prune_and_freeze(), we now do some extra work on each live
> > > > tuple, to set the all_visible_except_removable correctly. And also to
> > > > update live_tuples, recently_dead_tuples and hastup. When we're not
> > > > freezing, that's a waste of cycles, the caller doesn't care. I hope it's
> > > > enough that it doesn't matter, but is it?
> > >
> > > Last year on an early version of the patch set I did some pgbench
> > > tpcb-like benchmarks -- since there is a lot of on-access pruning in
> > > that workload -- and I don't remember it being a showstopper. The code
> > > has changed a fair bit since then. However, I think it might be safer
> > > to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip
> > > the rest of the loop after heap_prune_satisifies_vacuum() when
> > > on-access pruning invokes it. I had avoided that because it felt ugly
> > > and error-prone, however it addresses a few other of your points as
> > > well.
> >
> > Ok. I'm not a fan of the name 'by_vacuum' though. It'd be nice if the
> > argument described what it does, rather than who it's for. For example,
> > 'need_all_visible'. If set to true, the function determines 'all_visible',
> > otherwise it does not.
>
> A very rough v7 is attached. The whole thing is rebased over master and
> then 0016 contains an attempt at the refactor we discussed in this
> email.
>
> Instead of just using the PruneReason to avoid doing the extra steps
> when on-access pruning calls heap_page_prune_and_freeze(), I've made an
> "actions" variable and defined different flags for it. One of them is
> a replacement for the existing mark_unused_now flag. I defined another
> one, PRUNE_DO_TRY_FREEZE, which could be used in place of checking if
> pagefrz is NULL.
>
> There is a whole group of activities that only the vacuum caller does
> outside of freezing -- setting hastup, counting live and recently dead
> tuples, determining whole page visibility and a snapshot conflict
> horizon for updating the VM. But I didn't want to introduce separate
> flags for each of them, because then I would have to check each of them
> before taking the action. That would be lots of extra branching and
> on-access pruning does none of those actions while vacuum does all of
> them.
>
> > I started to look closer at the loops in heap_prune_chain() and how they
> > update all the various flags and counters. There's a lot going on there. We
> > have:
> >
> > - live_tuples counter
> > - recently_dead_tuples counter
> > - all_visible[_except_removable]
> > - all_frozen
> > - visibility_cutoff_xid
> > - hastup
> > - prstate.frozen array
> > - nnewlpdead
> > - deadoffsets array
> >
> > And that doesn't even include all the local variables and the final
> > dead/redirected arrays.
> >
> > Some of those are set in the first loop that initializes 'htsv' for each
> > tuple on the page. Others are updated in heap_prune_chain(). Some are
> > updated in both. It's hard to follow which are set where.
> >
> > I think recently_dead_tuples is updated incorrectly, for tuples that are
> > part of a completely dead HOT chain. For example, imagine a hot chain with
> > two tuples: RECENTLY_DEAD -> DEAD. heap_prune_chain() would follow the
> > chain, see the DEAD tuple at the end of the chain, and mark both tuples for
> > pruning. However, we already updated 'recently_dead_tuples' in the first
> > loop, which is wrong if we remove the tuple.
>
> Ah, yes, you are so right about this bug.
>
> > Maybe that's the only bug like this, but I'm a little scared. Is there
> > something we could do to make this simpler? Maybe move all the new work that
> > we added to the first loop, into heap_prune_chain() ? Maybe introduce a few
> > more helper heap_prune_record_*() functions, to update the flags and
> > counters also for live and insert/delete-in-progress tuples and for dead
> > line pointers? Something like heap_prune_record_live() and
> > heap_prune_record_lp_dead().
>
> I like the idea of a heap_prune_record_live_or_recently_dead() function.
> That's what I've attempted to implement in the attached 0016. I haven't
> updated and cleaned up everything (especially comments) in the refactor,
> but there are two major issues:
>
> 1) In heap_prune_chain(), a heap-only tuple which is not HOT updated may
> end up being a live tuple not part of any chain or it may end up the
> redirect target in a HOT chain. At the top of heap_prune_chain(), we
> return if (HeapTupleHeaderIsHeapOnly(htup)). We may come back to this
> tuple later if it is part of a chain. If we don't, we need to have
> called heap_prune_record_live_or_recently_dead(). However, there are
> other tuples that get redirected to which do not meet this criteria, so
> 

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-25 Thread Melanie Plageman
Thanks for committing the new WAL format!

On Mon, Mar 25, 2024 at 3:33 PM Heikki Linnakangas  wrote:
>
> On 24/03/2024 18:32, Melanie Plageman wrote:
> > On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas  wrote:
> >>
> >> In heap_page_prune_and_freeze(), we now do some extra work on each live
> >> tuple, to set the all_visible_except_removable correctly. And also to
> >> update live_tuples, recently_dead_tuples and hastup. When we're not
> >> freezing, that's a waste of cycles, the caller doesn't care. I hope it's
> >> enough that it doesn't matter, but is it?
> >
> > Last year on an early version of the patch set I did some pgbench
> > tpcb-like benchmarks -- since there is a lot of on-access pruning in
> > that workload -- and I don't remember it being a showstopper. The code
> > has changed a fair bit since then. However, I think it might be safer
> > to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip
> > the rest of the loop after heap_prune_satisifies_vacuum() when
> > on-access pruning invokes it. I had avoided that because it felt ugly
> > and error-prone, however it addresses a few other of your points as
> > well.
>
> Ok. I'm not a fan of the name 'by_vacuum' though. It'd be nice if the
> argument described what it does, rather than who it's for. For example,
> 'need_all_visible'. If set to true, the function determines
> 'all_visible', otherwise it does not.

I like that way of putting it -- describing what it does instead of
who it is for. However, we now have PruneReason as an argument to
heap_page_prune(), which would be usable for this purpose (for
skipping the rest of the first loop). It is not descriptive of how we
would use it in this scenario, though.

> I started to look closer at the loops in heap_prune_chain() and how they
> update all the various flags and counters. There's a lot going on there.
> We have:
>
> - live_tuples counter
> - recently_dead_tuples counter
> - all_visible[_except_removable]
> - all_frozen
> - visibility_cutoff_xid
> - hastup
> - prstate.frozen array
> - nnewlpdead
> - deadoffsets array
>
> And that doesn't even include all the local variables and the final
> dead/redirected arrays.

Yes, there are a lot of things happening. In an early version, I had
hoped for the first loop to be just getting the visibility information
and then to do most of the other stuff as we went in
heap_prune_chain() as you mention below. I couldn't quite get a
version of that working that looked nice. I agree that the whole thing
feels a bit brittle and error-prone. It's hard to be objective after
fiddling with something over the course of a year. I'm trying to take
a step back now and rethink it.

> Some of those are set in the first loop that initializes 'htsv' for each
> tuple on the page. Others are updated in heap_prune_chain(). Some are
> updated in both. It's hard to follow which are set where.

Yep.

> I think recently_dead_tuples is updated incorrectly, for tuples that are
> part of a completely dead HOT chain. For example, imagine a hot chain
> with two tuples: RECENTLY_DEAD -> DEAD. heap_prune_chain() would follow
> the chain, see the DEAD tuple at the end of the chain, and mark both
> tuples for pruning. However, we already updated 'recently_dead_tuples'
> in the first loop, which is wrong if we remove the tuple.
>
> Maybe that's the only bug like this, but I'm a little scared. Is there
> something we could do to make this simpler? Maybe move all the new work
> that we added to the first loop, into heap_prune_chain() ? Maybe
> introduce a few more helper heap_prune_record_*() functions, to update
> the flags and counters also for live and insert/delete-in-progress
> tuples and for dead line pointers? Something like
> heap_prune_record_live() and heap_prune_record_lp_dead().

I had discarded previous attempts to get everything done in
heap_prune_chain() because it was hard to make sure I was doing the
right thing given that it visits the line pointers out of order so
making sure you've considered all of them once and only once was hard.
I hadn't thought of the approach you suggested with record_live() --
that might help. I will work on this tomorrow. I had hoped to get
something out today, but I am still in the middle of rebasing the back
20 patches from your v5 over current master and then adding in the
suggestions that I made in the various diffs on the thread.

> > Note that I still don't think we have a resolution on what to
> > correctly update new_relfrozenxid and new_relminmxid to at the end
> > when presult->nfrozen == 0 and presult->all_frozen is true.
> >
> >  if (presult->nfrozen > 0)
> >  {
> >  presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid;
> >  presult->new_relminmxid = pagefrz->FreezePageRelminMxid;
> >  }
> >  else
> >  {
> >  presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid;
> >  presult->new_relminmxid = 

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-25 Thread Heikki Linnakangas

On 24/03/2024 18:32, Melanie Plageman wrote:

On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas  wrote:


In heap_page_prune_and_freeze(), we now do some extra work on each live
tuple, to set the all_visible_except_removable correctly. And also to
update live_tuples, recently_dead_tuples and hastup. When we're not
freezing, that's a waste of cycles, the caller doesn't care. I hope it's
enough that it doesn't matter, but is it?


Last year on an early version of the patch set I did some pgbench
tpcb-like benchmarks -- since there is a lot of on-access pruning in
that workload -- and I don't remember it being a showstopper. The code
has changed a fair bit since then. However, I think it might be safer
to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip
the rest of the loop after heap_prune_satisifies_vacuum() when
on-access pruning invokes it. I had avoided that because it felt ugly
and error-prone, however it addresses a few other of your points as
well.


Ok. I'm not a fan of the name 'by_vacuum' though. It'd be nice if the 
argument described what it does, rather than who it's for. For example, 
'need_all_visible'. If set to true, the function determines 
'all_visible', otherwise it does not.


I started to look closer at the loops in heap_prune_chain() and how they 
update all the various flags and counters. There's a lot going on there. 
We have:


- live_tuples counter
- recently_dead_tuples counter
- all_visible[_except_removable]
- all_frozen
- visibility_cutoff_xid
- hastup
- prstate.frozen array
- nnewlpdead
- deadoffsets array

And that doesn't even include all the local variables and the final 
dead/redirected arrays.


Some of those are set in the first loop that initializes 'htsv' for each 
tuple on the page. Others are updated in heap_prune_chain(). Some are 
updated in both. It's hard to follow which are set where.


I think recently_dead_tuples is updated incorrectly, for tuples that are 
part of a completely dead HOT chain. For example, imagine a hot chain 
with two tuples: RECENTLY_DEAD -> DEAD. heap_prune_chain() would follow 
the chain, see the DEAD tuple at the end of the chain, and mark both 
tuples for pruning. However, we already updated 'recently_dead_tuples' 
in the first loop, which is wrong if we remove the tuple.


Maybe that's the only bug like this, but I'm a little scared. Is there 
something we could do to make this simpler? Maybe move all the new work 
that we added to the first loop, into heap_prune_chain() ? Maybe 
introduce a few more helper heap_prune_record_*() functions, to update 
the flags and counters also for live and insert/delete-in-progress 
tuples and for dead line pointers? Something like 
heap_prune_record_live() and heap_prune_record_lp_dead().



The 'frz_conflict_horizon' stuff is still fuzzy to me. (Not necessarily
these patches's fault). This at least is wrong, because Max(a, b)
doesn't handle XID wraparound correctly:


   if (do_freeze)
   conflict_xid = 
Max(prstate.snapshotConflictHorizon,
  
presult->frz_conflict_horizon);
   else
   conflict_xid = prstate.snapshotConflictHorizon;


Then there's this in lazy_scan_prune():


   /* Using same cutoff when setting VM is now unnecessary */
   if (presult.all_frozen)
   presult.frz_conflict_horizon = InvalidTransactionId;

This does the right thing in the end, but if all the tuples are frozen
shouldn't frz_conflict_horizon already be InvalidTransactionId? The
comment says it's "newest xmin on the page", and if everything was
frozen, all xmins are FrozenTransactionId. In other words, that should
be moved to heap_page_prune_and_freeze() so that it doesn't lie to its
caller. Also, frz_conflict_horizon is only set correctly if
'all_frozen==true', would be good to mention that in the comments too.


Yes, this is a good point. I've spent some time swapping all of this
back into my head. I think we should change the names of all these
conflict horizon variables and introduce some local variables again.
In the attached patch, I've updated the name of the variable in
PruneFreezeResult to vm_conflict_horizon, as it is only used for
emitting a VM update record. Now, I don't set it until the end of
heap_page_prune_and_freeze(). It is only updated from
InvalidTransactionId if the page is not all frozen. As you say, if the
page is all frozen, there can be no conflict.


Makes sense.


I've also changed PruneState->snapshotConflictHorizon to
PruneState->latest_xid_removed.

And I introduced the local variables visibility_cutoff_xid and
frz_conflict_horizon. I think it is important we distinguish between
the latest xid pruned, the latest xmin of tuples frozen, and the
latest xid of all live tuples on the page.

Though we end up using visibility_cutoff_xid as the freeze conflict
horizon if the page is 

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-25 Thread Heikki Linnakangas

On 24/03/2024 21:55, Melanie Plageman wrote:

On Sat, Mar 23, 2024 at 01:09:30AM +0200, Heikki Linnakangas wrote:

On 20/03/2024 21:17, Melanie Plageman wrote:
There is another patch in the commitfest that touches this area: "Recording
whether Heap2/PRUNE records are from VACUUM or from opportunistic pruning"
[1]. That actually goes in the opposite direction than this patch. That
patch wants to add more information, to show whether a record was emitted by
VACUUM or on-access pruning, while this patch makes the freezing and
VACUUM's 2nd phase records also look the same. We could easily add more
flags to xl_heap_prune to distinguish them. Or assign different xl_info code
for them, like that other patch proposed. But I don't think that needs to
block this patch, that can be added as a separate patch.

[1] 
https://www.postgresql.org/message-id/cah2-wzmsevhox+hjpfmqgcxwwdgnzp0r9f+vbnpok6tgxmp...@mail.gmail.com


I think it would be very helpful to distinguish amongst vacuum pass 1,
2, and on-access pruning. I often want to know what did most of the
pruning -- and I could see also wanting to know if the first or second
vacuum pass was responsible for removing the tuples. I agree it could be
done separately, but it would be very helpful to have as soon as
possible now that the record type will be the same for all three.


Ok, I used separate 'info' codes for records generated on on-access 
pruning and vacuum's 1st and 2nd pass. Similar to Peter's patch on that 
other thread, except that I didn't reserve the whole high bit for this, 
but used three different 'info' codes. Freezing uses the same 
XLOG_HEAP2_PRUNE_VACUUM_SCAN as the pruning in vacuum's 1st pass. You 
can distinguish them based on whether the record has nfrozen > 0, and 
with the rest of the patches, they will be merged anyway.



 From 042185d3de14dcb7088bbe50e9c64e365ac42c2a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 22 Mar 2024 23:10:22 +0200
Subject: [PATCH v6] Merge prune, freeze and vacuum WAL record formats
  
  /*

- * Handles XLOG_HEAP2_PRUNE record type.
- *
- * Acquires a full cleanup lock.
+ * Replay XLOG_HEAP2_PRUNE_FREEZE record.
   */
  static void
-heap_xlog_prune(XLogReaderState *record)
+heap_xlog_prune_freeze(XLogReaderState *record)
  {
XLogRecPtr  lsn = record->EndRecPtr;
-   xl_heap_prune *xlrec = (xl_heap_prune *) XLogRecGetData(record);
+   char   *ptr;
+   xl_heap_prune *xlrec;
Buffer  buffer;
RelFileLocator rlocator;
BlockNumber blkno;
XLogRedoAction action;
  
  	XLogRecGetBlockTag(record, 0, , NULL, );

+   ptr = XLogRecGetData(record);


I don't love having two different pointers that alias each other and we
don't know which one is for what. Perhaps we could memcpy xlrec like in
my attached diff (log_updates.diff). It also might perform slightly
better than accessing flags through a xl_heap_prune
* -- since it wouldn't be doing pointer dereferencing.


Ok.


/*
-* We're about to remove tuples. In Hot Standby mode, ensure that 
there's
-* no queries running for which the removed tuples are still visible.
+* We will take an ordinary exclusive lock or a cleanup lock depending 
on
+* whether the XLHP_CLEANUP_LOCK flag is set.  With an ordinary 
exclusive
+* lock, we better not be doing anything that requires moving existing
+* tuple data.
 */
-   if (InHotStandby)
-   
ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
-  
 xlrec->isCatalogRel,
+   Assert((xlrec->flags & XLHP_CLEANUP_LOCK) != 0 ||
+  (xlrec->flags & (XLHP_HAS_REDIRECTIONS | 
XLHP_HAS_DEAD_ITEMS)) == 0);
+
+   /*
+* We are about to remove and/or freeze tuples.  In Hot Standby mode,
+* ensure that there are no queries running for which the removed tuples
+* are still visible or which still consider the frozen xids as running.
+* The conflict horizon XID comes after xl_heap_prune.
+*/
+   if (InHotStandby && (xlrec->flags & XLHP_HAS_CONFLICT_HORIZON) != 0)
+   {


My attached patch has a TODO here for the comment. It sticks out that
the serialization and deserialization conditions are different for the
snapshot conflict horizon. We don't deserialize if InHotStandby is
false. That's still correct now because we don't write anything else to
the main data chunk afterward. But, if someone were to add another
member after snapshot_conflict_horizon, they would want to know to
deserialize snapshot_conflict_horizon first even if InHotStandby is
false.


Good point. Fixed so that 'snapshot_conflict_horizon' is deserialized 
even if !InHotStandby. A memcpy is cheap, and is probably optimized away 
by the compiler anyway.



+   TransactionId snapshot_conflict_horizon;
+


I would throw a comment in about the memcpy 

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-24 Thread Melanie Plageman
On Sat, Mar 23, 2024 at 01:09:30AM +0200, Heikki Linnakangas wrote:
> On 20/03/2024 21:17, Melanie Plageman wrote:
> > Attached patch changes-for-0001.patch has a bunch of updated comments --
> > especially for heapam_xlog.h (including my promised diagram). And a few
> > suggestions (mostly changes that I should have made before).
> 
> New version of these WAL format changes attached. Squashed to one patch now.
> I spent more time on the comments throughout the patch. And one
> notable code change: I replaced the XLHP_LP_TRUNCATE_ONLY flag with
> XLHP_CLEANUP_LOCK. XLHP_CLEANUP_LOCK directly indicates if you need a
> cleanup lock to replay the record. It must always be set when
> XLHP_HAS_REDIRECTIONS or XLHP_HAS_DEAD_ITEMS is set, because replaying those
> always needs a cleanup lock. That felt easier to document and understand
> than XLHP_LP_TRUNCATE_ONLY.

Makes sense to me.

> > >  From b26e36ba8614d907a6e15810ed4f684f8f628dd2 Mon Sep 17 00:00:00 2001
> > > From: Heikki Linnakangas 
> > > Date: Wed, 20 Mar 2024 14:53:31 +0200
> > > Subject: [PATCH v5 08/26] minor refactoring in log_heap_prune_and_freeze()
> > > 
> > > Mostly to make local variables more tightly-scoped.
> > 
> > So, I don't think you can move those sub-records into the tighter scope.
> > If you run tests with this applied, you'll see it crashes and a lot of
> > the data in the record is wrong. If you move the sub-record declarations
> > out to a wider scope, the tests pass.
> > 
> > The WAL record data isn't actually copied into the buffer until
> > 
> > recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_FREEZE);
> > 
> > after registering everything.
> > So all of those sub-records you made are out of scope the time it tries
> > to copy from them.
> > 
> > I spent the better part of a day last week trying to figure out what was
> > happening after I did the exact same thing. I must say that I found the
> > xloginsert API incredibly unhelpful on this point.
> 
> Oops. I had that in mind and that was actually why I moved the
> XLogRegisterData() call to the end of the function, because I found it
> confusing to register the struct before filling it in completely, even
> though it works perfectly fine. But then I missed it anyway when I moved the
> local variables. I added a brief comment on that.

Comment was a good idea.

> There is another patch in the commitfest that touches this area: "Recording
> whether Heap2/PRUNE records are from VACUUM or from opportunistic pruning"
> [1]. That actually goes in the opposite direction than this patch. That
> patch wants to add more information, to show whether a record was emitted by
> VACUUM or on-access pruning, while this patch makes the freezing and
> VACUUM's 2nd phase records also look the same. We could easily add more
> flags to xl_heap_prune to distinguish them. Or assign different xl_info code
> for them, like that other patch proposed. But I don't think that needs to
> block this patch, that can be added as a separate patch.
> 
> [1] 
> https://www.postgresql.org/message-id/cah2-wzmsevhox+hjpfmqgcxwwdgnzp0r9f+vbnpok6tgxmp...@mail.gmail.com

I think it would be very helpful to distinguish amongst vacuum pass 1,
2, and on-access pruning. I often want to know what did most of the
pruning -- and I could see also wanting to know if the first or second
vacuum pass was responsible for removing the tuples. I agree it could be
done separately, but it would be very helpful to have as soon as
possible now that the record type will be the same for all three.

> From 042185d3de14dcb7088bbe50e9c64e365ac42c2a Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Fri, 22 Mar 2024 23:10:22 +0200
> Subject: [PATCH v6] Merge prune, freeze and vacuum WAL record formats
>  
>  /*
> - * Handles XLOG_HEAP2_PRUNE record type.
> - *
> - * Acquires a full cleanup lock.
> + * Replay XLOG_HEAP2_PRUNE_FREEZE record.
>   */
>  static void
> -heap_xlog_prune(XLogReaderState *record)
> +heap_xlog_prune_freeze(XLogReaderState *record)
>  {
>   XLogRecPtr  lsn = record->EndRecPtr;
> - xl_heap_prune *xlrec = (xl_heap_prune *) XLogRecGetData(record);
> + char   *ptr;
> + xl_heap_prune *xlrec;
>   Buffer  buffer;
>   RelFileLocator rlocator;
>   BlockNumber blkno;
>   XLogRedoAction action;
>  
>   XLogRecGetBlockTag(record, 0, , NULL, );
> + ptr = XLogRecGetData(record);

I don't love having two different pointers that alias each other and we
don't know which one is for what. Perhaps we could memcpy xlrec like in
my attached diff (log_updates.diff). It also might perform slightly
better than accessing flags through a xl_heap_prune
* -- since it wouldn't be doing pointer dereferencing.

> + xlrec = (xl_heap_prune *) ptr;
> + ptr += SizeOfHeapPrune;
>  
>   /*
> -  * We're about to remove tuples. In Hot Standby mode, ensure that 
> there's
> -  * no queries running for which the removed tuples are still visible.
> +  * We will take 

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-24 Thread Melanie Plageman
On Thu, Mar 21, 2024 at 9:28 AM Heikki Linnakangas  wrote:
>
> In heap_page_prune_and_freeze(), we now do some extra work on each live
> tuple, to set the all_visible_except_removable correctly. And also to
> update live_tuples, recently_dead_tuples and hastup. When we're not
> freezing, that's a waste of cycles, the caller doesn't care. I hope it's
> enough that it doesn't matter, but is it?

Last year on an early version of the patch set I did some pgbench
tpcb-like benchmarks -- since there is a lot of on-access pruning in
that workload -- and I don't remember it being a showstopper. The code
has changed a fair bit since then. However, I think it might be safer
to pass a flag "by_vacuum" to heap_page_prune_and_freeze() and skip
the rest of the loop after heap_prune_satisifies_vacuum() when
on-access pruning invokes it. I had avoided that because it felt ugly
and error-prone, however it addresses a few other of your points as
well.

For example, I think we should set a bit in the prune/freeze WAL
record's flags to indicate if pruning was done by vacuum or on access
(mentioned in another of your recent emails).

> The first commit (after the WAL format changes) changes the all-visible
> check to use GlobalVisTestIsRemovableXid. The commit message says that
> it's because we don't have 'cutoffs' available, but we only care about
> that when we're freezing, and when we're freezing, we actually do have
> 'cutoffs' in HeapPageFreeze. Using GlobalVisTestIsRemovableXid seems
> sensible anyway, because that's what we use in
> heap_prune_satisfies_vacuum() too, but just wanted to point that out.

Yes, this is true. If we skip this part of the loop when on-access
pruning invokes it, we can go back to using the OldestXmin. I have
done that as well as some other changes in the attached patch,
conflict_horizon_updates.diff. Note that this patch may not apply on
your latest patch as I wrote it on top of an older version. Switching
back to using OldestXmin for page visibility determination makes this
patch set more similar to master as well. We could keep the
alternative check (with GlobalVisState) to maintain the illusion that
callers passing by_vacuum as True can pass NULL for pagefrz, but I was
worried about the extra overhead.

It would be nice to pick a single reasonable visibility horizon (the
oldest running xid we compare things to) at the beginning of
heap_page_prune_and_freeze() and use it for determining if we can
remove tuples, if we can freeze tuples, and if the page is all
visible. It makes it confusing that we use OldestXmin for freezing and
setting the page visibility in the VM and GlobalVisState for
determining prunability. I think using GlobalVisState for freezing has
been discussed before and ruled out for a variety of reasons, and I
definitely don't suggest making that change in this patch set.

> The 'frz_conflict_horizon' stuff is still fuzzy to me. (Not necessarily
> these patches's fault). This at least is wrong, because Max(a, b)
> doesn't handle XID wraparound correctly:
>
> >   if (do_freeze)
> >   conflict_xid = 
> > Max(prstate.snapshotConflictHorizon,
> >  
> > presult->frz_conflict_horizon);
> >   else
> >   conflict_xid = 
> > prstate.snapshotConflictHorizon;
>
> Then there's this in lazy_scan_prune():
>
> >   /* Using same cutoff when setting VM is now unnecessary */
> >   if (presult.all_frozen)
> >   presult.frz_conflict_horizon = InvalidTransactionId;
> This does the right thing in the end, but if all the tuples are frozen
> shouldn't frz_conflict_horizon already be InvalidTransactionId? The
> comment says it's "newest xmin on the page", and if everything was
> frozen, all xmins are FrozenTransactionId. In other words, that should
> be moved to heap_page_prune_and_freeze() so that it doesn't lie to its
> caller. Also, frz_conflict_horizon is only set correctly if
> 'all_frozen==true', would be good to mention that in the comments too.

Yes, this is a good point. I've spent some time swapping all of this
back into my head. I think we should change the names of all these
conflict horizon variables and introduce some local variables again.
In the attached patch, I've updated the name of the variable in
PruneFreezeResult to vm_conflict_horizon, as it is only used for
emitting a VM update record. Now, I don't set it until the end of
heap_page_prune_and_freeze(). It is only updated from
InvalidTransactionId if the page is not all frozen. As you say, if the
page is all frozen, there can be no conflict.

I've also changed PruneState->snapshotConflictHorizon to
PruneState->latest_xid_removed.

And I introduced the local variables visibility_cutoff_xid and
frz_conflict_horizon. I think it is important we distinguish between
the latest xid pruned, the latest xmin of tuples frozen, and the

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-22 Thread Heikki Linnakangas

On 20/03/2024 21:17, Melanie Plageman wrote:

Attached patch changes-for-0001.patch has a bunch of updated comments --
especially for heapam_xlog.h (including my promised diagram). And a few
suggestions (mostly changes that I should have made before).


New version of these WAL format changes attached. Squashed to one patch now.


+   // TODO: should we avoid this if we only froze? heap_xlog_freeze() 
doesn't
+   // do it


Ah yes, that makes sense, did that.


In the final commit message, I think it is worth calling out that all of
those freeze logging functions heap_log_freeze_eq/cmp/etc are lifted and
shifted from one file to another. When I am reviewing a big diff, it is
always helpful to know where I need to review line-by-line.


Done.


 From 5d6fc2ffbdd839e0b69242af16446a46cf6a2dc7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 20 Mar 2024 13:49:59 +0200
Subject: [PATCH v5 04/26] 'nplans' is a pointer

I'm surprised the compiler didn't warn about this


oops. while looking at this, I noticed that the asserts I added that
nredirected > 0, ndead > 0, and nunused > 0 have the same problem.


Good catch, fixed.


- remove xlhp_conflict_horizon and store a TransactionId directly. A
   separate struct would make sense if we needed to store anything else
   there, but for now it just seems like more code.


makes sense. I just want to make sure heapam_xlog.h makes it extra clear
that this is happening. I see your comment addition. If you combine it
with my comment additions in the attached patch for 0001, hopefully that
makes it clear enough.


Thanks. I spent more time on the comments throughout the patch. And one 
notable code change: I replaced the XLHP_LP_TRUNCATE_ONLY flag with 
XLHP_CLEANUP_LOCK. XLHP_CLEANUP_LOCK directly indicates if you need a 
cleanup lock to replay the record. It must always be set when 
XLHP_HAS_REDIRECTIONS or XLHP_HAS_DEAD_ITEMS is set, because replaying 
those always needs a cleanup lock. That felt easier to document and 
understand than XLHP_LP_TRUNCATE_ONLY.



 From 8af186ee9dd8c7dc20f37a69b34cab7b95faa43b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 20 Mar 2024 14:03:06 +0200
Subject: [PATCH v5 07/26] Add comment to log_heap_prune_and_freeze().

XXX: This should be rewritten, but I tried to at least list some
important points.


Are you thinking that it needs to mention more things or that the things
it mentions need more detail?


I previously just quickly jotted down things that seemed worth 
mentioning in the comment. It was not so bad actually, but I reworded it 
a little.



 From b26e36ba8614d907a6e15810ed4f684f8f628dd2 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 20 Mar 2024 14:53:31 +0200
Subject: [PATCH v5 08/26] minor refactoring in log_heap_prune_and_freeze()

Mostly to make local variables more tightly-scoped.


So, I don't think you can move those sub-records into the tighter scope.
If you run tests with this applied, you'll see it crashes and a lot of
the data in the record is wrong. If you move the sub-record declarations
out to a wider scope, the tests pass.

The WAL record data isn't actually copied into the buffer until

recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE_FREEZE);

after registering everything.
So all of those sub-records you made are out of scope the time it tries
to copy from them.

I spent the better part of a day last week trying to figure out what was
happening after I did the exact same thing. I must say that I found the
xloginsert API incredibly unhelpful on this point.


Oops. I had that in mind and that was actually why I moved the 
XLogRegisterData() call to the end of the function, because I found it 
confusing to register the struct before filling it in completely, even 
though it works perfectly fine. But then I missed it anyway when I moved 
the local variables. I added a brief comment on that.



I would like to review the rest of the suggested changes in this patch
after we fix the issue I just mentioned.


Thanks, review is appreciated. I feel this is ready now, so barring any 
big new issues, I plan to commit this early next week.


There is another patch in the commitfest that touches this area: 
"Recording whether Heap2/PRUNE records are from VACUUM or from 
opportunistic pruning" [1]. That actually goes in the opposite direction 
than this patch. That patch wants to add more information, to show 
whether a record was emitted by VACUUM or on-access pruning, while this 
patch makes the freezing and VACUUM's 2nd phase records also look the 
same. We could easily add more flags to xl_heap_prune to distinguish 
them. Or assign different xl_info code for them, like that other patch 
proposed. But I don't think that needs to block this patch, that can be 
added as a separate patch.


[1] 
https://www.postgresql.org/message-id/cah2-wzmsevhox+hjpfmqgcxwwdgnzp0r9f+vbnpok6tgxmp...@mail.gmail.com


--
Heikki Linnakangas
Neon (https://neon.tech)
From 

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-21 Thread Heikki Linnakangas

On 20/03/2024 23:03, Melanie Plageman wrote:

On Wed, Mar 20, 2024 at 03:15:49PM +0200, Heikki Linnakangas wrote:

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ee0eca8ae02..b2015f5a1ac 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -202,14 +202,17 @@ typedef struct PruneFreezeResult
int recently_dead_tuples;
int ndeleted;   /* Number of tuples 
deleted from the page */
int nnewlpdead; /* Number of newly 
LP_DEAD items */
+   int nfrozen;


Let's add a comment after int nfrozen like
/* Number of newly frozen tuples */


+
boolall_visible;/* Whether or not the page is all 
visible */
boolhastup; /* Does page make rel 
truncation unsafe */
  
+	/* The following fields are only set if freezing */


So, all_frozen will be set correctly if we are even considering freezing
(if pagefrz is passed). all_frozen will be true even if we didn't freeze
anything if the page is all-frozen and can be set as such in the VM. And
it will be false if the page is not all-frozen. So, maybe we say
"considering freezing".

But, I'm glad you thought to call out which of these fields will make
sense to the caller.

Also, maybe we should just name the members to which this applies. It is
a bit confusing that I can't tell if the comment also refers to the
other members following it (lpdead_items and deadoffsets) -- which it
doesn't.


Right, sorry, I spotted the general issue that it's not clear which 
fields are valid when. I added that comment to remind about that, but I 
then forgot about it.


In heap_page_prune_and_freeze(), we now do some extra work on each live 
tuple, to set the all_visible_except_removable correctly. And also to 
update live_tuples, recently_dead_tuples and hastup. When we're not 
freezing, that's a waste of cycles, the caller doesn't care. I hope it's 
enough that it doesn't matter, but is it?


The first commit (after the WAL format changes) changes the all-visible 
check to use GlobalVisTestIsRemovableXid. The commit message says that 
it's because we don't have 'cutoffs' available, but we only care about 
that when we're freezing, and when we're freezing, we actually do have 
'cutoffs' in HeapPageFreeze. Using GlobalVisTestIsRemovableXid seems 
sensible anyway, because that's what we use in 
heap_prune_satisfies_vacuum() too, but just wanted to point that out.



The 'frz_conflict_horizon' stuff is still fuzzy to me. (Not necessarily 
these patches's fault). This at least is wrong, because Max(a, b) 
doesn't handle XID wraparound correctly:



if (do_freeze)
conflict_xid = 
Max(prstate.snapshotConflictHorizon,
   
presult->frz_conflict_horizon);
else
conflict_xid = prstate.snapshotConflictHorizon;


Then there's this in lazy_scan_prune():


/* Using same cutoff when setting VM is now unnecessary */
if (presult.all_frozen)
presult.frz_conflict_horizon = InvalidTransactionId;
This does the right thing in the end, but if all the tuples are frozen 
shouldn't frz_conflict_horizon already be InvalidTransactionId? The 
comment says it's "newest xmin on the page", and if everything was 
frozen, all xmins are FrozenTransactionId. In other words, that should 
be moved to heap_page_prune_and_freeze() so that it doesn't lie to its 
caller. Also, frz_conflict_horizon is only set correctly if 
'all_frozen==true', would be good to mention that in the comments too.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Combine Prune and Freeze records emitted by vacuum

2024-03-20 Thread Melanie Plageman
On Wed, Mar 20, 2024 at 03:15:49PM +0200, Heikki Linnakangas wrote:
>
> 0009-: The rest of the v4 patches, rebased over the WAL format changes. I
> also added a few small commits for little cleanups that caught my eye, let
> me know if you disagree with those.

This review is just of the patches containing changes you made in
0009-0026.

> From d36138b5bf0a93557273b5e47f8cd5ea089057c7 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Wed, 20 Mar 2024 11:47:42 +0200
> Subject: [PATCH v5 13/26] still use a local 'cutoffs' variable
> 
> Given how often 'cutoffs' is used in the function, I think it still
> makes sense to have a local variable for it, just to keep the source
> lines shorter.

Works for me.

> From 913617ed98cfddd678a6f620db7dee68d1d61c89 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Wed, 20 Mar 2024 10:51:13 +0200
> Subject: [PATCH v5 21/26] move whole_page_freezable to tighter scope

Works for me.

> From e2b50f9b64f7e4255f4f764e2a348e1b446573dc Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Wed, 20 Mar 2024 11:43:31 +0200
> Subject: [PATCH v5 22/26] make 'all_visible_except_removable' local
> 
> The caller doesn't need it, so it doesn't belong in PruneFreezeResult

Makes sense to me.

> From e993e0d98cd0ef1ecbd229f6ddbe23d59a427e3a Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Wed, 20 Mar 2024 11:40:34 +0200
> Subject: [PATCH v5 26/26] reorder PruneFreezeResult fields
> 
> ---
>  src/include/access/heapam.h | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> index ee0eca8ae02..b2015f5a1ac 100644
> --- a/src/include/access/heapam.h
> +++ b/src/include/access/heapam.h
> @@ -202,14 +202,17 @@ typedef struct PruneFreezeResult
>   int recently_dead_tuples;
>   int ndeleted;   /* Number of tuples 
> deleted from the page */
>   int nnewlpdead; /* Number of newly 
> LP_DEAD items */
> + int nfrozen;

Let's add a comment after int nfrozen like
/* Number of newly frozen tuples */

> +
>   boolall_visible;/* Whether or not the page is all 
> visible */
>   boolhastup; /* Does page make rel 
> truncation unsafe */
>  
> + /* The following fields are only set if freezing */

So, all_frozen will be set correctly if we are even considering freezing
(if pagefrz is passed). all_frozen will be true even if we didn't freeze
anything if the page is all-frozen and can be set as such in the VM. And
it will be false if the page is not all-frozen. So, maybe we say
"considering freezing".

But, I'm glad you thought to call out which of these fields will make
sense to the caller.

Also, maybe we should just name the members to which this applies. It is
a bit confusing that I can't tell if the comment also refers to the
other members following it (lpdead_items and deadoffsets) -- which it
doesn't.

> +
>   /* Whether or not the page can be set all frozen in the VM */
>   boolall_frozen;
>  
>   /* Number of newly frozen tuples */
> - int nfrozen;
>   TransactionId frz_conflict_horizon; /* Newest xmin on the page */
>  
>   /* New value of relfrozenxid found by heap_page_prune_and_freeze() */
> -- 
> 2.39.2
> 

- Melanie




Re: Combine Prune and Freeze records emitted by vacuum

2024-03-20 Thread Peter Geoghegan
On Wed, Mar 20, 2024 at 4:06 PM Melanie Plageman
 wrote:
> > What about the issue of cleanup locks, which aren't needed and aren't
> > taken with the current heapam VACUUM record type? Will you preserve
> > that aspect of the existing design?
>
> Yep, we have a flag to indicate whether or not a cleanup lock is needed.

Thanks for confirming.

I realize that this is fairly obvious, but thought it better to ask.

-- 
Peter Geoghegan




Re: Combine Prune and Freeze records emitted by vacuum

2024-03-20 Thread Melanie Plageman
On Wed, Mar 20, 2024 at 4:04 PM Peter Geoghegan  wrote:
>
> On Wed, Mar 20, 2024 at 9:15 AM Heikki Linnakangas  wrote:
> > > I made it its own sub-record (xlhp_conflict_horizon) less to help with
> > > alignment (though we can use all the help we can get there) and more to
> > > keep it from getting lost. When you look at heapam_xlog.h, you can see
> > > what a XLOG_HEAP2_PRUNE record will contain starting with the
> > > xl_heap_prune struct and then all the sub-record types.
> >
> > Ok, now that I look at this, I wonder if we're being overly cautious
> > about the WAL size. We probably could just always include the snapshot
> > field, and set it to InvalidTransactionId and waste 4 bytes when it's
> > not needed. For the sake of simplicity. I don't feel strongly either way
> > though, the flag is pretty simple too.
>
> What about the issue of cleanup locks, which aren't needed and aren't
> taken with the current heapam VACUUM record type? Will you preserve
> that aspect of the existing design?

Yep, we have a flag to indicate whether or not a cleanup lock is needed.

- Melanie




Re: Combine Prune and Freeze records emitted by vacuum

2024-03-20 Thread Peter Geoghegan
On Wed, Mar 20, 2024 at 9:15 AM Heikki Linnakangas  wrote:
> > I made it its own sub-record (xlhp_conflict_horizon) less to help with
> > alignment (though we can use all the help we can get there) and more to
> > keep it from getting lost. When you look at heapam_xlog.h, you can see
> > what a XLOG_HEAP2_PRUNE record will contain starting with the
> > xl_heap_prune struct and then all the sub-record types.
>
> Ok, now that I look at this, I wonder if we're being overly cautious
> about the WAL size. We probably could just always include the snapshot
> field, and set it to InvalidTransactionId and waste 4 bytes when it's
> not needed. For the sake of simplicity. I don't feel strongly either way
> though, the flag is pretty simple too.

What about the issue of cleanup locks, which aren't needed and aren't
taken with the current heapam VACUUM record type? Will you preserve
that aspect of the existing design?

-- 
Peter Geoghegan




Re: Combine Prune and Freeze records emitted by vacuum

2024-03-20 Thread Melanie Plageman
On Wed, Mar 20, 2024 at 03:15:49PM +0200, Heikki Linnakangas wrote:
> On 20/03/2024 03:36, Melanie Plageman wrote:
> > On Mon, Mar 18, 2024 at 01:15:21AM +0200, Heikki Linnakangas wrote:
> > > On 15/03/2024 02:56, Melanie Plageman wrote:
> > > > Okay, so I was going to start using xl_heap_prune for vacuum here too,
> > > > but I realized it would be bigger because of the
> > > > snapshotConflictHorizon. Do you think there is a non-terrible way to
> > > > make the snapshotConflictHorizon optional? Like with a flag?
> > > 
> > > Yeah, another flag would do the trick.
> > 
> > Okay, I've done this in attached v4 (including removing
> > XLOG_HEAP2_VACUUM). I had to put the snapshot conflict horizon in the
> > "main chunk" of data available at replay regardless of whether or not
> > the record ended up including an FPI.
> > 
> > I made it its own sub-record (xlhp_conflict_horizon) less to help with
> > alignment (though we can use all the help we can get there) and more to
> > keep it from getting lost. When you look at heapam_xlog.h, you can see
> > what a XLOG_HEAP2_PRUNE record will contain starting with the
> > xl_heap_prune struct and then all the sub-record types.
> 
> Ok, now that I look at this, I wonder if we're being overly cautious about
> the WAL size. We probably could just always include the snapshot field, and
> set it to InvalidTransactionId and waste 4 bytes when it's not needed. For
> the sake of simplicity. I don't feel strongly either way though, the flag is
> pretty simple too.

That will mean that all vacuum records are at least 3 bytes bigger than
before -- which makes it somewhat less defensible to get rid of
xl_heap_vacuum.

That being said, I ended up doing an unaligned access when I
packed it and made it optional, so maybe it is less user-friendly.

But I also think that making it optional is more clear for vacuum which
will never use it.

> I realized that the WAL record format changes are pretty independent from
> the rest of the patches. They could be applied before the rest. Without the
> rest of the changes, we'll still write two WAL records per page in vacuum,
> one to prune and another one to freeze, but it's another meaningful
> incremental step. So I reshuffled the patches, so that the WAL format is
> changed first, before the rest of the changes.

Ah, great idea! That eliminates the issue of preliminary commits having
larger WAL records that then get streamlined.

> 0001-0008: These are the WAL format changes. There's some comment cleanup
> needed, but as far as the code goes, I think these are pretty much ready to
> be squashed & committed.

My review in this email is *only* for 0001-0008. I have not looked at
the rest yet.

> From 06d5ff5349a8aa95cbfd06a8043fe503b7b1bf7b Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Wed, 20 Mar 2024 14:50:14 +0200
> Subject: [PATCH v5 01/26] Merge prune, freeze and vacuum WAL record formats
> 
> The new combined WAL record is now used for pruning, freezing and 2nd
> pass of vacuum. This is in preparation of changing vacuuming to write
> a combined prune+freeze record per page, instead of separate two
> records. The new WAL record format now supports that, but the code
> still always writes separate records for pruning and freezing.

Attached patch changes-for-0001.patch has a bunch of updated comments --
especially for heapam_xlog.h (including my promised diagram). And a few
suggestions (mostly changes that I should have made before).

> 
> XXX I tried to lift-and-shift the code from v4 patch set as unchanged
> as possible, for easier review, but some noteworthy changes:

In the final commit message, I think it is worth calling out that all of
those freeze logging functions heap_log_freeze_eq/cmp/etc are lifted and
shifted from one file to another. When I am reviewing a big diff, it is
always helpful to know where I need to review line-by-line.

> 
> - Instead of passing PruneState and PageFreezeResult to
>   log_heap_prune_and_freeze(), pass the arrays of frozen, redirected
>   et al offsets directly. That way, it can be called from other places.

good idea.

> - moved heap_xlog_deserialize_prune_and_freeze() from xactdesc.c to
>   heapdesc.c. (Because that's clearly where it belongs)

:)

> From cd6cdaebb362b014733e99ecd868896caf0fb3aa Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Wed, 20 Mar 2024 13:45:01 +0200
> Subject: [PATCH v5 02/26] Keep the original numbers for existing WAL records
> 
> Doesn't matter much because the WAL format is not compatible across
> major versions anyway. But still seems nice to keep the identifiers
> unchanged when we can. (There's some precedence for this if you search
> the git history for "is free, was").

sounds good.

> From d3207bb557aa1d2868a50d357a06318a6c0cb5cd Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Wed, 20 Mar 2024 13:48:29 +0200
> Subject: [PATCH v5 03/26] Rename record to XLOG_HEAP2_PRUNE_FREEZE
> 
> To clarify that it also freezes now, and 

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-17 Thread Heikki Linnakangas

On 15/03/2024 02:56, Melanie Plageman wrote:

Okay, so I was going to start using xl_heap_prune for vacuum here too,
but I realized it would be bigger because of the
snapshotConflictHorizon. Do you think there is a non-terrible way to
make the snapshotConflictHorizon optional? Like with a flag?


Yeah, another flag would do the trick.


I introduced a few sub-record types similar to what you suggested --
they help a bit with alignment, so I think they are worth keeping. There
are comments around them, but perhaps a larger diagram of the layout of
a the new XLOG_HEAP2_PRUNE record would be helpful.


I started doing this, but I can't find a way of laying out the diagram
that pgindent doesn't destroy. I thought I remember other diagrams in
the source code showing the layout of something (something with pages
somewhere?) that don't get messed up by pgindent, but I couldn't find
them.


See src/tools/pgindent/README, section "Cleaning up in case of failure 
or ugly output":


/*--
 * Text here will not be touched by pgindent.
 */


There is a bit of duplicated code between heap_xlog_prune() and
heap2_desc() since they both need to deserialize the record. Before the
code to do this was small and it didn't matter, but it might be worth
refactoring it that way now.


I have added a helper function to do the deserialization,
heap_xlog_deserialize_prune_and_freeze(). But I didn't start using it in
heap2_desc() because of the way the pg_waldump build file works. Do you
think the helper belongs in any of waldump's existing sources?

pg_waldump_sources = files(
'compat.c',
'pg_waldump.c',
'rmgrdesc.c',
)

pg_waldump_sources += rmgr_desc_sources
pg_waldump_sources += xlogreader_sources
pg_waldump_sources += files('../../backend/access/transam/xlogstats.c')

Otherwise, I assume I am suppose to avoid adding some big new include to
waldump.


Didn't look closely at that, but there's some precedent with 
commit/prepare/abort records. See ParseCommitRecord, xl_xact_commit, 
xl_parsed_commit et al.


Note that we don't provide WAL compatibility across major versions. You 
can fully remove the old xl_heap_freeze_page format. (We should bump 
XLOG_PAGE_MAGIC when this is committed though)



On the point of removing the freeze-only code path from
heap_page_prune() (now heap_page_prune_and_freeze()): while doing this,
I realized that heap_pre_freeze_checks() was not being called in the
case that we decide to freeze because we emitted an FPI while setting
the hint bit. I've fixed that, however, I've done so by moving
heap_pre_freeze_checks() into the critical section. I think that is not
okay? I could move it earlier and not do call it when the hint bit FPI
leads us to freeze tuples. But, I think that would lead to us doing a
lot less validation of tuples being frozen when checksums are enabled.
Or, I could make two critical sections?


I found another approach and just do the pre-freeze checks if we are
considering freezing except for the FPI criteria.


Hmm, I think you can make this simpler if you use 
XLogCheckBufferNeedsBackup() to check if the hint-bit update will 
generate a full-page-image before entering the critical section. Like 
you did to check if pruning will generate a full-page-image. I included 
that change in the attached patches.


I don't fully understand this:


/*
 * If we will freeze tuples on the page or they were all already frozen
 * on the page, if we will set the page all-frozen in the visibility 
map,
 * we can advance relfrozenxid and relminmxid to the values in
 * pagefrz->FreezePageRelfrozenXid and pagefrz->FreezePageRelminMxid.
 */
if (presult->all_frozen || presult->nfrozen > 0)
{
presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid;
presult->new_relminmxid = pagefrz->FreezePageRelminMxid;
}
else
{
presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid;
presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid;
}


Firstly, the comment is outdated, because we have already done the 
freezing at this point. But more importantly, I don't understand the 
logic here. Could we check just presult->nfrozen > 0 here and get the 
same result?



I think it is probably worse to add both of them as additional optional
arguments, so I've just left lazy_scan_prune() with the job of
initializing them.


Ok.


Here are some patches on top of your patches for some further 
refactorings. Some notable changes in heap_page_prune_and_freeze():


- I moved the heap_prepare_freeze_tuple() work from the 2nd loop to the 
1st one. It seems more clear and efficient that way.


- I extracted the code to generate the WAL record to a separate function.

- Refactored the "will setting hint bit generate FPI" check as discussed 
above


These 

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-11 Thread Heikki Linnakangas

On 09/03/2024 22:41, Melanie Plageman wrote:

On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas  wrote:

Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly?


Okay, so I thought a lot about this, and I don't understand how
GlobalVisTestIsRemovableXid() would not handle FrozenTransactionId
correctly.

vacrel->cutoffs.OldestXmin is computed initially from
GetOldestNonRemovableTransactionId() which uses ComputeXidHorizons().
GlobalVisState is updated by ComputeXidHorizons() (when it is
updated).

I do see that the comment above GlobalVisTestIsRemovableXid() says

  * It is crucial that this only gets called for xids from a source that
  * protects against xid wraparounds (e.g. from a table and thus protected by
  * relfrozenxid).

and then in

  * Convert 32 bit argument to FullTransactionId. We can do so safely
  * because we know the xid has to, at the very least, be between
  * [oldestXid, nextXid), i.e. within 2 billion of xid.

I'm not sure what oldestXid is here.
It's true that I don't see GlobalVisTestIsRemovableXid() being called
anywhere else with an xmin as an input. I think that hints that it is
not safe with FrozenTransactionId. But I don't see what could go
wrong.

Maybe it has something to do with converting it to a FullTransactionId?

FullTransactionIdFromU64(U64FromFullTransactionId(rel)  + (int32)
(xid - rel_xid));

Sorry, I couldn't quite figure it out :(


I just tested it. No, GlobalVisTestIsRemovableXid does not work for 
FrozenTransactionId. I just tested it with state->definitely_needed == 
{0, 40} and xid == FrozenTransactionid, and it incorrectly 
returned 'false'. It treats FrozenTransactionId as if was a regular xid '2'.



The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than
XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for
the case that there's no pruning, just freezing. The record format
(xl_heap_prune) looks pretty complex as it is, I think it could be made
both more compact and more clear with some refactoring.


I'm happy to change up xl_heap_prune format. In its current form,
according to pahole, it has no holes and just 3 bytes of padding at
the end.

One way we could make it smaller is by moving the isCatalogRel member
to directly after snapshotConflictHorizon and then adding a flags
field and defining flags to indicate whether or not other members
exist at all. We could set bits for HAS_FREEZE_PLANS, HAS_REDIRECTED,
HAS_UNUSED, HAS_DEAD. Then I would remove the non-optional uint16
nredirected, nunused, nplans, ndead and just put the number of
redirected/unused/etc at the beginning of the arrays containing the
offsets.


Sounds good.


Then I could write various macros for accessing them. That
would make it smaller, but it definitely wouldn't make it less complex
(IMO).


I don't know, it might turn out not that complex. If you define the 
formats of each of those "sub-record types" as explicit structs, per 
attached sketch, you won't need so many macros. Some care is still 
needed with alignment though.


--
Heikki Linnakangas
Neon (https://neon.tech)/*
 * XXX: This one combined record type replaces existing
 * XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM, and XLOG_HEAP2_FREEZE_PAGE record types
 */

/*
 * This is what we need to know about page pruning and freezing, both during
 * VACUUM and during opportunistic pruning.
 *
 * If XLPH_HAS_REDIRECTIONS, XLHP_HAS_DEAD_ITEMS or XLHP_HAS_FREEZE_PLANS is
 * set, acquires a full cleanup lock. Otherwise an ordinary exclusive lock is
 * enough (VACUUM's second heap pass generates such records).
 *
 * The data for block reference 0 contains "sub-records" depending on which
 * of the XLPH_HAS_* flags are set. See xlhp_* struct definitions below.
 */
typedef struct xl_heap_prune
{
TransactionId snapshotConflictHorizon;
uint8   flags;
} xl_heap_prune;

#define XLHP_IS_CATALOG_REL 0x01 /* to handle 
recovery conflict during logical

  * decoding on standby */
#define XLHP_HAS_REDIRECTIONS   0x02
#define XLHP_HAS_DEAD_ITEMS 0x04
#define XLHP_HAS_NOW_UNUSED_ITEMS   0x08
#define XLHP_HAS_FREEZE_PLANS   0x10

#define SizeOfHeapPrune (offsetof(xl_heap_prune, flags) + sizeof(uint8))

/*
 * Sub-record contained in block reference 0 of a prune record if
 * XLHP_HAS_REDIRECTIONS is set
 */
typedef struct xl_heap_prune_redirections
{
uint16  nitems;
struct
{
OffsetNumber from;
OffsetNumber to;
}   items[FLEXIBLE_ARRAY_MEMBER];
} xl_heap_prune_redirections;

/*
 * Sub-record contained in block reference 0 of a prune record if
 * XLHP_HAS_DEAD_ITEMS is set
 */
typedef struct xlhp_dead_items
{
uint16  nitems;
OffsetNumber items[FLEXIBLE_ARRAY_MEMBER];
} xlhp_dead_items;

/*
 * 

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-09 Thread Melanie Plageman
Thanks so much for the review!

On Wed, Mar 6, 2024 at 7:59 AM Heikki Linnakangas  wrote:
>
> On 25/01/2024 00:49, Melanie Plageman wrote:
>
> > The attached patch set is broken up into many separate commits for
> > ease of review. Each patch does a single thing which can be explained
> > plainly in the commit message. Every commit passes tests and works on
> > its own.
>
> About this very first change:
>
> > --- a/src/backend/access/heap/vacuumlazy.c
> > +++ b/src/backend/access/heap/vacuumlazy.c
> > @@ -1526,8 +1526,7 @@ lazy_scan_prune(LVRelState *vacrel,
> >* that everyone sees it as committed?
> >*/
> >   xmin = HeapTupleHeaderGetXmin(htup);
> > - if (!TransactionIdPrecedes(xmin,
> > -   
> >  vacrel->cutoffs.OldestXmin))
> > + if 
> > (!GlobalVisTestIsRemovableXid(vacrel->vistest, xmin))
> >   {
> >   all_visible = false;
> >   break;
>
> Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly?

Okay, so I thought a lot about this, and I don't understand how
GlobalVisTestIsRemovableXid() would not handle FrozenTransactionId
correctly.

vacrel->cutoffs.OldestXmin is computed initially from
GetOldestNonRemovableTransactionId() which uses ComputeXidHorizons().
GlobalVisState is updated by ComputeXidHorizons() (when it is
updated).

I do see that the comment above GlobalVisTestIsRemovableXid() says

 * It is crucial that this only gets called for xids from a source that
 * protects against xid wraparounds (e.g. from a table and thus protected by
 * relfrozenxid).

and then in

 * Convert 32 bit argument to FullTransactionId. We can do so safely
 * because we know the xid has to, at the very least, be between
 * [oldestXid, nextXid), i.e. within 2 billion of xid.

I'm not sure what oldestXid is here.
It's true that I don't see GlobalVisTestIsRemovableXid() being called
anywhere else with an xmin as an input. I think that hints that it is
not safe with FrozenTransactionId. But I don't see what could go
wrong.

Maybe it has something to do with converting it to a FullTransactionId?

   FullTransactionIdFromU64(U64FromFullTransactionId(rel)  + (int32)
(xid - rel_xid));

Sorry, I couldn't quite figure it out :(

> I read through all the patches in order, and aside from the above they
> all look correct to me. Some comments on the set as whole:
>
> I don't think we need XLOG_HEAP2_FREEZE_PAGE as a separate record type
> anymore. By removing that, you also get rid of the freeze-only codepath
> near the end of heap_page_prune(), and the
> heap_freeze_execute_prepared() function.

That makes sense to me.

> The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than
> XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for
> the case that there's no pruning, just freezing. The record format
> (xl_heap_prune) looks pretty complex as it is, I think it could be made
> both more compact and more clear with some refactoring.

I'm happy to change up xl_heap_prune format. In its current form,
according to pahole, it has no holes and just 3 bytes of padding at
the end.

One way we could make it smaller is by moving the isCatalogRel member
to directly after snapshotConflictHorizon and then adding a flags
field and defining flags to indicate whether or not other members
exist at all. We could set bits for HAS_FREEZE_PLANS, HAS_REDIRECTED,
HAS_UNUSED, HAS_DEAD. Then I would remove the non-optional uint16
nredirected, nunused, nplans, ndead and just put the number of
redirected/unused/etc at the beginning of the arrays containing the
offsets. Then I could write various macros for accessing them. That
would make it smaller, but it definitely wouldn't make it less complex
(IMO).

> FreezeMultiXactId still takes a separate 'cutoffs' arg, but it could use
> pagefrz->cutoffs now.

Yep, I forgot to add a commit to do this. Thanks!

> HeapPageFreeze has two "trackers", for the "freeze" and "no freeze"
> cases. heap_page_prune() needs to track both, until it decides whether
> to freeze or not. But it doesn't make much sense that the caller
> (lazy_scan_prune()) has to initialize both, and has to choose which of
> the values to use after the call depending on whether heap_page_prune()
> froze or not. The two trackers should be just heap_page_prune()'s
> internal business.
>
> HeapPageFreeze is a bit confusing in general, as it's both an input and
> an output to heap_page_prune(). Not sure what exactly to do there, but I
> feel that we should make heap_page_prune()'s interface more clear.
> Perhaps move the output fields to PruneResult.

Great point. Perhaps I just add NewRelfrozenXid and NewRelminMxid 

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-06 Thread Heikki Linnakangas

On 25/01/2024 00:49, Melanie Plageman wrote:

Generates 30% fewer WAL records and 12% fewer WAL bytes -- which,
depending on what else is happening on the system, can lead to vacuum
spending substantially less time on WAL writing and syncing (often 15%
less time on WAL writes and 10% less time on syncing WAL in my
testing).


Nice!


The attached patch set is broken up into many separate commits for
ease of review. Each patch does a single thing which can be explained
plainly in the commit message. Every commit passes tests and works on
its own.


About this very first change:


--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1526,8 +1526,7 @@ lazy_scan_prune(LVRelState *vacrel,
 * that everyone sees it as committed?
 */
xmin = HeapTupleHeaderGetXmin(htup);
-   if (!TransactionIdPrecedes(xmin,
-  
vacrel->cutoffs.OldestXmin))
+   if 
(!GlobalVisTestIsRemovableXid(vacrel->vistest, xmin))
{
all_visible = false;
break;


Does GlobalVisTestIsRemovableXid() handle FrozenTransactionId correctly?

I read through all the patches in order, and aside from the above they 
all look correct to me. Some comments on the set as whole:


I don't think we need XLOG_HEAP2_FREEZE_PAGE as a separate record type 
anymore. By removing that, you also get rid of the freeze-only codepath 
near the end of heap_page_prune(), and the 
heap_freeze_execute_prepared() function.


The XLOG_HEAP2_FREEZE_PAGE record is a little smaller than 
XLOG_HEAP2_PRUNE. But we could optimize the XLOG_HEAP2_PRUNE format for 
the case that there's no pruning, just freezing. The record format 
(xl_heap_prune) looks pretty complex as it is, I think it could be made 
both more compact and more clear with some refactoring.


FreezeMultiXactId still takes a separate 'cutoffs' arg, but it could use 
pagefrz->cutoffs now.


HeapPageFreeze has two "trackers", for the "freeze" and "no freeze" 
cases. heap_page_prune() needs to track both, until it decides whether 
to freeze or not. But it doesn't make much sense that the caller 
(lazy_scan_prune()) has to initialize both, and has to choose which of 
the values to use after the call depending on whether heap_page_prune() 
froze or not. The two trackers should be just heap_page_prune()'s 
internal business.


HeapPageFreeze is a bit confusing in general, as it's both an input and 
an output to heap_page_prune(). Not sure what exactly to do there, but I 
feel that we should make heap_page_prune()'s interface more clear. 
Perhaps move the output fields to PruneResult.


Let's rename heap_page_prune() to heap_page_prune_and_freeze(), as 
freezing is now an integral part of the function. And mention it in the 
function comment, too.


In heap_prune_chain:


 * Tuple visibility information is provided in presult->htsv.


Not this patch's fault directly, but it's not immediate clear what "is 
provided" means here. Does the caller provide it, or does the function 
set it, ie. is it an input or output argument? Looking at the code, it's 
an input, but now it looks a bit weird that an input argument is called 
'presult'.


--
Heikki Linnakangas
Neon (https://neon.tech)