Re: Flush pgstats file during checkpoints

Michael Paquier Mon, 22 Jul 2024 20:52:42 -0700

On Mon, Jul 22, 2024 at 07:01:41AM +0000, Bertrand Drouvot wrote:
> 1 ===
> Not related with your patch but this comment in the GetRedoRecPtr() function:
> 
>      * grabbed a WAL insertion lock to read the authoritative value in
>      * Insert->RedoRecPtr
> 
> sounds weird. Should'nt that be s/Insert/XLogCtl/?


No, the comment is right.  We are retrieving a copy of
Insert->RedoRecPtr here.

> 2 ===
> 
> +       /* Write the redo LSN, used to cross check the file loaded */
> 
> Nit: s/loaded/read/?

WFM.

> 3 ===
> 
> +       /*
> +        * Read the redo LSN stored in the file.
> +        */
> +       if (!read_chunk_s(fpin, &file_redo) ||
> +               file_redo != redo)
> +               goto error;
> 
> I wonder if it would make sense to have dedicated error messages for
> "file_redo != redo" and for "format_id != PGSTAT_FILE_FORMAT_ID". That would
> ease to diagnose as to why the stat file is discarded.

Yep.  This has been itching me quite a bit, and that's a bit more than
just the format ID or the redo LSN: it relates to all the read_chunk()
callers.  I've taken a shot at this with patch 0001, implemented on
top of the rest.  Adjusted as well the redo LSN read to have more
error context, now in 0002.

> Looking at 0003:
> 
> 4 ===
> 
> @@ -5638,10 +5634,7 @@ StartupXLOG(void)
>          * TODO: With a bit of extra work we could just start with a pgstat 
> file
>          * associated with the checkpoint redo location we're starting from.
>          */
> -       if (didCrash)
> -               pgstat_discard_stats();
> -       else
> -               pgstat_restore_stats(checkPoint.redo);
> +       pgstat_restore_stats(checkPoint.redo)
> 
> remove the TODO comment?

Pretty sure I've removed that more than one time already, and that
this is a rebase accident.  Thanks for noticing.

> 5 ===
> 
> + * process) if the stats file has a redo LSN that matches with the .
> 
> unfinished sentence?

This is missing a reference to the control file.

> 6 ===
> 
> - * Should only be called by the startup process or in single user mode.
> + * This is called by the checkpointer or in single-user mode.
>   */
>  void
> -pgstat_discard_stats(void)
> +pgstat_flush_stats(XLogRecPtr redo)
>  {
> 
> Would that make sense to add an Assert in pgstat_flush_stats()? (checking 
> what 
> the above comment states).

There is one in pgstat_write_statsfile(), not sure there is a point in
duplicating the assertion in both.

Attaching a new v4 series, with all these comments addressed.
--
Michael

From efcbb76efe406d59c2ba8b4a09e04c01158ba575 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Mon, 22 Jul 2024 16:13:56 -0400
Subject: [PATCH v4 1/4] Revert "Test that vacuum removes tuples older than
 OldestXmin"

This reverts commit aa607980aee08416211f003ab41aa750f5559712.

This test proved to be unstable on the buildfarm, timing out before the
standby could catch up on 32-bit machines where more rows were required
and failing to reliably trigger multiple index vacuum rounds on 64-bit
machines where fewer rows should be required.

Because the instability is only known to be present on versions of
Postgres with TIDStore used for dead TID storage by vacuum, this is only
being reverted on master and REL_17_STABLE.

As having this coverage may be valuable, there is a discussion on the
thread of possible ways to stabilize the test. If that happens, a fixed
test can be committed again.

Backpatch-through: 17
Reported-by: Tom Lane

Discussion: https://postgr.es/m/614152.1721580711%40sss.pgh.pa.us
---
 src/test/recovery/meson.build                 |   1 -
 .../recovery/t/043_vacuum_horizon_floor.pl    | 268 ------------------
 2 files changed, 269 deletions(-)
 delete mode 100644 src/test/recovery/t/043_vacuum_horizon_floor.pl

diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 1d55d6bf56..b1eb77b1ec 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,7 +51,6 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
-      't/043_vacuum_horizon_floor.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_vacuum_horizon_floor.pl b/src/test/recovery/t/043_vacuum_horizon_floor.pl
deleted file mode 100644
index 7ccf5ca850..0000000000
--- a/src/test/recovery/t/043_vacuum_horizon_floor.pl
+++ /dev/null
@@ -1,268 +0,0 @@
-use strict;
-use warnings;
-use PostgreSQL::Test::Cluster;
-use Test::More;
-
-# Test that vacuum prunes away all dead tuples killed before OldestXmin
-#
-# This test creates a table on a primary, updates the table to generate dead
-# tuples for vacuum, and then, during the vacuum, uses the replica to force
-# GlobalVisState->maybe_needed on the primary to move backwards and precede
-# the value of OldestXmin set at the beginning of vacuuming the table.
-
-# Set up nodes
-my $node_primary = PostgreSQL::Test::Cluster->new('primary');
-$node_primary->init(allows_streaming => 'physical');
-
-$node_primary->append_conf(
-	'postgresql.conf', qq[
-hot_standby_feedback = on
-log_recovery_conflict_waits = true
-autovacuum = off
-log_min_messages = INFO
-maintenance_work_mem = 1024
-]);
-$node_primary->start;
-
-my $node_replica = PostgreSQL::Test::Cluster->new('standby');
-
-$node_primary->backup('my_backup');
-$node_replica->init_from_backup($node_primary, 'my_backup',
-	has_streaming => 1);
-
-$node_replica->start;
-
-my $test_db = "test_db";
-$node_primary->safe_psql('postgres', "CREATE DATABASE $test_db");
-
-# Save the original connection info for later use
-my $orig_conninfo = $node_primary->connstr();
-
-my $table1 = "vac_horizon_floor_table";
-
-# Long-running Primary Session A
-my $psql_primaryA =
-  $node_primary->background_psql($test_db, on_error_stop => 1);
-
-# Long-running Primary Session B
-my $psql_primaryB  =
-  $node_primary->background_psql($test_db, on_error_stop => 1);
-
-# The TIDStore vacuum uses to store dead items is optimized for its target
-# system. On a 32-bit system, our example requires twice as many pages with
-# the same number of dead items per page to fill the TIDStore and trigger a
-# second round of index vacuuming.
-my $is_64bit = $node_primary->safe_psql($test_db,
-	qq[SELECT typbyval FROM pg_type WHERE typname = 'int8';]);
-
-my $nrows = $is_64bit eq 't' ? 400000 : 800000;
-
-# Because vacuum's first pass, pruning, is where we use the GlobalVisState to
-# check tuple visibility, GlobalVisState->maybe_needed must move backwards
-# during pruning before checking the visibility for a tuple which would have
-# been considered HEAPTUPLE_DEAD prior to maybe_needed moving backwards but
-# HEAPTUPLE_RECENTLY_DEAD compared to the new, older value of maybe_needed.
-#
-# We must not only force the horizon on the primary to move backwards but also
-# force the vacuuming backend's GlobalVisState to be updated. GlobalVisState
-# is forced to update during index vacuuming.
-#
-# _bt_pendingfsm_finalize() calls GetOldestNonRemovableTransactionId() at the
-# end of a round of index vacuuming, updating the backend's GlobalVisState
-# and, in our case, moving maybe_needed backwards.
-#
-# Then vacuum's first (pruning) pass will continue and pruning will find our
-# later inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
-# maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.
-#
-# Thus, we must force at least two rounds of index vacuuming to ensure that
-# some tuple visibility checks will happen after a round of index vacuuming.
-# To accomplish this, we set maintenance_work_mem to its minimum value and
-# insert and update enough rows that we force at least one round of index
-# vacuuming before getting to a dead tuple which was killed after the standby
-# is disconnected.
-$node_primary->safe_psql($test_db, qq[
-	CREATE TABLE ${table1}(col1 int)
-		WITH (autovacuum_enabled=false, fillfactor=10);
-	INSERT INTO $table1 VALUES(7);
-	INSERT INTO $table1 SELECT generate_series(1, $nrows) % 3;
-	CREATE INDEX on ${table1}(col1);
-	UPDATE $table1 SET col1 = 3 WHERE col1 = 0;
-	INSERT INTO $table1 VALUES(7);
-]);
-
-# We will later move the primary forward while the standby is disconnected.
-# For now, however, there is no reason not to wait for the standby to catch
-# up.
-my $primary_lsn = $node_primary->lsn('flush');
-$node_primary->wait_for_catchup($node_replica, 'replay', $primary_lsn);
-
-# Test that the WAL receiver is up and running.
-$node_replica->poll_query_until($test_db, qq[
-	select exists (select * from pg_stat_wal_receiver);] , 't');
-
-# Set primary_conninfo to something invalid on the replica and reload the
-# config. Once the config is reloaded, the startup process will force the WAL
-# receiver to restart and it will be unable to reconnect because of the
-# invalid connection information.
-$node_replica->safe_psql($test_db, qq[
-		ALTER SYSTEM SET primary_conninfo = '';
-		SELECT pg_reload_conf();
-	]);
-
-# Wait until the WAL receiver has shut down and been unable to start up again.
-$node_replica->poll_query_until($test_db, qq[
-	select exists (select * from pg_stat_wal_receiver);] , 'f');
-
-# Now insert and update a tuple which will be visible to the vacuum on the
-# primary but which will have xmax newer than the oldest xmin on the standby
-# that was recently disconnected.
-my $res = $psql_primaryA->query_safe(
-	qq[
-		INSERT INTO $table1 VALUES (99);
-		UPDATE $table1 SET col1 = 100 WHERE col1 = 99;
-		SELECT 'after_update';
-        ]
-	);
-
-# Make sure the UPDATE finished
-like($res, qr/^after_update$/m, "UPDATE occurred on primary session A");
-
-# Open a cursor on the primary whose pin will keep VACUUM from getting a
-# cleanup lock on the first page of the relation. We want VACUUM to be able to
-# start, calculate initial values for OldestXmin and GlobalVisState and then
-# be unable to proceed with pruning our dead tuples. This will allow us to
-# reconnect the standby and push the horizon back before we start actual
-# pruning and vacuuming.
-my $primary_cursor1 = "vac_horizon_floor_cursor1";
-
-# The first value inserted into the table was a 7, so FETCH FORWARD should
-# return a 7. That's how we know the cursor has a pin.
-$res = $psql_primaryB->query_safe(
-	qq[
-	BEGIN;
-	DECLARE $primary_cursor1 CURSOR FOR SELECT * FROM $table1 WHERE col1 = 7;
-	FETCH $primary_cursor1;
-	]
-	);
-
-is($res, 7, qq[Cursor query returned $res. Expected value 7.]);
-
-# Get the PID of the session which will run the VACUUM FREEZE so that we can
-# use it to filter pg_stat_activity later.
-my $vacuum_pid = $psql_primaryA->query_safe("SELECT pg_backend_pid();");
-
-# Now start a VACUUM FREEZE on the primary. It will call vacuum_get_cutoffs()
-# and establish values of OldestXmin and GlobalVisState which are newer than
-# all of our dead tuples. Then it will be unable to get a cleanup lock to
-# start pruning, so it will hang.
-#
-# We use VACUUM FREEZE because it will wait for a cleanup lock instead of
-# skipping the page pinned by the cursor. Note that works because the target
-# tuple's xmax precedes OldestXmin which ensures that lazy_scan_noprune() will
-# return false and we will wait for the cleanup lock.
-$psql_primaryA->{stdin} .= qq[
-		VACUUM (VERBOSE, FREEZE) $table1;
-		\\echo VACUUM
-        ];
-
-# Make sure the VACUUM command makes it to the server.
-$psql_primaryA->{run}->pump_nb();
-
-# Make sure that the VACUUM has already called vacuum_get_cutoffs() and is
-# just waiting on the lock to start vacuuming. We don't want the standby to
-# re-establish a connection to the primary and push the horizon back until
-# we've saved initial values in GlobalVisState and calculated OldestXmin.
-$node_primary->poll_query_until($test_db,
-	qq[
-	SELECT count(*) >= 1 FROM pg_stat_activity
-		WHERE pid = $vacuum_pid
-		AND wait_event = 'BufferPin';
-	],
-	't');
-
-# Ensure the WAL receiver is still not active on the replica.
-$node_replica->poll_query_until($test_db, qq[
-	SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);] , 'f');
-
-# Allow the WAL receiver connection to re-establish.
-$node_replica->safe_psql(
-	$test_db, qq[
-		ALTER SYSTEM SET primary_conninfo = '$orig_conninfo';
-		SELECT pg_reload_conf();
-	]);
-
-# Ensure the new WAL receiver has connected.
-$node_replica->poll_query_until($test_db, qq[
-	SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);] , 't');
-
-# Once the WAL sender is shown on the primary, the replica should have
-# connected with the primary and pushed the horizon backward. Primary Session
-# A won't see that until the VACUUM FREEZE proceeds and does its first round
-# of index vacuuming.
-$node_primary->poll_query_until($test_db, qq[
-	SELECT EXISTS (SELECT * FROm pg_stat_replication);] , 't');
-
-# Move the cursor forward to the next 7. We inserted the 7 much later, so
-# advancing the cursor should allow vacuum to proceed vacuuming most pages of
-# the relation. Because we set maintanence_work_mem sufficiently low, we
-# expect that a round of index vacuuming has happened and that the vacuum is
-# now waiting for the cursor to release its pin on the last page of the
-# relation.
-$res = $psql_primaryB->query_safe("FETCH $primary_cursor1");
-is($res, 7,
-	qq[Cursor query returned $res from second fetch. Expected value 7.]);
-
-# Prevent the test from incorrectly passing by confirming that we did indeed
-# do a pass of index vacuuming.
-$node_primary->poll_query_until($test_db, qq[
-	SELECT index_vacuum_count > 0
-	FROM pg_stat_progress_vacuum
-	WHERE datname='$test_db' AND relid::regclass = '$table1'::regclass;
-	] , 't');
-
-# Commit the transaction with the open cursor so that the VACUUM can finish.
-$psql_primaryB->query_until(
-		qr/^commit$/m,
-		qq[
-			COMMIT;
-			\\echo commit
-        ]
-	);
-
-# VACUUM proceeds with pruning and does a visibility check on each tuple. In
-# older versions of Postgres, pruning found our final dead tuple
-# non-removable (HEAPTUPLE_RECENTLY_DEAD) since its xmax is after the new
-# value of maybe_needed. Then heap_prepare_freeze_tuple() would decide the
-# tuple xmax should be frozen because it precedes OldestXmin. Vacuum would
-# then error out in heap_pre_freeze_checks() with "cannot freeze committed
-# xmax". This was fixed by changing pruning to find all
-# HEAPTUPLE_RECENTLY_DEAD tuples with xmaxes preceding OldestXmin
-# HEAPTUPLE_DEAD and removing them.
-
-# With the fix, VACUUM should finish successfully, incrementing the table
-# vacuum_count.
-$node_primary->poll_query_until($test_db,
-	qq[
-	SELECT vacuum_count > 0
-	FROM pg_stat_all_tables WHERE relname = '${table1}';
-	]
-	, 't');
-
-$primary_lsn = $node_primary->lsn('flush');
-
-# Make sure something causes us to flush
-$node_primary->safe_psql($test_db, "INSERT INTO $table1 VALUES (1);");
-
-# Nothing on the replica should cause a recovery conflict, so this should
-# finish successfully.
-$node_primary->wait_for_catchup($node_replica, 'replay', $primary_lsn);
-
-## Shut down psqls
-$psql_primaryA->quit;
-$psql_primaryB->quit;
-
-$node_replica->stop();
-$node_primary->stop();
-
-done_testing();
-- 
2.45.2

From cb1f44150c0f047facd0ad7fa9a782fa5101d769 Mon Sep 17 00:00:00 2001
From: Michael Paquier <[email protected]>
Date: Tue, 23 Jul 2024 12:31:08 +0900
Subject: [PATCH v4 2/4] Add more debugging information when reading stats
 files

This is useful to know which part of a stats file is corrupted when
loading it.
---
 src/backend/utils/activity/pgstat.c | 65 +++++++++++++++++++++++++++--
 1 file changed, 61 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 2e22bf2707..4a1724d342 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -1537,9 +1537,18 @@ pgstat_read_statsfile(void)
 	/*
 	 * Verify it's of the expected format.
 	 */
-	if (!read_chunk_s(fpin, &format_id) ||
-		format_id != PGSTAT_FILE_FORMAT_ID)
+	if (!read_chunk_s(fpin, &format_id))
+	{
+		elog(WARNING, "could not read format ID");
 		goto error;
+	}
+
+	if (format_id != PGSTAT_FILE_FORMAT_ID)
+	{
+		elog(WARNING, "found incorrect format ID %d (expected %d)",
+			 format_id, PGSTAT_FILE_FORMAT_ID);
+		goto error;
+	}
 
 	/*
 	 * We found an existing statistics file. Read it and put all the stats
@@ -1559,22 +1568,37 @@ pgstat_read_statsfile(void)
 
 					/* entry for fixed-numbered stats */
 					if (!read_chunk_s(fpin, &kind))
+					{
+						elog(WARNING, "could not read stats kind for entry of type %c", t);
 						goto error;
+					}
 
 					if (!pgstat_is_kind_valid(kind))
+					{
+						elog(WARNING, "invalid stats kind %d for entry of type %c",
+							 kind, t);
 						goto error;
+					}
 
 					info = pgstat_get_kind_info(kind);
 
 					if (!info->fixed_amount)
+					{
+						elog(WARNING, "invalid fixed_amount in stats kind %d for entry of type %c",
+							 kind, t);
 						goto error;
+					}
 
 					/* Load back stats into shared memory */
 					ptr = ((char *) shmem) + info->shared_ctl_off +
 						info->shared_data_off;
 
 					if (!read_chunk(fpin, ptr, info->shared_data_len))
+					{
+						elog(WARNING, "could not read data of stats kind %d for entry of type %c",
+							 kind, t);
 						goto error;
+					}
 
 					break;
 				}
@@ -1591,10 +1615,17 @@ pgstat_read_statsfile(void)
 					{
 						/* normal stats entry, identified by PgStat_HashKey */
 						if (!read_chunk_s(fpin, &key))
+						{
+							elog(WARNING, "could not read key for entry of type %c", t);
 							goto error;
+						}
 
 						if (!pgstat_is_kind_valid(key.kind))
+						{
+							elog(WARNING, "invalid stats kind for entry %d/%u/%u of type %c",
+								 key.kind, key.dboid, key.objoid, t);
 							goto error;
+						}
 					}
 					else
 					{
@@ -1604,22 +1635,41 @@ pgstat_read_statsfile(void)
 						NameData	name;
 
 						if (!read_chunk_s(fpin, &kind))
+						{
+							elog(WARNING, "could not read stats kind for entry of type %c", t);
 							goto error;
+						}
 						if (!read_chunk_s(fpin, &name))
+						{
+							elog(WARNING, "could not read name of stats kind %d for entry of type %c",
+								 kind, t);
 							goto error;
+						}
 						if (!pgstat_is_kind_valid(kind))
+						{
+							elog(WARNING, "invalid stats kind %d for entry of type %c",
+								 kind, t);
 							goto error;
+						}
 
 						kind_info = pgstat_get_kind_info(kind);
 
 						if (!kind_info->from_serialized_name)
+						{
+							elog(WARNING, "invalid from_serialized_name in stats kind %d for entry of type %c",
+								 kind, t);
 							goto error;
+						}
 
 						if (!kind_info->from_serialized_name(&name, &key))
 						{
 							/* skip over data for entry we don't care about */
 							if (fseek(fpin, pgstat_get_entry_len(kind), SEEK_CUR) != 0)
+							{
+								elog(WARNING, "could not seek \"%s\" of stats kind %d for entry of type %c",
+									 NameStr(name), kind, t);
 								goto error;
+							}
 
 							continue;
 						}
@@ -1638,8 +1688,8 @@ pgstat_read_statsfile(void)
 					if (found)
 					{
 						dshash_release_lock(pgStatLocal.shared_hash, p);
-						elog(WARNING, "found duplicate stats entry %d/%u/%u",
-							 key.kind, key.dboid, key.objoid);
+						elog(WARNING, "found duplicate stats entry %d/%u/%u of type %c",
+							 key.kind, key.dboid, key.objoid, t);
 						goto error;
 					}
 
@@ -1649,7 +1699,11 @@ pgstat_read_statsfile(void)
 					if (!read_chunk(fpin,
 									pgstat_get_entry_data(key.kind, header),
 									pgstat_get_entry_len(key.kind)))
+					{
+						elog(WARNING, "could not read data for entry %d/%u/%u of type %c",
+							 key.kind, key.dboid, key.objoid, t);
 						goto error;
+					}
 
 					break;
 				}
@@ -1660,7 +1714,10 @@ pgstat_read_statsfile(void)
 				 * file
 				 */
 				if (fgetc(fpin) != EOF)
+				{
+					elog(WARNING, "could not read end-of-file");
 					goto error;
+				}
 
 				goto done;
 
-- 
2.45.2

From 5a4108f1c23619558f205033964a4257545af6a4 Mon Sep 17 00:00:00 2001
From: Michael Paquier <[email protected]>
Date: Tue, 23 Jul 2024 12:40:08 +0900
Subject: [PATCH v4 3/4] Add redo LSN to pgstats file

This is used in the startup process to check that the file we are
loading is the one referring to the latest checkpoint.

Bump PGSTAT_FILE_FORMAT_ID.
---
 src/include/pgstat.h                |  5 +++--
 src/backend/access/transam/xlog.c   |  2 +-
 src/backend/utils/activity/pgstat.c | 35 +++++++++++++++++++++++------
 3 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6b99bb8aad..043d39a565 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -11,6 +11,7 @@
 #ifndef PGSTAT_H
 #define PGSTAT_H
 
+#include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
@@ -235,7 +236,7 @@ typedef struct PgStat_TableXactStatus
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BCAD
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BCAE
 
 typedef struct PgStat_ArchiverStats
 {
@@ -466,7 +467,7 @@ extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
 /* Functions called during server startup / shutdown */
-extern void pgstat_restore_stats(void);
+extern void pgstat_restore_stats(XLogRecPtr redo);
 extern void pgstat_discard_stats(void);
 extern void pgstat_before_server_shutdown(int code, Datum arg);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d70ba67bac..fa076050f4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5642,7 +5642,7 @@ StartupXLOG(void)
 	if (didCrash)
 		pgstat_discard_stats();
 	else
-		pgstat_restore_stats();
+		pgstat_restore_stats(checkPoint.redo);
 
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 4a1724d342..593763eb4c 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -94,6 +94,7 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "access/xlog.h"
 #include "lib/dshash.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -171,8 +172,8 @@ typedef struct PgStat_SnapshotEntry
  * ----------
  */
 
-static void pgstat_write_statsfile(void);
-static void pgstat_read_statsfile(void);
+static void pgstat_write_statsfile(XLogRecPtr redo);
+static void pgstat_read_statsfile(XLogRecPtr redo);
 
 static void pgstat_reset_after_failure(void);
 
@@ -448,9 +449,9 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
  * Should only be called by the startup process or in single user mode.
  */
 void
-pgstat_restore_stats(void)
+pgstat_restore_stats(XLogRecPtr redo)
 {
-	pgstat_read_statsfile();
+	pgstat_read_statsfile(redo);
 }
 
 /*
@@ -526,7 +527,7 @@ pgstat_before_server_shutdown(int code, Datum arg)
 	if (code == 0)
 	{
 		pgStatLocal.shmem->is_shutdown = true;
-		pgstat_write_statsfile();
+		pgstat_write_statsfile(GetRedoRecPtr());
 	}
 }
 
@@ -1349,7 +1350,7 @@ write_chunk(FILE *fpout, void *ptr, size_t len)
  * stats so locking is not required.
  */
 static void
-pgstat_write_statsfile(void)
+pgstat_write_statsfile(XLogRecPtr redo)
 {
 	FILE	   *fpout;
 	int32		format_id;
@@ -1387,6 +1388,9 @@ pgstat_write_statsfile(void)
 	format_id = PGSTAT_FILE_FORMAT_ID;
 	write_chunk_s(fpout, &format_id);
 
+	/* Write the redo LSN, used to cross check the file read */
+	write_chunk_s(fpout, &redo);
+
 	/* Write various stats structs for fixed number of objects */
 	for (int kind = PGSTAT_KIND_FIRST_VALID; kind <= PGSTAT_KIND_LAST; kind++)
 	{
@@ -1501,13 +1505,14 @@ read_chunk(FILE *fpin, void *ptr, size_t len)
  * stats so locking is not required.
  */
 static void
-pgstat_read_statsfile(void)
+pgstat_read_statsfile(XLogRecPtr redo)
 {
 	FILE	   *fpin;
 	int32		format_id;
 	bool		found;
 	const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 	PgStat_ShmemControl *shmem = pgStatLocal.shmem;
+	XLogRecPtr	file_redo;
 
 	/* shouldn't be called from postmaster */
 	Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
@@ -1550,6 +1555,22 @@ pgstat_read_statsfile(void)
 		goto error;
 	}
 
+	/*
+	 * Read the redo LSN stored in the file.
+	 */
+	if (!read_chunk_s(fpin, &file_redo))
+	{
+		elog(WARNING, "could not read redo LSN");
+		goto error;
+	}
+
+	if (file_redo != redo)
+	{
+		elog(WARNING, "found incorrect redo LSN %X/%X (expected %X/%X)",
+			 LSN_FORMAT_ARGS(file_redo), LSN_FORMAT_ARGS(redo));
+		goto error;
+	}
+
 	/*
 	 * We found an existing statistics file. Read it and put all the stats
 	 * data into place.
-- 
2.45.2

From 287dd0dd23b734a9f452a08ff959db7bb080e074 Mon Sep 17 00:00:00 2001
From: Michael Paquier <[email protected]>
Date: Wed, 17 Jul 2024 12:42:43 +0900
Subject: [PATCH v4 4/4] Flush pgstats file during checkpoints

The flushes are done for non-shutdown checkpoints and restart points, so
as it is possible to keep some statistics around in the event of a
crash.  Keeping the before_shmem_exit() callback to flush the pgstats
file in a shutdown sequence is a bit easier than doing so at checkpoint,
as they may be stats to flush before we are completely done with the
shutdown checkpoint.  Keeping the callback also keeps the shutdown of
single-user mode simpler, where the stats also need to be flushed.

At the beginning of recovery, the stats file is loaded when the redo LSN
it stores matches with the point we are replaying from.  The file is not
removed anymore after being read, to ensure that there is still a source
of stats to feed on should the system crash until the next checkpoint
that would update the stats.
---
 src/include/pgstat.h                          |  4 +-
 src/backend/access/transam/xlog.c             | 30 +++++---
 src/backend/utils/activity/pgstat.c           | 68 +++++--------------
 src/test/recovery/t/029_stats_restart.pl      | 26 +++++--
 .../recovery/t/030_stats_cleanup_replica.pl   |  2 +-
 5 files changed, 59 insertions(+), 71 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 043d39a565..f780b56cc3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -466,9 +466,9 @@ typedef struct PgStat_PendingWalStats
 extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
-/* Functions called during server startup / shutdown */
+/* Functions called during server startup / checkpoint / shutdown */
 extern void pgstat_restore_stats(XLogRecPtr redo);
-extern void pgstat_discard_stats(void);
+extern void pgstat_flush_stats(XLogRecPtr redo);
 extern void pgstat_before_server_shutdown(int code, Datum arg);
 
 /* Functions for backend initialization */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fa076050f4..1b52b03841 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5391,7 +5391,6 @@ StartupXLOG(void)
 	XLogCtlInsert *Insert;
 	CheckPoint	checkPoint;
 	bool		wasShutdown;
-	bool		didCrash;
 	bool		haveTblspcMap;
 	bool		haveBackupLabel;
 	XLogRecPtr	EndOfLog;
@@ -5511,10 +5510,7 @@ StartupXLOG(void)
 	{
 		RemoveTempXlogFiles();
 		SyncDataDirectory();
-		didCrash = true;
 	}
-	else
-		didCrash = false;
 
 	/*
 	 * Prepare for WAL recovery if needed.
@@ -5635,14 +5631,8 @@ StartupXLOG(void)
 	 *
 	 * NB: Restoring replication slot stats relies on slot state to have
 	 * already been restored from disk.
-	 *
-	 * TODO: With a bit of extra work we could just start with a pgstat file
-	 * associated with the checkpoint redo location we're starting from.
 	 */
-	if (didCrash)
-		pgstat_discard_stats();
-	else
-		pgstat_restore_stats(checkPoint.redo);
+	pgstat_restore_stats(checkPoint.redo);
 
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
@@ -7213,6 +7203,15 @@ CreateCheckPoint(int flags)
 	 */
 	END_CRIT_SECTION();
 
+	/*
+	 * The control file update is done, hence it is time to write some fresh
+	 * statistics.  Stats are flushed at shutdown by the checkpointer in a
+	 * dedicated before_shmem_exit callback, combined with sanity checks
+	 * related to the shutdown of pgstats.
+	 */
+	if (!shutdown)
+		pgstat_flush_stats(RedoRecPtr);
+
 	/*
 	 * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
 	 * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
@@ -7682,6 +7681,15 @@ CreateRestartPoint(int flags)
 	}
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * The control file update is done, hence it is time to write some fresh
+	 * statistics.  Stats are flushed at shutdown by the checkpointer in a
+	 * dedicated before_shmem_exit callback, combined with sanity checks
+	 * related to the shutdown of pgstats.
+	 */
+	if ((flags & CHECKPOINT_IS_SHUTDOWN) == 0)
+		pgstat_flush_stats(RedoRecPtr);
+
 	/*
 	 * Update the average distance between checkpoints/restartpoints if the
 	 * prior checkpoint exists.
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 593763eb4c..cda21b2814 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -10,9 +10,10 @@
  * statistics.
  *
  * Statistics are loaded from the filesystem during startup (by the startup
- * process), unless preceded by a crash, in which case all stats are
- * discarded. They are written out by the checkpointer process just before
- * shutting down, except when shutting down in immediate mode.
+ * process) if the stats file has a redo LSN that matches with the one stored
+ * in the control file. They are written out by the checkpointer during
+ * checkpoints and restart points, as well as before shutting down, except
+ * when shutting down in immediate mode.
  *
  * Fixed-numbered stats are stored in plain (non-dynamic) shared memory.
  *
@@ -455,53 +456,19 @@ pgstat_restore_stats(XLogRecPtr redo)
 }
 
 /*
- * Remove the stats file.  This is currently used only if WAL recovery is
- * needed after a crash.
+ * Write stats in memory to disk.
  *
- * Should only be called by the startup process or in single user mode.
+ * This is called by the checkpointer or in single-user mode.
  */
 void
-pgstat_discard_stats(void)
+pgstat_flush_stats(XLogRecPtr redo)
 {
-	int			ret;
-
-	/* NB: this needs to be done even in single user mode */
-
-	ret = unlink(PGSTAT_STAT_PERMANENT_FILENAME);
-	if (ret != 0)
-	{
-		if (errno == ENOENT)
-			elog(DEBUG2,
-				 "didn't need to unlink permanent stats file \"%s\" - didn't exist",
-				 PGSTAT_STAT_PERMANENT_FILENAME);
-		else
-			ereport(LOG,
-					(errcode_for_file_access(),
-					 errmsg("could not unlink permanent statistics file \"%s\": %m",
-							PGSTAT_STAT_PERMANENT_FILENAME)));
-	}
-	else
-	{
-		ereport(DEBUG2,
-				(errcode_for_file_access(),
-				 errmsg_internal("unlinked permanent statistics file \"%s\"",
-								 PGSTAT_STAT_PERMANENT_FILENAME)));
-	}
-
-	/*
-	 * Reset stats contents. This will set reset timestamps of fixed-numbered
-	 * stats to the current time (no variable stats exist).
-	 */
-	pgstat_reset_after_failure();
+	pgstat_write_statsfile(redo);
 }
 
 /*
  * pgstat_before_server_shutdown() needs to be called by exactly one process
- * during regular server shutdowns. Otherwise all stats will be lost.
- *
- * We currently only write out stats for proc_exit(0). We might want to change
- * that at some point... But right now pgstat_discard_stats() would be called
- * during the start after a disorderly shutdown, anyway.
+ * during regular server shutdowns.
  */
 void
 pgstat_before_server_shutdown(int code, Datum arg)
@@ -526,6 +493,9 @@ pgstat_before_server_shutdown(int code, Datum arg)
 	 */
 	if (code == 0)
 	{
+		/* we're shutting down, so it's ok to just override this */
+		pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
+
 		pgStatLocal.shmem->is_shutdown = true;
 		pgstat_write_statsfile(GetRedoRecPtr());
 	}
@@ -1346,8 +1316,8 @@ write_chunk(FILE *fpout, void *ptr, size_t len)
 #define write_chunk_s(fpout, ptr) write_chunk(fpout, ptr, sizeof(*ptr))
 
 /*
- * This function is called in the last process that is accessing the shared
- * stats so locking is not required.
+ * This function is called in the checkpointer or in single-user mode,
+ * so locking is not required.
  */
 static void
 pgstat_write_statsfile(XLogRecPtr redo)
@@ -1364,10 +1334,8 @@ pgstat_write_statsfile(XLogRecPtr redo)
 	/* should be called only by the checkpointer or single user mode */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
 
-	/* we're shutting down, so it's ok to just override this */
-	pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
-
-	elog(DEBUG2, "writing stats file \"%s\"", statfile);
+	elog(DEBUG2, "writing stats file \"%s\" with redo %X/%X", statfile,
+		 LSN_FORMAT_ARGS(redo));
 
 	/*
 	 * Open the statistics temp file to write out the current values.
@@ -1422,7 +1390,6 @@ pgstat_write_statsfile(XLogRecPtr redo)
 		CHECK_FOR_INTERRUPTS();
 
 		/* we may have some "dropped" entries not yet removed, skip them */
-		Assert(!ps->dropped);
 		if (ps->dropped)
 			continue;
 
@@ -1750,9 +1717,6 @@ pgstat_read_statsfile(XLogRecPtr redo)
 done:
 	FreeFile(fpin);
 
-	elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-	unlink(statfile);
-
 	return;
 
 error:
diff --git a/src/test/recovery/t/029_stats_restart.pl b/src/test/recovery/t/029_stats_restart.pl
index 93a7209f69..73d059b9f9 100644
--- a/src/test/recovery/t/029_stats_restart.pl
+++ b/src/test/recovery/t/029_stats_restart.pl
@@ -59,7 +59,7 @@ ok(-f "$og_stats", "origin stats file must exist");
 copy($og_stats, $statsfile) or die "Copy failed: $!";
 
 
-## test discarding of stats file after crash etc
+## test retention of stats file after crash etc
 
 $node->start;
 
@@ -69,12 +69,28 @@ is(have_stats('function', $dboid, $funcoid),
 	't', "$sect: function stats do exist");
 is(have_stats('relation', $dboid, $tableoid),
 	't', "$sect: relation stats do exist");
+# Flush the stats to disk.
+$node->psql('postgres', 'checkpoint');
 
 $node->stop('immediate');
 
-ok(!-f "$og_stats", "no stats file should exist after immediate shutdown");
+ok(-f "$og_stats", "stats file should exist after immediate shutdown");
 
-# copy the old stats back to test we discard stats after crash restart
+# Start once, there should be stats restored from the previous checkpoint.
+$node->start;
+
+$sect = "crashrecovery";
+is(have_stats('database', $dboid, 0), 't', "$sect: db stats do exist");
+is(have_stats('function', $dboid, $funcoid),
+	't', "$sect: function stats do exist");
+is(have_stats('relation', $dboid, $tableoid),
+	't', "$sect: relation stats do exist");
+
+$node->stop('immediate');
+
+# Copy the old stats back, and test that we discard stats after crash restart
+# because these don't match with the redo LSN stored in the stats file
+# generated by the previous checkpoint.
 copy($statsfile, $og_stats) or die "Copy failed: $!";
 
 $node->start;
@@ -274,9 +290,9 @@ my $wal_restart_immediate = wal_stats();
 
 cmp_ok(
 	$wal_reset_restart->{reset},
-	'lt',
+	'eq',
 	$wal_restart_immediate->{reset},
-	"$sect: reset timestamp is new");
+	"$sect: reset timestamp is the same");
 
 $node->stop;
 done_testing();
diff --git a/src/test/recovery/t/030_stats_cleanup_replica.pl b/src/test/recovery/t/030_stats_cleanup_replica.pl
index 74b516cc7c..5735921c99 100644
--- a/src/test/recovery/t/030_stats_cleanup_replica.pl
+++ b/src/test/recovery/t/030_stats_cleanup_replica.pl
@@ -115,7 +115,7 @@ $node_standby->start();
 $sect = "post immediate restart";
 
 test_standby_func_tab_stats_status('postgres',
-	$dboid, $tableoid, $funcoid, 'f');
+	$dboid, $tableoid, $funcoid, 't');
 
 
 done_testing();
-- 
2.45.2

signature.asc
Description: PGP signature

Re: Flush pgstats file during checkpoints

Reply via email to