On Fri, Jul 21, 2023 at 12:38 PM Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> wrote: > > Needed a rebase. I'm attaching the v13 patch for further consideration.
Needed a rebase. I'm attaching the v14 patch. It also has the following changes: - Ran pgindent on the new source code. - Ran pgperltidy on the new TAP test. - Improved the newly added TAP test a bit. Used the new wait_for_log core TAP function in place of custom find_in_log. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
From e6010e3b2e4c52d32a5d2f3eb2d59954617b221b Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Date: Sat, 21 Oct 2023 13:41:46 +0000 Subject: [PATCH v14] Allow standby to switch WAL source from archive to streaming A standby typically switches to streaming replication (get WAL from primary), only when receive from WAL archive finishes (no more WAL left there) or fails for any reason. Reading WAL from archive may not always be as efficient and fast as reading from primary. This can be due to the differences in disk types, IO costs, network latencies etc., all impacting recovery performance on standby. And, while standby is reading WAL from archive, primary accumulates WAL because the standby's replication slot stays inactive. To avoid these problems, one can use this parameter to make standby switch to stream mode sooner. This feature adds a new GUC that specifies amount of time after which standby attempts to switch WAL source from WAL archive to streaming replication (getting WAL from primary). However, standby exhausts all the WAL present in pg_wal before switching. If standby fails to switch to stream mode, it falls back to archive mode. Reported-by: SATYANARAYANA NARLAPURAM Author: Bharath Rupireddy Reviewed-by: Cary Huang, Nathan Bossart Reviewed-by: Kyotaro Horiguchi Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com --- doc/src/sgml/config.sgml | 45 ++++++ doc/src/sgml/high-availability.sgml | 15 +- src/backend/access/transam/xlogrecovery.c | 146 ++++++++++++++++-- src/backend/utils/misc/guc_tables.c | 12 ++ src/backend/utils/misc/postgresql.conf.sample | 4 + src/include/access/xlogrecovery.h | 1 + src/test/recovery/meson.build | 1 + src/test/recovery/t/040_wal_source_switch.pl | 130 ++++++++++++++++ 8 files changed, 333 insertions(+), 21 deletions(-) create mode 100644 src/test/recovery/t/040_wal_source_switch.pl diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 3839c72c86..3a18ba9b26 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -4791,6 +4791,51 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class=" </listitem> </varlistentry> + <varlistentry id="guc-streaming-replication-retry-interval" xreflabel="streaming_replication_retry_interval"> + <term><varname>streaming_replication_retry_interval</varname> (<type>integer</type>) + <indexterm> + <primary><varname>streaming_replication_retry_interval</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies amount of time after which standby attempts to switch WAL + source from WAL archive to streaming replication (getting WAL from + primary). However, standby exhausts all the WAL present in pg_wal + before switching. If standby fails to switch to stream mode, it falls + back to archive mode. If this parameter's value is specified without + units, it is taken as milliseconds. Default is five minutes + (<literal>5min</literal>). With a lower value for this parameter, + standby makes frequent WAL source switch attempts. To avoid this, it is + recommended to set a reasonable value. A setting of <literal>0</literal> + disables the feature. When disabled, standby typically switches to + stream mode only when receive from WAL archive finishes (no more WAL + left there) or fails for any reason. This parameter can only be set in + the <filename>postgresql.conf</filename> file or on the server command + line. + </para> + <note> + <para> + Standby may not always attempt to switch source from WAL archive to + streaming replication at exact <varname>streaming_replication_retry_interval</varname> + intervals. For example, if the parameter is set to <literal>1min</literal> + and fetching WAL file from archive takes about <literal>2min</literal>, + then the source switch attempt happens for the next WAL file after + current WAL file fetched from archive is fully applied. + </para> + </note> + <para> + Reading WAL from archive may not always be as efficient and fast as + reading from primary. This can be due to the differences in disk types, + IO costs, network latencies etc., all impacting recovery performance on + standby. And, while standby is reading WAL from archive, primary + accumulates WAL because the standby's replication slot stays inactive. + To avoid these problems, one can use this parameter to make standby + switch to stream mode sooner. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-recovery-min-apply-delay" xreflabel="recovery_min_apply_delay"> <term><varname>recovery_min_apply_delay</varname> (<type>integer</type>) <indexterm> diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml index 5f9257313a..b04dae7ce0 100644 --- a/doc/src/sgml/high-availability.sgml +++ b/doc/src/sgml/high-availability.sgml @@ -628,12 +628,15 @@ protocol to make nodes agree on a serializable transactional order. In standby mode, the server continuously applies WAL received from the primary server. The standby server can read WAL from a WAL archive (see <xref linkend="guc-restore-command"/>) or directly from the primary - over a TCP connection (streaming replication). The standby server will - also attempt to restore any WAL found in the standby cluster's - <filename>pg_wal</filename> directory. That typically happens after a server - restart, when the standby replays again WAL that was streamed from the - primary before the restart, but you can also manually copy files to - <filename>pg_wal</filename> at any time to have them replayed. + over a TCP connection (streaming replication) or attempt to switch to + streaming replication after reading from archive when + <xref linkend="guc-streaming-replication-retry-interval"/> parameter is + set. The standby server will also attempt to restore any WAL found in the + standby cluster's <filename>pg_wal</filename> directory. That typically + happens after a server restart, when the standby replays again WAL that was + streamed from the primary before the restart, but you can also manually + copy files to <filename>pg_wal</filename> at any time to have them + replayed. </para> <para> diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index d6f2bb8286..c1a2a83a0b 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -68,6 +68,11 @@ #define RECOVERY_COMMAND_FILE "recovery.conf" #define RECOVERY_COMMAND_DONE "recovery.done" +#define SwitchFromArchiveToStreamEnabled() \ + (streaming_replication_retry_interval > 0 && \ + StandbyMode && \ + currentSource == XLOG_FROM_ARCHIVE) + /* * GUC support */ @@ -91,6 +96,7 @@ TimestampTz recoveryTargetTime; const char *recoveryTargetName; XLogRecPtr recoveryTargetLSN; int recovery_min_apply_delay = 0; +int streaming_replication_retry_interval = 300000; /* options formerly taken from recovery.conf for XLOG streaming */ char *PrimaryConnInfo = NULL; @@ -297,6 +303,11 @@ bool reachedConsistency = false; static char *replay_image_masked = NULL; static char *primary_image_masked = NULL; +/* + * Holds the timestamp at which WaitForWALToBecomeAvailable()'s state machine + * switches to XLOG_FROM_ARCHIVE. + */ +static TimestampTz switched_to_archive_at = 0; /* * Shared-memory state for WAL recovery. @@ -440,6 +451,8 @@ static bool HotStandbyActiveInReplay(void); static void SetCurrentChunkStartTime(TimestampTz xtime); static void SetLatestXTime(TimestampTz xtime); +static bool ShouldSwitchWALSourceToPrimary(void); + /* * Initialization of shared memory for WAL recovery */ @@ -3471,8 +3484,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, bool nonblocking) { static TimestampTz last_fail_time = 0; + static bool canSwitchSource = false; + bool switchSource = false; TimestampTz now; bool streaming_reply_sent = false; + XLogSource readFrom; /*------- * Standby mode is implemented by a state machine: @@ -3492,6 +3508,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, * those actions are taken when reading from the previous source fails, as * part of advancing to the next state. * + * Try reading WAL from primary after being in XLOG_FROM_ARCHIVE state for + * at least streaming_replication_retry_interval milliseconds. However, + * exhaust all the WAL present in pg_wal before switching. If successful, + * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls + * back to XLOG_FROM_ARCHIVE state. + * * If standby mode is turned off while reading WAL from stream, we move * to XLOG_FROM_ARCHIVE and reset lastSourceFailed, to force fetching * the files (which would be required at end of recovery, e.g., timeline @@ -3515,19 +3537,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, bool startWalReceiver = false; /* - * First check if we failed to read from the current source, and - * advance the state machine if so. The failure to read might've + * First check if we failed to read from the current source or we + * intentionally want to switch the source from archive to primary, + * and advance the state machine if so. The failure to read might've * happened outside this function, e.g when a CRC check fails on a * record, or within this loop. */ - if (lastSourceFailed) + if (lastSourceFailed || switchSource) { /* * Don't allow any retry loops to occur during nonblocking - * readahead. Let the caller process everything that has been - * decoded already first. + * readahead if we failed to read from the current source. Let the + * caller process everything that has been decoded already first. */ - if (nonblocking) + if (nonblocking && lastSourceFailed) return XLREAD_WOULDBLOCK; switch (currentSource) @@ -3659,9 +3682,24 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, } if (currentSource != oldSource) - elog(DEBUG2, "switched WAL source from %s to %s after %s", - xlogSourceNames[oldSource], xlogSourceNames[currentSource], - lastSourceFailed ? "failure" : "success"); + { + /* Save the timestamp at which we're switching to archive. */ + if (SwitchFromArchiveToStreamEnabled()) + switched_to_archive_at = GetCurrentTimestamp(); + + if (switchSource) + ereport(DEBUG1, + (errmsg_internal("switched WAL source to %s after fetching WAL from %s for at least %d milliseconds", + xlogSourceNames[currentSource], + xlogSourceNames[oldSource], + streaming_replication_retry_interval))); + else + ereport(DEBUG1, + (errmsg_internal("switched WAL source from %s to %s after %s", + xlogSourceNames[oldSource], + xlogSourceNames[currentSource], + lastSourceFailed ? "failure" : "success"))); + } /* * We've now handled possible failure. Try to read from the chosen @@ -3669,6 +3707,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, */ lastSourceFailed = false; + if (switchSource) + { + Assert(canSwitchSource == true); + switchSource = false; + canSwitchSource = false; + } + switch (currentSource) { case XLOG_FROM_ARCHIVE: @@ -3690,13 +3735,24 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, if (randAccess) curFileTLI = 0; + /* + * See if we can switch the source to streaming from archive. + */ + if (!canSwitchSource) + canSwitchSource = ShouldSwitchWALSourceToPrimary(); + /* * Try to restore the file from archive, or read an existing - * file from pg_wal. + * file from pg_wal. However, before switching the source to + * stream mode, give it a chance to read all the WAL from + * pg_wal. */ - readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, - currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY : - currentSource); + if (canSwitchSource) + readFrom = XLOG_FROM_PG_WAL; + else + readFrom = currentSource; + + readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom); if (readFile >= 0) return XLREAD_SUCCESS; /* success! */ @@ -3704,6 +3760,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, * Nope, not found in archive or pg_wal. */ lastSourceFailed = true; + + /* + * Exhausted all the WAL in pg_wal, now ready to switch to + * streaming. + */ + if (canSwitchSource) + switchSource = true; + break; case XLOG_FROM_STREAM: @@ -3934,6 +3998,54 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, return XLREAD_FAIL; /* not reached */ } +/* + * This function tells if a standby can make an attempt to read WAL from + * primary after reading from archive for at least + * streaming_replication_retry_interval milliseconds. Reading WAL from archive + * may not always be as efficient and fast as reading from primary. This can be + * due to the differences in disk types, IO costs, network latencies etc., all + * impacting recovery performance on standby. And, while standby is reading WAL + * from archive, primary accumulates WAL because the standby's replication slot + * stays inactive. To avoid these problems, we try to make standby switch to + * stream mode sooner. + */ +static bool +ShouldSwitchWALSourceToPrimary(void) +{ + bool shouldSwitchSource = false; + + if (!SwitchFromArchiveToStreamEnabled()) + return shouldSwitchSource; + + if (switched_to_archive_at > 0) + { + TimestampTz curr_time; + + curr_time = GetCurrentTimestamp(); + + if (TimestampDifferenceExceeds(switched_to_archive_at, curr_time, + streaming_replication_retry_interval)) + { + ereport(DEBUG1, + (errmsg_internal("trying to switch WAL source to %s after fetching WAL from %s for at least %d milliseconds", + xlogSourceNames[XLOG_FROM_STREAM], + xlogSourceNames[currentSource], + streaming_replication_retry_interval))); + + shouldSwitchSource = true; + } + } + else if (switched_to_archive_at == 0) + { + /* + * Save the timestamp if we're about to fetch WAL from archive for the + * first time. + */ + switched_to_archive_at = GetCurrentTimestamp(); + } + + return shouldSwitchSource; +} /* * Determine what log level should be used to report a corrupt WAL record @@ -4259,7 +4371,11 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source) continue; } - if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE) + /* + * When failed to read from archive, try reading from pg_wal, see + * below. + */ + if (source == XLOG_FROM_ARCHIVE) { fd = XLogFileRead(segno, emode, tli, XLOG_FROM_ARCHIVE, true); @@ -4272,7 +4388,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source) } } - if (source == XLOG_FROM_ANY || source == XLOG_FROM_PG_WAL) + if (source == XLOG_FROM_ARCHIVE || source == XLOG_FROM_PG_WAL) { fd = XLogFileRead(segno, emode, tli, XLOG_FROM_PG_WAL, true); diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 4c58574166..b6dfa74e3f 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -3168,6 +3168,18 @@ struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"streaming_replication_retry_interval", PGC_SIGHUP, REPLICATION_STANDBY, + gettext_noop("Sets the time after which standby attempts to switch WAL " + "source from archive to streaming replication."), + gettext_noop("0 turns this feature off."), + GUC_UNIT_MS + }, + &streaming_replication_retry_interval, + 300000, 0, INT_MAX, + NULL, NULL, NULL + }, + { {"wal_segment_size", PGC_INTERNAL, PRESET_OPTIONS, gettext_noop("Shows the size of write ahead log segments."), diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index d08d55c3fe..0e03887683 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -352,6 +352,10 @@ # in milliseconds; 0 disables #wal_retrieve_retry_interval = 5s # time to wait before retrying to # retrieve WAL after a failed attempt +#streaming_replication_retry_interval = 5min # time after which standby + # attempts to switch WAL source from archive to + # streaming replication + # in milliseconds; 0 disables #recovery_min_apply_delay = 0 # minimum delay for applying changes during recovery # - Subscribers - diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h index 47c29350f5..dfa0301d61 100644 --- a/src/include/access/xlogrecovery.h +++ b/src/include/access/xlogrecovery.h @@ -57,6 +57,7 @@ extern PGDLLIMPORT char *PrimarySlotName; extern PGDLLIMPORT char *recoveryRestoreCommand; extern PGDLLIMPORT char *recoveryEndCommand; extern PGDLLIMPORT char *archiveCleanupCommand; +extern PGDLLIMPORT int streaming_replication_retry_interval; /* indirectly set via GUC system */ extern PGDLLIMPORT TransactionId recoveryTargetXid; diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build index 9d8039684a..06041a5f74 100644 --- a/src/test/recovery/meson.build +++ b/src/test/recovery/meson.build @@ -45,6 +45,7 @@ tests += { 't/037_invalid_database.pl', 't/038_save_logical_slots_shutdown.pl', 't/039_end_of_wal.pl', + 't/040_wal_source_switch.pl', ], }, } diff --git a/src/test/recovery/t/040_wal_source_switch.pl b/src/test/recovery/t/040_wal_source_switch.pl new file mode 100644 index 0000000000..9388fa3a39 --- /dev/null +++ b/src/test/recovery/t/040_wal_source_switch.pl @@ -0,0 +1,130 @@ +# Copyright (c) 2021-2022, PostgreSQL Global Development Group + +# Test for WAL source switch feature +use strict; +use warnings; + +use PostgreSQL::Test::Utils; +use PostgreSQL::Test::Cluster; +use Test::More; + +$ENV{PGDATABASE} = 'postgres'; + +# Initialize primary node, setting wal-segsize to 1MB +my $node_primary = PostgreSQL::Test::Cluster->new('primary'); +$node_primary->init( + allows_streaming => 1, + has_archiving => 1, + extra => ['--wal-segsize=1']); + +# Ensure checkpoint doesn't come in our way +$node_primary->append_conf( + 'postgresql.conf', qq( + min_wal_size = 2MB + max_wal_size = 1GB + checkpoint_timeout = 1h + wal_recycle = off +)); +$node_primary->start; + +# Create an inactive replication slot to keep the WAL +$node_primary->safe_psql('postgres', + "SELECT pg_create_physical_replication_slot('rep1')"); + +# Take backup +my $backup_name = 'my_backup'; +$node_primary->backup($backup_name); + +# Create a standby linking to it using the replication slot +my $node_standby = PostgreSQL::Test::Cluster->new('standby'); +$node_standby->init_from_backup( + $node_primary, $backup_name, + has_streaming => 1, + has_restoring => 1); +$node_standby->append_conf( + 'postgresql.conf', qq( +primary_slot_name = 'rep1' +min_wal_size = 2MB +max_wal_size = 1GB +checkpoint_timeout = 1h +streaming_replication_retry_interval = 10ms +wal_recycle = off +log_min_messages = 'debug1' +)); + +$node_standby->start; + +# Wait until standby has replayed enough data +$node_primary->wait_for_catchup($node_standby); + +$node_standby->stop; + +# Advance WAL by 10 segments (= 10MB) on primary +advance_wal($node_primary, 10); + +# Wait for primary to generate requested WAL files +$node_primary->poll_query_until('postgres', + "SELECT COUNT(*) >= 10 FROM pg_ls_waldir();") + or die "Timed out while waiting for primary to generate WAL"; + +# Wait until generated WAL files have been stored on the archives of the +# primary. This ensures that the standby created below will be able to restore +# the WAL files. +my $primary_archive = $node_primary->archive_dir; +$node_primary->poll_query_until('postgres', + "SELECT COUNT(*) >= 10 FROM pg_ls_dir('$primary_archive', false, false) a WHERE a ~ '^[0-9A-F]{24}\$';" +) or die "Timed out while waiting for archiving of WAL by primary"; + +# Generate some data on the primary +$node_primary->safe_psql('postgres', + "CREATE TABLE test_tbl AS SELECT a FROM generate_series(1,10) AS a;"); + +my $offset = -s $node_standby->logfile; + +# Standby now connects to primary during inital recovery after fetching WAL +# from archive for about streaming_replication_retry_interval milliseconds. +$node_standby->start; + +# Wait until standby has replayed enough data +$node_primary->wait_for_catchup($node_standby); + +$node_standby->wait_for_log( + qr/LOG: ( [A-Z0-9]+:)? restored log file ".*" from archive/, $offset); + +$node_standby->wait_for_log( + qr/DEBUG: ( [A-Z0-9]+:)? trying to switch WAL source to stream after fetching WAL from archive for at least 10 milliseconds/, + $offset); + +$node_standby->wait_for_log( + qr/DEBUG: ( [A-Z0-9]+:)? switched WAL source to stream after fetching WAL from archive for at least 10 milliseconds/, + $offset); + +$node_standby->wait_for_log( + qr/LOG: ( [A-Z0-9]+:)? started streaming WAL from primary at .* on timeline .*/, + $offset); + +# Check the streamed data on the standby +my $result = + $node_standby->safe_psql('postgres', "SELECT COUNT(*) FROM test_tbl;"); + +is($result, '10', 'check streamed data on standby'); + +$node_standby->stop; +$node_primary->stop; + +##################################### +# Advance WAL of $node by $n segments +sub advance_wal +{ + my ($node, $n) = @_; + + # Advance by $n segments (= (wal_segment_size * $n) bytes) on primary. + for (my $i = 0; $i < $n; $i++) + { + $node->safe_psql('postgres', + "CREATE TABLE t (); DROP TABLE t; SELECT pg_switch_wal();"); + } + return; +} + +done_testing(); -- 2.34.1