Your point of divergence is in the middle of the 7718/000000BF file. So,
you should have 2 such files eventually, one on timeline 1 and the other on
timeline 2.

Are you archiving WAL on the promoted machine in a way that your
restore_command can find it? Check archive_command and archive_mode on the
promoted machine.

Also, do your archive/restore scripts work properly for history files?

On Wed, Nov 6, 2024 at 7:48 PM Craig McIlwee <cra...@vt.edu> wrote:

> I have a primary -> standby 1 -> standby 2 setup with all servers running
> PG 13.8 (this effort is part of getting on to a newer version, but I think
> those details aren't relevant to this problem).  The first standby uses
> streaming replication from the primary and the second standby is using a
> WAL archive with a restore_command.  To make this standby chain work,
> standby 1 is set to archive_mode = always with a command that populates the
> WAL archive.
>
> I would like to be able to promote standby 2 (hereon referred to just as
> 'standby'), perform some writes, then rewind it back to the point before
> promotion so it can become a standby again.  The documentation for
> pg_rewind says that this is supported and it seems like it should be
> straightforward, but I'm not having any luck getting this to work so I'm
> hoping someone can point out what I'm doing wrong.  Here's what I did:
>
> First, observe that WAL is properly being applied from the archive.  Note
> that we are currently on timeline 1.
>
> 2024-11-06 09:51:23.286 EST [5438] LOG:  restored log file
> "0000000100007711000000F9" from archive
> 2024-11-06 09:51:23.434 EST [5438] LOG:  restored log file
> "0000000100007711000000FA" from archive
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/0000000100007711000000FB.gz: No such file or
> directory
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/00000002.history.gz: No such file or directory
>
> Next, stop postgres, set wal_log_hints = on as required by pg_rewind, and
> restart postgres.  I also make a copy of the data directory while the
> postgres is not running so I can repeat my test, which works fine on a
> small test database but won't be possible for the multi TB database that I
> will eventually be doing this on.
>
> Now promote the standby using "select pg_promote()" and see that it
> switches to a new timeline.  You can also see that the last WAL applied
> from the archive is 7718/BF.
>
> 2024-11-06 12:10:10.831 EST [4336] LOG:  restored log file
> "0000000100007718000000BD" from archive
> 2024-11-06 12:10:10.996 EST [4336] LOG:  restored log file
> "0000000100007718000000BE" from archive
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/0000000100007718000000BF.gz: No such file or
> directory
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/00000002.history.gz: No such file or directory
> 2024-11-06 12:10:15.384 EST [4336] LOG:  restored log file
> "0000000100007718000000BF" from archive
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/0000000100007718000000C0.gz: No such file or
> directory
> 2024-11-06 12:10:15.457 EST [4336] LOG:  received promote request
> 2024-11-06 12:10:15.457 EST [4336] LOG:  redo done at 7718/BFFFFF30
> 2024-11-06 12:10:15.457 EST [4336] LOG:  last completed transaction was at
> log time 2024-11-06 12:10:22.627074-05
> 2024-11-06 12:10:15.593 EST [4336] LOG:  restored log file
> "0000000100007718000000BF" from archive
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/00000002.history.gz: No such file or directory
> 2024-11-06 12:10:15.611 EST [4336] LOG:  selected new timeline ID: 2
> 2024-11-06 12:10:15.640 EST [4336] LOG:  archive recovery complete
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/00000001.history.gz: No such file or directory
> 2024-11-06 12:10:17.028 EST [4329] LOG:  database system is ready to
> accept connections
>
> Next, insert a record into just to make some changes that I want to
> rollback later.  (What I will eventually be doing is creating a publication
> to ship data to a newer version, but again, that's not what's important
> here.)
>
> Finally, shutdown postgres and attempt a rewind.  The address used in the
> --source-server connection string is the address of the primary.
>
> 2024-11-06 12:11:11.139 EST [4329] LOG:  received fast shutdown request
> 2024-11-06 12:11:11.143 EST [4329] LOG:  aborting any active transactions
> 2024-11-06 12:11:11.144 EST [4329] LOG:  background worker "logical
> replication launcher" (PID 5923) exited with exit code 1
> 2024-11-06 12:11:40.933 EST [4342] LOG:  shutting down
> 2024-11-06 12:11:41.753 EST [4329] LOG:  database system is shut down
>
> /usr/pgsql-13/bin/pg_rewind --target-pgdata=/data/pgsql/operational
> --source-server="host=x.x.x.x dbname=postgres user=xxx password=xxx"
> --dry-run --progress --restore-target-wal
>
> pg_rewind: connected to server
> pg_rewind: servers diverged at WAL location 7718/BFFFFFE8 on timeline 1
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/0000000200007718000000BF.gz: No such file or
> directory
> pg_rewind: error: could not restore file "0000000200007718000000BF" from
> archive
> pg_rewind: fatal: could not find previous WAL record at 7718/BFFFFFE8
>
> pg_rewind shows the point of divergence as 7718/BF on timeline 1, but when
> it tries to replay WAL using the restore command it is trying to find WAL
> from timeline 2 rather than picking back up on timeline 1.  I tried
> setting recovery_target_timeline on the target database to 'current' and
> '1' but that gave the same result. Searching the archives, [1] mentions the
> need to force a checkpoint after promotion which I tried even though the
> problem description isn't the same.  [2] mentions a problem that looks more
> like the one I am facing but has no responses.  At this point I don't know
> what to do next and hope someone can point me in the right direction.
>
> [1]
> https://www.postgresql.org/message-id/e7b16ddea93a92575cb6d143b6ef602cab22432e.camel%40cybertec.at
> [2]
> https://www.postgresql.org/message-id/CALp3DH1fLZmPvkOteAbUo4TOLZP-LstKOs6Gcw3Bm7acmJqk=w...@mail.gmail.com
>
> Craig
>

Reply via email to