Your point of divergence is in the middle of the 7718/000000BF file. So, you should have 2 such files eventually, one on timeline 1 and the other on timeline 2.
Are you archiving WAL on the promoted machine in a way that your restore_command can find it? Check archive_command and archive_mode on the promoted machine. Also, do your archive/restore scripts work properly for history files? On Wed, Nov 6, 2024 at 7:48 PM Craig McIlwee <cra...@vt.edu> wrote: > I have a primary -> standby 1 -> standby 2 setup with all servers running > PG 13.8 (this effort is part of getting on to a newer version, but I think > those details aren't relevant to this problem). The first standby uses > streaming replication from the primary and the second standby is using a > WAL archive with a restore_command. To make this standby chain work, > standby 1 is set to archive_mode = always with a command that populates the > WAL archive. > > I would like to be able to promote standby 2 (hereon referred to just as > 'standby'), perform some writes, then rewind it back to the point before > promotion so it can become a standby again. The documentation for > pg_rewind says that this is supported and it seems like it should be > straightforward, but I'm not having any luck getting this to work so I'm > hoping someone can point out what I'm doing wrong. Here's what I did: > > First, observe that WAL is properly being applied from the archive. Note > that we are currently on timeline 1. > > 2024-11-06 09:51:23.286 EST [5438] LOG: restored log file > "0000000100007711000000F9" from archive > 2024-11-06 09:51:23.434 EST [5438] LOG: restored log file > "0000000100007711000000FA" from archive > /data/wal_archive/restore_operational.sh: line 2: > /data/wal_archive/operational/0000000100007711000000FB.gz: No such file or > directory > /data/wal_archive/restore_operational.sh: line 2: > /data/wal_archive/operational/00000002.history.gz: No such file or directory > > Next, stop postgres, set wal_log_hints = on as required by pg_rewind, and > restart postgres. I also make a copy of the data directory while the > postgres is not running so I can repeat my test, which works fine on a > small test database but won't be possible for the multi TB database that I > will eventually be doing this on. > > Now promote the standby using "select pg_promote()" and see that it > switches to a new timeline. You can also see that the last WAL applied > from the archive is 7718/BF. > > 2024-11-06 12:10:10.831 EST [4336] LOG: restored log file > "0000000100007718000000BD" from archive > 2024-11-06 12:10:10.996 EST [4336] LOG: restored log file > "0000000100007718000000BE" from archive > /data/wal_archive/restore_operational.sh: line 2: > /data/wal_archive/operational/0000000100007718000000BF.gz: No such file or > directory > /data/wal_archive/restore_operational.sh: line 2: > /data/wal_archive/operational/00000002.history.gz: No such file or directory > 2024-11-06 12:10:15.384 EST [4336] LOG: restored log file > "0000000100007718000000BF" from archive > /data/wal_archive/restore_operational.sh: line 2: > /data/wal_archive/operational/0000000100007718000000C0.gz: No such file or > directory > 2024-11-06 12:10:15.457 EST [4336] LOG: received promote request > 2024-11-06 12:10:15.457 EST [4336] LOG: redo done at 7718/BFFFFF30 > 2024-11-06 12:10:15.457 EST [4336] LOG: last completed transaction was at > log time 2024-11-06 12:10:22.627074-05 > 2024-11-06 12:10:15.593 EST [4336] LOG: restored log file > "0000000100007718000000BF" from archive > /data/wal_archive/restore_operational.sh: line 2: > /data/wal_archive/operational/00000002.history.gz: No such file or directory > 2024-11-06 12:10:15.611 EST [4336] LOG: selected new timeline ID: 2 > 2024-11-06 12:10:15.640 EST [4336] LOG: archive recovery complete > /data/wal_archive/restore_operational.sh: line 2: > /data/wal_archive/operational/00000001.history.gz: No such file or directory > 2024-11-06 12:10:17.028 EST [4329] LOG: database system is ready to > accept connections > > Next, insert a record into just to make some changes that I want to > rollback later. (What I will eventually be doing is creating a publication > to ship data to a newer version, but again, that's not what's important > here.) > > Finally, shutdown postgres and attempt a rewind. The address used in the > --source-server connection string is the address of the primary. > > 2024-11-06 12:11:11.139 EST [4329] LOG: received fast shutdown request > 2024-11-06 12:11:11.143 EST [4329] LOG: aborting any active transactions > 2024-11-06 12:11:11.144 EST [4329] LOG: background worker "logical > replication launcher" (PID 5923) exited with exit code 1 > 2024-11-06 12:11:40.933 EST [4342] LOG: shutting down > 2024-11-06 12:11:41.753 EST [4329] LOG: database system is shut down > > /usr/pgsql-13/bin/pg_rewind --target-pgdata=/data/pgsql/operational > --source-server="host=x.x.x.x dbname=postgres user=xxx password=xxx" > --dry-run --progress --restore-target-wal > > pg_rewind: connected to server > pg_rewind: servers diverged at WAL location 7718/BFFFFFE8 on timeline 1 > /data/wal_archive/restore_operational.sh: line 2: > /data/wal_archive/operational/0000000200007718000000BF.gz: No such file or > directory > pg_rewind: error: could not restore file "0000000200007718000000BF" from > archive > pg_rewind: fatal: could not find previous WAL record at 7718/BFFFFFE8 > > pg_rewind shows the point of divergence as 7718/BF on timeline 1, but when > it tries to replay WAL using the restore command it is trying to find WAL > from timeline 2 rather than picking back up on timeline 1. I tried > setting recovery_target_timeline on the target database to 'current' and > '1' but that gave the same result. Searching the archives, [1] mentions the > need to force a checkpoint after promotion which I tried even though the > problem description isn't the same. [2] mentions a problem that looks more > like the one I am facing but has no responses. At this point I don't know > what to do next and hope someone can point me in the right direction. > > [1] > https://www.postgresql.org/message-id/e7b16ddea93a92575cb6d143b6ef602cab22432e.camel%40cybertec.at > [2] > https://www.postgresql.org/message-id/CALp3DH1fLZmPvkOteAbUo4TOLZP-LstKOs6Gcw3Bm7acmJqk=w...@mail.gmail.com > > Craig >