Hi, On Wed, Apr 29, 2015, at 01:36 AM, Heikki Linnakangas wrote: > The problem seems to be that when the standby is promoted, it's a > so-called "fast promotion", where it writes an end-of-recovery record > and starts accepting queries before creating a real checkpoint. > pg_rewind looks at the TLI in the latest checkpoint, as it's in the > control file, but that isn't updated until the checkpoint completes. I > don't see it on my laptop normally, but I can reproduce it if I insert a > "sleep(5)" in StartupXLog, just before it requests the checkpoint: > > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -7173,7 +7173,10 @@ StartupXLOG(void) > * than is appropriate now that we're not in standby mode anymore. > */ > if (fast_promoted) > + { > + sleep(5); > RequestCheckpoint(CHECKPOINT_FORCE); > + } > } > > The simplest fix would be to force a checkpoint in the regression test, > before running pg_rewind. It's a bit of a cop out, since you'd still get > the same issue when you tried to do the same thing in the real world. It > should be rare in practice - you'd not normally run pg_rewind > immediately after promoting the standby - but a better error message at > least would be nice..
I think we came across this issue in production. We run a Python daemon called 'Patroni' (https://github.com/zalando/patroni) in order to automate failovers. It stops the current master before promotion of one of the replicas, and runs pg_rewind on the former master to make it "compatible" with the new one. Naturally, since pg_rewind is launched automatically, it is called right after detecting that the new master is running. What we get sometimes is "source and target cluster are on the same timeline". Per the explanation from Heikki, it happens when pg_rewind connects to the new master after promotion but before the first checkpoint with the new timeline. Since "fast promotion" is the only option for pg_ctl promote, I think it might be beneficial to add an option for pg_rewind to issue a checkpoint (I worked around this problem by adding a checkpoint call to Patroni, another way would be to create the 'fallback_promote' flle and do the 'slow' promote without relying on pg_ctl promote functionality). There is another weird case I cannot explain easily. If I do a single promotion/rewind, while executing an insert that takes quite some time (insert into test select id from generate_series(1, 10000000) id) and actually interrupting it by the Patroni promotion (which does a checkpoint and pg_ctl -mf stop on the former master, and pg_ctl promote on the replica candidate), I'm getting the following (postgresql0 is the rewound node, note that the latest checkpoint location there is lower than the prior checkpoint one): $ pg_controldata data/postgresql0 pg_control version number: 942 Catalog version number: 201511071 Database system identifier: 6225948429980199864 Database cluster state: in archive recovery pg_control last modified: Tue Dec 8 17:20:16 2015 Latest checkpoint location: 0/6000028 Prior checkpoint location: 0/6000138 Latest checkpoint's REDO location: 0/6000028 Latest checkpoint's REDO WAL file: 000000010000000000000006 Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0/944 Latest checkpoint's NextOID: 16390 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 931 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 0 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTs: 0 Latest checkpoint's newestCommitTs: 0 Time of latest checkpoint: Tue Dec 8 17:20:04 2015 Fake LSN counter for unlogged rels: 0/1 Minimum recovery ending location: 0/6020580 Min recovery ending loc's timeline: 2 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: hot_standby wal_log_hints setting: on max_connections setting: 100 max_worker_processes setting: 8 max_prepared_xacts setting: 0 max_locks_per_xact setting: 64 track_commit_timestamp setting: off Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Data page checksum version: 0 Furthermore, on the newly promoted master the latest checkpoint timeline ID is different: $ pg_controldata data/postgresql1 pg_control version number: 942 Catalog version number: 201511071 Database system identifier: 6225948429980199864 Database cluster state: in production pg_control last modified: Tue Dec 8 17:20:07 2015 Latest checkpoint location: 0/6000138 Prior checkpoint location: 0/2000060 Latest checkpoint's REDO location: 0/60000C8 Latest checkpoint's REDO WAL file: 000000020000000000000006 Latest checkpoint's TimeLineID: 2 Latest checkpoint's PrevTimeLineID: 2 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0/944 Latest checkpoint's NextOID: 16390 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 931 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 944 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTs: 0 Latest checkpoint's newestCommitTs: 0 Time of latest checkpoint: Tue Dec 8 17:20:05 2015 Fake LSN counter for unlogged rels: 0/1 Minimum recovery ending location: 0/0 Min recovery ending loc's timeline: 0 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: hot_standby wal_log_hints setting: on max_connections setting: 100 max_worker_processes setting: 8 max_prepared_xacts setting: 0 max_locks_per_xact setting: 64 track_commit_timestamp setting: off Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Data page checksum version: 0 If I do the second round of promotion/rewind, this time waiting until the insert completes, I'm getting this (note the target timeline history goes ahead of the source one) 2015-12-08 17:21:01,991 INFO: running pg_rewind from user=postgres host=127.0.0.1 port=5432 dbname=postgres sslmode=prefer sslcompression=1 fetched file "global/pg_control", length 8192 Source timeline history: 1: 0/0 - 0/0 Target timeline history: 1: 0/0 - 0/6000098 2: 0/6000098 - 0/0 servers diverged at WAL position 0/6000098 on timeline 1 rewinding from last common checkpoint at 0/6000028 on timeline 1 .... HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. LOG: restored log file "00000002.history" from archive LOG: restored log file "00000003.history" from archive cp: ../wal_archive/00000004.history: No such file or directory LOG: entering standby mode LOG: restored log file "00000003.history" from archive cp: ../wal_archive/000000030000000000000006: No such file or directory LOG: restored log file "000000020000000000000006" from archive DEBUG: got WAL segment from archive DEBUG: checkpoint record is at 0/6000028 FATAL: requested timeline 3 does not contain minimum recovery point 0/A01D180 on timeline 1 again, pg_controldata after the second promote: $ pg_controldata data/postgresql0 pg_control version number: 942 Catalog version number: 201511071 Database system identifier: 6225948429980199864 Database cluster state: in production pg_control last modified: Tue Dec 8 17:20:51 2015 Latest checkpoint location: 0/6000028 Prior checkpoint location: 0/6000138 Latest checkpoint's REDO location: 0/6000028 Latest checkpoint's REDO WAL file: 000000010000000000000006 Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0/948 Latest checkpoint's NextOID: 16390 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 931 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 0 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTs: 0 Latest checkpoint's newestCommitTs: 0 Time of latest checkpoint: Tue Dec 8 17:20:04 2015 Fake LSN counter for unlogged rels: 0/1 Minimum recovery ending location: 0/A0000C8 Min recovery ending loc's timeline: 3 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: hot_standby wal_log_hints setting: on max_connections setting: 100 max_worker_processes setting: 8 max_prepared_xacts setting: 0 max_locks_per_xact setting: 64 track_commit_timestamp setting: off Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Data page checksum version: 0 $ pg_controldata data/postgresql1 pg_control version number: 942 Catalog version number: 201511071 Database system identifier: 6225948429980199864 Database cluster state: in archive recovery pg_control last modified: Tue Dec 8 17:20:51 2015 Latest checkpoint location: 0/6000028 Prior checkpoint location: 0/6000138 Latest checkpoint's REDO location: 0/6000028 Latest checkpoint's REDO WAL file: 000000010000000000000006 Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0/948 Latest checkpoint's NextOID: 16390 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 931 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 0 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTs: 0 Latest checkpoint's newestCommitTs: 0 Time of latest checkpoint: Tue Dec 8 17:20:04 2015 Fake LSN counter for unlogged rels: 0/1 Minimum recovery ending location: 0/A01D180 Min recovery ending loc's timeline: 1 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: hot_standby wal_log_hints setting: on max_connections setting: 100 max_worker_processes setting: 8 max_prepared_xacts setting: 0 max_locks_per_xact setting: 64 track_commit_timestamp setting: off Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Data page checksum version: 0 This is with the master branch of PostgreSQL. In order to produce the timeline comparison output above, I've applied a small fix to pg_rewind: diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c index 1ab82f0..a3ae8a4 100644 --- a/src/bin/pg_rewind/pg_rewind.c +++ b/src/bin/pg_rewind/pg_rewind.c @@ -468,9 +468,9 @@ getTimelineHistory(ControlFileData *controlFile, int *nentries) Assert(false); /* - * Print the target timeline history. + * Print the timeline history. */ - for (i = 0; i < targetNentries; i++) + for (i = 0; i < *nentries; i++) { TimeLineHistoryEntry *entry; -- Oleksii -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers