Re: [HACKERS] pg_rewind test race condition..?

Oleksii Kliukin Tue, 08 Dec 2015 09:36:12 -0800

Hi,

On Wed, Apr 29, 2015, at 01:36 AM, Heikki Linnakangas wrote:
> The problem seems to be that when the standby is promoted, it's a 
> so-called "fast promotion", where it writes an end-of-recovery record 
> and starts accepting queries before creating a real checkpoint. 
> pg_rewind looks at the TLI in the latest checkpoint, as it's in the 
> control file, but that isn't updated until the checkpoint completes. I 
> don't see it on my laptop normally, but I can reproduce it if I insert a 
> "sleep(5)" in StartupXLog, just before it requests the checkpoint:
> 
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -7173,7 +7173,10 @@ StartupXLOG(void)
>        * than is appropriate now that we're not in standby mode anymore.
>        */
>       if (fast_promoted)
> +       {
> +               sleep(5);
>               RequestCheckpoint(CHECKPOINT_FORCE);
> +       }
>   }
> 
> The simplest fix would be to force a checkpoint in the regression test, 
> before running pg_rewind. It's a bit of a cop out, since you'd still get 
> the same issue when you tried to do the same thing in the real world. It 
> should be rare in practice - you'd not normally run pg_rewind 
> immediately after promoting the standby - but a better error message at 
> least would be nice..


I think we came across this issue in production. We run a Python daemon
called 'Patroni' (https://github.com/zalando/patroni) in order to
automate failovers. It stops the current master before promotion of one
of the replicas, and runs pg_rewind on the former master to make it
"compatible" with the new one. Naturally, since pg_rewind is launched
automatically, it is called right after detecting that the new master is
running.

What we get sometimes is  "source and target cluster are on the same
timeline". Per the explanation from Heikki, it happens when pg_rewind
connects to the new master after promotion but before the first
checkpoint with the new timeline. Since "fast promotion" is the only
option for pg_ctl promote, I think it might be beneficial to add an
option for pg_rewind to issue a checkpoint  (I worked around this
problem by adding a checkpoint call to Patroni, another way would be to
create the 'fallback_promote' flle and do the 'slow' promote without
relying on pg_ctl promote functionality).

There is another weird case I cannot explain easily. If I do a single
promotion/rewind, while executing an insert that takes quite some time
(insert into test select id from generate_series(1, 10000000) id) and
actually interrupting it by the Patroni promotion (which does a
checkpoint and pg_ctl -mf stop on the former master, and pg_ctl promote
on the replica candidate), I'm getting the following (postgresql0 is the
rewound node, note that the latest checkpoint location there is lower
than the prior checkpoint one):

$ pg_controldata data/postgresql0
pg_control version number:            942
Catalog version number:               201511071
Database system identifier:           6225948429980199864
Database cluster state:               in archive recovery
pg_control last modified:             Tue Dec  8 17:20:16 2015
Latest checkpoint location:           0/6000028
Prior checkpoint location:            0/6000138
Latest checkpoint's REDO location:    0/6000028
Latest checkpoint's REDO WAL file:    000000010000000000000006
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/944
Latest checkpoint's NextOID:          16390
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        931
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTs:   0
Latest checkpoint's newestCommitTs:   0
Time of latest checkpoint:            Tue Dec  8 17:20:04 2015
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/6020580
Min recovery ending loc's timeline:   2
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    hot_standby
wal_log_hints setting:                on
max_connections setting:              100
max_worker_processes setting:         8
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0

Furthermore, on the newly promoted master the latest checkpoint timeline
ID is different:

$ pg_controldata data/postgresql1
pg_control version number:            942
Catalog version number:               201511071
Database system identifier:           6225948429980199864
Database cluster state:               in production
pg_control last modified:             Tue Dec  8 17:20:07 2015
Latest checkpoint location:           0/6000138
Prior checkpoint location:            0/2000060
Latest checkpoint's REDO location:    0/60000C8
Latest checkpoint's REDO WAL file:    000000020000000000000006
Latest checkpoint's TimeLineID:       2
Latest checkpoint's PrevTimeLineID:   2
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/944
Latest checkpoint's NextOID:          16390
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        931
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  944
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTs:   0
Latest checkpoint's newestCommitTs:   0
Time of latest checkpoint:            Tue Dec  8 17:20:05 2015
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    hot_standby
wal_log_hints setting:                on
max_connections setting:              100
max_worker_processes setting:         8
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0

If I do the second round of promotion/rewind, this time waiting until
the insert completes, I'm getting this (note the target timeline history
goes ahead of the source one)

2015-12-08 17:21:01,991 INFO: running pg_rewind from user=postgres
host=127.0.0.1 port=5432 dbname=postgres sslmode=prefer sslcompression=1
fetched file "global/pg_control", length 8192
Source timeline history:
1: 0/0 - 0/0
Target timeline history:
1: 0/0 - 0/6000098
2: 0/6000098 - 0/0
servers diverged at WAL position 0/6000098 on timeline 1
rewinding from last common checkpoint at 0/6000028 on timeline 1
....
HINT:  If this has occurred more than once some data might be corrupted
and you might need to choose an earlier recovery target.
LOG:  restored log file "00000002.history" from archive
LOG:  restored log file "00000003.history" from archive
cp: ../wal_archive/00000004.history: No such file or directory
LOG:  entering standby mode
LOG:  restored log file "00000003.history" from archive
cp: ../wal_archive/000000030000000000000006: No such file or directory
LOG:  restored log file "000000020000000000000006" from archive
DEBUG:  got WAL segment from archive
DEBUG:  checkpoint record is at 0/6000028
FATAL:  requested timeline 3 does not contain minimum recovery point
0/A01D180 on timeline 1

again, pg_controldata after the second promote:

$ pg_controldata data/postgresql0
pg_control version number:            942
Catalog version number:               201511071
Database system identifier:           6225948429980199864
Database cluster state:               in production
pg_control last modified:             Tue Dec  8 17:20:51 2015
Latest checkpoint location:           0/6000028
Prior checkpoint location:            0/6000138
Latest checkpoint's REDO location:    0/6000028
Latest checkpoint's REDO WAL file:    000000010000000000000006
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/948
Latest checkpoint's NextOID:          16390
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        931
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTs:   0
Latest checkpoint's newestCommitTs:   0
Time of latest checkpoint:            Tue Dec  8 17:20:04 2015
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/A0000C8
Min recovery ending loc's timeline:   3
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    hot_standby
wal_log_hints setting:                on
max_connections setting:              100
max_worker_processes setting:         8
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0

$ pg_controldata data/postgresql1
pg_control version number:            942
Catalog version number:               201511071
Database system identifier:           6225948429980199864
Database cluster state:               in archive recovery
pg_control last modified:             Tue Dec  8 17:20:51 2015
Latest checkpoint location:           0/6000028
Prior checkpoint location:            0/6000138
Latest checkpoint's REDO location:    0/6000028
Latest checkpoint's REDO WAL file:    000000010000000000000006
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/948
Latest checkpoint's NextOID:          16390
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        931
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTs:   0
Latest checkpoint's newestCommitTs:   0
Time of latest checkpoint:            Tue Dec  8 17:20:04 2015
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/A01D180
Min recovery ending loc's timeline:   1
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    hot_standby
wal_log_hints setting:                on
max_connections setting:              100
max_worker_processes setting:         8
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0


This is with the master branch of PostgreSQL. In order to produce the
timeline comparison output above, I've applied a small fix to pg_rewind:

diff --git a/src/bin/pg_rewind/pg_rewind.c
b/src/bin/pg_rewind/pg_rewind.c
index 1ab82f0..a3ae8a4 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -468,9 +468,9 @@ getTimelineHistory(ControlFileData *controlFile, int
*nentries)
                        Assert(false);

                /*
-                * Print the target timeline history.
+                * Print the timeline history.
                 */
-               for (i = 0; i < targetNentries; i++)
+               for (i = 0; i < *nentries; i++)
                {
                        TimeLineHistoryEntry *entry;


--
Oleksii


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind test race condition..?

Reply via email to