[HACKERS] [bug fix] Cascading standby cannot catch up and get stuck emitting the same message repeatedly

Tsunakawa, Takayuki Thu, 25 Aug 2016 19:34:35 -0700

Hello,

Our customer hit a problem of cascading replication, and we found the cause.  
They are using the latest PostgreSQL 9.2.18.  The bug seems to have been fixed 
in 9.4 and higher during the big modification of xlog.c, but it's not reflected 
in older releases.


The attached patch is for 9.2.18.  This just borrows the idea from 9.4 and 
higher.

But we haven't been able to reproduce the problem.  Could you review the patch 
and help to test it?  I would very much appreciate it if you could figure out 
how to reproduce the problem easily.


PROBLEM
========================================

The customer's configuration consists of three nodes: node1 is a primary, node2 
is a synchronous standby, and node3 is a cascading standby.  The primary 
archives WAL to a shared (network?) storage and the standbys read archived WAL 
from there with restore_command.  recovery_target_timeline is set to 'latest' 
on the standbys.

When node1 dies and node2 is promoted to a primary, node3 cannot catch up node2 
forever, emitting the following message repeatedly:

LOG:  out-of-sequence timeline ID 140 (after 141) in log file 652, segment 117, 
offset 0

The expected behavior is that node3 catches up node2 and keeps synchronization.


CAUSE
========================================

The processing went as follows.

1. node1's timeline is 140.  It wrote a WAL record at the end of WAL segment 
117.  The WAL record didn't fit the last page, so it was split across segments 
117 and 118.

2. WAL segment 117 was archived.

3. node1 got down, and node2 was promoted.

4. As part of the recovery process, node2 retrieves WAL segment 117 from 
archive.  It found a WAL record fragment at the end of the segment but could 
not find the remaining fragment in segment 118, so node2 stops recovery there.

LOG:  restored log file "0000008C0000028C00000075" from archive
LOG:  received promote request
LOG:  redo done at 28C/75FFF738

5. node2 becomes the primary, and its timeline becomes 118.  node3 is 
disconnected by node2 (but later reconnectes to node2).

LOG:  terminating all walsender processes to force cascaded standby(s) to 
update timeline and reconnect

6. node3 retrieves and applies WAL segment 117 from archive.

LOG:  restored log file "0000008C0000028C00000075" from archive

7. node3 found .history file for time line 141 and renews its timeline to 141.

8. Because node3 found a WAL record fragment at the end of segment 117, it 
expects to find the remaining fragment at the beginning of WAL segment 118 
streamed from node2.  But there was a fragment of a different WAL record, 
because node2 overwrote a different WAL record at the end of segment 117 across 
to 118.

LOG:  invalid contrecord length 5892 in log file 652, segment 118, offset 0

9. node3 then retrieves segment 117 from archive again to get the WAL record at 
the end of segment 117.  However, as node3's timeline is already 141, it 
complains about the older timeline when it sees the timeline 140 at the 
beginning of segment 117.

LOG:  out-of-sequence timeline ID 140 (after 141) in log file 652, segment 117, 
offset 0



Regards
Takayuki Tsunakawa

cascading_standby_stuck.patch
Description: cascading_standby_stuck.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [bug fix] Cascading standby cannot catch up and get stuck emitting the same message repeatedly

Reply via email to