(sorry for the top post, bitchy K9 Mail) James,
are you sure you're scp'ing from the archive, not from pg_xlog? Regards, Gunnar "Nick" Bluth Am 27. Juli 2017 05:00:17 MESZ schrieb James Sewell <james.sew...@jirotech.com>: >Hi all, > >I've got two servers (A,B) which are part of a streaming replication >pair. >A is the master, B is a hot standby. I'm sending archived WAL to a >directory on A, B is reading it via SCP. > >This all works fine normally. I'm on Redhat 7.3, running EDB 9.6.2 (I'm >currently working to reproduce with standard 9.6) > >We have recently seen a situation where B does not catch up when taken >offline for maintenance. > >When B is started we see the following in the logs: > >2017-07-27 11:55:57 AEST [21432]: [979-1] user=,db=,client= >(0:00000)LOG: restored log file "0000000C0000005A000000AA" from >archive >2017-07-27 11:55:58 AEST [21432]: [980-1] user=,db=,client= >(0:00000)LOG: restored log file "0000000C0000005A000000AB" from >archive >2017-07-27 11:55:58 AEST [21432]: [981-1] user=,db=,client= >(0:00000)LOG: restored log file "0000000C0000005A000000AC" from >archive >2017-07-27 11:55:59 AEST [21432]: [982-1] user=,db=,client= >(0:00000)LOG: restored log file "0000000C0000005A000000AD" from >archive >2017-07-27 11:55:59 AEST [21432]: [983-1] user=,db=,client= >(0:00000)LOG: restored log file "0000000C0000005A000000AE" from >archive >2017-07-27 11:56:00 AEST [21432]: [984-1] user=,db=,client= >(0:00000)LOG: restored log file "0000000C0000005A000000AF" from >archive >2017-07-27 11:56:00 AEST [21432]: [985-1] user=,db=,client= >(0:00000)LOG: restored log file "0000000C0000005A000000B0" from >archive >2017-07-27 11:56:01 AEST [21432]: [986-1] user=,db=,client= >(0:00000)LOG: restored log file "0000000C0000005A000000B1" from >archive >2017-07-27 11:56:01 AEST [21432]: [987-1] user=,db=,client= >(0:00000)LOG: restored log file "0000000C0000005A000000B2" from >archive >2017-07-27 11:56:02 AEST [21432]: [988-1] user=,db=,client= >(0:00000)LOG: restored log file "0000000C0000005A000000B3" from >archive >2017-07-27 11:56:02 AEST [21432]: [989-1] user=,db=,client= >(0:00000)LOG: restored log file "0000000C0000005A000000B4" from >archive >2017-07-27 11:56:03 AEST [21432]: [990-1] user=,db=,client= >(0:00000)LOG: restored log file "0000000C0000005A000000B5" from >archive >scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory >2017-07-27 11:56:03 AEST [46191]: [1-1] user=,db=,client= >(0:00000)LOG: started streaming WAL from primary at 5A/B5000000 on >timeline 12 >2017-07-27 11:56:03 AEST [46191]: [2-1] user=,db=,client= >(0:XX000)FATAL: could not receive data from WAL stream: ERROR: >requested WAL segment 0000000C0000005A000000B5 has already been >removed > >scp: /archive/xlog//0000000D.history: No such file or directory >scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory >2017-07-27 11:56:04 AEST [46203]: [1-1] user=,db=,client= >(0:00000)LOG: started streaming WAL from primary at 5A/B5000000 on >timeline 12 >2017-07-27 11:56:04 AEST [46203]: [2-1] user=,db=,client= >(0:XX000)FATAL: could not receive data from WAL stream: ERROR: >requested WAL segment 0000000C0000005A000000B5 has already been >removed > >This will loop indefinitely. At this stage the master reports no >connected >standbys in pg_stat_replication, and the standby has no running WAL >receiver process. > >This can be 'fixed' by running pg_switch_xlog() on the master, at which >time a connection is seen from the standby and the logs show the >following: > >scp: /archive/xlog//0000000D.history: No such file or directory >2017-07-27 12:03:19 AEST [21432]: [1029-1] user=,db=,client= >(0:00000)LOG: > restored log file "0000000C0000005A000000B5" from archive >scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory >2017-07-27 12:03:19 AEST [63141]: [1-1] user=,db=,client= >(0:00000)LOG: > started streaming WAL from primary at 5A/B5000000 on timeline 12 >2017-07-27 12:03:19 AEST [63141]: [2-1] user=,db=,client= >(0:XX000)FATAL: > could not receive data from WAL stream: ERROR: requested WAL segment >0000000C0000005A000000B5 has already been removed > >scp: /archive/xlog//0000000D.history: No such file or directory >2017-07-27 12:03:24 AEST [21432]: [1030-1] user=,db=,client= >(0:00000)LOG: > restored log file "0000000C0000005A000000B5" from archive >2017-07-27 12:03:24 AEST [21432]: [1031-1] user=,db=,client= >(0:00000)LOG: > restored log file "0000000C0000005A000000B6" from archive >scp: /archive/xlog//0000000C0000005A000000B7: No such file or directory >2017-07-27 12:03:25 AEST [21432]: [1032-1] user=,db=,client= >(0:00000)LOG: >unexpected pageaddr 56/E7000000 in log segment >0000000C0000005A000000B7, >offset 0 >2017-07-27 12:03:25 AEST [63196]: [1-1] user=,db=,client= >(0:00000)LOG: > started streaming WAL from primary > >At this time replication is running as normal and all changes are >streamed. > >Am I missing something here, this seems very odd. One thing I have >noticed >is it only seems to be caused after a lot of WAL is produced and the >pg_xlog >directory is sitting at max_wal_size > >James Sewell, >PostgreSQL Team Lead / Solutions Architect > > > >Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009 >*P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W* www.jirotech.com >*F * >(+61) 2 8099 9099 <(+61)%202%208099%209000> > >-- > >------------------------------ >The contents of this email are confidential and may be subject to legal >or >professional privilege and copyright. No representation is made that >this >email is free of viruses or other defects. If you have received this >communication in error, you may not copy or distribute any part of it >or >otherwise disclose its contents to anyone. Please advise the sender of >your >incorrect receipt of this correspondence. -- Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.