On Fri, Aug 12, 2016 at 1:39 AM, James Sewell <james.sew...@jirotech.com> wrote:
> Hello, > > We recently experienced a critical failure when failing to a DR > environment. > > This is in the following environment: > > > - 3 x PostgreSQL machines in Prod in a sync replication cluster > - 3 x PostgreSQL machines in DR, with a single machine async and the > other two cascading from the first machine. > > There was network failure which isolated Production from everything else, > Production has no errors during this time (and has now come back OK). > > DR did not tolerate the break, the following appeared in the logs and none > of them can start postgres. There were no queries coming into DR at the > time of the break. > > Please note that the "Host Key verification failed" messages are due to > the scp command not functioning. This means restore_command is not working > to restore from the XLOG archive, but should not effect anything else. > > 2016-08-12 02:52:55 GMT [23205]: [9077-1] user=,db=,client= (0:00000)LOG: > restartpoint starting: time > 2016-08-12 02:57:25 GMT [23205]: [9078-1] user=,db=,client= (0:00000)LOG: > restartpoint complete: wrote 13361 buffers (0.6%); 0 transaction log > file(s) added, 0 removed, 3 recycled; write=269.972 s, sync=0.031 s, > total=270.008 s; sync files=41, longest=0.004 s, average=0.000 s; > distance=37747 kB, estimate=48968 kB > 2016-08-12 02:57:25 GMT [23205]: [9079-1] user=,db=,client= (0:00000)LOG: > recovery restart point at 3/7C7B0228 > 2016-08-12 02:57:25 GMT [23205]: [9080-1] user=,db=,client= > (0:00000)DETAIL: last completed transaction was at log time 2016-08-12 > 02:57:24.033588+00 > 2016-08-12 02:57:55 GMT [23205]: [9081-1] user=,db=,client= (0:00000)LOG: > restartpoint starting: time > 2016-08-12 03:02:25 GMT [23205]: [9082-1] user=,db=,client= (0:00000)LOG: > restartpoint complete: wrote 7196 buffers (0.3%); 0 transaction log > file(s) added, 0 removed, 2 recycled; write=269.988 s, sync=0.014 s, > total=270.007 s; sync files=42, longest=0.002 s, average=0.000 s; > distance=79888 kB, estimate=79888 kB > 2016-08-12 03:02:25 GMT [23205]: [9083-1] user=,db=,client= (0:00000)LOG: > recovery restart point at 3/815B42C8 > 2016-08-12 03:02:25 GMT [23205]: [9084-1] user=,db=,client= > (0:00000)DETAIL: last completed transaction was at log time 2016-08-12 > 03:02:15.07639+00 > 2016-08-12 03:02:55 GMT [23205]: [9085-1] user=,db=,client= (0:00000)LOG: > restartpoint starting: time > 2016-08-12 03:04:00 GMT [22350]: [2-1] user=,db=,client= (0:XX000)FATAL: > terminating walreceiver due to timeout > Host key verification failed.^M > Host key verification failed.^M > 2016-08-12 03:04:00 GMT [23188]: [9-1] user=,db=,client= (0:XX000)FATAL: > invalid memory alloc request size 3445219328 > 2016-08-12 03:04:01 GMT [23182]: [5-1] user=,db=,client= (0:00000)LOG: > startup process (PID 23188) exited with exit code 1 > 2016-08-12 03:04:01 GMT [23182]: [6-1] user=,db=,client= (0:00000)LOG: > terminating any other active server processes > 2016-08-12 03:04:01 GMT [22285]: [1-1] > user=replication,db=[unknown],client=10.8.0.14 > 10.8.0.14(55826) (0:57P02)WARNING: terminating connection because of crash > of another server process > 2016-08-12 03:04:01 GMT [22285]: [2-1] > user=replication,db=[unknown],client=10.8.0.14 > 10.8.0.14(55826) (0:57P02)DETAIL: The postmaster has commanded this server > process to roll back the current transaction and exit, because another > server process exited abnormally and possibly corrupted shared memory. > 2016-08-12 03:04:01 GMT [22285]: [3-1] > user=replication,db=[unknown],client=10.8.0.14 > 10.8.0.14(55826) (0:57P02)HINT: In a moment you should be able to > reconnect to the database and repeat your command. > 2016-08-12 03:04:01 GMT [22286]: [1-1] > user=replication,db=[unknown],client=10.8.0.32 > 10.8.0.32(56442) (0:57P02)WARNING: terminating connection because of crash > of another server process > 2016-08-12 03:04:01 GMT [22286]: [2-1] > user=replication,db=[unknown],client=10.8.0.32 > 10.8.0.32(56442) (0:57P02)DETAIL: The postmaster has commanded this server > process to roll back the current transaction and exit, because another > server process exited abnormally and possibly corrupted shared memory. > 2016-08-12 03:04:01 GMT [22286]: [3-1] > user=replication,db=[unknown],client=10.8.0.32 > 10.8.0.32(56442) (0:57P02)HINT: In a moment you should be able to > reconnect to the database and repeat your command. > 2016-08-12 03:04:01 GMT [23182]: [7-1] user=,db=,client= (0:00000)LOG: > archiver process (PID 23207) exited with exit code 1 > 2016-08-12 04:43:51 GMT [23614]: [1-1] user=,db=,client= (0:00000)LOG: > database system was interrupted while in recovery at log time 2016-08-12 > 02:53:19 GMT > 2016-08-12 04:43:51 GMT [23614]: [2-1] user=,db=,client= (0:00000)HINT: > If this has occurred more than once some data might be corrupted and you > might need to choose an earlier recovery target. > 2016-08-12 04:43:51 GMT [23615]: [1-1] > user=postgres,db=postgres,client=[local] > [local] (0:57P03)FATAL: the database system is starting up > Host key verification failed.^M > 2016-08-12 04:43:51 GMT [23614]: [3-1] user=,db=,client= (0:00000)LOG: > entering standby mode > Host key verification failed.^M > Host key verification failed.^M > 2016-08-12 04:43:51 GMT [23614]: [4-1] user=,db=,client= (0:00000)LOG: > redo starts at 3/815B42C8 > Host key verification failed.^M > Host key verification failed.^M > Host key verification failed.^M > 2016-08-12 04:43:52 GMT [23643]: [1-1] > user=postgres,db=postgres,client=[local] > [local] (0:57P03)FATAL: the database system is starting up > Host key verification failed.^M > Host key verification failed.^M > Host key verification failed.^M > Host key verification failed.^M > 2016-08-12 04:43:53 GMT [23614]: [5-1] user=,db=,client= (0:00000)LOG: > consistent recovery state reached at 3/8811DFF0 > 2016-08-12 04:43:53 GMT [23614]: [6-1] user=,db=,client= (0:XX000)FATAL: > invalid memory alloc request size 3445219328 > 2016-08-12 04:43:53 GMT [23612]: [3-1] user=,db=,client= (0:00000)LOG: > database system is ready to accept read only connections > 2016-08-12 04:43:53 GMT [23612]: [4-1] user=,db=,client= (0:00000)LOG: > startup process (PID 23614) exited with exit code 1 > 2016-08-12 04:43:53 GMT [23612]: [5-1] user=,db=,client= (0:00000)LOG: > terminating any other active server processes > 2016-08-12 04:43:53 GMT [23612]: [6-1] user=,db=,client= (0:00000)LOG: > archiver process (PID 23627) exited with exit code 1 > 2016-08-12 04:44:04 GMT [23710]: [1-1] user=,db=,client= (0:00000)LOG: > database system was interrupted while in recovery at log time 2016-08-12 > 02:53:19 GMT > 2016-08-12 04:44:04 GMT [23710]: [2-1] user=,db=,client= (0:00000)HINT: > If this has occurred more than once some data might be corrupted and you > might need to choose an earlier recovery target. > > Does anyone have any ideas? It looks to me like some sort of bug / error > with the replication protocol or maybe some corruption on the master which > wasn't noticed and fed across? > > If that's the case would checksums help here? Are the computed on the > standby? > > Cheers, > > > James Sewell, > Solutions Architect > > > > Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009 > *P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W* www.jirotech.com > *F *(+61) 2 8099 9099 <(+61)%202%208099%209000> > > ------------------------------ > The contents of this email are confidential and may be subject to legal or > professional privilege and copyright. No representation is made that this > email is free of viruses or other defects. If you have received this > communication in error, you may not copy or distribute any part of it or > otherwise disclose its contents to anyone. Please advise the sender of your > incorrect receipt of this correspondence. >2016-08-12 03:04:00 GMT [23188]: [9-1] user=,db=,client= (0:XX000)FATAL: invalid memory alloc request size 3445219328 I'm not sure, but I'd double check your shared_memory spec both in postgresql.conf and /proc/sys/kernel/shmmax (or /etc/sysctl.conf) in DR -- *Melvin Davidson* I reserve the right to fantasize. Whether or not you wish to share my fantasy is entirely up to you.