On Fri, Aug 12, 2016 at 1:39 AM, James Sewell <james.sew...@jirotech.com>
wrote:

> Hello,
>
> We recently experienced a critical failure when failing to a DR
> environment.
>
> This is in the following environment:
>
>
>    - 3 x PostgreSQL machines in Prod in a sync replication cluster
>    - 3 x PostgreSQL machines in DR, with a single machine async and the
>    other two cascading from the first machine.
>
> There was network failure which isolated Production from everything else,
> Production has no errors during this time (and has now come back OK).
>
> DR did not tolerate the break, the following appeared in the logs and none
> of them can start postgres. There were no queries coming into DR at the
> time of the break.
>
> Please note that the "Host Key verification failed" messages are due to
> the scp command not functioning. This means restore_command is not working
> to restore from the XLOG archive, but should not effect anything else.
>
> 2016-08-12 02:52:55 GMT [23205]: [9077-1] user=,db=,client=  (0:00000)LOG:
>  restartpoint starting: time
> 2016-08-12 02:57:25 GMT [23205]: [9078-1] user=,db=,client=  (0:00000)LOG:
>  restartpoint complete: wrote 13361 buffers (0.6%); 0 transaction log
> file(s) added, 0 removed, 3 recycled; write=269.972 s, sync=0.031 s,
> total=270.008 s; sync files=41, longest=0.004 s, average=0.000 s;
> distance=37747 kB, estimate=48968 kB
> 2016-08-12 02:57:25 GMT [23205]: [9079-1] user=,db=,client=  (0:00000)LOG:
>  recovery restart point at 3/7C7B0228
> 2016-08-12 02:57:25 GMT [23205]: [9080-1] user=,db=,client=
>  (0:00000)DETAIL:  last completed transaction was at log time 2016-08-12
> 02:57:24.033588+00
> 2016-08-12 02:57:55 GMT [23205]: [9081-1] user=,db=,client=  (0:00000)LOG:
>  restartpoint starting: time
> 2016-08-12 03:02:25 GMT [23205]: [9082-1] user=,db=,client=  (0:00000)LOG:
>  restartpoint complete: wrote 7196 buffers (0.3%); 0 transaction log
> file(s) added, 0 removed, 2 recycled; write=269.988 s, sync=0.014 s,
> total=270.007 s; sync files=42, longest=0.002 s, average=0.000 s;
> distance=79888 kB, estimate=79888 kB
> 2016-08-12 03:02:25 GMT [23205]: [9083-1] user=,db=,client=  (0:00000)LOG:
>  recovery restart point at 3/815B42C8
> 2016-08-12 03:02:25 GMT [23205]: [9084-1] user=,db=,client=
>  (0:00000)DETAIL:  last completed transaction was at log time 2016-08-12
> 03:02:15.07639+00
> 2016-08-12 03:02:55 GMT [23205]: [9085-1] user=,db=,client=  (0:00000)LOG:
>  restartpoint starting: time
> 2016-08-12 03:04:00 GMT [22350]: [2-1] user=,db=,client=  (0:XX000)FATAL:
>  terminating walreceiver due to timeout
> Host key verification failed.^M
> Host key verification failed.^M
> 2016-08-12 03:04:00 GMT [23188]: [9-1] user=,db=,client=  (0:XX000)FATAL:
>  invalid memory alloc request size 3445219328
> 2016-08-12 03:04:01 GMT [23182]: [5-1] user=,db=,client=  (0:00000)LOG:
>  startup process (PID 23188) exited with exit code 1
> 2016-08-12 03:04:01 GMT [23182]: [6-1] user=,db=,client=  (0:00000)LOG:
>  terminating any other active server processes
> 2016-08-12 03:04:01 GMT [22285]: [1-1] 
> user=replication,db=[unknown],client=10.8.0.14
> 10.8.0.14(55826) (0:57P02)WARNING:  terminating connection because of crash
> of another server process
> 2016-08-12 03:04:01 GMT [22285]: [2-1] 
> user=replication,db=[unknown],client=10.8.0.14
> 10.8.0.14(55826) (0:57P02)DETAIL:  The postmaster has commanded this server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
> 2016-08-12 03:04:01 GMT [22285]: [3-1] 
> user=replication,db=[unknown],client=10.8.0.14
> 10.8.0.14(55826) (0:57P02)HINT:  In a moment you should be able to
> reconnect to the database and repeat your command.
> 2016-08-12 03:04:01 GMT [22286]: [1-1] 
> user=replication,db=[unknown],client=10.8.0.32
> 10.8.0.32(56442) (0:57P02)WARNING:  terminating connection because of crash
> of another server process
> 2016-08-12 03:04:01 GMT [22286]: [2-1] 
> user=replication,db=[unknown],client=10.8.0.32
> 10.8.0.32(56442) (0:57P02)DETAIL:  The postmaster has commanded this server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
> 2016-08-12 03:04:01 GMT [22286]: [3-1] 
> user=replication,db=[unknown],client=10.8.0.32
> 10.8.0.32(56442) (0:57P02)HINT:  In a moment you should be able to
> reconnect to the database and repeat your command.
> 2016-08-12 03:04:01 GMT [23182]: [7-1] user=,db=,client=  (0:00000)LOG:
>  archiver process (PID 23207) exited with exit code 1
> 2016-08-12 04:43:51 GMT [23614]: [1-1] user=,db=,client=  (0:00000)LOG:
>  database system was interrupted while in recovery at log time 2016-08-12
> 02:53:19 GMT
> 2016-08-12 04:43:51 GMT [23614]: [2-1] user=,db=,client=  (0:00000)HINT:
>  If this has occurred more than once some data might be corrupted and you
> might need to choose an earlier recovery target.
> 2016-08-12 04:43:51 GMT [23615]: [1-1] 
> user=postgres,db=postgres,client=[local]
> [local] (0:57P03)FATAL:  the database system is starting up
> Host key verification failed.^M
> 2016-08-12 04:43:51 GMT [23614]: [3-1] user=,db=,client=  (0:00000)LOG:
>  entering standby mode
> Host key verification failed.^M
> Host key verification failed.^M
> 2016-08-12 04:43:51 GMT [23614]: [4-1] user=,db=,client=  (0:00000)LOG:
>  redo starts at 3/815B42C8
> Host key verification failed.^M
> Host key verification failed.^M
> Host key verification failed.^M
> 2016-08-12 04:43:52 GMT [23643]: [1-1] 
> user=postgres,db=postgres,client=[local]
> [local] (0:57P03)FATAL:  the database system is starting up
> Host key verification failed.^M
> Host key verification failed.^M
> Host key verification failed.^M
> Host key verification failed.^M
> 2016-08-12 04:43:53 GMT [23614]: [5-1] user=,db=,client=  (0:00000)LOG:
>  consistent recovery state reached at 3/8811DFF0
> 2016-08-12 04:43:53 GMT [23614]: [6-1] user=,db=,client=  (0:XX000)FATAL:
>  invalid memory alloc request size 3445219328
> 2016-08-12 04:43:53 GMT [23612]: [3-1] user=,db=,client=  (0:00000)LOG:
>  database system is ready to accept read only connections
> 2016-08-12 04:43:53 GMT [23612]: [4-1] user=,db=,client=  (0:00000)LOG:
>  startup process (PID 23614) exited with exit code 1
> 2016-08-12 04:43:53 GMT [23612]: [5-1] user=,db=,client=  (0:00000)LOG:
>  terminating any other active server processes
> 2016-08-12 04:43:53 GMT [23612]: [6-1] user=,db=,client=  (0:00000)LOG:
>  archiver process (PID 23627) exited with exit code 1
> 2016-08-12 04:44:04 GMT [23710]: [1-1] user=,db=,client=  (0:00000)LOG:
>  database system was interrupted while in recovery at log time 2016-08-12
> 02:53:19 GMT
> 2016-08-12 04:44:04 GMT [23710]: [2-1] user=,db=,client=  (0:00000)HINT:
>  If this has occurred more than once some data might be corrupted and you
> might need to choose an earlier recovery target.
>
> Does anyone have any ideas? It looks to me like some sort of bug / error
> with the replication protocol or maybe some corruption on the master which
> wasn't noticed and fed across?
>
> If that's the case would checksums help here? Are the computed on the
> standby?
>
> Cheers,
>
>
> James Sewell,
> Solutions Architect
>
>
>
> Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
> *P *(+61) 2 8099 9000 <(+61)%202%208099%209000>  *W* www.jirotech.com
> *F *(+61) 2 8099 9099 <(+61)%202%208099%209000>
>
> ------------------------------
> The contents of this email are confidential and may be subject to legal or
> professional privilege and copyright. No representation is made that this
> email is free of viruses or other defects. If you have received this
> communication in error, you may not copy or distribute any part of it or
> otherwise disclose its contents to anyone. Please advise the sender of your
> incorrect receipt of this correspondence.


>2016-08-12 03:04:00 GMT [23188]: [9-1] user=,db=,client=  (0:XX000)FATAL:
 invalid memory alloc request size 3445219328

I'm not sure, but I'd double check your shared_memory spec both in
postgresql.conf and /proc/sys/kernel/shmmax
(or /etc/sysctl.conf) in DR




-- 
*Melvin Davidson*
I reserve the right to fantasize.  Whether or not you
wish to share my fantasy is entirely up to you.

Reply via email to