Greetings,

We are using Postgres 9.6.8 (planning to upgrade to 9.6.9 soon) on RHEL 6.9.

We recently experienced two similar outages on two different prod
databases. The error messages from the logs were as follows:

LOG:  server process (PID 138529) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.

We are still investigating what may have triggered these errors, since
there were no recent changes to these databases. Unfortunately, core dumps
were not configured correctly, so we may have to wait for the next outage
before we can do a good root cause analysis.

My question, meanwhile, is around remedial actions to take when this
happens.

In one case, the logs recorded

LOG:  all server processes terminated; reinitializing
LOG:  incomplete data in "postmaster.pid": found only 1 newlines while
trying to add line 7
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 365/FDFA738
LOG:  invalid record length at 365/12420978: wanted 24, got 0
LOG:  redo done at 365/12420950
LOG:  last completed transaction was at log time 2018-06-05
10:59:27.049443-05
LOG:  checkpoint starting: end-of-recovery immediate
LOG:  checkpoint complete: wrote 5343 buffers (0.5%); 0 transaction log
file(s) added, 1 removed, 0 recycled; write=0.131 s, sync=0.009 s,
total=0.164 s; sync files=142, longest=0.005 s, average=0.000 s;
distance=39064 kB, estimate=39064 kB
LOG:  MultiXact member wraparound protections are now enabled
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections

In that case, the database restarted immediately, with only 30 seconds of
downtime.

In the other case, the logs recorded

LOG:  all server processes terminated; reinitializing
LOG:  dynamic shared memory control segment is corrupt
LOG:  incomplete data in "postmaster.pid": found only 1 newlines while
trying to add line 7

In that case, the database did not restart on its own. It was 5 am on
Sunday, so the on-call SRE just manually started the database up, and it
appears to have been running fine since.

My question is whether the corrupt shared memory control segment, and the
failure of Postgres to automatically restart, mean the database should not
be automatically started up, and if there's something we should be doing
before restarting.

Do we potentially have corrupt data or indices as a result of our last
outage? If so, what should we do to investigate?

Thanks,
Sherrylyn

Reply via email to