[HACKERS] Funny WAL corruption issue

Chris Travers Thu, 10 Aug 2017 05:10:46 -0700

Hi;

I ran into a funny situation today regarding PostgreSQL replication and wal
corruption and wanted to go over what I think happened and what I wonder
about as a possible solution.


Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB
database with variable load.  Master database has two slaves and generates
10-20MB of WAL traffic a second.  The data_checksum option is off.


The problem occurred when I attempted to restart the service on the slave
using pg_ctl (I believe the service had been started with sys V init
scripts).  On trying to restart, it gave me a nice "Invalid memory
allocation request" error and promptly stopped.

The main logs showed a lot of messages like before the restart:

2017-08-02 11:47:33 UTC LOG:  PID 19033 in cancel request did not match any
process

2017-08-02 11:47:33 UTC LOG:  PID 19032 in cancel request did not match any
process

2017-08-02 11:47:33 UTC LOG:  PID 19024 in cancel request did not match any
process

2017-08-02 11:47:33 UTC LOG:  PID 19034 in cancel request did not match any
process


On restart, the following was logged to stderr:

LOG:  entering standby mode

LOG:  redo starts at 1E39C/8B77B458

LOG:  consistent recovery state reached at 1E39C/E1117FF8

FATAL:  invalid memory alloc request size 3456458752

LOG:  startup process (PID 18167) exited with exit code 1

LOG:  terminating any other active server processes

LOG:  database system is shut down

After some troubleshooting I found that the wal segment had become corrupt,
I copied the correct one from the master and everything came up to present.

So It seems like somewhere something crashed big time on the back-end and
when we tried to restart, the wal ended in an invalid way.

I am wondering what can be done to prevent these sorts of things from
happening in the future if, for example, a replica dies in the middle of a
wal fsync.
-- 
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
lock-in.
http://www.efficito.com/learn_more

[HACKERS] Funny WAL corruption issue

Reply via email to