[GENERAL] bug, bad memory, or bad disk?

Ben Chobot Thu, 14 Feb 2013 12:03:35 -0800

We have a Postgres server (PostgreSQL 9.1.6 on x86_64-unknown-linux-gnu, 
compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit) which does 
streaming replication to some slaves, and has another set of slaves reading the 
wal archive for wal-based replication. We had a bit of fun yesterday where, 
suddenly, the master started spewing errors like:


2013-02-13T23:13:18.042875+00:00 pgdb18-vpc postgres[20555]: [76-1]  ERROR:  
invalid memory alloc request size 1968078400
2013-02-13T23:13:18.956173+00:00 pgdb18-vpc postgres[23880]: [58-1]  ERROR:  
invalid page header in block 2948 of relation 
pg_tblspc/16435/PG_9.1_201105231/188417/56951641
2013-02-13T23:13:19.025971+00:00 pgdb18-vpc postgres[25027]: [36-1]  ERROR:  
could not open file "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" 
(target block 3936767042): No such file or directory
2013-02-13T23:13:19.847422+00:00 pgdb18-vpc postgres[28333]: [8-1]  ERROR:  
could not open file "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" 
(target block 3936767042): No such file or directory
2013-02-13T23:13:19.913595+00:00 pgdb18-vpc postgres[28894]: [8-1]  ERROR:  
could not open file "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" 
(target block 3936767042): No such file or directory
2013-02-13T23:13:20.043527+00:00 pgdb18-vpc postgres[20917]: [72-1]  ERROR:  
invalid memory alloc request size 1968078400
2013-02-13T23:13:21.548259+00:00 pgdb18-vpc postgres[23318]: [54-1]  ERROR:  
could not open file "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" 
(target block 3936767042): No such file or directory
2013-02-13T23:13:28.405529+00:00 pgdb18-vpc postgres[28055]: [12-1]  ERROR:  
invalid page header in block 38887 of relation 
pg_tblspc/16435/PG_9.1_201105231/188417/58206627
2013-02-13T23:13:29.199447+00:00 pgdb18-vpc postgres[25513]: [46-1]  ERROR:  
invalid page header in block 2368 of relation 
pg_tblspc/16435/PG_9.1_201105231/188417/60418945

There didn't seem to be much correlation to which files were affected, and this 
was a critical server, so once we realized a simple reindex wasn't going to 
solve things, we shut it down and brought up a slave as the new master db.

While that seemed to fix these issues, we soon noticed problems with missing 
clog files. The missing clogs were outside the range of the existing clogs, so 
we tried using dummy clog files. It didn't help, and running pg_check we found 
that one block of one table was definitely corrupt. Worse, that corruption had 
spread to all our replicas.

I know this is a little sparse on details, but my questions are:

1. What kind of fault should I be looking to fix? Because it spread to all the 
replicas, both those that stream and those that replicate by replaying wals in 
the wal archive, I assume it's not a storage issue. (My understanding is that 
streaming replicas stream their changes from memory, not from wals.) So that 
leaves bad memory on the master, or a bug in postgres. Or a flawed 
assumption... :)

2. Is it possible that the corruption that was on the master got replicated to 
the slaves when I tried to cleanly shut down the master before bringing up a 
new slave as the new master and switching the other slaves over to replicating 
from that?

[GENERAL] bug, bad memory, or bad disk?

Reply via email to