Hi All,

We are having a thorny problem I'm hoping someone will be able to help with.

We have a pair of machines set up as an active / hot SB pair. The database they 
contain is quite large - approx. 9TB. They were working fine on 9.1, and we 
recently upgraded the active DB to 9.2.1.

After upgrading the active DB, we re-mirrored the standby (using pg_basebackup) 
and started it up. It began replaying the WAL files as expected.

After a few hours this happened:

WARNING:  page 1 of relation pg_tblspc/16408/PG_9.2_201204301/16409/1123460086 
is uninitialized
CONTEXT:  xlog redo vacuum: rel 16408/16409/1123460086; blk 4411, 
lastBlockVacuumed 0
PANIC:  WAL contains references to invalid pages
CONTEXT:  xlog redo vacuum: rel 16408/16409/1123460086; blk 4411, 
lastBlockVacuumed 0
LOG:  startup process (PID 24195) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes

We tried starting it up again, the same thing happened.

After some googling and re-reading the release notes, we noticed the mention in 
the 9.2.1 release notes about the potential for corrupted visibility maps, so 
as per the recommendation we did a full VACUUM of the whole database (with 
vacuum_freeze_table_age set to zero), then re-mirrored the standby again.

After re-mirroring was completed we started the standby again. Strangely it 
reached consistency after only 33 WAL files - since the base backup took 5 days 
to complete this does not seem right to me. Anyway, WAL recovery continued, 
with occasional warnings like this:

[2013-02-04 10:30:51 EST]  13546@  WARNING:  xlog min recovery request 
1A13A/9BC425A0 is past current point 19F1E/725043E8
[2013-02-04 10:30:51 EST]  13546@  CONTEXT:  writing block 0 of relation 
pg_tblspc/16408/PG_9.2_201204301/16409/12525_vm

After a few hours, this happened:

[2013-02-04 13:43:24 EST]  13538@  WARNING:  page 1248 of relation 
pg_tblspc/16408/PG_9.2_201204301/16409/1128746393 does not exist
[2013-02-04 13:43:24 EST]  13538@  CONTEXT:  xlog redo visible: rel 
16408/16409/1128746393; blk 1248
[2013-02-04 13:43:24 EST]  13538@  PANIC:  WAL contains references to invalid 
pages
[2013-02-04 13:43:24 EST]  13538@  CONTEXT:  xlog redo visible: rel 
16408/16409/1128746393; blk 1248
[2013-02-04 13:43:25 EST]  13532@  LOG:  startup process (PID 13538) was 
terminated by signal 6: Aborted
[2013-02-04 13:43:25 EST]  13532@  LOG:  terminating any other active server 
processes

Looks similar to the first case, but a different context. We thought that 
perhaps an index had become corrupted (apparently also a possibility with the 
bug mentioned above) however the file mentioned belongs to a normal table, not 
an index. And 'redo visible' sounds like it might be to do with the visibility 
map?

We restarted it again with debugging cranked up. It didn't reveal anything more 
interesting. We then upgraded the standby to 9.2.2 and started it again. Again 
no dice. In each case it fails at exactly the same point with the same error.

Any ideas for a next troubleshooting step?

Regards // Mike



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to