On Fri, Mar 14, 2014 at 7:32 PM, Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp> wrote: > Hello, we found that postgreql won't complete archive recovery > foever on some situation. This occurs HEAD, 9.3.3, 9.2.7, 9.1.12. > > Restarting server with archive recovery fails as following just > after it was killed with SIGKILL after pg_start_backup and some > wal writes but before pg_stop_backup. > > | FATAL: WAL ends before end of online backup > | HINT: Online backup started with pg_start_backup() must be > | ended with pg_stop_backup(), and all WAL up to that point must > | be available at recovery. > > What the mess is once entering this situation, I could find no > formal operation to exit from it.
Though this is formal way, you can exit from that situation by (1) Remove recovery.conf and start the server with crash recovery (2) Execute pg_start_backup() after crash recovery ends (3) Copy backup_label to somewhere (4) Execute pg_stop_backup() and shutdown the server (5) Copy backup_label back to $PGDATA (6) Create recovery.conf and start the server with archive recovery > On this situation, 'Backup start location' in controldata has > some valid location but corresponding 'end of backup' WAL record > won't come forever. > > But I think PG cannot tell the situation dintinctly whether the > 'end of backup' reocred is not exists at all or it will come > later especially when the server starts as a streaming > replication hot-standby. > > One solution for it would be a new parameter in recovery.conf > which tells that the operator wants the server to start as if > there were no backup label ever before when the situation > comes. It looks ugly and somewhat danger but seems necessary. > > The first attached file is the script to replay the problem, and > the second is the patch trying to do what is described above. > > After applying this patch on HEAD and uncommneting the > 'cancel_backup_label_on_failure = true' in test.sh, the test > script runs as following, > > | LOG: record with zero length at 0/2010F40 > | WARNING: backup_label was canceled. > | HINT: server might have crashed during backup mode. > | LOG: consistent recovery state reached at 0/2010F40 > | LOG: redo done at 0/2010DA0 > > What do you thing about this? What about adding new option into pg_resetxlog so that we can reset the pg_control's backup start location? Even after we've accidentally entered into the situation that you described, we can exit from that by resetting the backup start location in pg_control. Also this option seems helpful to salvage the data as a last resort from the corrupted backup. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers