[HACKERS] Hot Backup with rsync fails at pg_clog if under load

Linas Virbalas Wed, 21 Sep 2011 07:55:32 -0700

Hello,

* Context *


I'm observing problems with provisioning a standby from the master by
following a basic and documented "Making a Base Backup" [1] procedure with
rsync if, in the mean time, heavy load is applied on the master.

After searching the archives, the only more discussed and similar issue I
found hit was by Daniel Farina in a thread "hot backups: am I doing it
wrong, or do we have a problem with pg_clog?" [2], but, it seems, the issue
was discarded because of a non-standard backup procedure Deniel used.
However, I'm observing the same error with a simple procedure, hence this
message.

* Details *

Procedure:

1. Start load generator on the master (WAL archiving enabled).
2. Prepare a Streaming Replication standby (accepting WAL files too):
2.1. pg_switch_xlog() on the master;
2.2. pg_start_backup(Obackup_under_load¹) on the master (this will take a
while as master is loaded up);
2.3. rsync data/global/pg_control to the standby;
2.4. rsync all other data/ (without pg_xlog) to the standby;
2.5. pg_stop_backup() on the master;
2.6. Wait to receive all WAL files, generated during the backup, on the
standby;
2.6. Start the standby PG instance.

The last step will, usually, fail with a similar error:

2011-09-21 13:41:05 CEST LOG:  database system was interrupted; last known
up at 2011-09-21 13:40:50 CEST
Restoring 00000014.history
mv: cannot stat `/opt/PostgreSQL/9.1/archive/00000014.history': No such file
or directory
Restoring 00000013.history
2011-09-21 13:41:05 CEST LOG:  restored log file "00000013.history" from
archive
2011-09-21 13:41:05 CEST LOG:  entering standby mode
Restoring 0000001300000006000000DC
2011-09-21 13:41:05 CEST LOG:  restored log file "0000001300000006000000DC"
from archive
Restoring 0000001300000006000000DB
2011-09-21 13:41:05 CEST LOG:  restored log file "0000001300000006000000DB"
from archive
2011-09-21 13:41:05 CEST FATAL:  could not access status of transaction
1188673
2011-09-21 13:41:05 CEST DETAIL:  Could not read from file "pg_clog/0001" at
offset 32768: Success.
2011-09-21 13:41:05 CEST LOG:  startup process (PID 13819) exited with exit
code 1
2011-09-21 13:41:05 CEST LOG:  aborting startup due to startup process
failure

The procedure works very reliably if there is little or no load on the
master, but fails very often with the pg_clog error when load generator (few
thousands of SELECTs, ~60 INSERTs, ~60 DELETEs and ~60 UPDATES per second)
is started up.

I assumed that a file system backup taken during pg_start_backup and
pg_stop_backup is guaranteed to be consistent and that missing pieces will
be taken from the WAL files, generated & shipped during the backup, but is
it really?

Is this procedure missing some steps? Or maybe this a known issue?

Thank you,
Linas

[1] http://www.postgresql.org/docs/current/static/continuous-archiving.html
[2] http://archives.postgresql.org/pgsql-hackers/2011-04/msg01132.php


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Hot Backup with rsync fails at pg_clog if under load

Reply via email to