Hello, * Context *
I'm observing problems with provisioning a standby from the master by following a basic and documented "Making a Base Backup" [1] procedure with rsync if, in the mean time, heavy load is applied on the master. After searching the archives, the only more discussed and similar issue I found hit was by Daniel Farina in a thread "hot backups: am I doing it wrong, or do we have a problem with pg_clog?" [2], but, it seems, the issue was discarded because of a non-standard backup procedure Deniel used. However, I'm observing the same error with a simple procedure, hence this message. * Details * Procedure: 1. Start load generator on the master (WAL archiving enabled). 2. Prepare a Streaming Replication standby (accepting WAL files too): 2.1. pg_switch_xlog() on the master; 2.2. pg_start_backup(Obackup_under_load¹) on the master (this will take a while as master is loaded up); 2.3. rsync data/global/pg_control to the standby; 2.4. rsync all other data/ (without pg_xlog) to the standby; 2.5. pg_stop_backup() on the master; 2.6. Wait to receive all WAL files, generated during the backup, on the standby; 2.6. Start the standby PG instance. The last step will, usually, fail with a similar error: 2011-09-21 13:41:05 CEST LOG: database system was interrupted; last known up at 2011-09-21 13:40:50 CEST Restoring 00000014.history mv: cannot stat `/opt/PostgreSQL/9.1/archive/00000014.history': No such file or directory Restoring 00000013.history 2011-09-21 13:41:05 CEST LOG: restored log file "00000013.history" from archive 2011-09-21 13:41:05 CEST LOG: entering standby mode Restoring 0000001300000006000000DC 2011-09-21 13:41:05 CEST LOG: restored log file "0000001300000006000000DC" from archive Restoring 0000001300000006000000DB 2011-09-21 13:41:05 CEST LOG: restored log file "0000001300000006000000DB" from archive 2011-09-21 13:41:05 CEST FATAL: could not access status of transaction 1188673 2011-09-21 13:41:05 CEST DETAIL: Could not read from file "pg_clog/0001" at offset 32768: Success. 2011-09-21 13:41:05 CEST LOG: startup process (PID 13819) exited with exit code 1 2011-09-21 13:41:05 CEST LOG: aborting startup due to startup process failure The procedure works very reliably if there is little or no load on the master, but fails very often with the pg_clog error when load generator (few thousands of SELECTs, ~60 INSERTs, ~60 DELETEs and ~60 UPDATES per second) is started up. I assumed that a file system backup taken during pg_start_backup and pg_stop_backup is guaranteed to be consistent and that missing pieces will be taken from the WAL files, generated & shipped during the backup, but is it really? Is this procedure missing some steps? Or maybe this a known issue? Thank you, Linas [1] http://www.postgresql.org/docs/current/static/continuous-archiving.html [2] http://archives.postgresql.org/pgsql-hackers/2011-04/msg01132.php -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers