On Thu, Apr 21, 2011 at 7:15 AM, Daniel Farina <dan...@heroku.com> wrote: > To start at the end of this story: "DETAIL: Could not read from file > "pg_clog/007D" at offset 65536: Success." > > This is a message we received on a a standby that we were bringing > online as part of a test. The clog file was present, but apparently > too small for Postgres (or at least I tihnk this is what the message > meant), so one could stub in another clog file and then continue > recovery successfully (modulus the voodoo of stubbing in clog files in > general). I am unsure if this is due to an interesting race condition > in Postgres or a result of my somewhat-interesting hot-backup > protocol, which is slightly more involved than the norm. I will > describe what it does here: > > 1) Call pg start backup > 2) crawl the entire postgres cluster directory structure, except > pg_xlog, taking notes of the size of every file present > 3) begin writing TAR files, but *only up to the size noted during the > original crawling of the cluster directory,* so if the file grows > between the original snapshot and subsequently actually calling read() > on the file those extra bytes will not be added to the TAR. > 3a) If a file is truncated partially, I add "\0" bytes to pad the > tarfile member up to the size sampled in step 2, as I am streaming the > tar file and cannot go back in the stream and adjust the tarfile > member size > 4) call pg stop backup
In theory I would expect any defects introduced by the, ahem, exciting, procedure described in steps 3 and 3a to be corrected by recovery automatically when you start the new cluster. It shouldn't matter exactly when you read the file, and recovery for unrelated blocks ought to proceed totally independently, and an all-zeros block should be treated the same way as one that isn't allocated yet, so it seems like it ought to work. But you may be stressing some paths in the recovery code that don't get regular exercise, since the manner in which you are taking the backup can produce backups that are different from any backup that could be taken by the normal method, and those paths might have bugs. It's also possible, as others have said, that you've botched it. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers