Hackers,

While reading through [1] I saw there were two instances where backup_label was removed to achieve a "successful" restore. This might work on trivial test restores but is an invitation to (silent) disaster in a production environment where the checkpoint stored in backup_label is almost certain to be earlier than the one stored in pg_control.

A while back I had an idea on how to prevent this so I decided to give it a try. Basically, before writing pg_control to the backup I set checkpoint to 0xFFFFFFFFFFFFFFFF.

Recovery worked perfectly as long as backup_label was present and failed hard when it was not:

LOG:  invalid primary checkpoint record
PANIC:  could not locate a valid checkpoint record

It's not a very good message, but at least the foot gun has been removed. We could use this as a special value to give a better message, and maybe use something a bit more unique like 0xFFFFFFFFFADEFADE (or whatever) as the value.

This is all easy enough for pg_basebackup to do, but will certainly be non-trivial for most backup software to implement. In [2] we have discussed perhaps returning pg_control from pg_backup_stop() for the backup software to save, or it could become part of the backup_label (encoded as hex or base64, presumably). I prefer the latter as this means less work for the backup software (except for the need to exclude pg_control from the backup).

I don't have a patch for this yet because I did not test this idea using pg_basebackup, but I'll be happy to work up a patch if there is interest.

I feel like we should do *something* here. If even advanced users are making this mistake, then we should take it pretty seriously.

Regards,
-David

[1] https://www.postgresql.org/message-id/flat/CAM_vCudkSjr7NsNKSdjwtfAm9dbzepY6beZ5DP177POKy8%3D2aw%40mail.gmail.com#746e492bfcd2667635634f1477a61288 [2] https://www.postgresql.org/message-id/CA%2BhUKGKiZJcfZSA5G5Rm8oC78SNOQ4c8az5Ku%3D4wMTjw1FZ40g%40mail.gmail.com


Reply via email to