On Tue, Jul 25, 2023 at 6:04 AM Stephen Frost <sfr...@snowman.net> wrote: > * Thomas Munro (thomas.mu...@gmail.com) wrote: > > Here's a new minimal patch that solves only the bugs in basebackup + > > the simple SQL-facing functions that read the control file, by simply > > acquiring ControlFileLock in the obvious places. This should be > > simple enough for back-patching? > > I don't particularly like fixing this in a way that only works for > pg_basebackup and means that the users of other backup tools don't have > a solution to this problem. What are we supposed to tell users of > pgBackRest when they see this fix for pg_basebackup in the next set of > minor releases and they ask us if we've addressed this risk? > > We might be able to accept the 're-read on CRC failure' approach, if it > were being used for pg_controldata and we documented that external > tools and especially backup tools using the low-level API are required > to check the CRC and to re-read on failure if accessing a running > system.
Hi Stephen, David, and thanks for looking. Alright, let's try that idea out. 0001 + 0002. Acquire the lock in the obvious places in the backend, to completely exclude the chance of anything going wrong for the easy cases, including pg_basebackup. (This is the v4 from yesterday). And... 0003. Document this problem and how to detect it, in the low-level-backup section. Better words welcome. And... 0004. Retries for front-end programs, using the idea suggested by Tom (to wit: if it fails, retry until it fails with the same CRC value twice). It's theoretically imperfect, but it's highly unlikely to fail in practice and the best we can do without file locks or a connection to the server AFAICT. (This is the first patch I posted, adjusted to give up after 10 (?) loops, and not to bother at all in backend code since that takes ControlFileLock with the 0001 patch). > While it's not ideal, maybe we could get away with changing the contents > of the backup_label as part of a back-patch? The documentation, at > least on a quick look, says to copy the second field from pg_backup_stop > into a backup_label file but doesn't say anything about what those > contents are or if they can change. That would at least address the > concern of backups ending up with a corrupt pg_control file and not > being able to be restored, even if tools aren't updated to verify the > CRC or similar. Of course, that's a fair bit of code churn for a > backpatch, which I certainly understand as being risky. If we can't > back-patch that, it might still be the way to go moving forward, while > also telling tools to check the CRC. (I'm not going to try to figure > out some back-patchable pretend solution for this for shell scripts that > pretend to be able to backup running PG databases; this issue is > probably the least of their issues anyway...) I think you're probably right that anyone following those instructions would be OK if we just back-patched such a thing, but it all seems a little too much to me. +1 to looking into that for v17 (and I guess maybe someone could eventually argue for back-patching much later with experience). I'm sure other solutions are possible too... other places to put a safe atomic copy of the control file could include: in the WAL, or in extra files (pg_control.XXX) in the data directory + some infallible way for recovery to choose which one to start up from. Or something.
From 6322d7159ae418582392fa6c0e11863a016378f4 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Tue, 31 Jan 2023 20:01:49 +1300 Subject: [PATCH v5 1/4] Acquire ControlFileLock in base backups. The control file could previously be overwritten while a base backup was reading it. On some file systems (ext4, ntfs), the concurrent read could see a mixture of old and new contents. The resulting copy would be corrupted, and the new cluster would not be able to start up. Other files are read without interlocking, but those will be corrected by replaying the log. Exclude this possibility by acquiring ControlFileLock. Copy into a temporary buffer first because we don't want to hold the lock while in sendFile() code that might sleep if throttling is configured. This requires a minor adjustment to sendFileWithContent() to support binary data. This does not address the same possibility when using using external file system copy tools (as described in the documentation under "Making a Base Backup Using the Low Level API"). Back-patch to all supported releases. Reviewed-by: Anton A. Melnikov <aamelni...@inbox.ru> (earlier version) Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de --- src/backend/backup/basebackup.c | 36 ++++++++++++++++++--------------- 1 file changed, 20 insertions(+), 16 deletions(-) diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c index 45be21131c..f03496665f 100644 --- a/src/backend/backup/basebackup.c +++ b/src/backend/backup/basebackup.c @@ -24,6 +24,7 @@ #include "backup/basebackup_target.h" #include "commands/defrem.h" #include "common/compression.h" +#include "common/controldata_utils.h" #include "common/file_perm.h" #include "lib/stringinfo.h" #include "miscadmin.h" @@ -39,6 +40,7 @@ #include "storage/dsm_impl.h" #include "storage/fd.h" #include "storage/ipc.h" +#include "storage/lwlock.h" #include "storage/reinit.h" #include "utils/builtins.h" #include "utils/guc.h" @@ -84,7 +86,7 @@ static bool sendFile(bbsink *sink, const char *readfilename, const char *tarfile struct stat *statbuf, bool missing_ok, Oid dboid, backup_manifest_info *manifest, const char *spcoid); static void sendFileWithContent(bbsink *sink, const char *filename, - const char *content, + const char *content, int len, backup_manifest_info *manifest); static int64 _tarWriteHeader(bbsink *sink, const char *filename, const char *linktarget, struct stat *statbuf, @@ -315,7 +317,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink) if (ti->path == NULL) { - struct stat statbuf; + ControlFileData *control_file; + bool crc_ok; bool sendtblspclinks = true; char *backup_label; @@ -324,14 +327,14 @@ perform_base_backup(basebackup_options *opt, bbsink *sink) /* In the main tar, include the backup_label first... */ backup_label = build_backup_content(backup_state, false); sendFileWithContent(sink, BACKUP_LABEL_FILE, - backup_label, &manifest); + backup_label, -1, &manifest); pfree(backup_label); /* Then the tablespace_map file, if required... */ if (opt->sendtblspcmapfile) { sendFileWithContent(sink, TABLESPACE_MAP, - tablespace_map->data, &manifest); + tablespace_map->data, -1, &manifest); sendtblspclinks = false; } @@ -340,13 +343,13 @@ perform_base_backup(basebackup_options *opt, bbsink *sink) sendtblspclinks, &manifest, NULL); /* ... and pg_control after everything else. */ - if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not stat file \"%s\": %m", - XLOG_CONTROL_FILE))); - sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf, - false, InvalidOid, &manifest, NULL); + LWLockAcquire(ControlFileLock, LW_SHARED); + control_file = get_controlfile(DataDir, &crc_ok); + LWLockRelease(ControlFileLock); + sendFileWithContent(sink, XLOG_CONTROL_FILE, + (char *) control_file, sizeof(*control_file), + &manifest); + pfree(control_file); } else { @@ -591,7 +594,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink) * complete segment. */ StatusFilePath(pathbuf, walFileName, ".done"); - sendFileWithContent(sink, pathbuf, "", &manifest); + sendFileWithContent(sink, pathbuf, "", -1, &manifest); } /* @@ -619,7 +622,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink) /* unconditionally mark file as archived */ StatusFilePath(pathbuf, fname, ".done"); - sendFileWithContent(sink, pathbuf, "", &manifest); + sendFileWithContent(sink, pathbuf, "", -1, &manifest); } /* Properly terminate the tar file. */ @@ -1030,18 +1033,19 @@ SendBaseBackup(BaseBackupCmd *cmd) */ static void sendFileWithContent(bbsink *sink, const char *filename, const char *content, + int len, backup_manifest_info *manifest) { struct stat statbuf; - int bytes_done = 0, - len; + int bytes_done = 0; pg_checksum_context checksum_ctx; if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0) elog(ERROR, "could not initialize checksum of file \"%s\"", filename); - len = strlen(content); + if (len < 0) + len = strlen(content); /* * Construct a stat struct for the backup_label file we're injecting in -- 2.39.2
From e55b05d52b792c2de35cb519e3a6f0de6030b8f9 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Sat, 22 Jul 2023 11:50:27 +1200 Subject: [PATCH v5 2/4] Acquire ControlFileLock in SQL functions. Commit dc7d70ea added functions that read the control file, but didn't acquire ControlFileLock. With unlucky timing and on certain file systems, they could read corrupted data that is in the process of being overwritten, and the checksum would fail. Back-patch to all supported releases. Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de --- src/backend/utils/misc/pg_controldata.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c index f2c1084797..a1003a464d 100644 --- a/src/backend/utils/misc/pg_controldata.c +++ b/src/backend/utils/misc/pg_controldata.c @@ -24,6 +24,7 @@ #include "common/controldata_utils.h" #include "funcapi.h" #include "miscadmin.h" +#include "storage/lwlock.h" #include "utils/builtins.h" #include "utils/pg_lsn.h" #include "utils/timestamp.h" @@ -42,7 +43,9 @@ pg_control_system(PG_FUNCTION_ARGS) elog(ERROR, "return type must be a row type"); /* read the control file */ + LWLockAcquire(ControlFileLock, LW_SHARED); ControlFile = get_controlfile(DataDir, &crc_ok); + LWLockRelease(ControlFileLock); if (!crc_ok) ereport(ERROR, (errmsg("calculated CRC checksum does not match value stored in file"))); @@ -80,7 +83,9 @@ pg_control_checkpoint(PG_FUNCTION_ARGS) elog(ERROR, "return type must be a row type"); /* Read the control file. */ + LWLockAcquire(ControlFileLock, LW_SHARED); ControlFile = get_controlfile(DataDir, &crc_ok); + LWLockRelease(ControlFileLock); if (!crc_ok) ereport(ERROR, (errmsg("calculated CRC checksum does not match value stored in file"))); @@ -169,7 +174,9 @@ pg_control_recovery(PG_FUNCTION_ARGS) elog(ERROR, "return type must be a row type"); /* read the control file */ + LWLockAcquire(ControlFileLock, LW_SHARED); ControlFile = get_controlfile(DataDir, &crc_ok); + LWLockRelease(ControlFileLock); if (!crc_ok) ereport(ERROR, (errmsg("calculated CRC checksum does not match value stored in file"))); @@ -208,7 +215,9 @@ pg_control_init(PG_FUNCTION_ARGS) elog(ERROR, "return type must be a row type"); /* read the control file */ + LWLockAcquire(ControlFileLock, LW_SHARED); ControlFile = get_controlfile(DataDir, &crc_ok); + LWLockRelease(ControlFileLock); if (!crc_ok) ereport(ERROR, (errmsg("calculated CRC checksum does not match value stored in file"))); -- 2.39.2
From 3b28603553bf483c3cda280eeae6790693b930b6 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Tue, 25 Jul 2023 11:11:17 +1200 Subject: [PATCH v5 3/4] Doc: Warn about pg_control corruption in low-level backups. On some file systems (ext4, ntfs), copying the control file while it's being modified to can result in a corrupted file with a bad checksum, and then recovery fails. Document a way to test for this problem as part of the low-level-backup procedure. Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de --- doc/src/sgml/backup.sgml | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index 8cb24d6ae5..60e110ea0a 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -920,6 +920,23 @@ SELECT pg_backup_start(label => 'label', fast => false); consider during this backup. </para> </listitem> + <listitem> + <para> + One file copied by the previous step requires special consideration. + For all other files, the effects of concurrent modification will be + corrected by WAL replay, but <literal>global/pg_control</literal> must + be self-consistent before replay begins. On some systems, it might be + corrupted by a concurrent modification during the backup. It can be + verified by installing it (after extracting from archive file if + applicable) into an otherwise empty directory (e.g. + <literal>test_pgdata/global/pg_control</literal>) and then + passing the path of the top directory to the + <application>pg_controldata</application> program. + <application>pg_controldata</application> will verify + the internal checksum, and fail if corruption is detected. In that + unlikely case, the file should be backed up again and re-tested. + </para> + </listitem> <listitem> <para> In the same connection as before, issue the command: -- 2.39.2
From 7c40600581799b12eeb8550aa095385e7adfb5a9 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Thu, 24 Nov 2022 13:28:22 +1300 Subject: [PATCH v5 4/4] Try to tolerate torn reads of control file in frontend. Some of our src/bin tools read the control file without any kind of interlocking against concurrent writes. In the backend we avoid this problem with ControlFileLock, but we can't do that from a stand-alone program. Tolerate the torn read that can occur on some systems (ext4, ntfs) by retrying if checksum fails, until we get two reads in a row with the same checksum. This is only a last ditch effort and not guaranteed to reach the right conclusion with extremely unlucky scheduling, but it seems at least very likely to. Thanks to Tom Lane for this suggestion. Back-patch to all supported releases. Reviewed-by: Anton A. Melnikov <aamelni...@inbox.ru> Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de --- src/common/controldata_utils.c | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c index 9723587466..8b1786512f 100644 --- a/src/common/controldata_utils.c +++ b/src/common/controldata_utils.c @@ -56,12 +56,22 @@ get_controlfile(const char *DataDir, bool *crc_ok_p) char ControlFilePath[MAXPGPATH]; pg_crc32c crc; int r; +#ifdef FRONTEND + pg_crc32c last_crc; + int retries = 0; +#endif Assert(crc_ok_p); ControlFile = palloc_object(ControlFileData); snprintf(ControlFilePath, MAXPGPATH, "%s/global/pg_control", DataDir); +#ifdef FRONTEND + INIT_CRC32C(last_crc); + +retry: +#endif + #ifndef FRONTEND if ((fd = OpenTransientFile(ControlFilePath, O_RDONLY | PG_BINARY)) == -1) ereport(ERROR, @@ -117,6 +127,26 @@ get_controlfile(const char *DataDir, bool *crc_ok_p) *crc_ok_p = EQ_CRC32C(crc, ControlFile->crc); +#ifdef FRONTEND + + /* + * With unlucky timing on filesystems that don't implement atomicity of + * concurrent reads and writes, we might have seen garbage if the server + * was writing to the file at the same time. Keep retrying until we see + * the same CRC twice, with a tiny sleep to give a concurrent writer a + * good chance of making progress. + */ + if (!*crc_ok_p && + (retries == 0 || !EQ_CRC32C(crc, last_crc)) && + retries < 10) + { + retries++; + last_crc = crc; + pg_usleep(10000); + goto retry; + } +#endif + /* Make sure the control file is valid byte order. */ if (ControlFile->pg_control_version % 65536 == 0 && ControlFile->pg_control_version / 65536 != 0) -- 2.39.2