On Mon, Apr 25, 2022 at 5:17 PM Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> wrote: > > On Mon, Apr 25, 2022 at 6:38 AM Michael Paquier <mich...@paquier.xyz> wrote: > > > > On Fri, Apr 22, 2022 at 07:17:37PM +0530, Bharath Rupireddy wrote: > > > Right. We find enough disk space and go to write and suddenly the > > > write operations fail for some reason or the VM crashes because of a > > > reason other than disk space. I think the foolproof solution is to > > > figure out the available disk space before prepadding or compressing > > > and also use the > > > write-first-to-temp-file-and-then-rename-it-to-original-file as > > > proposed in the earlier patches in this thread. > > > > Yes, what would count here is only the amount of free space in a > > partition. The total amount of space available becomes handy once you > > begin introducing things like percentage-based quota policies for the > > disk when archiving. The free amount of space could be used to define > > a policy based on the maximum number of bytes you need to leave > > around, as well, but this is not perfect science as this depends of > > what FSes decide to do underneath. There are a couple of designs > > possible here. When I had to deal with my upthread case I have chosen > > one as I had no need to worry only about Linux, it does not mean that > > this is the best choice that would fit with the long-term community > > picture. This comes down to how much pg_receivewal should handle > > automatically, and how it should handle it. > > Thanks. I'm not sure why we are just thinking of crashes due to > out-of-disk space. Figuring out free disk space before writing a huge > file (say a WAL file) is a problem in itself to the core postgres as > well, not just pg_receivewal. > > I think we are off-track a bit here. Let me illustrate what's the > whole problem is and the idea: > > If the node/VM on which pg_receivewal runs, goes down/crashes or fails > during write operation while padding the target WAL file (the .partial > file) with zeros, the unfilled target WAL file ((let me call this file > a partially padded .partial file) will be left over and subsequent > reads/writes to that it will fail with "write-ahead log file \"%s\" > has %zd bytes, should be 0 or %d" error which requires manual > intervention to remove it. In a service, this manual intervention is > what we would like to avoid. Let's not much bother right now for > compressed file writes (for now at least) as they don't have a > prepadding phase. > > The proposed solution is to make the prepadding atomic - prepad the > XXXX.partial file as XXXX.partial.tmp name and after the prepadding > rename (durably if sync option is chosen for pg_receivewal) to > XXXX.partial. Before prepadding XXXX.partial.tmp, delete the > XXXX.partial.tmp if it exists. > > The above problem isn't unique to pg_receivewal alone, pg_basebackup > too uses CreateWalDirectoryMethod and dir_open_for_write via > ReceiveXlogStream. > > IMHO, pg_receivewal checking for available disk space before writing > any file should better be discussed separately?
Here's the v3 patch after rebasing. I just would like to reiterate the issue the patch is trying to solve: At times (when no space is left on the device or when the process is taken down forcibly (VM/container crash)), there can be leftover uninitialized .partial files (note that padding i.e. filling 16MB WAL files with all zeros is done in non-compression cases) due to which pg_receivewal fails to come up after the crash. To address this, the proposed patch makes the padding 16MB WAL files atomic (first write a .temp file, pad it and durably rename it to the original .partial file, ready to be filled with received WAL records). This approach is similar to what core postgres achieves atomicity while creating new WAL file (see how XLogFileInitInternal() creates xlogtemp.%d file first and then how InstallXLogFileSegment() durably renames it to original WAL file). Thoughts? Regards, Bharath Rupireddy.
From f57d4cc3d3750e9104989a38450a91beea78aafd Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Date: Mon, 25 Jul 2022 11:02:38 +0000 Subject: [PATCH v3] Avoid partially-padded files under pg_receivewal/pg_basebackup dirs pg_receivewal and pg_basebackup can have partially-padded files in their target directories when a failure occurs while padding. The failure can be because of no space left on device (which the user can later increase/clean up to make some free disk space) or host VM/server crashed. To address the above problem, first create a temporary file, pad it and then rename it to the target file. --- src/bin/pg_basebackup/walmethods.c | 99 +++++++++++++++++++++++++++--- 1 file changed, 90 insertions(+), 9 deletions(-) diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c index e90aa0ba37..083861e217 100644 --- a/src/bin/pg_basebackup/walmethods.c +++ b/src/bin/pg_basebackup/walmethods.c @@ -82,6 +82,8 @@ typedef struct DirectoryMethodFile #define dir_set_error(msg) \ (dir_data->lasterrstring = _(msg)) +static bool dir_existsfile(const char *pathname); + static const char * dir_getlasterror(void) { @@ -107,7 +109,10 @@ dir_get_file_name(const char *pathname, const char *temp_suffix) static Walfile dir_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_size) { - char tmppath[MAXPGPATH]; + char targetpath[MAXPGPATH]; + char tmpsuffixpath[MAXPGPATH]; + bool shouldcreatetempfile = false; + int flags; char *filename; int fd; DirectoryMethodFile *f; @@ -123,7 +128,7 @@ dir_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_ dir_clear_error(); filename = dir_get_file_name(pathname, temp_suffix); - snprintf(tmppath, sizeof(tmppath), "%s/%s", + snprintf(targetpath, sizeof(targetpath), "%s/%s", dir_data->basedir, filename); pg_free(filename); @@ -132,12 +137,58 @@ dir_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_ * the file descriptor is important for dir_sync() method as gzflush() * does not do any system calls to fsync() to make changes permanent on * disk. + * + * Create a temporary file for padding and no compression cases to ensure + * a fully initialized file is created. */ - fd = open(tmppath, O_WRONLY | O_CREAT | PG_BINARY, pg_file_create_mode); + if (pad_to_size > 0 && + dir_data->compression_algorithm == PG_COMPRESSION_NONE) + { + shouldcreatetempfile = true; + flags = O_WRONLY | PG_BINARY; + } + else + { + shouldcreatetempfile = false; + flags = O_WRONLY | O_CREAT | PG_BINARY; + } + + fd = open(targetpath, flags, pg_file_create_mode); if (fd < 0) { - dir_data->lasterrno = errno; - return NULL; + if (errno != ENOENT || !shouldcreatetempfile) + { + dir_data->lasterrno = errno; + return NULL; + } + else if (shouldcreatetempfile) + { + /* + * Actual file doesn't exist. Now, create a temporary file pad it + * and rename to the target file. The temporary file may exist from + * the last failed attempt (failed during partial padding or + * renaming or some other). If it exists, let's play safe and + * delete it before creating a new one. + */ + snprintf(tmpsuffixpath, sizeof(tmpsuffixpath), "%s.%s", + targetpath, "temp"); + + if (dir_existsfile(tmpsuffixpath)) + { + if (unlink(tmpsuffixpath) != 0) + { + dir_data->lasterrno = errno; + return NULL; + } + } + + fd = open(tmpsuffixpath, flags | O_CREAT, pg_file_create_mode); + if (fd < 0) + { + dir_data->lasterrno = errno; + return NULL; + } + } } #ifdef HAVE_LIBZ @@ -233,16 +284,46 @@ dir_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_ } } + /* Rename the temporary file to target file */ + if (shouldcreatetempfile) + { + int rc; + + if (dir_data->sync) + rc = durable_rename(tmpsuffixpath, targetpath); + else + rc = rename(tmpsuffixpath, targetpath); + + if (rc != 0) + { + dir_data->lasterrno = errno; + close(fd); + + /* + * Removing the temporary file may as well fail here, but we are + * not concerned about it right now as we are already returning an + * error while renaming. However, the leftover temporary file gets + * deleted during the next padding cycle. + */ + unlink(tmpsuffixpath); + + return NULL; + } + } + /* * fsync WAL file and containing directory, to ensure the file is * persistently created and zeroed (if padded). That's particularly * important when using synchronous mode, where the file is modified and * fsynced in-place, without a directory fsync. + * + * For padding/temporary file case, durable_rename would have already + * fsynced file and containing directory. */ - if (dir_data->sync) + if (!shouldcreatetempfile && dir_data->sync) { - if (fsync_fname(tmppath, false) != 0 || - fsync_parent_path(tmppath) != 0) + if (fsync_fname(targetpath, false) != 0 || + fsync_parent_path(targetpath) != 0) { dir_data->lasterrno = errno; #ifdef HAVE_LIBZ @@ -282,7 +363,7 @@ dir_open_for_write(const char *pathname, const char *temp_suffix, size_t pad_to_ f->fd = fd; f->currpos = 0; f->pathname = pg_strdup(pathname); - f->fullpath = pg_strdup(tmppath); + f->fullpath = pg_strdup(targetpath); if (temp_suffix) f->temp_suffix = pg_strdup(temp_suffix); -- 2.34.1