Hi,

On 2016-03-18 15:08:32 +0900, Michael Paquier wrote:
> +/*
> + * Sync data directory to ensure that what has been generated up to now is
> + * persistent in case of a crash, and this is done once globally for
> + * performance reasons as sync requests on individual files would be
> + * a negative impact on the running time of pg_rewind. This is invoked at
> + * the end of processing once everything has been processed and written.
> + */
> +static void
> +syncTargetDirectory(const char *argv0)
> +{
> +     int             ret;
> +     char    exec_path[MAXPGPATH];
> +     char    cmd[MAXPGPATH];
> +
> +     if (dry_run)
> +             return;

I think it makes more sense to return after detecting the binary, so
you'd find out about problems around not finding initdb during the dry
run, not later.


> +     /* Grab and invoke initdb to perform the sync */
> +     if ((ret = find_other_exec(argv0, "initdb",
> +                                                        "initdb (PostgreSQL) 
> " PG_VERSION "\n",
> +                                                        exec_path)) < 0)
> +     {
> +             char        full_path[MAXPGPATH];
> +
> +             if (find_my_exec(argv0, full_path) < 0)
> +                     strlcpy(full_path, progname, sizeof(full_path));
> +
> +             if (ret == -1)
> +                     pg_fatal("The program \"initdb\" is needed by %s but 
> was \n"
> +                                      "not found in the same directory as 
> \"%s\".\n"
> +                                      "Check your installation.\n", 
> progname, full_path);
> +             else
> +                     pg_fatal("The program \"postgres\" was found by \"%s\" 
> but was \n"
> +                                      "not the same version as %s.\n"
> +                                      "Check your installation.\n", 
> progname, full_path);

Wrong binary name.


> +     }
> +
> +     /* now run initdb */
> +     if (debug)
> +             snprintf(cmd, MAXPGPATH, "\"%s\" -D \"%s\" -S",
> +                              exec_path, datadir_target);
> +     else
> +             snprintf(cmd, MAXPGPATH, "\"%s\" -D \"%s\" -S > \"%s\"",
> +                              exec_path, datadir_target, DEVNULL);
> +
> +     if (system(cmd) != 0)
> +             pg_fatal("sync of target directory with initdb -S failed\n");
> +     pg_log(PG_PROGRESS, "sync of target directory with initdb -S done\n");
> +}

Don't see need for emitting "done", for now at least.


>  /*
> + * Wrapper of rename() similar to the backend version with the same function
> + * name aimed at making the renaming durable on disk. Note that this version
> + * does not fsync the old file before the rename as all the code paths 
> leading
> + * to this function are already doing this operation. The new file is also
> + * normally not present on disk before the renaming so there is no need to
> + * bother about it.

I don't think it's a good idea to skip fsyncing the old file based on
that; it's way too likely that that'll not be done for the next user of
durable_rename.


> + */
> +static int
> +durable_rename(const char *oldfile, const char *newfile)
> +{
> +     int             fd;
> +     char    parentpath[MAXPGPATH];
> +
> +     if (rename(oldfile, newfile) != 0)
> +     {
> +             /* durable_rename produced a log entry */

Uh?


> +             fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
> +                             progname, current_walfile_name, 
> strerror(errno));

current_walfile_name doesn't look right, that's a largely independent
global variable.


> +     if (fsync(fd) != 0)
> +     {
> +             close(fd);
> +             fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
> +                             progname, newfile, strerror(errno));
> +             return -1;

Close should be after the strerror() call (yes, that mistake already
existed once in receivelog.c).


> +     fd = open(parentpath, O_RDONLY | PG_BINARY, 0);
> +
> +     /*
> +      * Some OSs don't allow us to open directories at all (Windows returns
> +      * EACCES), just ignore the error in that case.  If desired also 
> silently
> +      * ignoring errors about unreadable files. Log others.
> +      */

Comment is not applicable as a whole.


> +     if (fd < 0 && (errno == EISDIR || errno == EACCES))
> +             return 0;
> +     else if (fd < 0)
> +     {
> +             fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
> +                             progname, parentpath, strerror(errno));
> +             return -1;
> +     }
> +
> +     if (fsync(fd) != 0)
> +     {
> +             close(fd);
> +             fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
> +                             progname, parentpath, strerror(errno));
> +             return -1;

close() needs to be moved again.


It'd be easier to apply these if the rate of trivial issues were lower
:(.


Attached is an heavily revised version of the patch. Besides the above,
I've:
* copied fsync_fname_ext from initdb, I think it's easier to understand
  code this way, and it'll make unifying the code easier
* added fsyncing of the directory in mark_file_as_archived()
* The WAL files also need to be fsynced when created in open_walfile():
  - otherwise the directory entry itself isn't safely persistent, as we
    don't fsync the directory in the stream->synchronous fsync() cases.
  - we refuse to resume in open_walfile(), if a file isn't 16MB when
    restarting. Without an fsync that's actually not unlikely after a
    crash. Even with an fsync that's not guaranteed not to happen, but
    the chance of it is much lower.

I'm too tired to push this at the eleventh hour. Besides a heavily
revised patch, backpatching will likely include a number of conflicts.
If somebody in the US has the energy to take care of this...


I've also noticed that

a) pg_basebackup doesn't do anything about durability (it probably needs
   a very similar patch to the one pg_rewind just received).
b) nor does pg_dump[all]

I think it's pretty unacceptable for backup tools to be so cavalier
about durability.


So we're going to have another round of fsync stuff in the next set of
releases anyway...


Greetings,

Andres Freund
>From 7c909e9913ea9acac6d0fc6ac8a40e62584568a3 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Mon, 28 Mar 2016 01:24:42 +0200
Subject: [PATCH] Issue fsync more carefully in pg_receivexlog and
 pg_basebackup -X stream.

Several places weren't careful about fsyncing in the way. See 1d4a0ab1
and 606e0f98 for details about required fsyns.

This introduces a near-copy of initdb's fsync_fname_ext(), and of the
backend's durable_rename(), fsync_parent_path(). At least the frontend
duplication should be avoided; but that'd end up being hard to
backpatch.

Author: Michael Paquier, heavily revised by me
Discussion: cab7npqrmm+cx6bvxw0y7mmvgmfj1s8kwhevt8tap83yefrf...@mail.gmail.com
Backpatch: 9.1 (in parts)
---
 src/bin/pg_basebackup/receivelog.c | 188 ++++++++++++++++++++++++++++++++-----
 1 file changed, 164 insertions(+), 24 deletions(-)

diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 595213f..c533ad1 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -50,6 +50,9 @@ static long CalculateCopyStreamSleeptime(int64 now, int standby_message_timeout,
 
 static bool ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos,
 						 uint32 *timeline);
+static int	fsync_parent_path(const char *fname);
+static int	fsync_fname_ext(const char *fname, bool isdir);
+static int	durable_rename(const char *oldfile, const char *newfile);
 
 static bool
 mark_file_as_archived(const char *basedir, const char *fname)
@@ -68,18 +71,14 @@ mark_file_as_archived(const char *basedir, const char *fname)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-
-		close(fd);
-
-		return false;
-	}
-
 	close(fd);
 
+	if (fsync_fname_ext(tmppath, false) != 0)
+		return false;
+
+	if (fsync_parent_path(tmppath) != 0)
+		return false;
+
 	return true;
 }
 
@@ -116,6 +115,10 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	/*
 	 * Verify that the file is either empty (just created), or a complete
 	 * XLogSegSize segment. Anything in between indicates a corrupt file.
+	 *
+	 * XXX: This means that we might not restart if a crash occurs before the
+	 * fsync below. We probably should create the file in a temporary path
+	 * like the backend does...
 	 */
 	if (fstat(f, &statbuf) != 0)
 	{
@@ -129,6 +132,16 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	{
 		/* File is open and ready to use */
 		walfile = f;
+
+		/*
+		 * fsync, in case of a previous crash between padding and fsyncing the
+		 * file.
+		 */
+		if (fsync_fname_ext(fn, false) != 0)
+			return false;
+		if (fsync_parent_path(fn) != 0)
+			return false;
+
 		return true;
 	}
 	if (statbuf.st_size != 0)
@@ -157,6 +170,17 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	}
 	free(zerobuf);
 
+	/*
+	 * fsync WAL file and containing directory, to ensure the file is
+	 * persistently created and zeroed. That's particularly important when
+	 * using synchronous mode, where the file is modified and fsynced
+	 * in-place, without a directory fsync.
+	 */
+	if (fsync_fname_ext(fn, false) != 0)
+		return false;
+	if (fsync_parent_path(fn) != 0)
+		return false;
+
 	if (lseek(f, SEEK_SET, 0) != 0)
 	{
 		fprintf(stderr,
@@ -217,10 +241,9 @@ close_walfile(StreamCtl *stream, XLogRecPtr pos)
 
 		snprintf(oldfn, sizeof(oldfn), "%s/%s%s", stream->basedir, current_walfile_name, stream->partial_suffix);
 		snprintf(newfn, sizeof(newfn), "%s/%s", stream->basedir, current_walfile_name);
-		if (rename(oldfn, newfn) != 0)
+		if (durable_rename(oldfn, newfn) != 0)
 		{
-			fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
-					progname, current_walfile_name, strerror(errno));
+			/* durable_rename produced a log entry */
 			return false;
 		}
 	}
@@ -338,14 +361,6 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		close(fd);
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-		return false;
-	}
-
 	if (close(fd) != 0)
 	{
 		fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
@@ -356,10 +371,9 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 	/*
 	 * Now move the completed history file into place with its final name.
 	 */
-	if (rename(tmppath, path) < 0)
+	if (durable_rename(tmppath, path) < 0)
 	{
-		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
-				progname, tmppath, path, strerror(errno));
+		/* durable_rename produced a log entry */
 		return false;
 	}
 
@@ -786,6 +800,132 @@ ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos, uint32 *timeline)
 }
 
 /*
+ * fsync_fname_ext -- Try to fsync a file or directory
+ *
+ * Returns 0 if the operation succeeded, -1 otherwise.
+ *
+ * XXX: This is a near-duplicate of initdb.c's fsync_fname_ext(); they should
+ * be unified into a common place.
+ */
+static int
+fsync_fname_ext(const char *fname, bool isdir)
+{
+	int			fd;
+	int			flags;
+	int			returncode;
+
+	/*
+	 * Some OSs require directories to be opened read-only whereas other
+	 * systems don't allow us to fsync files opened read-only; so we need both
+	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are
+	 * not writable by our userid, but we assume that's OK.
+	 */
+	flags = PG_BINARY;
+	if (!isdir)
+		flags |= O_RDWR;
+	else
+		flags |= O_RDONLY;
+
+	/*
+	 * Open the file, silently ignoring errors about unreadable files (or
+	 * unsupported operations, e.g. opening a directory under Windows), and
+	 * logging others.
+	 */
+	fd = open(fname, flags);
+	if (fd < 0)
+	{
+		if (isdir && (errno == EISDIR || errno == EACCES))
+			return 0;
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		return -1;
+	}
+
+	returncode = fsync(fd);
+
+	/*
+	 * Some OSes don't allow us to fsync directories at all, so we can ignore
+	 * those errors. Anything else needs to be reported.
+	 */
+	if (returncode != 0 && !(isdir && errno == EBADF))
+	{
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		close(fd);
+		return -1;
+	}
+
+	close(fd);
+	return 0;
+}
+
+/*
+ * fsync_parent_path -- fsync the parent path of a file or directory
+ *
+ * This is aimed at making file operations persistent on disk in case of
+ * an OS crash or power failure.
+ */
+static int
+fsync_parent_path(const char *fname)
+{
+	char		parentpath[MAXPGPATH];
+
+	strlcpy(parentpath, fname, MAXPGPATH);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument is
+	 * just a file name (see comments in path.c), so handle that as being the
+	 * current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		strlcpy(parentpath, ".", MAXPGPATH);
+
+	if (fsync_fname_ext(parentpath, true) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * durable_rename -- rename(2) wrapper, issuing fsyncs required for durability
+ *
+ * Wrapper around rename, similar to the backend version.  Note that this
+ * version does not fsync the target file before the rename, as it's unlikely
+ * to be helpful for current and prospective users.
+ */
+static int
+durable_rename(const char *oldfile, const char *newfile)
+{
+	/*
+	 * First fsync the old path, to ensure that it is properly persistent on
+	 * disk.
+	 */
+	if (fsync_fname_ext(oldfile, false) != 0)
+		return -1;
+
+	/* Time to do the real deal... */
+	if (rename(oldfile, newfile) != 0)
+	{
+		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
+				progname, oldfile, newfile, strerror(errno));
+		return -1;
+	}
+
+	/*
+	 * To guarantee renaming the file is persistent, fsync the file with its
+	 * new name, and its containing directory.
+	 */
+	if (fsync_fname_ext(newfile, false) != 0)
+		return -1;
+
+	if (fsync_parent_path(newfile) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
  * The main loop of ReceiveXlogStream. Handles the COPY stream after
  * initiating streaming with the START_STREAMING command.
  *
-- 
2.7.0.229.g701fa7f.dirty

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to