Hi Andres, Robert.

I've attached four patches here.

1. Move the call to ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to
   earlier in StartupXLOG.

2. Inside that function, issue fsync()s for the main forks we create by
   copying the _init fork.

3. A small fixup to add a const to "typedef char *FileName", because the
   earlier patch gave me warnings about discarding const-ness. This is
   consistent with many other functions in fd.c that take const char *.

4. Issue an fsync() on the data directory at startup if we need to
   perform crash recovery.

I created some unlogged relations, performed an immediate shutdown, and
then straced postgres as it performed crash recovery. The changes in (2)
do indeed fsync the files we create by copying *_init, and don't fsync
anything else (at least not after I fixed a bug ;-).

I did not do anything about the END_OF_RECOVERY checkpoint setting the
ControlFile->state to DB_SHUTDOWNED, because it wasn't clear to me if
there was any agreement on what to do. I would be happy to submit a
followup patch if we reach some decision about it.

Is this what you had in mind?

-- Abhijit
>From d8726c06cdf11674661eac1d091cf7edd05c2a0c Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <a...@2ndquadrant.com>
Date: Wed, 24 Sep 2014 14:43:18 +0530
Subject: Call ResetUnloggedRelations(UNLOGGED_RELATION_INIT) earlier

We need to call this after recovery, but not after the END_OF_RECOVERY
checkpoint is written. If we crash after that checkpoint, for example
because of ENOSPC in PreallocXlogFiles, the checkpoint has already set
the ControlFile->state to DB_SHUTDOWNED, so we don't enter recovery
again at startup.

Because we did call ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP)
in the first cleanup, this leaves the database with a bunch of _init
forks for unlogged relations, but no main forks, leading to scary
errors.

See thread from 20140912112246.ga4...@alap3.anarazel.de for details.
---
 src/backend/access/transam/xlog.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 46eef5f..218f7fb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6863,6 +6863,14 @@ StartupXLOG(void)
 	ShutdownWalRcv();
 
 	/*
+	 * Reset initial contents of unlogged relations.  This has to be done
+	 * AFTER recovery is complete so that any unlogged relations created
+	 * during recovery also get picked up.
+	 */
+	if (InRecovery)
+		ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+
+	/*
 	 * We don't need the latch anymore. It's not strictly necessary to disown
 	 * it, but let's do it for the sake of tidiness.
 	 */
@@ -7130,14 +7138,6 @@ StartupXLOG(void)
 	PreallocXlogFiles(EndOfLog);
 
 	/*
-	 * Reset initial contents of unlogged relations.  This has to be done
-	 * AFTER recovery is complete so that any unlogged relations created
-	 * during recovery also get picked up.
-	 */
-	if (InRecovery)
-		ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
-
-	/*
 	 * Okay, we're officially UP.
 	 */
 	InRecovery = false;
-- 
1.9.1

>From 7eba57b5ed9e1eda1fa1b14b32a82828617d823e Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <a...@2ndquadrant.com>
Date: Wed, 24 Sep 2014 16:01:37 +0530
Subject: =?UTF-8?q?Make=20ResetUnloggedRelations(=E2=80=A6=5FINIT)=20fsync?=
 =?UTF-8?q?=20newly-created=20main=20forks?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During UNLOGGED_RELATION_INIT, after we have copied ${x}_init to $x, we
issue fsync()s for the newly-created files. We depend on their existence
and a checkpoint isn't going to fsync them for us during recovery.

See thread from 20140912112246.ga4...@alap3.anarazel.de for details.
---
 src/backend/storage/file/reinit.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 3229f41..4febf6f 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -339,6 +339,42 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
 		}
 
 		FreeDir(dbspace_dir);
+
+		/*
+		 * copy_file() above has already called pg_flush_data() on the
+		 * files it created. Now we need to fsync those files, because
+		 * a checkpoint won't do it for us while we're in recovery.
+		 */
+
+		dbspace_dir = AllocateDir(dbspacedirname);
+		while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+		{
+			ForkNumber forkNum;
+			int			oidchars;
+			char		oidbuf[OIDCHARS + 1];
+			char		mainpath[MAXPGPATH];
+
+			/* Skip anything that doesn't look like a relation data file. */
+			if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+													 &forkNum))
+				continue;
+
+			/* Also skip it unless this is the init fork. */
+			if (forkNum != INIT_FORKNUM)
+				continue;
+
+			/* Construct main fork pathname. */
+			memcpy(oidbuf, de->d_name, oidchars);
+			oidbuf[oidchars] = '\0';
+			snprintf(mainpath, sizeof(mainpath), "%s/%s%s",
+					 dbspacedirname, oidbuf, de->d_name + oidchars + 1 +
+					 strlen(forkNames[INIT_FORKNUM]));
+
+			fsync_fname(mainpath, false);
+		}
+		FreeDir(dbspace_dir);
+
+		fsync_fname(dbspacedirname, true);
 	}
 }
 
-- 
1.9.1

>From 2b7eb360cd8789bb75cc9c02b85b1df6a9903f0f Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <a...@2ndquadrant.com>
Date: Wed, 24 Sep 2014 16:07:06 +0530
Subject: Add const to FileName typedef, make fsync_fname use it

The filename arguments aren't modified anyway, and this avoids warnings
when fsync_fname is called from ResetUnloggedRelationsInDbspaceDir. It
is also consistent with several other functions in fd.c that already
take const char * arguments.
---
 src/backend/storage/file/fd.c | 2 +-
 src/include/storage/fd.h      | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1f69c9e..33ab471 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -390,7 +390,7 @@ pg_flush_data(int fd, off_t offset, off_t amount)
  * indicate the OS just doesn't allow/require fsyncing directories.
  */
 void
-fsync_fname(char *fname, bool isdir)
+fsync_fname(FileName fname, bool isdir)
 {
 	int			fd;
 	int			returncode;
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index a6df8fb..2fa4e18 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -46,7 +46,7 @@
  * FileSeek uses the standard UNIX lseek(2) flags.
  */
 
-typedef char *FileName;
+typedef const char *FileName;
 
 typedef int File;
 
@@ -113,7 +113,7 @@ extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
 extern int	pg_flush_data(int fd, off_t offset, off_t amount);
-extern void fsync_fname(char *fname, bool isdir);
+extern void fsync_fname(FileName fname, bool isdir);
 
 /* Filename components for OpenTemporaryFile */
 #define PG_TEMP_FILES_DIR "pgsql_tmp"
-- 
1.9.1

>From bc07b596b8778258da661d7c9aea1f4c0d00a2e0 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <a...@2ndquadrant.com>
Date: Wed, 24 Sep 2014 16:20:43 +0530
Subject: If we need to perform crash recovery, fsync the data directory

This is so that we don't lose older unflushed writes in the event of a
power failure after crash recovery, where more recent writes are
preserved.

See 20140918083148.ga17...@alap3.anarazel.de for details.
---
 src/backend/access/transam/xlog.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 218f7fb..95d57cb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5973,6 +5973,18 @@ StartupXLOG(void)
 		ereport(FATAL,
 				(errmsg("control file contains invalid data")));
 
+	/*
+	 * If we need to perform crash recovery, we issue an fsync on the
+	 * data directory to try to ensure that any data written before the
+	 * crash are flushed to disk. Otherwise a power failure in the near
+	 * future might mean that earlier unflushed writes are lost, but the
+	 * more recent data written to disk from here on are persisted.
+	 */
+
+	if (ControlFile->state != DB_SHUTDOWNED &&
+		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
+		fsync_fname(data_directory, true);
+
 	if (ControlFile->state == DB_SHUTDOWNED)
 	{
 		/* This is the expected case, so don't be chatty in standalone mode */
-- 
1.9.1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to