[HACKERS] silent data loss with ext4 / all current versions

Tomas Vondra Fri, 27 Nov 2015 03:19:15 -0800

Hi,

I've been doing some power failure tests (i.e. unexpectedlyinterrupting power) a few days ago, and I've discovered a fairly seriouscase of silent data loss on ext3/ext4. Initially i thought it's afilesystem bug, but after further investigation I'm pretty sure it's ourfault.

What happens is that when we recycle WAL segments, we rename them andthen sync them using fdatasync (which is the default on Linux). Howeverfdatasync does not force fsync on the parent directory, so in case ofpower failure the rename may get lost. The recovery won't realize thosesegments actually contain changes from "future" and thus does not replaythem. Hence data loss. The recovery completes as if everything went OK,so the data loss is entirely silent.

Reproducing this is rather trivial. I've prepared a simple C programsimulating our WAL recycling, that I intended to send to ext4 mailinglist to demonstrate the ext4 bug before (I realized it's most likely ourbug and not theirs).

The example program is called ext4-data-loss.c and is available here(along with other stuff mentioned in this message):


    https://github.com/2ndQuadrant/ext4-data-loss

Compile it, run it (over ssh from another host), interrupt the power andafter restart you should see some of the segments be lost (the renamereverted).

The git repo also contains a bunch of python scripts that I initiallyused to reproduce this on PostgreSQL - insert.py, update.py andxlog-watch.py. I'm not going to explain the details here, it's a bitmore complicated but the cause is exactly the same as with the Cprogram, just demonstrated in database. See README for instructions.

So, what's going on? The problem is that while the rename() is atomic,it's not guaranteed to be durable without an explicit fsync on theparent directory. And by default we only do fdatasync on the recycledsegments, which may not force fsync on the directory (and ext4 does notdo that, apparently).

This impacts all current kernels (tested on 2.6.32.68, 4.0.5 and4.4-rc1), and also all supported PostgreSQL versions (tested on 9.1.19,but I believe all versions since spread checkpoints were introduced arevulnerable).

FWIW this has nothing to do with storage reliability - you may have gooddrives, RAID controller with BBU, reliable SSDs or whatever, and you'restill not safe. This issue is at the filesystem level, not storage.

I've done the same tests on xfs and that seems to be safe - I've beenunable to reproduce the issue, so either the issue is not there or it'smore difficult to hit it. I haven't tried on other file systems, becauseext4 and xfs cover vast majority of deployments (at least on Linux), andthus issue on ext4 is serious enough I believe.

It's possible to make ext3/ext4 safe with respect to this issue by usingfull journaling (data=journal) instead of the default (data=ordered)mode. However this comes at a significant performance cost and prettymuch no one is using it with PostgreSQL because data=ordered is believedto be safe.

It's also possible to mitigate this by setting wal_sync_method=fsync,but I don't think I've ever seen that change at a customer. This alsocomes with a significant performance penalty, comparable to settingdata=journal. This has the advantage that this can be done withoutrestarting the database (SIGHUP is enough).


So pretty much everyone running on Linux + ext3/ext4 is vulnerable.

It's also worth mentioning that the data is not actually lost - it'sproperly fsynced in the WAL segments, it's just the rename that gotlost. So it's possible to survive this without losing data by manuallyrenaming the segments, but this must happen before starting the clusterbecause the automatic recovery comes and discards all the data etc.

I think this issue might also result in various other issues, not justdata loss. For example, I wouldn't be surprised by data corruption dueto flushing some of the changes in data files to disk (due to contentionfor shared buffers and reaching vm.dirty_bytes) and then losing thematching WAL segment. Also, while I have only seen 1 to 3 segmentsgetting lost, it might be possible that more segments can get lost,possibly making the recovery impossible. And of course, this might causeproblems with WAL archiving due to archiving the same

segment twice (before and after crash).

Attached is a proposed fix for this (xlog-fsync.patch), and I'm prettysure this needs to be backpatched to all backbranches. I've alsoattached a patch that adds pg_current_xlog_flush_location() function,which proved to be quite useful when debugging this issue.

I'd also like to propose adding "last segment" to pg_controldata, nextto the last checkpoint / restartpoint. We don't need to write this onevery commit, once per segment (on the first write) is enough. Thiswould make investigating the issue much easier, and it'd also make itpossible to terminate the recovery with an error if the last foundsegment does not match the expectation (instead of just assuming we'vefound all segments, leading to data loss).

Another useful change would be to allow pg_xlogdump to print segmentseven if the contents does not match the filename. Currently it'simpossible to even look at the contents in that case, so renaming theexisting segments is mostly guess work (find segments whrere pg_xlogdumpfails, try renaming to next segments).

And finally, I've done a quick review of all places that might sufferthe same issue - some are not really interesting as the stuff isephemeral anyway (like pgstat for example), but there are ~15 placesthat may need this fix:


 * src/backend/access/transam/timeline.c (2 matches)
 * src/backend/access/transam/xlog.c (9 matches)
 * src/backend/access/transam/xlogarchive.c (3 matches)
 * src/backend/postmaster/pgarch.c (1 match)

Some of these places might be actually safe because a fsync happenssomewhere immediately after the rename (e.g. in a caller), but I guessbetter safe than sorry.

I plan to do more power failure testing soon, with more complex testscenarios. I suspect there might be other similar issues (e.g. when werename a file before a checkpoint and don't fsync the directory - thenthe rename won't be replayed and will be lost).


regards

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f17f834..b47c852 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3282,6 +3282,8 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 #endif
 
+	fsync_fname("pg_xlog", true);
+
 	if (use_lock)
 		LWLockRelease(ControlFileLock);

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>

#define NFILES 64

/* fdatasync is default on Linux */
#define SYNC(a) fdatasync(a)

static void prepare_files(int nfiles);
static void rewrite_file(int fd);

static void
rewrite_file(int fd)
{
        int i;
        char data[1024];

        /* 1kB of zeroes */
        memset(data, 0, 1024);

        /* write 16MB of data (zeroes) to the file */
        for (i = 0; i < 16 * 1024; i++)
        {
                if (write(fd, data, 1024) != 1024)
                {
                        printf("write failed");
                        exit(6);
                }
        }
}

static void
prepare_files(int nfiles)
{
        int i, j;
        char fname[1024];

        for (i = 0; i < nfiles; i++)
        {
                int fd;

                sprintf(fname, "%d", i);

                if ((fd = open(fname, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR)) == 
-1)
                {
                        printf("open failed");
                        exit(7);
                }

                rewrite_file(fd);

                /* fdatasync is default on linux */
                if (SYNC(fd) != 0)
                {
                        printf("sync failed");
                        exit(8);
                }

                if (close(fd) != 0)
                {
                        printf("close failed");
                        exit(9);
                }
        }
}

void main()
{
        int i, f;

        prepare_files(NFILES);
        f = NFILES;

        /* now create a new file by renaming the oldest one, rewrite and sync */
        while (1)
        {
                int fd;
                char fname_old[1024];
                char fname_new[1024];

                sprintf(fname_old, "%d", f-NFILES);
                sprintf(fname_new, "%d", f);

                if (rename(fname_old, fname_new) != 0)
                {
                        printf("rename failed");
                        exit(1);
                }

                if ((fd = open(fname_new, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR)) 
== -1)
                {
                        printf("open failed");
                        exit(2);
                }

                rewrite_file(fd);

                /* fdatasync is default on linux */
                if (SYNC(fd) != 0)
                {
                        printf("sync failed");
                        exit(3);
                }

                if (close(fd) != 0)
                {
                        printf("close failed");
                        exit(4);
                }

                printf("file %s\n", fname_new);

                /* sleep for 1 second */
                sleep(1);

                /* next file */
                f += 1;
        }

}

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f17f834..dec7721 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10626,6 +10626,19 @@ GetXLogWriteRecPtr(void)
 }
 
 /*
+ * Get latest WAL flush pointer
+ */
+XLogRecPtr
+GetXLogFlushRecPtr(void)
+{
+	SpinLockAcquire(&XLogCtl->info_lck);
+	LogwrtResult = XLogCtl->LogwrtResult;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return LogwrtResult.Flush;
+}
+
+/*
  * Returns the redo pointer of the last checkpoint or restartpoint. This is
  * the oldest point in WAL that we still need, if we have to restart recovery.
  */
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 329bb8c..35c581d 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -195,7 +195,7 @@ pg_current_xlog_location(PG_FUNCTION_ARGS)
 }
 
 /*
- * Report the current WAL insert location (same format as pg_start_backup etc)
+ * Report the current WAL flush location (same format as pg_start_backup etc)
  *
  * This function is mostly for debugging purposes.
  */
@@ -216,6 +216,27 @@ pg_current_xlog_insert_location(PG_FUNCTION_ARGS)
 }
 
 /*
+ * Report the current WAL insert location (same format as pg_start_backup etc)
+ *
+ * This function is mostly for debugging purposes.
+ */
+Datum
+pg_current_xlog_flush_location(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	current_recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("WAL control functions cannot be executed during recovery.")));
+
+	current_recptr = GetXLogFlushRecPtr();
+
+	PG_RETURN_LSN(current_recptr);
+}
+
+/*
  * Report the last WAL receive location (same format as pg_start_backup etc)
  *
  * This is useful for determining how much of WAL is guaranteed to be received
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 790ca66..985291d 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -235,6 +235,7 @@ extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
+extern XLogRecPtr GetXLogFlushRecPtr(void);
 extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
diff --git a/src/include/access/xlog_fn.h b/src/include/access/xlog_fn.h
index 3ebe966..f4575d7 100644
--- a/src/include/access/xlog_fn.h
+++ b/src/include/access/xlog_fn.h
@@ -19,6 +19,7 @@ extern Datum pg_switch_xlog(PG_FUNCTION_ARGS);
 extern Datum pg_create_restore_point(PG_FUNCTION_ARGS);
 extern Datum pg_current_xlog_location(PG_FUNCTION_ARGS);
 extern Datum pg_current_xlog_insert_location(PG_FUNCTION_ARGS);
+extern Datum pg_current_xlog_flush_location(PG_FUNCTION_ARGS);
 extern Datum pg_last_xlog_receive_location(PG_FUNCTION_ARGS);
 extern Datum pg_last_xlog_replay_location(PG_FUNCTION_ARGS);
 extern Datum pg_last_xact_replay_timestamp(PG_FUNCTION_ARGS);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index d8640db..ca8fcd4 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3154,6 +3154,8 @@ DATA(insert OID = 2849 ( pg_current_xlog_location	PGNSP PGUID 12 1 0 0 0 f f f f
 DESCR("current xlog write location");
 DATA(insert OID = 2852 ( pg_current_xlog_insert_location	PGNSP PGUID 12 1 0 0 0 f f f f t f v s 0 0 3220 "" _null_ _null_ _null_ _null_ _null_ pg_current_xlog_insert_location _null_ _null_ _null_ ));
 DESCR("current xlog insert location");
+DATA(insert OID = 3330 ( pg_current_xlog_flush_location	PGNSP PGUID 12 1 0 0 0 f f f f t f v s 0 0 3220 "" _null_ _null_ _null_ _null_ _null_ pg_current_xlog_flush_location _null_ _null_ _null_ ));
+DESCR("current xlog flush location");
 DATA(insert OID = 2850 ( pg_xlogfile_name_offset	PGNSP PGUID 12 1 0 0 0 f f f f t f i s 1 0 2249 "3220" "{3220,25,23}" "{i,o,o}" "{wal_location,file_name,file_offset}" _null_ _null_ pg_xlogfile_name_offset _null_ _null_ _null_ ));
 DESCR("xlog filename and byte offset, given an xlog location");
 DATA(insert OID = 2851 ( pg_xlogfile_name			PGNSP PGUID 12 1 0 0 0 f f f f t f i s 1 0 25 "3220" _null_ _null_ _null_ _null_ _null_ pg_xlogfile_name _null_ _null_ _null_ ));

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] silent data loss with ext4 / all current versions

Reply via email to