I don't want to take a bunch of time away from the active CF talking about this, just wanted to pass along some notes:

-Back branch release just happening a few weeks ago. Happy to have this dropped until the CF is over.

-Attached is a working backport of this to 8.4, with standard git comments in the header. I think this one will backport happily with git cherrypick. Example provided mainly to prove that; not intended to be a patch submission.

-It is still possible to get extremely long running sync times with the improvement applied.

Since I had a 8.4 server manifesting this problem where I could just try this one change, I did that. Before we had this:

2012-06-17 14:48:13 EDT LOG: checkpoint complete: wrote 90 buffers
(0.1%); 0 transaction log file(s) added, 0 removed, 14 recycled;
write=26.531 s, sync=4371.513 s, total=4461.058 s

After the compaction code was working (also backported the extra logging here) I got this instead:

2012-06-20 23:10:36 EDT LOG: checkpoint complete: wrote 188 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 7 recycled; write=31.975 s, sync=3064.270 s, total=3096.263 s; sync files=308, longest=482.200 s, average=9.948 s

So the background writer still took a long time due to starvation from clients. But the backend side latency impact wasn't nearly as bad though. The peak load average didn't jump into the hundreds, it only got 10 to 20 clients behind on things.

Anyway, larger discussion around this and related OS tuning is a better topic for pgsql-performance, will raise this there when I've sorted that out a bit more clearly.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
>From 0c251231cca932b50376f0ca8b088b71b9eb5f84 Mon Sep 17 00:00:00 2001
From: Greg Smith <g...@2ndquadrant.com>
Date: Sun, 24 Apr 2011 20:54:08 -0400
Subject: [PATCH 3/3] Try to avoid running with a full fsync request queue.

When we need to insert a new entry and the queue is full, compact the
entire queue in the hopes of making room for the new entry.  Doing this
on every insertion might worsen contention on BgWriterCommLock, but
when the queue it's full, it's far better than allowing the backend to
perform its own fsync, per testing by Greg Smith as reported in
http://archives.postgresql.org/pgsql-hackers/2011-01/msg02665.php

Original idea from Greg Smith.  Patch by Robert Haas.  Review by Chris
Browne and Greg Smith

Conflicts:

	src/backend/postmaster/bgwriter.c
---
 src/backend/postmaster/bgwriter.c |  139 ++++++++++++++++++++++++++++++++++---
 src/backend/storage/smgr/md.c     |    8 ++-
 2 files changed, 137 insertions(+), 10 deletions(-)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 831ea94..8a6abff 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -179,6 +179,7 @@ static void CheckArchiveTimeout(void);
 static void BgWriterNap(void);
 static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
+static bool CompactBgwriterRequestQueue(void);
 
 /* Signal handlers */
 
@@ -1058,14 +1059,15 @@ RequestCheckpoint(int flags)
  * use high values for special flags; that's all internal to md.c, which
  * see for details.)
  *
- * If we are unable to pass over the request (at present, this can happen
- * if the shared memory queue is full), we return false.  That forces
- * the backend to do its own fsync.  We hope that will be even more seldom.
- *
- * Note: we presently make no attempt to eliminate duplicate requests
- * in the requests[] queue.  The bgwriter will have to eliminate dups
- * internally anyway, so we may as well avoid holding the lock longer
- * than we have to here.
+ * To avoid holding the lock for longer than necessary, we normally write
+ * to the requests[] queue without checking for duplicates.  The bgwriter
+ * will have to eliminate dups internally anyway.  However, if we discover
+ * that the queue is full, we make a pass over the entire queue to compact
+ * it.  This is somewhat expensive, but the alternative is for the backend
+ * to perform its own fsync, which is far more expensive in practice.  It
+ * is theoretically possible a backend fsync might still be necessary, if
+ * the queue is full and contains no duplicate entries.  In that case, we
+ * let the backend know by returning false.
  */
 bool
 ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
@@ -1083,8 +1085,15 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 	/* we count non-bgwriter writes even when the request queue overflows */
 	BgWriterShmem->num_backend_writes++;
 
+	/*
+	 * If the background writer isn't running or the request queue is full,
+	 * the backend will have to perform its own fsync request.  But before
+	 * forcing that to happen, we can try to compact the background writer
+	 * request queue.
+	 */
 	if (BgWriterShmem->bgwriter_pid == 0 ||
-		BgWriterShmem->num_requests >= BgWriterShmem->max_requests)
+		(BgWriterShmem->num_requests >= BgWriterShmem->max_requests
+		&& !CompactBgwriterRequestQueue()))
 	{
 		LWLockRelease(BgWriterCommLock);
 		return false;
@@ -1098,6 +1107,118 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 }
 
 /*
+ * CompactBgwriterRequestQueue
+ * 		Remove duplicates from the request queue to avoid backend fsyncs.
+ *
+ * Although a full fsync request queue is not common, it can lead to severe
+ * performance problems when it does happen.  So far, this situation has
+ * only been observed to occur when the system is under heavy write load,
+ * and especially during the "sync" phase of a checkpoint.  Without this
+ * logic, each backend begins doing an fsync for every block written, which
+ * gets very expensive and can slow down the whole system.
+ *
+ * Trying to do this every time the queue is full could lose if there
+ * aren't any removable entries.  But should be vanishingly rare in
+ * practice: there's one queue entry per shared buffer.
+ */
+static bool
+CompactBgwriterRequestQueue()
+{
+	struct BgWriterSlotMapping {
+		BgWriterRequest	request;
+		int		slot;
+	};
+
+	int			n,
+				preserve_count;
+	int			num_skipped = 0;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	bool	   *skip_slot;
+
+	/* must hold BgWriterCommLock in exclusive mode */
+	Assert(LWLockHeldByMe(BgWriterCommLock));
+
+	/* Initialize temporary hash table */
+	MemSet(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(BgWriterRequest);
+	ctl.entrysize = sizeof(struct BgWriterSlotMapping);
+	ctl.hash = tag_hash;
+	htab = hash_create("CompactBgwriterRequestQueue",
+					   BgWriterShmem->num_requests,
+					   &ctl,
+					   HASH_ELEM | HASH_FUNCTION);
+
+	/* Initialize skip_slot array */
+	skip_slot = palloc0(sizeof(bool) * BgWriterShmem->num_requests);
+
+	/* 
+	 * The basic idea here is that a request can be skipped if it's followed
+	 * by a later, identical request.  It might seem more sensible to work
+	 * backwards from the end of the queue and check whether a request is
+	 * *preceded* by an earlier, identical request, in the hopes of doing less
+	 * copying.  But that might change the semantics, if there's an
+	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
+	 * we do it this way.  It would be possible to be even smarter if we made
+	 * the code below understand the specific semantics of such requests (it
+	 * could blow away preceding entries that would end up being cancelled
+	 * anyhow), but it's not clear that the extra complexity would buy us
+	 * anything.
+	 */
+	for (n = 0; n < BgWriterShmem->num_requests; ++n)
+	{
+		BgWriterRequest *request;
+		struct BgWriterSlotMapping *slotmap;
+		bool	found;
+
+		request = &BgWriterShmem->requests[n];
+		slotmap = hash_search(htab, request, HASH_ENTER, &found);
+		if (found)
+		{
+			skip_slot[slotmap->slot] = true;
+			++num_skipped;
+		}
+		slotmap->slot = n;
+	}
+
+	/* Done with the hash table. */
+	hash_destroy(htab);
+
+	/* If no duplicates, we're out of luck. */
+	if (!num_skipped)
+	{
+		/*
+		 * In PostgreSQL 9.1, this increases a counter in pg_stat_bgwriter
+		 * Since that's impractical to do in a backport, produce a log message
+		 * when it happens.  This should be very rare, and if it happens that
+		 * is a notable event.  Pick an accordingly high log level.
+		 */
+		ereport(LOG,
+			(errmsg("compacted fsync request queue no smaller, still %d entries",
+				BgWriterShmem->num_requests)));
+
+		pfree(skip_slot);
+		return false;
+	}
+
+	/* We found some duplicates; remove them. */
+	for (n = 0, preserve_count = 0; n < BgWriterShmem->num_requests; ++n)
+	{
+		if (skip_slot[n])
+			continue;
+		BgWriterShmem->requests[preserve_count++] = BgWriterShmem->requests[n];
+	}
+	ereport(DEBUG1,
+			(errmsg("compacted fsync request queue from %d entries to %d entries",
+				BgWriterShmem->num_requests, preserve_count)));
+	BgWriterShmem->num_requests = preserve_count;
+
+	/* Cleanup. */
+	pfree(skip_slot);
+	return true;
+}
+
+/*
  * AbsorbFsyncRequests
  *		Retrieve queued fsync requests and pass them to local smgr.
  *
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index a1c0f73..222e48e 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -34,7 +34,13 @@
 /* interval for calling AbsorbFsyncRequests in mdsync */
 #define FSYNCS_PER_ABSORB		10
 
-/* special values for the segno arg to RememberFsyncRequest */
+/*
+ * Special values for the segno arg to RememberFsyncRequest.
+ *
+ * Note that CompactBgwriterRequestQueue assumes that it's OK to remove an
+ * fsync request from the queue if an identical, subsequent request is found.
+ * See comments there before making changes here.
+ */
 #define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
 #define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
 #define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
-- 
1.7.2.5

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to