On 05/27/2015 12:26 AM, Jeff Janes wrote:
On Thu, May 21, 2015 at 8:40 AM, Fujii Masao <masao.fu...@gmail.com> wrote:

On Thu, May 21, 2015 at 3:53 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:
One of the points of max_wal_size and its predecessor is to limit how big
pg_xlog can grow.  But running out of disk space on pg_xlog is no more
fun
during archive recovery than it is during normal operations.  So why
shouldn't max_wal_size be active during recovery?

The following message of commit 7181530 explains why.

     In standby mode, respect checkpoint_segments in addition to
     checkpoint_timeout to trigger restartpoints. We used to deliberately
only
     do time-based restartpoints, because if checkpoint_segments is small we
     would spend time doing restartpoints more often than really necessary.
     But now that restartpoints are done in bgwriter, they're not as
     disruptive as they used to be. Secondly, because streaming replication
     stores the streamed WAL files in pg_xlog, we want to clean it up more
     often to avoid running out of disk space when checkpoint_timeout is
large
     and checkpoint_segments small.

Previously users were more likely to fall into this trouble (i.e., too
frequent
occurrence of restartpoints) because the default value of
checkpoint_segments
was very small, I guess. But we increased the default of max_wal_size, so
now
the risk of that trouble seems to be smaller than before, and maybe we can
allow max_wal_size to trigger restartpoints.

I see.  The old behavior was present for the same reason we decided to split
checkpoint_segments into max_wal_size and min_wal_size.

That is, the default checkpoint_segments was small, and it had to be small
because increasing it would cause more space to be used even when that
extra space was not helpful.

So perhaps we can consider this change a completion of the max_wal_size
work, rather than a new feature?

Yeah, I'm inclined to change the behaviour. Ignoring checkpoint_segments made sense when we initially did that, but it has gradually become less and less sensible after that, as we got streaming replication, and as we started to keep all restored segments in pg_xlog even in archive recovery.

It seems to be a trivial change to implement that, although I might be
overlooking something subtle (pasted below, also attached)

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10946,7 +10946,7 @@ XLogPageRead(XLogReaderState *xlogreader,
XLogRecPtr targetPagePtr, int reqLen,
                 * Request a restartpoint if we've replayed too much xlog
since the
                 * last one.
                 */
-               if (StandbyModeRequested && bgwriterLaunched)
+               if (bgwriterLaunched)
                {
                        if (XLogCheckpointNeeded(readSegNo))
                        {

This keeps pg_xlog at about 67% of max_wal_size during archive recovery
(because checkpoint_completion_target is accounted for but goes unused)

Hmm. checkpoint_completion_target is used when determining progress against checkpoint_timeout just fine, but the problem is that if you do just the above, IsCheckpointOnSchedule() still won't consider consumed WAL when it determines whether the restartpoint is "on time". So the error is in the other direction: if you set max_wal_size to a small value, and checkpoint_timeout to a large value, the restartpoint would think that it has plenty of time to complete, and exceed max_wal_size. We need to fix IsCheckpointOnSchedule() to also track progress against max_wal_size during recovery.

I came up with the attached patch as a first attempt. It enables the same logic to calculate if the checkpoint is on schedule to be used in recovery. But there's a little problem (also explained in a comment in the patch):

There is a large gap between a checkpoint's redo-pointer, and the checkpoint record itself (determined by checkpoint_completion_target). When we're not in recovery, we set the redo-pointer for the current checkpoint first, then start flushing data, and finally write the checkpoint record. The logic in IsCheckpointOnSchedule() calculates a) how much WAL has been generated since the beginning of the checkpoint, i.e its redo-pointer, and b) what fraction of shared_buffers has been flushed to disk. But in recovery, we only start the restartpoint after replaying the checkpoint record, so at the beginning of a restartpoint, we're actually already behind schedule by the amount of WAL between the redo-pointer and the record itself.

I'm not sure what to do about this. With the attached patch, you get the same leisurely pacing with restartpoints as you get with checkpoints, but you exceed max_wal_size during recovery, by the amount determined by checkpoint_completion_target. Alternatively, we could try to perform restartpoints faster then checkpoints, but then you'll get nasty checkpoint I/O storms in recovery.

A bigger change would be to write a WAL record at the beginning of a checkpoint. It wouldn't do anything else, but it would be a hint to recovery that there's going to be a checkpoint record later whose redo-pointer will point to that record. We could then start the restartpoint at that record already, before seeing the checkpoint record itself.

I think the attached is better than nothing, but I'll take a look at that beginning-of-checkpoint idea. It might be too big a change to do at this point, but I'd really like to fix this properly for 9.5, since we've changed with the way checkpoints are scheduled anyway.

- Heikki

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4e37ad3..cc90ae6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10961,7 +10961,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 		 * Request a restartpoint if we've replayed too much xlog since the
 		 * last one.
 		 */
-		if (StandbyModeRequested && bgwriterLaunched)
+		if (bgwriterLaunched)
 		{
 			if (XLogCheckpointNeeded(readSegNo))
 			{
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0dce6a8..bc49ee4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -475,10 +475,12 @@ CheckpointerMain(void)
 
 			/*
 			 * Initialize checkpointer-private variables used during
-			 * checkpoint
+			 * checkpoint.
 			 */
 			ckpt_active = true;
-			if (!do_restartpoint)
+			if (do_restartpoint)
+				ckpt_start_recptr = GetXLogReplayRecPtr(NULL);
+			else
 				ckpt_start_recptr = GetInsertRecPtr();
 			ckpt_start_time = now;
 			ckpt_cached_elapsed = 0;
@@ -720,7 +722,7 @@ CheckpointWriteDelay(int flags, double progress)
 
 /*
  * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
- *		 in time?
+ *		 (or restartpoint) in time?
  *
  * Compares the current progress against the time/segments elapsed since last
  * checkpoint, and returns true if the progress we've made this far is greater
@@ -757,17 +759,28 @@ IsCheckpointOnSchedule(double progress)
 	 * compares against RedoRecptr, so this is not completely accurate.
 	 * However, it's good enough for our purposes, we're only calculating an
 	 * estimate anyway.
+	 *
+	 * During recovery, we compare last replayed WAL record's location with
+	 * the location computed before calling CreateRestartPoint. That maintains
+	 * the same pacing as we have during checkpoints in normal operation, but
+	 * we might exceed max_wal_size by a fair amount. That's because there can
+	 * be a large gap between a checkpoint's redo-pointer and the checkpoint
+	 * record itself, and we only start the restartpoint after we've seen the
+	 * checkpoint record. (The gap is typically up to CheckPointSegments *
+	 * checkpoint_completion_target where checkpoint_completion_target is the
+	 * value that was in effect when the WAL was generated).
 	 */
-	if (!RecoveryInProgress())
-	{
+	if (RecoveryInProgress())
+		recptr = GetXLogReplayRecPtr(NULL);
+	else
 		recptr = GetInsertRecPtr();
-		elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;
+	elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;
 
-		if (progress < elapsed_xlogs)
-		{
-			ckpt_cached_elapsed = elapsed_xlogs;
-			return false;
-		}
+	if (progress < elapsed_xlogs)
+	{
+		elog(LOG, "not on schedule, progress: %f elapsed_xlogs: %f", progress, elapsed_xlogs);
+		ckpt_cached_elapsed = elapsed_xlogs;
+		return false;
 	}
 
 	/*
@@ -779,11 +792,13 @@ IsCheckpointOnSchedule(double progress)
 
 	if (progress < elapsed_time)
 	{
+		elog(LOG, "not on schedule, progress: %f elapsed_xlogs: %f elapsed_time %f", progress, elapsed_xlogs, elapsed_time);
 		ckpt_cached_elapsed = elapsed_time;
 		return false;
 	}
 
 	/* It looks like we're on schedule. */
+	 elog(LOG, "on schedule, progress: %f elapsed_xlogs: %f elapsed_time %f", progress, elapsed_xlogs, elapsed_time);
 	return true;
 }
 
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to