Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Heikki Linnakangas Tue, 25 Jun 2013 10:17:12 -0700

On 21.06.2013 11:29, KONDO Mitsumasa wrote:

I took results of my separate patches and original PG.


* Result of DBT-2
| TPS 90%tile Average Maximum
------------------------------------------------------
original_0.7 | 3474.62 18.348328 5.739 36.977713
original_1.0 | 3469.03 18.637865 5.842 41.754421
fsync | 3525.03 13.872711 5.382 28.062947
write | 3465.96 19.653667 5.804 40.664066
fsync + write | 3564.94 16.31922 5.1 34.530766

- 'original_*' indicates checkpoint_completion_target in PG 9.2.4.
- In other patched postgres, checkpoint_completion_target sets 0.7.
- 'write' is applied write patch, and 'fsync' is applied fsync patch.
- 'fsync + write' is applied both patches.


* Investigation of result
- Large value of checkpoint_completion_target in original and the patch
in write become slow latency in benchmark transactions. Because slow
write pages are caused long time fsync IO in final checkpoint.
- The patch in fsync has an effect latency in each file fsync. Continued
fsyncsin each files are caused slow latency. Therefore, it is good for
latency that fsync stage in checkpoint has sleeping time after slow
fsync IO.
- The patches of fsync + write were seemed to improve TPS. I think that
write patch does not disturb transactions which are in full-page-write
WAL write than original(plain) PG.

Hmm, so the write patch doesn't do much, but the fsync patch makes theresponse times somewhat smoother. I'd suggest that we drop the writepatch for now, and focus on the fsyncs.

What checkpointer_fsync_delay_ratio andcheckpointer_fsync_delay_threshold settings did you use with the fsyncpatch? It's disabled by default.


This is the interesting part of the patch:

@@ -1171,6 +1174,20 @@ mdsync(void)
                                                                 
FilePathName(seg->mdfd_vfd),
                                                                 (double) 
elapsed / 1000);

+                                               /*
+                                                * If this fsync has long time, 
we sleep 'fsync-time * checkpoint_fsync_delay_ratio'
+                                                * for giving priority to 
executing transaction.
+                                                */
+                                               if( CheckPointerFsyncDelayThreshold >= 
0 &&
+                                                       !shutdown_requested &&
+                                                       !ImmediateCheckpointRequested() 
&&
+                                                       (elapsed / 1000 > 
CheckPointerFsyncDelayThreshold))
+                                               {
+                                                       pg_usleep((elapsed / 
1000) * CheckPointerFsyncDelayRatio * 1000L);
+                                                       if(log_checkpoints)
+                                                               elog(DEBUG1, 
"checkpoint sync sleep: time=%.3f msec",
+                                                                       
(double) (elapsed / 1000) * CheckPointerFsyncDelayRatio);
+                                               }
                                                break;  /* out of retry loop */
                                        }

I'm not sure it's a good idea to sleep proportionally to the time ittook to complete the previous fsync. If you have a 1GB cache in the RAIDcontroller, fsyncing the a 1GB segment will fill it up. But since itfits in cache, it will return immediately. So we proceed fsyncing otherfiles, until the cache is full and the fsync blocks. But once we fill upthe cache, it's likely that we're hurting concurrent queries. ISTM itwould be better to stay under that threshold, keeping the I/O systembusy, but never fill up the cache completely.

This is just a theory, though. I don't have a good grasp on how the OSand a typical RAID controller behaves under these conditions.

I'd suggest that we just sleep for a small fixed amount of time betweenevery fsync, unless we're running behind the checkpoint schedule. Andfor a first approximation, let's just assume that the fsync phase is e.g10% of the whole checkpoint work.

I will send you more detail investigation and result next week. And I
will also take result in pgbench. If you mind other part of benchmark
result or parameter of postgres, please tell me.

Attached is a quick patch to implement a fixed, 100ms delay betweenfsyncs, and the assumption that fsync phase is 10% of the totalcheckpoint duration. I suspect 100ms is too small to have much effect,but that happens to be what we have currently in CheckpointWriteDelay().Could you test this patch along with yours? If you can test withdifferent delays (e.g 100ms, 500ms and 1000ms) and different ratiosbetween the write and fsync phase (e.g 0.5, 0.7, 0.9), to get an idea ofhow sensitive the test case is to those settings.


- Heikki

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8079226..a622a18 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -94,7 +94,7 @@ static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
 static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
-static void BufferSync(int flags);
+static void BufferSync(int flags, double progress_upto);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
@@ -1207,7 +1207,7 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
  * remaining flags currently have no effect here.
  */
 static void
-BufferSync(int flags)
+BufferSync(int flags, double progress_upto)
 {
 	int			buf_id;
 	int			num_to_scan;
@@ -1319,7 +1319,7 @@ BufferSync(int flags)
 				/*
 				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, progress_upto * (double) num_written / num_to_write);
 			}
 		}
 
@@ -1825,10 +1825,10 @@ CheckPointBuffers(int flags)
 {
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
-	BufferSync(flags);
+	BufferSync(flags, 0.9);
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
-	smgrsync();
+	smgrsync(flags, 0.9);
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..7ceec9c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -235,7 +235,7 @@ SetForwardFsyncRequests(void)
 	/* Perform any pending fsyncs we may have queued up, then drop table */
 	if (pendingOpsTable)
 	{
-		mdsync();
+		mdsync(CHECKPOINT_IMMEDIATE, 0.0);
 		hash_destroy(pendingOpsTable);
 	}
 	pendingOpsTable = NULL;
@@ -974,7 +974,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
  *	mdsync() -- Sync previous writes to stable storage.
  */
 void
-mdsync(void)
+mdsync(int ckpt_flags, double progress_at_begin)
 {
 	static bool mdsync_in_progress = false;
 
@@ -990,6 +990,7 @@ mdsync(void)
 	uint64		elapsed;
 	uint64		longest = 0;
 	uint64		total_elapsed = 0;
+	int			ntoprocess;
 
 	/*
 	 * This is only called during checkpoints, and checkpoints should only
@@ -1052,6 +1053,7 @@ mdsync(void)
 	/* Now scan the hashtable for fsync requests to process */
 	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOpsTable);
+	ntoprocess = hash_get_num_entries(pendingOpsTable);
 	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
 	{
 		ForkNumber	forknum;
@@ -1171,6 +1173,11 @@ mdsync(void)
 								 FilePathName(seg->mdfd_vfd),
 								 (double) elapsed / 1000);
 
+						/*
+						 * Sleep to throttle our I/O rate.
+						 */
+						CheckpointWriteDelay(ckpt_flags, progress_at_begin + (1.0 - progress_at_begin) * (double) processed / ntoprocess);
+
 						break;	/* out of retry loop */
 					}
 
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f7f1437..ec24007 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -58,7 +58,7 @@ typedef struct f_smgr
 											  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_pre_ckpt) (void);		/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
+	void		(*smgr_sync) (int ckpt_flags, double progress_at_begin);	/* may be NULL */
 	void		(*smgr_post_ckpt) (void);		/* may be NULL */
 } f_smgr;
 
@@ -708,14 +708,18 @@ smgrpreckpt(void)
  *	smgrsync() -- Sync files to disk during checkpoint.
  */
 void
-smgrsync(void)
+smgrsync(int ckpt_flags, double progress_at_begin)
 {
 	int			i;
 
 	for (i = 0; i < NSmgr; i++)
 	{
+		/*
+		 * XXX: If we ever have more than one smgr, the remaining progress
+		 * should somehow be divided among all smgrs.
+		 */
 		if (smgrsw[i].smgr_sync)
-			(*(smgrsw[i].smgr_sync)) ();
+			(*(smgrsw[i].smgr_sync)) (ckpt_flags, progress_at_begin);
 	}
 }
 
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 98b6f13..e8efcbe 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -100,7 +100,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrpreckpt(void);
-extern void smgrsync(void);
+extern void smgrsync(int ckpt_flags, double progress_at_begin);
 extern void smgrpostckpt(void);
 extern void AtEOXact_SMgr(void);
 
@@ -126,7 +126,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdpreckpt(void);
-extern void mdsync(void);
+extern void mdsync(int ckpt_flags, double progress_at_begin);
 extern void mdpostckpt(void);
 
 extern void SetForwardFsyncRequests(void);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Reply via email to