On 21.06.2013 11:29, KONDO Mitsumasa wrote:
I took results of my separate patches and original PG.

* Result of DBT-2
| TPS 90%tile Average Maximum
------------------------------------------------------
original_0.7 | 3474.62 18.348328 5.739 36.977713
original_1.0 | 3469.03 18.637865 5.842 41.754421
fsync | 3525.03 13.872711 5.382 28.062947
write | 3465.96 19.653667 5.804 40.664066
fsync + write | 3564.94 16.31922 5.1 34.530766

- 'original_*' indicates checkpoint_completion_target in PG 9.2.4.
- In other patched postgres, checkpoint_completion_target sets 0.7.
- 'write' is applied write patch, and 'fsync' is applied fsync patch.
- 'fsync + write' is applied both patches.


* Investigation of result
- Large value of checkpoint_completion_target in original and the patch
in write become slow latency in benchmark transactions. Because slow
write pages are caused long time fsync IO in final checkpoint.
- The patch in fsync has an effect latency in each file fsync. Continued
fsyncsin each files are caused slow latency. Therefore, it is good for
latency that fsync stage in checkpoint has sleeping time after slow
fsync IO.
- The patches of fsync + write were seemed to improve TPS. I think that
write patch does not disturb transactions which are in full-page-write
WAL write than original(plain) PG.

Hmm, so the write patch doesn't do much, but the fsync patch makes the response times somewhat smoother. I'd suggest that we drop the write patch for now, and focus on the fsyncs.

What checkpointer_fsync_delay_ratio and checkpointer_fsync_delay_threshold settings did you use with the fsync patch? It's disabled by default.

This is the interesting part of the patch:

@@ -1171,6 +1174,20 @@ mdsync(void)
                                                                 
FilePathName(seg->mdfd_vfd),
                                                                 (double) 
elapsed / 1000);

+                                               /*
+                                                * If this fsync has long time, 
we sleep 'fsync-time * checkpoint_fsync_delay_ratio'
+                                                * for giving priority to 
executing transaction.
+                                                */
+                                               if( CheckPointerFsyncDelayThreshold >= 
0 &&
+                                                       !shutdown_requested &&
+                                                       !ImmediateCheckpointRequested() 
&&
+                                                       (elapsed / 1000 > 
CheckPointerFsyncDelayThreshold))
+                                               {
+                                                       pg_usleep((elapsed / 
1000) * CheckPointerFsyncDelayRatio * 1000L);
+                                                       if(log_checkpoints)
+                                                               elog(DEBUG1, 
"checkpoint sync sleep: time=%.3f msec",
+                                                                       
(double) (elapsed / 1000) * CheckPointerFsyncDelayRatio);
+                                               }
                                                break;  /* out of retry loop */
                                        }

I'm not sure it's a good idea to sleep proportionally to the time it took to complete the previous fsync. If you have a 1GB cache in the RAID controller, fsyncing the a 1GB segment will fill it up. But since it fits in cache, it will return immediately. So we proceed fsyncing other files, until the cache is full and the fsync blocks. But once we fill up the cache, it's likely that we're hurting concurrent queries. ISTM it would be better to stay under that threshold, keeping the I/O system busy, but never fill up the cache completely.

This is just a theory, though. I don't have a good grasp on how the OS and a typical RAID controller behaves under these conditions.

I'd suggest that we just sleep for a small fixed amount of time between every fsync, unless we're running behind the checkpoint schedule. And for a first approximation, let's just assume that the fsync phase is e.g 10% of the whole checkpoint work.

I will send you more detail investigation and result next week. And I
will also take result in pgbench. If you mind other part of benchmark
result or parameter of postgres, please tell me.

Attached is a quick patch to implement a fixed, 100ms delay between fsyncs, and the assumption that fsync phase is 10% of the total checkpoint duration. I suspect 100ms is too small to have much effect, but that happens to be what we have currently in CheckpointWriteDelay(). Could you test this patch along with yours? If you can test with different delays (e.g 100ms, 500ms and 1000ms) and different ratios between the write and fsync phase (e.g 0.5, 0.7, 0.9), to get an idea of how sensitive the test case is to those settings.

- Heikki
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8079226..a622a18 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -94,7 +94,7 @@ static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
 static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
-static void BufferSync(int flags);
+static void BufferSync(int flags, double progress_upto);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
@@ -1207,7 +1207,7 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
  * remaining flags currently have no effect here.
  */
 static void
-BufferSync(int flags)
+BufferSync(int flags, double progress_upto)
 {
 	int			buf_id;
 	int			num_to_scan;
@@ -1319,7 +1319,7 @@ BufferSync(int flags)
 				/*
 				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, progress_upto * (double) num_written / num_to_write);
 			}
 		}
 
@@ -1825,10 +1825,10 @@ CheckPointBuffers(int flags)
 {
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
-	BufferSync(flags);
+	BufferSync(flags, 0.9);
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
-	smgrsync();
+	smgrsync(flags, 0.9);
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..7ceec9c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -235,7 +235,7 @@ SetForwardFsyncRequests(void)
 	/* Perform any pending fsyncs we may have queued up, then drop table */
 	if (pendingOpsTable)
 	{
-		mdsync();
+		mdsync(CHECKPOINT_IMMEDIATE, 0.0);
 		hash_destroy(pendingOpsTable);
 	}
 	pendingOpsTable = NULL;
@@ -974,7 +974,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
  *	mdsync() -- Sync previous writes to stable storage.
  */
 void
-mdsync(void)
+mdsync(int ckpt_flags, double progress_at_begin)
 {
 	static bool mdsync_in_progress = false;
 
@@ -990,6 +990,7 @@ mdsync(void)
 	uint64		elapsed;
 	uint64		longest = 0;
 	uint64		total_elapsed = 0;
+	int			ntoprocess;
 
 	/*
 	 * This is only called during checkpoints, and checkpoints should only
@@ -1052,6 +1053,7 @@ mdsync(void)
 	/* Now scan the hashtable for fsync requests to process */
 	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOpsTable);
+	ntoprocess = hash_get_num_entries(pendingOpsTable);
 	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
 	{
 		ForkNumber	forknum;
@@ -1171,6 +1173,11 @@ mdsync(void)
 								 FilePathName(seg->mdfd_vfd),
 								 (double) elapsed / 1000);
 
+						/*
+						 * Sleep to throttle our I/O rate.
+						 */
+						CheckpointWriteDelay(ckpt_flags, progress_at_begin + (1.0 - progress_at_begin) * (double) processed / ntoprocess);
+
 						break;	/* out of retry loop */
 					}
 
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f7f1437..ec24007 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -58,7 +58,7 @@ typedef struct f_smgr
 											  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_pre_ckpt) (void);		/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
+	void		(*smgr_sync) (int ckpt_flags, double progress_at_begin);	/* may be NULL */
 	void		(*smgr_post_ckpt) (void);		/* may be NULL */
 } f_smgr;
 
@@ -708,14 +708,18 @@ smgrpreckpt(void)
  *	smgrsync() -- Sync files to disk during checkpoint.
  */
 void
-smgrsync(void)
+smgrsync(int ckpt_flags, double progress_at_begin)
 {
 	int			i;
 
 	for (i = 0; i < NSmgr; i++)
 	{
+		/*
+		 * XXX: If we ever have more than one smgr, the remaining progress
+		 * should somehow be divided among all smgrs.
+		 */
 		if (smgrsw[i].smgr_sync)
-			(*(smgrsw[i].smgr_sync)) ();
+			(*(smgrsw[i].smgr_sync)) (ckpt_flags, progress_at_begin);
 	}
 }
 
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 98b6f13..e8efcbe 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -100,7 +100,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrpreckpt(void);
-extern void smgrsync(void);
+extern void smgrsync(int ckpt_flags, double progress_at_begin);
 extern void smgrpostckpt(void);
 extern void AtEOXact_SMgr(void);
 
@@ -126,7 +126,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdpreckpt(void);
-extern void mdsync(void);
+extern void mdsync(int ckpt_flags, double progress_at_begin);
 extern void mdpostckpt(void);
 
 extern void SetForwardFsyncRequests(void);
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to