Robert Haas wrote:
Idea #2: At the beginning of a checkpoint when we scan all the
buffers, count the number of buffers that need to be synced for each
relation.  Use the same hashtable that we use for tracking pending
fsync requests.  Then, interleave the writes and the fsyncs...

Idea #3: Stick with the idea of a fixed delay between fsyncs, but
compute how many fsyncs you think you're ultimately going to need at
the start of the checkpoint, and back up the target completion time by
3 s per fsync from the get-go, so that the checkpoint still finishes
on schedule.

What I've been working on is something halfway between these two ideas. I have a patch, and it doesn't work right yet because I just broke it, but since I have some faint hope this will all come together any minute now I'm going to share it before someone announces a deadline has passed or something. (whistling). I'm going to add this messy thing and the patch you submitted upthread to the CF list; I'll review yours, I'll either fix the remaining problem in this one myself or rewrite to one of your ideas, and then it's onto a round of benchmarking.

Once upon a time we got a patch from Itagaki Takahiro whose purpose was to sort writes before sending them out:

http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php

This didn't really work repeatedly for everyone because of the now well understood ext3 issues--I never replicated that speedup at the time for example. And this was before the spread checkpoint code was in 8.3. The hope was that it wasn't really going to be necessary after that anyway.

Back to today...instead of something complicated, it struck me that if I just had a count of exactly how many files were involved in each checkpoint, that would be helpful. I could keep the idea of a fixed delay between fsyncs, but just auto-tune that delay amount based on the count. And how do you count the number of unique things in a list? Well, you can always sort them. I thought that if the sorted writes patch got back to functional again, it could serve two purposes. It would group all of the writes for a file together, and if you did the syncs in the same sorted order they would have the maximum odds of discovering the data was already written. So rather than this possible order:

table block
a 1
b 1
c 1
c 2
b 2
a 2
sync a
sync b
sync c

Which has very low odds of the sync on "a" finishing quickly, we'd get this one:

table block
a 1
a 2
b 1
b 2
c 1
c 2
sync a
sync b
sync c

Which sure seems like a reasonable way to improve the odds data has been written before the associated sync comes along.

Also, I could just traverse the sorted list with some simple logic to count the number of unique files, and then set the delay between fsync writes based on it. In the above, once the list was sorted, easy to just see how many times the table name changes on a linear scan of the sorted data. 3 files, so if the checkpoint target gives me, say, a minute of time to sync them, I can delay 20 seconds between. Simple math, and exactly the sort I used to get reasonable behavior on the busy production system this all started on. There's some unresolved trickiness in the segment-driven checkpoint case, but one thing at a time.

So I fixed the bitrot on the old sorted patch, which was fun as it came from before the 8.3 changes. It seemed to work. I then moved the structure it uses to hold the list of buffers to write, the thing that's sorted, into shared memory. It's got a predictable maximum size, relying on palloc in the middle of the checkpoint code seems bad, and there's some potential gain from not reallocating it every time through.

Somewhere along the way, it started doing this instead of what I wanted:

BadArgument("!(((header->context) != ((void *)0) && (((((Node*)((header->context)))->type) == T_AllocSetContext))))", File: "mcxt.c", Line: 589)

(that's from initdb, not a good sign)

And it's left me wondering whether this whole idea is a dead end I used up my window of time wandering down.

There's good bits in the patch I submitted for the last CF and in the patch you wrote earlier this week. This unfinished patch may be a valuable idea to fit in there too once I fix it, or maybe it's fundamentally flawed and one of the other ideas you suggested (or I have sitting on the potential design list) will work better. There's a patch integration problem that needs to be solved here, but I think almost all the individual pieces are available. I'd hate to see this fail to get integrated now just for lack of time, considering the problem is so serious when you run into it.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index dadb49d..c8c0f67 100644
*** a/src/backend/storage/buffer/buf_init.c
--- b/src/backend/storage/buffer/buf_init.c
***************
*** 20,25 ****
--- 20,26 ----
  
  BufferDesc *BufferDescriptors;
  char	   *BufferBlocks;
+ BufAndTag  *BufferTags;
  int32	   *PrivateRefCount;
  
  
*************** int32	   *PrivateRefCount;
*** 72,79 ****
  void
  InitBufferPool(void)
  {
! 	bool		foundBufs,
! 				foundDescs;
  
  	BufferDescriptors = (BufferDesc *)
  		ShmemInitStruct("Buffer Descriptors",
--- 73,81 ----
  void
  InitBufferPool(void)
  {
! 	bool		foundBufs;
! 	bool		foundDescs;
! 	bool		foundTags;
  
  	BufferDescriptors = (BufferDesc *)
  		ShmemInitStruct("Buffer Descriptors",
*************** InitBufferPool(void)
*** 83,92 ****
  		ShmemInitStruct("Buffer Blocks",
  						NBuffers * (Size) BLCKSZ, &foundBufs);
  
! 	if (foundDescs || foundBufs)
  	{
! 		/* both should be present or neither */
! 		Assert(foundDescs && foundBufs);
  		/* note: this path is only taken in EXEC_BACKEND case */
  	}
  	else
--- 85,98 ----
  		ShmemInitStruct("Buffer Blocks",
  						NBuffers * (Size) BLCKSZ, &foundBufs);
  
! 	BufferTags = (BufAndTag *)
! 		ShmemInitStruct("Dirty Buffer Tags",
! 						NBuffers * sizeof(BufAndTag), &foundTags);
! 
! 	if (foundDescs || foundBufs || foundTags)
  	{
! 		/* all should be present or none */
! 		Assert(foundDescs && foundBufs && foundTags);
  		/* note: this path is only taken in EXEC_BACKEND case */
  	}
  	else
*************** BufferShmemSize(void)
*** 171,176 ****
--- 177,185 ----
  	/* size of data pages */
  	size = add_size(size, mul_size(NBuffers, BLCKSZ));
  
+ 	/* size of checkpoint buffer tags */
+ 	size = add_size(size, mul_size(NBuffers, sizeof(BufAndTag)));
+ 
  	/* size of stuff controlled by freelist.c */
  	size = add_size(size, StrategyShmemSize());
  
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1f89e52..bd779bf 100644
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
*************** UnpinBuffer(volatile BufferDesc *buf, bo
*** 1158,1163 ****
--- 1158,1181 ----
  	}
  }
  
+ static int
+ bufcmp(const void *a, const void *b)
+ {
+ 	const BufAndTag *lhs = (const BufAndTag *) a;
+ 	const BufAndTag *rhs = (const BufAndTag *) b;
+ 	int		r;
+ 
+ 	r = memcmp(&lhs->tag.rnode, &rhs->tag.rnode, sizeof(lhs->tag.rnode));
+ 	if (r != 0)
+ 		return r;
+ 	if (lhs->tag.blockNum < rhs->tag.blockNum)
+ 		return -1;
+ 	else if (lhs->tag.blockNum > rhs->tag.blockNum)
+ 		return 1;
+ 	else
+ 		return 0;
+ }
+ 
  /*
   * BufferSync -- Write out all dirty buffers in the pool.
   *
*************** static void
*** 1171,1180 ****
  BufferSync(int flags)
  {
  	int			buf_id;
- 	int			num_to_scan;
  	int			num_to_write;
  	int			num_written;
  	int			mask = BM_DIRTY;
  
  	/* Make sure we can handle the pin inside SyncOneBuffer */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
--- 1189,1202 ----
  BufferSync(int flags)
  {
  	int			buf_id;
  	int			num_to_write;
  	int			num_written;
  	int			mask = BM_DIRTY;
+ 	int			dirty_buf;
+ 	int			dirty_files;
+ 	Oid			last_seen_rel;
+ 	ForkNumber  last_seen_fork;
+ 	BlockNumber last_seen_block;
  
  	/* Make sure we can handle the pin inside SyncOneBuffer */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
*************** BufferSync(int flags)
*** 1216,1221 ****
--- 1238,1245 ----
  		if ((bufHdr->flags & mask) == mask)
  		{
  			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+ 			BufferTags[num_to_write].buf_id = buf_id;
+ 			BufferTags[num_to_write].tag = bufHdr->tag;
  			num_to_write++;
  		}
  
*************** BufferSync(int flags)
*** 1225,1246 ****
  	if (num_to_write == 0)
  		return;					/* nothing to do */
  
  	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
  
  	/*
  	 * Loop over all buffers again, and write the ones (still) marked with
! 	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
! 	 * since we might as well dump soon-to-be-recycled buffers first.
  	 *
  	 * Note that we don't read the buffer alloc count here --- that should be
  	 * left untouched till the next BgBufferSync() call.
! 	 */
! 	buf_id = StrategySyncStart(NULL, NULL);
! 	num_to_scan = NBuffers;
  	num_written = 0;
! 	while (num_to_scan-- > 0)
! 	{
! 		volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
  
  		/*
  		 * We don't need to acquire the lock here, because we're only looking
--- 1249,1307 ----
  	if (num_to_write == 0)
  		return;					/* nothing to do */
  
+ 	/*
+ 	 * Sort the list of buffers to write.  It's then straightforward to
+ 	 * count the approximate number of files involved.  There may be
+ 	 * some small error from buffers that turn out to be skipped below,
+ 	 * but for the purposes the file count is needed that's acceptable.
+ 	 */
+ 	qsort(BufferTags, num_to_write, sizeof(*BufferTags), bufcmp);
+ 
+ 	/*
+ 	 * Count the number of unique node/fork combinations, relying on the
+ 	 * sorted order
+ 	 */
+ 
+ 	/* Initialize with the first entry in the dirty buffer list */
+ 	last_seen_rel = BufferTags[0].tag.rnode.relNode;
+ 	last_seen_fork = BufferTags[0].tag.forkNum;
+ 	last_seen_block = BufferTags[0].tag.blockNum;
+ 	dirty_files = 1;
+ 
+ 	for (dirty_buf = 1; dirty_buf < num_to_write; dirty_buf++)
+   	{
+ 		if ((last_seen_rel != BufferTags[dirty_buf].tag.rnode.relNode) ||
+     		(last_seen_fork != BufferTags[dirty_buf].tag.forkNum))
+ 		{
+ 		    last_seen_rel=BufferTags[dirty_buf].tag.rnode.relNode;
+ 		    last_seen_fork=BufferTags[dirty_buf].tag.forkNum;
+ 		    dirty_files++;
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * TODO:  This doesn't account for the fact that blocks might span multiple
+ 	 * files within the same relation yet.
+ 	 */
+ 
+ 	elog(DEBUG1, "BufferSync found %d buffers to write involving %d files",
+ 		num_to_write,dirty_files)
+ 	
  	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
  
  	/*
  	 * Loop over all buffers again, and write the ones (still) marked with
! 	 * BM_CHECKPOINT_NEEDED.
  	 *
  	 * Note that we don't read the buffer alloc count here --- that should be
  	 * left untouched till the next BgBufferSync() call.
! 	 */ 	 	
  	num_written = 0;
! 	for (dirty_buf = 0; dirty_buf < num_to_write; dirty_buf++)
!   	{
! 		volatile BufferDesc *bufHdr;  
! 		buf_id = BufferTags[dirty_buf].buf_id;
! 		bufHdr = &BufferDescriptors[buf_id];
  
  		/*
  		 * We don't need to acquire the lock here, because we're only looking
*************** BufferSync(int flags)
*** 1263,1282 ****
  				num_written++;
  
  				/*
- 				 * We know there are at most num_to_write buffers with
- 				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
- 				 * num_written reaches num_to_write.
- 				 *
- 				 * Note that num_written doesn't include buffers written by
- 				 * other backends, or by the bgwriter cleaning scan. That
- 				 * means that the estimate of how much progress we've made is
- 				 * conservative, and also that this test will often fail to
- 				 * trigger.  But it seems worth making anyway.
- 				 */
- 				if (num_written >= num_to_write)
- 					break;
- 
- 				/*
  				 * Perform normal bgwriter duties and sleep to throttle our
  				 * I/O rate.
  				 */
--- 1324,1329 ----
*************** BufferSync(int flags)
*** 1284,1292 ****
  									 (double) num_written / num_to_write);
  			}
  		}
- 
- 		if (++buf_id >= NBuffers)
- 			buf_id = 0;
  	}
  
  	/*
--- 1331,1336 ----
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0652bdf..1c9c910 100644
*** a/src/include/storage/buf_internals.h
--- b/src/include/storage/buf_internals.h
*************** typedef struct sbufdesc
*** 167,175 ****
--- 167,185 ----
  #define LockBufHdr(bufHdr)		SpinLockAcquire(&(bufHdr)->buf_hdr_lock)
  #define UnlockBufHdr(bufHdr)	SpinLockRelease(&(bufHdr)->buf_hdr_lock)
  
+ /*
+  * Checkpoint time mapping between the buffer id values and the associated
+  * buffer tags of dirty buffers to write
+  */
+ typedef struct BufAndTag
+ {
+ 	int			buf_id;
+ 	BufferTag	tag;
+ } BufAndTag;
  
  /* in buf_init.c */
  extern PGDLLIMPORT BufferDesc *BufferDescriptors;
+ extern PGDLLIMPORT BufAndTag *BufferTags;
  
  /* in localbuf.c */
  extern BufferDesc *LocalBufferDescriptors;
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to