Re: [HACKERS] Spread checkpoint sync

Greg Smith Sat, 15 Jan 2011 06:26:14 -0800

Robert Haas wrote:

Idea #2: At the beginning of a checkpoint when we scan all the
buffers, count the number of buffers that need to be synced for each
relation.  Use the same hashtable that we use for tracking pending
fsync requests.  Then, interleave the writes and the fsyncs...


Idea #3: Stick with the idea of a fixed delay between fsyncs, but
compute how many fsyncs you think you're ultimately going to need at
the start of the checkpoint, and back up the target completion time by
3 s per fsync from the get-go, so that the checkpoint still finishes
on schedule.

What I've been working on is something halfway between these two ideas.I have a patch, and it doesn't work right yet because I just broke it,but since I have some faint hope this will all come together any minutenow I'm going to share it before someone announces a deadline has passedor something. (whistling). I'm going to add this messy thing and thepatch you submitted upthread to the CF list; I'll review yours, I'lleither fix the remaining problem in this one myself or rewrite to one ofyour ideas, and then it's onto a round of benchmarking.

Once upon a time we got a patch from Itagaki Takahiro whose purpose wasto sort writes before sending them out:


http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php

This didn't really work repeatedly for everyone because of the now wellunderstood ext3 issues--I never replicated that speedup at the time forexample. And this was before the spread checkpoint code was in 8.3.The hope was that it wasn't really going to be necessary after that anyway.

Back to today...instead of something complicated, it struck me that if Ijust had a count of exactly how many files were involved in eachcheckpoint, that would be helpful. I could keep the idea of a fixeddelay between fsyncs, but just auto-tune that delay amount based on thecount. And how do you count the number of unique things in a list?Well, you can always sort them. I thought that if the sorted writespatch got back to functional again, it could serve two purposes. Itwould group all of the writes for a file together, and if you did thesyncs in the same sorted order they would have the maximum odds ofdiscovering the data was already written. So rather than this possibleorder:


table block
a 1
b 1
c 1
c 2
b 2
a 2
sync a
sync b
sync c

Which has very low odds of the sync on "a" finishing quickly, we'd getthis one:


table block
a 1
a 2
b 1
b 2
c 1
c 2
sync a
sync b
sync c

Which sure seems like a reasonable way to improve the odds data has beenwritten before the associated sync comes along.

Also, I could just traverse the sorted list with some simple logic tocount the number of unique files, and then set the delay between fsyncwrites based on it. In the above, once the list was sorted, easy tojust see how many times the table name changes on a linear scan of thesorted data. 3 files, so if the checkpoint target gives me, say, aminute of time to sync them, I can delay 20 seconds between. Simplemath, and exactly the sort I used to get reasonable behavior on the busyproduction system this all started on. There's some unresolvedtrickiness in the segment-driven checkpoint case, but one thing at a time.

So I fixed the bitrot on the old sorted patch, which was fun as it camefrom before the 8.3 changes. It seemed to work. I then moved thestructure it uses to hold the list of buffers to write, the thing that'ssorted, into shared memory. It's got a predictable maximum size,relying on palloc in the middle of the checkpoint code seems bad, andthere's some potential gain from not reallocating it every time through.


Somewhere along the way, it started doing this instead of what I wanted:

BadArgument("!(((header->context) != ((void *)0) &&(((((Node*)((header->context)))->type) == T_AllocSetContext))))", File:"mcxt.c", Line: 589)


(that's from initdb, not a good sign)

And it's left me wondering whether this whole idea is a dead end I usedup my window of time wandering down.

There's good bits in the patch I submitted for the last CF and in thepatch you wrote earlier this week. This unfinished patch may be avaluable idea to fit in there too once I fix it, or maybe it'sfundamentally flawed and one of the other ideas you suggested (or I havesitting on the potential design list) will work better. There's a patchintegration problem that needs to be solved here, but I think almost allthe individual pieces are available. I'd hate to see this fail to getintegrated now just for lack of time, considering the problem is soserious when you run into it.


--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index dadb49d..c8c0f67 100644
*** a/src/backend/storage/buffer/buf_init.c
--- b/src/backend/storage/buffer/buf_init.c
***************
*** 20,25 ****
--- 20,26 ----
  
  BufferDesc *BufferDescriptors;
  char	   *BufferBlocks;
+ BufAndTag  *BufferTags;
  int32	   *PrivateRefCount;
  
  
*************** int32	   *PrivateRefCount;
*** 72,79 ****
  void
  InitBufferPool(void)
  {
! 	bool		foundBufs,
! 				foundDescs;
  
  	BufferDescriptors = (BufferDesc *)
  		ShmemInitStruct("Buffer Descriptors",
--- 73,81 ----
  void
  InitBufferPool(void)
  {
! 	bool		foundBufs;
! 	bool		foundDescs;
! 	bool		foundTags;
  
  	BufferDescriptors = (BufferDesc *)
  		ShmemInitStruct("Buffer Descriptors",
*************** InitBufferPool(void)
*** 83,92 ****
  		ShmemInitStruct("Buffer Blocks",
  						NBuffers * (Size) BLCKSZ, &foundBufs);
  
! 	if (foundDescs || foundBufs)
  	{
! 		/* both should be present or neither */
! 		Assert(foundDescs && foundBufs);
  		/* note: this path is only taken in EXEC_BACKEND case */
  	}
  	else
--- 85,98 ----
  		ShmemInitStruct("Buffer Blocks",
  						NBuffers * (Size) BLCKSZ, &foundBufs);
  
! 	BufferTags = (BufAndTag *)
! 		ShmemInitStruct("Dirty Buffer Tags",
! 						NBuffers * sizeof(BufAndTag), &foundTags);
! 
! 	if (foundDescs || foundBufs || foundTags)
  	{
! 		/* all should be present or none */
! 		Assert(foundDescs && foundBufs && foundTags);
  		/* note: this path is only taken in EXEC_BACKEND case */
  	}
  	else
*************** BufferShmemSize(void)
*** 171,176 ****
--- 177,185 ----
  	/* size of data pages */
  	size = add_size(size, mul_size(NBuffers, BLCKSZ));
  
+ 	/* size of checkpoint buffer tags */
+ 	size = add_size(size, mul_size(NBuffers, sizeof(BufAndTag)));
+ 
  	/* size of stuff controlled by freelist.c */
  	size = add_size(size, StrategyShmemSize());
  
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1f89e52..bd779bf 100644
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
*************** UnpinBuffer(volatile BufferDesc *buf, bo
*** 1158,1163 ****
--- 1158,1181 ----
  	}
  }
  
+ static int
+ bufcmp(const void *a, const void *b)
+ {
+ 	const BufAndTag *lhs = (const BufAndTag *) a;
+ 	const BufAndTag *rhs = (const BufAndTag *) b;
+ 	int		r;
+ 
+ 	r = memcmp(&lhs->tag.rnode, &rhs->tag.rnode, sizeof(lhs->tag.rnode));
+ 	if (r != 0)
+ 		return r;
+ 	if (lhs->tag.blockNum < rhs->tag.blockNum)
+ 		return -1;
+ 	else if (lhs->tag.blockNum > rhs->tag.blockNum)
+ 		return 1;
+ 	else
+ 		return 0;
+ }
+ 
  /*
   * BufferSync -- Write out all dirty buffers in the pool.
   *
*************** static void
*** 1171,1180 ****
  BufferSync(int flags)
  {
  	int			buf_id;
- 	int			num_to_scan;
  	int			num_to_write;
  	int			num_written;
  	int			mask = BM_DIRTY;
  
  	/* Make sure we can handle the pin inside SyncOneBuffer */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
--- 1189,1202 ----
  BufferSync(int flags)
  {
  	int			buf_id;
  	int			num_to_write;
  	int			num_written;
  	int			mask = BM_DIRTY;
+ 	int			dirty_buf;
+ 	int			dirty_files;
+ 	Oid			last_seen_rel;
+ 	ForkNumber  last_seen_fork;
+ 	BlockNumber last_seen_block;
  
  	/* Make sure we can handle the pin inside SyncOneBuffer */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
*************** BufferSync(int flags)
*** 1216,1221 ****
--- 1238,1245 ----
  		if ((bufHdr->flags & mask) == mask)
  		{
  			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+ 			BufferTags[num_to_write].buf_id = buf_id;
+ 			BufferTags[num_to_write].tag = bufHdr->tag;
  			num_to_write++;
  		}
  
*************** BufferSync(int flags)
*** 1225,1246 ****
  	if (num_to_write == 0)
  		return;					/* nothing to do */
  
  	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
  
  	/*
  	 * Loop over all buffers again, and write the ones (still) marked with
! 	 * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
! 	 * since we might as well dump soon-to-be-recycled buffers first.
  	 *
  	 * Note that we don't read the buffer alloc count here --- that should be
  	 * left untouched till the next BgBufferSync() call.
! 	 */
! 	buf_id = StrategySyncStart(NULL, NULL);
! 	num_to_scan = NBuffers;
  	num_written = 0;
! 	while (num_to_scan-- > 0)
! 	{
! 		volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
  
  		/*
  		 * We don't need to acquire the lock here, because we're only looking
--- 1249,1307 ----
  	if (num_to_write == 0)
  		return;					/* nothing to do */
  
+ 	/*
+ 	 * Sort the list of buffers to write.  It's then straightforward to
+ 	 * count the approximate number of files involved.  There may be
+ 	 * some small error from buffers that turn out to be skipped below,
+ 	 * but for the purposes the file count is needed that's acceptable.
+ 	 */
+ 	qsort(BufferTags, num_to_write, sizeof(*BufferTags), bufcmp);
+ 
+ 	/*
+ 	 * Count the number of unique node/fork combinations, relying on the
+ 	 * sorted order
+ 	 */
+ 
+ 	/* Initialize with the first entry in the dirty buffer list */
+ 	last_seen_rel = BufferTags[0].tag.rnode.relNode;
+ 	last_seen_fork = BufferTags[0].tag.forkNum;
+ 	last_seen_block = BufferTags[0].tag.blockNum;
+ 	dirty_files = 1;
+ 
+ 	for (dirty_buf = 1; dirty_buf < num_to_write; dirty_buf++)
+   	{
+ 		if ((last_seen_rel != BufferTags[dirty_buf].tag.rnode.relNode) ||
+     		(last_seen_fork != BufferTags[dirty_buf].tag.forkNum))
+ 		{
+ 		    last_seen_rel=BufferTags[dirty_buf].tag.rnode.relNode;
+ 		    last_seen_fork=BufferTags[dirty_buf].tag.forkNum;
+ 		    dirty_files++;
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * TODO:  This doesn't account for the fact that blocks might span multiple
+ 	 * files within the same relation yet.
+ 	 */
+ 
+ 	elog(DEBUG1, "BufferSync found %d buffers to write involving %d files",
+ 		num_to_write,dirty_files)
+ 	
  	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);
  
  	/*
  	 * Loop over all buffers again, and write the ones (still) marked with
! 	 * BM_CHECKPOINT_NEEDED.
  	 *
  	 * Note that we don't read the buffer alloc count here --- that should be
  	 * left untouched till the next BgBufferSync() call.
! 	 */ 	 	
  	num_written = 0;
! 	for (dirty_buf = 0; dirty_buf < num_to_write; dirty_buf++)
!   	{
! 		volatile BufferDesc *bufHdr;  
! 		buf_id = BufferTags[dirty_buf].buf_id;
! 		bufHdr = &BufferDescriptors[buf_id];
  
  		/*
  		 * We don't need to acquire the lock here, because we're only looking
*************** BufferSync(int flags)
*** 1263,1282 ****
  				num_written++;
  
  				/*
- 				 * We know there are at most num_to_write buffers with
- 				 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
- 				 * num_written reaches num_to_write.
- 				 *
- 				 * Note that num_written doesn't include buffers written by
- 				 * other backends, or by the bgwriter cleaning scan. That
- 				 * means that the estimate of how much progress we've made is
- 				 * conservative, and also that this test will often fail to
- 				 * trigger.  But it seems worth making anyway.
- 				 */
- 				if (num_written >= num_to_write)
- 					break;
- 
- 				/*
  				 * Perform normal bgwriter duties and sleep to throttle our
  				 * I/O rate.
  				 */
--- 1324,1329 ----
*************** BufferSync(int flags)
*** 1284,1292 ****
  									 (double) num_written / num_to_write);
  			}
  		}
- 
- 		if (++buf_id >= NBuffers)
- 			buf_id = 0;
  	}
  
  	/*
--- 1331,1336 ----
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0652bdf..1c9c910 100644
*** a/src/include/storage/buf_internals.h
--- b/src/include/storage/buf_internals.h
*************** typedef struct sbufdesc
*** 167,175 ****
--- 167,185 ----
  #define LockBufHdr(bufHdr)		SpinLockAcquire(&(bufHdr)->buf_hdr_lock)
  #define UnlockBufHdr(bufHdr)	SpinLockRelease(&(bufHdr)->buf_hdr_lock)
  
+ /*
+  * Checkpoint time mapping between the buffer id values and the associated
+  * buffer tags of dirty buffers to write
+  */
+ typedef struct BufAndTag
+ {
+ 	int			buf_id;
+ 	BufferTag	tag;
+ } BufAndTag;
  
  /* in buf_init.c */
  extern PGDLLIMPORT BufferDesc *BufferDescriptors;
+ extern PGDLLIMPORT BufAndTag *BufferTags;
  
  /* in localbuf.c */
  extern BufferDesc *LocalBufferDescriptors;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

Reply via email to