Attached is a new version of Simon's "scan-resistant buffer manager" patch. It's not ready for committing yet because of a small issue I found this morning (* see bottom), but here's a status update.

To recap, the basic idea is to use a small ring of buffers for large scans like VACUUM, COPY and seq-scans. Changes to the original patch:

- a different sized ring is used for VACUUMs and seq-scans, and COPY. VACUUM and COPY use a ring of 32 buffers, and COPY uses a ring of 4096 buffers in default configuration. See README changes in the patch for rationale.

- for queries with large seqscans, the buffer ring is only used for reads issued by the seq scan, not for any other reads in the query. Typical scenario where this matters is doing a large seq scan with a nested loop join to a smaller table. You don't want to use the buffer ring for index lookups inside the nested loop.

- for seqscans, drop buffers from the ring that would need a WAL flush to reuse. That makes bulk updates to behave roughly like they do without the patch, instead of having to do a WAL flush every 32 pages.

I've spent a lot of time thinking of solutions to the last point. The obvious solution would be to not use the buffer ring for updating scans. The difficulty with that is that we don't know if a scan is read-only in heapam.c, where the hint to use the buffer ring is set.

I've completed a set of performance tests on a test server. The server has 4 GB of RAM, of which 1 GB is used for shared_buffers.

Results for a 10 GB table:

 head-copy-bigtable               | 00:10:09.07016
 head-copy-bigtable               | 00:10:20.507357
 head-copy-bigtable               | 00:10:21.857677
 head-copy_nowal-bigtable         | 00:05:18.232956
 head-copy_nowal-bigtable         | 00:03:24.109047
 head-copy_nowal-bigtable         | 00:05:31.019643
 head-select-bigtable             | 00:03:47.102731
 head-select-bigtable             | 00:01:08.314719
 head-select-bigtable             | 00:01:08.238509
 head-select-bigtable             | 00:01:08.208563
 head-select-bigtable             | 00:01:08.28347
 head-select-bigtable             | 00:01:08.308671
 head-vacuum_clean-bigtable       | 00:01:04.227832
 head-vacuum_clean-bigtable       | 00:01:04.232258
 head-vacuum_clean-bigtable       | 00:01:04.294621
 head-vacuum_clean-bigtable       | 00:01:04.280677
 head-vacuum_hintbits-bigtable    | 00:04:01.123924
 head-vacuum_hintbits-bigtable    | 00:03:58.253175
 head-vacuum_hintbits-bigtable    | 00:04:26.318159
 head-vacuum_hintbits-bigtable    | 00:04:37.512965
 patched-copy-bigtable            | 00:09:52.776754
 patched-copy-bigtable            | 00:10:18.185826
 patched-copy-bigtable            | 00:10:16.975482
 patched-copy_nowal-bigtable      | 00:03:14.882366
 patched-copy_nowal-bigtable      | 00:04:01.04648
 patched-copy_nowal-bigtable      | 00:03:56.062272
 patched-select-bigtable          | 00:03:47.704154
 patched-select-bigtable          | 00:01:08.460326
 patched-select-bigtable          | 00:01:10.441544
 patched-select-bigtable          | 00:01:11.916221
 patched-select-bigtable          | 00:01:13.848038
 patched-select-bigtable          | 00:01:10.956133
 patched-vacuum_clean-bigtable    | 00:01:10.315439
 patched-vacuum_clean-bigtable    | 00:01:12.210537
 patched-vacuum_clean-bigtable    | 00:01:15.202114
 patched-vacuum_clean-bigtable    | 00:01:10.712235
 patched-vacuum_hintbits-bigtable | 00:03:42.279201
 patched-vacuum_hintbits-bigtable | 00:04:02.057778
 patched-vacuum_hintbits-bigtable | 00:04:26.805822
 patched-vacuum_hintbits-bigtable | 00:04:28.911184

In other words, the patch has no significant effect, as expected. The select times did go up by a couple of seconds, which I didn't expect, though. One theory is that unused shared_buffers are swapped out during the tests, and bgwriter pulls them back in. I'll set swappiness to 0 and try again at some point.

Results for a 2 GB table:

 copy-medsize-unpatched            | 00:02:18.23246
 copy-medsize-unpatched            | 00:02:22.347194
 copy-medsize-unpatched            | 00:02:23.875874
 copy_nowal-medsize-unpatched      | 00:01:27.606334
 copy_nowal-medsize-unpatched      | 00:01:17.491243
 copy_nowal-medsize-unpatched      | 00:01:31.902719
 select-medsize-unpatched          | 00:00:03.786031
 select-medsize-unpatched          | 00:00:02.678069
 select-medsize-unpatched          | 00:00:02.666103
 select-medsize-unpatched          | 00:00:02.673494
 select-medsize-unpatched          | 00:00:02.669645
 select-medsize-unpatched          | 00:00:02.666278
 vacuum_clean-medsize-unpatched    | 00:00:01.091356
 vacuum_clean-medsize-unpatched    | 00:00:01.923138
 vacuum_clean-medsize-unpatched    | 00:00:01.917213
 vacuum_clean-medsize-unpatched    | 00:00:01.917333
 vacuum_hintbits-medsize-unpatched | 00:00:01.683718
 vacuum_hintbits-medsize-unpatched | 00:00:01.864003
 vacuum_hintbits-medsize-unpatched | 00:00:03.186596
 vacuum_hintbits-medsize-unpatched | 00:00:02.16494
 copy-medsize-patched              | 00:02:35.113501
 copy-medsize-patched              | 00:02:25.269866
 copy-medsize-patched              | 00:02:31.881089
 copy_nowal-medsize-patched        | 00:01:00.254633
 copy_nowal-medsize-patched        | 00:01:04.630687
 copy_nowal-medsize-patched        | 00:01:03.729128
 select-medsize-patched            | 00:00:03.201837
 select-medsize-patched            | 00:00:01.332975
 select-medsize-patched            | 00:00:01.33014
 select-medsize-patched            | 00:00:01.332392
 select-medsize-patched            | 00:00:01.333498
 select-medsize-patched            | 00:00:01.332692
 vacuum_clean-medsize-patched      | 00:00:01.140189
 vacuum_clean-medsize-patched      | 00:00:01.062762
 vacuum_clean-medsize-patched      | 00:00:01.062402
 vacuum_clean-medsize-patched      | 00:00:01.07113
 vacuum_hintbits-medsize-patched   | 00:00:17.865446
 vacuum_hintbits-medsize-patched   | 00:00:15.162064
 vacuum_hintbits-medsize-patched   | 00:00:01.704651
 vacuum_hintbits-medsize-patched   | 00:00:02.671651

This looks good to me, except for some glitch at the last vacuum_hintbits tests. Selects and vacuums benefit significantly, as does non-WAL-logged copy.

Not shown here, but I run tests earlier with vacuum on a table that actually had dead tuples to be removed on it. In that test the patched version really shined, reducing the runtime to ~ 1/6th. That was the original motivation of this patch: not having to do a WAL flush on every page in the 2nd phase of vacuum.

Test script attached. To use it:

1. Edit testscript.sh. Change BIGTABLESIZE.
2. Start postmaster
3. Run script, giving test-label as argument. For example: "./testscript.sh bigtable-patched"

Attached is also the patch I used for the tests.

I would appreciate it if people would download the patch and the script and repeat the tests on different hardware. I'm particularly interested in testing on a box with good I/O hardware where selects on unpatched PostgreSQL are bottlenecked by CPU.

Barring any surprises I'm going to fix the remaining issue and submit a final patch, probably in the weekend.

(*) The issue with this patch is that if the buffer cache is completely filled with dirty buffers that need a WAL flush to evict, the buffer ring code will get into an infinite loop trying to find one that doesn't need a WAL flush. Should be simple to fix.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Attachment: testscript.sh
Description: application/shellscript

Index: src/backend/access/heap/heapam.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.232
diff -c -r1.232 heapam.c
*** src/backend/access/heap/heapam.c	8 Apr 2007 01:26:27 -0000	1.232
--- src/backend/access/heap/heapam.c	16 May 2007 11:35:14 -0000
***************
*** 83,88 ****
--- 83,96 ----
  	 */
  	scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
  
+ 	/* A scan on a table smaller than shared_buffers is treated like random
+ 	 * access, but bigger scans should use the bulk read replacement policy.
+ 	 */
+ 	if (scan->rs_nblocks > NBuffers)
+ 		scan->rs_accesspattern = AP_BULKREAD;
+ 	else
+ 		scan->rs_accesspattern = AP_NORMAL;
+ 
  	scan->rs_inited = false;
  	scan->rs_ctup.t_data = NULL;
  	ItemPointerSetInvalid(&scan->rs_ctup.t_self);
***************
*** 123,133 ****
--- 131,146 ----
  
  	Assert(page < scan->rs_nblocks);
  
+ 	/* Read the page with the right strategy */
+ 	SetAccessPattern(scan->rs_accesspattern);
+ 
  	scan->rs_cbuf = ReleaseAndReadBuffer(scan->rs_cbuf,
  										 scan->rs_rd,
  										 page);
  	scan->rs_cblock = page;
  
+ 	SetAccessPattern(AP_NORMAL);
+ 
  	if (!scan->rs_pageatatime)
  		return;
  
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.268
diff -c -r1.268 xlog.c
*** src/backend/access/transam/xlog.c	30 Apr 2007 21:01:52 -0000	1.268
--- src/backend/access/transam/xlog.c	15 May 2007 16:23:30 -0000
***************
*** 1668,1673 ****
--- 1668,1700 ----
  }
  
  /*
+  * Returns true if 'record' hasn't been flushed to disk yet.
+  */
+ bool
+ XLogNeedsFlush(XLogRecPtr record)
+ {
+ 	/* Quick exit if already known flushed */
+ 	if (XLByteLE(record, LogwrtResult.Flush))
+ 		return false;
+ 
+ 	/* read LogwrtResult and update local state */
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+ 		SpinLockAcquire(&xlogctl->info_lck);
+ 		LogwrtResult = xlogctl->LogwrtResult;
+ 		SpinLockRelease(&xlogctl->info_lck);
+ 	}
+ 
+ 	/* check again */
+ 	if (XLByteLE(record, LogwrtResult.Flush))
+ 		return false;
+ 
+ 	return true;
+ }
+ 
+ /*
   * Ensure that all XLOG data through the given position is flushed to disk.
   *
   * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not
Index: src/backend/commands/copy.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.283
diff -c -r1.283 copy.c
*** src/backend/commands/copy.c	27 Apr 2007 22:05:46 -0000	1.283
--- src/backend/commands/copy.c	15 May 2007 17:05:29 -0000
***************
*** 1876,1881 ****
--- 1876,1888 ----
  	nfields = file_has_oids ? (attr_count + 1) : attr_count;
  	field_strings = (char **) palloc(nfields * sizeof(char *));
  
+ 	/* Use the special COPY buffer replacement strategy if WAL-logging
+ 	 * is enabled. If it's not, the pages we're writing are dirty but
+ 	 * don't need a WAL flush to write out, so the BULKREAD strategy
+ 	 * is more suitable.
+ 	 */
+ 	SetAccessPattern(use_wal ? AP_COPY : AP_BULKREAD);
+ 
  	/* Initialize state variables */
  	cstate->fe_eof = false;
  	cstate->eol_type = EOL_UNKNOWN;
***************
*** 2161,2166 ****
--- 2168,2176 ----
  							cstate->filename)));
  	}
  
+ 	/* Reset buffer replacement strategy */
+ 	SetAccessPattern(AP_NORMAL);
+ 
  	/* 
  	 * If we skipped writing WAL, then we need to sync the heap (but not
  	 * indexes since those use WAL anyway)
Index: src/backend/commands/vacuum.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/vacuum.c,v
retrieving revision 1.350
diff -c -r1.350 vacuum.c
*** src/backend/commands/vacuum.c	16 Apr 2007 18:29:50 -0000	1.350
--- src/backend/commands/vacuum.c	15 May 2007 17:06:18 -0000
***************
*** 421,431 ****
  				 * Tell the buffer replacement strategy that vacuum is causing
  				 * the IO
  				 */
! 				StrategyHintVacuum(true);
  
  				analyze_rel(relid, vacstmt);
  
! 				StrategyHintVacuum(false);
  
  				if (use_own_xacts)
  					CommitTransactionCommand();
--- 421,431 ----
  				 * Tell the buffer replacement strategy that vacuum is causing
  				 * the IO
  				 */
! 				SetAccessPattern(AP_VACUUM);
  
  				analyze_rel(relid, vacstmt);
  
! 				SetAccessPattern(AP_NORMAL);
  
  				if (use_own_xacts)
  					CommitTransactionCommand();
***************
*** 442,448 ****
  		/* Make sure cost accounting is turned off after error */
  		VacuumCostActive = false;
  		/* And reset buffer replacement strategy, too */
! 		StrategyHintVacuum(false);
  		PG_RE_THROW();
  	}
  	PG_END_TRY();
--- 442,448 ----
  		/* Make sure cost accounting is turned off after error */
  		VacuumCostActive = false;
  		/* And reset buffer replacement strategy, too */
! 		SetAccessPattern(AP_NORMAL);
  		PG_RE_THROW();
  	}
  	PG_END_TRY();
***************
*** 1088,1094 ****
  	 * Tell the cache replacement strategy that vacuum is causing all
  	 * following IO
  	 */
! 	StrategyHintVacuum(true);
  
  	/*
  	 * Do the actual work --- either FULL or "lazy" vacuum
--- 1088,1094 ----
  	 * Tell the cache replacement strategy that vacuum is causing all
  	 * following IO
  	 */
! 	SetAccessPattern(AP_VACUUM);
  
  	/*
  	 * Do the actual work --- either FULL or "lazy" vacuum
***************
*** 1098,1104 ****
  	else
  		lazy_vacuum_rel(onerel, vacstmt);
  
! 	StrategyHintVacuum(false);
  
  	/* all done with this class, but hold lock until commit */
  	relation_close(onerel, NoLock);
--- 1098,1104 ----
  	else
  		lazy_vacuum_rel(onerel, vacstmt);
  
! 	SetAccessPattern(AP_NORMAL);
  
  	/* all done with this class, but hold lock until commit */
  	relation_close(onerel, NoLock);
Index: src/backend/storage/buffer/README
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/README,v
retrieving revision 1.11
diff -c -r1.11 README
*** src/backend/storage/buffer/README	23 Jul 2006 03:07:58 -0000	1.11
--- src/backend/storage/buffer/README	16 May 2007 11:43:11 -0000
***************
*** 152,159 ****
  a field to show which backend is doing its I/O).
  
  
! Buffer replacement strategy
! ---------------------------
  
  There is a "free list" of buffers that are prime candidates for replacement.
  In particular, buffers that are completely free (contain no valid page) are
--- 152,159 ----
  a field to show which backend is doing its I/O).
  
  
! Normal buffer replacement strategy
! ----------------------------------
  
  There is a "free list" of buffers that are prime candidates for replacement.
  In particular, buffers that are completely free (contain no valid page) are
***************
*** 199,221 ****
  have to give up and try another buffer.  This however is not a concern
  of the basic select-a-victim-buffer algorithm.)
  
- A special provision is that while running VACUUM, a backend does not
- increment the usage count on buffers it accesses.  In fact, if ReleaseBuffer
- sees that it is dropping the pin count to zero and the usage count is zero,
- then it appends the buffer to the tail of the free list.  (This implies that
- VACUUM, but only VACUUM, must take the BufFreelistLock during ReleaseBuffer;
- this shouldn't create much of a contention problem.)  This provision
- encourages VACUUM to work in a relatively small number of buffers rather
- than blowing out the entire buffer cache.  It is reasonable since a page
- that has been touched only by VACUUM is unlikely to be needed again soon.
- 
- Since VACUUM usually requests many pages very fast, the effect of this is that
- it will get back the very buffers it filled and possibly modified on the next
- call and will therefore do its work in a few shared memory buffers, while
- being able to use whatever it finds in the cache already.  This also implies
- that most of the write traffic caused by a VACUUM will be done by the VACUUM
- itself and not pushed off onto other processes.
  
  
  Background writer's processing
  ------------------------------
--- 199,243 ----
  have to give up and try another buffer.  This however is not a concern
  of the basic select-a-victim-buffer algorithm.)
  
  
+ Buffer ring replacement strategy
+ ---------------------------------
+ 
+ When running a query that needs to access a large number of pages, like VACUUM,
+ COPY, or a large sequential scan, a different strategy is used.  A page that
+ has been touched only by such a scan is unlikely to be needed again soon, so
+ instead of running the normal clock sweep algorithm and blowing out the entire
+ buffer cache, a small ring of buffers is allocated using the normal clock sweep
+ algorithm and those buffers are reused for the whole scan.  This also implies
+ that most of the write traffic caused by such a statement will be done by the
+ backend itself and not pushed off onto other processes.
+ 
+ The size of the ring used depends on the kind of scan:
+ 
+ For sequential scans, a small 256 KB ring is used. That's small enough to fit
+ in L2 cache, which makes transferring pages from OS cache to shared buffer
+ cache efficient. Even less would often be enough, but the ring must be big
+ enough to accommodate all pages in the scan that are pinned concurrently. 
+ 256 KB should also be enough to leave a small cache trail for other backends to
+ join in a synchronized seq scan. If a buffer is dirtied and LSN set, the buffer
+ is removed from the ring and a replacement buffer is chosen using the normal
+ replacement strategy. In a scan that modifies every page in the scan, like a
+ bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and the
+ ring strategy effectively degrades to the normal strategy.
+ 
+ VACUUM uses a 256 KB ring like sequential scans, but dirty pages are not
+ removed from the ring. WAL is flushed instead to allow reuse of the buffers.
+ Before introducing the buffer ring strategy in 8.3, buffers were put to the
+ freelist, which was effectively a buffer ring of 1 buffer.
+ 
+ COPY behaves like VACUUM, but a much larger ring is used. The ring size is
+ chosen to be twice the WAL segment size. This avoids polluting the buffer cache
+ like the clock sweep would do, and using a ring larger than WAL segment size
+ avoids having to do any extra WAL flushes, since a WAL segment will always be
+ filled, forcing a WAL flush, before looping through the buffer ring and bumping
+ into a buffer that would force a WAL flush. However, for non-WAL-logged COPY
+ operations the smaller 256 KB ring is used because WAL flushes are not needed
+ to write the buffers.
  
  Background writer's processing
  ------------------------------
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.218
diff -c -r1.218 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c	2 May 2007 23:34:48 -0000	1.218
--- src/backend/storage/buffer/bufmgr.c	16 May 2007 12:34:10 -0000
***************
*** 419,431 ****
  	/* Loop here in case we have to try another victim buffer */
  	for (;;)
  	{
  		/*
  		 * Select a victim buffer.	The buffer is returned with its header
  		 * spinlock still held!  Also the BufFreelistLock is still held, since
  		 * it would be bad to hold the spinlock while possibly waking up other
  		 * processes.
  		 */
! 		buf = StrategyGetBuffer();
  
  		Assert(buf->refcount == 0);
  
--- 419,433 ----
  	/* Loop here in case we have to try another victim buffer */
  	for (;;)
  	{
+ 		bool lock_held;
+ 
  		/*
  		 * Select a victim buffer.	The buffer is returned with its header
  		 * spinlock still held!  Also the BufFreelistLock is still held, since
  		 * it would be bad to hold the spinlock while possibly waking up other
  		 * processes.
  		 */
! 		buf = StrategyGetBuffer(&lock_held);
  
  		Assert(buf->refcount == 0);
  
***************
*** 436,442 ****
  		PinBuffer_Locked(buf);
  
  		/* Now it's safe to release the freelist lock */
! 		LWLockRelease(BufFreelistLock);
  
  		/*
  		 * If the buffer was dirty, try to write it out.  There is a race
--- 438,445 ----
  		PinBuffer_Locked(buf);
  
  		/* Now it's safe to release the freelist lock */
! 		if (lock_held)
! 			LWLockRelease(BufFreelistLock);
  
  		/*
  		 * If the buffer was dirty, try to write it out.  There is a race
***************
*** 464,469 ****
--- 467,489 ----
  			 */
  			if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED))
  			{
+ 				/* In BULKREAD-mode, check if a WAL flush would be needed to
+ 				 * evict this buffer. If so, ask the replacement strategy if
+ 				 * we should go ahead and do it or choose another victim.
+ 				 */
+ 				if (active_access_pattern == AP_BULKREAD)
+ 				{
+ 					if (XLogNeedsFlush(BufferGetLSN(buf)))
+ 					{
+ 						if (StrategyRejectBuffer(buf))
+ 						{
+ 							LWLockRelease(buf->content_lock);
+ 							UnpinBuffer(buf, true, false);
+ 							continue;
+ 						}
+ 					}
+ 				}
+ 
  				FlushBuffer(buf, NULL);
  				LWLockRelease(buf->content_lock);
  			}
***************
*** 925,932 ****
  	PrivateRefCount[b]--;
  	if (PrivateRefCount[b] == 0)
  	{
- 		bool		immed_free_buffer = false;
- 
  		/* I'd better not still hold any locks on the buffer */
  		Assert(!LWLockHeldByMe(buf->content_lock));
  		Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
--- 945,950 ----
***************
*** 940,956 ****
  		/* Update buffer usage info, unless this is an internal access */
  		if (normalAccess)
  		{
! 			if (!strategy_hint_vacuum)
  			{
! 				if (buf->usage_count < BM_MAX_USAGE_COUNT)
! 					buf->usage_count++;
  			}
  			else
! 			{
! 				/* VACUUM accesses don't bump usage count, instead... */
! 				if (buf->refcount == 0 && buf->usage_count == 0)
! 					immed_free_buffer = true;
! 			}
  		}
  
  		if ((buf->flags & BM_PIN_COUNT_WAITER) &&
--- 958,975 ----
  		/* Update buffer usage info, unless this is an internal access */
  		if (normalAccess)
  		{
! 			if (active_access_pattern != AP_NORMAL)
  			{
! 				/* We don't want large one-off scans like vacuum to inflate 
! 				 * the usage_count. We do want to set it to 1, though, to keep
! 				 * other backends from hijacking it from the buffer ring.
! 				 */
! 				if (buf->usage_count == 0)
! 					buf->usage_count = 1;
  			}
  			else
! 			if (buf->usage_count < BM_MAX_USAGE_COUNT)
! 				buf->usage_count++;
  		}
  
  		if ((buf->flags & BM_PIN_COUNT_WAITER) &&
***************
*** 965,978 ****
  		}
  		else
  			UnlockBufHdr(buf);
- 
- 		/*
- 		 * If VACUUM is releasing an otherwise-unused buffer, send it to the
- 		 * freelist for near-term reuse.  We put it at the tail so that it
- 		 * won't be used before any invalid buffers that may exist.
- 		 */
- 		if (immed_free_buffer)
- 			StrategyFreeBuffer(buf, false);
  	}
  }
  
--- 984,989 ----
Index: src/backend/storage/buffer/freelist.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/freelist.c,v
retrieving revision 1.58
diff -c -r1.58 freelist.c
*** src/backend/storage/buffer/freelist.c	5 Jan 2007 22:19:37 -0000	1.58
--- src/backend/storage/buffer/freelist.c	17 May 2007 16:12:56 -0000
***************
*** 18,23 ****
--- 18,25 ----
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
  
+ #include "utils/memutils.h"
+ 
  
  /*
   * The shared freelist control information.
***************
*** 39,47 ****
  /* Pointers to shared state */
  static BufferStrategyControl *StrategyControl = NULL;
  
! /* Backend-local state about whether currently vacuuming */
! bool		strategy_hint_vacuum = false;
  
  
  /*
   * StrategyGetBuffer
--- 41,53 ----
  /* Pointers to shared state */
  static BufferStrategyControl *StrategyControl = NULL;
  
! /* Currently active access pattern hint. */
! AccessPattern active_access_pattern = AP_NORMAL;
  
+ /* prototypes for internal functions */
+ static volatile BufferDesc *GetBufferFromRing(void);
+ static void PutBufferToRing(volatile BufferDesc *buf);
+ static void InitRing(void);
  
  /*
   * StrategyGetBuffer
***************
*** 51,67 ****
   *	the selected buffer must not currently be pinned by anyone.
   *
   *	To ensure that no one else can pin the buffer before we do, we must
!  *	return the buffer with the buffer header spinlock still held.  That
!  *	means that we return with the BufFreelistLock still held, as well;
!  *	the caller must release that lock once the spinlock is dropped.
   */
  volatile BufferDesc *
! StrategyGetBuffer(void)
  {
  	volatile BufferDesc *buf;
  	int			trycounter;
  
  	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
  
  	/*
  	 * Try to get a buffer from the freelist.  Note that the freeNext fields
--- 57,89 ----
   *	the selected buffer must not currently be pinned by anyone.
   *
   *	To ensure that no one else can pin the buffer before we do, we must
!  *	return the buffer with the buffer header spinlock still held.  If
!  *	*lock_held is set at return, we return with the BufFreelistLock still
!  *	held, as well;	the caller must release that lock once the spinlock is
!  *	dropped.
   */
  volatile BufferDesc *
! StrategyGetBuffer(bool *lock_held)
  {
  	volatile BufferDesc *buf;
  	int			trycounter;
  
+ 	/* Get a buffer from the ring if we're doing a bulk scan */
+ 	if (active_access_pattern != AP_NORMAL)
+ 	{
+ 		buf = GetBufferFromRing();
+ 		if (buf != NULL)
+ 		{
+ 			*lock_held = false;
+ 			return buf;
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * If our selected buffer wasn't available, pick another...
+ 	 */
  	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+ 	*lock_held = true;
  
  	/*
  	 * Try to get a buffer from the freelist.  Note that the freeNext fields
***************
*** 86,96 ****
  		 */
  		LockBufHdr(buf);
  		if (buf->refcount == 0 && buf->usage_count == 0)
  			return buf;
  		UnlockBufHdr(buf);
  	}
  
! 	/* Nothing on the freelist, so run the "clock sweep" algorithm */
  	trycounter = NBuffers;
  	for (;;)
  	{
--- 108,122 ----
  		 */
  		LockBufHdr(buf);
  		if (buf->refcount == 0 && buf->usage_count == 0)
+ 		{
+ 			if (active_access_pattern != AP_NORMAL)
+ 				PutBufferToRing(buf);
  			return buf;
+ 		}
  		UnlockBufHdr(buf);
  	}
  
! 	/* Nothing on the freelist, so run the shared "clock sweep" algorithm */
  	trycounter = NBuffers;
  	for (;;)
  	{
***************
*** 105,111 ****
--- 131,141 ----
  		 */
  		LockBufHdr(buf);
  		if (buf->refcount == 0 && buf->usage_count == 0)
+ 		{
+ 			if (active_access_pattern != AP_NORMAL)
+ 				PutBufferToRing(buf);
  			return buf;
+ 		}
  		if (buf->usage_count > 0)
  		{
  			buf->usage_count--;
***************
*** 191,204 ****
  }
  
  /*
!  * StrategyHintVacuum -- tell us whether VACUUM is active
   */
  void
! StrategyHintVacuum(bool vacuum_active)
  {
! 	strategy_hint_vacuum = vacuum_active;
! }
  
  
  /*
   * StrategyShmemSize
--- 221,245 ----
  }
  
  /*
!  * SetAccessPattern -- Sets the active access pattern hint
!  *
!  * Caller is responsible for resetting the hint to AP_NORMAL after the bulk
!  * operation is done. It's ok to switch repeatedly between AP_NORMAL and one of
!  * the other strategies, for example in a query with one large sequential scan
!  * nested loop joined to an index scan. Index tuples should be fetched with the
!  * normal strategy and the pages from the seq scan should be read in with the
!  * AP_BULKREAD strategy. The ring won't be affected by such switching, however
!  * switching to an access pattern with different ring size will invalidate the
!  * old ring.
   */
  void
! SetAccessPattern(AccessPattern new_pattern)
  {
! 	active_access_pattern = new_pattern;
  
+ 	if (active_access_pattern != AP_NORMAL)
+ 		InitRing();
+ }
  
  /*
   * StrategyShmemSize
***************
*** 274,276 ****
--- 315,498 ----
  	else
  		Assert(!init);
  }
+ 
+ /* ----------------------------------------------------------------
+  *				Backend-private buffer ring management
+  * ----------------------------------------------------------------
+  */
+ 
+ /*
+  * Ring sizes for different access patterns. See README for the rationale
+  * of these.
+  */
+ #define BULKREAD_RING_SIZE	256 * 1024 / BLCKSZ
+ #define VACUUM_RING_SIZE	256 * 1024 / BLCKSZ
+ #define COPY_RING_SIZE		Min(NBuffers / 8, (XLOG_SEG_SIZE / BLCKSZ) * 2)
+ 
+ /*
+  * BufferRing is an array of buffer ids, and RingSize it's size in number of
+  * elements. It's allocated in TopMemoryContext the first time it's needed.
+  */
+ static int *BufferRing = NULL;
+ static int RingSize = 0;
+ 
+ /* Index of the "current" slot in the ring. It's advanced every time a buffer
+  * is handed out from the ring with GetBufferFromRing and it points to the 
+  * last buffer returned from the ring. RingCurSlot + 1 is the next victim
+  * GetBufferRing will hand out.
+  */
+ static int RingCurSlot = 0;
+ 
+ /* magic value to mark empty slots in the ring */
+ #define BUF_ID_NOT_SET -1
+ 
+ 
+ /*
+  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
+  *		ring is empty.
+  *
+  * The bufhdr spin lock is held on the returned buffer.
+  */
+ static volatile BufferDesc *
+ GetBufferFromRing(void)
+ {
+ 	volatile BufferDesc *buf;
+ 
+ 	/* ring should be initialized by now */
+ 	Assert(RingSize > 0 && BufferRing != NULL);
+ 
+ 	/* Run private "clock cycle" */
+ 	if (++RingCurSlot >= RingSize)
+ 		RingCurSlot = 0;
+ 
+ 	/*
+ 	 * If that slot hasn't been filled yet, tell the caller to allocate
+ 	 * a new buffer with the normal allocation strategy. He will then
+ 	 * fill this slot by calling PutBufferToRing with the new buffer.
+ 	 */
+ 	if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET)
+ 		return NULL;
+ 
+ 	buf = &BufferDescriptors[BufferRing[RingCurSlot]];
+ 
+ 	/*
+ 	 * If the buffer is pinned we cannot use it under any circumstances.
+ 	 * If usage_count == 0 then the buffer is fair game. 
+ 	 *
+ 	 * We also choose this buffer if usage_count == 1. Strictly, this
+ 	 * might sometimes be the wrong thing to do, but we rely on the high
+ 	 * probability that it was this process that last touched the buffer.
+ 	 * If it wasn't, we'll choose a suboptimal victim, but  it shouldn't
+ 	 * make any difference in the big scheme of things.
+ 	 *
+ 	 */
+ 	LockBufHdr(buf);
+ 	if (buf->refcount == 0 && buf->usage_count <= 1)
+ 		return buf;
+ 	UnlockBufHdr(buf);
+ 
+ 	return NULL;
+ }
+ 
+ /*
+  * PutBufferToRing -- adds a buffer to the buffer ring
+  *
+  * Caller must hold the buffer header spinlock on the buffer.
+  */
+ static void
+ PutBufferToRing(volatile BufferDesc *buf)
+ {
+ 	/* ring should be initialized by now */
+ 	Assert(RingSize > 0 && BufferRing != NULL);
+ 
+ 	if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET)
+ 		BufferRing[RingCurSlot] = buf->buf_id;
+ }
+ 
+ /*
+  * Initializes a ring buffer with correct size for the currently
+  * active strategy. Does nothing if the ring already has the right size.
+  */
+ static void
+ InitRing(void)
+ {
+ 	int new_size;
+ 	int old_size = RingSize;
+ 	int i;
+ 	MemoryContext oldcxt;
+ 
+ 	/* Determine new size */
+ 
+ 	switch(active_access_pattern)
+ 	{
+ 		case AP_BULKREAD:
+ 			new_size = BULKREAD_RING_SIZE;
+ 			break;
+ 		case AP_COPY:
+ 			new_size = COPY_RING_SIZE;
+ 			break;
+ 		case AP_VACUUM:
+ 			new_size = VACUUM_RING_SIZE;
+ 			break;
+ 		default:
+ 			elog(ERROR, "unexpected buffer cache strategy %d", 
+ 				 active_access_pattern);
+ 			return; /* keep compile happy */
+ 	}
+ 
+ 	/*
+ 	 * Seq scans set and reset the strategy on every page, so we better exit
+ 	 * quickly if no change in size is needed.
+ 	 */
+ 	if (new_size == old_size)
+ 		return;
+ 
+ 	/* Allocate array */
+ 
+ 	oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+ 
+ 	if (old_size == 0)
+ 	{
+ 		Assert(BufferRing == NULL);
+ 		BufferRing = palloc(new_size * sizeof(int));
+ 	}
+ 	else
+ 		BufferRing = repalloc(BufferRing, new_size * sizeof(int));
+ 
+ 	MemoryContextSwitchTo(oldcxt);
+ 
+ 	for(i = 0; i < new_size; i++)
+ 		BufferRing[i] = BUF_ID_NOT_SET;
+ 
+ 	RingCurSlot = 0;
+ 	RingSize = new_size;
+ }
+ 
+ /*
+  * Buffer manager calls this function in AP_BULKREAD mode when the
+  * buffer handed to it turns out to need a WAL flush to write out. This
+  * gives the strategy a second chance to choose another victim.
+  *
+  * Returns true if buffer manager should ask for a new victim, and false
+  * if WAL should be flushed and this buffer used.
+  */
+ bool
+ StrategyRejectBuffer(volatile BufferDesc *buf)
+ {
+ 	Assert(RingSize > 0);
+ 
+ 	if (BufferRing[RingCurSlot] == buf->buf_id)
+ 	{
+ 		BufferRing[RingCurSlot] = BUF_ID_NOT_SET;
+ 		return true;
+ 	}
+ 	else
+ 	{
+ 		/* Apparently the buffer didn't come from the ring. We don't want to
+ 		 * mess with how the clock sweep works; in worst case there's no
+ 		 * buffers in the buffer cache that can be reused without a WAL flush,
+ 		 * and we'd get into an endless loop trying.
+ 		 */
+ 		return false;
+ 	}
+ }
Index: src/include/access/relscan.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/relscan.h,v
retrieving revision 1.52
diff -c -r1.52 relscan.h
*** src/include/access/relscan.h	20 Jan 2007 18:43:35 -0000	1.52
--- src/include/access/relscan.h	15 May 2007 17:01:31 -0000
***************
*** 28,33 ****
--- 28,34 ----
  	ScanKey		rs_key;			/* array of scan key descriptors */
  	BlockNumber rs_nblocks;		/* number of blocks to scan */
  	bool		rs_pageatatime; /* verify visibility page-at-a-time? */
+ 	AccessPattern rs_accesspattern; /* access pattern to use for reads */
  
  	/* scan current state */
  	bool		rs_inited;		/* false = scan not init'd yet */
Index: src/include/access/xlog.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
retrieving revision 1.76
diff -c -r1.76 xlog.h
*** src/include/access/xlog.h	5 Jan 2007 22:19:51 -0000	1.76
--- src/include/access/xlog.h	14 May 2007 21:22:40 -0000
***************
*** 151,156 ****
--- 151,157 ----
  
  extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
  extern void XLogFlush(XLogRecPtr RecPtr);
+ extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
  
  extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
Index: src/include/storage/buf_internals.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v
retrieving revision 1.89
diff -c -r1.89 buf_internals.h
*** src/include/storage/buf_internals.h	5 Jan 2007 22:19:57 -0000	1.89
--- src/include/storage/buf_internals.h	15 May 2007 17:07:59 -0000
***************
*** 16,21 ****
--- 16,22 ----
  #define BUFMGR_INTERNALS_H
  
  #include "storage/buf.h"
+ #include "storage/bufmgr.h"
  #include "storage/lwlock.h"
  #include "storage/shmem.h"
  #include "storage/spin.h"
***************
*** 168,174 ****
  extern BufferDesc *LocalBufferDescriptors;
  
  /* in freelist.c */
! extern bool strategy_hint_vacuum;
  
  /* event counters in buf_init.c */
  extern long int ReadBufferCount;
--- 169,175 ----
  extern BufferDesc *LocalBufferDescriptors;
  
  /* in freelist.c */
! extern AccessPattern active_access_pattern;
  
  /* event counters in buf_init.c */
  extern long int ReadBufferCount;
***************
*** 184,195 ****
   */
  
  /* freelist.c */
! extern volatile BufferDesc *StrategyGetBuffer(void);
  extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
  extern int	StrategySyncStart(void);
  extern Size StrategyShmemSize(void);
  extern void StrategyInitialize(bool init);
  
  /* buf_table.c */
  extern Size BufTableShmemSize(int size);
  extern void InitBufTable(int size);
--- 185,198 ----
   */
  
  /* freelist.c */
! extern volatile BufferDesc *StrategyGetBuffer(bool *lock_held);
  extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
  extern int	StrategySyncStart(void);
  extern Size StrategyShmemSize(void);
  extern void StrategyInitialize(bool init);
  
+ extern bool StrategyRejectBuffer(volatile BufferDesc *buf);
+ 
  /* buf_table.c */
  extern Size BufTableShmemSize(int size);
  extern void InitBufTable(int size);
Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.103
diff -c -r1.103 bufmgr.h
*** src/include/storage/bufmgr.h	2 May 2007 23:18:03 -0000	1.103
--- src/include/storage/bufmgr.h	15 May 2007 17:07:02 -0000
***************
*** 48,53 ****
--- 48,61 ----
  #define BUFFER_LOCK_SHARE		1
  #define BUFFER_LOCK_EXCLUSIVE	2
  
+ typedef enum AccessPattern
+ {
+ 	AP_NORMAL,		/* Normal random access */
+     AP_BULKREAD,	/* Large read-only scan (hint bit updates are ok) */
+     AP_COPY,		/* Large updating scan, like COPY with WAL enabled */
+     AP_VACUUM,		/* VACUUM */
+ } AccessPattern;
+ 
  /*
   * These routines are beaten on quite heavily, hence the macroization.
   */
***************
*** 157,162 ****
  extern void AtProcExit_LocalBuffers(void);
  
  /* in freelist.c */
! extern void StrategyHintVacuum(bool vacuum_active);
  
  #endif
--- 165,170 ----
  extern void AtProcExit_LocalBuffers(void);
  
  /* in freelist.c */
! extern void SetAccessPattern(AccessPattern new_pattern);
  
  #endif
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

Reply via email to