Attached are two patches that try to recast the ideas of Itagaki Takahiro's auto bgwriter_lru_maxpages patch in the direction I think this code needs to move. Epic-length commentary follows.

The original code came from before there was a pg_stat_bgwriter. The first patch (buf-alloc-stats) takes the two most interesting pieces of data the original patch collected, the number of buffers allocated recently and the number that the clients wrote out, and ties all that into the new stats structure. With this patch applied, you can get a feel for things like churn/turnover in the buffer pool that were very hard to quantify before. Also, it makes it easy to measure how well your background writer is doing at writing buffers so the clients don't have to. Applying this would complete one of my personal goals for the 8.3 release, which was having stats to track every type of buffer write.

I split this out because I think it's very useful to have regardless of whether the automatic tuning portion is accepted, and I think these smaller patches make the review easier. The main thing I would recommend someone check is how am_bg_writer is (mis?)used here. I spliced some of the debugging-only code from the original patch, and I can't tell if the result is a robust enough approach to solving the problem of having every client indirectly report their activity to the background writer. Other than that, I think this code is ready for review and potentially comitting.

The second patch (limit-lru) adds on top of that a constraint of the LRU writer so that it doesn't do any more work than it has to. Note that I left verbose debugging code in here because I'm much less confident this patch is complete.

It predicts upcoming buffer allocations using a 16-period weighted moving average of recent activity, which you can think of as the last 3.2 seconds at the default interval. After testing a few systems that seemed a decent compromise of smoothing in both directions. I found the 2X overallocation fudge factor of the original patch way too aggressive, and just pick the larger of the most recent allocation amount or the smoothed value. The main thing that throws off the allocation estimation is when you hit a checkpoint, which can give a big spike after the background writer returns to BgBufferSync and notices all the buffers that were allocated during the checkpoint write; the code then tries to find more buffers it can recycle than it needs to. Since the checkpoint itself normally leaves a large wake of reusable buffers behind it, I didn't find this to be a serious problem.

There's another communication issue here, which is that SyncOneBuffer needs to return more information about the buffer than it currently does once it gets it locked. The background writer needs to know more than just if it was written to tune itself. The original patch used a clever trick for this which worked but I found confusing. I happen to have a bunch of other background writer tuning code I'm working on, and I had to come up with a more robust way to communicate buffer internals back via this channel. I used that code here, it's a bitmask setup similar to how flags like BM_DIRTY are used. It's overkill for solving this particular problem, but I think the interface is clean and it helps support future enhancements in intelligent background writing.

Now we get to the controversial part. The original patch removed the bgwriter_lru_maxpages parameter and updated the documentation accordingly. I didn't do that here. The reason is that after playing around in this area I'm not convinced yet I can satisfy all the tuning scenarios I'd like to be able to handle that way. I describe this patch as enforcing a constraint instead; it allows you to set the LRU parameters much higher than was reasonable before without having to be as concerned about the LRU writer wasting resources.

I already brought up some issues in this area on -hackers ( http://archives.postgresql.org/pgsql-hackers/2007-04/msg00781.php ) but my work hasn't advanced as fast as I'd hoped. I wanted to submit what I've finished anyway because I think any approach here is going to have cope with the issues addressed in these two patches, and I'm happy now with how they're solved here. It's only a one-line delete to disable the LRU limiting behavior of the second patch, at which point it's strictly internals code with no expected functional impact that alternate approaches might be built on.

--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /d3/pgsql/cvs/pgsql-local/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.1.1.1
diff -c -r1.1.1.1 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c 7 May 2007 01:48:49 -0000       1.1.1.1
--- src/backend/storage/buffer/bufmgr.c 12 May 2007 22:26:56 -0000
***************
*** 67,72 ****
--- 67,79 ----
  /* interval for calling AbsorbFsyncRequests in BufferSync */
  #define WRITES_PER_ABSORB             1000
  
+ /* Return codes describing what SyncOneBuffer found out and did with the
+  * buffer it processed.  The way code here tests for whether a write
+  * was done depends on BUF_WRITTEN being the highest bit value in this set. */
+ #define BUF_WRITTEN                           0x80
+ #define BUF_CLEAN                             0x40
+ #define BUF_REUSABLE                  0x20
+ #define BUF_USAGE_COUNT                       0x1F
  
  /* GUC variables */
  bool          zero_damaged_pages = false;
***************
*** 101,107 ****
  static void PinBuffer_Locked(volatile BufferDesc *buf);
  static void UnpinBuffer(volatile BufferDesc *buf,
                        bool fixOwner, bool normalAccess);
! static bool SyncOneBuffer(int buf_id, bool skip_pinned);
  static void WaitIO(volatile BufferDesc *buf);
  static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
  static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
--- 108,114 ----
  static void PinBuffer_Locked(volatile BufferDesc *buf);
  static void UnpinBuffer(volatile BufferDesc *buf,
                        bool fixOwner, bool normalAccess);
! static int SyncOneBuffer(int buf_id, bool skip_recently_used);
  static void WaitIO(volatile BufferDesc *buf);
  static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
  static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
***************
*** 1007,1013 ****
        absorb_counter = WRITES_PER_ABSORB;
        while (num_to_scan-- > 0)
        {
!               if (SyncOneBuffer(buf_id, false))
                {
                        BgWriterStats.m_buf_written_checkpoints++;
  
--- 1014,1020 ----
        absorb_counter = WRITES_PER_ABSORB;
        while (num_to_scan-- > 0)
        {
!               if (SyncOneBuffer(buf_id, false)>=BUF_WRITTEN)
                {
                        BgWriterStats.m_buf_written_checkpoints++;
  
***************
*** 1040,1047 ****
        int                     buf_id2;
        int                     num_to_scan;
        int                     num_written;
!       int                     recent_alloc;
        int                     num_client_writes;
  
        /* Make sure we can handle the pin inside SyncOneBuffer */
        ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
--- 1047,1063 ----
        int                     buf_id2;
        int                     num_to_scan;
        int                     num_written;
! 
!       /* Statistics returned by the freelist strategy code */
        int                     num_client_writes;
+       int                     recent_alloc;
+       
+ 
+       /* Used to estimate the upcoming LRU eviction activity */
+       static int      smoothed_alloc = 0;
+       int                     upcoming_alloc_estimate;
+       int                     reusable_buffers;
+       int                     buffer_state;
  
        /* Make sure we can handle the pin inside SyncOneBuffer */
        ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
***************
*** 1073,1079 ****
                {
                        if (++buf_id1 >= NBuffers)
                                buf_id1 = 0;
!                       if (SyncOneBuffer(buf_id1, false))
                        {
                                if (++num_written >= bgwriter_all_maxpages)
                                {
--- 1089,1095 ----
                {
                        if (++buf_id1 >= NBuffers)
                                buf_id1 = 0;
!                       if (SyncOneBuffer(buf_id1, false)>=BUF_WRITTEN)
                        {
                                if (++num_written >= bgwriter_all_maxpages)
                                {
***************
*** 1092,1097 ****
--- 1108,1142 ----
        BgWriterStats.m_buf_alloc+=recent_alloc;
        BgWriterStats.m_buf_written_client+=num_client_writes;
  
+       /* Estimate number of buffers to write based on a smoothed weighted
+        * average of previous and recent buffer allocations */
+       smoothed_alloc = smoothed_alloc * 15 / 16 + recent_alloc / 16;
+ 
+       /* Expect we will soon need either the smoothed amount or the recent 
allocation amount, 
+        * whichever is larger */
+       upcoming_alloc_estimate = smoothed_alloc;
+       if (recent_alloc > upcoming_alloc_estimate)
+               upcoming_alloc_estimate = recent_alloc;
+ 
+       /**** DEBUG show the smoothing in action ***/
+       if (1)  
+       {
+               static int count = 0;
+               static int alloc[10];
+               static int smoothed[10];
+               alloc[count % 10]=recent_alloc;
+               smoothed[count % 10]=smoothed_alloc;
+               if (++count % 10 == 9)
+                       {
+                       elog(LOG,"alloc = %d %d %d %d %d %d %d %d %d %d",
+                       alloc[0],alloc[1],alloc[2],alloc[3],alloc[4],
+                       alloc[5],alloc[6],alloc[7],alloc[8],alloc[9]);
+                       elog(LOG,"smoothed = %d %d %d %d %d %d %d %d %d %d",
+                       
smoothed[0],smoothed[1],smoothed[2],smoothed[3],smoothed[4],
+                       
smoothed[5],smoothed[6],smoothed[7],smoothed[8],smoothed[9]);
+                       }
+       }
+ 
        /*
         * This loop considers only unpinned buffers close to the clock sweep
         * point.
***************
*** 1100,1139 ****
        {
                num_to_scan = (int) ((NBuffers * bgwriter_lru_percent + 99) / 
100);
                num_written = 0;
! 
                while (num_to_scan-- > 0)
                {
!                       if (SyncOneBuffer(buf_id2, true))
                        {
                                if (++num_written >= bgwriter_lru_maxpages)
                                {
                                        BgWriterStats.m_maxwritten_lru++;
                                        break;
                                }
                        }
                        if (++buf_id2 >= NBuffers)
                                buf_id2 = 0;
                }
                BgWriterStats.m_buf_written_lru += num_written;
        }
  }
  
  /*
   * SyncOneBuffer -- process a single buffer during syncing.
   *
!  * If skip_pinned is true, we don't write currently-pinned buffers, nor
   * buffers marked recently used, as these are not replacement candidates.
   *
!  * Returns true if buffer was written, else false.    (This could be in error
!  * if FlushBuffers finds the buffer clean after locking it, but we don't
!  * care all that much.)
   *
   * Note: caller must have done ResourceOwnerEnlargeBuffers.
   */
! static bool
! SyncOneBuffer(int buf_id, bool skip_pinned)
  {
        volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
  
        /*
         * Check whether buffer needs writing.
--- 1145,1207 ----
        {
                num_to_scan = (int) ((NBuffers * bgwriter_lru_percent + 99) / 
100);
                num_written = 0;
!               reusable_buffers = 0;
                while (num_to_scan-- > 0)
                {
!                       buffer_state=SyncOneBuffer(buf_id2, true);
!                       if (buffer_state>=BUF_WRITTEN)
                        {
+                               reusable_buffers++;
                                if (++num_written >= bgwriter_lru_maxpages)
                                {
                                        BgWriterStats.m_maxwritten_lru++;
                                        break;
                                }
                        }
+                       else if (buffer_state & BUF_REUSABLE) 
reusable_buffers++;
+ 
                        if (++buf_id2 >= NBuffers)
                                buf_id2 = 0;
+ 
+                       /* Exit when target for upcoming allocations reached */
+                       if (reusable_buffers>=upcoming_alloc_estimate) break;
                }
                BgWriterStats.m_buf_written_lru += num_written;
+ 
+               if (1 && num_written>0) /**** DEBUG Show what happened this 
pass */
+                       {
+                       elog(LOG,"scanned=%d written=%d client write=%d 
alloc_est=%d reusable=%d",
+                               (int) ((NBuffers * bgwriter_lru_percent + 99) / 
100) - num_to_scan,
+                               
num_written,num_client_writes,upcoming_alloc_estimate,reusable_buffers);
+                       }
+ 
        }
  }
  
  /*
   * SyncOneBuffer -- process a single buffer during syncing.
   *
!  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
   * buffers marked recently used, as these are not replacement candidates.
   *
!  * Returns an integer code describing both the state the buffer was
!  * in when examined and what was done with it.  The lower-order bits
!  * are set to the usage_count of the buffer, and the following
!  * bit masks are set accordingly:  BUF_WRITTEN, BUF_CLEAN, BUF_REUSABLE
!  *
!  * (This could be in error if FlushBuffers finds the buffer clean after 
!  * locking it, but we don't care all that much.)
!  * 
!  * The results are ordered such that the simple test for whether a buffer was 
!  * written is to check whether the return code is >=BUF_WRITTEN
   *
   * Note: caller must have done ResourceOwnerEnlargeBuffers.
   */
! static int
! SyncOneBuffer(int buf_id, bool skip_recently_used)
  {
        volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
+       int buffer_state;
  
        /*
         * Check whether buffer needs writing.
***************
*** 1145,1160 ****
         * upcoming changes and so we are not required to write such dirty 
buffer.
         */
        LockBufHdr(bufHdr);
        if (!(bufHdr->flags & BM_VALID) || !(bufHdr->flags & BM_DIRTY))
        {
                UnlockBufHdr(bufHdr);
!               return false;
        }
!       if (skip_pinned &&
!               (bufHdr->refcount != 0 || bufHdr->usage_count != 0))
        {
                UnlockBufHdr(bufHdr);
!               return false;
        }
  
        /*
--- 1213,1237 ----
         * upcoming changes and so we are not required to write such dirty 
buffer.
         */
        LockBufHdr(bufHdr);
+ 
+       /* Starting state says this buffer is dirty, not reusable, and 
unwritten */
+       buffer_state = bufHdr->usage_count;
+ 
        if (!(bufHdr->flags & BM_VALID) || !(bufHdr->flags & BM_DIRTY))
+               buffer_state|=BUF_CLEAN;
+ 
+       if (bufHdr->refcount == 0 && bufHdr->usage_count == 0) 
+               buffer_state|=BUF_REUSABLE;
+       else if (skip_recently_used)
        {
                UnlockBufHdr(bufHdr);
!               return buffer_state;
        }
! 
!       if (buffer_state & BUF_CLEAN)
        {
                UnlockBufHdr(bufHdr);
!               return buffer_state;
        }
  
        /*
***************
*** 1169,1175 ****
        LWLockRelease(bufHdr->content_lock);
        UnpinBuffer(bufHdr, true, false /* don't change freelist */ );
  
!       return true;
  }
  
  
--- 1246,1252 ----
        LWLockRelease(bufHdr->content_lock);
        UnpinBuffer(bufHdr, true, false /* don't change freelist */ );
  
!       return buffer_state | BUF_WRITTEN;
  }
  
  

Attachment: buf-alloc-stats.patch
Description: Binary data

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Reply via email to