On Fri, 9 Mar 2007, Tom Lane wrote:

It strikes me that the patch would be more useful if it produced a histogram of the observed usage_counts

Don't have something worth releasing yet, but I did code a first rev of this today. The results are quite instructive and it's well worth looking at.

The main server I work on has shared_buffers=60000 for pgbench testing and the background writers turned way up (in hopes of reducing checkpoint times; that's a whole nother topic). I run pgbench with s=100 (~1.6GB database). Here's what I get as statistics on the buffer pool after a scan when the server is "happy", from a run with 20 clients:

writes=38.3MB (8.2%) pinned+used=38.3MB (8.2%)
dirty buffer usage count histogram:
0=0.1% 1=0.3% 2=26% 3=17% 4=21% 5+=36%

Basically, everything that's left dirty is also heavily used; as I noted before, when I inspect with pg_buffercache these are mostly index blocks. I note that I should probably generate a second histogram that compares the clean data. The all scan is writing pages out as soon as they dirty, the LRU background writer is lucky if it can find anything to do. (Note that I don't rely on the LRU writer more because that causes a significant lull in writes after a checkpoint. By the time it gets going again it's harder to catch up, and delaying PostgreSQL writes also aggrevates issues with the way Linux caches writes.)

What's I found really interesting was comparing a "sad" section where performance breaks down. This is from a minute later:

writes=441.6MB (94.2%) pinned+used=356.2MB (76.0%)
dirty buffer usage count histogram:
0=18.7% 1=26.4% 2=31% 3=11% 4=9% 5+=4%

Note how the whole buffer distribution has shifted toward lower usage counts. The breakdown seems to involve evicting the index blocks to make room for the recently dirty data when it can't be written out fast enough. As the index blocks go poof, things slow further, and into the vicious circle you go. Eventually you can end up blocked on a combination of buffer evictions and disk seeks for uncached data that are fighting with the writes.

The bgwriter change this suggested to me is defining a triage behavior where the all scan switches to acting like the LRU one:

-Each sample period, note what % of the usage_count=0 records are dirty -If that number is above a tunable threshold, switch to only writing usage_count=0 records until it isn't anymore.

On my system a first guess for that tunable would be 2-5%, based on what values I see on either side of the sad periods. No doubt some systems would set that much higher, haven't tested my system at home yet to have a guideline for a more typical PC.

As for why this behavior matters so much to me, I actually have a prototype auto-tuning background writer design that was hung up on this particular problem. It notes how often you write out max_pages, uses that to model the average percentage of the buffer cache you're traversing each pass, then runs a couple of weighted-average feedback loops to aim for a target seconds/sweep.

The problem was that it went berzerk when the whole buffer cache was dirty (I hope someone appreciates that I've been calling this "Felix Unger on crack mode" in my notes). I needed some way to prioritize which buffers to concentrate on when that happens, and so far the above has been a good first-cut way to help with that.

* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Reply via email to