[HACKERS] Just-in-time Background Writer Patch+Test Results

Greg Smith Wed, 05 Sep 2007 20:36:34 -0700

Tom gets credit for naming the attached patch, which is my latest attempt to finalize what has been called the "Automatic adjustment of bgwriter_lru_maxpages" patch for 8.3; that's not what it does anymore but that's where it started.


Background on testing
---------------------

I decided to use pgbench for running my tests. The scripting framework to collect all that data and usefully summarize it is now available as pgbench-tools-0.2 at http://www.westnet.com/~gsmith/content/postgresql/pgbench-tools.htm

I hope to expand and actually document use of pgbench-tools in the future but didn't want to hold the rest of this up on that work. That page includes basic information about what my testing environment was and why I felt this was an appropriate way to test background writer efficiency.

Quite a bit of raw data for all of the test sets summarized here is at http://www.westnet.com/~gsmith/content/bgwriter/

The patches attached to this message are also available at: http://www.westnet.com/~gsmith/content/postgresql/buf-alloc-2.patch http://www.westnet.com/~gsmith/content/postgresql/jit-cleaner.patch (This is my second attempt to send this message, don't know why the earlier one failed; using gzip'd patches for this one and hopefully there won't be a dupe)


Baseline test results
---------------------

The first patch to apply attached to this message is the latest buf-alloc-2 that adds counters to pgstat_bgwriter for everything the background writer is doing. Here's what we get out of the standard 8.3 background writer before and after applying that patch, at various settings:


                info                | set | tps  | cleaner_pct
------------------------------------+-----+------+-------------
 HEAD nobgwriter                    |   5 |  994 |
 HEAD+buf-alloc-2 nobgwriter        |   6 | 1012 |           0
 HEAD+buf-alloc-2 LRU=0.5%/500      |  16 |  974 |       15.94
 HEAD+buf-alloc-2 LRU=5%/500        |  19 |  983 |       98.47
 HEAD+buf-alloc-2 LRU=10%/500       |   7 |  997 |       99.95

cleaner_pct is what percentage of the writes the BGW LRU cleaner did relative to a total that includes the client backend writes; writes done by checkpoints are not included in this summary computation, it just shows the balance of backend vs. BGW writes.

The /500 means bgwriter_lru_maxpages=500, which I already knew was about as many pages as this server ever dirties in a 200ms cycle. Without the buf-alloc-2 patch I don't get statistics on the LRU cleaner, I include that number as a baseline just to suggest that the buf-alloc-2 patch itself isn't pulling down results.

Here we see that in order to get most of the writes to happen via the LRU cleaner rather than having the backends handle them, you'd need to play with the settings until the bgwriter_lru_percent was somewhere between 5% and 10%, and it seems obvious that doing this doesn't improve the TPS results. The margin of error here is big enough that I consider all these basically the same performance. The question then is how to get this high level of writes by the background writer automatically, without having to know what percentage to scan; I wanted to remove bgwriter_lru_percent, while still keeping bgwriter_lru_maxpages strictly as a way to throttle overall BGW activity.


First JIT Implementation
------------------------

The method I described in my last message on this topic ( http://archives.postgresql.org/pgsql-hackers/2007-08/msg00887.php ) implemented a weighted moving average of how many pages were allocated, and based on feedback from that I improved the code to allow a multiplier factor on top of that. Here's the summary of those results:


                info                | set | tps  | cleaner_pct
------------------------------------+-----+------+-------------
 jit cleaner multiplier=1.0/500     |   9 |  981 |        94.3
 jit cleaner multiplier=2.0/500     |   8 | 1005 |       99.78
 jit multiplier=1.0/100             |  10 |  985 |       68.14

That's pretty good. As long as maxpages is set intelligently, it gets most of the writes even with the multiplier of 1.0, and cranking it up to the 2.0 suggested by the original Itagaki Takahiro patch gets nearly all of them. Again, there's really no performance change here in throughput by any of this.


Coping with idle periods
------------------------

While I was basically happy with these results, the data Kevin Grittner submitted in response to my last call for commentary left me concerned. While the JIT approach works fine as long as your system is active, it does absolutely nothing if the system is idle. I noticed that a lot of the writes that were being done by the client backends were after idle periods where the JIT writer just didn't react fast enough during the ramp-up. For example, if the system went from idle for a while to full-speed just as the 200ms sleep started, by the time the BGW woke up again the backends could have needed to write many buffers already themselves.

Ideally, idle periods should be used to slowly trickly dirty pages out, so that there are less of them hanging around when a checkpoint shows up or so that reusable pages are already available. The question then is how fast to go about that trickle. Heikki's background writer tests and my own suggest that if you make the rate during quiet periods too high, you'll clog the underlying buffers with some writes that end up being duplicated and lower overall efficiency. But all of those tests had the background writer going at a constant and relatively high speed.

I wanted to keep the ability to scan the entire buffer cache, using the latest idea of never looking at the same buffer twice, but to do that slowly when idle and using the JIT rate otherwise. This is sort of a hybrid of the old LRU cleaner behavior (scan a fixed %) at a low speed with the new approach (scan based on allocations, however many of them there are). I starting with the old default of 0.5% used by bgwriter_lru_percent (a tunable already removed by the patch at this point) with logic to tack that onto the JIT intelligently and got these results:


                info                | set | tps  | cleaner_pct
------------------------------------+-----+------+-------------
 jit multiplier=1.0 min scan=0.5%   |  13 |  882 |         100
 jit multiplier=1.5 min scan=0.5%   |  12 |  871 |         100
 jit multiplier=2.0 min scan=0.5%   |  11 |  910 |         100
 jit multiplier=1.0 min scan=0.25%  |  14 |  982 |       98.34

It's nice to see fully 100% of the buffers written by the cleaner with the hybrid approach; I feel that validates my idea that just a bit more work needs to be done during idle periods to completely fix the issue with it not reacting fast enough during the idle/full speed transition. But look at the drop in TPS. While I'm willing to say a couple of percent change isn't significant in a pgbench result, those <900 results are clearly bad. This is crossing that line where inefficient writes are being done. I'm happier with the result using the smaller min scan=0.25% even though it doesn't quite get every write that way.


Making percentage independant of delay
--------------------------------------

But a new problem here is that if you lower bgwriter_delay, the minimum scan percentage needs to drop too, and my goal was to remove the number of tunables people need to tinker with. Assuming you're not stopped by the maxpages parameter, with the default delay=200ms a scan that hits 0.5% each time will scan 5*0.5%=2.5% of the buffer cache per second, which means it will take 24 seconds to scan the entire pool. Using 0.25% means 48 seconds between scans. I improved the overall algorithm a bit and decided to set this parameter an alternate way: by how long it should take to creep its way through the entire buffer cache if the JIT code is idle. I decided I liked 120 seconds as value for that parameter, which is a slower rate than any of the above but still a reasonable one for a typical application. Here's what the results look like using that approach:


                info                | set | tps  | cleaner_pct
------------------------------------+-----+------+-------------
 jit multiplier=1.0 scan_whole=120s |  18 |  970 |       99.99
 jit multiplier=1.5 scan_whole=120s |  15 |  995 |       99.93
 jit multiplier=2.0 scan_whole=120s |  17 |  981 |       99.98

Now here are results I'm happy with. The TPS results are almost unchanged from where we started from, with minimal inefficient writes, but almost all the writes are being done by the cleaner process. The results appear much less sensitive to what you set the multiplier to. And unless you use an unresonable low value for maxpages (which will quickly become obvious if you monitor pg_stat_bgwriter and look for maxwritten_clean increasing fast), you'll get a complete scan of the buffer cache within 2 minutes even if there's no system activity. But once that's done, until more buffers are allocated the code won't even look at the buffer cache again (as opposed to the current code, which is always looking at buffers and acquiring locks even if nothing is going on).

I think I can safely say there is a level of intelligence going into what the LRU background writer does with this patch that has never been applied to this problem before. There have been a lot of good ideas thrown out in this area, but it took a hybrid approach that included and carefully balanced all of them to actually get results that I felt were usable. What I don't know is whether that will also be true for other testers.


Patch review
------------

The attached jit-cleaner.patch implements this approach, and if you just want to look at the main code involved without having to apply the patch you can browse the BgBufferSync function in bufmgr.c starting around line 1120 at http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c

There is lots of debugging of internals dumped into the logs if you toggle on #define BGW_DEBUG , the gross summary of the two most important things that show what the code is doing are logged at DEBUG1 (but should probably be pushed lower before committing).

This code is as good as you're going to get from me before the 8.3 close. I could do some small rewriting and certainly can document all this further as part of getting this patch moved toward committed, but I'm out of resources to do too much more here. Along with the big question of whether this whole idea is worth following at all as part of 8.3, here are the remaining small questions I feel review feedback would be valuable on related to my specific code:

-The way I'm getting the passes number back from the freelist.c strategy code seems like it will eventually overflow the long I'm using for the intermediate results when I execute statements like this:


strategy_position=(long)strategy_passes * NBuffers + strategy_buf_id;

I'm not sure if the code would be better if I were to use a 64-bit integer for strategy_position instead, or if I should just rewrite the code to separate out the passes multiplication--which will make it less elegant to read but should make overflow issues go away.

-Heikki didn't like the way I pass information back from SyncOneBuffer back to the background writer. The bitmask approach I'm using has added flexibility to writing more intelligent background writers in the future. I have written more complicated ones than any of the approaches mentioned here in the past, using things like the usage_count information returned, but the simpler implementation here that ignores that. I could simplify this interface if I had to, but I like what I've done as a solid structure for future coding as it's written right now.


-There are two magic constants in the code:

    int         smoothing_samples = 16;
    float       scan_whole_pool_seconds = 120.0;

I believe I've done enough testing recently and in the past to say these are reasonable numbers for most installations, and high-throughput systems are going to care more about tuning the multiplier GUC than either of these. In the interest of having less knobs people can fool with and break, I personally don't feel like these constants need to be exposed for tuning purposes; they don't have a significant impact on how the underlying model works. Determining whether these should be exposed as GUC tunables is certainly an open question though.

-I bumped the default for bgwriter_lru_maxpages to 100 so that typical low-end systems should get an automatically tuning LRU background writer out of the box in 8.3. This is a big change from the 5 that was used in the older releases. If you keep everything at the defaults this represents a maximum theoretical write rate for the BGW of 4MB/s, which isn't very much relative to modern hardware.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

jit-cleaner.patch.gz
Description: Binary data

buf-alloc-2.patch.gz
Description: Binary data

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to [EMAIL PROTECTED] so that your
       message can get through to the mailing list cleanly

[HACKERS] Just-in-time Background Writer Patch+Test Results

Reply via email to