On 01/16/2012 11:00 AM, Robert Haas wrote:
Also, I am still struggling with what the right benchmarking methodology even is to judge whether
any patch in this area "works".  Can you provide more details about
your test setup?

The "test" setup is a production server with a few hundred users at peak workload, reading and writing to the database. Each RAID controller (couple of them with their own tablespaces) has either 512MG or 1GB of battery-backed write cache. The setup that leads to the bad situation happens like this:

-The steady stream of backend writes that happen between checkpoints have filled up most of the OS write cache. A look at /proc/meminfo shows around 2.5GB "Dirty:"

-Since we have shared_buffers set to 512MB to try and keep checkpoint storms from being too bad, there might be 300MB of dirty pages involved in the checkpoint. The write phase dumps this all into Linux's cache. There's now closer to 3GB of dirty data there. @64GB of RAM, this is still only 4.7% though--just below the effective lower range for dirty_background_ratio. Linux is perfectly content to let it all sit there.

-Sync phase begins. Between absorption and the new checkpoint writes, there are >300 segments to sync waiting here.

-The first few syncs force data out of Linux's cache and into the BBWC. Some of these return almost instantly. Others block for a moderate number of seconds. That's not necessarily a showstopper, on XFS at least. So long as the checkpointer is not being given all of the I/O in the system, the fact that it's stuck waiting for a sync doesn't mean the server is unresponsive to the needs of other backends. Early data might look like this:

DEBUG:  Sync #1 time=21.969000 gap=0.000000 msec
DEBUG:  Sync #2 time=40.378000 gap=0.000000 msec
DEBUG:  Sync #3 time=12574.224000 gap=3007.614000 msec
DEBUG:  Sync #4 time=91.385000 gap=2433.719000 msec
DEBUG:  Sync #5 time=2119.122000 gap=2836.741000 msec
DEBUG:  Sync #6 time=67.134000 gap=2840.791000 msec
DEBUG:  Sync #7 time=62.005000 gap=3004.823000 msec
DEBUG:  Sync #8 time=0.004000 gap=2818.031000 msec
DEBUG:  Sync #9 time=0.006000 gap=3012.026000 msec
DEBUG:  Sync #10 time=302.750000 gap=3003.958000 msec

[Here 'gap' is a precise measurement of how close the sync pause feature is working, with it set to 3 seconds. This is from an earlier version of this patch. All the timing issues I used to measure went away in the current implementation because it doesn't have to worry about doing background writer LRU work anymore, with the checkpointer split out]

But after a few hundred of these, every downstream cache is filled up. The result is seeing more really ugly sync times, like #164 here:

DEBUG:  Sync #160 time=1147.386000 gap=2801.047000 msec
DEBUG:  Sync #161 time=0.004000 gap=4075.115000 msec
DEBUG:  Sync #162 time=0.005000 gap=2943.966000 msec
DEBUG:  Sync #163 time=962.769000 gap=3003.906000 msec
DEBUG:  Sync #164 time=45125.991000 gap=3033.228000 msec
DEBUG:  Sync #165 time=4.031000 gap=2818.013000 msec
DEBUG:  Sync #166 time=212.537000 gap=3039.979000 msec
DEBUG:  Sync #167 time=0.005000 gap=2820.023000 msec
DEBUG:  Sync #355 time=2.550000 gap=2806.425000 msec
LOG:  Sync 355 files longest=45125.991000 msec average=1276.177977 msec

At the same time #164 is happening, that 45 second long window, a pile of clients will get stuck where they can't do any I/O. The RAID controller that used to have a useful mix of data is now completely filled with >=512MB of random writes. It's now failing to write as fast as new data is coming in. Eventually that leads to pressure building up in Linux's cache. Now you're in the bad place: dirty_background_ratio is crossed, Linux is now worried about spooling all cached writes to disk as fast as it can, the checkpointer is sync'ing its own important data to disk as fast as it can too, and all caches are inefficient because they're full.

To recreate a scenario like this, I've realized the benchmark needs to have a couple of characteristics:

-It has to focus on transaction latency instead of throughput. We know that doing syncs more often will lower throughput due to reduced reordering etc.

-It cannot run at maximum possible speed all the time. It needs to be the case that the system keeps up with the load during the rest of the time, but the sync phase of checkpoints causes I/O to queue faster than it's draining, thus saturating all caches and then blocking backends. Ideally, "Dirty:" in /proc/meminfo will reach >90% of the dirty_background_ratio trigger line around the same time the sync phase starts.

-There should be a lot of clients doing a mix of work. The way Linux I/O works, the scheduling for readers vs. writers is complicated, and this is one of the few areas where things like CFQ vs. Deadline matter.

I've realized now one reason I never got anywhere with this while running pgbench tests is that pgbench always runs at 100% of capacity. It fills all the caches involved completely as fast as it can, and every checkpoint starts with them already filled to capacity. So when latency gets bad at checkpoint time, no amount of clever reordering will help keep those writes from interfering with other processes. There just isn't any room to work with left.

What I think is needed instead is a write-heavy benchmark with a think time in it, so that we can dial the workload up to, say, 90% of I/O capacity, but that spikes to 100% when checkpoint sync happens. Then rearrangements in syncing that reduces caching pressure should be visible as a latency reduction in client response times. My guess is that dbt-2 can be configured to provide such a workload, and I don't see a way forward here except for me to fully embrace that and start over with it.

Just one random thought: I wonder if it would make sense to cap the
delay after each sync to the time spending performing that sync.  That
would make the tuning of the delay less sensitive to the total number
of files, because we won't unnecessarily wait after each sync when
they're not actually taking any time to complete.

This is one of the attractive ideas in this area that didn't work out so well when tested. The problem is that writes into a battery-backed write cache will show zero latency for some time until the cache is filled...and then you're done. You have to pause anyway, even though it seems write speed is massive, to give the cache some time to drain to disk between syncs that push data toward it. Even though it absorbed your previous write with no delay, that doesn't mean it takes no time to process that write. With proper write caching, that processing is just happening asynchronously.

This is related to another observation, noting what went wrong when we tried deploying my fully auto-tuning sync spread patch onto production. If the sync phase of the checkpoint starts to fall behind, and you've configured for a sync pause, you have to just suck that up and accept you'll finish late[1]. When you do get into the situation where the cache is completely filled, writes will slow dramatically. In the above log example, sync #164 taking 45 seconds means that #165 will surely be considered behind schedule now. If you use that feedback to then reduce the sync pause, feeling that you are behind schedule and cannot afford to pause anymore, now you've degenerated right back to the original troubled behavior: sync calls, as fast as they can be accepted by the OS, no delay between them.

[1] Where I think I'm going to end up with this eventually now is that setting checkpoint_sync_pause is the important tunable. The parameter that then gets auto-tuned is checkpoint_timeout. If you have 300 relations to sync and you have to wait 10 seconds between syncs to get latency down, the server is going to inform you an hour between checkpoints is all you can do here.

Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to