On 02/03/2012 11:41 PM, Jeff Janes wrote:
-The steady stream of backend writes that happen between checkpoints have
filled up most of the OS write cache.  A look at /proc/meminfo shows around
2.5GB "Dirty:"
"backend writes" includes bgwriter writes, right?


Has using a newer kernal with dirty_background_bytes been tried, so it
could be set to a lower level?  If so, how did it do?  Or does it just
refuse to obey below the 5% level, as well?

Trying to dip below 5% using dirty_background_bytes slows VACUUM down faster than it improves checkpoint latency. Since the sort of servers that have checkpoint issues are quite often ones that have VACUUM ones, too, that whole path doesn't seem very productive. The one test I haven't tried yet is whether increasing the size of the VACUUM ring buffer might improve how well the server responds to a lower write cache.

If there is 3GB of dirty data spread over>300 segments each segment
is about full-sized (1GB) then on average<1% of each segment is

If that analysis holds, then it seem like there is simply an awful lot
of data has to be written randomly, no matter how clever the
re-ordering is.  In other words, it is not that a harried or panicked
kernel or RAID control is failing to do good re-ordering when it has
opportunities to, it is just that you dirty your data too randomly for
substantial reordering to be possible even under ideal conditions.

Averages are deceptive here. This data follows the usual distribution for real-world data, which is that there is a hot chunk of data that receives far more writes than average (particularly index blocks), along with a long tail of segments that are only seeing one or two 8K blocks modified (catalog data, stats, application metadata).

Plenty of useful reordering happens here. It's happening in Linux's cache and in the controller's cache. The constant of stream of checkpoint syncs doesn't stop that. It does seems to do two bad things though: a) makes some of these bad cache filled situations more likely, and b) doesn't leave any I/O capacity unused for clients to get some work done. One of the real possibilities I've been considering more lately is that the value we've seen of the pauses during sync aren't so much about optimizing I/O, that instead it's from allowing a brief window of client backend I/O to slip in there between the cache filling checkpoint sync.

Does the BBWC, once given an fsync command and reporting success,
write out those block forthwith, or does it lolly-gag around like the
kernel (under non-fsync) does?  If it is waiting around for
write-combing opportunities that will never actually materialize in
sufficient quantities to make up for the wait, how to get it to stop?

Was the sorted checkpoint with an fsync after every file (real file,
not VFD) one of the changes you tried?

As far as I know the typical BBWC is always working. When a read or a write comes in, it starts moving immediately. When it gets behind, it starts making seek decisions more intelligently based on visibility of the whole queue. But they don't delay doing any work at all the way Linux does.

I haven't had very good luck with sorting checkpoints at the PostgreSQL relation level on server-size systems. There is a lot of sorting already happening at both the OS (~3GB) and BBWC (>=512MB) levels on this server. My own tests on my smaller test server--with a scaled down OS (~750MB) and BBWC (256MB) cache--haven't ever validated sorting as a useful technique on top of that. It's never bubbled up to being considered a likely win on the production one as a result.

DEBUG:  Sync #1 time=21.969000 gap=0.000000 msec
DEBUG:  Sync #2 time=40.378000 gap=0.000000 msec
DEBUG:  Sync #3 time=12574.224000 gap=3007.614000 msec
DEBUG:  Sync #4 time=91.385000 gap=2433.719000 msec
DEBUG:  Sync #5 time=2119.122000 gap=2836.741000 msec
DEBUG:  Sync #6 time=67.134000 gap=2840.791000 msec
DEBUG:  Sync #7 time=62.005000 gap=3004.823000 msec
DEBUG:  Sync #8 time=0.004000 gap=2818.031000 msec
DEBUG:  Sync #9 time=0.006000 gap=3012.026000 msec
DEBUG:  Sync #10 time=302.750000 gap=3003.958000 msec
Syncs 3 and 5 kind of surprise me.  It seems like the times should be
more bimodal.  If the cache is already full, why doesn't the system
promptly collapse, like it does later?  And if it is not, why would it
take 12 seconds, or even 2 seconds?  Or is this just evidence that the
gaps you are inserting are partially, but not completely, effective?

Given a mix of completely random I/O, a 24 disk array like this system has is lucky to hit 20MB/s clearing it out. It doesn't take too much of that before even 12 seconds makes sense. The 45 second pauses are the ones demonstrating the controller's cached is completely overwhelmed. It's rare to see caching turn truly bimodal, because the model for it has both a variable ingress and egress rate involved. Even as the checkpoint sync is pushing stuff in, at the same time writes are being evacuated at some speed out the other end.

What I/O are they trying to do?  It seems like all your data is in RAM
(if not, I'm surprised you can get queries to ran fast enough to
create this much dirty data).  So they probably aren't blocking on
reads which are being interfered with by all the attempted writes.

Reads on infrequently read data. Long tail again; even though caching is close to 100%, the occasional outlier client who wants some rarely accessed page with their personal data on it shows up. Pollute the write caches badly enough, and what happens to reads mixed into there gets very fuzzy. Depends on the exact mechanics of the I/O scheduler used in the kernel version deployed.

The current shared_buffer allocation method (or my misunderstanding of
it) reminds me of the joke about the guy who walks into his kitchen
with a cow-pie in his hand and tells his wife "Look what I almost
stepped in".  If you find a buffer that is usagecount=0 and unpinned,
but dirty, then why is it dirty?  It is likely to be dirty because the
background writer can't keep up.  And if the background writer can't
keep up, it is probably having trouble with writes blocking.  So, for
Pete's sake, don't try to write it out yourself!  If you can't find a
clean, reusable buffer in a reasonable number of attempts, I guess at
some point you need to punt and write one out.  But currently it grabs
the first unpinned usagecount=0 buffer it sees and writes it out if
dirty, without even checking if the next one might be clean.

Don't forget that in the version deployed here, the background writer isn't running during the sync phase. I think the direction you're talking about here circles back to "why doesn't the BGW just put things it finds clean onto the free list?", a direction which would make "nothing on the free list" a noteworthy event suggesting the BGW needs to run more often.

One option for pgbench I've contemplated was better latency reporting.
  I don't really want to have mine very large log files (and just
writing them out can produce IO that competes with the IO you actually
care about, if you don't have a lot of controllers around to isolate

Every time I've measured this, I've found it to be <1% of the total I/O. The single line of data with latency counts, written buffered, is pretty slim compared with the >=8K any write transaction is likely to have touched. The only time I've found the disk writing overhead becoming serious on an absolute scale is when I'm running read-only in-memory benchmarks, where the rate might hit >100K TPS. But by definition, that sort of test has I/O bandwidth to spare, so there it doesn't actually impact results much. Just a fraction of a core doing some sequential writes.

Also, what limits the amount of work that needs to get done?  If you
make a change that decreases throughput but also decreases latency,
then something else has got to give.

The thing that is giving way here is total time taken to execute the checkpoint. There's even a theoretical gain possible form that. It's possible to prove (using the pg_stat_bgwriter counts) that having checkpoints less frequently decreases total I/O, because there are less writes of the most popular blocks happening. Right now, when I tune that to decrease total I/O the upper limit is when it starts spiking up latency. This new GUC is trying to allow a different way to increase checkpoint time that seems to do less of that.

What problems do you see with pgbench?  Can you not reproduce
something similar to the production latency problems, or can you
reproduce them, but things that fix the problem in pgbench don't
translate to production?  Or the other way around, things that work in
production didn't work in pgbench?

I can't simulate something similar enough to the production latency problem. Your comments about doing something like specifying 50 "-f" files or a weighting are in the right area; it might be possible to hack a better simulation with an approach like that. The thing that makes wandering that way even harder than it seems at first is how we split the pgbench work among multiple worker threads.

I'm not using connection pooling for the pgbench simulations I'm doing. There's some of that happening in the production application server.with it.

But I would think that pgbench can be configured to do that as well,
and would probably offer a wider array of other testers.  Of course,if
they have to copy and specify 30 different -f files, maybe getting
dbt-2 to install and run would be easier than that.  My attempts at
getting dbt-5 to work for me do not make me eager jump from pgbench to
try more other things.

dbt-5 is a work in progress, known to be tricky to get going. dbt-2 is mature enough that it was used for this sort of role in 8.3 development. And it's even used by other database systems for similar testing. It's the closest thing to an open-source standard for write-heavy workloads as we'll find here.

What I'm doing right now is recording a large amount of pgbench data for my test system here, to validate it has the typical problems pgbench runs into. Once that's done I expect to switch to dbt-2 and see whether it's a more useful latency test environment. That plan is working out fine so far, it just hit a couple of weeks of unanticipated delay.

Do we have a theoretical guess on about how fast you should be able to
go, based on the RAID capacity and the speed and density at which you
dirty data?

This is a hard question to answer; it's something I've been thinking about and modeling a lot lately. The problem is that the speed an array writes at depends on how many reads or writes it does during each seek and/or rotation. The array here can do 1GB/s of all sequential I/O, and 15 - 20MB/s on all random I/O. The more efficiently writes are scheduled, the more like sequential I/O the workload becomes. Any attempt to even try to estimate real-world throughput needs the number of concurrent processes as another input, and the complexity of the resulting model is high.

Greg Smith   2ndQuadrant usg...@2ndquadrant.com    Baltimore, MD
PostgreSQL Training, Services, and 24x7 Supportwww.2ndQuadrant.com

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to