Re: [HACKERS] Checkpoint sync pause

Greg Smith Tue, 07 Feb 2012 13:22:54 -0800

On 02/03/2012 11:41 PM, Jeff Janes wrote:

-The steady stream of backend writes that happen between checkpoints have
filled up most of the OS write cache.  A look at /proc/meminfo shows around
2.5GB "Dirty:"

"backend writes" includes bgwriter writes, right?


Right.

Has using a newer kernal with dirty_background_bytes been tried, so it
could be set to a lower level?  If so, how did it do?  Or does it just
refuse to obey below the 5% level, as well?

Trying to dip below 5% using dirty_background_bytes slows VACUUM downfaster than it improves checkpoint latency. Since the sort of serversthat have checkpoint issues are quite often ones that have VACUUM ones,too, that whole path doesn't seem very productive. The one test Ihaven't tried yet is whether increasing the size of the VACUUM ringbuffer might improve how well the server responds to a lower write cache.

If there is 3GB of dirty data spread over>300 segments each segment
is about full-sized (1GB) then on average<1% of each segment is
dirty?

If that analysis holds, then it seem like there is simply an awful lot
of data has to be written randomly, no matter how clever the
re-ordering is.  In other words, it is not that a harried or panicked
kernel or RAID control is failing to do good re-ordering when it has
opportunities to, it is just that you dirty your data too randomly for
substantial reordering to be possible even under ideal conditions.

Averages are deceptive here. This data follows the usual distributionfor real-world data, which is that there is a hot chunk of data thatreceives far more writes than average (particularly index blocks), alongwith a long tail of segments that are only seeing one or two 8K blocksmodified (catalog data, stats, application metadata).

Plenty of useful reordering happens here. It's happening in Linux'scache and in the controller's cache. The constant of stream ofcheckpoint syncs doesn't stop that. It does seems to do two bad thingsthough: a) makes some of these bad cache filled situations more likely,and b) doesn't leave any I/O capacity unused for clients to get somework done. One of the real possibilities I've been considering morelately is that the value we've seen of the pauses during sync aren't somuch about optimizing I/O, that instead it's from allowing a briefwindow of client backend I/O to slip in there between the cache fillingcheckpoint sync.

Does the BBWC, once given an fsync command and reporting success,
write out those block forthwith, or does it lolly-gag around like the
kernel (under non-fsync) does?  If it is waiting around for
write-combing opportunities that will never actually materialize in
sufficient quantities to make up for the wait, how to get it to stop?

Was the sorted checkpoint with an fsync after every file (real file,
not VFD) one of the changes you tried?

As far as I know the typical BBWC is always working. When a read or awrite comes in, it starts moving immediately. When it gets behind, itstarts making seek decisions more intelligently based on visibility ofthe whole queue. But they don't delay doing any work at all the wayLinux does.

I haven't had very good luck with sorting checkpoints at the PostgreSQLrelation level on server-size systems. There is a lot of sortingalready happening at both the OS (~3GB) and BBWC (>=512MB) levels onthis server. My own tests on my smaller test server--with a scaled downOS (~750MB) and BBWC (256MB) cache--haven't ever validated sorting as auseful technique on top of that. It's never bubbled up to beingconsidered a likely win on the production one as a result.

DEBUG:  Sync #1 time=21.969000 gap=0.000000 msec
DEBUG:  Sync #2 time=40.378000 gap=0.000000 msec
DEBUG:  Sync #3 time=12574.224000 gap=3007.614000 msec
DEBUG:  Sync #4 time=91.385000 gap=2433.719000 msec
DEBUG:  Sync #5 time=2119.122000 gap=2836.741000 msec
DEBUG:  Sync #6 time=67.134000 gap=2840.791000 msec
DEBUG:  Sync #7 time=62.005000 gap=3004.823000 msec
DEBUG:  Sync #8 time=0.004000 gap=2818.031000 msec
DEBUG:  Sync #9 time=0.006000 gap=3012.026000 msec
DEBUG:  Sync #10 time=302.750000 gap=3003.958000 msec

Syncs 3 and 5 kind of surprise me.  It seems like the times should be
more bimodal.  If the cache is already full, why doesn't the system
promptly collapse, like it does later?  And if it is not, why would it
take 12 seconds, or even 2 seconds?  Or is this just evidence that the
gaps you are inserting are partially, but not completely, effective?

Given a mix of completely random I/O, a 24 disk array like this systemhas is lucky to hit 20MB/s clearing it out. It doesn't take too much ofthat before even 12 seconds makes sense. The 45 second pauses are theones demonstrating the controller's cached is completely overwhelmed.It's rare to see caching turn truly bimodal, because the model for ithas both a variable ingress and egress rate involved. Even as thecheckpoint sync is pushing stuff in, at the same time writes are beingevacuated at some speed out the other end.

What I/O are they trying to do?  It seems like all your data is in RAM
(if not, I'm surprised you can get queries to ran fast enough to
create this much dirty data).  So they probably aren't blocking on
reads which are being interfered with by all the attempted writes.

Reads on infrequently read data. Long tail again; even though cachingis close to 100%, the occasional outlier client who wants some rarelyaccessed page with their personal data on it shows up. Pollute thewrite caches badly enough, and what happens to reads mixed into theregets very fuzzy. Depends on the exact mechanics of the I/O schedulerused in the kernel version deployed.

The current shared_buffer allocation method (or my misunderstanding of
it) reminds me of the joke about the guy who walks into his kitchen
with a cow-pie in his hand and tells his wife "Look what I almost
stepped in".  If you find a buffer that is usagecount=0 and unpinned,
but dirty, then why is it dirty?  It is likely to be dirty because the
background writer can't keep up.  And if the background writer can't
keep up, it is probably having trouble with writes blocking.  So, for
Pete's sake, don't try to write it out yourself!  If you can't find a
clean, reusable buffer in a reasonable number of attempts, I guess at
some point you need to punt and write one out.  But currently it grabs
the first unpinned usagecount=0 buffer it sees and writes it out if
dirty, without even checking if the next one might be clean.

Don't forget that in the version deployed here, the background writerisn't running during the sync phase. I think the direction you'retalking about here circles back to "why doesn't the BGW just put thingsit finds clean onto the free list?", a direction which would make"nothing on the free list" a noteworthy event suggesting the BGW needsto run more often.

One option for pgbench I've contemplated was better latency reporting.
  I don't really want to have mine very large log files (and just
writing them out can produce IO that competes with the IO you actually
care about, if you don't have a lot of controllers around to isolate
everything.).

Every time I've measured this, I've found it to be <1% of the totalI/O. The single line of data with latency counts, written buffered, ispretty slim compared with the >=8K any write transaction is likely tohave touched. The only time I've found the disk writing overheadbecoming serious on an absolute scale is when I'm running read-onlyin-memory benchmarks, where the rate might hit >100K TPS. But bydefinition, that sort of test has I/O bandwidth to spare, so there itdoesn't actually impact results much. Just a fraction of a core doingsome sequential writes.

Also, what limits the amount of work that needs to get done?  If you
make a change that decreases throughput but also decreases latency,
then something else has got to give.

The thing that is giving way here is total time taken to execute thecheckpoint. There's even a theoretical gain possible form that. It'spossible to prove (using the pg_stat_bgwriter counts) that havingcheckpoints less frequently decreases total I/O, because there are lesswrites of the most popular blocks happening. Right now, when I tunethat to decrease total I/O the upper limit is when it starts spiking uplatency. This new GUC is trying to allow a different way to increasecheckpoint time that seems to do less of that.

What problems do you see with pgbench?  Can you not reproduce
something similar to the production latency problems, or can you
reproduce them, but things that fix the problem in pgbench don't
translate to production?  Or the other way around, things that work in
production didn't work in pgbench?

I can't simulate something similar enough to the production latencyproblem. Your comments about doing something like specifying 50 "-f"files or a weighting are in the right area; it might be possible to hacka better simulation with an approach like that. The thing that makeswandering that way even harder than it seems at first is how we splitthe pgbench work among multiple worker threads.

I'm not using connection pooling for the pgbench simulations I'm doing.There's some of that happening in the production application server.with it.

But I would think that pgbench can be configured to do that as well,
and would probably offer a wider array of other testers.  Of course,if
they have to copy and specify 30 different -f files, maybe getting
dbt-2 to install and run would be easier than that.  My attempts at
getting dbt-5 to work for me do not make me eager jump from pgbench to
try more other things.

dbt-5 is a work in progress, known to be tricky to get going. dbt-2 ismature enough that it was used for this sort of role in 8.3development. And it's even used by other database systems for similartesting. It's the closest thing to an open-source standard forwrite-heavy workloads as we'll find here.

What I'm doing right now is recording a large amount of pgbench data formy test system here, to validate it has the typical problems pgbenchruns into. Once that's done I expect to switch to dbt-2 and see whetherit's a more useful latency test environment. That plan is working outfine so far, it just hit a couple of weeks of unanticipated delay.

Do we have a theoretical guess on about how fast you should be able to
go, based on the RAID capacity and the speed and density at which you
dirty data?

This is a hard question to answer; it's something I've been thinkingabout and modeling a lot lately. The problem is that the speed an arraywrites at depends on how many reads or writes it does during each seekand/or rotation. The array here can do 1GB/s of all sequential I/O, and15 - 20MB/s on all random I/O. The more efficiently writes arescheduled, the more like sequential I/O the workload becomes. Anyattempt to even try to estimate real-world throughput needs the numberof concurrent processes as another input, and the complexity of theresulting model is high.


--
Greg Smith   2ndQuadrant [email protected]    Baltimore, MD
PostgreSQL Training, Services, and 24x7 Supportwww.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checkpoint sync pause

Reply via email to