Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Greg Smith Sun, 14 Jul 2013 11:46:57 -0700

On 6/16/13 10:27 AM, Heikki Linnakangas wrote:

Yeah, the checkpoint scheduling logic doesn't take into account the
heavy WAL activity caused by full page images...
Rationalizing a bit, I could even argue to myself that it's a *good*
thing. At the beginning of a checkpoint, the OS write cache should be
relatively empty, as the checkpointer hasn't done any writes yet. So it
might make sense to write a burst of pages at the beginning, to
partially fill the write cache first, before starting to throttle. But
this is just handwaving - I have no idea what the effect is in real life.

That's exactly right. When a checkpoint finishes the OS write cache isclean. That means all of the full-page writes aren't even hitting diskin many cases. They just pile up in the OS dirty memory, often sittingthere all the way until when the next checkpoint fsyncs start. That'swhy I never wandered down the road of changing FPW behavior. I havenever seen a benchmark workload hit a write bottleneck until long afterthe big burst of FPW pages is over.

I could easily believe that there are low-memory systems where the FPWwrite pressure becomes a problem earlier. And slim VMs make sense asthe place this behavior is being seen at.

I'm a big fan of instrumenting the code around a performance changebefore touching anything, as a companion patch that might make sense tocommit on its own. In the case of a change to FPW spacing, I'd want tosee some diagnostic output in something like pg_stat_bgwriter thattracks how many FPW pages are being modified. Apgstat_bgwriter.full_page_writes counter would be perfect here, and thengraph that data over time as the benchmark runs.

Another thought is that rather than trying to compensate for that effect
in the checkpoint scheduler, could we avoid the sudden rush of full-page
images in the first place? The current rule for when to write a full
page image is conservative: you don't actually need to write a full page
image when you modify a buffer that's sitting in the buffer cache, if
that buffer hasn't been flushed to disk by the checkpointer yet, because
the checkpointer will write and fsync it later. I'm not sure how much it
would smoothen WAL write I/O, but it would be interesting to try.

There I also think the right way to proceed is instrumenting that areafirst.

A long time ago, Itagaki wrote a patch to sort the checkpoint writes:
www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp.
He posted very promising performance numbers, but it was dropped because
Tom couldn't reproduce the numbers, and because sorting requires
allocating a large array, which has the risk of running out of memory,
which would be bad when you're trying to checkpoint.

I updated and re-reviewed that in 2011:http://www.postgresql.org/message-id/4d31ae64.3000...@2ndquadrant.comand commented on why I think the improvement was difficult to reproduceback then. The improvement didn't follow for me either. It would takea really amazing bit of data to get me to believe write sorting code isworthwhile after that. On large systems capable of dirtying enoughblocks to cause a problem, the operating system and RAID controllers arealready sorting block. And *that* sorting is also consideringconcurrent read requests, which are a lot more important to an efficientschedule than anything the checkpoint process knows about. The databasedoesn't have nearly enough information yet to compete against OS levelsorting.

Bad point of my patch is longer checkpoint. Checkpoint time was
increased about 10% - 20%. But it can work correctry on schedule-time in
checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6).


For a fair comparison, you should increase the
checkpoint_completion_target of the unpatched test, so that the
checkpoints run for roughly the same amount of time with and without the
patch. Otherwise the benefit you're seeing could be just because of a
more lazy checkpoint.

Heikki has nailed the problem with the submitted dbt-2 results here. Ifyou spread checkpoints out more, you cannot fairly compare the resultingTPS or latency numbers anymore.

Simple example: 20 minute long test. Server A does a checkpoint every5 minutes. Server B has modified parameters or server code such thatcheckpoints happen every 6 minutes. If you run both to completion, Awill have hit 4 checkpoints that flush the buffer cache, B only 3. Ofcourse B will seem faster. It didn't do as much work.

pgbench_tools measures the number of checkpoints during the test, aswell as the buffer count statistics. If those numbers are verydifferent between two tests, I have to throw them out as unfair. A lotof things that seem promising turn out to have this sort of problem.


--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Reply via email to