On 6/16/13 10:27 AM, Heikki Linnakangas wrote:
Yeah, the checkpoint scheduling logic doesn't take into account the
heavy WAL activity caused by full page images...
Rationalizing a bit, I could even argue to myself that it's a *good*
thing. At the beginning of a checkpoint, the OS write cache should be
relatively empty, as the checkpointer hasn't done any writes yet. So it
might make sense to write a burst of pages at the beginning, to
partially fill the write cache first, before starting to throttle. But
this is just handwaving - I have no idea what the effect is in real life.

That's exactly right. When a checkpoint finishes the OS write cache is clean. That means all of the full-page writes aren't even hitting disk in many cases. They just pile up in the OS dirty memory, often sitting there all the way until when the next checkpoint fsyncs start. That's why I never wandered down the road of changing FPW behavior. I have never seen a benchmark workload hit a write bottleneck until long after the big burst of FPW pages is over.

I could easily believe that there are low-memory systems where the FPW write pressure becomes a problem earlier. And slim VMs make sense as the place this behavior is being seen at.

I'm a big fan of instrumenting the code around a performance change before touching anything, as a companion patch that might make sense to commit on its own. In the case of a change to FPW spacing, I'd want to see some diagnostic output in something like pg_stat_bgwriter that tracks how many FPW pages are being modified. A pgstat_bgwriter.full_page_writes counter would be perfect here, and then graph that data over time as the benchmark runs.

Another thought is that rather than trying to compensate for that effect
in the checkpoint scheduler, could we avoid the sudden rush of full-page
images in the first place? The current rule for when to write a full
page image is conservative: you don't actually need to write a full page
image when you modify a buffer that's sitting in the buffer cache, if
that buffer hasn't been flushed to disk by the checkpointer yet, because
the checkpointer will write and fsync it later. I'm not sure how much it
would smoothen WAL write I/O, but it would be interesting to try.

There I also think the right way to proceed is instrumenting that area first.

A long time ago, Itagaki wrote a patch to sort the checkpoint writes:
www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp.
He posted very promising performance numbers, but it was dropped because
Tom couldn't reproduce the numbers, and because sorting requires
allocating a large array, which has the risk of running out of memory,
which would be bad when you're trying to checkpoint.

I updated and re-reviewed that in 2011: http://www.postgresql.org/message-id/4d31ae64.3000...@2ndquadrant.com and commented on why I think the improvement was difficult to reproduce back then. The improvement didn't follow for me either. It would take a really amazing bit of data to get me to believe write sorting code is worthwhile after that. On large systems capable of dirtying enough blocks to cause a problem, the operating system and RAID controllers are already sorting block. And *that* sorting is also considering concurrent read requests, which are a lot more important to an efficient schedule than anything the checkpoint process knows about. The database doesn't have nearly enough information yet to compete against OS level sorting.


Bad point of my patch is longer checkpoint. Checkpoint time was
increased about 10% - 20%. But it can work correctry on schedule-time in
checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6).

For a fair comparison, you should increase the
checkpoint_completion_target of the unpatched test, so that the
checkpoints run for roughly the same amount of time with and without the
patch. Otherwise the benefit you're seeing could be just because of a
more lazy checkpoint.

Heikki has nailed the problem with the submitted dbt-2 results here. If you spread checkpoints out more, you cannot fairly compare the resulting TPS or latency numbers anymore.

Simple example: 20 minute long test. Server A does a checkpoint every 5 minutes. Server B has modified parameters or server code such that checkpoints happen every 6 minutes. If you run both to completion, A will have hit 4 checkpoints that flush the buffer cache, B only 3. Of course B will seem faster. It didn't do as much work.

pgbench_tools measures the number of checkpoints during the test, as well as the buffer count statistics. If those numbers are very different between two tests, I have to throw them out as unfair. A lot of things that seem promising turn out to have this sort of problem.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to