On 6/16/13 10:27 AM, Heikki Linnakangas wrote:
Yeah, the checkpoint scheduling logic doesn't take into account the
heavy WAL activity caused by full page images...
Rationalizing a bit, I could even argue to myself that it's a *good*
thing. At the beginning of a checkpoint, the OS write cache should be
relatively empty, as the checkpointer hasn't done any writes yet. So it
might make sense to write a burst of pages at the beginning, to
partially fill the write cache first, before starting to throttle. But
this is just handwaving - I have no idea what the effect is in real life.
That's exactly right. When a checkpoint finishes the OS write cache is
clean. That means all of the full-page writes aren't even hitting disk
in many cases. They just pile up in the OS dirty memory, often sitting
there all the way until when the next checkpoint fsyncs start. That's
why I never wandered down the road of changing FPW behavior. I have
never seen a benchmark workload hit a write bottleneck until long after
the big burst of FPW pages is over.
I could easily believe that there are low-memory systems where the FPW
write pressure becomes a problem earlier. And slim VMs make sense as
the place this behavior is being seen at.
I'm a big fan of instrumenting the code around a performance change
before touching anything, as a companion patch that might make sense to
commit on its own. In the case of a change to FPW spacing, I'd want to
see some diagnostic output in something like pg_stat_bgwriter that
tracks how many FPW pages are being modified. A
pgstat_bgwriter.full_page_writes counter would be perfect here, and then
graph that data over time as the benchmark runs.
Another thought is that rather than trying to compensate for that effect
in the checkpoint scheduler, could we avoid the sudden rush of full-page
images in the first place? The current rule for when to write a full
page image is conservative: you don't actually need to write a full page
image when you modify a buffer that's sitting in the buffer cache, if
that buffer hasn't been flushed to disk by the checkpointer yet, because
the checkpointer will write and fsync it later. I'm not sure how much it
would smoothen WAL write I/O, but it would be interesting to try.
There I also think the right way to proceed is instrumenting that area
A long time ago, Itagaki wrote a patch to sort the checkpoint writes:
He posted very promising performance numbers, but it was dropped because
Tom couldn't reproduce the numbers, and because sorting requires
allocating a large array, which has the risk of running out of memory,
which would be bad when you're trying to checkpoint.
I updated and re-reviewed that in 2011:
and commented on why I think the improvement was difficult to reproduce
back then. The improvement didn't follow for me either. It would take
a really amazing bit of data to get me to believe write sorting code is
worthwhile after that. On large systems capable of dirtying enough
blocks to cause a problem, the operating system and RAID controllers are
already sorting block. And *that* sorting is also considering
concurrent read requests, which are a lot more important to an efficient
schedule than anything the checkpoint process knows about. The database
doesn't have nearly enough information yet to compete against OS level
Bad point of my patch is longer checkpoint. Checkpoint time was
increased about 10% - 20%. But it can work correctry on schedule-time in
checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6).
For a fair comparison, you should increase the
checkpoint_completion_target of the unpatched test, so that the
checkpoints run for roughly the same amount of time with and without the
patch. Otherwise the benefit you're seeing could be just because of a
more lazy checkpoint.
Heikki has nailed the problem with the submitted dbt-2 results here. If
you spread checkpoints out more, you cannot fairly compare the resulting
TPS or latency numbers anymore.
Simple example: 20 minute long test. Server A does a checkpoint every
5 minutes. Server B has modified parameters or server code such that
checkpoints happen every 6 minutes. If you run both to completion, A
will have hit 4 checkpoints that flush the buffer cache, B only 3. Of
course B will seem faster. It didn't do as much work.
pgbench_tools measures the number of checkpoints during the test, as
well as the buffer count statistics. If those numbers are very
different between two tests, I have to throw them out as unfair. A lot
of things that seem promising turn out to have this sort of problem.
Greg Smith 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: