I'm again looking at way the GUC variables work in load distributed
checkpoints patch. We've discussed them a lot already, but I don't think
they're still quite right.
I like the way the write-phase is controlled in general. Writes are
throttled so that we spend the specified percentage of checkpoint
interval doing the writes. But we always write at a specified minimum
rate to avoid spreading out the writes unnecessarily when there's little
work to do.
The original patch uses bgwriter_all_max_pages to set the minimum rate.
I think we should have a separate variable, checkpoint_write_min_rate,
in KB/s, instead.
This is trickier. The purpose of the sleep between writes and fsyncs is
to give the OS a chance to flush the pages to disk in it's own pace,
hopefully limiting the affect on concurrent activity. The sleep
shouldn't last too long, because any concurrent activity can be dirtying
and writing more pages, and we might end up fsyncing more than necessary
which is bad for performance. The optimal delay depends on many factors,
but I believe it's somewhere between 0-30 seconds in any reasonable system.
In the current patch, the duration of the sleep between the write and
sync phases is controlled as a percentage of checkpoint interval. Given
that the optimal delay is in the range of seconds, and
checkpoint_timeout can be up to 60 minutes, the useful values of that
percentage would be very small, like 0.5% or even less. Furthermore, the
optimal value doesn't depend that much on the checkpoint interval, it's
more dependent on your OS and memory configuration.
We should therefore give the delay as a number of seconds instead of as
a percentage of checkpoint interval.
This is also tricky. As with the nap phase, we don't want to spend too
much time fsyncing, because concurrent activity will write more dirty
pages and we might just end up doing more work.
And we don't know how much work an fsync performs. The patch uses the
file size as a measure of that, but as we discussed that doesn't
necessarily have anything to do with reality. fsyncing a 1GB file with
one dirty block isn't any more expensive than fsyncing a file with a
Another problem is the granularity of an fsync. If we fsync a 1GB file
that's full of dirty pages, we can't limit the affect on other activity.
The best we can do is to sleep between fsyncs, but sleeping more than a
few seconds is hardly going to be useful, no matter how bad an I/O storm
each fsync causes.
Because of the above, I'm thinking we should ditch the
checkpoint_sync_percentage variable, in favor of:
checkpoint_fsync_period # duration of the fsync phase, in seconds
checkpoint_fsync_delay # max. sleep between fsyncs, in milliseconds
In all phases, the normal bgwriter activities are performed:
lru-cleaning and switching xlog segments if archive_timeout expires. If
a new checkpoint request arrives while the previous one is still in
progress, we skip all the delays and finish the previous checkpoint as
soon as possible.
GUC summary and suggested default values
checkpoint_write_percent = 50 # % of checkpoint interval to spread out
checkpoint_write_min_rate = 1000 # minimum I/O rate to write dirty
buffers at checkpoint (KB/s)
checkpoint_nap_duration = 2 # delay between write and sync phase, in
checkpoint_fsync_period = 30 # duration of the sync phase, in seconds
checkpoint_fsync_delay = 500 # max. delay between fsyncs
I don't like adding that many GUC variables, but I don't really see a
way to tune them automatically. Maybe we could just hard-code the last
one, it doesn't seem that critical, but that still leaves us 4 variables.
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster