I'm again looking at way the GUC variables work in load distributed checkpoints patch. We've discussed them a lot already, but I don't think they're still quite right.

I like the way the write-phase is controlled in general. Writes are throttled so that we spend the specified percentage of checkpoint interval doing the writes. But we always write at a specified minimum rate to avoid spreading out the writes unnecessarily when there's little work to do.

The original patch uses bgwriter_all_max_pages to set the minimum rate. I think we should have a separate variable, checkpoint_write_min_rate, in KB/s, instead.

Nap phase
This is trickier. The purpose of the sleep between writes and fsyncs is to give the OS a chance to flush the pages to disk in it's own pace, hopefully limiting the affect on concurrent activity. The sleep shouldn't last too long, because any concurrent activity can be dirtying and writing more pages, and we might end up fsyncing more than necessary which is bad for performance. The optimal delay depends on many factors, but I believe it's somewhere between 0-30 seconds in any reasonable system.

In the current patch, the duration of the sleep between the write and sync phases is controlled as a percentage of checkpoint interval. Given that the optimal delay is in the range of seconds, and checkpoint_timeout can be up to 60 minutes, the useful values of that percentage would be very small, like 0.5% or even less. Furthermore, the optimal value doesn't depend that much on the checkpoint interval, it's more dependent on your OS and memory configuration.

We should therefore give the delay as a number of seconds instead of as a percentage of checkpoint interval.

Sync phase
This is also tricky. As with the nap phase, we don't want to spend too much time fsyncing, because concurrent activity will write more dirty pages and we might just end up doing more work.

And we don't know how much work an fsync performs. The patch uses the file size as a measure of that, but as we discussed that doesn't necessarily have anything to do with reality. fsyncing a 1GB file with one dirty block isn't any more expensive than fsyncing a file with a single block.

Another problem is the granularity of an fsync. If we fsync a 1GB file that's full of dirty pages, we can't limit the affect on other activity. The best we can do is to sleep between fsyncs, but sleeping more than a few seconds is hardly going to be useful, no matter how bad an I/O storm each fsync causes.

Because of the above, I'm thinking we should ditch the checkpoint_sync_percentage variable, in favor of:
checkpoint_fsync_period # duration of the fsync phase, in seconds
checkpoint_fsync_delay  # max. sleep between fsyncs, in milliseconds

In all phases, the normal bgwriter activities are performed: lru-cleaning and switching xlog segments if archive_timeout expires. If a new checkpoint request arrives while the previous one is still in progress, we skip all the delays and finish the previous checkpoint as soon as possible.

GUC summary and suggested default values
checkpoint_write_percent = 50 # % of checkpoint interval to spread out writes checkpoint_write_min_rate = 1000 # minimum I/O rate to write dirty buffers at checkpoint (KB/s) checkpoint_nap_duration = 2 # delay between write and sync phase, in seconds
checkpoint_fsync_period = 30            # duration of the sync phase, in seconds
checkpoint_fsync_delay = 500            # max. delay between fsyncs

I don't like adding that many GUC variables, but I don't really see a way to tune them automatically. Maybe we could just hard-code the last one, it doesn't seem that critical, but that still leaves us 4 variables.


  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Reply via email to