ITAGAKI Takahiro wrote:
Here is an updated version of LDC patch (V4).

Thanks! I'll start testing.

- Progress of checkpoint is controlled not only based on checkpoint_timeout
  but also checkpoint_segments. -- Now it works better with large
  checkpoint_timeout and small checkpoint_segments.

Great, much better now. I like the concept of "progress" used in the calculations. We might want to call GetCheckpointProgress something else, though. It doesn't return the amount of progress made, but rather the amount of progress we should've made up to that point or we're in danger of not completing the checkpoint in time.

We can control the delay of checkpoints using three parameters:
checkpoint_write_percent, checkpoint_nap_percent and checkpoint_sync_percent.
If we set all of the values to zero, checkpoint behaves as it was.

The nap and sync phases are pretty straightforward. The write phase, however, behaves a bit differently

In the nap phase, we just sleep until enough time/segments has passed, where enough is defined by checkpoint_nap_percent. However, if we're already past checkpoint_write_percent at the beginning of the nap, I think we should clamp the nap time so that we don't run out of time until the next checkpoint because of sleeping.

In the sync phase, we sleep between each fsync until enough time/segments have passed, assuming that the time to fsync is proportional to the file length. I'm not sure that's a very good assumption. We might have one huge files with only very little changed data, for example a logging table that is just occasionaly appended to. If we begin by fsyncing that, it'll take a very short time to finish, and we'll then sleep for a long time. If we then have another large file to fsync, but that one has all pages dirty, we risk running out of time because of the unnecessarily long sleep. The segmentation of relations limits the risk of that, though, by limiting the max. file size, and I don't really have any better suggestions.

In the write phase, bgwriter_all_maxpages is also factored in the sleeps. On each iteration, we write bgwriter_all_maxpages pages and then we sleep for bgwriter_delay msecs. checkpoint_write_percent only controls the maximum amount of time we try spend in the write phase, we skip the sleeps if we're exceeding checkpoint_write_percent, but it can very well finish earlier. IOW, bgwriter_all_maxpages is the *minimum* amount of pages to write between sleeps. If it's not set, we use WRITERS_PER_ABSORB, which is hardcoded to 1000.

The approach of writing min. N pages per iteration seems sound to me. By setting N we can control the maximum impact of a checkpoint under normal circumstances. If there's very little work to do, it doesn't make sense to stretch the write of say 10 buffers across a 15 min period; it's indeed better to finish the checkpoint earlier. It's similar to vacuum_cost_limit in that sense. But using bgwriter_all_maxpages for it doesn't feel right, we should at least name it differently. The default of 1000 is a bit high as well, with the default bgwriter_delay that adds up to 39MB/s. That's ok for decent a I/O subsystem, but the default really should be something that will still leave room for other I/O on a small single-disk server.

Should we try doing something similar for the sync phase? If there's only 2 small files to fsync, there's no point sleeping for 5 minutes between them just to use up the checkpoint_sync_percent budget.

Should we give a warning if you set the *_percent settings so that they exceed 100%?

  Heikki Linnakangas

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Reply via email to