[HACKERS] Controlling Load Distributed Checkpoints

Heikki Linnakangas Wed, 06 Jun 2007 06:23:13 -0700

I'm again looking at way the GUC variables work in load distributedcheckpoints patch. We've discussed them a lot already, but I don't thinkthey're still quite right.


Write-phase
-----------

I like the way the write-phase is controlled in general. Writes arethrottled so that we spend the specified percentage of checkpointinterval doing the writes. But we always write at a specified minimumrate to avoid spreading out the writes unnecessarily when there's littlework to do.

The original patch uses bgwriter_all_max_pages to set the minimum rate.I think we should have a separate variable, checkpoint_write_min_rate,in KB/s, instead.


Nap phase
---------

This is trickier. The purpose of the sleep between writes and fsyncs isto give the OS a chance to flush the pages to disk in it's own pace,hopefully limiting the affect on concurrent activity. The sleepshouldn't last too long, because any concurrent activity can be dirtyingand writing more pages, and we might end up fsyncing more than necessarywhich is bad for performance. The optimal delay depends on many factors,but I believe it's somewhere between 0-30 seconds in any reasonable system.

In the current patch, the duration of the sleep between the write andsync phases is controlled as a percentage of checkpoint interval. Giventhat the optimal delay is in the range of seconds, andcheckpoint_timeout can be up to 60 minutes, the useful values of thatpercentage would be very small, like 0.5% or even less. Furthermore, theoptimal value doesn't depend that much on the checkpoint interval, it'smore dependent on your OS and memory configuration.

We should therefore give the delay as a number of seconds instead of asa percentage of checkpoint interval.


Sync phase
----------

This is also tricky. As with the nap phase, we don't want to spend toomuch time fsyncing, because concurrent activity will write more dirtypages and we might just end up doing more work.

And we don't know how much work an fsync performs. The patch uses thefile size as a measure of that, but as we discussed that doesn'tnecessarily have anything to do with reality. fsyncing a 1GB file withone dirty block isn't any more expensive than fsyncing a file with asingle block.

Another problem is the granularity of an fsync. If we fsync a 1GB filethat's full of dirty pages, we can't limit the affect on other activity.The best we can do is to sleep between fsyncs, but sleeping more than afew seconds is hardly going to be useful, no matter how bad an I/O stormeach fsync causes.

Because of the above, I'm thinking we should ditch thecheckpoint_sync_percentage variable, in favor of:

checkpoint_fsync_period # duration of the fsync phase, in seconds
checkpoint_fsync_delay  # max. sleep between fsyncs, in milliseconds

In all phases, the normal bgwriter activities are performed:lru-cleaning and switching xlog segments if archive_timeout expires. Ifa new checkpoint request arrives while the previous one is still inprogress, we skip all the delays and finish the previous checkpoint assoon as possible.



GUC summary and suggested default values
----------------------------------------

checkpoint_write_percent = 50 # % of checkpoint interval to spread outwritescheckpoint_write_min_rate = 1000 # minimum I/O rate to write dirtybuffers at checkpoint (KB/s)checkpoint_nap_duration = 2 # delay between write and sync phase, inseconds

checkpoint_fsync_period = 30            # duration of the sync phase, in seconds
checkpoint_fsync_delay = 500            # max. delay between fsyncs

I don't like adding that many GUC variables, but I don't really see away to tune them automatically. Maybe we could just hard-code the lastone, it doesn't seem that critical, but that still leaves us 4 variables.


Thoughts?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

[HACKERS] Controlling Load Distributed Checkpoints

Reply via email to