Re: [PATCHES] Load distributed checkpoint V4

Heikki Linnakangas Thu, 19 Apr 2007 04:04:53 -0700

ITAGAKI Takahiro wrote:

Here is an updated version of LDC patch (V4).


Thanks! I'll start testing.

- Progress of checkpoint is controlled not only based on checkpoint_timeout
  but also checkpoint_segments. -- Now it works better with large
  checkpoint_timeout and small checkpoint_segments.

Great, much better now. I like the concept of "progress" used in thecalculations. We might want to call GetCheckpointProgress somethingelse, though. It doesn't return the amount of progress made, but ratherthe amount of progress we should've made up to that point or we're indanger of not completing the checkpoint in time.

We can control the delay of checkpoints using three parameters:
checkpoint_write_percent, checkpoint_nap_percent and checkpoint_sync_percent.
If we set all of the values to zero, checkpoint behaves as it was.

The nap and sync phases are pretty straightforward. The write phase,however, behaves a bit differently

In the nap phase, we just sleep until enough time/segments has passed,where enough is defined by checkpoint_nap_percent. However, if we'realready past checkpoint_write_percent at the beginning of the nap, Ithink we should clamp the nap time so that we don't run out of timeuntil the next checkpoint because of sleeping.

In the sync phase, we sleep between each fsync until enoughtime/segments have passed, assuming that the time to fsync isproportional to the file length. I'm not sure that's a very goodassumption. We might have one huge files with only very little changeddata, for example a logging table that is just occasionaly appended to.If we begin by fsyncing that, it'll take a very short time to finish,and we'll then sleep for a long time. If we then have another large fileto fsync, but that one has all pages dirty, we risk running out of timebecause of the unnecessarily long sleep. The segmentation of relationslimits the risk of that, though, by limiting the max. file size, and Idon't really have any better suggestions.

In the write phase, bgwriter_all_maxpages is also factored in thesleeps. On each iteration, we write bgwriter_all_maxpages pages and thenwe sleep for bgwriter_delay msecs. checkpoint_write_percent onlycontrols the maximum amount of time we try spend in the write phase, weskip the sleeps if we're exceeding checkpoint_write_percent, but it canvery well finish earlier. IOW, bgwriter_all_maxpages is the *minimum*amount of pages to write between sleeps. If it's not set, we useWRITERS_PER_ABSORB, which is hardcoded to 1000.

The approach of writing min. N pages per iteration seems sound to me. Bysetting N we can control the maximum impact of a checkpoint under normalcircumstances. If there's very little work to do, it doesn't make senseto stretch the write of say 10 buffers across a 15 min period; it'sindeed better to finish the checkpoint earlier. It's similar tovacuum_cost_limit in that sense. But using bgwriter_all_maxpages for itdoesn't feel right, we should at least name it differently. The defaultof 1000 is a bit high as well, with the default bgwriter_delay that addsup to 39MB/s. That's ok for decent a I/O subsystem, but the defaultreally should be something that will still leave room for other I/O on asmall single-disk server.

Should we try doing something similar for the sync phase? If there'sonly 2 small files to fsync, there's no point sleeping for 5 minutesbetween them just to use up the checkpoint_sync_percent budget.

Should we give a warning if you set the *_percent settings so that theyexceed 100%?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Re: [PATCHES] Load distributed checkpoint V4

Reply via email to