Re: [PATCHES] Load Distributed Checkpoints, take 3

Heikki Linnakangas Tue, 26 Jun 2007 07:00:05 -0700

Tom Lane wrote:

Greg Smith <[EMAIL PROTECTED]> writes:

The way transitions between completely idle and all-out bursts happen wereone problematic area I struggled with. Since the LRU point doesn't moveduring the idle parts, and the lingering buffers have a usage_count>0, theLRU scan won't touch them; the only way to clear out a bunch of dirtybuffers leftover from the last burst is with the all-scan.


One thing that might be worth changing is that right now, BgBufferSync
starts over from the current clock-sweep point on each call --- that is,
each bgwriter cycle.  So it can't really be made to write very many
buffers without excessive CPU work.  Maybe we should redefine it to have
some static state carried across bgwriter cycles, such that it would
write at most N dirty buffers per call, but scan through X percent of
the buffers, possibly across several calls, before returning to the (by
now probably advanced) clock-sweep point.  This would allow a larger
value of X to be used than is currently practical.  You might wish to
recheck the clock sweep point on each iteration just to make sure the
scan hasn't fallen behind it, but otherwise I don't see any downside.
The scenario where somebody re-dirties a buffer that was cleaned by the
bgwriter scan isn't a problem, because that buffer will also have had its
usage_count increased and thereby not be a candidate for replacement.

Something along those lines could be useful. I've thought of thatbefore, but it never occured to me that if a page in front of the clockhand is re-dirtied, it's no longer a candidate for replacement anyway...

I'm going to leave the all- and lru- bgwriter scans alone for now to getthis LDC patch finished. We still have the bgwriter autotuning patch inthe queue. Let's think about this in the context of that patch.

As a general comment on this subject, a lot of the work in LDC presumesyou have an accurate notion of how close the next checkpoint is.


Yeah; this is one reason I was interested in carrying some write-speed
state across checkpoints instead of having the calculation start from
scratch each time.  That wouldn't help systems that sit idle a long time
and suddenly go nuts, but it seems to me that smoothing the write rate
across more than one checkpoint could help if the fluctuations occur
over a timescale of a few checkpoints.

Hmm. This problem only applies to checkpoints triggered bycheckpoint_segments; time tends to move forward at a constant rate.

I'm not worried about small fluctuations or bursts. As I argued earlier,the OS still buffers the writes and should give some extra smoothing ofthe physical writes. I believe bursts of say 50-100 MB would easily fitin OS cache, as long as there's enough gap between them. I haven'ttested that, though.


Here's a proposal for an algorithm to smooth bigger bursts:

The basic design is the same as before. We keep track of elapsed timeand elapsed xlogs, and based on them we estimate how much progress influshing the buffers we should've made by now, and then we catch upuntil we reach that. The estimate for time is the same. The estimate forxlogs gets more complicated:


Let's have a few definitions first:

Ro = elapsed segments / elapsed time, from previous checkpoint cycle.For example, 1.25 means that the checkpoint was triggered bycheckpoint_segments, and we had spent 1/1.25 = 80% ofcheckpoint_timeout when we reached checkpoint_segments. 0.25 would meanthat checkpoint was triggered by checkpoint_timeout, and we had spent25% of checkpoint_segments by then.

Rn = elapsed segments / elapsed time this far from current in-progresscheckpoint.

t = elapsed time, as a fraction of checkpoint_timeout (0.0 - 1.0, thoughcould be > 1 if next checkpoint is already due)s = elapsed xlog segments, as a fraction of checkpoint_segments (0.0 -1.0, though could again be > 1 if next checkpoint is already due)

R = estimate for WAL segment consumption rate, as checkpoint_segments /checkpoint_timeout


R = Ro * t + Rn * (1 - t)

In other words, at the beginning of the checkpoint, we give more weightto the state carried over from previous checkpoint. As we go forward,more weight is given to the rate calculated from current cycle.


From R, we extrapolate how much progress we should've done by now:

Target progress = R * t

This would require saving just one number from previous cycle (Rn), andthere is no requirement to call the estimation function at steady timeintervals, for example. It gives pretty steady I/O rate even if there'sbig bursts in WAL activity, but still reacts to changes in the rate.

I'm not convinced this is worth the effort, though. First of all, thisis only a problem if you use checkpoint_segments to control yourcheckpoints, so you can lower checkpoint_timeout to do more work duringthe idle periods. Secondly, with the optimization of not flushingbuffers during checkpoint that were dirtied after the start ofcheckpoint, the LRU-sweep will also contribute to flushing the buffersand finishing the checkpoint. We don't count them towards the progressmade ATM, but we probably should. Lastly, distributing the writes even alittle bit is going to be smoother than the current behavior anyway.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Re: [PATCHES] Load Distributed Checkpoints, take 3

Reply via email to