Tom Lane wrote:
Greg Smith <[EMAIL PROTECTED]> writes:
The way transitions between completely idle and all-out bursts happen were
one problematic area I struggled with. Since the LRU point doesn't move
during the idle parts, and the lingering buffers have a usage_count>0, the
LRU scan won't touch them; the only way to clear out a bunch of dirty
buffers leftover from the last burst is with the all-scan.
One thing that might be worth changing is that right now, BgBufferSync
starts over from the current clock-sweep point on each call --- that is,
each bgwriter cycle. So it can't really be made to write very many
buffers without excessive CPU work. Maybe we should redefine it to have
some static state carried across bgwriter cycles, such that it would
write at most N dirty buffers per call, but scan through X percent of
the buffers, possibly across several calls, before returning to the (by
now probably advanced) clock-sweep point. This would allow a larger
value of X to be used than is currently practical. You might wish to
recheck the clock sweep point on each iteration just to make sure the
scan hasn't fallen behind it, but otherwise I don't see any downside.
The scenario where somebody re-dirties a buffer that was cleaned by the
bgwriter scan isn't a problem, because that buffer will also have had its
usage_count increased and thereby not be a candidate for replacement.
Something along those lines could be useful. I've thought of that
before, but it never occured to me that if a page in front of the clock
hand is re-dirtied, it's no longer a candidate for replacement anyway...
I'm going to leave the all- and lru- bgwriter scans alone for now to get
this LDC patch finished. We still have the bgwriter autotuning patch in
the queue. Let's think about this in the context of that patch.
As a general comment on this subject, a lot of the work in LDC presumes
you have an accurate notion of how close the next checkpoint is.
Yeah; this is one reason I was interested in carrying some write-speed
state across checkpoints instead of having the calculation start from
scratch each time. That wouldn't help systems that sit idle a long time
and suddenly go nuts, but it seems to me that smoothing the write rate
across more than one checkpoint could help if the fluctuations occur
over a timescale of a few checkpoints.
Hmm. This problem only applies to checkpoints triggered by
checkpoint_segments; time tends to move forward at a constant rate.
I'm not worried about small fluctuations or bursts. As I argued earlier,
the OS still buffers the writes and should give some extra smoothing of
the physical writes. I believe bursts of say 50-100 MB would easily fit
in OS cache, as long as there's enough gap between them. I haven't
tested that, though.
Here's a proposal for an algorithm to smooth bigger bursts:
The basic design is the same as before. We keep track of elapsed time
and elapsed xlogs, and based on them we estimate how much progress in
flushing the buffers we should've made by now, and then we catch up
until we reach that. The estimate for time is the same. The estimate for
xlogs gets more complicated:
Let's have a few definitions first:
Ro = elapsed segments / elapsed time, from previous checkpoint cycle.
For example, 1.25 means that the checkpoint was triggered by
checkpoint_segments, and we had spent 1/1.25 = 80% of
checkpoint_timeout when we reached checkpoint_segments. 0.25 would mean
that checkpoint was triggered by checkpoint_timeout, and we had spent
25% of checkpoint_segments by then.
Rn = elapsed segments / elapsed time this far from current in-progress
t = elapsed time, as a fraction of checkpoint_timeout (0.0 - 1.0, though
could be > 1 if next checkpoint is already due)
s = elapsed xlog segments, as a fraction of checkpoint_segments (0.0 -
1.0, though could again be > 1 if next checkpoint is already due)
R = estimate for WAL segment consumption rate, as checkpoint_segments /
R = Ro * t + Rn * (1 - t)
In other words, at the beginning of the checkpoint, we give more weight
to the state carried over from previous checkpoint. As we go forward,
more weight is given to the rate calculated from current cycle.
From R, we extrapolate how much progress we should've done by now:
Target progress = R * t
This would require saving just one number from previous cycle (Rn), and
there is no requirement to call the estimation function at steady time
intervals, for example. It gives pretty steady I/O rate even if there's
big bursts in WAL activity, but still reacts to changes in the rate.
I'm not convinced this is worth the effort, though. First of all, this
is only a problem if you use checkpoint_segments to control your
checkpoints, so you can lower checkpoint_timeout to do more work during
the idle periods. Secondly, with the optimization of not flushing
buffers during checkpoint that were dirtied after the start of
checkpoint, the LRU-sweep will also contribute to flushing the buffers
and finishing the checkpoint. We don't count them towards the progress
made ATM, but we probably should. Lastly, distributing the writes even a
little bit is going to be smoother than the current behavior anyway.
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster