Tom Lane wrote:
Bruce Momjian <[EMAIL PROTECTED]> writes:
Heikki Linnakangas wrote:
For comparison, imola-328 has full_page_writes=off. Checkpoints last ~9
minutes there, and the graphs look very smooth. That suggests that
spreading the writes over a longer time wouldn't make a difference, but
smoothing the rush at the beginning of checkpoint might. I'm going to
try the algorithm I posted, that uses the WAL consumption rate from
previous checkpoint interval in the calculations.
One thing that concerns me is that checkpoint smoothing happening just
after the checkpoint is causing I/O at the same time that
full_page_writes is causing additional I/O.
I'm tempted to just apply some sort of nonlinear correction to the
WAL-based progress measurement. Squaring it would be cheap but is
probably too extreme. Carrying over info from the previous cycle
doesn't seem like it would help much; rather, the point is exactly
that we *don't* want a constant write speed during the checkpoint.
While thinking about this, I made an observation on full_page_writes.
Currently, we perform a full page write whenever LSN < RedoRecPtr. If
we're clever, we can skip or defer some of the full page writes:
The rule is that when we replay, we need to always replay a full page
image before we apply any regular WAL records on the page. When we begin
a checkpoint, there's two possible outcomes: we crash before the new
checkpoint is finished, and we replay starting from the previous redo
ptr, or we finish the checkpoint successfully, and we replay starting
from the new redo ptr (or we don't crash and don't need to recover).
To be able to recover from the previous redo ptr, we don't need to write
a full page image if we have already written one since the previous redo
To be able to recover from the new redo ptr, we don't need to write a
full page image, if we haven't flushed the page yet. It will be written
and fsync'd by the time the checkpoint finishes.
IOW, we can skip full page images of pages that we have already taken a
full page image of since previous checkpoint, and we haven't flushed yet
during the current checkpoint.
This might reduce the overall WAL I/O a little bit, but more
importantly, it spreads the impact of taking full page images over the
checkpoint duration. That's a good thing on its own, but it also makes
it unnecessary to compensate for the full_page_writes rush in the
I'm still trying to get my head around the bookkeeping required to get
that right; I think it's possible using the new BM_CHECKPOINT_NEEDED
flag and a new flag in the page header to mark pages that we've skipped
taking the full page image when it was last modified.
For 8.3, we should probably just do some simple compensation in the
checkpoint throttling code, if we want to do anything at all. But this
is something to think about in the future.
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?