Re: [PATCHES] Load Distributed Checkpoints, final patch

Heikki Linnakangas Tue, 03 Jul 2007 01:42:50 -0700

Tom Lane wrote:

Bruce Momjian <[EMAIL PROTECTED]> writes:
Heikki Linnakangas wrote:
For comparison, imola-328 has full_page_writes=off. Checkpoints last ~9minutes there, and the graphs look very smooth. That suggests thatspreading the writes over a longer time wouldn't make a difference, butsmoothing the rush at the beginning of checkpoint might. I'm going totry the algorithm I posted, that uses the WAL consumption rate fromprevious checkpoint interval in the calculations.
One thing that concerns me is that checkpoint smoothing happening just
after the checkpoint is causing I/O at the same time that
full_page_writes is causing additional I/O.
I'm tempted to just apply some sort of nonlinear correction to the
WAL-based progress measurement.  Squaring it would be cheap but is
probably too extreme.  Carrying over info from the previous cycle
doesn't seem like it would help much; rather, the point is exactly
that we *don't* want a constant write speed during the checkpoint.

While thinking about this, I made an observation on full_page_writes.Currently, we perform a full page write whenever LSN < RedoRecPtr. Ifwe're clever, we can skip or defer some of the full page writes:

The rule is that when we replay, we need to always replay a full pageimage before we apply any regular WAL records on the page. When we begina checkpoint, there's two possible outcomes: we crash before the newcheckpoint is finished, and we replay starting from the previous redoptr, or we finish the checkpoint successfully, and we replay startingfrom the new redo ptr (or we don't crash and don't need to recover).

To be able to recover from the previous redo ptr, we don't need to writea full page image if we have already written one since the previous redoptr.

To be able to recover from the new redo ptr, we don't need to write afull page image, if we haven't flushed the page yet. It will be writtenand fsync'd by the time the checkpoint finishes.

IOW, we can skip full page images of pages that we have already taken afull page image of since previous checkpoint, and we haven't flushed yetduring the current checkpoint.

This might reduce the overall WAL I/O a little bit, but moreimportantly, it spreads the impact of taking full page images over thecheckpoint duration. That's a good thing on its own, but it also makesit unnecessary to compensate for the full_page_writes rush in thecheckpoint smoothing.

I'm still trying to get my head around the bookkeeping required to getthat right; I think it's possible using the new BM_CHECKPOINT_NEEDEDflag and a new flag in the page header to mark pages that we've skippedtaking the full page image when it was last modified.

For 8.3, we should probably just do some simple compensation in thecheckpoint throttling code, if we want to do anything at all. But thisis something to think about in the future.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq

Re: [PATCHES] Load Distributed Checkpoints, final patch

Reply via email to