I just had an epiphany, I think.

As I wrote in the LDC discussion,
http://archives.postgresql.org/pgsql-patches/2007-06/msg00294.php
if the bgwriter's LRU-cleaning scan has advanced ahead of freelist.c's
clock sweep pointer, then any buffers between them are either clean,
or are pinned and/or have usage_count > 0 (in which case the bgwriter
wouldn't bother to clean them, and freelist.c wouldn't consider them
candidates for re-use).  And *this invariant is not destroyed by the
activities of other backends*.  A backend cannot dirty a page without
raising its usage_count from zero, and there are no race cases because
the transition states will be pinned.

This means that there is absolutely no point in having the bgwriter
re-start its LRU scan from the clock sweep position each time, as
it currently does.  Any pages it revisits are not going to need
cleaning.  We might as well have it progress forward from where it
stopped before.

In fact, the notion of the bgwriter's cleaning scan being "in front of"
the clock sweep is entirely backward.  It should try to be behind the
sweep, ie, so far ahead that it's lapped the clock sweep and is trailing
along right behind it, cleaning buffers immediately after their
usage_count falls to zero.  All the rest of the buffer arena is either
clean or has positive usage_count.

This means that we don't need the bgwriter_lru_percent parameter at all;
all we need is the lru_maxpages limit on how much I/O to initiate per
wakeup.  On each wakeup, the bgwriter always cleans until either it's
dumped lru_maxpages buffers, or it's caught up with the clock sweep.

There is a risk that if the clock sweep manages to lap the bgwriter,
the bgwriter would stop upon "catching up", when in reality there are
dirty pages everywhere.  This is easily prevented though, if we add
to the shared BufferStrategyControl struct a counter that is incremented
each time the clock sweep wraps around to buffer zero.  (Essentially
this counter stores the high-order bits of the sweep counter.)  The
bgwriter can then recognize having been lapped by comparing that counter
to its own similar counter.  If it does get lapped, it should advance
its work pointer to the current sweep pointer and try to get ahead
again.  (There's no point in continuing to clean pages behind the sweep
when those just ahead of it are dirty.)

This idea changes the terms of discussion for Itagaki-san's
automatic-adjustment-of-lru_maxpages patch.  I'm not sure we'd still
need it at all, as lru_maxpages would now be just an upper bound on the
desired I/O rate, rather than the target itself.  If we do still need
such a patch, it probably needs to look a lot different than it does
now.

Comments?

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to [EMAIL PROTECTED] so that your
       message can get through to the mailing list cleanly

Reply via email to