Re: [HACKERS] Scaling shared buffer eviction

Amit Kapila Wed, 06 Aug 2014 03:13:07 -0700

On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila <amit.kapil...@gmail.com>
wrote:
> This essentially removes BgWriterDelay, but it's still mentioned in
BgBufferSync().  Looking further, I see that with the patch applied,
BgBufferSync() is still present in the source code but is no longer called
from anywhere.  Please don't submit patches that render things unused
without actually removing them; it makes it much harder to see what you've
changed.  I realize you probably left it that way for testing purposes, but
you need to clean such things up before submitting.  Likewise, if you've
rendered GUCs or statistics counters removed, you need to rip them out, so
that the scope of the changes you've made is clear to reviewers.


I have kept it just for the reason that if the basic approach is
sounds reasonable/accepted, then I will clean it up.  Sorry for
the inconvenience, I didn't realized that it can be annoying for
reviewer, I will remove all such code from patch in next version.

>
> A comparison of BgBufferSync() with
BgBufferSyncAndMoveBuffersToFreelist() reveals that you've removed at least
one behavior that some people (at least, me) will care about, which is the
guarantee that the background writer will scan the entire buffer pool at
least every couple of minutes.

Okay, I will take care of this based on the conclusion of
the other points in this mail.

>This is important because it guarantees that dirty data doesn't sit in
memory forever.  When the system becomes busy again after a long idle
period, users will expect the system to have used the idle time to flush
dirty buffers to disk.  This also improves data recovery prospects if, for
example, somebody loses their pg_xlog directory - there may be dirty
buffers whose contents are lost, of course, but they won't be months old.
>
>> b.  New stats for number of buffers on freelist has been added, some
>>      old one's like maxwritten_clean can be removed as new logic for
>>      syncing buffers and moving them to free list doesn't use them.
>>      However I think it's better to remove them once the new logic is
>>      accepted.  Added some new logs for info related to free list under
>>      BGW_DEBUG
>
>
> If I'm reading this right, the new statistic is an incrementing counter
where, every time you update it, you add the number of buffers currently on
the freelist.  That makes no sense.

I think using 'number of buffers currently on the freelist' and
'number of recently allocated buffers' for consecutive cycles,
we can figure out approximately how many buffer allocations
needs clock sweep assuming low and high threshold water
marks are fixed.  However there can be cases where it is not
easy to estimate that number.

> I think what you should be counting is the number of allocations that are
being satisfied from the free-list.  Then, by comparing the rate at which
that value is incrementing to the rate at which buffers_alloc is
incrementing, somebody can figure out what percentage of allocations are
requiring a clock-sweep run.  Actually, I think it's better to flip it
around: count the number of allocations that require an individual backend
to run the clock sweep (vs. being satisfied from the free-list); call it,
say, buffers_backend_clocksweep.  We can then try to tune the patch to make
that number as small as possible under varying workloads.

This can give us clear idea to tune the patch, however we need
to maintain 3 counters for it in code(recent_alloc (needed for
current bgwriter logic) and other 2 suggested by you).  Do you
want to retain such counters in code or it's for kind of debug info
for patch?

>
>>
>> d.  Autotune the low and high threshold for freelist for various
>>      configurations.
>
> I think we need to come up with some kind of formula here rather than
just a list of hard-coded constants.

That was my initial intention as well and I have tried based
on number of shared buffers like keeping threshold values as
percentage of shared buffers but nothing could satisfy different
kind of workloads.  The current values I have choosen are based
on experiments for various workloads at different thresholds.  I have
shown the lwlock_stats data for various loads based on current
thresholds upthread.  Another way could be to make them as config
knobs and use the values as given by user incase it is provided by
user else go with fixed values.

There are other instances in code as well (one of them I remember
offhand is in pglz_compress) where we use fixed values based on
different sizes.

>And it definitely needs some comments explaining the logic behind the
choices.

Agreed, I shall improve them in next version of patch.

>
> Aside from those specific remarks, I think the elephant in the room is
the question of whether it really makes sense to have one process which is
responsible both for populating the free list and for writing buffers to
disk.  One problem, which I alluded to above under point (1), is that we
might sometimes want to ensure that dirty buffers are written out to disk
without decrementing usage counts or adding anything to the free list.
 This is a potentially solvable problem, though, because we can figure out
the number of buffers that we need to scan for freelist population and the
number that we need to scan for minimum buffer pool cleaning (one cycle
every 2 minutes).  Once we've met the first goal, any further buffers we
run into under the second goal get cleaned if appropriate but their usage
counts don't get pushed down nor do they get added to the freelist.  Once
we meet the second goal, we can go back to sleep.
>
> But the other problem, which I think is likely unsolvable, is that
writing a dirty page can take a long time on a busy system (multiple
seconds) and the freelist can be emptied much, much quicker than that
(milliseconds).  Although your benchmark results show great speed-ups on
read-only workloads, we're not really going to get the benefit consistently
on read-write workloads -- unless of course the background writer fails to
actually write anything, which should be viewed as a bug, not a feature --
because the freelist will often be empty while the background writer is
blocked on I/O.
>
> I'm wondering if it would be a whole lot simpler and better to introduce
a new background process, maybe with a name like bgreclaim.

That will certainly help in retaining the current behaviour of
bgwriter and make the idea cleaner.  I will modify the patch
to have a new background process unless somebody thinks
otherwise.

>That process wouldn't write dirty buffers.

If we go with this approach, one thing which we need to decide
is what to do incase buf which has usage_count as zero is *dirty*,
as I don't think it is good idea to put it in freelist.  Few options to
handle such a case are:

a. Skip such a buffer; the downside is if we have to skip lot
of buffers due to this reason then having separate process
such as bgreclaim will be less advantageous.
b. Skip the buffer and notify bgwriter to flush buffers, now this
notification can be sent either as soon as we encounter one
such buffer or after few such buffers (incase of few, we need to decide
some useful number).  In this option, there is a chance that bgwriter
decide not to flush buffer/'s which ideally should not happen because
I think bgwriter considers the number of recent allocations for
performing scan to flush dirt buffers.
c. Have some mechanism where bgreclaim can notify bgwriter
to flush some specific buffers.  I think if we have such a mechanism
that can be later even used by backends if required.
d. Keep the logic as per current patch and improve such that it can
retain the behaviour of one cycle per two minutes as suggested above
by you on the basis that in anycase it is better than the current code.


I don't think option (d) is best way to handle this scenario,however I
kept it incase nothing else sounds reasonable.  Option (c) might have
lot of work which I am not sure is justifiable to handle the current
scenario,
though it can be useful for some other things.  Option (a) should be okay
for most cases, but I think option (b) would be better.

> Instead, it would just run the clock sweep (i.e. the last loop inside
StrategyGetBuffer) and put the buffers onto the free list.

Don't we need to do more than just last loop inside StrategyGetBuffer(),
as clock sweep in strategy get buffer is responsible for getting one
buffer with usage_count = 0 where as we need to run the loop till it
finds and moves enough such buffers so that it can populate freelist
with number of buffers equal to high water mark of freelist.

>Then, we could leave the bgwriter logic more or less intact.  It certainly
needs improvement, but that could be another patch.
>
> Incidentally, while I generally think your changes to the locking regimen
in StrategyGetBuffer() are going in the right direction, they need
significant cleanup.  Your patch adds two new spinlocks, freelist_lck and
victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and
you've now got StrategyGetBuffer() running with no lock at all when
accessing some things that used to be protected by BufFreelistLock;
specifically, you're doing StrategyControl->numBufferAllocs++ and
SetLatch(StrategyControl->bgwriterLatch) without any locking.  That's not
OK.

I have kept them outside spinlock because as per patch the only
callsite for setting StrategyControl->bgwriterLatch is StrategyGetBuffer()
and StrategyControl->numBufferAllocs is used just for statistics purpose
(which I thought might be okay even if it is not accurate) whereas without
patch it is used by bgwriter for purpose other than stats as well.
However it certainly needs to be protected for separate bgreclaim process
idea or for retaining current bgwriter behaviour.


> I think you should get rid of BufFreelistLock completely and just decide
that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer.  victimbuf_lck
will protect nextVictimBuffer and nothing else.
>
> Then, in StrategyGetBuffer, acquire the freelist_lck at the point where
the LWLock is acquired today.  Increment StrategyControl->numBufferAllocs;
save the values of StrategyControl->bgwriterLatch; pop a buffer off the
freelist if there is one, saving its identity.  Release the spinlock.
 Then, set the bgwriterLatch if needed.  In the first loop, first check
whether the buffer we previously popped from the freelist is pinned or has
a non-zero usage count and return it if not, holding the buffer header
lock.  Otherwise, reacquire the spinlock just long enough to pop a new
potential victim and then loop around.

I shall take care of doing this way in next version of patch.

>
> Under this locking strategy, StrategyNotifyBgWriter would use
freelist_lck.  Right now, the patch removes the only caller, and should
therefore remove the function as well, but if we go with the new-process
idea listed above that part would get reverted, and then you'd need to make
it use the correct spinlock.  You should also go through this patch and
remove all the commented-out bits and pieces that you haven't cleaned up;
those are distracting and unhelpful.

Sure.

Thank you for review.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Scaling shared buffer eviction

Reply via email to