Hello,

In conversations [1] recently about considering how best to adapt the code to 
become NUMA-aware Andres commented, "FWIW, I've started to wonder if we 
shouldn't just get rid of the freelist entirely" and because I'm a glutton for 
punishment (and I think this idea has some merit) I took him up on this task.

In freelist.c the function StrategyGetBuffer() currently tries first to use the 
BufferAccessStrategy, if present, via GetBufferFromRing().  Failing that, the 
second step is to check the "freelist" as defined by 
StrategyControl->firstFreeBuffer/lastFreeBuffer to determine if it contains any 
available buffers, and finally it will "Use the clock-sweep algorithm to find a 
free buffer."  The freelist was intended to be "a list of buffers that are 
prime candidates for replacement" but I question the value of that versus the 
overhead of managing it.  Without the list some operations are likely (I plan 
to measure this) faster due, as Anders points out in [1], "just using clock 
sweep actually makes things like DROP TABLE perform better because it doesn't 
need to maintain the freelist anymore."  It may be the case that with very 
large NBuffers where most are in use that this approach is slower (I plan to 
measure this too), but in those cases I'd imagine the freelist is likely empty 
and so the code will use the clock-sweep algorithm anyway, so I'm not sure 
there is a performance penalty at all.

This change does remove the have_free_buffer() function used by the 
contrib/pg_prewarm module.  On the surface this doesn't seem to cause any 
issues, but honestly I've not thought too deeply on this one.

v2-0001 Eliminate the freelist from the buffer manager and depend on 
clock-sweep.


Once removed [2] and tests passing [3] I took a long hard look at the 
buffer_strategy_lock that used to serialize concurrent access to members of 
BufferStrategyControl and I couldn't find a good reason to keep it around.  
Let's review what it is guarding:

completePasses: a count of the number of times the clock-sweep hand wraps 
around.  StrategySyncStart() provides this to the bgwriter which in turn uses 
it to compute a strategic location at which to start scanning for pages to 
evict.  There's an interesting comment that indicates both a "requirement" and 
an equivocal "but that's highly unlikely and wouldn't be particularly harmful" 
statement conflicting with itself.  I tried to find a reason that 
nextVictimBuffers could overflow or that the use of the completePasses value 
could somehow cause harm if off by one or more in the bgwriter and either I 
missed it (please tell me) or there isn't one.  However, it does make sense to 
change completePasses into an atomic value so that it is consistent across 
backends and in the bgwriter.

bgwprocno: when not -1 is the PID of the allocation notification latch 
(ProcGlobal->allProcs[bgwprocno].procLatch).  This is a "power savings" feature 
where the goal is to signal the bgwriter "When a backend starts using buffers 
again, it will wake us up by setting our latch."  Here the code reads, "Since 
we don't want to rely on a spinlock for this we force a read from shared memory 
once, and then set the latch based on that value." and uses INT_ACCESS_ONCE() 
to read the value and set the latch.  The function StrategyNotifyBgWriter() is 
where bgwprocno is set, I see no reason to use atomic or other synchronization 
here.

And that's all there is to it now that the freelist is gone.  As a result, IMO 
it seems unnecessary to require a spin lock for access to BufferStrategyControl.

v2-0002 Remove the buffer_strategy_lock.


This attached patch is also a branch on GitHub [2] if that is of interest or 
helpful, it passes check world [3] (I use the GitHub PR to trigger CirrusCI 
tests, not as the way to convey the change set).

I also made a few minor changes such that we're consistently referring to 
"clock-sweep" (not "clock sweep" or "clocksweep"), I'm not wedded to those but 
consistency isn't a bad thing, right?

As an aside, we're really implementing "generalized CLOCK" [4][5] which uses 
counters rather a single bit as pointed out [6] in the CLOCK-Pro [7] paper, but 
I digress.

I'd like to hear ideas for worst cases to test and/or benchmark.  I plan on 
attempting what I saw Michael do with flamegraphs before/after/delta in a 
follow up to this.  If people feel this has merit I'll add it to CF/2.

thank you for your time considering this idea,

-greg

[1] 
https://postgr.es/m/ndvygkpdx44pmi4xbkf52gfrl77cohpefr42tipvd5dgiaeuyd%40fe2og2kxyjnc
[2] https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v2 
[3] https://github.com/gburd/postgres/pull/4/checks
[4] V. F. Nicola, A. Dan, and D. M. Dias, "Analysis of the Generalized Clock 
Buffer Replacement Scheme for Database Transaction Processing", Proceeding of 
1992 ACM SIGMETRICS Conference, June 1992, pp. 35-46.
[5] A. J. Smith, "Sequentiality and Prefetching in Database Systems", ACM 
Trans. on Database Systems, Vol. 3, No. 3, 1978, pp. 223-247.
[6] "In a generalized CLOCK version called GCLOCK [25,17], a counter is 
associated with each page rather than a single bit. Its counter will be 
incremented if a page is hit. The cycling clock hand sweeps over the pages 
decrementing their counters until a page whose counter is zero is found for 
replacement."
[7] CLOCK-Pro: An Effective Improvement of the CLOCK Replacement 
https://www.usenix.org/legacy/event/usenix05/tech/general/full_papers/jiang/jiang.pdf

Attachment: v2-0001-Eliminate-the-freelist-from-the-buffer-manager-an.patch
Description: Binary data

Attachment: v2-0002-Remove-the-buffer_strategy_lock.patch
Description: Binary data

Reply via email to