bufmgr: Implement buffer content locks independently of lwlocks

Until now buffer content locks were implemented using lwlocks. That has the
obvious advantage of not needing a separate efficient implementation of
locks. However, the time for a dedicated buffer content lock implementation
has come:

1) Hint bits are currently set while holding only a share lock. This leads to
   having to copy pages while they are being written out if checksums are
   enabled, which is not cheap. We would like to add AIO writes, however once
   many buffers can be written out at the same time, it gets a lot more
   expensive to copy them, particularly because that copy needs to reside in
   shared buffers (for worker mode to have access to the buffer).

   In addition, modifying buffers while they are being written out can cause
   issues with unbuffered/direct-IO, as some filesystems (like btrfs) do not
   like that, due to filesystem internal checksums getting corrupted.

   The solution to this is to require a new share-exclusive lock-level to set
   hint bits and to write out buffers, making those operations mutually
   exclusive. We could introduce such a lock-level into the generic lwlock
   implementation, however it does not look like there would be other users,
   and it does add some overhead into important code paths.

2) For AIO writes we need to be able to race-freely check whether a buffer is
   undergoing IO and whether an exclusive lock on the page can be acquired. That
   is rather hard to do efficiently when the buffer state and the lock state
   are separate atomic variables. This is a major hindrance to allowing writes
   to be done asynchronously.

3) Buffer locks are by far the most frequently taken locks. Optimizing them
   specifically for their use case is worth the effort. E.g. by merging
   content locks into buffer locks we will be able to release a buffer lock
   and pin in one atomic operation.

4) There are more complicated optimizations, like long-lived "super pinned &
   locked" pages, that cannot realistically be implemented with the generic
   lwlock implementation.

Therefore implement content locks inside bufmgr.c. The lockstate is stored as
part of BufferDesc.state. The implementation of buffer content locks is fairly
similar to lwlocks, with a few important differences:

1) An additional lock-level share-exclusive has been added. This lock-level
   conflicts with exclusive locks and itself, but not share locks.

2) Error recovery for content locks is implemented as part of the already
   existing private-refcount tracking mechanism in combination with resowners,
   instead of a bespoke mechanism as the case for lwlocks. This means we do
   not need to add dedicated error-recovery code paths to release all content
   locks (like done with LWLockReleaseAll() for lwlocks).

3) The lock state is embedded in BufferDesc.state instead of having its own
   struct.

4) The wakeup logic is a tad more complicated due to needing to support the
   additional lock-level

This commit unfortunately introduces some code that is very similar to the
code in lwlock.c, however the code is not equivalent enough to easily merge
it. The future wins that this commit makes possible seem worth the cost.

As of this commit nothing uses the new share-exclusive lock mode. It will be
used in a future commit. It seemed too complicated to introduce the lock-level
in a separate commit.

It's worth calling out one wart in this commit: Despite content locks not
being lwlocks anymore, they continue to use PGPROC->lw* - that seemed better
than duplicating the relevant infrastructure.

Another thing worth pointing out is that, after this change, content locks are
not reported as LWLock wait events anymore, but as new wait events in the
"Buffer" wait event class (see also 6c5c393b740). The old BufferContent lwlock
tranche has been removed.

Reviewed-by: Melanie Plageman <[email protected]>
Reviewed-by: Heikki Linnakangas <[email protected]>
Reviewed-by: Greg Burd <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: 
https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/fcb9c977aa5f1eefe7444e423e833ff64a5d1d8f

Modified Files
--------------
src/backend/storage/buffer/buf_init.c           |   5 +-
src/backend/storage/buffer/bufmgr.c             | 913 ++++++++++++++++++++++--
src/backend/utils/activity/wait_event_names.txt |   4 +-
src/include/storage/buf_internals.h             |  73 +-
src/include/storage/bufmgr.h                    |  32 +-
src/include/storage/lwlocklist.h                |   1 -
src/include/storage/proc.h                      |   8 +-
7 files changed, 931 insertions(+), 105 deletions(-)

Reply via email to