Require share-exclusive lock to set hint bits and to flush

At the moment hint bits can be set with just a share lock on a page (and,
until 45f658dacb9, in one case even without any lock). Because of this we need
to copy pages while writing them out, as otherwise the checksum could be
corrupted.

The need to copy the page is problematic to implement AIO writes:

1) Instead of just needing a single buffer for a copied page we need one for
   each page that's potentially undergoing I/O
2) To be able to use the "worker" AIO implementation the copied page needs to
   reside in shared memory

It also causes problems for using unbuffered/direct-IO, independent of AIO:
Some filesystems, raid implementations, ... do not tolerate the data being
written out to change during the write. E.g. they may compute internal
checksums that can be invalidated by concurrent modifications, leading e.g. to
filesystem errors (as the case with btrfs).

It also just is plain odd to allow modifications of buffers that are just
share locked.

To address these issues, this commit changes the rules so that modifications
to pages are not allowed anymore while holding a share lock. Instead the new
share-exclusive lock (introduced in fcb9c977aa5) allows at most one backend to
modify a buffer while other backends have the same page share locked. An
existing share-lock can be upgraded to a share-exclusive lock, if there are no
conflicting locks. For that BufferBeginSetHintBits()/BufferFinishSetHintBits()
and BufferSetHintBits16() have been introduced.

To prevent hint bits from being set while the buffer is being written out,
writing out buffers now requires a share-exclusive lock.

The use of share-exclusive to gate setting hint bits means that from now on
only one backend can set hint bits at a time. To allow multiple backends to
set hint bits would require more complicated locking: For setting hint bits
we'd need to store the count of backends currently setting hint bits and we
would need another lock-level for I/O conflicting with the lock-level to set
hint bits. Given that the share-exclusive lock for setting hint bits is only
held for a short time, that backends would often just set the same hint bits
and that the cost of occasionally not setting hint bits in hotly accessed
pages is fairly low, this seems like an acceptable tradeoff.

The biggest change to adapt to this is in heapam. To avoid performance
regressions for sequential scans that need to set a lot of hint bits, we need
to amortize the cost of BufferBeginSetHintBits() for cases where hint bits are
set at a high frequency. To that end HeapTupleSatisfiesMVCCBatch() uses the
new SetHintBitsExt(), which defers BufferFinishSetHintBits() until all hint
bits on a page have been set.  Conversely, to avoid regressions in cases where
we can't set hint bits in bulk (because we're looking only at individual
tuples), use BufferSetHintBits16() when setting hint bits without batching.

Several other places also need to be adapted, but those changes are
comparatively simpler.

After this we do not need to copy buffers to write them out anymore. That
change is done separately however.

Reviewed-by: Melanie Plageman <[email protected]>
Reviewed-by: Heikki Linnakangas <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: 
https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Discussion: 
https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/82467f627bd478569de04f4a3f1993098e80c812

Modified Files
--------------
src/backend/access/gist/gistget.c           |  20 +-
src/backend/access/hash/hashutil.c          |  14 +-
src/backend/access/heap/heapam_visibility.c | 130 +++++++--
src/backend/access/nbtree/nbtinsert.c       |  31 ++-
src/backend/access/nbtree/nbtutils.c        |  14 +-
src/backend/access/transam/xloginsert.c     |  11 +-
src/backend/storage/buffer/README           |  66 +++--
src/backend/storage/buffer/bufmgr.c         | 396 +++++++++++++++++++++-------
src/backend/storage/freespace/freespace.c   |  14 +-
src/backend/storage/freespace/fsmpage.c     |  11 +-
src/include/storage/bufmgr.h                |   4 +
src/tools/pgindent/typedefs.list            |   1 +
12 files changed, 528 insertions(+), 184 deletions(-)

Reply via email to