Require share-exclusive lock to set hint bits and to flush At the moment hint bits can be set with just a share lock on a page (and, until 45f658dacb9, in one case even without any lock). Because of this we need to copy pages while writing them out, as otherwise the checksum could be corrupted.
The need to copy the page is problematic to implement AIO writes: 1) Instead of just needing a single buffer for a copied page we need one for each page that's potentially undergoing I/O 2) To be able to use the "worker" AIO implementation the copied page needs to reside in shared memory It also causes problems for using unbuffered/direct-IO, independent of AIO: Some filesystems, raid implementations, ... do not tolerate the data being written out to change during the write. E.g. they may compute internal checksums that can be invalidated by concurrent modifications, leading e.g. to filesystem errors (as the case with btrfs). It also just is plain odd to allow modifications of buffers that are just share locked. To address these issues, this commit changes the rules so that modifications to pages are not allowed anymore while holding a share lock. Instead the new share-exclusive lock (introduced in fcb9c977aa5) allows at most one backend to modify a buffer while other backends have the same page share locked. An existing share-lock can be upgraded to a share-exclusive lock, if there are no conflicting locks. For that BufferBeginSetHintBits()/BufferFinishSetHintBits() and BufferSetHintBits16() have been introduced. To prevent hint bits from being set while the buffer is being written out, writing out buffers now requires a share-exclusive lock. The use of share-exclusive to gate setting hint bits means that from now on only one backend can set hint bits at a time. To allow multiple backends to set hint bits would require more complicated locking: For setting hint bits we'd need to store the count of backends currently setting hint bits and we would need another lock-level for I/O conflicting with the lock-level to set hint bits. Given that the share-exclusive lock for setting hint bits is only held for a short time, that backends would often just set the same hint bits and that the cost of occasionally not setting hint bits in hotly accessed pages is fairly low, this seems like an acceptable tradeoff. The biggest change to adapt to this is in heapam. To avoid performance regressions for sequential scans that need to set a lot of hint bits, we need to amortize the cost of BufferBeginSetHintBits() for cases where hint bits are set at a high frequency. To that end HeapTupleSatisfiesMVCCBatch() uses the new SetHintBitsExt(), which defers BufferFinishSetHintBits() until all hint bits on a page have been set. Conversely, to avoid regressions in cases where we can't set hint bits in bulk (because we're looking only at individual tuples), use BufferSetHintBits16() when setting hint bits without batching. Several other places also need to be adapted, but those changes are comparatively simpler. After this we do not need to copy buffers to write them out anymore. That change is done separately however. Reviewed-by: Melanie Plageman <[email protected]> Reviewed-by: Heikki Linnakangas <[email protected]> Reviewed-by: Chao Li <[email protected]> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/82467f627bd478569de04f4a3f1993098e80c812 Modified Files -------------- src/backend/access/gist/gistget.c | 20 +- src/backend/access/hash/hashutil.c | 14 +- src/backend/access/heap/heapam_visibility.c | 130 +++++++-- src/backend/access/nbtree/nbtinsert.c | 31 ++- src/backend/access/nbtree/nbtutils.c | 14 +- src/backend/access/transam/xloginsert.c | 11 +- src/backend/storage/buffer/README | 66 +++-- src/backend/storage/buffer/bufmgr.c | 396 +++++++++++++++++++++------- src/backend/storage/freespace/freespace.c | 14 +- src/backend/storage/freespace/fsmpage.c | 11 +- src/include/storage/bufmgr.h | 4 + src/tools/pgindent/typedefs.list | 1 + 12 files changed, 528 insertions(+), 184 deletions(-)
