On Wed, Nov 29, 2023 at 1:44 AM Heikki Linnakangas <hlinn...@iki.fi> wrote: > LGTM. I think this 0001 patch is ready for commit, independently of the > rest of the patches.
Done. > In v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch, fd.h: > > > +/* Filename components */ > > +#define PG_TEMP_FILES_DIR "pgsql_tmp" > > +#define PG_TEMP_FILE_PREFIX "pgsql_tmp" > > + > > These seem out of place, we already have them in common/file_utils.h. Yeah, they moved from there in f39b2658 and I messed up the rebase. Fixed. > Other than that, > v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch and > v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patch look > good to me. One thing I wasn't 100% happy with was the treatment of ENOSPC. A few callers believe that short writes set errno: they error out with a message including %m. We have historically set errno = ENOSPC inside FileWrite() if the write size was unexpectedly small AND the kernel didn't set errno to a non-zero value (having set it to zero ourselves earlier). In FileWriteV(), I didn't want to do that because it is expensive to compute the total write size from the vector array and we managed to measure an effect due to that in some workloads. Note that the smgr patch actually handles short writes by continuing, instead of raising an error. Short writes do already occur in the wild on various systems for various rare technical reasons other than ENOSPC I have heard (imagine transient failure to acquire some temporary memory that the kernel chooses not to wait for, stuff like that, though certainly many people and programs believe they should not happen[1]), and it seems like a good idea to actually handle them as our write sizes increase and the probability of short writes might presumably increase. With the previous version of the patch, we'd have to change a couple of other callers not to believe that short writes are errors and set errno (callers are inconsistent on this point). I don't really love that we have "fake" system errors but I also want to stay focused here, so in this new version V3 I tried a new approach: I realised I can just always set errno without needing the total size, so that (undocumented) aspect of the interface doesn't change. The point being that it doesn't matter if you clobber errno with a bogus value when the write was non-short. Thoughts? [1] https://utcc.utoronto.ca/~cks/space/blog/unix/WritesNotShortOften
From 7808d7241088f8dc02d7d828dc4ee0227e8e2a01 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Wed, 19 Jul 2023 12:32:51 +1200 Subject: [PATCH v3 1/7] Provide vectored variants of FileRead() and FileWrite(). FileReadV() and FileWriteV() adapt preadv() and pwritev() for PostgreSQL's virtual file descriptors. The traditional FileRead() and FileWrite() functions are implemented in terms of the new functions. Reviewed-by: Heikki Linnakangas <hlinn...@iki.fi> Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6ut5tum2g...@mail.gmail.com --- src/backend/storage/file/fd.c | 33 +++++++++++++++++++++++---------- src/include/storage/fd.h | 30 ++++++++++++++++++++++++++++-- 2 files changed, 51 insertions(+), 12 deletions(-) diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index f691ba0932..aa89c755b3 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -2110,8 +2110,8 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info) } int -FileRead(File file, void *buffer, size_t amount, off_t offset, - uint32 wait_event_info) +FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset, + uint32 wait_event_info) { int returnCode; Vfd *vfdP; @@ -2131,7 +2131,7 @@ FileRead(File file, void *buffer, size_t amount, off_t offset, retry: pgstat_report_wait_start(wait_event_info); - returnCode = pg_pread(vfdP->fd, buffer, amount, offset); + returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset); pgstat_report_wait_end(); if (returnCode < 0) @@ -2166,8 +2166,8 @@ retry: } int -FileWrite(File file, const void *buffer, size_t amount, off_t offset, - uint32 wait_event_info) +FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset, + uint32 wait_event_info) { int returnCode; Vfd *vfdP; @@ -2195,7 +2195,14 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset, */ if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT)) { - off_t past_write = offset + amount; + size_t size = 0; + off_t past_write; + + /* Compute the total transfer size. */ + for (int i = 0; i < iovcnt; ++i) + size += iov[i].iov_len; + + past_write = offset + size; if (past_write > vfdP->fileSize) { @@ -2213,11 +2220,17 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset, retry: errno = 0; pgstat_report_wait_start(wait_event_info); - returnCode = pg_pwrite(VfdCache[file].fd, buffer, amount, offset); + returnCode = pg_pwritev(vfdP->fd, iov, iovcnt, offset); pgstat_report_wait_end(); - /* if write didn't set errno, assume problem is no disk space */ - if (returnCode != amount && errno == 0) + /* + * Some callers expect short writes to set errno, even though that isn't + * the behavior of the underlying system calls. We don't want to waste + * CPU cycles adding up the total write size here, so we'll just set it to + * ENOSPC even on success, if the syscall didn't set it. On successful + * non-short write, no caller should consult errno. + */ + if (errno == 0) errno = ENOSPC; if (returnCode >= 0) @@ -2227,7 +2240,7 @@ retry: */ if (vfdP->fdstate & FD_TEMP_FILE_LIMIT) { - off_t past_write = offset + amount; + off_t past_write = offset + returnCode; if (past_write > vfdP->fileSize) { diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index d9d5d9da5f..a1d6056d9d 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -43,6 +43,8 @@ #ifndef FD_H #define FD_H +#include "port/pg_iovec.h" + #include <dirent.h> #include <fcntl.h> @@ -105,8 +107,8 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil extern File OpenTemporaryFile(bool interXact); extern void FileClose(File file); extern int FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info); -extern int FileRead(File file, void *buffer, size_t amount, off_t offset, uint32 wait_event_info); -extern int FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info); +extern int FileReadV(File file, const struct iovec *ioc, int iovcnt, off_t offset, uint32 wait_event_info); +extern int FileWriteV(File file, const struct iovec *ioc, int iovcnt, off_t offset, uint32 wait_event_info); extern int FileSync(File file, uint32 wait_event_info); extern int FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info); extern int FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info); @@ -189,4 +191,28 @@ extern int durable_unlink(const char *fname, int elevel); extern void SyncDataDirectory(void); extern int data_sync_elevel(int elevel); +static inline int +FileRead(File file, void *buffer, size_t amount, off_t offset, + uint32 wait_event_info) +{ + struct iovec iov = { + .iov_base = buffer, + .iov_len = amount + }; + + return FileReadV(file, &iov, 1, offset, wait_event_info); +} + +static inline int +FileWrite(File file, const void *buffer, size_t amount, off_t offset, + uint32 wait_event_info) +{ + struct iovec iov = { + .iov_base = unconstify(void *, buffer), + .iov_len = amount + }; + + return FileWriteV(file, &iov, 1, offset, wait_event_info); +} + #endif /* FD_H */ -- 2.42.1
From 5673ae46baa2cf8fa006f04462b9a60004873c65 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Wed, 19 Jul 2023 14:50:41 +1200 Subject: [PATCH v3 2/7] Provide vectored variants of smgrread() and smgrwrite(). smgrreadv() and smgrwritev() map to FileReadV() and FileWriteV(), and thence the OS's vector I/O routines, after dealing with segmentation. These functions allow for contiguous regions of relation files to be transferred into and out of non-neighboring buffers with a single system call. The traditional smgrread() and smgrwrite() functions are implemented in terms of the new functions. The md.c implementation automatically handles short transfers by retrying. They were probably always possible even when we were only doing 8kB reads and write, but with larger transfers they might become more likely. Reviewed-by: Heikki Linnakangas <hlinn...@iki.fi> Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6ut5tum2g...@mail.gmail.com --- src/backend/storage/smgr/md.c | 353 +++++++++++++++++++++++--------- src/backend/storage/smgr/smgr.c | 37 ++-- src/include/storage/fd.h | 2 +- src/include/storage/md.h | 9 +- src/include/storage/smgr.h | 25 ++- 5 files changed, 304 insertions(+), 122 deletions(-) diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index fdecbad170..d1383220bd 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -452,7 +452,7 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo) /* * mdextend() -- Add a block to the specified relation. * - * The semantics are nearly the same as mdwrite(): write at the + * The semantics are nearly the same as mdwritev(): write at the * specified position. However, this is to be used for the case of * extending a relation (i.e., blocknum is at or beyond the current * EOF). Note that we assume writing a block beyond current EOF @@ -737,136 +737,297 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum) } /* - * mdread() -- Read the specified block from a relation. + * Convert an array of buffer address into an array of iovec objects, and + * return the number that were required. 'iov' must have enough space for up + * to PG_IOV_MAX elements. */ -void -mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, - void *buffer) +static int +buffers_to_iov(struct iovec *iov, void **buffers, int nblocks) { - off_t seekpos; - int nbytes; - MdfdVec *v; + struct iovec *iovp; + int iovcnt; - /* If this build supports direct I/O, the buffer must be I/O aligned. */ - if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ) - Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)); + Assert(nblocks >= 1); + Assert(nblocks <= PG_IOV_MAX); - TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum, - reln->smgr_rlocator.locator.spcOid, - reln->smgr_rlocator.locator.dbOid, - reln->smgr_rlocator.locator.relNumber, - reln->smgr_rlocator.backend); + /* If this build supports direct I/O, buffers must be I/O aligned. */ + for (int i = 0; i < nblocks; ++i) + { + if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ) + Assert((uintptr_t) buffers[i] == + TYPEALIGN(PG_IO_ALIGN_SIZE, buffers[i])); + } - v = _mdfd_getseg(reln, forknum, blocknum, false, - EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY); + /* Start the first iovec off with the first buffer. */ + iovp = &iov[0]; + iovp->iov_base = buffers[0]; + iovp->iov_len = BLCKSZ; + iovcnt = 1; - seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE)); + /* Merge in the rest. */ + for (int i = 1; i < nblocks; ++i) + { + void *buffer = buffers[i]; - Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE); + if (((char *) iovp->iov_base + iovp->iov_len) == buffer) + { + /* Contiguous with the last iovec. */ + iovp->iov_len += BLCKSZ; + } + else + { + /* Need a new iovec. */ + iovp++; + iovp->iov_base = buffer; + iovp->iov_len = BLCKSZ; + iovcnt++; + } + } + + return iovcnt; +} - nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ); +/* + * To continue a vector I/O after a short transfer, we need a way to remove + * the whole or partial iovecs that have already been transferred. We assume + * that partial transfers with direct I/O will happen on suitable alignment + * boundaries, so that we don't have to make further adjustments for that. + */ +static int +adjust_iovec_for_partial_transfer(struct iovec *iov, + int iovcnt, + size_t transferred) +{ + struct iovec *keep = iov; - TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum, - reln->smgr_rlocator.locator.spcOid, - reln->smgr_rlocator.locator.dbOid, - reln->smgr_rlocator.locator.relNumber, - reln->smgr_rlocator.backend, - nbytes, - BLCKSZ); + Assert(iovcnt > 0); - if (nbytes != BLCKSZ) + /* Remove wholly transferred iovecs. */ + while (keep->iov_len <= transferred) { - if (nbytes < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not read block %u in file \"%s\": %m", - blocknum, FilePathName(v->mdfd_vfd)))); + transferred -= keep->iov_len; + keep++; + iovcnt--; + } + if (keep != iov) + memmove(iov, keep, sizeof(*keep) * iovcnt); - /* - * Short read: we are at or past EOF, or we read a partial block at - * EOF. Normally this is an error; upper levels should never try to - * read a nonexistent block. However, if zero_damaged_pages is ON or - * we are InRecovery, we should instead return zeroes without - * complaining. This allows, for example, the case of trying to - * update a block that was later truncated away. - */ - if (zero_damaged_pages || InRecovery) - MemSet(buffer, 0, BLCKSZ); - else - ereport(ERROR, - (errcode(ERRCODE_DATA_CORRUPTED), - errmsg("could not read block %u in file \"%s\": read only %d of %d bytes", - blocknum, FilePathName(v->mdfd_vfd), - nbytes, BLCKSZ))); + /* Shouldn't be called with 'transferred' >= total size. */ + Assert(iovcnt > 0); + + /* Adjust leading iovec, which may have been partially transferred. */ + Assert(iov->iov_len > transferred); + iov->iov_base = (char *) iov->iov_base + transferred; + iov->iov_len -= transferred; + + return iovcnt; +} + +/* + * mdreadv() -- Read the specified blocks from a relation. + */ +void +mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, + void **buffers, BlockNumber nblocks) +{ + while (nblocks > 0) + { + struct iovec iov[PG_IOV_MAX]; + int iovcnt; + off_t seekpos; + int nbytes; + MdfdVec *v; + BlockNumber nblocks_this_segment; + size_t transferred_this_segment; + size_t size_this_segment; + + v = _mdfd_getseg(reln, forknum, blocknum, false, + EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY); + + seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE)); + + Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE); + + nblocks_this_segment = + Min(nblocks, + RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE))); + nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov)); + + iovcnt = buffers_to_iov(iov, buffers, nblocks_this_segment); + size_this_segment = nblocks_this_segment * BLCKSZ; + transferred_this_segment = 0; + + /* Inner loop to continue after a short reads. */ + for (;;) + { + TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum, + reln->smgr_rlocator.locator.spcOid, + reln->smgr_rlocator.locator.dbOid, + reln->smgr_rlocator.locator.relNumber, + reln->smgr_rlocator.backend); + nbytes = FileReadV(v->mdfd_vfd, iov, iovcnt, seekpos, + WAIT_EVENT_DATA_FILE_READ); + TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum, + reln->smgr_rlocator.locator.spcOid, + reln->smgr_rlocator.locator.dbOid, + reln->smgr_rlocator.locator.relNumber, + reln->smgr_rlocator.backend, + nbytes, + BLCKSZ); + + if (nbytes < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not read %u blocks at block %u in file \"%s\": %m", + nblocks_this_segment, blocknum, + FilePathName(v->mdfd_vfd)))); + + if (nbytes == 0) + { + /* + * We are at or past EOF, or we read a partial block at EOF. + * Normally this is an error; upper levels should never try to + * read a nonexistent block. However, if zero_damaged_pages + * is ON or we are InRecovery, we should instead return zeroes + * without complaining. This allows, for example, the case of + * trying to update a block that was later truncated away. + */ + if (zero_damaged_pages || InRecovery) + { + for (BlockNumber i = transferred_this_segment / BLCKSZ; + i < nblocks_this_segment; + ++i) + memset(buffers[i], 0, BLCKSZ); + break; + } + else + ereport(ERROR, + (errcode(ERRCODE_DATA_CORRUPTED), + errmsg("could not read %u blocks at block %u in file \"%s\": read only %zu of %zu bytes", + nblocks_this_segment, + blocknum, + FilePathName(v->mdfd_vfd), + transferred_this_segment, + size_this_segment))); + } + + /* One loop should usually be enough. */ + transferred_this_segment += nbytes; + Assert(transferred_this_segment <= size_this_segment); + if (transferred_this_segment == size_this_segment) + break; + + /* Adjust position and vectors after a short transfer. */ + seekpos += nbytes; + iovcnt = adjust_iovec_for_partial_transfer(iov, iovcnt, nbytes); + } + + nblocks -= nblocks_this_segment; + buffers += nblocks_this_segment; + blocknum += nblocks_this_segment; } } /* - * mdwrite() -- Write the supplied block at the appropriate location. + * mdwritev() -- Write the supplied blocks at the appropriate location. * * This is to be used only for updating already-existing blocks of a * relation (ie, those before the current EOF). To extend a relation, * use mdextend(). */ void -mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, - const void *buffer, bool skipFsync) +mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, + const void **buffers, BlockNumber nblocks, bool skipFsync) { - off_t seekpos; - int nbytes; - MdfdVec *v; - - /* If this build supports direct I/O, the buffer must be I/O aligned. */ - if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ) - Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)); - /* This assert is too expensive to have on normally ... */ #ifdef CHECK_WRITE_VS_EXTEND Assert(blocknum < mdnblocks(reln, forknum)); #endif - TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum, - reln->smgr_rlocator.locator.spcOid, - reln->smgr_rlocator.locator.dbOid, - reln->smgr_rlocator.locator.relNumber, - reln->smgr_rlocator.backend); + while (nblocks > 0) + { + struct iovec iov[PG_IOV_MAX]; + int iovcnt; + off_t seekpos; + int nbytes; + MdfdVec *v; + BlockNumber nblocks_this_segment; + size_t transferred_this_segment; + size_t size_this_segment; - v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, - EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY); + v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, + EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY); - seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE)); + seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE)); - Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE); + Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE); - nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE); + nblocks_this_segment = + Min(nblocks, + RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE))); + nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov)); - TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum, - reln->smgr_rlocator.locator.spcOid, - reln->smgr_rlocator.locator.dbOid, - reln->smgr_rlocator.locator.relNumber, - reln->smgr_rlocator.backend, - nbytes, - BLCKSZ); + iovcnt = buffers_to_iov(iov, (void **) buffers, nblocks_this_segment); + size_this_segment = nblocks_this_segment * BLCKSZ; + transferred_this_segment = 0; - if (nbytes != BLCKSZ) - { - if (nbytes < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not write block %u in file \"%s\": %m", - blocknum, FilePathName(v->mdfd_vfd)))); - /* short write: complain appropriately */ - ereport(ERROR, - (errcode(ERRCODE_DISK_FULL), - errmsg("could not write block %u in file \"%s\": wrote only %d of %d bytes", - blocknum, - FilePathName(v->mdfd_vfd), - nbytes, BLCKSZ), - errhint("Check free disk space."))); - } + /* Inner loop to continue after a short writes. */ + for (;;) + { + TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum, + reln->smgr_rlocator.locator.spcOid, + reln->smgr_rlocator.locator.dbOid, + reln->smgr_rlocator.locator.relNumber, + reln->smgr_rlocator.backend); + nbytes = FileWriteV(v->mdfd_vfd, iov, iovcnt, seekpos, + WAIT_EVENT_DATA_FILE_WRITE); + TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum, + reln->smgr_rlocator.locator.spcOid, + reln->smgr_rlocator.locator.dbOid, + reln->smgr_rlocator.locator.relNumber, + reln->smgr_rlocator.backend, + nbytes, + BLCKSZ); + + if (nbytes < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not write %u blocks at block %u in file \"%s\": %m", + nblocks_this_segment, blocknum, + FilePathName(v->mdfd_vfd)))); - if (!skipFsync && !SmgrIsTemp(reln)) - register_dirty_segment(reln, forknum, v); + if (nbytes == 0) + { + /* short write: complain appropriately */ + ereport(ERROR, + (errcode(ERRCODE_DISK_FULL), + errmsg("could not write %u blocks at block %u in file \"%s\": wrote only %zu of %zu bytes", + nblocks_this_segment, + blocknum, + FilePathName(v->mdfd_vfd), + transferred_this_segment, + size_this_segment), + errhint("Check free disk space."))); + } + + /* One loop should usually be enough. */ + transferred_this_segment += nbytes; + Assert(transferred_this_segment <= size_this_segment); + if (transferred_this_segment == size_this_segment) + break; + + /* Adjust position and vectors after a short transfer. */ + seekpos += nbytes; + iovcnt = adjust_iovec_for_partial_transfer(iov, iovcnt, nbytes); + } + + if (!skipFsync && !SmgrIsTemp(reln)) + register_dirty_segment(reln, forknum, v); + + nblocks -= nblocks_this_segment; + buffers += nblocks_this_segment; + blocknum += nblocks_this_segment; + } } /* diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index 5d0f3d515c..b46ac3cb09 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -55,10 +55,13 @@ typedef struct f_smgr BlockNumber blocknum, int nblocks, bool skipFsync); bool (*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum); - void (*smgr_read) (SMgrRelation reln, ForkNumber forknum, - BlockNumber blocknum, void *buffer); - void (*smgr_write) (SMgrRelation reln, ForkNumber forknum, - BlockNumber blocknum, const void *buffer, bool skipFsync); + void (*smgr_readv) (SMgrRelation reln, ForkNumber forknum, + BlockNumber blocknum, + void **buffers, BlockNumber nblocks); + void (*smgr_writev) (SMgrRelation reln, ForkNumber forknum, + BlockNumber blocknum, + const void **buffers, BlockNumber nblocks, + bool skipFsync); void (*smgr_writeback) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, BlockNumber nblocks); BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum); @@ -80,8 +83,8 @@ static const f_smgr smgrsw[] = { .smgr_extend = mdextend, .smgr_zeroextend = mdzeroextend, .smgr_prefetch = mdprefetch, - .smgr_read = mdread, - .smgr_write = mdwrite, + .smgr_readv = mdreadv, + .smgr_writev = mdwritev, .smgr_writeback = mdwriteback, .smgr_nblocks = mdnblocks, .smgr_truncate = mdtruncate, @@ -551,22 +554,23 @@ smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum) } /* - * smgrread() -- read a particular block from a relation into the supplied - * buffer. + * smgrreadv() -- read a particular block range from a relation into the + * supplied buffers. * * This routine is called from the buffer manager in order to * instantiate pages in the shared buffer cache. All storage managers * return pages in the format that POSTGRES expects. */ void -smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, - void *buffer) +smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, + void **buffers, BlockNumber nblocks) { - smgrsw[reln->smgr_which].smgr_read(reln, forknum, blocknum, buffer); + smgrsw[reln->smgr_which].smgr_readv(reln, forknum, blocknum, buffers, + nblocks); } /* - * smgrwrite() -- Write the supplied buffer out. + * smgrwritev() -- Write the supplied buffers out. * * This is to be used only for updating already-existing blocks of a * relation (ie, those before the current EOF). To extend a relation, @@ -581,14 +585,13 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, * do not require fsync. */ void -smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, - const void *buffer, bool skipFsync) +smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, + const void **buffers, BlockNumber nblocks, bool skipFsync) { - smgrsw[reln->smgr_which].smgr_write(reln, forknum, blocknum, - buffer, skipFsync); + smgrsw[reln->smgr_which].smgr_writev(reln, forknum, blocknum, + buffers, nblocks, skipFsync); } - /* * smgrwriteback() -- Trigger kernel writeback for the supplied range of * blocks. diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index a1d6056d9d..7025af618a 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -208,7 +208,7 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info) { struct iovec iov = { - .iov_base = unconstify(void *, buffer), + .iov_base = (void *) buffer, .iov_len = amount }; diff --git a/src/include/storage/md.h b/src/include/storage/md.h index 941879ee6a..0f73d9af49 100644 --- a/src/include/storage/md.h +++ b/src/include/storage/md.h @@ -32,10 +32,11 @@ extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, int nblocks, bool skipFsync); extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum); -extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, - void *buffer); -extern void mdwrite(SMgrRelation reln, ForkNumber forknum, - BlockNumber blocknum, const void *buffer, bool skipFsync); +extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, + void **buffers, BlockNumber nblocks); +extern void mdwritev(SMgrRelation reln, ForkNumber forknum, + BlockNumber blocknum, + const void **buffers, BlockNumber nblocks, bool skipFsync); extern void mdwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, BlockNumber nblocks); extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum); diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index a9a179aaba..ae93910132 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -96,10 +96,13 @@ extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, int nblocks, bool skipFsync); extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum); -extern void smgrread(SMgrRelation reln, ForkNumber forknum, - BlockNumber blocknum, void *buffer); -extern void smgrwrite(SMgrRelation reln, ForkNumber forknum, - BlockNumber blocknum, const void *buffer, bool skipFsync); +extern void smgrreadv(SMgrRelation reln, ForkNumber forknum, + BlockNumber blocknum, + void **buffer, BlockNumber nblocks); +extern void smgrwritev(SMgrRelation reln, ForkNumber forknum, + BlockNumber blocknum, + const void **buffer, BlockNumber nblocks, + bool skipFsync); extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, BlockNumber nblocks); extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum); @@ -110,4 +113,18 @@ extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum); extern void AtEOXact_SMgr(void); extern bool ProcessBarrierSmgrRelease(void); +static inline void +smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, + void *buffer) +{ + smgrreadv(reln, forknum, blocknum, &buffer, 1); +} + +static inline void +smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, + const void *buffer, bool skipFsync) +{ + smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync); +} + #endif /* SMGR_H */ -- 2.42.1