On Wed, Nov 29, 2023 at 1:44 AM Heikki Linnakangas <hlinn...@iki.fi> wrote:
> LGTM. I think this 0001 patch is ready for commit, independently of the
> rest of the patches.

Done.

> In v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch, fd.h:
>
> > +/* Filename components */
> > +#define PG_TEMP_FILES_DIR "pgsql_tmp"
> > +#define PG_TEMP_FILE_PREFIX "pgsql_tmp"
> > +
>
> These seem out of place, we already have them in common/file_utils.h.

Yeah, they moved from there in f39b2658 and I messed up the rebase.  Fixed.

> Other than that,
> v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch and
> v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patch look
> good to me.

One thing I wasn't 100% happy with was the treatment of ENOSPC.  A few
callers believe that short writes set errno: they error out with a
message including %m.  We have historically set errno = ENOSPC inside
FileWrite() if the write size was unexpectedly small AND the kernel
didn't set errno to a non-zero value (having set it to zero ourselves
earlier).  In FileWriteV(), I didn't want to do that because it is
expensive to compute the total write size from the vector array and we
managed to measure an effect due to that in some workloads.

Note that the smgr patch actually handles short writes by continuing,
instead of raising an error.  Short writes do already occur in the
wild on various systems for various rare technical reasons other than
ENOSPC I have heard (imagine transient failure to acquire some
temporary memory that the kernel chooses not to wait for, stuff like
that, though certainly many people and programs believe they should
not happen[1]), and it seems like a good idea to actually handle them
as our write sizes increase and the probability of short writes might
presumably increase.

With the previous version of the patch, we'd have to change a couple
of other callers not to believe that short writes are errors and set
errno (callers are inconsistent on this point).  I don't really love
that we have "fake" system errors but I also want to stay focused
here, so in this new version V3 I tried a new approach: I realised I
can just always set errno without needing the total size, so that
(undocumented) aspect of the interface doesn't change.  The point
being that it doesn't matter if you clobber errno with a bogus value
when the write was non-short.  Thoughts?

[1] https://utcc.utoronto.ca/~cks/space/blog/unix/WritesNotShortOften
From 7808d7241088f8dc02d7d828dc4ee0227e8e2a01 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Wed, 19 Jul 2023 12:32:51 +1200
Subject: [PATCH v3 1/7] Provide vectored variants of FileRead() and
 FileWrite().

FileReadV() and FileWriteV() adapt preadv() and pwritev() for
PostgreSQL's virtual file descriptors.  The traditional FileRead() and
FileWrite() functions are implemented in terms of the new functions.

Reviewed-by: Heikki Linnakangas <hlinn...@iki.fi>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6ut5tum2g...@mail.gmail.com
---
 src/backend/storage/file/fd.c | 33 +++++++++++++++++++++++----------
 src/include/storage/fd.h      | 30 ++++++++++++++++++++++++++++--
 2 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f691ba0932..aa89c755b3 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2110,8 +2110,8 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 }
 
 int
-FileRead(File file, void *buffer, size_t amount, off_t offset,
-		 uint32 wait_event_info)
+FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+		  uint32 wait_event_info)
 {
 	int			returnCode;
 	Vfd		   *vfdP;
@@ -2131,7 +2131,7 @@ FileRead(File file, void *buffer, size_t amount, off_t offset,
 
 retry:
 	pgstat_report_wait_start(wait_event_info);
-	returnCode = pg_pread(vfdP->fd, buffer, amount, offset);
+	returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
 	pgstat_report_wait_end();
 
 	if (returnCode < 0)
@@ -2166,8 +2166,8 @@ retry:
 }
 
 int
-FileWrite(File file, const void *buffer, size_t amount, off_t offset,
-		  uint32 wait_event_info)
+FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+		   uint32 wait_event_info)
 {
 	int			returnCode;
 	Vfd		   *vfdP;
@@ -2195,7 +2195,14 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 	 */
 	if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
 	{
-		off_t		past_write = offset + amount;
+		size_t		size = 0;
+		off_t		past_write;
+
+		/* Compute the total transfer size. */
+		for (int i = 0; i < iovcnt; ++i)
+			size += iov[i].iov_len;
+
+		past_write = offset + size;
 
 		if (past_write > vfdP->fileSize)
 		{
@@ -2213,11 +2220,17 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 retry:
 	errno = 0;
 	pgstat_report_wait_start(wait_event_info);
-	returnCode = pg_pwrite(VfdCache[file].fd, buffer, amount, offset);
+	returnCode = pg_pwritev(vfdP->fd, iov, iovcnt, offset);
 	pgstat_report_wait_end();
 
-	/* if write didn't set errno, assume problem is no disk space */
-	if (returnCode != amount && errno == 0)
+	/*
+	 * Some callers expect short writes to set errno, even though that isn't
+	 * the behavior of the underlying system calls.  We don't want to waste
+	 * CPU cycles adding up the total write size here, so we'll just set it to
+	 * ENOSPC even on success, if the syscall didn't set it.  On successful
+	 * non-short write, no caller should consult errno.
+	 */
+	if (errno == 0)
 		errno = ENOSPC;
 
 	if (returnCode >= 0)
@@ -2227,7 +2240,7 @@ retry:
 		 */
 		if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
 		{
-			off_t		past_write = offset + amount;
+			off_t		past_write = offset + returnCode;
 
 			if (past_write > vfdP->fileSize)
 			{
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index d9d5d9da5f..a1d6056d9d 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -43,6 +43,8 @@
 #ifndef FD_H
 #define FD_H
 
+#include "port/pg_iovec.h"
+
 #include <dirent.h>
 #include <fcntl.h>
 
@@ -105,8 +107,8 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern int	FileRead(File file, void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
-extern int	FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
+extern int	FileReadV(File file, const struct iovec *ioc, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileWriteV(File file, const struct iovec *ioc, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
@@ -189,4 +191,28 @@ extern int	durable_unlink(const char *fname, int elevel);
 extern void SyncDataDirectory(void);
 extern int	data_sync_elevel(int elevel);
 
+static inline int
+FileRead(File file, void *buffer, size_t amount, off_t offset,
+		 uint32 wait_event_info)
+{
+	struct iovec iov = {
+		.iov_base = buffer,
+		.iov_len = amount
+	};
+
+	return FileReadV(file, &iov, 1, offset, wait_event_info);
+}
+
+static inline int
+FileWrite(File file, const void *buffer, size_t amount, off_t offset,
+		  uint32 wait_event_info)
+{
+	struct iovec iov = {
+		.iov_base = unconstify(void *, buffer),
+		.iov_len = amount
+	};
+
+	return FileWriteV(file, &iov, 1, offset, wait_event_info);
+}
+
 #endif							/* FD_H */
-- 
2.42.1

From 5673ae46baa2cf8fa006f04462b9a60004873c65 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Wed, 19 Jul 2023 14:50:41 +1200
Subject: [PATCH v3 2/7] Provide vectored variants of smgrread() and
 smgrwrite().

smgrreadv() and smgrwritev() map to FileReadV() and FileWriteV(), and
thence the OS's vector I/O routines, after dealing with segmentation.
These functions allow for contiguous regions of relation files to be
transferred into and out of non-neighboring buffers with a single system
call.  The traditional smgrread() and smgrwrite() functions are
implemented in terms of the new functions.

The md.c implementation automatically handles short transfers by
retrying.  They were probably always possible even when we were only
doing 8kB reads and write, but with larger transfers they might become
more likely.

Reviewed-by: Heikki Linnakangas <hlinn...@iki.fi>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6ut5tum2g...@mail.gmail.com
---
 src/backend/storage/smgr/md.c   | 353 +++++++++++++++++++++++---------
 src/backend/storage/smgr/smgr.c |  37 ++--
 src/include/storage/fd.h        |   2 +-
 src/include/storage/md.h        |   9 +-
 src/include/storage/smgr.h      |  25 ++-
 5 files changed, 304 insertions(+), 122 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index fdecbad170..d1383220bd 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -452,7 +452,7 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
 /*
  * mdextend() -- Add a block to the specified relation.
  *
- * The semantics are nearly the same as mdwrite(): write at the
+ * The semantics are nearly the same as mdwritev(): write at the
  * specified position.  However, this is to be used for the case of
  * extending a relation (i.e., blocknum is at or beyond the current
  * EOF).  Note that we assume writing a block beyond current EOF
@@ -737,136 +737,297 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 }
 
 /*
- * mdread() -- Read the specified block from a relation.
+ * Convert an array of buffer address into an array of iovec objects, and
+ * return the number that were required.  'iov' must have enough space for up
+ * to PG_IOV_MAX elements.
  */
-void
-mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-	   void *buffer)
+static int
+buffers_to_iov(struct iovec *iov, void **buffers, int nblocks)
 {
-	off_t		seekpos;
-	int			nbytes;
-	MdfdVec    *v;
+	struct iovec *iovp;
+	int			iovcnt;
 
-	/* If this build supports direct I/O, the buffer must be I/O aligned. */
-	if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
-		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+	Assert(nblocks >= 1);
+	Assert(nblocks <= PG_IOV_MAX);
 
-	TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber,
-										reln->smgr_rlocator.backend);
+	/* If this build supports direct I/O, buffers must be I/O aligned. */
+	for (int i = 0; i < nblocks; ++i)
+	{
+		if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
+			Assert((uintptr_t) buffers[i] ==
+				   TYPEALIGN(PG_IO_ALIGN_SIZE, buffers[i]));
+	}
 
-	v = _mdfd_getseg(reln, forknum, blocknum, false,
-					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+	/* Start the first iovec off with the first buffer. */
+	iovp = &iov[0];
+	iovp->iov_base = buffers[0];
+	iovp->iov_len = BLCKSZ;
+	iovcnt = 1;
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	/* Merge in the rest. */
+	for (int i = 1; i < nblocks; ++i)
+	{
+		void	   *buffer = buffers[i];
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		if (((char *) iovp->iov_base + iovp->iov_len) == buffer)
+		{
+			/* Contiguous with the last iovec. */
+			iovp->iov_len += BLCKSZ;
+		}
+		else
+		{
+			/* Need a new iovec. */
+			iovp++;
+			iovp->iov_base = buffer;
+			iovp->iov_len = BLCKSZ;
+			iovcnt++;
+		}
+	}
+
+	return iovcnt;
+}
 
-	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
+/*
+ * To continue a vector I/O after a short transfer, we need a way to remove
+ * the whole or partial iovecs that have already been transferred.  We assume
+ * that partial transfers with direct I/O will happen on suitable alignment
+ * boundaries, so that we don't have to make further adjustments for that.
+ */
+static int
+adjust_iovec_for_partial_transfer(struct iovec *iov,
+								  int iovcnt,
+								  size_t transferred)
+{
+	struct iovec *keep = iov;
 
-	TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
-									   reln->smgr_rlocator.locator.spcOid,
-									   reln->smgr_rlocator.locator.dbOid,
-									   reln->smgr_rlocator.locator.relNumber,
-									   reln->smgr_rlocator.backend,
-									   nbytes,
-									   BLCKSZ);
+	Assert(iovcnt > 0);
 
-	if (nbytes != BLCKSZ)
+	/* Remove wholly transferred iovecs. */
+	while (keep->iov_len <= transferred)
 	{
-		if (nbytes < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read block %u in file \"%s\": %m",
-							blocknum, FilePathName(v->mdfd_vfd))));
+		transferred -= keep->iov_len;
+		keep++;
+		iovcnt--;
+	}
+	if (keep != iov)
+		memmove(iov, keep, sizeof(*keep) * iovcnt);
 
-		/*
-		 * Short read: we are at or past EOF, or we read a partial block at
-		 * EOF.  Normally this is an error; upper levels should never try to
-		 * read a nonexistent block.  However, if zero_damaged_pages is ON or
-		 * we are InRecovery, we should instead return zeroes without
-		 * complaining.  This allows, for example, the case of trying to
-		 * update a block that was later truncated away.
-		 */
-		if (zero_damaged_pages || InRecovery)
-			MemSet(buffer, 0, BLCKSZ);
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read block %u in file \"%s\": read only %d of %d bytes",
-							blocknum, FilePathName(v->mdfd_vfd),
-							nbytes, BLCKSZ)));
+	/* Shouldn't be called with 'transferred' >= total size. */
+	Assert(iovcnt > 0);
+
+	/* Adjust leading iovec, which may have been partially transferred. */
+	Assert(iov->iov_len > transferred);
+	iov->iov_base = (char *) iov->iov_base + transferred;
+	iov->iov_len -= transferred;
+
+	return iovcnt;
+}
+
+/*
+ * mdreadv() -- Read the specified blocks from a relation.
+ */
+void
+mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		void **buffers, BlockNumber nblocks)
+{
+	while (nblocks > 0)
+	{
+		struct iovec iov[PG_IOV_MAX];
+		int			iovcnt;
+		off_t		seekpos;
+		int			nbytes;
+		MdfdVec    *v;
+		BlockNumber nblocks_this_segment;
+		size_t		transferred_this_segment;
+		size_t		size_this_segment;
+
+		v = _mdfd_getseg(reln, forknum, blocknum, false,
+						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+		nblocks_this_segment =
+			Min(nblocks,
+				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+		nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
+
+		iovcnt = buffers_to_iov(iov, buffers, nblocks_this_segment);
+		size_this_segment = nblocks_this_segment * BLCKSZ;
+		transferred_this_segment = 0;
+
+		/* Inner loop to continue after a short reads. */
+		for (;;)
+		{
+			TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
+												reln->smgr_rlocator.locator.spcOid,
+												reln->smgr_rlocator.locator.dbOid,
+												reln->smgr_rlocator.locator.relNumber,
+												reln->smgr_rlocator.backend);
+			nbytes = FileReadV(v->mdfd_vfd, iov, iovcnt, seekpos,
+							   WAIT_EVENT_DATA_FILE_READ);
+			TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
+											   reln->smgr_rlocator.locator.spcOid,
+											   reln->smgr_rlocator.locator.dbOid,
+											   reln->smgr_rlocator.locator.relNumber,
+											   reln->smgr_rlocator.backend,
+											   nbytes,
+											   BLCKSZ);
+
+			if (nbytes < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not read %u blocks at block %u in file \"%s\": %m",
+								nblocks_this_segment, blocknum,
+								FilePathName(v->mdfd_vfd))));
+
+			if (nbytes == 0)
+			{
+				/*
+				 * We are at or past EOF, or we read a partial block at EOF.
+				 * Normally this is an error; upper levels should never try to
+				 * read a nonexistent block.  However, if zero_damaged_pages
+				 * is ON or we are InRecovery, we should instead return zeroes
+				 * without complaining.  This allows, for example, the case of
+				 * trying to update a block that was later truncated away.
+				 */
+				if (zero_damaged_pages || InRecovery)
+				{
+					for (BlockNumber i = transferred_this_segment / BLCKSZ;
+						 i < nblocks_this_segment;
+						 ++i)
+						memset(buffers[i], 0, BLCKSZ);
+					break;
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("could not read %u blocks at block %u in file \"%s\": read only %zu of %zu bytes",
+									nblocks_this_segment,
+									blocknum,
+									FilePathName(v->mdfd_vfd),
+									transferred_this_segment,
+									size_this_segment)));
+			}
+
+			/* One loop should usually be enough. */
+			transferred_this_segment += nbytes;
+			Assert(transferred_this_segment <= size_this_segment);
+			if (transferred_this_segment == size_this_segment)
+				break;
+
+			/* Adjust position and vectors after a short transfer. */
+			seekpos += nbytes;
+			iovcnt = adjust_iovec_for_partial_transfer(iov, iovcnt, nbytes);
+		}
+
+		nblocks -= nblocks_this_segment;
+		buffers += nblocks_this_segment;
+		blocknum += nblocks_this_segment;
 	}
 }
 
 /*
- * mdwrite() -- Write the supplied block at the appropriate location.
+ * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
  * This is to be used only for updating already-existing blocks of a
  * relation (ie, those before the current EOF).  To extend a relation,
  * use mdextend().
  */
 void
-mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		const void *buffer, bool skipFsync)
+mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
-	off_t		seekpos;
-	int			nbytes;
-	MdfdVec    *v;
-
-	/* If this build supports direct I/O, the buffer must be I/O aligned. */
-	if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
-		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
-
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert(blocknum < mdnblocks(reln, forknum));
 #endif
 
-	TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum,
-										 reln->smgr_rlocator.locator.spcOid,
-										 reln->smgr_rlocator.locator.dbOid,
-										 reln->smgr_rlocator.locator.relNumber,
-										 reln->smgr_rlocator.backend);
+	while (nblocks > 0)
+	{
+		struct iovec iov[PG_IOV_MAX];
+		int			iovcnt;
+		off_t		seekpos;
+		int			nbytes;
+		MdfdVec    *v;
+		BlockNumber nblocks_this_segment;
+		size_t		transferred_this_segment;
+		size_t		size_this_segment;
 
-	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
-					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+		v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
+						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+		nblocks_this_segment =
+			Min(nblocks,
+				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+		nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
 
-	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber,
-										reln->smgr_rlocator.backend,
-										nbytes,
-										BLCKSZ);
+		iovcnt = buffers_to_iov(iov, (void **) buffers, nblocks_this_segment);
+		size_this_segment = nblocks_this_segment * BLCKSZ;
+		transferred_this_segment = 0;
 
-	if (nbytes != BLCKSZ)
-	{
-		if (nbytes < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not write block %u in file \"%s\": %m",
-							blocknum, FilePathName(v->mdfd_vfd))));
-		/* short write: complain appropriately */
-		ereport(ERROR,
-				(errcode(ERRCODE_DISK_FULL),
-				 errmsg("could not write block %u in file \"%s\": wrote only %d of %d bytes",
-						blocknum,
-						FilePathName(v->mdfd_vfd),
-						nbytes, BLCKSZ),
-				 errhint("Check free disk space.")));
-	}
+		/* Inner loop to continue after a short writes. */
+		for (;;)
+		{
+			TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum,
+												 reln->smgr_rlocator.locator.spcOid,
+												 reln->smgr_rlocator.locator.dbOid,
+												 reln->smgr_rlocator.locator.relNumber,
+												 reln->smgr_rlocator.backend);
+			nbytes = FileWriteV(v->mdfd_vfd, iov, iovcnt, seekpos,
+								WAIT_EVENT_DATA_FILE_WRITE);
+			TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
+												reln->smgr_rlocator.locator.spcOid,
+												reln->smgr_rlocator.locator.dbOid,
+												reln->smgr_rlocator.locator.relNumber,
+												reln->smgr_rlocator.backend,
+												nbytes,
+												BLCKSZ);
+
+			if (nbytes < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write %u blocks at block %u in file \"%s\": %m",
+								nblocks_this_segment, blocknum,
+								FilePathName(v->mdfd_vfd))));
 
-	if (!skipFsync && !SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, v);
+			if (nbytes == 0)
+			{
+				/* short write: complain appropriately */
+				ereport(ERROR,
+						(errcode(ERRCODE_DISK_FULL),
+						 errmsg("could not write %u blocks at block %u in file \"%s\": wrote only %zu of %zu bytes",
+								nblocks_this_segment,
+								blocknum,
+								FilePathName(v->mdfd_vfd),
+								transferred_this_segment,
+								size_this_segment),
+						 errhint("Check free disk space.")));
+			}
+
+			/* One loop should usually be enough. */
+			transferred_this_segment += nbytes;
+			Assert(transferred_this_segment <= size_this_segment);
+			if (transferred_this_segment == size_this_segment)
+				break;
+
+			/* Adjust position and vectors after a short transfer. */
+			seekpos += nbytes;
+			iovcnt = adjust_iovec_for_partial_transfer(iov, iovcnt, nbytes);
+		}
+
+		if (!skipFsync && !SmgrIsTemp(reln))
+			register_dirty_segment(reln, forknum, v);
+
+		nblocks -= nblocks_this_segment;
+		buffers += nblocks_this_segment;
+		blocknum += nblocks_this_segment;
+	}
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5d0f3d515c..b46ac3cb09 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -55,10 +55,13 @@ typedef struct f_smgr
 									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
-	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
-							  BlockNumber blocknum, void *buffer);
-	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-							   BlockNumber blocknum, const void *buffer, bool skipFsync);
+	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
+							   BlockNumber blocknum,
+							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
+								BlockNumber blocknum,
+								const void **buffers, BlockNumber nblocks,
+								bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -80,8 +83,8 @@ static const f_smgr smgrsw[] = {
 		.smgr_extend = mdextend,
 		.smgr_zeroextend = mdzeroextend,
 		.smgr_prefetch = mdprefetch,
-		.smgr_read = mdread,
-		.smgr_write = mdwrite,
+		.smgr_readv = mdreadv,
+		.smgr_writev = mdwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
@@ -551,22 +554,23 @@ smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 }
 
 /*
- * smgrread() -- read a particular block from a relation into the supplied
- *				 buffer.
+ * smgrreadv() -- read a particular block range from a relation into the
+ *				 supplied buffers.
  *
  * This routine is called from the buffer manager in order to
  * instantiate pages in the shared buffer cache.  All storage managers
  * return pages in the format that POSTGRES expects.
  */
 void
-smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		 void *buffer)
+smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		  void **buffers, BlockNumber nblocks)
 {
-	smgrsw[reln->smgr_which].smgr_read(reln, forknum, blocknum, buffer);
+	smgrsw[reln->smgr_which].smgr_readv(reln, forknum, blocknum, buffers,
+										nblocks);
 }
 
 /*
- * smgrwrite() -- Write the supplied buffer out.
+ * smgrwritev() -- Write the supplied buffers out.
  *
  * This is to be used only for updating already-existing blocks of a
  * relation (ie, those before the current EOF).  To extend a relation,
@@ -581,14 +585,13 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * do not require fsync.
  */
 void
-smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  const void *buffer, bool skipFsync)
+smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		   const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
-	smgrsw[reln->smgr_which].smgr_write(reln, forknum, blocknum,
-										buffer, skipFsync);
+	smgrsw[reln->smgr_which].smgr_writev(reln, forknum, blocknum,
+										 buffers, nblocks, skipFsync);
 }
 
-
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index a1d6056d9d..7025af618a 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -208,7 +208,7 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 		  uint32 wait_event_info)
 {
 	struct iovec iov = {
-		.iov_base = unconstify(void *, buffer),
+		.iov_base = (void *) buffer,
 		.iov_len = amount
 	};
 
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 941879ee6a..0f73d9af49 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -32,10 +32,11 @@ extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
-extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-				   void *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-					BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					void **buffers, BlockNumber nblocks);
+extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum,
+					 const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..ae93910132 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -96,10 +96,13 @@ extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
-extern void smgrread(SMgrRelation reln, ForkNumber forknum,
-					 BlockNumber blocknum, void *buffer);
-extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-					  BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
+					  BlockNumber blocknum,
+					  void **buffer, BlockNumber nblocks);
+extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
+					   BlockNumber blocknum,
+					   const void **buffer, BlockNumber nblocks,
+					   bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -110,4 +113,18 @@ extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
+static inline void
+smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 void *buffer)
+{
+	smgrreadv(reln, forknum, blocknum, &buffer, 1);
+}
+
+static inline void
+smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		  const void *buffer, bool skipFsync)
+{
+	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
+}
+
 #endif							/* SMGR_H */
-- 
2.42.1

Reply via email to