Bufferless buffered files

Thomas Munro Sat, 22 Apr 2023 18:10:14 -0700

"$SUBJECT" may sound like a Zen kōan, but buffile.c does a few useful
things other than buffering (I/O time tracking, inter-process file
exchange though file sets, segmentation, etc).  A couple of places
such as logtape.c, sharedtuplestore.c and gistbuildbuffers.c do
block-oriented I/O in temporary files, but unintentionally also pay
the price of buffile.c's buffering scheme for no good reason, namely:


 * extra memory for the buffer
 * extra memcpy() to/from that buffer
 * I/O chopped up into BLCKSZ-sized system calls

Melanie and I were bouncing some ideas around about the hash join
batch memory usage problem, and I thought it might be good to write
down these thoughts about lower level buffer management stuff in a new
thread.  They don't address the hard algorithmic issues in that topic
(namely: at some point, doubling the number of partition files will
take more buffer space than can be saved by halving the size of the
hash table, parallel or not), but we probably should remove a couple
of the constants that make the parallel case worse.  As a side benefit
it would be nice to also fix some obvious silly efficiency problems
for all users of BufFiles.

First, here is a toy patch to try to skip the buffering layer in
BufFile when possible, without changing any interfaces.  Other ways
would of course be possible, including a new separate module with a
different name.  The approach taken here is to defer allocation of
BufFile::buffer, and use a fast path that just passes I/O straight
through to the OS without touching the sides, until the first
non-block-aligned I/O is seen, at which point we switch to the less
efficient mode.  Don't take this code too seriously, and I'm not sure
what to think about short reads/writes yet; it's just a cheap demo to
accompany this problem statement.

We could go further (and this one is probably more important).
SharedTuplestore works in 32KB chunks, which are the "parallel grain"
for distributing data to workers that read the data back later
(usually the workers try to read from different files to avoid
stepping on each others' toes, but when there aren't enough files left
they gang up and read from one file in parallel, and we need some way
to scatter the data to different workers; these chunks are analogous
to the heap table blocks handed out by Parallel Seq Scan).  With the
bufferless-BufFile optimisation, data is written out directly from
sharedtuplestore.c's buffer to the OS in a 32KB pwrite(), which is a
nice improvement, but that's 32KB of buffer space that non-parallel HJ
doesn't have.  We could fix that by making the chunk size dynamic, and
turning it down as low as 8KB when the number of partitions shoots up
to silly numbers by some heuristics.  (I guess the buffer size and the
chunk size don't have to be strictly the same, but it'd probably make
the code simpler...)  With those changes, PHJ would use strictly no
more buffer memory than HJ, unless npartitions is fairly low when it
should be OK to do so (and we could even consider making it larger
sometimes for bigger I/Os).

(Obviously it's just terrible in every way when we have very high
numbers of partitions, and this just mitigates one symptom.  For
example, once we get to our max fds limit, we're re-opening and
re-closing files at high speed, which is bonkers, and the kernel has
an identical buffer space management problem on its side of the wall.
In the past we've talked about something a bit more logtape.c-like
(multiple streams in one file with our own extent management) which
might help a bit.  But... it feels like what we really need is a
higher level algorithmic solution that lets us cap the number of
partitions, as chronicled elsewhere.)

We could go further again.  The above concerns the phase where we're
writing data out to many partitions, which requires us to have
in-memory state for all of them.  But during the reading back phase,
we read from just a single partition at a time, so we don't have the
memory explosion problem.  For dumb implementation reasons
(essentially because it was styled on the older pre-parallel HJ code),
SharedTuplestore reads one tuple at a time from buffile.c with
BufFileRead(), so while reading it *needs* BufFile's internal
buffering, which means 8KB pread() calls, no optimisation for you.  We
could change it to read in whole 32KB chunks at a time (or dynamic
size as mentioned already), via the above faster bufferless path.  The
next benefit would be that, with whole chunks in *its* buffer, it
could emit MinimalTuples using a pointer directly into that buffer,
skipping another memcpy() (unless the tuple is too big and involves
overflow chunks, a rare case).  (Perhaps there could even be some
clever way to read whole chunks directly into the hash table memory
chunks at once, instead of copying tuples in one-by-one, not
examined.)

Another closely related topic that came up in other recent I/O work:
How can we arrange for all of these buffers to be allocated more
efficiently?  In fact we know how many partitions there are at a
higher level, and we'd ideally make one
palloc_aligned(PG_IO_ALIGN_SIZE) for all of them, instead of many
individual smaller pallocs.  Aside from wasted CPU cycles, individual
pallocs have invisible space overheads at both libc and palloc level,
and non-I/O-aligned buffers have invisible costs at I/O system call
level (the kernel must hold/pin/something an extra neighbouring VM
page while copying in/out of user space, if not nicely aligned).
However, palloc_aligned()'s strategy is to over-allocate by 4KB and
then round the pointer up, so we'd ideally like to do that just once
for a whole array, not once for each BufFile (you can see this problem
in the attached patch, which does that that for individual BufFile
objects, but it'd make the problem Jehan-Guillaume wrote about
slightly worse, which is why I didn't do that in commit faeedbcef
"Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.").  This seems
to suggest a different design where the client code can optionally
provide its own buffer space, perhaps something as simple as
BufFileSetBuffer(), and we'd need to make one big allocation and pass
in pointers to its elements for the raw batch files (for HJ), and come
up with something similar for SharedTuplestore's internal chunk buffer
(for PHJ)).

From 7bad3ebe32f2f319fd76c613c8bb04a7f50e3c4f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Fri, 21 Apr 2023 15:58:02 +1200
Subject: [PATCH 1/2] Skip useless buffering in BufFile.

Some callers do small reads and writes (tuples), and benefit from
BufFile's internal BLCKSZ buffer.  Other callers such as logtape.c always
do perfectly BLCKSZ-oriented read and writes, so they might as well skip
a level of memcpy() through yet another buffer.  And yet other callers
such as sharedtuplestore.c always do perfectly aligned reads and writes
for many blocks at a time, so we might as well skip that extra buffering
AND the unwanted conversion into multiple system calls.

Those callers still have a reason to want to use BufFile despite the
apparent contradiction in its name: it deals with temporary file
management, file segments, IO timing, stats etc.

The creation of the internal buffer is deferred until we see the first
I/O operation that actually needs it, which should be never for
strictly block-oriented callers.  When it is created, the buffer is now
IO-aligned, which may provide a small peformance boost.

XXX But that IO-aligned created-on-demand buffer now wastes extra space,
so we probably need another solution for the cases that create many
BufFiles and rely on buffile.c creating the buffer.  See later
commits...
---
 src/backend/storage/file/buffile.c | 157 +++++++++++++++++++++++++----
 1 file changed, 139 insertions(+), 18 deletions(-)

diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 84ead85942..34418a03b7 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -80,13 +80,12 @@ struct BufFile
 	FileSet    *fileset;		/* space for fileset based segment files */
 	const char *name;			/* name of fileset based BufFile */
 
-	/*
-	 * resowner is the ResourceOwner to use for underlying temp files.  (We
-	 * don't need to remember the memory context we're using explicitly,
-	 * because after creation we only repalloc our arrays larger.)
-	 */
+	/* resowner is the ResourceOwner to use for underlying temp files */
 	ResourceOwner resowner;
 
+	/* context to use for buffer */
+	MemoryContext context;		/* context to use for buffer */
+
 	/*
 	 * "current pos" is position of start of buffer within the logical file.
 	 * Position as seen by user of BufFile is (curFile, curOffset + pos).
@@ -96,12 +95,7 @@ struct BufFile
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
 
-	/*
-	 * XXX Should ideally us PGIOAlignedBlock, but might need a way to avoid
-	 * wasting per-file alignment padding when some users create many
-	 * files.
-	 */
-	PGAlignedBlock buffer;
+	PGIOAlignedBlock *buffer;	/* allocated on first use */
 };
 
 static BufFile *makeBufFileCommon(int nfiles);
@@ -124,10 +118,12 @@ makeBufFileCommon(int nfiles)
 	file->isInterXact = false;
 	file->dirty = false;
 	file->resowner = CurrentResourceOwner;
+	file->context = CurrentMemoryContext;
 	file->curFile = 0;
 	file->curOffset = 0;
 	file->pos = 0;
 	file->nbytes = 0;
+	file->buffer = NULL;
 
 	return file;
 }
@@ -400,7 +396,8 @@ BufFileExportFileSet(BufFile *file)
 	/* It's probably a bug if someone calls this twice. */
 	Assert(!file->readOnly);
 
-	BufFileFlush(file);
+	if (file->buffer)
+		BufFileFlush(file);
 	file->readOnly = true;
 }
 
@@ -415,11 +412,14 @@ BufFileClose(BufFile *file)
 	int			i;
 
 	/* flush any unwritten data */
-	BufFileFlush(file);
+	if (file->buffer)
+		BufFileFlush(file);
 	/* close and delete the underlying file(s) */
 	for (i = 0; i < file->numFiles; i++)
 		FileClose(file->files[i]);
 	/* release the buffer space */
+	if (file->buffer)
+		pfree(file->buffer);
 	pfree(file->files);
 	pfree(file);
 }
@@ -459,7 +459,7 @@ BufFileLoadBuffer(BufFile *file)
 	 * Read whatever we can get, up to a full bufferload.
 	 */
 	file->nbytes = FileRead(thisfile,
-							file->buffer.data,
+							file->buffer->data,
 							sizeof(file->buffer),
 							file->curOffset,
 							WAIT_EVENT_BUFFILE_READ);
@@ -536,7 +536,7 @@ BufFileDumpBuffer(BufFile *file)
 			INSTR_TIME_SET_ZERO(io_start);
 
 		bytestowrite = FileWrite(thisfile,
-								 file->buffer.data + wpos,
+								 file->buffer->data + wpos,
 								 bytestowrite,
 								 file->curOffset,
 								 WAIT_EVENT_BUFFILE_WRITE);
@@ -597,6 +597,67 @@ BufFileReadCommon(BufFile *file, void *ptr, size_t size, bool exact, bool eofOK)
 	size_t		nread = 0;
 	size_t		nthistime;
 
+	/* Fast path for unbuffered large suitably aligned reads. */
+	if (!file->buffer &&
+		(file->curOffset % BLCKSZ) == 0 &&
+		(size % BLCKSZ) == 0)
+	{
+elog(LOG, "fast path read! size = %zu", size);
+		while (size > 0)
+		{
+			instr_time	io_start;
+			instr_time	io_time;
+			ssize_t		rc;
+
+			/* Move to next segment if necessary. */
+			if (file->curOffset >= MAX_PHYSICAL_FILESIZE)
+			{
+				if (file->curFile + 1 >= file->numFiles)
+					return nread;
+
+				file->curFile++;
+				file->curOffset = 0;
+			}
+
+			if (track_io_timing)
+				INSTR_TIME_SET_CURRENT(io_start);
+			else
+				INSTR_TIME_SET_ZERO(io_start);
+
+			/* Read as much as we can from this segment at once. */
+			rc = FileRead(file->files[file->curFile],
+						  ptr,
+						  Min(MAX_PHYSICAL_FILESIZE - file->curOffset, size),
+						  file->curOffset,
+						  WAIT_EVENT_BUFFILE_WRITE);
+			if (rc <= 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not read file \"%s\": %m",
+								FilePathName(file->files[file->curFile]))));
+
+			if (track_io_timing)
+			{
+				INSTR_TIME_SET_CURRENT(io_time);
+				INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
+			}
+
+			/* Advance buffer, offset and stats. */
+			nread += rc;
+			ptr = (char *) ptr + rc;
+			size -= rc;
+			file->curOffset += rc;
+			pgBufferUsage.temp_blks_read += rc / BLCKSZ;
+		}
+		return nread;
+	}
+
+	if (!file->buffer)
+		file->buffer = MemoryContextAllocAligned(file->context,
+												 sizeof(*file->buffer),
+												 PG_IO_ALIGN_SIZE,
+												 MCXT_ALLOC_ZERO);
+
 	BufFileFlush(file);
 
 	while (size > 0)
@@ -617,7 +678,7 @@ BufFileReadCommon(BufFile *file, void *ptr, size_t size, bool exact, bool eofOK)
 			nthistime = size;
 		Assert(nthistime > 0);
 
-		memcpy(ptr, file->buffer.data + file->pos, nthistime);
+		memcpy(ptr, file->buffer->data + file->pos, nthistime);
 
 		file->pos += nthistime;
 		ptr = (char *) ptr + nthistime;
@@ -680,6 +741,65 @@ BufFileWrite(BufFile *file, const void *ptr, size_t size)
 
 	Assert(!file->readOnly);
 
+	/* Fast path for unbuffered large suitably aligned writes. */
+	if (!file->buffer &&
+		(file->curOffset % BLCKSZ) == 0 &&
+		(size % BLCKSZ) == 0)
+	{
+elog(LOG, "fast path write! size = %zu", size);
+		while (size > 0)
+		{
+			instr_time	io_start;
+			instr_time	io_time;
+			ssize_t		rc;
+
+			/* Make a new segment if necessary. */
+			if (file->curOffset >= MAX_PHYSICAL_FILESIZE)
+			{
+				while (file->curFile + 1 >= file->numFiles)
+					extendBufFile(file);
+				file->curFile++;
+				file->curOffset = 0;
+			}
+
+			if (track_io_timing)
+				INSTR_TIME_SET_CURRENT(io_start);
+			else
+				INSTR_TIME_SET_ZERO(io_start);
+
+			/* Write as much as we can into this segment at once. */
+			rc = FileWrite(file->files[file->curFile],
+						   ptr,
+						   Min(MAX_PHYSICAL_FILESIZE - file->curOffset, size),
+						   file->curOffset,
+						   WAIT_EVENT_BUFFILE_WRITE);
+			if (rc <= 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write to file \"%s\": %m",
+								FilePathName(file->files[file->curFile]))));
+
+			if (track_io_timing)
+			{
+				INSTR_TIME_SET_CURRENT(io_time);
+				INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
+			}
+
+			/* Advance buffer, offset and stats. */
+			ptr = (char *) ptr + rc;
+			size -= rc;
+			file->curOffset += rc;
+			pgBufferUsage.temp_blks_written += rc / BLCKSZ;
+		}
+		return;
+	}
+
+	if (!file->buffer)
+		file->buffer = MemoryContextAllocAligned(file->context,
+												 sizeof(*file->buffer),
+												 PG_IO_ALIGN_SIZE,
+												 MCXT_ALLOC_ZERO);
+
 	while (size > 0)
 	{
 		if (file->pos >= BLCKSZ)
@@ -701,7 +821,7 @@ BufFileWrite(BufFile *file, const void *ptr, size_t size)
 			nthistime = size;
 		Assert(nthistime > 0);
 
-		memcpy(file->buffer.data + file->pos, ptr, nthistime);
+		memcpy(file->buffer->data + file->pos, ptr, nthistime);
 
 		file->dirty = true;
 		file->pos += nthistime;
@@ -800,7 +920,8 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence)
 		return 0;
 	}
 	/* Otherwise, must reposition buffer, so flush any dirty data */
-	BufFileFlush(file);
+	if (file->buffer)
+		BufFileFlush(file);
 
 	/*
 	 * At this point and no sooner, check for seek past last segment. The
-- 
2.40.0

Bufferless buffered files

Reply via email to