"$SUBJECT" may sound like a Zen kōan, but buffile.c does a few useful things other than buffering (I/O time tracking, inter-process file exchange though file sets, segmentation, etc). A couple of places such as logtape.c, sharedtuplestore.c and gistbuildbuffers.c do block-oriented I/O in temporary files, but unintentionally also pay the price of buffile.c's buffering scheme for no good reason, namely:
* extra memory for the buffer * extra memcpy() to/from that buffer * I/O chopped up into BLCKSZ-sized system calls Melanie and I were bouncing some ideas around about the hash join batch memory usage problem, and I thought it might be good to write down these thoughts about lower level buffer management stuff in a new thread. They don't address the hard algorithmic issues in that topic (namely: at some point, doubling the number of partition files will take more buffer space than can be saved by halving the size of the hash table, parallel or not), but we probably should remove a couple of the constants that make the parallel case worse. As a side benefit it would be nice to also fix some obvious silly efficiency problems for all users of BufFiles. First, here is a toy patch to try to skip the buffering layer in BufFile when possible, without changing any interfaces. Other ways would of course be possible, including a new separate module with a different name. The approach taken here is to defer allocation of BufFile::buffer, and use a fast path that just passes I/O straight through to the OS without touching the sides, until the first non-block-aligned I/O is seen, at which point we switch to the less efficient mode. Don't take this code too seriously, and I'm not sure what to think about short reads/writes yet; it's just a cheap demo to accompany this problem statement. We could go further (and this one is probably more important). SharedTuplestore works in 32KB chunks, which are the "parallel grain" for distributing data to workers that read the data back later (usually the workers try to read from different files to avoid stepping on each others' toes, but when there aren't enough files left they gang up and read from one file in parallel, and we need some way to scatter the data to different workers; these chunks are analogous to the heap table blocks handed out by Parallel Seq Scan). With the bufferless-BufFile optimisation, data is written out directly from sharedtuplestore.c's buffer to the OS in a 32KB pwrite(), which is a nice improvement, but that's 32KB of buffer space that non-parallel HJ doesn't have. We could fix that by making the chunk size dynamic, and turning it down as low as 8KB when the number of partitions shoots up to silly numbers by some heuristics. (I guess the buffer size and the chunk size don't have to be strictly the same, but it'd probably make the code simpler...) With those changes, PHJ would use strictly no more buffer memory than HJ, unless npartitions is fairly low when it should be OK to do so (and we could even consider making it larger sometimes for bigger I/Os). (Obviously it's just terrible in every way when we have very high numbers of partitions, and this just mitigates one symptom. For example, once we get to our max fds limit, we're re-opening and re-closing files at high speed, which is bonkers, and the kernel has an identical buffer space management problem on its side of the wall. In the past we've talked about something a bit more logtape.c-like (multiple streams in one file with our own extent management) which might help a bit. But... it feels like what we really need is a higher level algorithmic solution that lets us cap the number of partitions, as chronicled elsewhere.) We could go further again. The above concerns the phase where we're writing data out to many partitions, which requires us to have in-memory state for all of them. But during the reading back phase, we read from just a single partition at a time, so we don't have the memory explosion problem. For dumb implementation reasons (essentially because it was styled on the older pre-parallel HJ code), SharedTuplestore reads one tuple at a time from buffile.c with BufFileRead(), so while reading it *needs* BufFile's internal buffering, which means 8KB pread() calls, no optimisation for you. We could change it to read in whole 32KB chunks at a time (or dynamic size as mentioned already), via the above faster bufferless path. The next benefit would be that, with whole chunks in *its* buffer, it could emit MinimalTuples using a pointer directly into that buffer, skipping another memcpy() (unless the tuple is too big and involves overflow chunks, a rare case). (Perhaps there could even be some clever way to read whole chunks directly into the hash table memory chunks at once, instead of copying tuples in one-by-one, not examined.) Another closely related topic that came up in other recent I/O work: How can we arrange for all of these buffers to be allocated more efficiently? In fact we know how many partitions there are at a higher level, and we'd ideally make one palloc_aligned(PG_IO_ALIGN_SIZE) for all of them, instead of many individual smaller pallocs. Aside from wasted CPU cycles, individual pallocs have invisible space overheads at both libc and palloc level, and non-I/O-aligned buffers have invisible costs at I/O system call level (the kernel must hold/pin/something an extra neighbouring VM page while copying in/out of user space, if not nicely aligned). However, palloc_aligned()'s strategy is to over-allocate by 4KB and then round the pointer up, so we'd ideally like to do that just once for a whole array, not once for each BufFile (you can see this problem in the attached patch, which does that that for individual BufFile objects, but it'd make the problem Jehan-Guillaume wrote about slightly worse, which is why I didn't do that in commit faeedbcef "Introduce PG_IO_ALIGN_SIZE and align all I/O buffers."). This seems to suggest a different design where the client code can optionally provide its own buffer space, perhaps something as simple as BufFileSetBuffer(), and we'd need to make one big allocation and pass in pointers to its elements for the raw batch files (for HJ), and come up with something similar for SharedTuplestore's internal chunk buffer (for PHJ)).
From 7bad3ebe32f2f319fd76c613c8bb04a7f50e3c4f Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Fri, 21 Apr 2023 15:58:02 +1200 Subject: [PATCH 1/2] Skip useless buffering in BufFile. Some callers do small reads and writes (tuples), and benefit from BufFile's internal BLCKSZ buffer. Other callers such as logtape.c always do perfectly BLCKSZ-oriented read and writes, so they might as well skip a level of memcpy() through yet another buffer. And yet other callers such as sharedtuplestore.c always do perfectly aligned reads and writes for many blocks at a time, so we might as well skip that extra buffering AND the unwanted conversion into multiple system calls. Those callers still have a reason to want to use BufFile despite the apparent contradiction in its name: it deals with temporary file management, file segments, IO timing, stats etc. The creation of the internal buffer is deferred until we see the first I/O operation that actually needs it, which should be never for strictly block-oriented callers. When it is created, the buffer is now IO-aligned, which may provide a small peformance boost. XXX But that IO-aligned created-on-demand buffer now wastes extra space, so we probably need another solution for the cases that create many BufFiles and rely on buffile.c creating the buffer. See later commits... --- src/backend/storage/file/buffile.c | 157 +++++++++++++++++++++++++---- 1 file changed, 139 insertions(+), 18 deletions(-) diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c index 84ead85942..34418a03b7 100644 --- a/src/backend/storage/file/buffile.c +++ b/src/backend/storage/file/buffile.c @@ -80,13 +80,12 @@ struct BufFile FileSet *fileset; /* space for fileset based segment files */ const char *name; /* name of fileset based BufFile */ - /* - * resowner is the ResourceOwner to use for underlying temp files. (We - * don't need to remember the memory context we're using explicitly, - * because after creation we only repalloc our arrays larger.) - */ + /* resowner is the ResourceOwner to use for underlying temp files */ ResourceOwner resowner; + /* context to use for buffer */ + MemoryContext context; /* context to use for buffer */ + /* * "current pos" is position of start of buffer within the logical file. * Position as seen by user of BufFile is (curFile, curOffset + pos). @@ -96,12 +95,7 @@ struct BufFile int pos; /* next read/write position in buffer */ int nbytes; /* total # of valid bytes in buffer */ - /* - * XXX Should ideally us PGIOAlignedBlock, but might need a way to avoid - * wasting per-file alignment padding when some users create many - * files. - */ - PGAlignedBlock buffer; + PGIOAlignedBlock *buffer; /* allocated on first use */ }; static BufFile *makeBufFileCommon(int nfiles); @@ -124,10 +118,12 @@ makeBufFileCommon(int nfiles) file->isInterXact = false; file->dirty = false; file->resowner = CurrentResourceOwner; + file->context = CurrentMemoryContext; file->curFile = 0; file->curOffset = 0; file->pos = 0; file->nbytes = 0; + file->buffer = NULL; return file; } @@ -400,7 +396,8 @@ BufFileExportFileSet(BufFile *file) /* It's probably a bug if someone calls this twice. */ Assert(!file->readOnly); - BufFileFlush(file); + if (file->buffer) + BufFileFlush(file); file->readOnly = true; } @@ -415,11 +412,14 @@ BufFileClose(BufFile *file) int i; /* flush any unwritten data */ - BufFileFlush(file); + if (file->buffer) + BufFileFlush(file); /* close and delete the underlying file(s) */ for (i = 0; i < file->numFiles; i++) FileClose(file->files[i]); /* release the buffer space */ + if (file->buffer) + pfree(file->buffer); pfree(file->files); pfree(file); } @@ -459,7 +459,7 @@ BufFileLoadBuffer(BufFile *file) * Read whatever we can get, up to a full bufferload. */ file->nbytes = FileRead(thisfile, - file->buffer.data, + file->buffer->data, sizeof(file->buffer), file->curOffset, WAIT_EVENT_BUFFILE_READ); @@ -536,7 +536,7 @@ BufFileDumpBuffer(BufFile *file) INSTR_TIME_SET_ZERO(io_start); bytestowrite = FileWrite(thisfile, - file->buffer.data + wpos, + file->buffer->data + wpos, bytestowrite, file->curOffset, WAIT_EVENT_BUFFILE_WRITE); @@ -597,6 +597,67 @@ BufFileReadCommon(BufFile *file, void *ptr, size_t size, bool exact, bool eofOK) size_t nread = 0; size_t nthistime; + /* Fast path for unbuffered large suitably aligned reads. */ + if (!file->buffer && + (file->curOffset % BLCKSZ) == 0 && + (size % BLCKSZ) == 0) + { +elog(LOG, "fast path read! size = %zu", size); + while (size > 0) + { + instr_time io_start; + instr_time io_time; + ssize_t rc; + + /* Move to next segment if necessary. */ + if (file->curOffset >= MAX_PHYSICAL_FILESIZE) + { + if (file->curFile + 1 >= file->numFiles) + return nread; + + file->curFile++; + file->curOffset = 0; + } + + if (track_io_timing) + INSTR_TIME_SET_CURRENT(io_start); + else + INSTR_TIME_SET_ZERO(io_start); + + /* Read as much as we can from this segment at once. */ + rc = FileRead(file->files[file->curFile], + ptr, + Min(MAX_PHYSICAL_FILESIZE - file->curOffset, size), + file->curOffset, + WAIT_EVENT_BUFFILE_WRITE); + if (rc <= 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not read file \"%s\": %m", + FilePathName(file->files[file->curFile])))); + + if (track_io_timing) + { + INSTR_TIME_SET_CURRENT(io_time); + INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start); + } + + /* Advance buffer, offset and stats. */ + nread += rc; + ptr = (char *) ptr + rc; + size -= rc; + file->curOffset += rc; + pgBufferUsage.temp_blks_read += rc / BLCKSZ; + } + return nread; + } + + if (!file->buffer) + file->buffer = MemoryContextAllocAligned(file->context, + sizeof(*file->buffer), + PG_IO_ALIGN_SIZE, + MCXT_ALLOC_ZERO); + BufFileFlush(file); while (size > 0) @@ -617,7 +678,7 @@ BufFileReadCommon(BufFile *file, void *ptr, size_t size, bool exact, bool eofOK) nthistime = size; Assert(nthistime > 0); - memcpy(ptr, file->buffer.data + file->pos, nthistime); + memcpy(ptr, file->buffer->data + file->pos, nthistime); file->pos += nthistime; ptr = (char *) ptr + nthistime; @@ -680,6 +741,65 @@ BufFileWrite(BufFile *file, const void *ptr, size_t size) Assert(!file->readOnly); + /* Fast path for unbuffered large suitably aligned writes. */ + if (!file->buffer && + (file->curOffset % BLCKSZ) == 0 && + (size % BLCKSZ) == 0) + { +elog(LOG, "fast path write! size = %zu", size); + while (size > 0) + { + instr_time io_start; + instr_time io_time; + ssize_t rc; + + /* Make a new segment if necessary. */ + if (file->curOffset >= MAX_PHYSICAL_FILESIZE) + { + while (file->curFile + 1 >= file->numFiles) + extendBufFile(file); + file->curFile++; + file->curOffset = 0; + } + + if (track_io_timing) + INSTR_TIME_SET_CURRENT(io_start); + else + INSTR_TIME_SET_ZERO(io_start); + + /* Write as much as we can into this segment at once. */ + rc = FileWrite(file->files[file->curFile], + ptr, + Min(MAX_PHYSICAL_FILESIZE - file->curOffset, size), + file->curOffset, + WAIT_EVENT_BUFFILE_WRITE); + if (rc <= 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not write to file \"%s\": %m", + FilePathName(file->files[file->curFile])))); + + if (track_io_timing) + { + INSTR_TIME_SET_CURRENT(io_time); + INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start); + } + + /* Advance buffer, offset and stats. */ + ptr = (char *) ptr + rc; + size -= rc; + file->curOffset += rc; + pgBufferUsage.temp_blks_written += rc / BLCKSZ; + } + return; + } + + if (!file->buffer) + file->buffer = MemoryContextAllocAligned(file->context, + sizeof(*file->buffer), + PG_IO_ALIGN_SIZE, + MCXT_ALLOC_ZERO); + while (size > 0) { if (file->pos >= BLCKSZ) @@ -701,7 +821,7 @@ BufFileWrite(BufFile *file, const void *ptr, size_t size) nthistime = size; Assert(nthistime > 0); - memcpy(file->buffer.data + file->pos, ptr, nthistime); + memcpy(file->buffer->data + file->pos, ptr, nthistime); file->dirty = true; file->pos += nthistime; @@ -800,7 +920,8 @@ BufFileSeek(BufFile *file, int fileno, off_t offset, int whence) return 0; } /* Otherwise, must reposition buffer, so flush any dirty data */ - BufFileFlush(file); + if (file->buffer) + BufFileFlush(file); /* * At this point and no sooner, check for seek past last segment. The -- 2.40.0