On Sat, 25 Nov 2023 at 06:49, Heikki Linnakangas <hlinn...@iki.fi> wrote: > > On 19/11/2023 02:04, Andres Freund wrote: > > On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote: > >> The new facility makes it easier to optimize bulk loading, as the > >> logic for buffering, WAL-logging, and syncing the relation only needs > >> to be implemented once. It's also less error-prone: We have had a > >> number of bugs in how a relation is fsync'd - or not - at the end of a > >> bulk loading operation. By centralizing that logic to one place, we > >> only need to write it correctly once. > > > > One thing I'd like to use the centralized handling for is to track such > > writes in pg_stat_io. I don't mean as part of the initial patch, just that > > that's another reason I like the facility. > > Oh I didn't realize they're not counted at the moment. > > >> + bulkw = bulkw_start_smgr(dst, forkNum, use_wal); > >> + > >> nblocks = smgrnblocks(src, forkNum); > >> > >> for (blkno = 0; blkno < nblocks; blkno++) > >> { > >> + Page page; > >> + > >> /* If we got a cancel signal during the copy of the data, > >> quit */ > >> CHECK_FOR_INTERRUPTS(); > >> > >> - smgrread(src, forkNum, blkno, buf.data); > >> + page = bulkw_alloc_buf(bulkw); > >> + smgrread(src, forkNum, blkno, page); > >> > >> if (!PageIsVerifiedExtended(page, blkno, > >> > >> PIV_LOG_WARNING | PIV_REPORT_STAT)) > >> @@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation > >> dst, > >> * page this is, so we have to log the full page including > >> any unused > >> * space. > >> */ > >> - if (use_wal) > >> - log_newpage(&dst->smgr_rlocator.locator, forkNum, > >> blkno, page, false); > >> - > >> - PageSetChecksumInplace(page, blkno); > >> - > >> - /* > >> - * Now write the page. We say skipFsync = true because > >> there's no > >> - * need for smgr to schedule an fsync for this write; we'll > >> do it > >> - * ourselves below. > >> - */ > >> - smgrextend(dst, forkNum, blkno, buf.data, true); > >> + bulkw_write(bulkw, blkno, page, false); > > > > I wonder if bulkw_alloc_buf() is a good name - if you naively read this > > change, it looks like it'll just leak memory. It also might be taken to be > > valid until freed, which I don't think is the case? > > Yeah, I'm not very happy with this interface. The model is that you get > a buffer to write to by calling bulkw_alloc_buf(). Later, you hand it > over to bulkw_write(), which takes ownership of it and frees it later. > There is no other function to free it, although currently the buffer is > just palloc'd so you could call pfree on it. > > However, I'd like to not expose that detail to the callers. I'm > imagining that in the future we might optimize further, by having a > larger e.g. 1 MB buffer, and carve the 8kB blocks from that. Then > opportunistically, if you fill the buffers sequentially, bulk_write.c > could do one smgrextend() to write the whole 1 MB chunk. > > I renamed it to bulkw_get_buf() now, and made it return a new > BulkWriteBuffer typedef instead of a plain Page. The point of the new > typedef is to distinguish a buffer returned by bulkw_get_buf() from a > Page or char[BLCKSZ] that you might palloc on your own. That indeed > revealed some latent bugs in gistbuild.c where I had mixed up buffers > returned by bulkw_alloc_buf() and palloc'd buffers. > > (The previous version of this patch called a different struct > BulkWriteBuffer, but I renamed that to PendingWrite; see below. Don't be > confused!) > > I think this helps a little, but I'm still not very happy with it. I'll > give it some more thought after sleeping over it, but in the meanwhile, > I'm all ears for suggestions. > > >> +/*------------------------------------------------------------------------- > >> + * > >> + * bulk_write.c > >> + * Efficiently and reliably populate a new relation > >> + * > >> + * The assumption is that no other backends access the relation while we > >> are > >> + * loading it, so we can take some shortcuts. Alternatively, you can use > >> the > >> + * buffer manager as usual, if performance is not critical, but you must > >> not > >> + * mix operations through the buffer manager and the bulk loading > >> interface at > >> + * the same time. > > > > From "Alternatively" onward this is is somewhat confusing. > > Rewrote that to just "Do not mix operations through the regular buffer > manager and the bulk loading interface!" > > >> + * One tricky point is that because we bypass the buffer manager, we need > >> to > >> + * register the relation for fsyncing at the next checkpoint ourselves, > >> and > >> + * make sure that the relation is correctly fsync by us or the > >> checkpointer > >> + * even if a checkpoint happens concurrently. > > > > "fsync'ed" or such? Otherwise this reads awkwardly for me. > > Yep, fixed. > > >> +typedef struct BulkWriteBuffer > >> +{ > >> + Page page; > >> + BlockNumber blkno; > >> + bool page_std; > >> + int16 order; > >> +} BulkWriteBuffer; > >> + > > > > The name makes it sound like this struct itself contains a buffer - but it's > > just pointing to one. *BufferRef or such maybe? > > > > I was wondering how you dealt with the alignment of buffers given the struct > > definition, which is what lead me to look at this... > > I renamed this to PendingWrite, and the field that holds these > "pending_writes". Think of it as a queue of writes that haven't been > performed yet. > > >> +/* > >> + * Bulk writer state for one relation fork. > >> + */ > >> +typedef struct BulkWriteState > >> +{ > >> + /* Information about the target relation we're writing */ > >> + SMgrRelation smgr; > > > > Isn't there a danger of this becoming a dangling pointer? At least until > > https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com > > is merged? > > Good point. I just added a FIXME comment to remind about that, hoping > that that patch gets merged soon. If not, I'll come up with a different fix. > > >> + ForkNumber forknum; > >> + bool use_wal; > >> + > >> + /* We keep several pages buffered, and WAL-log them in batches */ > >> + int nbuffered; > >> + BulkWriteBuffer buffers[MAX_BUFFERED_PAGES]; > >> + > >> + /* Current size of the relation */ > >> + BlockNumber pages_written; > >> + > >> + /* The RedoRecPtr at the time that the bulk operation started */ > >> + XLogRecPtr start_RedoRecPtr; > >> + > >> + Page zeropage; /* workspace for filling > >> zeroes */ > > > > We really should just have one such page in shared memory somewhere... For > > WAL > > writes as well. > > > > But until then - why do you allocate the page? Seems like we could just use > > a > > static global variable? > > I made it a static global variable for now. (The palloc way was copied > over from nbtsort.c) > > >> +/* > >> + * Write all buffered pages to disk. > >> + */ > >> +static void > >> +bulkw_flush(BulkWriteState *bulkw) > >> +{ > >> + int nbuffered = bulkw->nbuffered; > >> + BulkWriteBuffer *buffers = bulkw->buffers; > >> + > >> + if (nbuffered == 0) > >> + return; > >> + > >> + if (nbuffered > 1) > >> + { > >> + int o; > >> + > >> + qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), > >> buffer_cmp); > >> + > >> + /* > >> + * Eliminate duplicates, keeping the last write of each block. > >> + * (buffer_cmp uses 'order' as the last sort key) > >> + */ > > > > Huh - which use cases would actually cause duplicate writes? > > Hmm, nothing anymore I guess. Many AMs used to write zero pages as a > placeholder and come back to fill them in later, but now that > bulk_write.c handles that, > > Removed that, and replaced it with with an assertion in buffer_cmp() > that there are no duplicates.
There are few compilation errors reported by CFBot at [1], patch needs to be rebased: [02:38:12.675] In file included from ../../../../src/include/postgres.h:45, [02:38:12.675] from nbtsort.c:41: [02:38:12.675] nbtsort.c: In function ‘_bt_load’: [02:38:12.675] nbtsort.c:1309:57: error: ‘BTPageState’ has no member named ‘btps_page’ [02:38:12.675] 1309 | Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) && [02:38:12.675] | ^~ [02:38:12.675] ../../../../src/include/c.h:864:9: note: in definition of macro ‘Assert’ [02:38:12.675] 864 | if (!(condition)) \ [02:38:12.675] | ^~~~~~~~~ [02:38:12.675] ../../../../src/include/c.h:812:29: note: in expansion of macro ‘TYPEALIGN_DOWN’ [02:38:12.675] 812 | #define MAXALIGN_DOWN(LEN) TYPEALIGN_DOWN(MAXIMUM_ALIGNOF, (LEN)) [02:38:12.675] | ^~~~~~~~~~~~~~ [02:38:12.675] ../../../../src/include/access/nbtree.h:165:3: note: in expansion of macro ‘MAXALIGN_DOWN’ [02:38:12.675] 165 | (MAXALIGN_DOWN((PageGetPageSize(page) - \ [1] - https://cirrus-ci.com/task/5299954164432896 Regards, Vignesh