Re: Relation bulk write facility

vignesh C Mon, 08 Jan 2024 22:51:11 -0800

On Sat, 25 Nov 2023 at 06:49, Heikki Linnakangas <hlinn...@iki.fi> wrote:
>
> On 19/11/2023 02:04, Andres Freund wrote:
> > On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote:
> >> The new facility makes it easier to optimize bulk loading, as the
> >> logic for buffering, WAL-logging, and syncing the relation only needs
> >> to be implemented once. It's also less error-prone: We have had a
> >> number of bugs in how a relation is fsync'd - or not - at the end of a
> >> bulk loading operation. By centralizing that logic to one place, we
> >> only need to write it correctly once.
> >
> > One thing I'd like to use the centralized handling for is to track such
> > writes in pg_stat_io. I don't mean as part of the initial patch, just that
> > that's another reason I like the facility.
>
> Oh I didn't realize they're not counted at the moment.
>
> >> +    bulkw = bulkw_start_smgr(dst, forkNum, use_wal);
> >> +
> >>      nblocks = smgrnblocks(src, forkNum);
> >>
> >>      for (blkno = 0; blkno < nblocks; blkno++)
> >>      {
> >> +            Page            page;
> >> +
> >>              /* If we got a cancel signal during the copy of the data, 
> >> quit */
> >>              CHECK_FOR_INTERRUPTS();
> >>
> >> -            smgrread(src, forkNum, blkno, buf.data);
> >> +            page = bulkw_alloc_buf(bulkw);
> >> +            smgrread(src, forkNum, blkno, page);
> >>
> >>              if (!PageIsVerifiedExtended(page, blkno,
> >>                                                                      
> >> PIV_LOG_WARNING | PIV_REPORT_STAT))
> >> @@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation 
> >> dst,
> >>               * page this is, so we have to log the full page including 
> >> any unused
> >>               * space.
> >>               */
> >> -            if (use_wal)
> >> -                    log_newpage(&dst->smgr_rlocator.locator, forkNum, 
> >> blkno, page, false);
> >> -
> >> -            PageSetChecksumInplace(page, blkno);
> >> -
> >> -            /*
> >> -             * Now write the page.  We say skipFsync = true because 
> >> there's no
> >> -             * need for smgr to schedule an fsync for this write; we'll 
> >> do it
> >> -             * ourselves below.
> >> -             */
> >> -            smgrextend(dst, forkNum, blkno, buf.data, true);
> >> +            bulkw_write(bulkw, blkno, page, false);
> >
> > I wonder if bulkw_alloc_buf() is a good name - if you naively read this
> > change, it looks like it'll just leak memory. It also might be taken to be
> > valid until freed, which I don't think is the case?
>
> Yeah, I'm not very happy with this interface. The model is that you get
> a buffer to write to by calling bulkw_alloc_buf(). Later, you hand it
> over to bulkw_write(), which takes ownership of it and frees it later.
> There is no other function to free it, although currently the buffer is
> just palloc'd so you could call pfree on it.
>
> However, I'd like to not expose that detail to the callers. I'm
> imagining that in the future we might optimize further, by having a
> larger e.g. 1 MB buffer, and carve the 8kB blocks from that. Then
> opportunistically, if you fill the buffers sequentially, bulk_write.c
> could do one smgrextend() to write the whole 1 MB chunk.
>
> I renamed it to bulkw_get_buf() now, and made it return a new
> BulkWriteBuffer typedef instead of a plain Page. The point of the new
> typedef is to distinguish a buffer returned by bulkw_get_buf() from a
> Page or char[BLCKSZ] that you might palloc on your own. That indeed
> revealed some latent bugs in gistbuild.c where I had mixed up buffers
> returned by bulkw_alloc_buf() and palloc'd buffers.
>
> (The previous version of this patch called a different struct
> BulkWriteBuffer, but I renamed that to PendingWrite; see below. Don't be
> confused!)
>
> I think this helps a little, but I'm still not very happy with it. I'll
> give it some more thought after sleeping over it, but in the meanwhile,
> I'm all ears for suggestions.
>
> >> +/*-------------------------------------------------------------------------
> >> + *
> >> + * bulk_write.c
> >> + *    Efficiently and reliably populate a new relation
> >> + *
> >> + * The assumption is that no other backends access the relation while we 
> >> are
> >> + * loading it, so we can take some shortcuts.  Alternatively, you can use 
> >> the
> >> + * buffer manager as usual, if performance is not critical, but you must 
> >> not
> >> + * mix operations through the buffer manager and the bulk loading 
> >> interface at
> >> + * the same time.
> >
> >  From "Alternatively" onward this is is somewhat confusing.
>
> Rewrote that to just "Do not mix operations through the regular buffer
> manager and the bulk loading interface!"
>
> >> + * One tricky point is that because we bypass the buffer manager, we need 
> >> to
> >> + * register the relation for fsyncing at the next checkpoint ourselves, 
> >> and
> >> + * make sure that the relation is correctly fsync by us or the 
> >> checkpointer
> >> + * even if a checkpoint happens concurrently.
> >
> > "fsync'ed" or such? Otherwise this reads awkwardly for me.
>
> Yep, fixed.
>
> >> +typedef struct BulkWriteBuffer
> >> +{
> >> +    Page            page;
> >> +    BlockNumber blkno;
> >> +    bool            page_std;
> >> +    int16           order;
> >> +} BulkWriteBuffer;
> >> +
> >
> > The name makes it sound like this struct itself contains a buffer - but it's
> > just pointing to one. *BufferRef or such maybe?
> >
> > I was wondering how you dealt with the alignment of buffers given the struct
> > definition, which is what lead me to look at this...
>
> I renamed this to PendingWrite, and the field that holds these
> "pending_writes". Think of it as a queue of writes that haven't been
> performed yet.
>
> >> +/*
> >> + * Bulk writer state for one relation fork.
> >> + */
> >> +typedef struct BulkWriteState
> >> +{
> >> +    /* Information about the target relation we're writing */
> >> +    SMgrRelation smgr;
> >
> > Isn't there a danger of this becoming a dangling pointer? At least until
> > https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
> > is merged?
>
> Good point. I just added a FIXME comment to remind about that, hoping
> that that patch gets merged soon. If not, I'll come up with a different fix.
>
> >> +    ForkNumber      forknum;
> >> +    bool            use_wal;
> >> +
> >> +    /* We keep several pages buffered, and WAL-log them in batches */
> >> +    int                     nbuffered;
> >> +    BulkWriteBuffer buffers[MAX_BUFFERED_PAGES];
> >> +
> >> +    /* Current size of the relation */
> >> +    BlockNumber pages_written;
> >> +
> >> +    /* The RedoRecPtr at the time that the bulk operation started */
> >> +    XLogRecPtr      start_RedoRecPtr;
> >> +
> >> +    Page            zeropage;               /* workspace for filling 
> >> zeroes */
> >
> > We really should just have one such page in shared memory somewhere... For 
> > WAL
> > writes as well.
> >
> > But until then - why do you allocate the page? Seems like we could just use 
> > a
> > static global variable?
>
> I made it a static global variable for now. (The palloc way was copied
> over from nbtsort.c)
>
> >> +/*
> >> + * Write all buffered pages to disk.
> >> + */
> >> +static void
> >> +bulkw_flush(BulkWriteState *bulkw)
> >> +{
> >> +    int                     nbuffered = bulkw->nbuffered;
> >> +    BulkWriteBuffer *buffers = bulkw->buffers;
> >> +
> >> +    if (nbuffered == 0)
> >> +            return;
> >> +
> >> +    if (nbuffered > 1)
> >> +    {
> >> +            int                     o;
> >> +
> >> +            qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), 
> >> buffer_cmp);
> >> +
> >> +            /*
> >> +             * Eliminate duplicates, keeping the last write of each block.
> >> +             * (buffer_cmp uses 'order' as the last sort key)
> >> +             */
> >
> > Huh - which use cases would actually cause duplicate writes?
>
> Hmm, nothing anymore I guess. Many AMs used to write zero pages as a
> placeholder and come back to fill them in later, but now that
> bulk_write.c handles that,
>
> Removed that, and replaced it with with an assertion in buffer_cmp()
> that there are no duplicates.


There are few compilation errors reported by CFBot at [1], patch needs
to be rebased:
[02:38:12.675] In file included from ../../../../src/include/postgres.h:45,
[02:38:12.675] from nbtsort.c:41:
[02:38:12.675] nbtsort.c: In function ‘_bt_load’:
[02:38:12.675] nbtsort.c:1309:57: error: ‘BTPageState’ has no member
named ‘btps_page’
[02:38:12.675] 1309 | Assert(dstate->maxpostingsize <=
BTMaxItemSize(state->btps_page) &&
[02:38:12.675] | ^~
[02:38:12.675] ../../../../src/include/c.h:864:9: note: in definition
of macro ‘Assert’
[02:38:12.675] 864 | if (!(condition)) \
[02:38:12.675] | ^~~~~~~~~
[02:38:12.675] ../../../../src/include/c.h:812:29: note: in expansion
of macro ‘TYPEALIGN_DOWN’
[02:38:12.675] 812 | #define MAXALIGN_DOWN(LEN)
TYPEALIGN_DOWN(MAXIMUM_ALIGNOF, (LEN))
[02:38:12.675] | ^~~~~~~~~~~~~~
[02:38:12.675] ../../../../src/include/access/nbtree.h:165:3: note: in
expansion of macro ‘MAXALIGN_DOWN’
[02:38:12.675] 165 | (MAXALIGN_DOWN((PageGetPageSize(page) - \

[1] - https://cirrus-ci.com/task/5299954164432896

Regards,
Vignesh

Re: Relation bulk write facility

Reply via email to