[HACKERS] WAL format and API changes (9.5)

Heikki Linnakangas Thu, 03 Apr 2014 07:16:20 -0700

I'd like to do some changes to the WAL format in 9.5. I want to annotateeach WAL record with the blocks that they modify. Every WAL recordalready includes that information, but it's done in an ad hoc way,differently in every rmgr. The RelFileNode and block number arecurrently part of the WAL payload, and it's the REDO routine'sresponsibility to extract it. I want to include that information in acommon format for every WAL record type.

That makes life a lot easier for tools that are interested in knowingwhich blocks a WAL record modifies. One such tool is pg_rewind; itcurrently has to understand every WAL record the backend writes. There'salso a tool out there called pg_readahead, which does prefetching ofblocks accessed by WAL records, to speed up PITR. I don't think thattool has been actively maintained, but at least part of the reason forthat is probably that it's a pain to maintain when it has to understandthe details of every WAL record type.

It'd also be nice for contrib/pg_xlogdump and backend code itself. Theboilerplate code in all WAL redo routines, and writing WAL records,could be simplified.


So, here's my proposal:

Insertion
---------

The big change in creating WAL records is that the buffers involved inthe WAL-logged operation are explicitly registered, by calling a newXLogRegisterBuffer function. Currently, buffers that need full-pageimages are registered by including them in the XLogRecData chain, butwith the new system, you call the XLogRegisterBuffer() function instead.And you call that function for every buffer involved, even if nofull-page image needs to be taken, e.g because the page is going to berecreated from scratch at replay.

It is no longer necessary to include the RelFileNode and BlockNumber ofthe modified pages in the WAL payload. That information is automaticallyincluded in the WAL record, when XLogRegisterBuffer is called.

Currently, the backup blocks are implicitly numbered, in the order thebuffers appear in XLogRecData entries. With the new API, the blocks arenumbered explicitly. This is more convenient when a WAL record sometimesmodifies a buffer and sometimes not. For example, a B-tree split needsto modify four pages: the original page, the new page, the right sibling(unless it's the rightmost page) and if it's an internal page, the pageat the lower level whose split the insertion completes. So there are twopages that are sometimes missing from the record. With the new API, youcan nevertheless always register e.g. original page as buffer 0, newpage as 1, right sibling as 2, even if some of them are actuallymissing. SP-GiST contains even more complicated examples of that.


The new XLogRegisterBuffer would look like this:

void XLogRegisterBuffer(int blockref_id, Buffer buffer, bool buffer_std)

blockref_id: An arbitrary ID given to this block reference. It is usedin the redo routine to open/restore the same block.

buffer: the buffer involved
buffer_std: is the page in "standard" page layout?

That's for the normal cases. We'll need a couple of variants for alsoregistering buffers that don't need full-page images, and perhaps also afunction for registering a page that *always* needs a full-page image,regardless of the LSN. A few existing WAL record types just WAL-log thewhole page, so those ad-hoc full-page images could be replaced with this.


With these changes, a typical WAL insertion would look like this:

        /* register the buffer with the WAL record, with ID 0 */
        XLogRegisterBuffer(0, buf, true);

        rdata[0].data = (char *) &xlrec;
        rdata[0].len = sizeof(BlahRecord);
        rdata[0].buffer_id = -1; /* -1 means the data is always included */
        rdata[0].next = &(rdata[1]);

        rdata[1].data = (char *) mydata;
        rdata[1].len = mydatalen;
        rdata[1].buffer_id = 0; /* 0 here refers to the buffer registered above 
*/
        rdata[1].next = NULL

        ...
        recptr = XLogInsert(RM_BLAH_ID, xlinfo, rdata);

        PageSetLSN(buf, recptr);

(While we're at it, perhaps we should let XLogInsert set the LSN of allthe registered buffers, to reduce the amount of boilerplate code).

(Instead of using a new XLogRegisterBuffer() function to register thebuffers, perhaps they should be passed to XLogInsert as a separate listor array. I'm not wedded on the details...)


Redo
----

There are four different states a block referenced by a typical WALrecord can be in:

1. The old page does not exist at all (because the relation wastruncated later)2. The old page exists, but has an LSN higher than current WAL record,so it doesn't need replaying.

3. The LSN is < current WAL record, so it needs to be replayed.
4. The WAL record contains a full-page image, which needs to be restored.

With the current API, that leads to a long boilerplate:

        /* If we have a full-page image, restore it and we're done */
        if (HasBackupBlock(record, 0))
        {
                (void) RestoreBackupBlock(lsn, record, 0, false, false);
                return;
        }
        buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
        /* If the page was truncated away, we're done */
        if (!BufferIsValid(buffer))
                return;

        page = (Page) BufferGetPage(buffer);

        /* Has this record already been replayed? */
        if (lsn <= PageGetLSN(page))
        {
                UnlockReleaseBuffer(buffer);
                return;
        }

        /* Modify the page */
        ...
        
        PageSetLSN(page, lsn);
        MarkBufferDirty(buffer);
        UnlockReleaseBuffer(buffer);

Let's simplify that, and have one new function, XLogOpenBuffer, whichreturns a return code that indicates which of the four cases we'redealing with. A typical redo function looks like this:


        if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
        {
                /* Modify the page */
                ...

                PageSetLSN(page, lsn);
                MarkBufferDirty(buffer);
        }
        if (BufferIsValid(buffer))
                UnlockReleaseBuffer(buffer);

The '0' in the XLogOpenBuffer call is the ID of the block referencespecified in the XLogRegisterBuffer call, when the WAL record was created.


WAL format
----------

The registered block references need to be included in the WAL record.We already do that for backup blocks, so a naive implementation would beto just include a BkpBlock struct for all the block references, eventhose that don't need a full-page image. That would be rather bulky,though, so that needs some optimization. Shouldn't be difficult to omitduplicated/unnecessary information, and add a flags field indicatingwhich fields are present. Overall, I don't expect there to be any bigdifference in the amount of WAL generated by a typical application.


- Heikki


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] WAL format and API changes (9.5)

Reply via email to