I'd like to do some changes to the WAL format in 9.5. I want to annotate each WAL record with the blocks that they modify. Every WAL record already includes that information, but it's done in an ad hoc way, differently in every rmgr. The RelFileNode and block number are currently part of the WAL payload, and it's the REDO routine's responsibility to extract it. I want to include that information in a common format for every WAL record type.

That makes life a lot easier for tools that are interested in knowing which blocks a WAL record modifies. One such tool is pg_rewind; it currently has to understand every WAL record the backend writes. There's also a tool out there called pg_readahead, which does prefetching of blocks accessed by WAL records, to speed up PITR. I don't think that tool has been actively maintained, but at least part of the reason for that is probably that it's a pain to maintain when it has to understand the details of every WAL record type.


It'd also be nice for contrib/pg_xlogdump and backend code itself. The boilerplate code in all WAL redo routines, and writing WAL records, could be simplified.

So, here's my proposal:

Insertion
---------

The big change in creating WAL records is that the buffers involved in the WAL-logged operation are explicitly registered, by calling a new XLogRegisterBuffer function. Currently, buffers that need full-page images are registered by including them in the XLogRecData chain, but with the new system, you call the XLogRegisterBuffer() function instead. And you call that function for every buffer involved, even if no full-page image needs to be taken, e.g because the page is going to be recreated from scratch at replay.

It is no longer necessary to include the RelFileNode and BlockNumber of the modified pages in the WAL payload. That information is automatically included in the WAL record, when XLogRegisterBuffer is called.

Currently, the backup blocks are implicitly numbered, in the order the buffers appear in XLogRecData entries. With the new API, the blocks are numbered explicitly. This is more convenient when a WAL record sometimes modifies a buffer and sometimes not. For example, a B-tree split needs to modify four pages: the original page, the new page, the right sibling (unless it's the rightmost page) and if it's an internal page, the page at the lower level whose split the insertion completes. So there are two pages that are sometimes missing from the record. With the new API, you can nevertheless always register e.g. original page as buffer 0, new page as 1, right sibling as 2, even if some of them are actually missing. SP-GiST contains even more complicated examples of that.

The new XLogRegisterBuffer would look like this:

void XLogRegisterBuffer(int blockref_id, Buffer buffer, bool buffer_std)

blockref_id: An arbitrary ID given to this block reference. It is used in the redo routine to open/restore the same block.
buffer: the buffer involved
buffer_std: is the page in "standard" page layout?

That's for the normal cases. We'll need a couple of variants for also registering buffers that don't need full-page images, and perhaps also a function for registering a page that *always* needs a full-page image, regardless of the LSN. A few existing WAL record types just WAL-log the whole page, so those ad-hoc full-page images could be replaced with this.

With these changes, a typical WAL insertion would look like this:

        /* register the buffer with the WAL record, with ID 0 */
        XLogRegisterBuffer(0, buf, true);

        rdata[0].data = (char *) &xlrec;
        rdata[0].len = sizeof(BlahRecord);
        rdata[0].buffer_id = -1; /* -1 means the data is always included */
        rdata[0].next = &(rdata[1]);

        rdata[1].data = (char *) mydata;
        rdata[1].len = mydatalen;
        rdata[1].buffer_id = 0; /* 0 here refers to the buffer registered above 
*/
        rdata[1].next = NULL

        ...
        recptr = XLogInsert(RM_BLAH_ID, xlinfo, rdata);

        PageSetLSN(buf, recptr);


(While we're at it, perhaps we should let XLogInsert set the LSN of all the registered buffers, to reduce the amount of boilerplate code).

(Instead of using a new XLogRegisterBuffer() function to register the buffers, perhaps they should be passed to XLogInsert as a separate list or array. I'm not wedded on the details...)

Redo
----

There are four different states a block referenced by a typical WAL record can be in:

1. The old page does not exist at all (because the relation was truncated later) 2. The old page exists, but has an LSN higher than current WAL record, so it doesn't need replaying.
3. The LSN is < current WAL record, so it needs to be replayed.
4. The WAL record contains a full-page image, which needs to be restored.

With the current API, that leads to a long boilerplate:

        /* If we have a full-page image, restore it and we're done */
        if (HasBackupBlock(record, 0))
        {
                (void) RestoreBackupBlock(lsn, record, 0, false, false);
                return;
        }
        buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
        /* If the page was truncated away, we're done */
        if (!BufferIsValid(buffer))
                return;

        page = (Page) BufferGetPage(buffer);

        /* Has this record already been replayed? */
        if (lsn <= PageGetLSN(page))
        {
                UnlockReleaseBuffer(buffer);
                return;
        }

        /* Modify the page */
        ...
        
        PageSetLSN(page, lsn);
        MarkBufferDirty(buffer);
        UnlockReleaseBuffer(buffer);

Let's simplify that, and have one new function, XLogOpenBuffer, which returns a return code that indicates which of the four cases we're dealing with. A typical redo function looks like this:

        if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
        {
                /* Modify the page */
                ...

                PageSetLSN(page, lsn);
                MarkBufferDirty(buffer);
        }
        if (BufferIsValid(buffer))
                UnlockReleaseBuffer(buffer);

The '0' in the XLogOpenBuffer call is the ID of the block reference specified in the XLogRegisterBuffer call, when the WAL record was created.

WAL format
----------

The registered block references need to be included in the WAL record. We already do that for backup blocks, so a naive implementation would be to just include a BkpBlock struct for all the block references, even those that don't need a full-page image. That would be rather bulky, though, so that needs some optimization. Shouldn't be difficult to omit duplicated/unnecessary information, and add a flags field indicating which fields are present. Overall, I don't expect there to be any big difference in the amount of WAL generated by a typical application.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to