On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas <hlinnakan...@vmware.com> wrote:
> On 06/27/2014 09:12 AM, Michael Paquier wrote: > >> 2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and >> XLogGetBlockRefIds are missing in src/backend/access/transam/README. >> > > Added descriptions of XLogBeginInsert and XLogHasBlockRef. IÍ„'m not sure > what to do with the latter two. They're not really intended for use by redo > routines, but for things like pg_xlogdump that want to extract information > about any record type. That's definitely not part of the redo section or WAL construction section. It could be a new section in the README describing how to get the list of blocks touched by a WAL record. I'd rather have those APIs documented somewhere than nowhere, and the README would be well-suited for that (as well as in-code comments) as those APIs are aimed at developers. > 3) There are a couple of code blocks marked as FIXME and BROKEN. There are >> as well some NOT_USED blocks. >> > > Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They > mostly stand for code that used to print block numbers or relfilenodes, > which doesn't make much sense to do in an ad hoc way in rmgr-specific desc > routines anymore. Will need to go through them all an decide what's the > important information to print for each record type. > > The few NOT_USED blocks are around code in redo routines that extract some > information from the WAL record, that isn't needed anymore. We could remove > the fields from the WAL records altogether, but might be useful to keep for > debugging purposes. They could be removed and be kept as a part of the git history... That's not really a vital point btw. > > 6) XLogRegisterData creates a XLogRecData entry and appends it in one of >> the static XLogRecData entries if buffer_id >= 0 or to >> rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the >> WAL record). XLogRegisterXLogRecData does something similar, but with an >> entry of XLogRecData already built. What do you think about removing >> XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the >> interface simpler, and XLogRecData could be kept private in xlog.c. This >> would need more work especially on gin side where I am seeing for example >> constructLeafRecompressWALData directly building a XLogRecData entry. >> > > Another big change in the attached patch version is that > XLogRegisterData() now *copies* the data to a working area, instead of > merely adding a XLogRecData entry to the chain. This simplifies things in > xlog.c, now that the XLogRecData concept is not visible to the rest of the > system. This is a subtle semantic change for the callers: you can no longer > register a struct first, and fill the fields afterwards. > > That adds a lot of memcpys to the critical path, which could be bad for > performance. In practice, I think it's OK, or even a win, because a typical > WAL record payload is very small. But I haven't measured that. If it > becomes an issue, we could try to optimize, and link larger data blocks to > the rdata chain, but memcpy small small chunks of data. There are a few WAL > record types that don't fit in a small statically allocated buffer, so I > provided a new function, XLogEnsureSpace(), that can be called before > XLogBegin() to allocate a large-enough working area. > I just ran the following test on my laptop with your latest patch: - Patched version: =# CREATE TABLE foo AS SELECT generate_series(1,10000000) as a; SELECT 10000000 Time: 32913.419 ms =# CREATE TABLE foo2 AS SELECT generate_series(1,10000000) as a; SELECT 10000000 Time: 32261.045 ms - master (d97e98e): =# CREATE TABLE foo AS SELECT generate_series(1,10000000) as a; SELECT 10000000 Time: 29651.394 ms =# CREATE TABLE foo2 AS SELECT generate_series(1,10000000) as a; SELECT 10000000 Time: 29734.887 ms 10% difference... That was just a quick test though. These changes required changing a couple of places where WAL records are > created: > > * ginPlaceToPage. This function has a strange split of responsibility > between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage. > ginPlaceToPage calls dataPlaceToPage or entryPlaceToPage, depending on what > kind of a tree it is, and dataPlaceToPage or entryPlaceToPage contributes > some data to the WAL record. Then ginPlaceToPage adds some more, and > finally calls XLogInsert(). I simplified the SPLIT case so that we now just > create full page images of both page halves. We were already logging all > the data on both page halves, but we were doing it in a "smarter" way, > knowing what kind of pages they were. For example, for an entry tree page, > we included an IndexTuple struct for every tuple on the page, but didn't > include the item pointers. A straight full page image takes more space, but > not much, so I think in any real application it's going to be lost in > noise. (again, haven't measured it though) Thanks, the code looks cleaner this way. Having looked a bit at the way ginPlaceToPage works, it seems a bit sensible to refactor dataPlaceToPageLeaf and dataPlaceToPageInternal to make the WAL operations smarter than they are now. I am not saying that this is impossible, just not that straight-forward, especially considering that this code is like that for a long time. I actually tried to run some tests on the WAL difference it generates, but got discouraged by the email Fujii-san sent reporting the crash in gin code path. > 10) Shouldn't the call to XLogBeginInsert come first here? >> @@ -1646,7 +1647,16 @@ spgAddNodeAction(Relation index, SpGistState >> *state, >> { >> XLogRecPtr recptr; >> >> - recptr = XLogInsert(RM_SPGIST_ID, >> XLOG_SPGIST_ADD_NODE, rdata); >> + XLogRegisterBuffer(0, saveCurrent.buffer, >> REGBUF_STANDARD); /* orig page */ >> + XLogRegisterBuffer(1, current->buffer, >> REGBUF_STANDARD); /* new page */ >> + if (xlrec.parentBlk == 2) >> + XLogRegisterBuffer(2, parent->buffer, >> REGBUF_STANDARD); >> + >> + XLogBeginInsert(); >> + XLogRegisterData(-1, (char *) &xlrec, >> sizeof(xlrec)); >> + XLogRegisterData(-1, (char *) newInnerTuple, >> newInnerTuple->size); >> + >> + recptr = XLogInsert(RM_SPGIST_ID, >> XLOG_SPGIST_ADD_NODE); >> > > Yeah. Apparently we have no regression test coverage of this codepath, or > it would've triggered an assertion. Fixed now, just by looking at the code, > but there's still no test coverage of this so it's likely still broken in > other ways. Before this is committed, I think we'll need to go through the > coverage report and make sure all the XLogInsert() codepaths (and their > redo routines) are covered. Hm. This is not really related to this patch as it means that even current code is not tested properly. I wouldn't mind taking a closer look at the test suite and come up new tests improving coverage of WAL replay and record construction... > Also, I wanted to run the WAL replay facility to compare before and after >> buffer images, but code crashed before... I am still planning to do so >> once >> this code gets more stable though. >> > > I did that earlier, and did it now again. I got a single difference from a > sp-gist SPLIT_TUPLE operation. I didn't dig deeper right now, will have to > investigate it later. > OK, thanks. I haven't looked at the patch in more details: 1) Noticed a small type => s/reprsent/represent/g 2) This time make installcheck passed for both master and standby, even on contrib stuff... 3) I noticed a bug in gin redo code path when trying to use the WAL replay facility though so I couldn't make a proper run with it. -- Michael