Hi,

On 2014-09-15 15:41:22 +0300, Heikki Linnakangas wrote:
> The second patch contains the interesting changes.

Heikki's pushed the newest version of this to the git tree.

Some things I noticed while reading the patch:
* potential mismerge:
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -408,7 +408,7 @@ main(int argc, char **argv)
        }
    }
 
-   while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:S:nF:wWv",
+   while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:nF:wWv",
                            long_options, &option_index)) != -1)
    {
        switch (c)

* Can't you just have REGBUF_WILL_INIT include REGBUF_NO_IMAGE? Then you
  don't need to test both.

* Is it really a good idea to separate XLogRegisterBufData() from
  XLogRegisterBuffer() the way you've done it ? If we ever actually get
  a record with a large numbers of blocks touched this issentially is
  O(touched_buffers*data_entries).

* At least in Assert mode XLogRegisterBuffer() should ensure we're not
  registering the same buffer twice. It's a pretty easy to make mistake
  to do it twice for some kind of actions (think UPDATE), and the
  consequences will be far from obvious afaics.

* There's lots of functions like XLogRecHasBlockRef() that dig through
  the wal record. A common pattern is something like:
if (XLogRecHasBlockRef(record, 1))
    XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
else
    oldblk = newblk;

  I think doing that repeatedly is quite a bad idea. We should parse the
  record once and then use it in a sensible format. Not do it in pieces,
  over and over again. It's not like we ignore backup blocks - so doing
  this lazily doesn't make sense to me.
  Especially as ValidXLogRecord() *already* has parsed the whole damn
  thing once.

* Maybe XLogRegisterData() and XLogRegisterBufData() should accept void*
  instead of char*? Right now there's lots of pointless casts in callers
  needed that don't add any safety. The one additional line that's then
  needed in these functions seems well worth it.

* There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only
  refer to the relation, but not to the block number. These still log
  their rnode manually. Shouldn't we somehow deal with those in a
  similar way explicit block references are dealt with?

* You've added an assert in RestoreBackupBlockContents() that ensures
  the page isn't new. That wasn't there before, instead there was a
  special case for it. I can't immediately think how that previously
  could happen, but I do wonder why it was there.

* The following comment in +RestoreBackupBlock probably is two lines to
  early
+   /* Found it, apply the update */
+   if (!(bkpb->fork_flags & BKPBLOCK_HAS_IMAGE))
+       return InvalidBuffer;
  This new InvalidBuffer case probably could use a comment in either
  XLogReadBufferForRedoExtended or here.

* Most of the commentary above RestoreBackupBlock() probably doesn't
  belong there anymore.

* Maybe XLogRecordBlockData.fork_flags should be renamed? Maybe just
  into flags_fork? Right now it makes at least me think that it's fork
  specific flags, which really isn't the actual meaning.

* I wonder if it wouldn't be worthwile, for the benefit of the FPI
  compression patch, to keep the bkp block data after *all* the
  "headers". That'd make it easier to just compress the data.

* +InitXLogRecordConstruct() seems like a remainder of the
  xlogconstruct.c naming. I'd just go for InitXLogInsert()

* Hm. At least WriteMZeroPageXlogRec (and probably the same for all the
  other slru stuff) doesn't include a reference to the page. Isn't that
  bad? Shouldn't we make XLogRegisterBlock() usable for that case?
  Otherwise I fail to see how pg_rewind like tools can sanely deal with this?

* Why did you duplicate the identical cases in btree_desc()? I guess due
  to rebasing ontop the xlogdump stats series?

* I haven't actually looked at the xlogdump output so I might be
  completely wrong, but won't the output of e.g. updates be harder to
  read now because it's not clear which of the references old/new
  buffers are? Not that it matters that much.

* s/recdataend/recdata_end/? The former is somewhat hard to parse for
  me.

* I think heap_xlog_update is buggy for wal_level=logical because it
  computes the length of the tuple using
  tuplen = recdataend - recdata;
  But the old primary key/old tuple value might be stored there as
  well. Afaics that code has to continue to use xl_heap_header_len.

* It looks to me like insert wal logging could just use REGBUF_KEEP_DATA
  to get rid of:
+       /*
+        * The new tuple is normally stored as buffer 0's data. But if
+        * XLOG_HEAP_CONTAINS_NEW_TUPLE flag is set, it's part of the main
+        * data, after the xl_heap_insert struct.
+        */
+       if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
+       {
+           data = XLogRecGetData(record) + SizeOfHeapInsert;
+           datalen = record->xl_len - SizeOfHeapInsert;
+       }
+       else
+           data = XLogRecGetBlockData(record, 0, &datalen);
 or have I misunderstood how that works?

* spurious reindent
@@ -7306,9 +7120,9 @@ heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record)
         * XLOG record's LSN, we mustn't mark the page all-visible, because
         * the subsequent update won't be replayed to clear the flag.
         */
-       page = BufferGetPage(buffer);
-       PageSetAllVisible(page);
-       MarkBufferDirty(buffer);
+           page = BufferGetPage(buffer);
+           PageSetAllVisible(page);
+           MarkBufferDirty(buffer);
    }

* Somewhat long line:
        XLogRegisterBuffer(1, heap_buffer, XLogHintBitIsNeeded() ? 
REGBUF_STANDARD : (REGBUF_STANDARD | REGBUF_NO_IMAGE));

Man. This page is friggin huge. Now I'm tired ;)


This patch needs at the very least:
* Careful benchmarking of both primary and standbys
* Very careful review of the many changed XLogInsert/replay sites. I'd
  be very surprised if there weren't bugs hidden somewhere. I've lost
  the energy to read them all now.
* A good amount of consideration about the increase in WAL volume. I
  think there's some cases where that's not inconsiderable.

Greetings,

Andres Freund

-- 
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to