Re: [HACKERS] [PATCH] Logical decoding support for sequence advances

Petr Jelinek Mon, 29 Feb 2016 13:35:42 -0800


On 29/02/16 03:23, Craig Ringer wrote:

On 17 December 2015 at 10:08, Craig Ringer <cr...@2ndquadrant.com
<mailto:cr...@2ndquadrant.com>> wrote:

    On 15 December 2015 at 20:17, Andres Freund <and...@anarazel.de
    <mailto:and...@anarazel.de>> wrote:


        I think this is quite the wrong approach. You're calling the logical
        decoding callback directly from decode.c, circumventing
        reorderbuffer.c. While you don't need the actual reordering, you
        *do*
        need the snapshot integration.


    Yeah. I thought I could get away without it because I could use
    Form_pg_sequence.sequence_name to bypass any catalog lookups at all,
    but per upthread that field should be ignored and can't be used.

        As you since noticed you can't just
        look into the sequence tuple and be happy with that, you need to do
        pg_class lookups et al. Furhtermore callbacks can validly expect
        to have
        a snapshot set.


    Good point. Just because I don't need one doesn't mean others won't,
    and all the other callbacks do.

    I'll have a read over reorderbuffer.c and see if I can integrate
    this properly.

        At the very least what you need to do is to SetupHistoricSnapshot(),
        then lookup the actual identity of the sequence using
        RelidByRelfilenode() and only then you can call the callback.


    Yeah. That means it's safe for the callback to do a syscache lookup
    for the sequence name by oid.


I've revisited this patch in an attempt to get a corrected version ready
for the next CF.

Turns out it's not that simple.


Sequences advances aren't part of a transaction (though they always
happen during one) and proceed whether the xact that happened to trigger
them commits or rolls back. So it doesn't make sense to add them to a
reorder buffer for an xact. We want to send it immediately, not
associate it with some arbitrary xact that just happened to use the last
value in the cache that might not commit for ages.


But then when do we send it? If sent at the moment of decoding the
advance from WAL then it'll always be sent to the client between the
commit of one xact and the begin of another, because it'll be sent while
we're reading xlog and populating reorder buffers. For advance of an
already-created sequence advance that's what we want, but CREATE
SEQUENCE, ALTER SEQUENCE etc also log RM_SEQ_ID XLOG_SEQ_LOG operations.
The xact that created the sequence isn't committed yet so sending the
advance to the downstream will lead to attempting to advance a sequence
that doesn't yet exist. Similarly ALTER SEQUENCE emits a new
XLOG_SEQ_LOG with the updated Form_pg_sequence. All fine for physical
replication but rather more challenging for logical replication since
sometimes we want the sequence change to be associated with a particular
xact and sometimes not.


So the reorder buffer has to keep track of sequences. It must check to
see if a catalog change is a sequence creation and if so mark the
sequence as pending, keeping a copy of the Form_pg_sequence that's
updated to the latest version as the xact proceeds and writes updates to
the sequence. On commit the sequence advance is replayed at the end of
the xact using the snapshot of the newly committed xact; on rollback
it's discarded since the sequence never became visible to anyone else.
We can safely assert that that sequence will not be updated by any other
xact until this one commits.


In reorderbuffer.c, there's a test for relation->rd_rel->relkind ==
RELKIND_SEQUENCE that skips changes to sequence relations. We'll want to
replicate sequence catalog updates as DDL via event triggers and deparse
so that's OK, but I think we'll need to make a note of sequence creation
here to mark new sequences as uncommitted.


If we see a sequence change that isn't for an uncommitted newly-created
sequence we make an immediate call through the reorder buffer to send
the sequence update info to the client. That's OK even if it's something
like ALTER SEQUENCE ... INCREMENT 10 RENAME TO othername; . The ALTER
... RENAME part gets sent with the xact that did it when/if it commits
since it's part of the pg_class tuple for the sequence; the INCREMENT
gets sent immediately since it's part of the Form_pg_sequence. That
matches what happens locally, and it means there's no need to keep track
of every sequence, only new ones.


On commit or rollback of an xact that creates a sequence we pop the
sequence oid from the ordered list of uncommitted sequence oids that
must be kept by the decoding session.


If we land up decoding the same WAL again we might send sequence updates
that temporarily move a sequence backwards then advance it again when we
replay the newer updates. That's the same as hot standby and seems fine.
You obviously won't be able to safely get new values from sequences
that're replicated from an upstream on the downstream anyway - and I
anticipate that logical decoding receivers will probably want to use
seqam etc later to make those sequences read-only until a failover event.


Sound reasonable?

I wonder if it would be acceptable to create new info flag for RM_SEQ_IDthat would behave just like XLOG_SEQ_LOG but would be used only for thenontransactional updates (nextval) so that decoding could easilydifferentiate between transactional and non-transactional update ofsequence and then just either call the callback immediately or add thechange to reorder buffer based on that. The redo code could just havesimple OR expression to behave same with both of the info flags.

Seems like simpler solution than building all the tracking code on thedecoding side to me.



--
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Logical decoding support for sequence advances

Reply via email to