Thanks for this write-up.

On Tue, Sep 23, 2014 at 2:24 PM, Andres Freund <and...@2ndquadrant.com> wrote:
> 1) The ability Monitor how for replication has progressed in a
>    crashsafe manner to allow it to restart at the right point after
>    errors/crashes.
> 2) Efficiently identify the origin of individual changes and
>    transactions. In multimaster and some cascading scenarios it is
>    necessary to do so to avoid sending out replayed changes again.
> 3) Allow to efficiently filter out replayed changes from logical
>    decoding. It's currently possible to filter changes from inside the
>    output plugin, but it's much more efficient to filter them out
>    before decoding.

I agree with these goals.

Let me try to summarize the information requirements for each of these
things.  For #1, you need to know, after crash recovery, for each
standby, the last commit LSN which the client has confirmed via a
feedback message.  For #2, you need to know, when decoding each
change, what the origin node was.  And for #3, you need to know, when
decoding each change, whether it is of local origin.  The information
requirements for #3 are a subset of those for #2.

> A rather efficient solution for 1) is to attach the 'origin node' and
> the remote commit LSN to every local commit record produced by
> replay. That allows to have a shared memory "table" (remote_node,
> local_lsn, remote_lsn).

This seems OK to me, modulo some arguing about what the origin node
information ought to look like.  People who are not using logical
replication can use the compact form of the commit record in most
cases, and people who are using logical replication can pay for it.

> Similarly, to solve the problem of identifying the origin of changes
> during decoding, the problem can be solved nicely by adding the origin
> node of every change to changes/transactions. At first it might seem
> to be sufficient to do so on transaction level, but for cascading
> scenarios it's very useful to be able to have changes from different
> source transactions combinded into a larger one.

I think this is a lot more problematic.  I agree that having the data
in the commit record isn't sufficient here, because for filtering
purposes (for example) you really want to identify the problematic
transactions at the beginning, so you can chuck their WAL, rather than
reassembling the transaction first and then throwing it out.  But
putting the origin ID in every insert/update/delete is pretty
unappealing from my point of view - not just because it adds bytes to
WAL, though that's a non-trivial concern, but also because it adds
complexity - IMHO, a non-trivial amount of complexity.  I'd be a lot
happier with a solution where, say, we have a separate WAL record that
says "XID 1234 will be decoding for origin 567 until further notice".

> == Replication Identifiers ==
>
> The above explains the need to have as small as possible identifiers
> for nodes. Two years back I'd suggested that we rely on the user to
> manually assign 16bit ids to individual nodes. Not very surprisingly
> that was shot down because a) 16bit numbers are not descriptive b) a
> per node identifier is problematic because it prohibits replication
> inside the same cluster.
>
> So, what I've proposed since is to have two different forms of
> identifiers. A long one, that's as descriptive as
> $replication_solution wants. And a small one (16bit in my case) that's
> *only meaningful within one node*. The long, user facing, identifier
> is locally mapped to the short one.
>
> In the first version I proposed these long identifiers had a fixed
> form, including the system identifier, timeline id, database id, and a
> freeform name. That wasn't well received and I agree that that's too
> narrow. I think it should be a freeform text of arbitrary length.
>
> Note that it depends on the replication solution whether these
> external identifiers need to be coordinated across systems or not. I
> think it's *good* if we don't propose a solution for that - different
> replication solutions will have different requirements.

I'm pretty fuzzy on how this actually works.  Like, the short form
here is just getting injected into WAL by the apply process.  How does
it figure out what value to inject?  What if it injects a value that
doesn't have a short-to-long mapping?  What's the point of the
short-to-long mappings in the first place?  Is that only required
because of the possibility that there might be multiple replication
solutions in play on the same node?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to