On 7 April 2016 at 23:32, Robert Haas <robertmh...@gmail.com> wrote:

> > Yeah. I understand the reasons for that decision. Per an earlier reply I
> > think we can avoid making them WAL-logged so they can be used on standbys
> > and still achieve usable failover support on physical replicas.
> I think at one point we may have discussed doing this via additional
> side-channel protocol messages.  Is that what you are thinking about
> now, or something else?

Essentially, yes.

The way I'd like to do it in 9.6+1 is:

- Require that the replica(s) use streaming replication with a replication
slot to connect to the master

- Extend the feedback protocol to allow the replica to push its required
catalog_xmin up to the master so it doesn't vacuum away catalog tuples
still needed by a replica. (There's no need for it to push the restart_lsn,
and its fine for the master to throw away WAL still needed by a replica).

- Track the replica's catalog_xmin on the replica's slot. So it'll be a
slot used for physical replication that also has a catalog_xmin set.

- Allow applications to create their own slots on read-replicas (for apps
that want to decode from a standby).

- For transparent failover sync slot state from master to replica via
writes by a helper to a table on the master that get applied by a helper on
the standby.

Allowing apps to create slots on a replica can be used by aware apps to do
failover but only if they know about and can connect to all the failover
candidate replica(s), and they have to maintain and advance a slot on each.
Potentially fragile. So this is mostly good for supporting decoding from a
standby, rather than failover.

I really want easy, practical failover that doesn't require every app to
re-implement logic to keep track of replicas its self, etc. For that I'd
use have a bgworker that runs on the replica make a direct libpq connection
to the master and snapshot the state of its slots plus its xlog insert
position. The worker would wait until the replica's replay reaches/passes
that xlog position before applying the slot state copied from the master to
the replica. (Adding a "GET SLOT STATE" or whatever to the walsender
interface would make this less ugly). This basically emulates what failover
slots did, but lazily: with no hook to capture slot state save we have to
poll the master. With no ability to insert the changes into WAL and run a
custom redo function on the replica we have to manually ensure they're
applied at the right time. Unlike with failover slots it's possible for the
slot on the replica to be behind where it was on the master at the same LSN
- but that's OK because we're protecting against catalog vacuum, per above.

(I'd really like to have a slot flag that lets us disallow replay from such
copied slots on the replica, marking them as usable only if not in

The only part of this that isn't possible on 9.6 is having the replica push
the catalog_xmin up to the master over feedback. But we can emulate that
with a bgworker on the replica that maintains an additional dummy logical
slot on the master. It replays the dummy slot to the lowest confirmed_lsn
of any slot on the replica. Somewhat inefficient, but will work.

If it sounds like a bit of a pile of hacks, that's because the failover
support part is. But unlike failover slots it will bring us closer to being
able to do logical decoding from a replica, which is nice. It can be made a
lot less ugly if the help of the walsender can be enlisted to report the
master's slot state, so we don't have to use normal libpq. (The reason I
wouldn't use a bgworker on the master to write it to a table then another
worker to apply changes from that table on the replica is mainly that then
we can't have failover support for ascading replicas, which can't write

> Well, we can agree to disagree on this.  I don't think that it's all
> that difficult to figure out how to change your schema in a
> step-by-step way that allows logical replication to keep working while
> the nodes are out of sync, but you don't have to agree and that's
> fine.  I'm not objecting to eventually adding that feature to core.  I
> do think it's a bad idea to be polishing that sort of thing before
> getting some more basic facility into core.

That much I agree on - I certainly don't want to block this on DDL

> While I acknowledge that a logical output plugin has applications
> beyond replication, I think replication is the preeminent one by a
> considerable margin.  Some people want it for other purposes, and we
> should facilitate that.  But that number of people is dwarfed by the
> number who would use a seamless logical replication facility if we had
> one available.  And so that's the thing I think we should focus on
> making work.


Really I think we'll want a separate json output plugin for most of those
other uses anyway. Though some of the facilities in pglogical_output will
need to be extracted, added to into logical decoding its self, and shared.

> If I were doing that, I think I would attack it from a considerably
> different direction than what has so far been proposed.  I would try
> to put the stuff in core, not contrib, and I would arrange to control
> it using DDL, not function calls.  For version one, I would cut all of
> the stuff that allows data to be sent in any format other than text,
> and just use in/outfuncs all the time.

I'm very hesistant to try to do this with new DDL. Partly for complexity,
partly because I'd really like to be able to carry a backport for 9.4 / 9.5
/ 9.6 so people can use it within the next couple of years.

> I do generally think that logical decoding relies too much on trying
> to set up situations where it will never fail, and I've said from the
> beginning that it should make more provision to cope with failure
> rather than just trying to avoid it.  If logical decoding never
> breaks, great.  But the approach I would favor is to set things up so
> that it automatically reclones if there is a replication break, and
> then as an optimization project, try to eliminate those cases one by
> one.

I really can't agree there.

For one thing, sometimes those clones are *massive*. How do you tell
someone who's replicating a 10 TiB database that they've got to let the
whole thing re-sync, and by the way all replication will completely halt
until it does?

It's bad enough with physical standby, though at least there rsync helps a
bit and pg_rewind has made a huge difference. Lets not create the same
problem again in logical replication.

Then, as Petr points out, there are applications where you can't re-clone,
at least not directly. You're using the decoding stream with a transform
downstream to insert incoming data into fact tables. You're feeding it into
a messaging system.  You're merging data from multiple upstreams into a
single downstream. Many of the interesting, exciting things we can make
possible with logical replication that simply aren't possible with physical
replication really need it not to randomly break.

Also, from an operational experience point of view, BDR has places where it
does just break if you do something wrong. Experience with this has left me
absolutely adamant that we need not to have such booby-traps in core
logical replication, at least in the medium term. It doesn't matter how
many times you tell users "don't do that" ... they'll do it. Then get angry
when it breaks. Not to mention how hard it can be to discover why something
broke. You have to look at the logs. Obvious to you or me, but I spend a
lot of time answering questions about BDR and pglogical to the effect of
"not working? nothing happening? LOOK AT THE LOG FILES."

I think you're really over-estimating the average user when it comes to
analysing and understanding the consequence of specific schema changes,
etc. Sure, I think it's fine not to support some things we do support for
physical, but as much as possible we should stop the user doing those when
logical replication is enabled, and where that's not possible it needs to
be really, really obvious what broke and why.

> That isn't really my perception of how things have gone, but I admit I
> may not have been following it as closely as I should have done.  I'd
> be happy to talk with you about this in person if you're going to be
> at PGCONF.US or PGCon in Ottawa; or there's always Skype.  I don't see
> any reason why we can't all put our heads together and agree on a
> direction that we can all live with.

Yeah. I'd like to see us able to work on a shared dev tree rather than
mailing patches around, at least.

I'm not going to make it to PgConf.US. I don't know about Ottawa yet, but I
doubt it given that I did go to Brussels. Perth is absurdly far from almost
everywhere, unfortunately.

 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply via email to