I didn't follow all of that but I wonder if it isn't just that when you
restore from backup you should be creating a new slot?
On 3 Aug 2016 14:39, "Craig Ringer" <cr...@2ndquadrant.com> wrote:
> Hi all
> I think we have a bit of a problem with the behaviour specified for
> logical slots, one that makes it hard to prevent a outdated snapshot or
> backup of a logical-slot-using downstream from knowing it's missing a chunk
> of data that's been consumed from a slot. That's not great since slots are
> supposed to ensure a continuous, gapless data stream.
> If the downstream requests that logical decoding restarts at an LSN older
> than the slot's confirmed_flush_lsn, we silently ignore the client's
> request and start replay at the confirmed_flush_lsn. That's by design and
> fine normally, since we know the gap LSNs contained no transactions of
> interest to the downstream.
> But it's *bad* if the downstream is actually a copy of the original
> downstream that's been started from a restored backup/snapshot. In that
> case the downstream won't know that some other client, probably a newer
> instance of its self, consumed rows it should've seen. It'll merrily
> continue replaying and not know it isn't consistent.
> The cause is an optimisation intended to allow the downstream to avoid
> having to do local writes and flushes when the upstream's activity isn't of
> interest to it and doesn't result in replicated rows. When the upstream
> does a bunch of writes to another database or otherwise produces WAL not of
> interest to the downstream we send the downstream keepalive messages that
> include the upstream's current xlog position and the client replies to
> acknowledge it's seen the new LSN. But, so that we can avoid disk flushes
> on the downstream, we permit it to skip advancing its replication origin in
> response to those keepalives. We continue to advance the
> confirmed_flush_lsn and restart_lsn in the replication slot on the upstream
> so we can free WAL that's not needed and move the catalog_xmin up. The
> replication origin on the downstream falls behind the confirmed_flush_lsn
> on the upstream.
> This means that if the downstream exits/crashes before receiving some new
> row, its replication origin will tell it that it last replayed some LSN
> older than what it really did, and older than what the server retains.
> Logical decoding doesn't allow the client to "jump backwards" and replay
> anything older than the confirmed_lsn. Since we "know" that the gap cannot
> contain rows of interest, otherwise we'd have updated the replication
> origin, we just skip and start replay at the confirmed_flush_lsn.
> That means that if the downstream is restored from a backup it has no way
> of knowing it can't rejoin and become consistent because it can't tell the
> difference between "everything's fine, replication origin intentionally
> behind confirmed_flush_lsn due to activity not of interest" and "we've
> missed data consumed from this slot by some other peer and should refuse to
> continue replay".
> The simplest fix would be to require downstreams to flush their
> replication origin when they get a hot standby feedback message, before
> they send a reply with confirmation. That could be somewhat painful for
> performance, but can be alleviated somewhat by waiting for the downstream
> postgres to get around to doing a flush anyway and only forcing it if we're
> getting close to the walsender timeout. That's pretty much what BDR and
> pglogical do when applying transactions to avoid having to do a disk flush
> for each and every applied xact. Then we change START_REPLICATION ...
> LOGICAL so it ERRORs if you ask for a too-old LSN rather than silently
> ignoring it.
> This problem can also bite you if you restore a copy of a downstream (say,
> to look at since-deleted data) while the original happens to be
> disconnected for some reason. The copy connects to the upstream and
> consumes some data from the slot. Then when the original comes back on line
> it has no idea there's a gap in its time stream.
> This came up when investigating issues with people using snapshot-based
> BDR and pglogical backup/restore. It's a real-world problem that can result
> in silent data inconsistency.
> Thoughts on the proposed fix? Any ideas for lower-impact fixes that'd
> still allow a downstream to find out if it's missed data?
> Craig Ringer http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services