I didn't follow all of that but I wonder if it isn't just that when you restore from backup you should be creating a new slot?
On 3 Aug 2016 14:39, "Craig Ringer" <cr...@2ndquadrant.com> wrote: > Hi all > > I think we have a bit of a problem with the behaviour specified for > logical slots, one that makes it hard to prevent a outdated snapshot or > backup of a logical-slot-using downstream from knowing it's missing a chunk > of data that's been consumed from a slot. That's not great since slots are > supposed to ensure a continuous, gapless data stream. > > If the downstream requests that logical decoding restarts at an LSN older > than the slot's confirmed_flush_lsn, we silently ignore the client's > request and start replay at the confirmed_flush_lsn. That's by design and > fine normally, since we know the gap LSNs contained no transactions of > interest to the downstream. > > But it's *bad* if the downstream is actually a copy of the original > downstream that's been started from a restored backup/snapshot. In that > case the downstream won't know that some other client, probably a newer > instance of its self, consumed rows it should've seen. It'll merrily > continue replaying and not know it isn't consistent. > > The cause is an optimisation intended to allow the downstream to avoid > having to do local writes and flushes when the upstream's activity isn't of > interest to it and doesn't result in replicated rows. When the upstream > does a bunch of writes to another database or otherwise produces WAL not of > interest to the downstream we send the downstream keepalive messages that > include the upstream's current xlog position and the client replies to > acknowledge it's seen the new LSN. But, so that we can avoid disk flushes > on the downstream, we permit it to skip advancing its replication origin in > response to those keepalives. We continue to advance the > confirmed_flush_lsn and restart_lsn in the replication slot on the upstream > so we can free WAL that's not needed and move the catalog_xmin up. The > replication origin on the downstream falls behind the confirmed_flush_lsn > on the upstream. > > This means that if the downstream exits/crashes before receiving some new > row, its replication origin will tell it that it last replayed some LSN > older than what it really did, and older than what the server retains. > Logical decoding doesn't allow the client to "jump backwards" and replay > anything older than the confirmed_lsn. Since we "know" that the gap cannot > contain rows of interest, otherwise we'd have updated the replication > origin, we just skip and start replay at the confirmed_flush_lsn. > > That means that if the downstream is restored from a backup it has no way > of knowing it can't rejoin and become consistent because it can't tell the > difference between "everything's fine, replication origin intentionally > behind confirmed_flush_lsn due to activity not of interest" and "we've > missed data consumed from this slot by some other peer and should refuse to > continue replay". > > The simplest fix would be to require downstreams to flush their > replication origin when they get a hot standby feedback message, before > they send a reply with confirmation. That could be somewhat painful for > performance, but can be alleviated somewhat by waiting for the downstream > postgres to get around to doing a flush anyway and only forcing it if we're > getting close to the walsender timeout. That's pretty much what BDR and > pglogical do when applying transactions to avoid having to do a disk flush > for each and every applied xact. Then we change START_REPLICATION ... > LOGICAL so it ERRORs if you ask for a too-old LSN rather than silently > ignoring it. > > This problem can also bite you if you restore a copy of a downstream (say, > to look at since-deleted data) while the original happens to be > disconnected for some reason. The copy connects to the upstream and > consumes some data from the slot. Then when the original comes back on line > it has no idea there's a gap in its time stream. > > This came up when investigating issues with people using snapshot-based > BDR and pglogical backup/restore. It's a real-world problem that can result > in silent data inconsistency. > > Thoughts on the proposed fix? Any ideas for lower-impact fixes that'd > still allow a downstream to find out if it's missed data? > > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services >