On 2016-08-05 00:12:41 -0400, Robert Haas wrote: > > The cause is an optimisation intended to allow the downstream to avoid > > having to do local writes and flushes when the upstream's activity isn't of > > interest to it and doesn't result in replicated rows. When the upstream does > > a bunch of writes to another database or otherwise produces WAL not of > > interest to the downstream we send the downstream keepalive messages that > > include the upstream's current xlog position and the client replies to > > acknowledge it's seen the new LSN. But, so that we can avoid disk flushes on > > the downstream, we permit it to skip advancing its replication origin in > > response to those keepalives. We continue to advance the confirmed_flush_lsn > > and restart_lsn in the replication slot on the upstream so we can free WAL > > that's not needed and move the catalog_xmin up. The replication origin on > > the downstream falls behind the confirmed_flush_lsn on the upstream. > > This seems entirely too clever. The upstream could safely remember > that if the downstream asks for WAL position X it's safe to begin > streaming from WAL position Y because nothing in the middle is > interesting, but it can hardly decide to unilaterally ignore the > request position. > > > The simplest fix would be to require downstreams to flush their replication > > origin when they get a hot standby feedback message, before they send a > > reply with confirmation. That could be somewhat painful for performance, but > > can be alleviated somewhat by waiting for the downstream postgres to get > > around to doing a flush anyway and only forcing it if we're getting close to > > the walsender timeout. That's pretty much what BDR and pglogical do when > > applying transactions to avoid having to do a disk flush for each and every > > applied xact. Then we change START_REPLICATION ... LOGICAL so it ERRORs if > > you ask for a too-old LSN rather than silently ignoring it. > > That's basically just proposing to revert this broken optimization, > IIUC, and instead just try not to flush too often on the standby.
The effect of the optimization is *massive* if you are replicating a less active database, or a less active subset of a database, in a cluster with lots of other activity. I don't think that can just be disregard, to protect against something with plenty of other failure scenarios. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers