On 17 August 2016 at 05:18, Andres Freund <and...@anarazel.de> wrote:

> On 2016-08-08 10:59:20 +0800, Craig Ringer wrote:
> > Right. Though if we flush lazily I'm surprised the effect is that big,
> > you're the one who did the work and knows the significance of it.
> It will be. Either you're increasing bloat (by not increasing the
> slot's wal position / catalog xmin), or you're adding frequent syncs on
> an idle connection.

My thinking is that we should be able to do it lazily, like we do already
with feedback during apply of changes.  The problem is that right now we
can't tell the difference between confirmed_flush_lsn advances in response
to keepalives when there's no interesting upstream activity, and advances
when the client replays and confirms real activity of interest. So we can
add a new field in logical slots that tracks the last confirmed_flush_lsn
update that occurred as a result of an actual write to the client rather
than keepalive responses. No new resource retention is required, no new
client messages, no new protocol fields. Just one new field in a logical

* Add a new field, say last_write_lsn, in slots. A logical slot updates
this whenever an output plugin sends something to the client in response to
a callback. last_write_lsn is not advanced along with confirmed_flush_lsn
when we just skip over data that's not of interest like writes to other
DBs or changes that are filtered out by the output plugin, only when the
output plugin actually sends something to the client.

* A candidate_last_write_lsn type mechanism is needed to ensure we don't
flush out advances of last_write_lsn before we've got client feedback to
confirm it flushed the changes resulting from the output plugin writes. The
same sort of logic as used for candidate_restart_lsn & restart_lsn will
work fine, but we don't have to make sure it's flushed like we do with
restart_lsn, we can just dirty the slot and wait for the next slot
checkpoint - it's pretty harmless if candidate_last_write_lsn is older than
reality, it just adds a small window where we won't detect lost changes.

* Clients like BDR and pglogical already send feedback lazily. They track
the server's flush position and sending feedback for an upstream lsn when
we know the corresponding downstream writes and associated replication
origin advances have been flushed to disk. (As you know, having written
it). Behaviour during normal apply doesn't need to change. Neither does
behaviour during idle; clients don't have to advance their replication
origin in response to server keepalives, though they may do so lazily.

*  When a client starts a new decoding session we
check last_write_lsn against the client-requested LSN from the client's
replication origin. We ERROR if last_write_lsn is newer than
the LSN requested by the client, indicating that the client is trying to
replay changes it or someone else using the same slot has already seen and

*  catalog_xmin advances and WAL removal are NOT limited by last_write_lsn,
we can freely remove WAL after last_write_lsn and vacuum catalogs. On
reconnect we continue to skip to confirmed_flush_lsn if asked for an older
LSN, just like we currently do. The difference is that now we know we're
skipping data that wasn't of interest to the client so it didn't result in
eager client side replication origin advances.

Think of last_write_lsn as "the value of confirmed_flush_lsn last time the
client actually flushed something interesting". We can safely skip from any
value >= last_write_lsn to the current slot confirmed_lsn if asked to start
replay at any LSN within that range. We CANNOT safely skip from
< last_write_lsn to confirmed_flush_lsn since we know the client would miss
data it already received and confirmed but seems to have forgotten due to
lying fsync(), restore from snapshot backup, etc.

We'd need more flushes on the upstream only if we were going to try to
guarantee that we'd detect all lost changes from a client, since
last_write_lsn would need flushing in response to every client feedback
message during apply (but not idle). Even then the client could've flushed
more changes we haven't got feedback for yet, so it's not really possible
to totally prevent the problem. I don't think total prevention is too
interesting though. A window since the last slot checkpoint where we don't
detect problems *if* the server has also crashed and restarted isn't too
bad and is a lot better than the current situation.

 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply via email to