[HACKERS] Measuring replay lag

Thomas Munro Wed, 26 Oct 2016 03:35:42 -0700

Hi hackers,

Here is a new version of my patch to add a replay_lag column to the
pg_stat_replication view (originally proposed as part of a larger
patch set for 9.6[1]), like this:

postgres=# select application_name, replay_lag from pg_stat_replication;
┌──────────────────┬─────────────────┐
│ application_name │   replay_lag    │
├──────────────────┼─────────────────┤
│ replica1         │ 00:00:00.595382 │
│ replica2         │ 00:00:00.598448 │
│ replica3         │ 00:00:00.541597 │
│ replica4         │ 00:00:00.551227 │
└──────────────────┴─────────────────┘
(4 rows)

It works by taking advantage of the { time, end-of-WAL } samples that
sending servers already include in message headers to standbys.  That
seems to provide a pretty good proxy for when the WAL was written, if
you ignore messages where the LSN hasn't advanced.  The patch
introduces a new GUC replay_lag_sample_interval, defaulting to 1s, to
control how often the standby should record these timestamped LSNs
into a small circular buffer.  When its recovery process eventually
replays a timestamped LSN, the timestamp is sent back to the upstream
server in a new reply message field.  The value visible in
pg_stat_replication.replay_lag can then be updated by comparing with
the current time.

Compared to the usual techniques people use to estimate replay lag,
this approach has the following advantages:

1.  The lag is measured in time, not LSN difference.
2.  The lag time is computed using two observations of a single
server's clock, so there is no clock skew.
3.  The lag is updated even between commits (during large data loads etc).

In the previous version I was effectively showing the ping time
between the servers during idle times when the standby was fully
caught up because there was nothing happening.  I decided that was not
useful information and that it's more newsworthy and interesting to
see the estimated replay lag for the most recent real replayed
activity, so I changed that.

In the last thread[1], Robert Haas wrote:
> Well, one problem with this is that you can't put a loop inside of a
> spinlock-protected critical section.

Fixed.

> In general, I think this is a pretty reasonable way of attacking this
> problem, but I'd say it's significantly under-commented.  Where should
> someone go to get a general overview of this mechanism?  The answer is
> not "at place XXX within the patch".  (I think it might merit some
> more extensive documentation, too, although I'm not exactly sure what
> that should look like.)

I have added lots of comments.

> When you overflow the buffer, you could thin in out in a smarter way,
> like by throwing away every other entry instead of the oldest one.  I
> guess you'd need to be careful how you coded that, though, because
> replaying an entry with a timestamp invalidates some of the saved
> entries without formally throwing them out.

Done, by overwriting the newest sample rather than the oldest if the
buffer is full.  That seems to give pretty reasonable degradation,
effectively lowering the sampling rate, without any complicated buffer
or rate management code.

> Conceivably, 0002 could be split into two patches, one of which
> computes "stupid replay lag" considering only records that naturally
> carry timestamps, and a second adding the circular buffer to handle
> the case where much time passes without finding such a record.

I contemplated this but decided that it'd be best to use ONLY samples
from walsender headers, and never use the time stamps from commit
records for this.  If we use times from commit records, then a
cascading sending server will not be able to compute the difference in
time without introducing clock skew (not to mention the difficulty of
combining timestamps from two sources if we try to do both).  I
figured that it's better to have value that shows a cascading
sender->standby->cascading sender round trip time that is free of
clock skew, than a master->cascading sender->standby->cascading sender
incomplete round trip that includes clock skew.

By the same reasoning I decided against introducing a new periodic WAL
record or field from the master to hold extra time stamps in between
commits to do this, in favour of the buffered transient timestamp
approach I took in this patch.  That said, I can see there are
arguments for doing it with extra periodic WAL timestamps, if people
don't think it'd be too invasive to mess with the WAL for this, and
don't care about cascading standbys giving skewed readings.  One
advantage would be that persistent WAL timestamps would still be able
to provide lag estimates if a standby has been down for a while and
was catching up, and this approach can't until it's caught up due to
lack of buffered transient timestamps.  Thoughts?

I plan to post a new "causal reads" patch at some point which will
depend on this, but in any case I think this is a useful feature on
its own.  I'd be grateful for any feedback, flames, better ideas etc.
Thanks for reading.

[1] 
https://www.postgresql.org/message-id/CAEepm%3D31yndQ7S5RdGofoGz1yQ-cteMrePR2JLf9gWGzxKcV7w%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com

replay-lag-v12.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Measuring replay lag

Reply via email to