On Wed, Jan 4, 2017 at 1:06 AM, Simon Riggs <si...@2ndquadrant.com> wrote: > On 21 December 2016 at 21:14, Thomas Munro > <thomas.mu...@enterprisedb.com> wrote: >> I thought about that too, but I couldn't figure out how to make the >> sampling work. If the primary is choosing (LSN, time) pairs to store >> in a buffer, and the standby is sending replies at times of its >> choosing (when wal_receiver_status_interval has been exceeded), then >> you can't accurately measure anything. > > Skipping adding the line delay to this was very specifically excluded > by Tom, so that clock disparity between servers is not included. > > If the balance of opinion is in favour of including a measure of > complete roundtrip time then I'm OK with that.
I deliberately included the network round trip for two reasons: 1. The three lag numbers tell you how long syncrep would take to return control at the three levels remote_write, on, remote_apply. 2. The time arithmetic is all done on the primary side using two observations of its single system clock, avoiding any discussion of clock differences between servers. You can always subtract half the ping time from these numbers later if you really want to (replay_lag - (write_lag / 2) may be a cheap proxy for a lag time that doesn't include the return network leg, and still doesn't introduce clock difference error). I am strongly of the opinion that time measurements made by a single observer are better data to start from. >> You could fix that by making the standby send a reply *every time* it >> applies some WAL (like it does for transactions committing with >> synchronous_commit = remote_apply, though that is only for commit >> records), but then we'd be generating a lot of recovery->walreceiver >> communication and standby->primary network traffic, even for people >> who don't otherwise need it. It seems unacceptable. > > I don't see why that would be unacceptable. If we do it for > remote_apply, why not also do it for other modes? Whatever the > reasoning was for remote_apply should work for other modes. I should > add it was originally designed to be that way by me, so must have been > changed later. You can achieve that with this patch by setting replication_lag_sample_interval to 0. The patch streams (time-right-now, end-of-wal) to the standby in every outgoing message, and then sees how long it takes for those timestamps to be fed back to it. The standby feeds them back immediately as soon as it writes, flushes and applies those WAL positions. I figured it would be silly if every message from the primary caused the standby to generate 3 replies from the standby just for a monitoring feature, so I introduced the GUC replication_lag_sample_interval to rate-limit that. I don't think there's much point in setting it lower than 1s: how often will you look at pg_stat_replication? >> That's why I thought that the standby should have the (LSN, time) >> buffer: it decides which samples to record in its buffer, using LSN >> and time provided by the sending server, and then it can send replies >> at exactly the right times. The LSNs don't have to be commit records, >> they're just arbitrary points in the WAL stream which we attach >> timestamps to. IPC and network overhead is minimised, and accuracy is >> maximised. > > I'm dubious of keeping standby-side state, but I will review the patch. Thanks! The only standby-side state is the three buffers of (LSN, time) that haven't been written/flushed/applied yet. I don't see how that can be avoided, except by inserting extra periodic timestamps into the WAL itself, which has already been rejected. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers