On 12/12/2011 08:45 AM, Robert Haas wrote:
But I'm skeptical that anything that we only update once per checkpoint cycle will help much in
calculating an accurate lag value.

I'm sure there is no upper bound on how much WAL lag you can build up between commit/abort records either; they can be far less frequent than checkpoints. All it takes is a multi-hour COPY with no other commits to completely hose lag measured by that advance, and that is not an unusual situation at all. Overnight daily ETL or reporting MV-ish roll-ups, scheduled specifically for when no one is normally at the office, are the first thing that spring to mind.

Anyway, I wasn't suggesting checkpoints as anything other than a worst case behavior. We can always thump out more frequent updates to reduce lag, and in what I expect to the most common case the WAL send/receive stuff will usually do much better. I see the XID vs. WAL position UI issues as being fundamentally unsolvable, which really bothers me. I'd have preferred to run screaming away from this thread if it hadn't.

It also strikes me that anything that is based on augmenting the 
walsender/walreceiver protocol leaves
anyone who is using WAL shipping out in the cold.  I'm not clear from
the comments you or Simon have made how important you think that use
case still is.

There's a number of reasons why we might want more timestamps streamed into the WAL; this might be one. We'd just need one to pop out one as part of the archive_timeout switch to in theory make it possible for these people to be happy. I think Simon was hoping to avoid WAL timestamps, I wouldn't bet too much on that myself. The obvious implementation problem here is that the logical place to put the timestamps is right at the end of the WAL file, just before it's closed for archiving. But that position isn't known until you've at least started processing it, which you clearly are not doing fast enough if lag exists.

As far as who's still important here, two observations. Note that the pg_last_xact_insert_timestamp approach can fail to satisfy WAL shipping people who are going to a separate network, where it's impractical to connect to both servers with libpq. I have some customers who like putting a one-way WAL wall (sorry) between production and the standby server, with the log shipping being the only route between them; that's one reason why they might still be doing this instead of using streaming. There's really no good way to make these people happy and provide time lag monitoring inside the database.

I was actually the last person I recall who suggested some extra monitoring mainly aimed at WAL shipping environments: http://archives.postgresql.org/pgsql-hackers/2010-01/msg01522.php Had some pg_standby changes I was also working on back then, almost two years ago. I never circled back to any of it due to having zero demand since 9.0 shipped, the requests I had been regularly getting about this all dried up. While I'm all for keeping new features working for everyone when it doesn't hold progress back, it's not unreasonable to recognize we can't support every monitoring option through all of the weird ways WAL files can move around. pg_stat_replication isn't very helpful for 9.0+ WAL shippers either, yet they still go on doing their thing.

In the other direction, people who will immediately adopt the latest hotness, cascading is a whole new layer of use case concerns on top of the ones considered so far. Now you're talking two layers of connections users have to navigate though to compute master->cascaded standby lag. Cascade the WALSender timestamps instead, which seems pretty simple to do, and then people can just ask their local standby.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to