Hi all,
I wondered if any experts can help me out?

I currently monitor Postgresql's replication status by looking at two metrics.
First I check to see if the current slave xlog replay is equal to the master 
-- if so, it's up to date.
If it's not equal, then I look at pg_last_xact_replay_timestamp().

I can compare this with now() to get a duration, which I believe should 
indicate how far behind the slave is tracking.

Most of the time this works quite well -- the slave might fall behind by some 
seconds under heavy load, but that's fine.

However, occasionally this replay timestamp will report times many hours or 
days behind! This goes on for a few minutes, then suddenly recovers.


My best guess for what is going on is:
 - There has been no activity for hours or days, and so the oldest replayed 
transaction on the slave is genuinely quite old.
 - Something has happened on the master that causes its 
pg_current_xlog_location() to be updated, but not in a way that is sent to the 
slave until the end of a long-running transaction.


Could anyone suggest how to do this in a manner that avoids the problem?


It's annoying because when it happens, because for 5-10 minutes monitoring 
alerts get fired off about catastrophic amounts of lag on the read-only slave!

Cheers
Toby


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to