"Alvaro Herrera" <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> Greg Smith <[EMAIL PROTECTED]> writes:
>> > ... If they'd have noticed it while the server was up, perhaps because the
>> > "last checkpoint" value hadn't changed in a long time (which seems like it
>> > might be available via stats even if, as you say, the background writer is
>> > out of its mind at that point), they could have done such a kill and
>> > collected some actual useful information here. That's the theory at
>> > least.
>> Well, mebbe, but that still seems to require a lot of custom monitoring
>> infrastructure that is not present in this patch, and furthermore that
>> this patch doesn't especially aid the development of.
> These kind of things can be monitored externally very easily, say by
> Nagios, when the values are available via the database. If you have to
> troll the logs, it's quite a bit harder to do it.
I can see Tom's reluctance out of fear that really this is going to be the
first of hundreds of dials which have to be monitored so a single function to
handle that single dial is kind of short sighted. I would think eventually it
should be part of the Postgres SNMP MIB.
But I would say from my experience on the not-really-a-sysadmin side I think
the time of the last checkpoint is probably the *most* important thing to be
monitoring. Effectively it's monitoring how stale your data on disk is
potentially becoming by showing how much recovery will be necessary.
> I'm not sure about the right values to export -- last checkpoint start
> time is the most obvious idea, but I would also suggest exporting last
> checkpoint end, or NULL if the checkpoint is ongoing; and also previous-
> to-last checkpoint start and end.
I'm surprised y'all want the time of the last checkpoint *start* though. It
seems to me that only the last checkpoint *end* is actually interesting. A
checkpoint which has started but not ended yet is not really a checkpoint at
all. It's nothing.
In the future there could be multiple checkpoints which have "started" but not
finished. Or we could add support for lazy checkpoints in which case there
could be an infinite number of "potential" checkpoints which haven't finished.
Worse, the time the last checkpoint actually finished isn't actually
interesting either. What you want to know is what time the record which the
last finished checkpoint *checkpointed up to* was inserted. That is, the time
of the record that the last checkpoint record *points to*.
That represents the "guarantee" that the database is making to the sysadmin
about data integrity. Everything before that time is guaranteed to have
reached data files already. Everything after it may or may not be in the data
files and has to be checked against the WAL logs.
Ask me about EnterpriseDB's RemoteDBA services!
Sent via pgsql-patches mailing list (firstname.lastname@example.org)
To make changes to your subscription: