On 03/03/2018 09:00 PM, Peter Eisentraut wrote:
I find this premise a bit dubious.  Why have a log file if it's too big
to find anything in it?  Server crashes aren't the only thing people are
interested in.  So we'll need a function for "last $anything".

Thank you for your interest in this topic.
Well, on heavy loaded machine log file will be big, because you usually want to log every query for later analysis, and, because postgres is dumping everything in a single file, searching for some specific error will be slow. Log file is certainly needed for digging the details of a crash, but pg_shmem_init_time is more about online monitoring.

On 03/03/2018 09:43 PM, Justin Pryzby wrote:

I think one can tell if it's crashed recently by comparing start time of parent
postmaster and its main children (I'm going to go put this in place for myself
now).

ts=# SELECT backend_type, backend_start, pg_postmaster_start_time(), 
backend_start-pg_postmaster_start_time() FROM pg_stat_activity ORDER BY 
backend_start LIMIT 1;
     backend_type     |         backend_start         |   
pg_postmaster_start_time    |    ?column?
---------------------+-------------------------------+-------------------------------+-----------------
  autovacuum launcher | 2018-03-02 00:21:11.604468-03 | 2018-03-02 
00:12:46.757642-03 | 00:08:24.846826

Thank you, though we must take into account the fact that autovacuum may not be enabled so it`s better to use checkpointer or bgwriter. It`s working solution, yes, but it looks more like workaround and requires from user an understanding of postgres internals.


On 03/04/2018 06:56 PM, Tomas Vondra wrote:

Can you please explain why pg_postmaster_start_time can't be used for
this purpose? It seems like a pretty good match, considering it's meant
to show server start time.

Because a backend crash do not reset pg_postmaster_start_time.

On 03/04/2018 07:02 PM, Tomas Vondra wrote:
On 02/28/2018 01:11 PM, Anastasia Lubennikova wrote:
Attached patch introduces a new function pg_shmem_init_time(),
which returns the time shared memory was last (re)initialized.
It is created for use by monitoring tools to track backend crashes.

Currently, if the 'restart_after_crash' option is on, postgres will
just restart. And the only way to know that it happened is to
regularly parse logfile or monitor it, catching restart messages.
This approach is really inconvenient for users, who have gigabytes of
logs.

This new function can be periodiacally called by a monitoring agent,
and, if /shmem_init_time/ doesn't match /pg_postmaster_start_time,/
we know that server crashed-restarted, and also know the exact time,
when.

I don't think it really solves the problem, though. For example if the
whole VM reboots (which can be a matter of seconds), this check will say
"shmem_init_time == pg_postmaster_start_time" and you've not detected
anything.

IMHO pg_postmaster_start_time is the right way to monitor uptime, and
the right way to detect spurious restarts is to remember the last value
you've seen and compare it to the current one.

Yes, for the described case with VM restart pg_postmaster_start_time works fine. pg_postmaster_start_time is essential for almost every case and pg_shmem_init_time only expand his usefulness, not diminish.

On 03/04/2018 07:09 PM, Tom Lane wrote:
It evidently depends on how you want to define "server uptime".  If you
get backend crashes often enough, you might feel a need to define it
as "time since last crash".  Although I would think that if that's
happening regularly in a production environment, you have a problem
you need to fix, not just measure.

Absolutely. And for that you need to know ASAP that backend crash happened in the first place.

On 03/04/2018 07:09 PM, Tom Lane wrote:
My own thought about that is that if you are trying to measure
backend crashes, just knowing the time of the latest one is little help.
You want to know how often they're happening.  So this gets back to the
question of why the postmaster log isn't a useful source of that
information.

For that purpose pg_shmem_init_time also can be used.
For example we can build a chart based on values from "select (now() - pg_shmem_init_time());" taken, for example, every 10 seconds. Values around 0 =< x =<10 will signal about memory reinitialization which is usually a byproduct of a backend crash.

On 03/04/2018 07:09 PM, Tom Lane wrote:
  I think that if we're to do anything in this area,
improving the usefulness of the log would be more important than
providing the proposed function.

The separation of maintenance messages and query messaged into different log files is sorely needed.
This way server errors can be identified fast and in convenient way.
But as I mentioned earlier, pg_shmem_init_time() is about online monitoring.


On 03/04/2018 07:02 PM, Tomas Vondra wrote:
Actually, after looking at the code a bit, I think that test would not
really work anyway, because those two timestamps come from two separate
GetCurrentTimestamp calls, so you get this:

    test=# select pg_shmem_init_time(), pg_postmaster_start_time();
          pg_shmem_init_time       |   pg_postmaster_start_time
-------------------------------+-------------------------------
     2018-03-04 17:10:59.017058+01 | 2018-03-04 17:10:59.022071+01
    (1 row)

    test=# select pg_shmem_init_time() = pg_postmaster_start_time();
     ?column?
    ----------
     f
    (1 row)

So there would have to be some sort of "fuzz" parameter when comparing
the values, etc.

The difference measured in milliseconds is expected. After all, shared memory is initialized after postmaster start.




--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply via email to