Please let me ask you about possible causes of a certain problem, slow shutdown
of postmaster when a backend crashes, and whether to fix PostgreSQL.
Our customer is using 64-bit PostgreSQL 9.2.8 on RHEL 6.4. Yes, the PostgreSQL
version is rather old but there's no relevant bug fix in later 9.2.x releases.
One backend process (postgres) for an application session crashed due to a
segmentation fault and dumped a core file. The cause is a bug of
pg_dbms_stats. Another note is that restart_after_crash is off to make
The problem here is that postmaster took as long as 15 seconds to terminate
after it had detected a crashed backend. The messages were output as follows:
LOG: server process (PID 31894) was terminated by signal 11: Segmentation fault
DETAIL: Failed process was running: DELETE...(snip)
LOG: terminating any other active server processes
>From 20：12：35.013 to 20：12：39.074, the following message was output 80 times.
FATAL: the database system is in recovery mode
The custom monitoring system detected the death of postmaster as a result of
running "pg_ctl status".
That's it. You may say the following message should also have been emitted,
but there's not. This is because we commented out the ereport() call in
quickdie() in tcop.c. That ereport() call can hang depending on the timing,
which is fixed in 9.4.
WARNING: terminating connection because of crash of another server process
The customer insists that PostgreSQL takes longer to shut down than expected,
which risks exceeding their allowed failover time.
There's no apparent evidence to indicate the cause, but I could guess a few
reasons. What do you think these are correct and should fix PostgreSQL? (I
1) postmaster should close the listening ports earlier
As cited above, for 4 seconds, postmaster created 80 dead-end child processes
which just output "FATAL: the database system is in recovery mode". This
indicates that postmaster is busy handling re-connection requests from
disconnected applications, preventing postmaster from reaping dead children as
fast as possible. This is a waste because postmaster will only shut down.
I think the listening ports should be closed in HandleChildCrash() when the
condition "(RecoveryError || !restart_after_crash)" is true.
2) make stats collector terminate immediately
stats collector seems to write the permanent stats file even when it receives
SIGQUIT. But it's useless because the stat file is reset during recovery. And
Tom claimed that writing stats file can take long:
3) Anything else?
While postmaster is in PM_WAIT_DEAD_END state, it leaves the listening ports
open but doesn't call select()/accept(). As a result, incoming connection
requests are accumulated in the listen queue of the sockets. Does the OS have
any bug to slow the process termination when the listen queue is not empty?
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: