[HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

Tsunakawa, Takayuki Mon, 19 Sep 2016 23:21:29 -0700

Hello,

Please let me ask you about possible causes of a certain problem, slow shutdown 
of postmaster when a backend crashes, and whether to fix PostgreSQL.


Our customer is using 64-bit PostgreSQL 9.2.8 on RHEL 6.4.  Yes, the PostgreSQL 
version is rather old but there's no relevant bug fix in later 9.2.x releases.


PROBLEM
==============================

One backend process (postgres) for an application session crashed due to a 
segmentation fault and dumped a core file.  The cause is a bug of 
pg_dbms_stats.  Another note is that restart_after_crash is off to make 
failover happen.

The problem here is that postmaster took as long as 15 seconds to terminate 
after it had detected a crashed backend.  The messages were output as follows:

20:12：35.004に
LOG:  server process (PID 31894) was terminated by signal 11: Segmentation fault
DETAIL:  Failed process was running: DELETE...(snip)
LOG:  terminating any other active server processes

>From 20：12：35.013 to 20：12：39.074, the following message was output 80 times.

FATAL:  the database system is in recovery mode

20:12:50
The custom monitoring system detected the death of postmaster as a result of 
running "pg_ctl status".

That's it.  You may say the following message should also have been emitted, 
but there's not.  This is because we commented out the ereport() call in 
quickdie() in tcop.c.  That ereport() call can hang depending on the timing, 
which is fixed in 9.4.

WARNING:  terminating connection because of crash of another server process

The customer insists that PostgreSQL takes longer to shut down than expected, 
which risks exceeding their allowed failover time.


CAUSE
==============================

There's no apparent evidence to indicate the cause, but I could guess a few 
reasons.  What do you think these are correct and should fix PostgreSQL? (I 
think so)

1) postmaster should close the listening ports earlier
As cited above, for 4 seconds, postmaster created 80 dead-end child processes 
which just output "FATAL:  the database system is in recovery mode".  This 
indicates that postmaster is busy handling re-connection requests from 
disconnected applications, preventing postmaster from reaping dead children as 
fast as possible.  This is a waste because postmaster will only shut down.

I think the listening ports should be closed in HandleChildCrash() when the 
condition "(RecoveryError || !restart_after_crash)" is true.


        2) make stats collector terminate immediately
stats collector seems to write the permanent stats file even when it receives 
SIGQUIT.  But it's useless because the stat file is reset during recovery.  And 
Tom claimed that writing stats file can take long:

https://www.postgresql.org/message-id/11800.1455135...@sss.pgh.pa.us


3) Anything else?
While postmaster is in PM_WAIT_DEAD_END state, it leaves the listening ports 
open but doesn't call select()/accept().  As a result, incoming connection 
requests are accumulated in the listen queue of the sockets.  Does the OS have 
any bug to slow the process termination when the listen queue is not empty?


Regards
Takayuki Tsunakawa



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

Reply via email to