[HACKERS] Pending query cancel defeats SIGQUIT

Noah Misch Tue, 10 Sep 2013 18:32:51 -0700

I've been doing an excess of immediate shutdowns lately, and that has turned
up bugs old and new.  This one goes back to 8.4 or earlier.  If a query cancel
is pending when a backend receives SIGQUIT, the cancel takes precedence and
the backend survives:


[local] test=# select nmtest_spin(false);
Cancel request sent
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the 
current transaction and exit, because another server process exited abnormally 
and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat 
your command.
ERROR:  canceling statement due to user request
Time: 4932.257 ms
[local] test=# select 1;
 ?column?
----------
        1
(1 row) 

The errfinish() pertaining to that WARNING issues CHECK_FOR_INTERRUPTS(), and
the query cancel pending since before the SIGQUIT arrived then takes effect.
This is less bad on 9.4, because the postmaster will SIGKILL the backend after
5s.  On older releases, the backend persists indefinitely.

Let's fix this by holding interrupts for the duration of quickdie(); see
attached patch.  Surely we don't want any other kind of backend demise taking
precedence over quickdie().  Unfortunately, this patch does not fully prevent
that.  If ImmediateInterruptOK==true, a SIGINT could arrive and longjmp()
between the start of quickdie() and its PG_SETMASK() call.  The only
decently-portable way I know to close that race is to name SIGINT/SIGTERM in
the SIGQUIT handler's sa_mask.  In any event, that's a *far* narrower race and
is a general problem shared by most of our signal use.  I'm content to not fix
it in this patch, but I propose adding it as a TODO.

Here is the source code for the nmtest_spin() function used above:

Datum
nmtest_spin(PG_FUNCTION_ARGS)
{
        bool no_sigquit = PG_GETARG_BOOL(0);

        if (no_sigquit)
        {
                sigset_t mask;
                sigemptyset(&mask);
                sigaddset(&mask, SIGQUIT);
                sigprocmask(SIG_BLOCK, &mask, NULL);
        }

        for (;;)
                sleep(1);
}

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index e56dbfb..1eaf287 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2525,6 +2525,13 @@ quickdie(SIGNAL_ARGS)
        PG_SETMASK(&BlockSig);
 
        /*
+        * Prevent interrupts while exiting; though we just blocked signals that
+        * would queue new interrupts, one may have been pending.  We don't 
want a
+        * quickdie() downgraded to a mere query cancel.
+        */
+       HOLD_INTERRUPTS();
+
+       /*
         * If we're aborting out of client auth, don't risk trying to send
         * anything to the client; we will likely violate the protocol, not to
         * mention that we may have interrupted the guts of OpenSSL or some

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Pending query cancel defeats SIGQUIT

Reply via email to