Fujii Masao wrote:
On Fri, Jan 30, 2009 at 7:47 PM, Simon Riggs <si...@2ndquadrant.com> wrote:
That whole area was something I was leaving until last, since immediate
shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
this before Christmas, briefly).

This problem remains in current HEAD. I mean, immediate shutdown
may be unable to kill the startup process because system() which
executes restore_command ignores SIGQUIT while waiting.
When I tried immediate shutdown during recovery, only the startup
process survived. This is undesirable behavior, I think.

Yeah, we need to fix that.

The following code should be added into RestoreArchivedFile()?

----
if (WTERMSIG(rc) == SIGQUIT)
       exit(2);
----

I don't see how that helps, as we already have this in there:

        signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

        ereport(signaled ? FATAL : DEBUG2,
                (errmsg("could not restore file \"%s\" from archive: return code 
%d",
                                xlogfname, rc)));

which means we already ereport(FATAL) if the restore command dies with SIGQUIT.

I think the real problem here is that pg_standby traps SIGQUIT. The startup process doesn't receive the SIGQUIT because it's in system(), and pg_standby doesn't propagate it to the startup process either because it traps it.

I think we should simply remove the signal handler for SIGQUIT from pg_standby. Or will that lead to core dump by default? In that case, we need pg_standby to exit(128) or similar, so that RestoreArchivedFile understands that the command was killed by a signal.

Another approach is to check that the postmaster is still alive, like we do in walwriter and bgwriter:

                /*
                 * Emergency bailout if postmaster has died.  This is to avoid 
the
                 * necessity for manual cleanup of all postmaster children.
                 */
                if (!PostmasterIsAlive(true))
                        exit(1);

However, I'm afraid there's a race condition with that. If we do that right after system(), postmaster might've signaled us but not exited yet. We could check that in the main loop, but if we wrongly interpret the exit of the recovery command as a "file not found - go ahead and start up", the damage might be done by the time we notice that the postmaster is gone.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to