Re: [HACKERS] Hot standby, recovery infra

Heikki Linnakangas Thu, 26 Feb 2009 10:39:10 -0800

Fujii Masao wrote:

On Fri, Jan 30, 2009 at 7:47 PM, Simon Riggs <si...@2ndquadrant.com> wrote:

That whole area was something I was leaving until last, since immediate
shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
this before Christmas, briefly).


This problem remains in current HEAD. I mean, immediate shutdown
may be unable to kill the startup process because system() which
executes restore_command ignores SIGQUIT while waiting.
When I tried immediate shutdown during recovery, only the startup
process survived. This is undesirable behavior, I think.


Yeah, we need to fix that.

The following code should be added into RestoreArchivedFile()?

----
if (WTERMSIG(rc) == SIGQUIT)
       exit(2);
----


I don't see how that helps, as we already have this in there:

        signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

        ereport(signaled ? FATAL : DEBUG2,
                (errmsg("could not restore file \"%s\" from archive: return code 
%d",
                                xlogfname, rc)));

which means we already ereport(FATAL) if the restore command dies withSIGQUIT.

I think the real problem here is that pg_standby traps SIGQUIT. Thestartup process doesn't receive the SIGQUIT because it's in system(),and pg_standby doesn't propagate it to the startup process eitherbecause it traps it.

I think we should simply remove the signal handler for SIGQUIT frompg_standby. Or will that lead to core dump by default? In that case, weneed pg_standby to exit(128) or similar, so that RestoreArchivedFileunderstands that the command was killed by a signal.

Another approach is to check that the postmaster is still alive, like wedo in walwriter and bgwriter:


                /*
                 * Emergency bailout if postmaster has died.  This is to avoid 
the
                 * necessity for manual cleanup of all postmaster children.
                 */
                if (!PostmasterIsAlive(true))
                        exit(1);

However, I'm afraid there's a race condition with that. If we do thatright after system(), postmaster might've signaled us but not exitedyet. We could check that in the main loop, but if we wrongly interpretthe exit of the recovery command as a "file not found - go ahead andstart up", the damage might be done by the time we notice that thepostmaster is gone.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hot standby, recovery infra

Reply via email to