Decibel! <[EMAIL PROTECTED]> writes:
> On Tue, Aug 07, 2007 at 06:46:05PM -0400, Tom Lane wrote:
>> I'm tempted to stop accepting connections once we start the shutdown
>> checkpoint (ie, tell the bgwriter to shut down). This would at least
>> limit the width of the window in which incoming connections get ignored.
>> We could also close the sockets at that point, so that clients get an
>> immediate failure instead of hanging for awhile --- then it will look
>> to them like the postmaster is already gone.
> Well, we're effectively shutdown at that point... I don't see much
> benefit to sending a "We're in the middle of shutting down" message over
> "couldn't connect to the server", which is what the client would have
> gotten anyway in very short order.
After sleeping on it, I see that doesn't work either. We have exactly
the same problem during crash recovery: if there's a sufficiently
constant flow of connection requests, the postmaster will never decide
that all the old backends are gone and it can initiate recovery. (This
is if anything a more plausible scenario than the shutdown case, since
the clients will surely be trying to reconnect, whereas in a commanded
shutdown you'd mostly expect the clients to be gone already.) And we
can't just throw away the sockets if we intend to restart.
The only way I can see to fix it is to have the postmaster just stop
doing accept()s for a short time while the "doomed" children drain out
of the system. If we don't do this until all the "live" children are
gone then it should just be a short interval, since we know the "doomed"
children will exit as soon as they get a few cycles to run.
BTW, the crash-recovery case also destroys the option of just ignoring
doomed children while deciding whether it's OK to move on to the next
phase. Because doomed children will still be attached to the old shmem
segment, they would prevent it from being deleted immediately when we
IPC_RMID it. That could mean that our subsequent attempt to create a
new one fails due to SHMALL violation, and thus that we fail to recover
automatically after a crash.
Can anyone think of a better adjective than "doomed"? Basically we're
going to remember whether we sent the child a canAcceptConnections state
different from CAC_OK.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster