On Jul4, 2011, at 23:11 , Peter Geoghegan wrote: > On 4 July 2011 17:36, Florian Pflug <f...@phlo.org> wrote: >> On Jul4, 2011, at 17:53 , Heikki Linnakangas wrote: >>>> Under Linux, select() may report a socket file descriptor as "ready >>>> for >>>> reading", while nevertheless a subsequent read blocks. This could >>>> for >>>> example happen when data has arrived but upon examination has >>>> wrong >>>> checksum and is discarded. There may be other circumstances in >>>> which a >>>> file descriptor is spuriously reported as ready. Thus it may be >>>> safer >>>> to use O_NONBLOCK on sockets that should not block. >>> >>> So in theory, on Linux you might WaitLatch might sometimes incorrectly >>> return WL_POSTMASTER_DEATH. None of the callers check for >>> WL_POSTMASTER_DEATH return code, they call PostmasterIsAlive() before >>> assuming the postmaster has died, so that won't affect correctness at the >>> moment. I doubt that scenario can even happen in our case, select() on a >>> pipe that is never written to. But maybe we should add add an assertion to >>> WaitLatch to assert that if select() reports that the postmaster pipe has >>> been closed, PostmasterIsAlive() also returns false. >> >> The correct solution would be to read() from the pipe after select() >> returns, and only return WL_POSTMASTER_DEATCH if the read doesn't return >> EAGAIN. To prevent that read() from blocking if the read event was indeed >> spurious, O_NONBLOCK must be set on the pipe but that patch does that >> already. > > Let's have some perspective on this. We're talking about a highly > doubtful chance that latches may wake when they shouldn't.
Yeah, as long as there's just a spurious wake up, sure. However, having WaitLatch() indicate a postmaster death in that case seems dangerous. Some caller, sooner or later, is bound to get it wrong, i.e. forget to re-check PostmasterIsAlive. > Maybe we should restore the return value of WaitLatch to its previous > format (so it doesn't return a bitmask)? That way, we don't report > that the Postmaster died, and therefore clients are required to call > PostmasterIsAlive() to be sure. That'd solve the issue too. > In any case, I'm in favour of the assertion. I don't really see the value in that assertion. It'd cause spurious assertion failures in the case of spurious events reported by select(). If we do expect such event, we should close the hole instead of asserting. If we don't, then what's the point of the assert. BTW, do we currently retry the select() on EINTR (meaning a signal has arrived)? If we don't, that'd be an additional source of spurious returns from select. >> Btw, with the death-watch / life-sign / whatever infrastructure in place, >> shouldn't PostmasterIsAlive() be using that instead of getppid() / kill(0)? > > Hmm, maybe. That seems like a separate issue though, that can be > addressed with another patch. It does have the considerable > disadvantage of making Heikki's proposed assertion failure useless. Is > the implementation of PostmasterIsAlive() really a problem at the > moment? I'm not sure that there is currently a guarantee that PostmasterIsAlive will returns false immediately after select() indicates postmaster death. If e.g. the postmaster's parent is still running (which happens for example if you launch postgres via daemontools), the re-parenting of backends to init might not happen until the postmaster zombie has been vanquished by its parent's call of waitpid(). It's not entirely inconceivable for getppid() to then return the (dead) postmaster's pid until that waitpid() call has occurred. But agreed, this is probably best handled by a separate patch. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers