On Tue, 17 Jun 2003, Tom Lane wrote:
> Charles Hornberger <[EMAIL PROTECTED]> writes:
> > Other things I perhaps ought to mention: Trying to stop the postmaster
> > using pg_ctl fails (unsurprisingly, since pg_ctl relies on
> > /var/pgsql/data/postmaster.pid, which contains a nonexistent PID); I
> > haven't tried to start a new postmaster yet, because the old backends
> > are hanging around.
>
> In theory a new postmaster would detect the old backends and refuse to
> start anyway. I don't trust that interlock unreservedly though. (But
> please test it while you have the opportunity...)
Unfortunately, our system administrator solved this before I got a chance
to test more. I don't know how he went about restarting the server,
although whatever he did doesn't appear to have hurt anything; would
it be interesting to know exactly what steps he took?
> > Nor have I attempted to restart the web server, which might allow the
> > hanging-round backends to die by closing the old connections it's
> > holding to them. I'm tempted to go ahead and do this, though I'm not
> > sure whether I ought to until I've diagnosed what's going on right now.
>
> You will need to close all the existing connections before the new
> postmaster can be started. I'd recommend doing so sooner instead of
> later, because with no postmaster you aren't getting any checkpoints
> done, and your WAL space is going to start ballooning.
>
> As far as diagnosing the problem goes: if you have a postmaster log
> file, look to see if the postmaster wrote an ERROR or FATAL message
> before it exited. (Finding it among all the backend-level messages
> might be painful though.) Also look in the directory the postmaster
> was started in to see if there's a core file. Save away any evidence
> you can find before trying to start a new postmaster.
Interestingly, there are no messages in the log file, and I can't find a
core file -- in short, there's no evidence whatsoever, at least not that
I can find. (Though I am probably a pretty rotten detective.)
However, I think I know the cause (though I haven't tested to see if this
indeed causes the postmaster to die): A few hours before I noticed that
the postmaster was dead, one of the sysadmins made a typo that caused an
NFS mount to become unavailable -- the very NFS mount that held the
postgres executable (all our Solaris boxes share the same executables). So
the theory is that the postmaster tried to fork() a process using a
non-existent executable, and died as a result. Does this make any sense?
-Charlie
> Because the postmaster doesn't actually do much, crashes are pretty
> unusual. I'm interested in whatever you can find.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to [EMAIL PROTECTED] so that your
> message can get through to the mailing list cleanly
>
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
http://archives.postgresql.org