Alvaro Herrera <alvhe...@commandprompt.com> writes: > Excerpts from Tom Lane's message of jue may 10 02:27:32 -0400 2012: >> Alvaro Herrera <alvhe...@alvh.no-ip.org> writes: >>> I noticed while doing some tests that the checkpointer process does not >>> recover very nicely after a backend crashes under postmaster -T (after >>> all processes have been kill -CONTd, of course, and postmaster told to >>> shutdown via Ctrl-C on its console). For some reason it seems to get >>> stuck on a loop doing sleep(0.5s) In other case I caught it trying to >>> do a checkpoint, but it was progressing a single page each time and then >>> sleeping. In that condition, the checkpoint took a very long time to >>> finish.
>> Is this still a problem as of HEAD? I think I've fixed some issues in >> the checkpointer's outer loop logic, but not sure if what you saw is >> still there. > Yep, it's still there as far as I can tell. A backtrace from the > checkpointer shows it's waiting on the latch. I'm confused about what you did here and whether this isn't just pilot error. If you run with -T then the postmaster will just SIGSTOP the remaining child processes, but then it will sit and wait for them to die, since the state machine expects them to react as though they'd been sent SIGQUIT. If you SIGCONT any of them then that process will resume, totally ignorant that it's supposed to die. So "kill -CONTd, of course" makes no sense to me. I tried killing one child with -KILL, then sending SIGINT to the postmaster, then killing the remaining already-stopped children, and the postmaster did exit as expected after the last child died. So I don't see any bug here. And, after closer inspection, your previous proposed patch is quite bogus because checkpointer is not supposed to stop yet when the other processes are being terminated normally. Possibly it'd be useful to teach the postmaster more thoroughly about SIGSTOP and have a way for it to really kill the remaining children after you've finished investigating their state. But frankly this is the first time I've heard of anybody using that feature at all; I always thought it was a vestigial hangover from days when the kernel was too stupid to write separate core dump files for each backend. I'd rather remove SendStop than add more complexity there. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers