Re: [HACKERS] Idea for improving buildfarm robustness

Andrew Dunstan Tue, 29 Sep 2015 12:09:03 -0700


On 09/29/2015 02:48 PM, Tom Lane wrote:

A problem the buildfarm has had for a long time is that if for some reason
the scripts fail to stop a test postmaster, the postmaster process will
hang around and cause subsequent runs to fail because of socket conflicts.
This seems to have gotten a lot worse lately due to the influx of very
slow buildfarm machines, but the risk has always been there.

I've been thinking about teaching the buildfarm script to "kill -9"
any postmasters left around at the end of the run, but that's fairly
problematic: how do you find such processes (since "ps" output isn't
hugely portable, especially not to Windows), and how do you tell them
apart from postmasters not started by the script?  So the idea was on
hold.

But today I thought of another way: suppose that we teach the postmaster
to commit hara-kiri if the $PGDATA directory goes away.  Since the
buildfarm script definitely does remove all the temporary data directories
it creates, this ought to get the job done.

An easy way to do that would be to have it check every so often if
pg_control can still be read.  We should not have it fail on ENFILE or
EMFILE, since that would create a new failure hazard under heavy load,
but ENOENT or similar would be reasonable grounds for deciding that
something is horribly broken.  (At least on Windows, failing on EPERM
doesn't seem wise either, since we've seen antivirus products randomly
causing such errors.)

I wouldn't want to do this every time through the postmaster's main loop,
but we could do this once an hour for no added cost by adding the check
where it does TouchSocketLockFiles; or once every few minutes if we
carried a separate variable like last_touch_time.  Once an hour would be
plenty to fix the buildfarm's problem, I should think.

Another question is what exactly "commit hara-kiri" should consist of.
We could just abort() or _exit(1) and leave it to child processes to
notice that the postmaster is gone, or we could make an effort to clean
up.  I'd be a bit inclined to treat it like a SIGQUIT situation, ie
kill all the children and exit.  The children are probably having
problems of their own if the data directory's gone, so forcing
termination might be best to keep them from getting stuck.

Also, perhaps we'd only enable this behavior in --enable-cassert builds,
to avoid any risk of a postmaster incorrectly choosing to suicide in a
production scenario.  Or maybe that's overly conservative.

Thoughts?

It's a fine idea. This is much more likely to be robust than anybuildfarm client fix.

Not every buildfarm member uses cassert, so I'm not sure that's the bestway to go. axolotl doesn't, and it's one of those that regularly hasspeed problems. Maybe a not-very-well-publicized GUC, or an environmentsetting? Or maybe just enable this anyway. If the data directory is gonewhat's the point in keeping the postmaster around? Shutting it downdoesn't seem likely to cause any damage.



cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Idea for improving buildfarm robustness

Reply via email to