> > Buildfarm member orangutan has failed chronically on both of the branches 
> > for
> > which it still reports, HEAD and REL9_1_STABLE, for over two years.  The
> > postmaster appears to jam during isolation-check.  Dave, orangutan currently
> > has one such jammed postmaster for each branch.  Could you gather some
> > information about the running processes?
> What's particularly odd is that orangutan seems to be running an only
> slightly out-of-date OS X release, which is hardly an unusual
> configuration.  My own laptop gets through isolation-check just fine.
> Seems like there must be something nonstandard about orangutan's
> software ... but what?

Agreed.  The difference is durable across OS X releases, because orangutan
showed like symptoms under 10.7.3.  Dave assisted me off-list with data
collection and experimentation.  Ultimately, --enable-nls was the key
distinction, the absence of which spares the other OS X buildfarm animals.

The explanation for ECONNREFUSED was more pedestrian than the reasons I had
guessed.  There were no jammed postmasters running as of the above writing.
Rather, the postmasters were gone, but the socket directory entries remained.
That happens when the postmaster suffers a "kill -9", a SIGSEGV, an assertion
failure, or a similar abrupt exit.

When I reproduced the problem, CountChildren() was attempting to walk a
corrupt BackendList.  Sometimes, the list had an entry such that e->next == e;
these send CountChildren() into an infinite loop.  Other times, testing "if
(bp->dead_end)" prompted a segfault.  That explains orangutan sometimes
failing quickly and other times hanging for hours.  Every crash showed at
least two threads running in the postmaster.  Multiple threads bring trouble
in the form of undefined behavior for fork() w/o exec() and for sigprocmask().
The postmaster uses sigprocmask() to block most signals when doing something
nontrivial; this allows it to do such nontrivial work in signal handlers.  A
sequence of 74 buildfarm runs caught 27 cases of a secondary thread running a
signal handler, 14 cases of two signal handlers running at once, and one
user-visible postmaster failure.

libintl replaces setlocale().  Its setlocale(LC_x, "") uses OS-specific APIs
to determine the default locale when $LANG and similar environment variables
are empty, as they are during "make check NO_LOCALE=1".  On OS X, it calls[1]
CFLocaleCopyCurrent(), which in turn spins up a thread.  See the end of this
message for the postmaster thread stacks active upon hitting a breakpoint set
at _dispatch_mgr_thread.

I see two options for fixing this in pg_perm_setlocale(LC_x, ""):

1. Fork, call setlocale(LC_x, "") in the child, pass back the effective locale
   name through a pipe, and pass that name to setlocale() in the original
   process.  The short-lived child will get the extra threads, and the
   postmaster will remain clean.

2. On OS X, check for relevant environment variables.  Finding none, set
   LC_x=C before calling setlocale(LC_x, "").  A variation is to raise
   ereport(FATAL) if sufficient environment variables aren't in place.  Either
   way ensures the libintl setlocale() will never call CFLocaleCopyCurrent().
   This is simpler than (1), but it entails a behavior change: "LANG= initdb"
   will use LANG=C or fail rather than use the OS X user account locale.

I'm skeptical of the value of looking up locale information using other OS X
facilities when the usual environment variables are inconclusive, but I see no
clear cause to reverse that decision now.  I lean toward (1).



