[HACKERS] orangutan seizes up during isolation-check

Noah Misch Mon, 01 Sep 2014 18:36:06 -0700

Buildfarm member orangutan has failed chronically on both of the branches for
which it still reports, HEAD and REL9_1_STABLE, for over two years.  The
postmaster appears to jam during isolation-check.  Dave, orangutan currently
has one such jammed postmaster for each branch.  Could you gather some
information about the running processes?  Specifically, it would be helpful to
see the output of "ps -el" and a stack trace for each running PostgreSQL
process.  (If there are enough PostgreSQL processes to make stack traces
tedious to acquire, it will be almost as good to have traces for each
postmaster and one autovacuum worker per postmaster.)  Thanks.  Best not to
kill the processes yet, in case we need more information.



The rest of this message is just a dump my observations from the data already
available.  The jammed postmasters fail to complete fast shutdown requests.
Beyond that, the symptoms are different on HEAD versus 9.1.  The 2014-07-09
run is representative for HEAD.  multiple-row-versions.spec failed like this
after having run for almost 21 hours:

--- 1,2 ----
  Parsed test spec with 4 sessions
! Connection 2 to database failed: 
\ No newline at end of file

I don't know what would cause PQconnectdb() to hang for 21 hours before
failing with a blank error message.  Note that the hang duration and the spec
in which the hang falls varies from failure to failure.  All subsequent specs
then fail like this:

--- 1,4 ----
  Parsed test spec with 2 sessions
! Connection 0 to database failed: could not connect to server: Connection 
refused
!       Is the server running locally and accepting
!       connections on Unix domain socket "/tmp/.s.PGSQL.5678"?

One can get ECONNREFUSED from a Unix-domain socket when the listen() backlog
is full.  At this point, we've made only two connection attempts since the
last successful one and only about 40 attempts since last postmaster startup.
I have no good theories remaining at the moment.  The postmaster log ends in
1211 copies of this message:

WARNING:  worker took too long to start; canceled.

At the default autovacuum_naptime=1min, that represents 20:11:00 of autovacuum
launch failures.  The postmaster had been running about 20:55:42 by the time
we collected that log, suggesting that autovacuum was healthy until 40-45
minutes into the doomed PQconnectdb() call.  I'm hypothesizing that the
postmaster ceased serving autovacuum launcher requests.  A jammed postmaster
tends to explain both the ECONNREFUSED symptom and the autovacuum symptom.



In REL9_1_STABLE, isolation-check completes, but the StopDb-C:2 step that
follows isolation-check fails to stop the server.  (If you go back far enough
in the history, suites other than isolation-check occasionally jam the
server.)  The server log ends like this:

LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  autovacuum launcher shutting down

That suggests a postmaster stuck in PM_WAIT_BACKENDS.  The process data should
illuminate this situation.


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] orangutan seizes up during isolation-check

Reply via email to