Hello, A while back someone showed me a program blocking in libpq 9.2 on Solaris 11, inside sigwait called by pq_reset_sigpipe. It had happened a couple of times before during a period of instability/crashing with a particular DB (a commercial PostgreSQL derivative, but the client was using regular libpq). This was a very busy multithreaded client where each thread had its own connection. My theory is that if two connections accessed by different threads get shut down around the same time, there is a race scenario where each of them fails to write to its socket, sees errno == EPIPE and then sees a pending SIGPIPE with sigpending(), but only one thread returns from sigwait() due to signal merging.
We never saw the problem again after we made the following change: --- a/src/interfaces/libpq/fe-secure.c +++ b/src/interfaces/libpq/fe-secure.c @@ -450,7 +450,6 @@ void pq_reset_sigpipe(sigset_t *osigset, bool sigpipe_pending, bool got_epipe) { int save_errno = SOCK_ERRNO; - int signo; sigset_t sigset; /* Clear SIGPIPE only if none was pending */ @@ -460,11 +459,13 @@ pq_reset_sigpipe(sigset_t *osigset, bool sigpipe_pending, bool got_epipe) sigismember(&sigset, SIGPIPE)) { sigset_t sigpipe_sigset; + siginfo_t siginfo; + struct timespec timeout = { 0, 0 }; sigemptyset(&sigpipe_sigset); sigaddset(&sigpipe_sigset, SIGPIPE); - sigwait(&sigpipe_sigset, &signo); + sigtimedwait(&sigpipe_sigset, &siginfo, &timeout); } } Does this make any sense? Best regards, Thomas Munro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers