2018-02-23 14:20 GMT-03:00 Andres Freund <and...@anarazel.de>:
> On 2018-02-23 13:33:18 -0300, Andre Oliveira Freitas wrote:
> > Since it's been happening for a few weeks now, every time it freezes we
> > take a gcore dump and check it in gdb... and after a lot of hair pulling
> > and learning about the innards of the VoIP software we see that most often
> > the software is stuck in this call trace:
> > #0 in __libc_recv (fd=409, buf=0x7f2c4802e6c0, n=16384, flags=1898970523)
> > at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
> > #1 in ?? () from /usr/lib/x86_64-linux-gnu/libpq.so.5
> > #2 in ?? () from /usr/lib/x86_64-linux-gnu/libpq.so.5
> > #3 in PQconsumeInput () from /usr/lib/x86_64-linux-gnu/libpq.so.5
> So it's just receiving data from the network. Have you verified whether
> the connection is actually stable? Any chance it's just waiting for the
> network to time out. Might be worth configuring tcp timeouts, to make
> sure its unrelated to that.
> What is the server showing as activity while the client is waiting?
> Could you show the corresponding pg_stat_activity row?
14:11:56;;2018-02-20 14:24:15;2018-02-20 14:24:15;;;idle;;;COMMIT
14:11:57;;2018-02-20 14:24:57;2018-02-20 14:24:57;;;idle;;;COMMIT
14:16:02;;2018-02-20 14:16:31;2018-02-20 14:16:31;;;idle;;;insert into
sip_authentication (nonce,expires,profile_name,hostname, last_nc)
values('363d02f6-cb9a-4791-9e05-d18473a18812', 1519147649, 'internal',
14:22:13;;2018-02-20 14:25:09;2018-02-20 14:25:09;;;idle;;;select
command from aliases where alias='show status'
The problematic connection is the third one; by checking the timings
on query_start it appears to be the last query executed by the stuck
connection before it became stuck. There are no outstanding locks in
any of the tables the VoIP software normally uses.
> > The software shares a database connection between threads, and controls its
> > access through a mutex, so once one thread that acquires the mutex gets
> > stuck in the location above, all other threads starts pilling up behind the
> > mutex, and that's apparently the reason the software stops responding for
> > most of its functions (while other functions that do not depend on the
> > database works normally).
> Hm, have you compiled libpq with threading support? Or use a
> distribution that compiles it with that? While I don't see an obvious
> connection to that stacktrace it seems worthwhile to verify.
> A mutex protecting individual connection, while also necessary if
> connections are shared, doesn't achieve the same.
I'm using the libpq that comes with debian, however I can install the
library from the official repository to be sure, I assume the one from
the official repo has it enabled.
> > I wonder if anyone has any tip on what to look for next...
> Any chance you're occasionally forking and then interacting with the
> connection in the forked process?
As far as I know, no. The software forks at the beginning but from
then on, its just threads.
> Andres Freund
If it is of any help, here is the link to the implementation that
It is a function to check if the database connection is up before
running a query. I'm not a mantainer nor an expert in pg, but we
reviewed the implementation and seems OK.
One thing that is bothering me by looking at the gdb backtraces is
that recv always seems to be receiving a non-zero value in flags, even
though libpq seems to pass zero. I don't know if its of any relevance.
Thanks in advance,