Hi Pieter,
On Mon, Jun 11, 2018 at 10:48:25PM +0200, PiBa-NL wrote:
> Hi List,
>
> I've got no clue how i got into this state ;) and maybe there is nothing
> wrong..(well i did resume a VM that was suspended for half a day..)
>
> Still thought it might be worth reporting, or perhaps its solved already as
> there are a few fixes for threads after the 6-6 snapshot that i build with..
> Sometimes all that some people need is half a idea to find a problem... So
> maybe there is something that needs fixing??
This one is not known yet, to the best of my knowledge, or at least
not reported yet.
> (gdb) info threads
> Id Target Id Frame
> * 1 LWP 100660 of process 56253 0x00000000005b0202 in thread_sync_barrier
> (barrier=0x8bc690 <thread_enter_sync.barrier>) at src/hathreads.c:109
> 2 LWP 101036 of process 56253 0x000000000050874a in process_chk_conn
> (t=0x8025187c0, context=0x802482610, state=33) at src/checks.c:2112
> 3 LWP 101037 of process 56253 0x000000000050b58e in
> enqueue_one_email_alert (p=0x80253f400, s=0x8024dec00, q=0x802482600,
> msg=0x7fffdfdfc770 "Health check for server Test-SNI_ipvANY/srv451-4
> failed, reason: Layer4 connection problem, info: \"General socket error
> (Network is unreachable)\", check duration: 0ms, status: 0/2 DOWN") at
> src/checks.c:3396
This is quite odd. Either some quit a function without releasing the
server's lock nor the email_queue's lock (suspicious but possible),
or both just happen to be the same. And at first glance, seeing that
process_chk_conn() is called witha context of 0x802482610 which is
only 16 bytes above enqueue_one_email_alert's queue parameter, this
last possibility suddenly seems quite likely.
So my guess is that we take the server's lock in process_chk_conn(),
that we go down through this :
chk_report_conn_err()
-> set_server_check_status()
-> send_email_alert()
-> enqueue_email_alert()
-> enqueue_one_email_alert()
And here we spin on a lock which very likely is the same eventhough I
have no idea why. I suspect the way the queue or lock is retrieved
there is incorrect and explains the situation with a recursive lock,
but I'm afraid I don't really know since I don't know this part of
the code :-(
Note, it could also be that the queue's lock has never been initialized,
and makes the inner lock block there, with the server's lock held, causing
the second thread to spin and the 3rd one to wait for the barrier. I'll
have to ask for some help on this part. I'm pretty confident that 1.8 is
affected as well.
Thanks,
Willy