Hi Pieter,

On Sun, May 20, 2018 at 02:07:43AM +0200, PiBa-NL wrote:
> Hi List,
> 
> With 1.8.8 ran into this, tried latest 1.9dev snapshot seems to have the
> same issue..
> 
> Running with 3 threads, a template for 8 servers, and only 2 ip's in the dns
> response, neither of which is actually 'up' one responds with 'L4TOUT in
> 1004ms' the other with 'L4CON in 0ms' on stats page.. DNS comes from a local
> unbound dns-server on the same host as the haproxy process. after a little
> while (10 minutes+-) haproxy go's to 300% usage on all the 3 spin-locks and
> doesn't respond anymore.
> 
> There was no actual traffic passing through this machine. I am watching the
> stats page.
> Running with 1 thread the issue does not seem to appear.
> 
> I think 2 threads are deadlocking each other.?. and then later the 3rd joins
> the waiting game.
> I've added logging between a lot of most of the lock/unlock functions.. and
> it seems that the 2 succeeded locks below from lines 332 and 333 in attached
> logfile are the last ones where threads 1 and 0 are doing anything..
> 
> For example added logging like this everywhere ,the thread-id, the
> servername, and the function its inside of is logged with printf
> statements..:
> 
> static inline void health_adjust(struct server *s, short status)
> {
>     HA_SPIN_LOCK(SERVER_LOCK, &s->lock);
>     printf("tid:%u LOCKED srv: %s health_adjust\n", tid, s->id);
>     /* return now if observing nor health check is not enabled */
>     if (!s->observe || !s->check.task) {
>         printf("tid:%u UNLOCK srv: %s health_adjust a\n", tid, s->id);
>         HA_SPIN_UNLOCK(SERVER_LOCK, &s->lock);
>         return;
>     }
> 
>     __health_adjust(s, status);
>     printf("tid:%u UNLOCK srv: %s health_adjust b\n", tid, s->id);
>     HA_SPIN_UNLOCK(SERVER_LOCK, &s->lock);
> }
> 
> This leads to the following calls that are seemingly never unlocked the tid:
> stops appearing, and actual callstacks go several functions deeper that try
> to gain another lock....:
> 
> tid:1 LOCKED srv: alias_srv1 process_chk_conn
> tid:0 LOCKED srv: alias_srv8 event_srv_chk_w
> 
> Then a little later the tid:2 also hangs.. probably it was time to perform
> another check or dns lookup or so...
> 
> (gdb) info threads
>   Id   Target Id         Frame
>   1    LWP 100730 of process 15238 0x00000000004ab6bc in
> snr_check_ip_callback (srv=0x8022d0400, ip=0x80288e394,
>     ip_family=0x7fffffffe207 "\002\224\343\210\002\b") at src/server.c:3781
>   2    LWP 100924 of process 15238 0x00000000004ab6b4 in
> snr_check_ip_callback (srv=0x8022cb000, ip=0x80288e394,
>     ip_family=0x7fffdfffda97 "\002\224\343\210\002\b") at src/server.c:3781
> * 3    LWP 100925 of process 15238 0x000000000051ef21 in dns_resolve_recv
> (dgram=0x8022483c0) at src/dns.c:1646
> 
> More gdb 'bt full' output of all 3 threads is at the bottom of the attached
> logfile.
> 
> Hope someone can try and fix this :) Config and haproxy -vv also added
> below.
> 
> Thanks
> PiBa-NL (Pieter)
> 

As usual, you did most of the work :)
I think I understand what is going on, and it's ugly as hell.
Does the attached patch fix it for you ? It's been generated from master,
but will probably apply against 1.8 as well.

Thanks !

Olivier
>From b938f86e1fe51e95adc73f9e583dd225f5ecf88d Mon Sep 17 00:00:00 2001
From: Olivier Houchard <ohouch...@haproxy.com>
Date: Tue, 22 May 2018 18:40:07 +0200
Subject: [PATCH] BUG/MEDIUM: dns: Delay the attempt to run a DNS resolution on
 check failure.

When checks fail, the code tries to run a dns resolution, in case the IP
changed.
The old way of doing that was to check, in case the last dns resolution
hadn't expired yet, if there were an applicable IP, which should be useless,
because it has already be done when the resolution was first done, or to
run a new resolution.
Both are a locking nightmare, and lead to deadlocks, so instead, just wake the
resolvers task, that should do the trick.

This should be backported to 1.8.
---
 src/dns.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/src/dns.c b/src/dns.c
index e42950212..e2d46a658 100644
--- a/src/dns.c
+++ b/src/dns.c
@@ -348,11 +348,10 @@ void dns_trigger_resolution(struct dns_requester *req)
        /* The resolution must not be triggered yet. Use the cached response, if
         * valid */
        exp = tick_add(res->last_resolution, resolvers->hold.valid);
-       if (res->status == RSLV_STATUS_VALID &&
-           tick_isset(res->last_resolution) && !tick_is_expired(exp, now_ms))
-               req->requester_cb(req, NULL);
-       else
-               dns_run_resolution(res);
+       if (resolvers->t && (res->status != RSLV_STATUS_VALID ||
+           !tick_isset(res->last_resolution) || tick_is_expired(exp, now_ms)))
+               task_wakeup(resolvers->t, TASK_WOKEN_OTHER);
+
 }
 
 
-- 
2.14.3

Reply via email to