Re: HAProxy 1.5 incorrectly marks servers as DOWN

Juho Mäkinen Mon, 08 Sep 2014 23:29:13 -0700

I'm more and more suspecting that this is a new bug introduced on the 1.5.x
branch. We don't have these issues on our 1.4 deployment and it seems that
the check.c file has undergone a lot of changes and refactoring between
1.4.18 and 1.5.4.


I started to look how the check subsystem works but the learning curve is
quite high on this one. I'm hoping that maybe Willy (it seems that your
name is all over the changelog on the check.c) might have some clues which
I could then pursue further. The biggest problem is that this bug seems to
be nondeterministic, but on the good side is that I can run modified
haproxy binarines in my environment so that I could trace this further.

On Mon, Sep 8, 2014 at 11:30 AM, Juho Mäkinen <[email protected]> wrote:

>
> On Thu, Sep 4, 2014 at 11:35 PM, Pavlos Parissis <
> [email protected]> wrote:
>
>> On 04/09/2014 08:55 πμ, Juho Mäkinen wrote:
>> > I'm upgrading my old 1.4.18 haproxies to 1.5.4 and I have a mysterious
>> > problem where haproxy marks some backend servers as being DOWN with a
>> > message "L4TOUT in 2000ms".
>> Are you sure that you haven't reached any sort of limits on your backend
>> servers? Number of open files and etc...
>
>
> Quite sure because I can always use curl from the haproxy machine to the
> backend machine and I get the response to the check command always without
> any delays.
>
> Are you sure that backend servers return a response with HTTP status 200
>> on healtchecks?
>>
>
> Yes. I also ran strace on a single haproxy process when the haproxy marked
> multiple backends as being down. Here's an example output:
>
> 08:06:07.302582 connect(30, {sa_family=AF_INET, sin_port=htons(3500),
> sin_addr=inet_addr("172.16.6.102")}, 16) = -1 EINPROGRESS (Operation now in
> progress)
> 08:06:07.303024 recvfrom(30, 0x1305494, 16384, 0, 0, 0) = -1 EAGAIN
> (Resource temporarily unavailable)
> 08:06:07.303097 getsockopt(30, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> 08:06:07.303167 sendto(30, "GET /check HTTP/1.0\r\n\r\n", 23,
> MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 23
> 08:06:07.304522 recvfrom(30, "HTTP/1.1 200 OK\r\nX-Powered-By:
> Express\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Methods:
> GET, HEAD, POST, PUT, DELE"..., 16384, 0, NULL, NULL) = 503
> 08:06:07.304603 setsockopt(30, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0},
> 8) = 0
> 08:06:07.304666 close(30)               = 0
>
> So the server clearly sends an HTTP 200 OK response, in just 1.9 ms. I
> analysed around 20 different checks via the strace to the same backend
> (which is marked down by haproxy) and none of them was over one second.
>
> Here's an example from haproxy logging what happens when the problem
> starts:
>
> Sep  8 07:22:25 localhost haproxy[24282]: [08/Sep/2014:07:22:24.615] https
> comet-getcampaigns/comet-172.16.2.97:3500 423/0/1/3/427 200 502 - - ----
> 1577/1577/3/1/0 0/0 "GET /mobile HTTP/1.1"
> Sep  8 07:22:25 localhost haproxy[24284]: [08/Sep/2014:07:22:24.280]
> https~ comet-getcampaigns/comet-172.16.2.97:3500 771/0/2/346/1121 200 40370
> - - ---- 2769/2769/6/0/0 0/0 "GET /mobile HTTP/1.1"Sep  8 07:22:25
> localhost haproxy[24284]: [08/Sep/2014:07:22:25.090] https~
> comet-getcampaigns/comet-172.16.2.97:3500 379/0/2/-1/804 502 204 - - SH--
> 2733/2733/7/0/0 0/0 "GET /mobile HTTP/1.1"
> Sep  8 07:22:25 localhost haproxy[24280]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
> info: "Connection reset by peer", check duration: 231ms, status: 2/3 UP.
> Sep  8 07:22:25 localhost haproxy[24281]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
> info: "Connection reset by peer", check duration: 217ms, status: 2/3 UP.
> Sep  8 07:22:25 localhost haproxy[24282]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
> info: "Connection reset by peer", check duration: 137ms, status: 2/3 UP.
> Sep  8 07:22:25 localhost haproxy[24284]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
> info: "Connection reset by peer", check duration: 393ms, status: 2/3 UP.Sep
>  8 07:22:25 localhost haproxy[24284]: [08/Sep/2014:07:22:25.661] https
> comet-getcampaigns/comet-172.16.2.97:3500 305/0/1/-1/314 -1 0 - - SD--
> 2718/2718/5/0/0 0/0 "GET /mobile HTTP/1.1"
> Sep  8 07:22:27 localhost haproxy[24278]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
> problem, info: "Connection refused", check duration: 0ms, status: 2/3 UP.
> Sep  8 07:22:27 localhost haproxy[24279]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
> problem, info: "Connection refused", check duration: 0ms, status: 2/3 UP.
> Sep  8 07:22:28 localhost haproxy[24280]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
> problem, info: "Connection refused", check duration: 2ms, status: 1/3 UP.
> Sep  8 07:22:28 localhost haproxy[24284]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
> problem, info: "Connection refused", check duration: 1ms, status: 1/3 UP.
> Sep  8 07:22:28 localhost haproxy[24282]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
> problem, info: "Connection refused", check duration: 1ms, status: 1/3 UP.
> Sep  8 07:22:28 localhost haproxy[24283]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
> problem, info: "Connection refused", check duration: 2ms, status: 1/3 UP.
> Sep  8 07:22:29 localhost haproxy[24278]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
> problem, info: "Connection refused", check duration: 1ms, status: 1/3 UP.
> Sep  8 07:22:29 localhost haproxy[24279]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
> problem, info: "Connection refused", check duration: 0ms, status: 1/3 UP.
> Sep  8 07:22:29 localhost haproxy[24281]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
> problem, info: "Connection refused", check duration: 0ms, status: 0/2 DOWN.
> Sep  8 07:22:29 localhost haproxy[24281]: Server
> comet-getcampaigns/comet-172.16.2.97:3500 is DOWN. 15 active and 0 backup
> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> Sep  8 07:22:29 localhost haproxy[24280]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
> problem, info: "Connection refused", check duration: 0ms, status: 0/2 DOWN.
> Sep  8 07:22:29 localhost haproxy[24280]: Server
> comet-getcampaigns/comet-172.16.2.97:3500 is DOWN. 15 active and 0 backup
> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> Sep  8 07:22:29 localhost haproxy[24284]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
> problem, info: "Connection refused", check duration: 0ms, status: 0/2 DOWN.
> Sep  8 07:22:29 localhost haproxy[24284]: Server
> comet-getcampaigns/comet-172.16.2.97:3500 is DOWN. 14 active and 0 backup
> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> Sep  8 07:22:30 localhost haproxy[24282]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
> problem, info: "Connection refused", check duration: 0ms, status: 0/2 DOWN.
> Sep  8 07:22:30 localhost haproxy[24282]: Server
> comet-getcampaigns/comet-172.16.2.97:3500 is DOWN. 15 active and 0 backup
> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>
> So first there's a Socket error, info: "Connection reset by peer",
> followed with Layer4 connection problem, info: "Connection refused" until
> the server is marked DOWN.
>
>
>
>> You have only 1 stats server while you have 7 processes. You need to
>> enable for each process a stats socket, here is an example from a 24
>> processes
>>
>
> All individual haproxy processes marks the same backend servers being down
> once the problem starts. Another machine has also similar problem, but the
> backend severs which are marked down are not the same. Note that I have
> three backends which all share the same servers (comet, comet-priority and
> comet-getcampaigns). The different backends might have different servers
> marked as DOWN. Example screenshot:
> http://koti.kapsi.fi/garo/nobackup/haproxy-different-servers-down.png
> (the brown servers are marked down manually via the admin socket)
>
> As a side question: The documentation was a bit unclear. If I have nbproc
> > 1 and I use the admin socket to turn servers administrative status down
> or up, do I need to do it to separated admin sockets per haproxy process,
> or can I just use one admin socket?
>
>

Re: HAProxy 1.5 incorrectly marks servers as DOWN

Reply via email to