On Thu, Sep 4, 2014 at 11:35 PM, Pavlos Parissis <[email protected]>
wrote:
> On 04/09/2014 08:55 πμ, Juho Mäkinen wrote:
> > I'm upgrading my old 1.4.18 haproxies to 1.5.4 and I have a mysterious
> > problem where haproxy marks some backend servers as being DOWN with a
> > message "L4TOUT in 2000ms".
> Are you sure that you haven't reached any sort of limits on your backend
> servers? Number of open files and etc...
Quite sure because I can always use curl from the haproxy machine to the
backend machine and I get the response to the check command always without
any delays.
Are you sure that backend servers return a response with HTTP status 200
> on healtchecks?
>
Yes. I also ran strace on a single haproxy process when the haproxy marked
multiple backends as being down. Here's an example output:
08:06:07.302582 connect(30, {sa_family=AF_INET, sin_port=htons(3500),
sin_addr=inet_addr("172.16.6.102")}, 16) = -1 EINPROGRESS (Operation now in
progress)
08:06:07.303024 recvfrom(30, 0x1305494, 16384, 0, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
08:06:07.303097 getsockopt(30, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
08:06:07.303167 sendto(30, "GET /check HTTP/1.0\r\n\r\n", 23,
MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 23
08:06:07.304522 recvfrom(30, "HTTP/1.1 200 OK\r\nX-Powered-By:
Express\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Methods:
GET, HEAD, POST, PUT, DELE"..., 16384, 0, NULL, NULL) = 503
08:06:07.304603 setsockopt(30, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0},
8) = 0
08:06:07.304666 close(30) = 0
So the server clearly sends an HTTP 200 OK response, in just 1.9 ms. I
analysed around 20 different checks via the strace to the same backend
(which is marked down by haproxy) and none of them was over one second.
Here's an example from haproxy logging what happens when the problem starts:
Sep 8 07:22:25 localhost haproxy[24282]: [08/Sep/2014:07:22:24.615] https
comet-getcampaigns/comet-172.16.2.97:3500 423/0/1/3/427 200 502 - - ----
1577/1577/3/1/0 0/0 "GET /mobile HTTP/1.1"
Sep 8 07:22:25 localhost haproxy[24284]: [08/Sep/2014:07:22:24.280] https~
comet-getcampaigns/comet-172.16.2.97:3500 771/0/2/346/1121 200 40370 - -
---- 2769/2769/6/0/0 0/0 "GET /mobile HTTP/1.1"Sep 8 07:22:25 localhost
haproxy[24284]: [08/Sep/2014:07:22:25.090] https~
comet-getcampaigns/comet-172.16.2.97:3500 379/0/2/-1/804 502 204 - - SH--
2733/2733/7/0/0 0/0 "GET /mobile HTTP/1.1"
Sep 8 07:22:25 localhost haproxy[24280]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
info: "Connection reset by peer", check duration: 231ms, status: 2/3 UP.
Sep 8 07:22:25 localhost haproxy[24281]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
info: "Connection reset by peer", check duration: 217ms, status: 2/3 UP.
Sep 8 07:22:25 localhost haproxy[24282]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
info: "Connection reset by peer", check duration: 137ms, status: 2/3 UP.
Sep 8 07:22:25 localhost haproxy[24284]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
info: "Connection reset by peer", check duration: 393ms, status: 2/3 UP.Sep
8 07:22:25 localhost haproxy[24284]: [08/Sep/2014:07:22:25.661] https
comet-getcampaigns/comet-172.16.2.97:3500 305/0/1/-1/314 -1 0 - - SD--
2718/2718/5/0/0 0/0 "GET /mobile HTTP/1.1"
Sep 8 07:22:27 localhost haproxy[24278]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: "Connection refused", check duration: 0ms, status: 2/3 UP.
Sep 8 07:22:27 localhost haproxy[24279]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: "Connection refused", check duration: 0ms, status: 2/3 UP.
Sep 8 07:22:28 localhost haproxy[24280]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: "Connection refused", check duration: 2ms, status: 1/3 UP.
Sep 8 07:22:28 localhost haproxy[24284]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: "Connection refused", check duration: 1ms, status: 1/3 UP.
Sep 8 07:22:28 localhost haproxy[24282]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: "Connection refused", check duration: 1ms, status: 1/3 UP.
Sep 8 07:22:28 localhost haproxy[24283]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: "Connection refused", check duration: 2ms, status: 1/3 UP.
Sep 8 07:22:29 localhost haproxy[24278]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: "Connection refused", check duration: 1ms, status: 1/3 UP.
Sep 8 07:22:29 localhost haproxy[24279]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: "Connection refused", check duration: 0ms, status: 1/3 UP.
Sep 8 07:22:29 localhost haproxy[24281]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: "Connection refused", check duration: 0ms, status: 0/2 DOWN.
Sep 8 07:22:29 localhost haproxy[24281]: Server
comet-getcampaigns/comet-172.16.2.97:3500 is DOWN. 15 active and 0 backup
servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Sep 8 07:22:29 localhost haproxy[24280]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: "Connection refused", check duration: 0ms, status: 0/2 DOWN.
Sep 8 07:22:29 localhost haproxy[24280]: Server
comet-getcampaigns/comet-172.16.2.97:3500 is DOWN. 15 active and 0 backup
servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Sep 8 07:22:29 localhost haproxy[24284]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: "Connection refused", check duration: 0ms, status: 0/2 DOWN.
Sep 8 07:22:29 localhost haproxy[24284]: Server
comet-getcampaigns/comet-172.16.2.97:3500 is DOWN. 14 active and 0 backup
servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Sep 8 07:22:30 localhost haproxy[24282]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: "Connection refused", check duration: 0ms, status: 0/2 DOWN.
Sep 8 07:22:30 localhost haproxy[24282]: Server
comet-getcampaigns/comet-172.16.2.97:3500 is DOWN. 15 active and 0 backup
servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
So first there's a Socket error, info: "Connection reset by peer", followed
with Layer4 connection problem, info: "Connection refused" until the server
is marked DOWN.
> You have only 1 stats server while you have 7 processes. You need to
> enable for each process a stats socket, here is an example from a 24
> processes
>
All individual haproxy processes marks the same backend servers being down
once the problem starts. Another machine has also similar problem, but the
backend severs which are marked down are not the same. Note that I have
three backends which all share the same servers (comet, comet-priority and
comet-getcampaigns). The different backends might have different servers
marked as DOWN. Example screenshot:
http://koti.kapsi.fi/garo/nobackup/haproxy-different-servers-down.png (the
brown servers are marked down manually via the admin socket)
As a side question: The documentation was a bit unclear. If I have nbproc >
1 and I use the admin socket to turn servers administrative status down or
up, do I need to do it to separated admin sockets per haproxy process, or
can I just use one admin socket?