Two tiered haproxy setup and managing queues and back pressure
We have a setup which requires us to have two haproxy tiers so that first forwards connections to the second. What I want to know is the theory how (and why) I should tune my maxconn, backlog and timeout settings to handle queues overloads and back pressure in situations where my backends are overloaded. Think an array of virtual called A, which size is between 10-20. Then another array B which size is 100-200. There is a client in the A servers which want to connect to a random worker in array B. Each machine in B contains a haproxy which will have a single listen clause with backends which are pointed to workers in the same host. This means that all connections to a worker in a host in the array B will go through the haproxy in that host. Servers A have haproxies which have a listen clause so that each server in the B array will have one backend set. Clients in servers A will connect to localhost so they will reach the haproxy in machine A which will route the request to a suitable server B haproxy and where that haproxy will route it to a worker in that node. This works, but I'm not sure how I should tune my configurations that if any server in the B array gets overloaded then the haproxies in server array A would avoid this server? I'm thinking that I should use the "retries" setting in haproxy A so that if it can't connect to the firstly selected server B it would try another. But I'm not sure how I should configure haproxy B so that this is done? If I set both maxconn and backlog settings low enough in B will this cause this to happen and what is actually going in terms of SYN, SYN+ACK, kernel backlog queues and in haproxy frontend queues? I'm pretty sure I need to lab this out so that I can use wireshark to really look what is going on, but the lab setup is non-trivial and I could use some good theory how this should work. - Garo
subscribe
Re: Can I add one new server without stop the haproxy process?
You can't. The socket admin interface allows you to only disable existing servers and then re-enable them, but you can't add a completely new server. However you can reload haproxy so that it minimises and on some platforms eliminates dropping any existing connections. Probably your init script already does this with the reload-command. For example this blog post tells about this https://medium.com/@Drew_Stokes/actual-zero-downtime-with-haproxy-18318578fde6 but I've also hear that on modern linux kernels even the iptables syn trick isn't needed, but I can't confirm. On Mon, Sep 15, 2014 at 3:02 AM, Zebra max...@unitedstack.com wrote: Hi,all How can I add one new server without stop the haproxy process? Looking forward to your reply! Thanks, Zebra
Re: HAProxy 1.5 incorrectly marks servers as DOWN
I'm more and more suspecting that this is a new bug introduced on the 1.5.x branch. We don't have these issues on our 1.4 deployment and it seems that the check.c file has undergone a lot of changes and refactoring between 1.4.18 and 1.5.4. I started to look how the check subsystem works but the learning curve is quite high on this one. I'm hoping that maybe Willy (it seems that your name is all over the changelog on the check.c) might have some clues which I could then pursue further. The biggest problem is that this bug seems to be nondeterministic, but on the good side is that I can run modified haproxy binarines in my environment so that I could trace this further. On Mon, Sep 8, 2014 at 11:30 AM, Juho Mäkinen j...@unity3d.com wrote: On Thu, Sep 4, 2014 at 11:35 PM, Pavlos Parissis pavlos.paris...@gmail.com wrote: On 04/09/2014 08:55 πμ, Juho Mäkinen wrote: I'm upgrading my old 1.4.18 haproxies to 1.5.4 and I have a mysterious problem where haproxy marks some backend servers as being DOWN with a message L4TOUT in 2000ms. Are you sure that you haven't reached any sort of limits on your backend servers? Number of open files and etc... Quite sure because I can always use curl from the haproxy machine to the backend machine and I get the response to the check command always without any delays. Are you sure that backend servers return a response with HTTP status 200 on healtchecks? Yes. I also ran strace on a single haproxy process when the haproxy marked multiple backends as being down. Here's an example output: 08:06:07.302582 connect(30, {sa_family=AF_INET, sin_port=htons(3500), sin_addr=inet_addr(172.16.6.102)}, 16) = -1 EINPROGRESS (Operation now in progress) 08:06:07.303024 recvfrom(30, 0x1305494, 16384, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 08:06:07.303097 getsockopt(30, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 08:06:07.303167 sendto(30, GET /check HTTP/1.0\r\n\r\n, 23, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 23 08:06:07.304522 recvfrom(30, HTTP/1.1 200 OK\r\nX-Powered-By: Express\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Methods: GET, HEAD, POST, PUT, DELE..., 16384, 0, NULL, NULL) = 503 08:06:07.304603 setsockopt(30, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0 08:06:07.304666 close(30) = 0 So the server clearly sends an HTTP 200 OK response, in just 1.9 ms. I analysed around 20 different checks via the strace to the same backend (which is marked down by haproxy) and none of them was over one second. Here's an example from haproxy logging what happens when the problem starts: Sep 8 07:22:25 localhost haproxy[24282]: [08/Sep/2014:07:22:24.615] https comet-getcampaigns/comet-172.16.2.97:3500 423/0/1/3/427 200 502 - - 1577/1577/3/1/0 0/0 GET /mobile HTTP/1.1 Sep 8 07:22:25 localhost haproxy[24284]: [08/Sep/2014:07:22:24.280] https~ comet-getcampaigns/comet-172.16.2.97:3500 771/0/2/346/1121 200 40370 - - 2769/2769/6/0/0 0/0 GET /mobile HTTP/1.1Sep 8 07:22:25 localhost haproxy[24284]: [08/Sep/2014:07:22:25.090] https~ comet-getcampaigns/comet-172.16.2.97:3500 379/0/2/-1/804 502 204 - - SH-- 2733/2733/7/0/0 0/0 GET /mobile HTTP/1.1 Sep 8 07:22:25 localhost haproxy[24280]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error, info: Connection reset by peer, check duration: 231ms, status: 2/3 UP. Sep 8 07:22:25 localhost haproxy[24281]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error, info: Connection reset by peer, check duration: 217ms, status: 2/3 UP. Sep 8 07:22:25 localhost haproxy[24282]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error, info: Connection reset by peer, check duration: 137ms, status: 2/3 UP. Sep 8 07:22:25 localhost haproxy[24284]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error, info: Connection reset by peer, check duration: 393ms, status: 2/3 UP.Sep 8 07:22:25 localhost haproxy[24284]: [08/Sep/2014:07:22:25.661] https comet-getcampaigns/comet-172.16.2.97:3500 305/0/1/-1/314 -1 0 - - SD-- 2718/2718/5/0/0 0/0 GET /mobile HTTP/1.1 Sep 8 07:22:27 localhost haproxy[24278]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection problem, info: Connection refused, check duration: 0ms, status: 2/3 UP. Sep 8 07:22:27 localhost haproxy[24279]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection problem, info: Connection refused, check duration: 0ms, status: 2/3 UP. Sep 8 07:22:28 localhost haproxy[24280]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection problem, info: Connection refused, check duration: 2ms, status: 1/3 UP. Sep 8 07:22:28 localhost haproxy[24284]: Health check for server comet-getcampaigns/comet
Re: HAProxy 1.5 incorrectly marks servers as DOWN
Thanks Pavlos for your help. Fortunately (and embarrassedly for me) the mistake was not anywhere near haproxy but instead my haproxy configure template system had a bug which mixed up the backend name and ip address. Because of this haproxy showed different names for those servers which were actually down and that threw me into way off when I investigated this issue, blinded by the actual problem which was always so near of my sight. :( haproxy shows the server name in the server log when it reports health check statuses. Example: Health check for server comet/comet-172.16.4.209:3500 succeeded, reason: Layer7 check passed, code: 200, info: OK, check duration: 2ms, status: 3/3 UP. This could be improved by also showing the actual ip and port in the log. Suggestion: Health check for server comet/comet-172.16.4.209:3500 (172.16.4.209:3500) succeeded, reason: Layer7 check passed, code: 200, info: OK, check duration: 2ms, status: 3/3 UP. As a side question: The documentation was a bit unclear. If I have nbproc 1 and I use the admin socket to turn servers administrative status down or up, do I need to do it to separated admin sockets per haproxy process, or can I just use one admin socket? You need a different socket. Each process can only be managed by a dedicated stats socket. There isn't any kind of aggregation where you issue a command to 1 stats socket and this command is pushed to all processes. Next release will address this kind of issues. Thank you, good to know! - Garo
Re: HAProxy 1.5 incorrectly marks servers as DOWN
On Thu, Sep 4, 2014 at 11:35 PM, Pavlos Parissis pavlos.paris...@gmail.com wrote: On 04/09/2014 08:55 πμ, Juho Mäkinen wrote: I'm upgrading my old 1.4.18 haproxies to 1.5.4 and I have a mysterious problem where haproxy marks some backend servers as being DOWN with a message L4TOUT in 2000ms. Are you sure that you haven't reached any sort of limits on your backend servers? Number of open files and etc... Quite sure because I can always use curl from the haproxy machine to the backend machine and I get the response to the check command always without any delays. Are you sure that backend servers return a response with HTTP status 200 on healtchecks? Yes. I also ran strace on a single haproxy process when the haproxy marked multiple backends as being down. Here's an example output: 08:06:07.302582 connect(30, {sa_family=AF_INET, sin_port=htons(3500), sin_addr=inet_addr(172.16.6.102)}, 16) = -1 EINPROGRESS (Operation now in progress) 08:06:07.303024 recvfrom(30, 0x1305494, 16384, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 08:06:07.303097 getsockopt(30, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 08:06:07.303167 sendto(30, GET /check HTTP/1.0\r\n\r\n, 23, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 23 08:06:07.304522 recvfrom(30, HTTP/1.1 200 OK\r\nX-Powered-By: Express\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Methods: GET, HEAD, POST, PUT, DELE..., 16384, 0, NULL, NULL) = 503 08:06:07.304603 setsockopt(30, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0 08:06:07.304666 close(30) = 0 So the server clearly sends an HTTP 200 OK response, in just 1.9 ms. I analysed around 20 different checks via the strace to the same backend (which is marked down by haproxy) and none of them was over one second. Here's an example from haproxy logging what happens when the problem starts: Sep 8 07:22:25 localhost haproxy[24282]: [08/Sep/2014:07:22:24.615] https comet-getcampaigns/comet-172.16.2.97:3500 423/0/1/3/427 200 502 - - 1577/1577/3/1/0 0/0 GET /mobile HTTP/1.1 Sep 8 07:22:25 localhost haproxy[24284]: [08/Sep/2014:07:22:24.280] https~ comet-getcampaigns/comet-172.16.2.97:3500 771/0/2/346/1121 200 40370 - - 2769/2769/6/0/0 0/0 GET /mobile HTTP/1.1Sep 8 07:22:25 localhost haproxy[24284]: [08/Sep/2014:07:22:25.090] https~ comet-getcampaigns/comet-172.16.2.97:3500 379/0/2/-1/804 502 204 - - SH-- 2733/2733/7/0/0 0/0 GET /mobile HTTP/1.1 Sep 8 07:22:25 localhost haproxy[24280]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error, info: Connection reset by peer, check duration: 231ms, status: 2/3 UP. Sep 8 07:22:25 localhost haproxy[24281]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error, info: Connection reset by peer, check duration: 217ms, status: 2/3 UP. Sep 8 07:22:25 localhost haproxy[24282]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error, info: Connection reset by peer, check duration: 137ms, status: 2/3 UP. Sep 8 07:22:25 localhost haproxy[24284]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error, info: Connection reset by peer, check duration: 393ms, status: 2/3 UP.Sep 8 07:22:25 localhost haproxy[24284]: [08/Sep/2014:07:22:25.661] https comet-getcampaigns/comet-172.16.2.97:3500 305/0/1/-1/314 -1 0 - - SD-- 2718/2718/5/0/0 0/0 GET /mobile HTTP/1.1 Sep 8 07:22:27 localhost haproxy[24278]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection problem, info: Connection refused, check duration: 0ms, status: 2/3 UP. Sep 8 07:22:27 localhost haproxy[24279]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection problem, info: Connection refused, check duration: 0ms, status: 2/3 UP. Sep 8 07:22:28 localhost haproxy[24280]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection problem, info: Connection refused, check duration: 2ms, status: 1/3 UP. Sep 8 07:22:28 localhost haproxy[24284]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection problem, info: Connection refused, check duration: 1ms, status: 1/3 UP. Sep 8 07:22:28 localhost haproxy[24282]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection problem, info: Connection refused, check duration: 1ms, status: 1/3 UP. Sep 8 07:22:28 localhost haproxy[24283]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection problem, info: Connection refused, check duration: 2ms, status: 1/3 UP. Sep 8 07:22:29 localhost haproxy[24278]: Health check for server comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection problem, info: Connection refused, check duration: 1ms, status: 1/3 UP. Sep 8 07:22:29 localhost haproxy[24279]: Health check
Re: Spam to this list?
On Fri, Sep 5, 2014 at 1:17 PM, Lukas Tribus luky...@hotmail.com wrote: Restricting the list to subscribed user (subonlypost) is not a good thing either May I ask why this is not a good thing? I see no valid reason why not subscribed members should be allowed to post. The subscription already checks that the sender email is valid, thus should be a decent way to remove most of the spam with very little configuration and maintenance. I think all my other lists which I've subscribed require that you need to be subscribed so you can post. - Garo
HAProxy 1.5 incorrectly marks servers as DOWN
I'm upgrading my old 1.4.18 haproxies to 1.5.4 and I have a mysterious problem where haproxy marks some backend servers as being DOWN with a message L4TOUT in 2000ms. Some times the message also has a star: * L4TOUT in 2000ms (I didn't find what the star means from the docs). Also the reported timeout varies between 2000ms and 2003ms. This does not happen to every backend and it doesn't happen immediately. After restart every backend is green and a few backends starts to get marked DOWN after about 30 minutes or so. I'm also running two instances in two different servers and they both suffer the same problem but the DOWN servers aren't same. So server A might be marked DOWN on haproxy-1 and server B marked down on haproxy-2 (or vice versa). This seems to happen regardless how much traffic I run into the haproxies. I can always ssh into the haproxies and run curl against the check url and it always works, so this problem seems to be inside haproxy. My haproxy config is a kind of long so I copied it here: http://koti.kapsi.fi/garo/nobackup/haproxy.cfg (I've sanitised it a bit, but only hostnames). I've ran the logging with verbose debugging to check if that gives any clues on the health check issue, but the logs did not reveal anything to my eye. I can however gather a new log sample on the health checks, but the haproxies are now receiving production traffic so the log amount would be too much to gather at the current moment. I've also gathered some tcpdump traffic to the hosts marked DOWN and strangely it seems that the hosts is receiving queries. It could be that one (or more) processes (I'm using nbprocs 7 on my 8 core aws c3.2xlarge instance) haven't marked the host down. Trying to refresh the stats uri doesn't seem to indicate this, but it's hard to be sure as the probability of going thru all seven different processes fast enough is low. All clues and debugging ideas are greatly appreciated.