Two tiered haproxy setup and managing queues and back pressure

2017-02-15 Thread Juho Mäkinen
We have a setup which requires us to have two haproxy tiers so that first
forwards connections to the second. What I want to know is the theory how
(and why) I should tune my maxconn, backlog and timeout settings to handle
queues overloads and back pressure in situations where my backends are
overloaded.

Think an array of virtual called A, which size is between 10-20. Then
another array B which size is 100-200. There is a client in the A servers
which want to connect to a random worker in array B.

Each machine in B contains a haproxy which will have a single listen clause
with backends which are pointed to workers in the same host. This means
that all connections to a worker in a host in the array B will go through
the haproxy in that host.

Servers A have haproxies which have a listen clause so that each server in
the B array will have one backend set. Clients in servers A will connect to
localhost so they will reach the haproxy in machine A which will route the
request to a suitable server B haproxy and where that haproxy will route it
to a worker in that node.

This works, but I'm not sure how I should tune my configurations that if
any server in the B array gets overloaded then the haproxies in server
array A would avoid this server? I'm thinking that I should use the
"retries" setting in haproxy A so that if it can't connect to the firstly
selected server B it would try another. But I'm not sure how I should
configure haproxy B so that this is done? If I set both maxconn and backlog
settings low enough in B will this cause this to happen and what is
actually going in terms of SYN, SYN+ACK, kernel backlog queues and in
haproxy frontend queues?

I'm pretty sure I need to lab this out so that I can use wireshark to
really look what is going on, but the lab setup is non-trivial and I could
use some good theory how this should work.

 - Garo


subscribe

2017-02-15 Thread Juho Mäkinen



Re: Can I add one new server without stop the haproxy process?

2014-09-14 Thread Juho Mäkinen
You can't. The socket admin interface allows you to only disable existing
servers and then re-enable them, but you can't add a completely new server.

However you can reload haproxy so that it minimises and on some platforms
eliminates dropping any existing connections. Probably your init script
already does this with the reload-command.

For example this blog post tells about this
https://medium.com/@Drew_Stokes/actual-zero-downtime-with-haproxy-18318578fde6
but I've also hear that on modern linux kernels even the iptables syn trick
isn't needed, but I can't confirm.

On Mon, Sep 15, 2014 at 3:02 AM, Zebra max...@unitedstack.com wrote:

 Hi,all

   How can I add one new server without stop the haproxy process?

   Looking forward to your reply!


 Thanks,
 Zebra



Re: HAProxy 1.5 incorrectly marks servers as DOWN

2014-09-09 Thread Juho Mäkinen
I'm more and more suspecting that this is a new bug introduced on the 1.5.x
branch. We don't have these issues on our 1.4 deployment and it seems that
the check.c file has undergone a lot of changes and refactoring between
1.4.18 and 1.5.4.

I started to look how the check subsystem works but the learning curve is
quite high on this one. I'm hoping that maybe Willy (it seems that your
name is all over the changelog on the check.c) might have some clues which
I could then pursue further. The biggest problem is that this bug seems to
be nondeterministic, but on the good side is that I can run modified
haproxy binarines in my environment so that I could trace this further.

On Mon, Sep 8, 2014 at 11:30 AM, Juho Mäkinen j...@unity3d.com wrote:


 On Thu, Sep 4, 2014 at 11:35 PM, Pavlos Parissis 
 pavlos.paris...@gmail.com wrote:

 On 04/09/2014 08:55 πμ, Juho Mäkinen wrote:
  I'm upgrading my old 1.4.18 haproxies to 1.5.4 and I have a mysterious
  problem where haproxy marks some backend servers as being DOWN with a
  message L4TOUT in 2000ms.
 Are you sure that you haven't reached any sort of limits on your backend
 servers? Number of open files and etc...


 Quite sure because I can always use curl from the haproxy machine to the
 backend machine and I get the response to the check command always without
 any delays.

 Are you sure that backend servers return a response with HTTP status 200
 on healtchecks?


 Yes. I also ran strace on a single haproxy process when the haproxy marked
 multiple backends as being down. Here's an example output:

 08:06:07.302582 connect(30, {sa_family=AF_INET, sin_port=htons(3500),
 sin_addr=inet_addr(172.16.6.102)}, 16) = -1 EINPROGRESS (Operation now in
 progress)
 08:06:07.303024 recvfrom(30, 0x1305494, 16384, 0, 0, 0) = -1 EAGAIN
 (Resource temporarily unavailable)
 08:06:07.303097 getsockopt(30, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
 08:06:07.303167 sendto(30, GET /check HTTP/1.0\r\n\r\n, 23,
 MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 23
 08:06:07.304522 recvfrom(30, HTTP/1.1 200 OK\r\nX-Powered-By:
 Express\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Methods:
 GET, HEAD, POST, PUT, DELE..., 16384, 0, NULL, NULL) = 503
 08:06:07.304603 setsockopt(30, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0},
 8) = 0
 08:06:07.304666 close(30)   = 0

 So the server clearly sends an HTTP 200 OK response, in just 1.9 ms. I
 analysed around 20 different checks via the strace to the same backend
 (which is marked down by haproxy) and none of them was over one second.

 Here's an example from haproxy logging what happens when the problem
 starts:

 Sep  8 07:22:25 localhost haproxy[24282]: [08/Sep/2014:07:22:24.615] https
 comet-getcampaigns/comet-172.16.2.97:3500 423/0/1/3/427 200 502 - - 
 1577/1577/3/1/0 0/0 GET /mobile HTTP/1.1
 Sep  8 07:22:25 localhost haproxy[24284]: [08/Sep/2014:07:22:24.280]
 https~ comet-getcampaigns/comet-172.16.2.97:3500 771/0/2/346/1121 200 40370
 - -  2769/2769/6/0/0 0/0 GET /mobile HTTP/1.1Sep  8 07:22:25
 localhost haproxy[24284]: [08/Sep/2014:07:22:25.090] https~
 comet-getcampaigns/comet-172.16.2.97:3500 379/0/2/-1/804 502 204 - - SH--
 2733/2733/7/0/0 0/0 GET /mobile HTTP/1.1
 Sep  8 07:22:25 localhost haproxy[24280]: Health check for server
 comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
 info: Connection reset by peer, check duration: 231ms, status: 2/3 UP.
 Sep  8 07:22:25 localhost haproxy[24281]: Health check for server
 comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
 info: Connection reset by peer, check duration: 217ms, status: 2/3 UP.
 Sep  8 07:22:25 localhost haproxy[24282]: Health check for server
 comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
 info: Connection reset by peer, check duration: 137ms, status: 2/3 UP.
 Sep  8 07:22:25 localhost haproxy[24284]: Health check for server
 comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
 info: Connection reset by peer, check duration: 393ms, status: 2/3 UP.Sep
  8 07:22:25 localhost haproxy[24284]: [08/Sep/2014:07:22:25.661] https
 comet-getcampaigns/comet-172.16.2.97:3500 305/0/1/-1/314 -1 0 - - SD--
 2718/2718/5/0/0 0/0 GET /mobile HTTP/1.1
 Sep  8 07:22:27 localhost haproxy[24278]: Health check for server
 comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
 problem, info: Connection refused, check duration: 0ms, status: 2/3 UP.
 Sep  8 07:22:27 localhost haproxy[24279]: Health check for server
 comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
 problem, info: Connection refused, check duration: 0ms, status: 2/3 UP.
 Sep  8 07:22:28 localhost haproxy[24280]: Health check for server
 comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
 problem, info: Connection refused, check duration: 2ms, status: 1/3 UP.
 Sep  8 07:22:28 localhost haproxy[24284]: Health check for server
 comet-getcampaigns/comet

Re: HAProxy 1.5 incorrectly marks servers as DOWN

2014-09-09 Thread Juho Mäkinen
Thanks Pavlos for your help. Fortunately (and embarrassedly for me) the
mistake was not anywhere near haproxy but instead my haproxy configure
template system had a bug which mixed up the backend name and ip address.
Because of this haproxy showed different names for those servers which were
actually down and that threw me into way off when I investigated this
issue, blinded by the actual problem which was always so near of my sight.
:(

haproxy shows the server name in the server log when it reports health
check statuses. Example:
Health check for server comet/comet-172.16.4.209:3500 succeeded, reason:
Layer7 check passed, code: 200, info: OK, check duration: 2ms, status:
3/3 UP.

This could be improved by also showing the actual ip and port in the log.
Suggestion:
Health check for server comet/comet-172.16.4.209:3500 (172.16.4.209:3500)
succeeded, reason: Layer7 check passed, code: 200, info: OK, check
duration: 2ms, status: 3/3 UP.

 As a side question: The documentation was a bit unclear. If I have
  nbproc  1 and I use the admin socket to turn servers administrative
  status down or up, do I need to do it to separated admin sockets per
  haproxy process, or can I just use one admin socket?
 

 You need a different socket. Each process can only be managed by a
 dedicated stats socket. There isn't any kind of aggregation where you
 issue a command to 1 stats socket and this command is pushed to all
 processes. Next release will address this kind of issues.


Thank you, good to know!

 - Garo


Re: HAProxy 1.5 incorrectly marks servers as DOWN

2014-09-08 Thread Juho Mäkinen
On Thu, Sep 4, 2014 at 11:35 PM, Pavlos Parissis pavlos.paris...@gmail.com
wrote:

 On 04/09/2014 08:55 πμ, Juho Mäkinen wrote:
  I'm upgrading my old 1.4.18 haproxies to 1.5.4 and I have a mysterious
  problem where haproxy marks some backend servers as being DOWN with a
  message L4TOUT in 2000ms.
 Are you sure that you haven't reached any sort of limits on your backend
 servers? Number of open files and etc...


Quite sure because I can always use curl from the haproxy machine to the
backend machine and I get the response to the check command always without
any delays.

Are you sure that backend servers return a response with HTTP status 200
 on healtchecks?


Yes. I also ran strace on a single haproxy process when the haproxy marked
multiple backends as being down. Here's an example output:

08:06:07.302582 connect(30, {sa_family=AF_INET, sin_port=htons(3500),
sin_addr=inet_addr(172.16.6.102)}, 16) = -1 EINPROGRESS (Operation now in
progress)
08:06:07.303024 recvfrom(30, 0x1305494, 16384, 0, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
08:06:07.303097 getsockopt(30, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
08:06:07.303167 sendto(30, GET /check HTTP/1.0\r\n\r\n, 23,
MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 23
08:06:07.304522 recvfrom(30, HTTP/1.1 200 OK\r\nX-Powered-By:
Express\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Methods:
GET, HEAD, POST, PUT, DELE..., 16384, 0, NULL, NULL) = 503
08:06:07.304603 setsockopt(30, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0},
8) = 0
08:06:07.304666 close(30)   = 0

So the server clearly sends an HTTP 200 OK response, in just 1.9 ms. I
analysed around 20 different checks via the strace to the same backend
(which is marked down by haproxy) and none of them was over one second.

Here's an example from haproxy logging what happens when the problem starts:

Sep  8 07:22:25 localhost haproxy[24282]: [08/Sep/2014:07:22:24.615] https
comet-getcampaigns/comet-172.16.2.97:3500 423/0/1/3/427 200 502 - - 
1577/1577/3/1/0 0/0 GET /mobile HTTP/1.1
Sep  8 07:22:25 localhost haproxy[24284]: [08/Sep/2014:07:22:24.280] https~
comet-getcampaigns/comet-172.16.2.97:3500 771/0/2/346/1121 200 40370 - -
 2769/2769/6/0/0 0/0 GET /mobile HTTP/1.1Sep  8 07:22:25 localhost
haproxy[24284]: [08/Sep/2014:07:22:25.090] https~
comet-getcampaigns/comet-172.16.2.97:3500 379/0/2/-1/804 502 204 - - SH--
2733/2733/7/0/0 0/0 GET /mobile HTTP/1.1
Sep  8 07:22:25 localhost haproxy[24280]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
info: Connection reset by peer, check duration: 231ms, status: 2/3 UP.
Sep  8 07:22:25 localhost haproxy[24281]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
info: Connection reset by peer, check duration: 217ms, status: 2/3 UP.
Sep  8 07:22:25 localhost haproxy[24282]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
info: Connection reset by peer, check duration: 137ms, status: 2/3 UP.
Sep  8 07:22:25 localhost haproxy[24284]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
info: Connection reset by peer, check duration: 393ms, status: 2/3 UP.Sep
 8 07:22:25 localhost haproxy[24284]: [08/Sep/2014:07:22:25.661] https
comet-getcampaigns/comet-172.16.2.97:3500 305/0/1/-1/314 -1 0 - - SD--
2718/2718/5/0/0 0/0 GET /mobile HTTP/1.1
Sep  8 07:22:27 localhost haproxy[24278]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: Connection refused, check duration: 0ms, status: 2/3 UP.
Sep  8 07:22:27 localhost haproxy[24279]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: Connection refused, check duration: 0ms, status: 2/3 UP.
Sep  8 07:22:28 localhost haproxy[24280]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: Connection refused, check duration: 2ms, status: 1/3 UP.
Sep  8 07:22:28 localhost haproxy[24284]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: Connection refused, check duration: 1ms, status: 1/3 UP.
Sep  8 07:22:28 localhost haproxy[24282]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: Connection refused, check duration: 1ms, status: 1/3 UP.
Sep  8 07:22:28 localhost haproxy[24283]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: Connection refused, check duration: 2ms, status: 1/3 UP.
Sep  8 07:22:29 localhost haproxy[24278]: Health check for server
comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4 connection
problem, info: Connection refused, check duration: 1ms, status: 1/3 UP.
Sep  8 07:22:29 localhost haproxy[24279]: Health check

Re: Spam to this list?

2014-09-05 Thread Juho Mäkinen
On Fri, Sep 5, 2014 at 1:17 PM, Lukas Tribus luky...@hotmail.com wrote:

 Restricting the list to subscribed user (subonlypost) is not a good
 thing either


May I ask why this is not a good thing? I see no valid reason why not
subscribed members should be allowed to post. The subscription already
checks that the sender email is valid, thus should be a decent way to
remove most of the spam with very little configuration and maintenance.

I think all my other lists which I've subscribed require that you need to
be subscribed so you can post.

 - Garo


HAProxy 1.5 incorrectly marks servers as DOWN

2014-09-04 Thread Juho Mäkinen
I'm upgrading my old 1.4.18 haproxies to 1.5.4 and I have a mysterious
problem where haproxy marks some backend servers as being DOWN with a
message L4TOUT in 2000ms. Some times the message also has a star: *
L4TOUT in 2000ms (I didn't find what the star means from the docs). Also
the reported timeout varies between 2000ms and 2003ms.

This does not happen to every backend and it doesn't happen immediately.
After restart every backend is green and a few backends starts to get
marked DOWN after about 30 minutes or so. I'm also running two instances in
two different servers and they both suffer the same problem but the DOWN
servers aren't same. So server A might be marked DOWN on haproxy-1 and
server B marked down on haproxy-2 (or vice versa).

This seems to happen regardless how much traffic I run into the haproxies.
I can always ssh into the haproxies and run curl against the check url and
it always works, so this problem seems to be inside haproxy.

My haproxy config is a kind of long so I copied it here:
http://koti.kapsi.fi/garo/nobackup/haproxy.cfg (I've sanitised it a bit,
but only hostnames).

I've ran the logging with verbose debugging to check if that gives any
clues on the health check issue, but the logs did not reveal anything to my
eye. I can however gather a new log sample on the health checks, but the
haproxies are now receiving production traffic so the log amount would be
too much to gather at the current moment.

I've also gathered some tcpdump traffic to the hosts marked DOWN and
strangely it seems that the hosts is receiving queries. It could be that
one (or more) processes (I'm using nbprocs 7 on my 8 core aws c3.2xlarge
instance) haven't marked the host down. Trying to refresh the stats uri
doesn't seem to indicate this, but it's hard to be sure as the probability
of going thru all seven different processes fast enough is low.

All clues and debugging ideas are greatly appreciated.