Re: Sudden queueing to backends

2020-03-14 Thread Willy Tarreau
Hi Sander,

On Tue, Mar 10, 2020 at 10:28:38AM +0100, Sander Klein wrote:
> Hi All,
> 
> I'm looking in a strange issue I'm having and I start to think it is HAProxy
> related.
> 
> I have a setup with HAProxy serving multiple frontends and multiple backends
> which are Nginx server with PHP-FPM. Sometimes all of the sudden the maxconn
> limit is hit and connections get queued to a backend server and I do not
> have a clue why. The backend is not overloaded, not traffic is flowing,
> Nginx/PHP-FPM picks op other connections like the health checks from HAProxy
> or out monitoring server, PHP-FPM is not doing anything so no long runing
> processes, Nginx is doint nothing,  but it does not receive any new
> connection from HAProxy. Sometimes this is for 1 second, but this also
> happens for as much as 30 seconds.
> 
> It does not happen on all backend servers at once, just random at one
> server. So if I have defined a backend with 2 servers it happens to only one
> at a time.

There could be two possible explanations to this. The first one is that
the backend server is sometimes slowing down for whatever reason on the
requests it is processing, resulting in maxconn being reached on haproxy.
This commonly happens on applications where one request is much more
expensive than most others. I've seen some sites use a dedicated backend
with much lower maxconn for a search button for example because they knew
this search request could take multiple seconds, and if enough of them
happen at the same time, the maxconn is reached and extra requests get
queued even if they could have been handled.

The other possibility would be that some requests produce huge responses
that take a lot of time to be consumed by the clients. And until one
response finishes to be delivered there's no more slots available on the
server.

> I'm running HAProxy 2.0.13 on Debian Buster in a VM. I've tested with 'no
> option http-use-htx' and HAProxy 2.1.3 and I see the problem on both.
> Backends are Nginx with PHP-FPM and only using HTTP/1.1 over port 80, also
> VM's.
> 
> Today I disabled H2 on the frontends and now the problem seems to have
> disappeared.  So it seems to be releated to that part. But, I'm not sure.
> How should I go on and debug this?

The best way to do it is to emit "show sess all", "show info", "show stats"
and "show fd" on the CLI when this happens. This will indicate if there are
many connections still active to the server under trouble, and/or if there
are many connections from a single source address for example.

One thing that could happen with H2 would be that it's trivial to send
many requests at once (100 by default), and if someone wants to have fun
with your site once they've found an expensive request, sending 100 of
these doesn't take more than a single TCP segment. Or if you're having
heavy pages, a single H2 connection can request many objects at once and
if the network link to the client is lossy, these can take a while to
deliver over a single connection. And since H2 is much faster than H1
on reliable networks, but much worse on lossy networks, that could be
an explanation. Do you have lots of static files ? If so it might make
sense to deliver them from dedicated servers that are not subject to
the very low maxconn. And if such objects are small, you could also
enable some caching to reduce the number of connections to the servers
when fetching them.

> The config looks a bit like this (very redacted and very, very much
> shortened):
(...)

Looks pretty fine at first glance. I'm seeing "prefer-last-server", it
*might* participate to the problem if it's caused by sudden spikes, as
it will encourage requests from a same client to go to the same server,
but that's not necessarily the case.

Willy



Sudden queueing to backends

2020-03-10 Thread Sander Klein

Hi All,

I'm looking in a strange issue I'm having and I start to think it is 
HAProxy related.


I have a setup with HAProxy serving multiple frontends and multiple 
backends which are Nginx server with PHP-FPM. Sometimes all of the 
sudden the maxconn limit is hit and connections get queued to a backend 
server and I do not have a clue why. The backend is not overloaded, not 
traffic is flowing, Nginx/PHP-FPM picks op other connections like the 
health checks from HAProxy or out monitoring server, PHP-FPM is not 
doing anything so no long runing processes, Nginx is doint nothing,  but 
it does not receive any new connection from HAProxy. Sometimes this is 
for 1 second, but this also happens for as much as 30 seconds.


It does not happen on all backend servers at once, just random at one 
server. So if I have defined a backend with 2 servers it happens to only 
one at a time.


I'm running HAProxy 2.0.13 on Debian Buster in a VM. I've tested with 
'no option http-use-htx' and HAProxy 2.1.3 and I see the problem on 
both. Backends are Nginx with PHP-FPM and only using HTTP/1.1 over port 
80, also VM's.


Today I disabled H2 on the frontends and now the problem seems to have 
disappeared.  So it seems to be releated to that part. But, I'm not 
sure. How should I go on and debug this?


The config looks a bit like this (very redacted and very, very much 
shortened):


global
master-worker
log /dev/loglocal0
log /dev/loglocal1 notice

daemon
userhaproxy
group   haproxy
maxconn 32768
spread-checks   3
nbproc  1
nbthread4
stats socket/var/run/haproxy.stat mode 666 level admin

	# ciphers generator 
(https://mozilla.github.io/server-side-tls/ssl-config-generator/)

ssl-server-verify   none

	ssl-default-bind-ciphers 
ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256
	ssl-default-bind-ciphersuites 
TLS_AES_256_GCM_SHA384:TLS_AES_128_GCM_SHA256:TLS_CHACHA20_POLY1305_SHA256

ssl-default-bind-options no-tls-tickets ssl-min-ver TLSv1.2
	ssl-default-server-ciphers 
ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256

ssl-default-server-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets

ssl-dh-param-file /etc/haproxy/ffdhe3072.pem

defaults
log global
timeout check   2s
timeout client  60s
timeout connect 10s
timeout http-keep-alive 4s
timeout http-request15s
timeout queue   30s
timeout server  60s
timeout tarpit  120s

errorfile 400 /etc/haproxy/errors.loc/400.http
errorfile 403 /etc/haproxy/errors.loc/403.http
errorfile 500 /etc/haproxy/errors.loc/500.http
errorfile 502 /etc/haproxy/errors.loc/502.http
errorfile 503 /etc/haproxy/errors.loc/503.http
errorfile 504 /etc/haproxy/errors.loc/504.http

option http-use-htx

listen admin
bindx.x.x.x:8080 ssl crt /some/path/ strict-sni alpn 
h2,http/1.1
modehttp
stats enable
stats uri   /haproxy?stats
stats auth  username:password
stats admin if TRUE
stats refresh 5s

frontend cluster1-in
# LB itself
bind x.x.x.x:80 transparent
bind aaa:aaa:aaa::a:80 transparent
bind x.x.x.x:443 transparent ssl crt /some/path/
bind aaa:aaa:aaa::a:443 transparent ssl crt /some/path/

# Mass hosting VIP
bind y.y.y.y:80 transparent
bind aab:aab:aab::a:80 transparent

	bind y.y.y.y:443 transparent ssl crt /some/cert.pem crt 
/another/cert.pem crt /some/path/ strict-sni alpn h2,http/1.1
	bind aab:aab:aab::a:443 transparent ssl crt /some/cert.pem crt 
/another/cert.pem crt /some/path/ strict-sni alpn h2,http/1.1


mode http
maxconn 8192

option httplog
option dontlog-normal
option http-ignore-probes
option forwardfor

capture request header Host len 64
capture request header User-Agent   len 16
capture request header Content-Length   len 10
capture request header Referer  len 256
capture response header Content-Length  len 10

#
# Some security stuff starts here
#
acl name src -f /some/file.txt

http-request deny if name
http-request del-header Proxy