Hi Kai, On Wed, Feb 7, 2018 at 2:45 AM, Kai Timmer <k...@staffbase.com> wrote:
> Hi, > sorry I totally forgot them. Here are some logs from a time where > everything was fine: > Jan 31 00:59:57 haproxy01.prod haproxy[53367]: x.x.x.x:48861 > [31/Jan/2018:00:59:57.728] https-in~ eyo.backend/eyobackend01 0/0/1/33/34 > 304 170 - - ---- 514/514/0/0/0 0/0 "GET /api/menu;wesessid=xxx HTTP/1.1" > Jan 31 00:59:57 haproxy01.prod haproxy[53367]: x.x.x.x:63720 > [31/Jan/2018:00:59:57.766] https-in~ eyo.backend/eyobackend04 0/0/1/21/22 > 304 170 - - ---- 514/514/0/0/0 0/0 "GET /api/menu;wesessid=xxx HTTP/1.1" > Jan 31 00:59:57 haproxy01.prod haproxy[53367]: x.x.x.x:42904 > [31/Jan/2018:00:59:57.831] https-in~ eyo.backend/eyobackend05 0/0/1/10/11 > 304 170 - - ---- 514/514/0/0/0 0/0 "GET /api/menu;wesessid=xxx HTTP/1.1" > Jan 31 00:59:57 haproxy01.prod haproxy[53367]: x.x.x.x:51460 > [31/Jan/2018:00:59:57.896] https-in~ eyo.backend/eyobackend06 0/0/0/13/13 > 304 170 - - ---- 514/514/2/0/0 0/0 "GET /api/menu;wesessid=xxx HTTP/1.1" > Jan 31 00:59:57 haproxy01.prod haproxy[53367]: x.x.x.x:58558 > [31/Jan/2018:00:59:57.901] https-in~ eyo.backend/eyobackend02 0/0/0/16/16 > 304 170 - - ---- 514/514/1/0/0 0/0 "GET /api/menu;wesessid=xxx HTTP/1.1" > Jan 31 00:59:57 haproxy01.prod haproxy[53367]: x.x.x.x:33093 > [31/Jan/2018:00:59:57.881] https-in~ eyo.backend/eyobackend03 0/0/0/45/48 > 200 13264 - - ---- 514/514/0/0/0 0/0 "GET /api/menu;wesessid=xxx HTTP/1.1" > Jan 31 00:59:58 haproxy01.prod haproxy[53367]: x.x.x.x:11282 > [31/Jan/2018:00:59:57.994] https-in~ eyo.backend/eyobackend05 0/0/0/6/6 202 > 582 - - ---- 514/514/2/0/0 0/0 "POST /api/receipts;wesessid=xxx HTTP/1.1" > Jan 31 00:59:58 haproxy01.prod haproxy[53367]: x.x.x.x:3527 > [31/Jan/2018:00:59:57.977] https-in~ eyo.backend/eyobackend01 0/0/0/26/26 > 304 170 - - ---- 514/514/1/0/0 0/0 "GET /api/menu;wesessid=xxx HTTP/1.1" > Jan 31 00:59:58 haproxy01.prod haproxy[53367]: x.x.x.x:11550 > [31/Jan/2018:00:59:57.994] https-in~ eyo.backend/eyobackend04 0/0/1/9/10 > 202 582 - - ---- 514/514/0/0/0 0/0 "POST /api/receipts;wesessid=xxx > HTTP/1.1" > > Here are some where we had the problem: > Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:4916 > [31/Jan/2018:09:59:50.648] https-in~ eyo.backend/eyobackend05 > 0/0/1081/165/1247 304 170 - - ---- 5547/5547/173/31/0 0/0 "GET > /api/menu;wesessid=xxx HTTP/1.1" > Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:55249 > [31/Jan/2018:09:59:51.500] https-in~ eyo.backend/eyobackend03 > 0/0/259/97/395 200 11670 - - ---- 5547/5547/172/26/0 0/0 "GET > /api/menu;wesessid=xxx HTTP/1.1" > Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:19541 > [31/Jan/2018:09:59:50.648] https-in~ eyo.backend/eyobackend01 > 0/0/1081/165/1247 304 170 - - ---- 5547/5547/171/26/0 0/0 "GET > /api/menu;wesessid=xxx HTTP/1.1" > Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:16853 > [31/Jan/2018:09:59:51.728] https-in~ eyo.backend/eyobackend03 0/0/31/86/167 > 200 5624 - - ---- 5547/5547/170/26/0 0/0 "GET /api/menu;wesessid=xxx > HTTP/1.1" > Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:54980 > [31/Jan/2018:09:59:51.500] https-in~ eyo.backend/eyobackend02 > 0/0/259/115/395 200 10583 - - ---- 5547/5547/169/28/0 0/0 "GET > /api/menu;wesessid=xxx HTTP/1.1" > Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:16784 > [31/Jan/2018:09:59:51.500] https-in~ eyo.backend/eyobackend06 > 0/0/259/136/396 304 170 - - ---- 5547/5547/168/29/0 0/0 "GET > /api/menu;wesessid=xxx HTTP/1.1" > Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:63618 > [31/Jan/2018:09:59:51.500] https-in~ eyo.backend/eyobackend02 > 0/0/259/106/396 200 9874 - - ---- 5547/5547/167/28/0 0/0 "GET > /api/menu;wesessid=xxx HTTP/1.1" > Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:56552 > [31/Jan/2018:09:59:50.648] https-in~ eyo.backend/eyobackend05 > 0/0/1080/168/1248 304 170 - - ---- 5547/5547/166/30/0 0/0 "GET > /api/menu;wesessid=xxx HTTP/1.1" > Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:51860 > [31/Jan/2018:09:59:50.648] https-in~ eyo.backend/eyobackend02 > 0/0/1080/171/1251 304 170 - - ---- 5547/5547/165/27/0 0/0 "GET > /api/menu;wesessid=xxx HTTP/1.1" > Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:22137 > [31/Jan/2018:09:59:50.648] https-in~ eyo.backend/eyobackend03 > 0/0/1080/171/1251 304 170 - - ---- 5547/5547/164/25/0 0/0 "GET > /api/menu;wesessid=xxx HTTP/1.1" > > Also I wanted to share the limits on the haproxy process, which look fine > to me, but you never know: > Limit Soft Limit Hard Limit Units > Max cpu time unlimited unlimited seconds > Max file size unlimited unlimited bytes > Max data size unlimited unlimited bytes > Max stack size 8388608 unlimited bytes > Max core file size 0 unlimited bytes > Max resident set unlimited unlimited bytes > Max processes 64000 64000 > processes > Max open files 65536 65536 files > Max locked memory 65536 65536 bytes > Max address space unlimited unlimited bytes > Max file locks unlimited unlimited locks > Max pending signals 31401 31401 signals > Max msgqueue size 819200 819200 bytes > Max nice priority 0 0 > Max realtime priority 0 0 > Max realtime timeout unlimited unlimited us > > Regards, > Kai > > > > Igor Cicimov <ig...@encompasscorporation.com> schrieb am Mo., 5. Feb. > 2018 um 22:34 Uhr: > >> >> >> On 6 Feb 2018 4:38 am, "Kai Timmer" <k...@staffbase.com> wrote: >> >> Hello, >> I recently tried to update from v1.6.14 to v1.8.3 but experienced a lot >> of problems with it. >> >> I do hope that I made mistake in my configuration that works in 1.6 but >> blows up my system up in 1.8. So I'm going to describe my setup/workload >> and hope that someone here might be able to help me. >> >> I run 2 haproxy servers in front of multiple Jetty based backend systems. >> The main source of data for those backend servers is a mongodb. Both of >> those haproxy servers receive traffic and use keepalived for failover >> scenarios. >> >> Before I updated to v1.8 the average time to connect was about 0.5ms. >> After the update it kept like that when there was almost no traffic (night >> times). But as soon as I got some traffic. The time to connect went up to >> 500-600ms. >> >> The problem occurred when we had around 2000 requests per minute. >> >> Restarting the haproxy process helped for a while. But after a few >> minutes the problem was back again. >> >> One other thing I like to share is the fact that I see a peak in MongoDB >> write lock times when updating haproxy. Probably some side effect of >> something else going on but still, it correlates with the new haproxy >> version. >> >> As soon as I downgraded back to v1.6.14 the problem went away. >> >> This is the haproxy.cfg that I use (same config for 1.6.14 and 1.8.3): >> ############################################################ >> ############## >> global >> log 127.0.0.1 local1 info >> chroot /var/lib/haproxy >> pidfile /var/run/haproxy.pid >> ulimit-n 65536 >> user haproxy >> group haproxy >> daemon >> ca-base /etc/ssl/certs >> crt-base /etc/ssl/private >> tune.ssl.default-dh-param 2048 >> ssl-default-bind-options no-sslv3 no-tls-tickets >> ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256: >> ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384: >> ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE- >> DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128- >> SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA: >> ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA- >> AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256- >> SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128- >> SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA- >> AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128- >> SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:CAMELLIA:! >> DES-CBC3-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:! >> aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA >> stats socket /var/run/haproxy.sock level admin >> stats timeout 30s >> maxconn 22000 >> >> defaults >> mode http >> log global >> option dontlognull >> option httplog >> option log-health-checks >> timeout connect 3600000ms >> timeout client 3600000ms >> timeout server 3600000ms >> timeout check 10s >> balance roundrobin >> maxconn 20000 >> fullconn 2000 >> >> resolvers ourdns >> nameserver dns1 10.13.138.250:53 >> nameserver dns2 10.13.138.251:53 >> resolve_retries 3 >> timeout retry 1s >> hold valid 30s >> >> #--------------------------------------------------------------------- >> # frontend HTTP >> #--------------------------------------------------------------------- >> frontend http-in >> option http-server-close >> bind *:80 >> redirect scheme https code 301 if !{ ssl_fc } >> >> #--------------------------------------------------------------------- >> # frontend HTTPS >> #--------------------------------------------------------------------- >> frontend https-in >> option forwardfor >> http-request set-header X-Forwarded-Port %[dst_port] >> http-request set-header X-Forwarded-Proto https >> rspadd Strict-Transport-Security:\ max-age=31536000;\ >> includeSubdomains;\ preload >> bind *:443 ssl no-sslv3 crt /etc/ssl/ourssl/ourcert.com.pem crt >> /etc/ssl/oldcert/oldcert.net.pem >> >> acl host_backend hdr_dom(host) -i de backend >> use_backend our.backend if host_backend >> >> # deny all requests to /metrics >> acl restricted_metrics path_beg,url_dec -i /metrics >> http-request deny if restricted_metrics >> >> default_backend eyo.backend >> rspidel ^Server:.*$ >> rspidel ^X-Powered-By:.*$ >> >> >> #--------------------------------------------------------------------- >> # backend >> #--------------------------------------------------------------------- >> backend our.backend >> description Backend (Java, Port 8080) >> option http-server-close >> option http-pretend-keepalive >> timeout server 2700000ms >> >> option httpchk GET /health HTTP/1.0\r\nAuthorization:\ Basic\ foobar >> http-check expect status 200 >> default-server inter 5s fall 2 rise 2 >> server backend01 backend01.prod:8080 resolvers ourdns check slowstart >> 60s # regular server >> server backend02 backend02.prod:8080 resolvers ourdns check slowstart >> 60s # regular server >> server backend03 backend03.prod:8080 resolvers ourdns check slowstart >> 60s # regular server >> server backend04 backend04.prod:8080 resolvers ourdns check slowstart >> 60s # regular server >> server backend05 backend05.prod:8080 resolvers ourdns check slowstart >> 60s # regular server >> server backend06 backend06.prod:8080 resolvers ourdns check slowstart >> 60s # regular server >> >> ############################################################ >> ############## >> >> Any help on what might be going on here is highly appreciated. >> >> Regards, >> Kai >> >> >> Maybe the logs would be useful here from the time of the "incident". >> > I would say this is more of a breakdown of the config and the logs then a concrete help but that's what I can offer atm :-) global maxconn 22000 <- max. conn per process default maxconn 20000 <- max. conn per frontend fullconn 2000 <- when the dynamic conn. limits start to take effect So this together with the system limits you provided looks sane. The global maxconn applies to the process, which means to the sum of the frontend connections. The default/frontend/listen maxconn applies maxconn per frontend. The servers' maxconns are per server. However, the fullconn is only useful if both minconn and maxconn are configured on the server lines. If you intended to use this value then you are missing those in your config. Now lets look at one of the log lines from the healthy time period Jan 31 00:59:57 haproxy01.prod haproxy[53367]: x.x.x.x:42904 [31/Jan/2018:00:59:57.831] https-in~ eyo.backend/eyobackend05 *0/0/1/10/11* 304 170 - - ---- *514/514/0/0/0* 0/0 "GET /api/menu;wesessid=xxx HTTP/1.1" Compared to a line from when the problem started: Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:4916 [31/Jan/2018:09:59:50.648] https-in~ eyo.backend/eyobackend05 *0/0/1081/165/1247* 304 170 - - ---- *5547/5547/173/31/0* 0/0 "GET /api/menu;wesessid=xxx HTTP/1.1" you have more than 10 times higher number of connections on the frontend. What can possibly be causing such a high number of incoming connections? The value *1247* above means the backend took 1.2 seconds to reply with 304 which is really high and the *5547/5547/173/31/0* block broken down to it's individual parameters: actconn = 5547 feconn = 5547 beconn = 173 srv_conn = 31 retries = 0 with their interpretation: - "actconn" is the total number of concurrent connections on the process when the session was logged. It is useful to detect when some per-process system limits have been reached. For instance, if actconn is close to 512 when multiple connection errors occur, chances are high that the system limits the process to use a maximum of 1024 file descriptors and that all of them are used. See section 3 "Global parameters" to find how to tune the system. - "feconn" is the total number of concurrent connections on the frontend when the session was logged. It is useful to estimate the amount of resource required to sustain high loads, and to detect when the frontend's "maxconn" has been reached. Most often when this value increases by huge jumps, it is because there is congestion on the backend servers, but sometimes it can be caused by a denial of service attack. - "beconn" is the total number of concurrent connections handled by the backend when the session was logged. It includes the total number of concurrent connections active on servers as well as the number of connections pending in queues. It is useful to estimate the amount of additional servers needed to support high loads for a given application. Most often when this value increases by huge jumps, it is because there is congestion on the backend servers, but sometimes it can be caused by a denial of service attack. - "srv_conn" is the total number of concurrent connections still active on the server when the session was logged. It can never exceed the server's configured "maxconn" parameter. If this value is very often close or equal to the server's "maxconn", it means that traffic regulation is involved a lot, meaning that either the server's maxconn value is too low, or that there aren't enough servers to process the load with an optimal response time. When only one of the server's "srv_conn" is high, it usually means that this server has some trouble causing the connections to take longer to be processed than on other servers. - "retries" is the number of connection retries experienced by this session when trying to connect to the server. It must normally be zero, unless a server is being stopped at the same moment the connection was attempted. Frequent retries generally indicate either a network problem between haproxy and the server, or a misconfigured system backlog on the server preventing new connections from being queued. This field may optionally be prefixed with a '+' sign, indicating that the session has experienced a redispatch after the maximal retry count has been reached on the initial server. In this case, the server name appearing in the log is the one the connection was redispatched to, and not the first one, though both may sometimes be the same in case of hashing for instance. So as a general rule of thumb, when a '+' is present in front of the retry count, this count should not be attributed to the logged server. suggests your backend servers are simply congested. What are the expected limits for each of your backend servers? What's the max. number of simultaneous conn. each of them can serve? And have they been also system tuned to support those expectations? What do you see in their logs when this kind of issue happens? Otherwise, and that's why I call this more of a analyzes then a help, I have no idea why would you see this effect in 1.8 version only and not on 1.6 too. Might be related to the way fullconn, minconn and maxconn being handled (and their default values) in various versions and it's just a matter of tuning them properly. Or maybe the difference is how the (possible) firewall is handling the connections to haproxy when you change to 1.8. Hope someone else can jump in and help.