Re: haproxy 1.8.3 has a very slow tc time after some time of running

Igor Cicimov Tue, 06 Feb 2018 15:53:44 -0800

Hi Kai,

On Wed, Feb 7, 2018 at 2:45 AM, Kai Timmer <k...@staffbase.com> wrote:


> Hi,
> sorry I totally forgot them. Here are some logs from a time where
> everything was fine:
> Jan 31 00:59:57 haproxy01.prod haproxy[53367]: x.x.x.x:48861
> [31/Jan/2018:00:59:57.728] https-in~ eyo.backend/eyobackend01 0/0/1/33/34
> 304 170 - - ---- 514/514/0/0/0 0/0 "GET /api/menu;wesessid=xxx HTTP/1.1"
> Jan 31 00:59:57 haproxy01.prod haproxy[53367]: x.x.x.x:63720
> [31/Jan/2018:00:59:57.766] https-in~ eyo.backend/eyobackend04 0/0/1/21/22
> 304 170 - - ---- 514/514/0/0/0 0/0 "GET /api/menu;wesessid=xxx HTTP/1.1"
> Jan 31 00:59:57 haproxy01.prod haproxy[53367]: x.x.x.x:42904
> [31/Jan/2018:00:59:57.831] https-in~ eyo.backend/eyobackend05 0/0/1/10/11
> 304 170 - - ---- 514/514/0/0/0 0/0 "GET /api/menu;wesessid=xxx HTTP/1.1"
> Jan 31 00:59:57 haproxy01.prod haproxy[53367]: x.x.x.x:51460
> [31/Jan/2018:00:59:57.896] https-in~ eyo.backend/eyobackend06 0/0/0/13/13
> 304 170 - - ---- 514/514/2/0/0 0/0 "GET /api/menu;wesessid=xxx HTTP/1.1"
> Jan 31 00:59:57 haproxy01.prod haproxy[53367]: x.x.x.x:58558
> [31/Jan/2018:00:59:57.901] https-in~ eyo.backend/eyobackend02 0/0/0/16/16
> 304 170 - - ---- 514/514/1/0/0 0/0 "GET /api/menu;wesessid=xxx HTTP/1.1"
> Jan 31 00:59:57 haproxy01.prod haproxy[53367]: x.x.x.x:33093
> [31/Jan/2018:00:59:57.881] https-in~ eyo.backend/eyobackend03 0/0/0/45/48
> 200 13264 - - ---- 514/514/0/0/0 0/0 "GET /api/menu;wesessid=xxx HTTP/1.1"
> Jan 31 00:59:58 haproxy01.prod haproxy[53367]: x.x.x.x:11282
> [31/Jan/2018:00:59:57.994] https-in~ eyo.backend/eyobackend05 0/0/0/6/6 202
> 582 - - ---- 514/514/2/0/0 0/0 "POST /api/receipts;wesessid=xxx HTTP/1.1"
> Jan 31 00:59:58 haproxy01.prod haproxy[53367]: x.x.x.x:3527
> [31/Jan/2018:00:59:57.977] https-in~ eyo.backend/eyobackend01 0/0/0/26/26
> 304 170 - - ---- 514/514/1/0/0 0/0 "GET /api/menu;wesessid=xxx HTTP/1.1"
> Jan 31 00:59:58 haproxy01.prod haproxy[53367]: x.x.x.x:11550
> [31/Jan/2018:00:59:57.994] https-in~ eyo.backend/eyobackend04 0/0/1/9/10
> 202 582 - - ---- 514/514/0/0/0 0/0 "POST /api/receipts;wesessid=xxx
> HTTP/1.1"
>
> Here are some where we had the problem:
> Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:4916
> [31/Jan/2018:09:59:50.648] https-in~ eyo.backend/eyobackend05
> 0/0/1081/165/1247 304 170 - - ---- 5547/5547/173/31/0 0/0 "GET
> /api/menu;wesessid=xxx HTTP/1.1"
> Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:55249
> [31/Jan/2018:09:59:51.500] https-in~ eyo.backend/eyobackend03
> 0/0/259/97/395 200 11670 - - ---- 5547/5547/172/26/0 0/0 "GET
> /api/menu;wesessid=xxx HTTP/1.1"
> Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:19541
> [31/Jan/2018:09:59:50.648] https-in~ eyo.backend/eyobackend01
> 0/0/1081/165/1247 304 170 - - ---- 5547/5547/171/26/0 0/0 "GET
> /api/menu;wesessid=xxx HTTP/1.1"
> Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:16853
> [31/Jan/2018:09:59:51.728] https-in~ eyo.backend/eyobackend03 0/0/31/86/167
> 200 5624 - - ---- 5547/5547/170/26/0 0/0 "GET /api/menu;wesessid=xxx
> HTTP/1.1"
> Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:54980
> [31/Jan/2018:09:59:51.500] https-in~ eyo.backend/eyobackend02
> 0/0/259/115/395 200 10583 - - ---- 5547/5547/169/28/0 0/0 "GET
> /api/menu;wesessid=xxx HTTP/1.1"
> Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:16784
> [31/Jan/2018:09:59:51.500] https-in~ eyo.backend/eyobackend06
> 0/0/259/136/396 304 170 - - ---- 5547/5547/168/29/0 0/0 "GET
> /api/menu;wesessid=xxx HTTP/1.1"
> Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:63618
> [31/Jan/2018:09:59:51.500] https-in~ eyo.backend/eyobackend02
> 0/0/259/106/396 200 9874 - - ---- 5547/5547/167/28/0 0/0 "GET
> /api/menu;wesessid=xxx HTTP/1.1"
> Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:56552
> [31/Jan/2018:09:59:50.648] https-in~ eyo.backend/eyobackend05
> 0/0/1080/168/1248 304 170 - - ---- 5547/5547/166/30/0 0/0 "GET
> /api/menu;wesessid=xxx HTTP/1.1"
> Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:51860
> [31/Jan/2018:09:59:50.648] https-in~ eyo.backend/eyobackend02
> 0/0/1080/171/1251 304 170 - - ---- 5547/5547/165/27/0 0/0 "GET
> /api/menu;wesessid=xxx HTTP/1.1"
> Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:22137
> [31/Jan/2018:09:59:50.648] https-in~ eyo.backend/eyobackend03
> 0/0/1080/171/1251 304 170 - - ---- 5547/5547/164/25/0 0/0 "GET
> /api/menu;wesessid=xxx HTTP/1.1"
>
> Also I wanted to share the limits on the haproxy process, which look fine
> to me, but you never know:
> Limit                     Soft Limit           Hard Limit           Units
> Max cpu time              unlimited            unlimited            seconds
> Max file size             unlimited            unlimited            bytes
> Max data size             unlimited            unlimited            bytes
> Max stack size            8388608              unlimited            bytes
> Max core file size        0                    unlimited            bytes
> Max resident set          unlimited            unlimited            bytes
> Max processes             64000                64000
>  processes
> Max open files            65536                65536                files
> Max locked memory         65536                65536                bytes
> Max address space         unlimited            unlimited            bytes
> Max file locks            unlimited            unlimited            locks
> Max pending signals       31401                31401                signals
> Max msgqueue size         819200               819200               bytes
> Max nice priority         0                    0
> Max realtime priority     0                    0
> Max realtime timeout      unlimited            unlimited            us
>
> Regards,
> Kai
>
>
>
> Igor Cicimov <ig...@encompasscorporation.com> schrieb am Mo., 5. Feb.
> 2018 um 22:34 Uhr:
>
>>
>>
>> On 6 Feb 2018 4:38 am, "Kai Timmer" <k...@staffbase.com> wrote:
>>
>> Hello,
>> I recently tried to update from v1.6.14 to v1.8.3 but experienced a lot
>> of problems with it.
>>
>> I do hope that I made mistake in my configuration that works in 1.6 but
>> blows up my system up in 1.8. So I'm going to describe my setup/workload
>> and hope that someone here might be able to help me.
>>
>> I run 2 haproxy servers in front of multiple Jetty based backend systems.
>> The main source of data for those backend servers is a mongodb. Both of
>> those haproxy servers receive traffic and use keepalived for failover
>> scenarios.
>>
>> Before I updated to v1.8 the average time to connect was about 0.5ms.
>> After the update it kept like that when there was almost no traffic (night
>> times). But as soon as I got some traffic. The time to connect went up to
>> 500-600ms.
>>
>> The problem occurred when we had around 2000 requests per minute.
>>
>> Restarting the haproxy process helped for a while. But after a few
>> minutes the problem was back again.
>>
>> One other thing I like to share is the fact that I see a peak in MongoDB
>> write lock times when updating haproxy. Probably some side effect of
>> something else going on but still, it correlates with the new haproxy
>> version.
>>
>> As soon as I downgraded back to v1.6.14 the problem went away.
>>
>> This is the haproxy.cfg that I use (same config for 1.6.14 and 1.8.3):
>> ############################################################
>> ##############
>> global
>>     log 127.0.0.1 local1 info
>>     chroot      /var/lib/haproxy
>>     pidfile     /var/run/haproxy.pid
>>     ulimit-n    65536
>>     user  haproxy
>>     group haproxy
>>     daemon
>>     ca-base  /etc/ssl/certs
>>     crt-base /etc/ssl/private
>>     tune.ssl.default-dh-param 2048
>>     ssl-default-bind-options no-sslv3 no-tls-tickets
>>     ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:
>> ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:
>> ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-
>> DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-
>> SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:
>> ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-
>> AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-
>> SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-
>> SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-
>> AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-
>> SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:CAMELLIA:!
>> DES-CBC3-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!
>> aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA
>>     stats socket /var/run/haproxy.sock level admin
>>     stats timeout 30s
>>     maxconn 22000
>>
>> defaults
>>     mode   http
>>     log    global
>>     option dontlognull
>>     option httplog
>>     option log-health-checks
>>     timeout connect 3600000ms
>>     timeout client 3600000ms
>>     timeout server 3600000ms
>>     timeout check 10s
>>     balance roundrobin
>>     maxconn 20000
>>     fullconn 2000
>>
>> resolvers ourdns
>>    nameserver dns1    10.13.138.250:53
>>    nameserver dns2    10.13.138.251:53
>>    resolve_retries    3
>>    timeout retry      1s
>>    hold valid         30s
>>
>> #---------------------------------------------------------------------
>> # frontend HTTP
>> #---------------------------------------------------------------------
>> frontend http-in
>>     option http-server-close
>>     bind *:80
>>     redirect scheme https code 301 if !{ ssl_fc }
>>
>> #---------------------------------------------------------------------
>> # frontend HTTPS
>> #---------------------------------------------------------------------
>> frontend https-in
>>     option forwardfor
>>     http-request set-header X-Forwarded-Port %[dst_port]
>>     http-request set-header X-Forwarded-Proto https
>>     rspadd Strict-Transport-Security:\ max-age=31536000;\
>> includeSubdomains;\ preload
>>     bind *:443 ssl no-sslv3 crt /etc/ssl/ourssl/ourcert.com.pem crt
>> /etc/ssl/oldcert/oldcert.net.pem
>>
>>     acl host_backend      hdr_dom(host) -i de backend
>>     use_backend our.backend      if host_backend
>>
>>     # deny all requests to /metrics
>>     acl restricted_metrics path_beg,url_dec -i /metrics
>>     http-request deny if restricted_metrics
>>
>>     default_backend eyo.backend
>>     rspidel ^Server:.*$
>>     rspidel ^X-Powered-By:.*$
>>
>>
>> #---------------------------------------------------------------------
>> # backend
>> #---------------------------------------------------------------------
>> backend our.backend
>>     description Backend (Java, Port 8080)
>>     option http-server-close
>>     option http-pretend-keepalive
>>     timeout server 2700000ms
>>
>>     option httpchk GET /health HTTP/1.0\r\nAuthorization:\ Basic\ foobar
>>     http-check expect status 200
>>     default-server inter 5s fall 2 rise 2
>>     server backend01 backend01.prod:8080 resolvers ourdns check slowstart
>> 60s # regular server
>>     server backend02 backend02.prod:8080 resolvers ourdns check slowstart
>> 60s # regular server
>>     server backend03 backend03.prod:8080 resolvers ourdns check slowstart
>> 60s # regular server
>>     server backend04 backend04.prod:8080 resolvers ourdns check slowstart
>> 60s # regular server
>>     server backend05 backend05.prod:8080 resolvers ourdns check slowstart
>> 60s # regular server
>>     server backend06 backend06.prod:8080 resolvers ourdns check slowstart
>> 60s # regular server
>>
>> ############################################################
>> ##############
>>
>> Any help on what might be going on here is highly appreciated.
>>
>> Regards,
>> Kai
>>
>>
>> Maybe the logs would be useful here from the time of the "incident".
>>
>
I would say this is more of a breakdown of the config and the logs then a
concrete help but that's what I can offer atm :-)

global
    maxconn 22000   <- max. conn per process

default
    maxconn  20000  <- max. conn per frontend
    fullconn 2000   <- when the dynamic conn. limits start to take effect

So this together with the system limits you provided looks sane. The global
maxconn applies to the process, which means to the sum of the frontend
connections. The default/frontend/listen maxconn applies maxconn per
frontend. The servers' maxconns are per server. However, the fullconn is
only useful if both minconn and maxconn are configured on the server lines.
If you intended to use this value then you are missing those in your config.

Now lets look at one of the log lines from the healthy time period

Jan 31 00:59:57 haproxy01.prod haproxy[53367]: x.x.x.x:42904
[31/Jan/2018:00:59:57.831] https-in~ eyo.backend/eyobackend05 *0/0/1/10/11*
304 170 - - ---- *514/514/0/0/0* 0/0 "GET /api/menu;wesessid=xxx HTTP/1.1"

Compared to a line from when the problem started:

Jan 31 09:59:51 haproxy01.prod haproxy[53367]: x.x.x.x:4916
[31/Jan/2018:09:59:50.648] https-in~ eyo.backend/eyobackend05
*0/0/1081/165/1247* 304 170 - - ---- *5547/5547/173/31/0* 0/0 "GET
/api/menu;wesessid=xxx HTTP/1.1"

you have more than 10 times higher number of connections on the frontend.
What can possibly be causing such a high number of incoming connections?
The value *1247* above means the backend took 1.2 seconds to reply with 304
which is really high and the *5547/5547/173/31/0* block broken down to it's
individual parameters:

actconn  = 5547
feconn   = 5547
beconn   = 173
srv_conn = 31
retries  = 0

with their interpretation:

  - "actconn" is the total number of concurrent connections on the process
when
    the session was logged. It is useful to detect when some per-process
system
    limits have been reached. For instance, if actconn is close to 512 when
    multiple connection errors occur, chances are high that the system
limits
    the process to use a maximum of 1024 file descriptors and that all of
them
    are used. See section 3 "Global parameters" to find how to tune the
system.

  - "feconn" is the total number of concurrent connections on the frontend
when
    the session was logged. It is useful to estimate the amount of resource
    required to sustain high loads, and to detect when the frontend's
"maxconn"
    has been reached. Most often when this value increases by huge jumps,
it is
    because there is congestion on the backend servers, but sometimes it
can be
    caused by a denial of service attack.

  - "beconn" is the total number of concurrent connections handled by the
    backend when the session was logged. It includes the total number of
    concurrent connections active on servers as well as the number of
    connections pending in queues. It is useful to estimate the amount of
    additional servers needed to support high loads for a given application.
    Most often when this value increases by huge jumps, it is because there
is
    congestion on the backend servers, but sometimes it can be caused by a
    denial of service attack.

  - "srv_conn" is the total number of concurrent connections still active on
    the server when the session was logged. It can never exceed the server's
    configured "maxconn" parameter. If this value is very often close or
equal
    to the server's "maxconn", it means that traffic regulation is involved
a
    lot, meaning that either the server's maxconn value is too low, or that
    there aren't enough servers to process the load with an optimal response
    time. When only one of the server's "srv_conn" is high, it usually means
    that this server has some trouble causing the connections to take
longer to
    be processed than on other servers.

  - "retries" is the number of connection retries experienced by this
session
    when trying to connect to the server. It must normally be zero, unless a
    server is being stopped at the same moment the connection was attempted.
    Frequent retries generally indicate either a network problem between
    haproxy and the server, or a misconfigured system backlog on the server
    preventing new connections from being queued. This field may optionally
be
    prefixed with a '+' sign, indicating that the session has experienced a
    redispatch after the maximal retry count has been reached on the initial
    server. In this case, the server name appearing in the log is the one
the
    connection was redispatched to, and not the first one, though both may
    sometimes be the same in case of hashing for instance. So as a general
rule
    of thumb, when a '+' is present in front of the retry count, this count
    should not be attributed to the logged server.

suggests your backend servers are simply congested. What are the expected
limits for each of your backend servers? What's the max. number of
simultaneous conn. each of them can serve? And have they been also system
tuned to support those expectations? What do you see in their logs when
this kind of issue happens?

Otherwise, and that's why I call this more of a analyzes then a help, I
have no idea why would you see this effect in 1.8 version only and not on
1.6 too. Might be related to the way fullconn, minconn and maxconn being
handled (and their default values) in various versions and it's just a
matter of tuning them properly. Or maybe the difference is how the
(possible) firewall is handling the connections to haproxy when you change
to 1.8.

Hope someone else can jump in and help.

Re: haproxy 1.8.3 has a very slow tc time after some time of running

Reply via email to