Sudden peak of orphaned sockets -> 100% cpu

Carl Pettersson (BN) Thu, 12 Jan 2017 03:35:29 -0800

Hi,
We have a haproxy setup consisting of a pair of nodes with keepalived, which 
then utilize the proxy protocol to pass requests (roundrobin) to a second pair 
of haproxy nodes. The first pair mainly terminates SSL and serves as a highly 
available entrypoint, while second pair does all the logic around routing to 
the correct application.
Yesterday at 21:30:35 CET the active node of the first pair suddenly 
accumulated thousands of orphaned sockets. Historically there have been around 
300 orphans at any time on this machine. A few seconds later, the cpu shot up 
to 100%, up from 30%. At this point, most requests started timing out. At the 
moments just before, we were handling 400-450 req/s. The second pair saw no 
increase of load during these problems.
As far as we can tell so far (investigations are ongoing), there was no change 
anywhere in the environment for several hours preceding this sudden activity. 
We are lucky enough that a large part of the traffic coming in is diagnostic 
data, which we can live without for a while, and when we shut down that 
application, the situation returned to normal. Before doing that, we tried both 
failing over to the other node, as well as both restarting haproxy and a full 
reboot of the node. Neither of these helped, the situation returned in less 
than a minute.


We had a lot of "kernel: [ 1342.944691] TCP: too many orphaned sockets" in our 
logs, as well as a few of these:
kernel: [2352025.865855] TCP: request_sock_TCP: Possible SYN flooding on port 
443. Sending cookies.  Check SNMP counters.
kernel: [2352068.014861] TCP: request_sock_TCP: Possible SYN flooding on port 
80. Sending cookies.  Check SNMP counters.

According to the Arbor our hosting provider has, there was no SYN attack, 
however. The servers are fronted by Akamai, which by default does two retries 
on connection failures or timeouts, so this may have amplified our problems. 
This feature will be turned off.

So, some questions:
1. Does it seem reasonable that the orphaned socket could cause this behaviour, 
or are they just a symptom?
2. What causes the orphaned sockets? Could haproxy start misbehaving when it is 
starved for resources?
3. We were speculating that it could somehow be related to keepalives not being 
terminated properly, any merit to that thought?

The nodes are virtual machines running Centos 7 and haproxy 1.6.7, single core 
with 2 GB memory. As I mentioned, they have been handling this load without any 
hiccups for several months, but we are still considering increasing the specs. 
Would a few more cores have helped, or would it just have taken a few more 
seconds to chew up?

Below is the haproxy configuration.

Thankful for any insights! Best regards,
Carl Pettersson

global
    log 127.0.0.1 local0
    maxconn 4000
    log-tag haproxy-gateway
    server-state-file /var/lib/haproxy/gateway.state
    stats socket /var/run/haproxy-gateway.sock
    stats timeout 2m
    user haproxy
    group haproxy

    tune.ssl.default-dh-param 2048
    ssl-default-bind-options no-sslv3
    ssl-default-bind-ciphers 
ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!3DES:!MD5:!PSK
    tune.ssl.cachesize 100000
    tune.ssl.lifetime 600
    tune.ssl.maxrecord 1460

defaults
    log global
    mode http
    option httplog
    option dontlognull
    load-server-state-from-file global
    option http-server-close
    option forwardfor
    # Redispatch if backend is down
    option redispatch
    retries 3
    # timeouts
    timeout http-request 5s
    timeout queue 1m
    timeout connect 10s
    timeout client 1m
    timeout server 1m
    # Set X-Request-Id on all incoming requests. Magic format taken from docs
    unique-id-format %{+X}o\ %ci:%cp_%fi:%fp_%Ts_%rt:%pid
    unique-id-header X-Request-ID
    # Set log format to the same as default, adding the request id on the end
    log-format %ci:%cp\ [%t]\ %ft\ %b/%s\ %Tq/%Tw/%Tc/%Tr/%Tt\ %ST\ %B\ %CC\ 
%CS\ %tsc\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq\ %hr\ %hs\ %{+Q}r\ %ID

listen http
    bind 10.0.1.21:80
    default_backend backend_pair
    capture request header Host len 64
    capture request header X-Forwarded-For len 64
    capture request header User-Agent len 200
    capture request header True-Client-IP len 32
listen https
    bind 10.0.1.21:443 ssl crt /etc/haproxy/ssl/gateway/
    http-request set-header X-Forwarded-Proto https
    default_backend backend_pair
    capture request header Host len 64
    capture request header X-Forwarded-For len 64
    capture request header User-Agent len 200
    capture request header True-Client-IP len 32
backend backend_pair
    option tcp-check
    server 10_0_1_24_8082 10.0.1.24:8082 check send-proxy
    server 10_0_1_27_8082 10.0.1.27:8082 check send-proxy

Sudden peak of orphaned sockets -> 100% cpu

Reply via email to