Throughput slow with kernel 4.9.0

Brendon Colby Sun, 23 Sep 2018 13:49:22 -0700

Greetings,

Similar to this user:


https://www.mail-archive.com/[email protected]/msg27698.html

I recently upgraded our proxy VMs from Debian 8/Jessie (kernel 3.16.0) to
Devuan 2/ASCII (Debian Stretch w/o systemd, kernel 4.9.0). I know running
haproxy on a VM is often discouraged, but we have done so for years with
great success.

Right now I'm stress testing the new build on ONE proxy VM doing 861 req/s,
2.26 Gbps outbound traffic, 70k pps in, 90k pps out with quite a bit of
capacity to spare. It can be done with some tweaking but nothing much
outside of what would have to be done on hardware.

Our VM hosts are have one Xeon E5-2687W v4 processor (12 core, 24 logical),
256GB ram, and dual Intel 10G adapters, one for external traffic, one for
internal traffic. I have the proxy VMs configured with 8 cores, 8GB ram,
and two virtio adapters both with multi-queue set to 2 (which gives me two
receive queues per adapter). We're running Proxmox 5.

haproxy is a custom build of 1.8.14 built with:

make TARGET=linux2628 USE_PCRE=1 USE_GETADDRINFO=1 USE_OPENSSL=1 USE_ZLIB=1
USE_FUTEX=1

I have each receive queue pinned to a different processor (0 - 3). haproxy
is configured with nbproc 4 and pinned to procs 4 - 7. iptables with
connection tracking is enabled (I couldn't see ANY performance benefits
from using a stateless firewall).

I can get near wire speeds between VM hosts as well as between VM guests on
the local network.

The problem we saw right away was when any amount of traffic was flowing
through these new proxy builds, single stream throughput would be severely
reduced. Without load, I could pull down a file at 200+ Mbps with a single
stream. With load, that would drop to 10-15 Mbps if that.

This meant that 1080p videos would endlessly buffer and large images would
load like they did in the 90s on dial-up. Not good.

After a bunch of trial and error, I narrowed the issue down to the network
layer itself. The only thing I could find that may have pointed to what was
going on was this:

# netstat -s | grep buffer
    16889843 packets pruned from receive queue because of socket buffer
overrun
    7626 packets dropped from out-of-order queue because of socket buffer
overrun
    3912652 packets collapsed in receive queue due to low socket buffer

These values were incrementing a lot faster than on the old build. My
research on this pointed to w/rmem settings, which I've never adjusted
before because most recommendations seem to be to leave these alone. Plus I
could never determine that we actually needed to adjust these.

Here are the sysctl settings we've been using for years:

vm.swappiness=10
net.ipv4.tcp_tw_reuse=1
net.ipv4.ip_local_port_range=1024 65535
net.core.somaxconn=10240
net.core.netdev_max_backlog=10240
net.ipv4.conf.all.rp_filter=1
net.ipv4.tcp_max_syn_backlog=10240
net.ipv4.tcp_synack_retries=3
net.ipv4.tcp_syncookies=1
net.netfilter.nf_conntrack_max=4194304

After doing a TON of research, I decided to adjust the r/wmem settings.
>From here:

http://fasterdata.es.net/host-tuning/linux/

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Welcome%20to%20High%20Performance%20Computing%20%28HPC%29%20Central/page/Linux%20System%20Tuning%20Recommendations

I settled on the following:

# allow testing with buffers up to 128MB
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
# increase Linux autotuning TCP buffer limit to 64MB
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864

Which is good "For a host with a 10G NIC optimized for network paths up to
200ms RTT, and for friendliness to single and parallel stream tools..."
which seemed fine for us.

However, these settings didn't make any difference.

The next thing I did was to try adjusting net.ipv4.tcp_mem. This is the one
setting almost everyone says to leave alone, that the kernel defaults are
good enough. Well, adjusting this one setting is what seemed to fix this
issue for us.

Here is the default values the kernel set on Devuan / Stretch:

net.ipv4.tcp_mem = 94401        125868  188802

On Jessie:

net.ipv4.tcp_mem = 92394        123194  184788

Here is what I set it to:

net.ipv4.tcp_mem = 16777216 16777216 16777216

I can create the low throughput issue by changing tcp_mem back to the
defaults. I'm not even sure the other settings are necessary (still testing
that).

Can anyone shed some light on why adjusting tcp_mem fixed this? Are the
other settings needed / appropriate? I'm not fond of deploying anything
into production with settings I've copied from the internet without fully
understanding what I'm doing. Most posts on this only copy the kernel docs
verbatim.

Since almost everyone says "do NOT adjust tcp_mem" there isn't much
documentation out there that I can find on when you SHOULD adjust this
setting.

All I know is that by changing tcp_mem I can run an iperf test and get over
1 Gbps even with site traffic being over 2 Gbps (we have 3 Gbps available).
File downloads are now snappier and get up to speed faster than before.

If anyone has some input on this, I'd really appreciate it. I'd love to be
able to understand these settings better and what the ramifications of
changing them are.

Thanks!

--
Brendon Colby
Senior DevOps Engineer
Newgrounds.com

Throughput slow with kernel 4.9.0

Reply via email to