Hi Maxime, On Thu, Jul 03, 2014 at 07:00:52PM +0200, Maxime Brugidou wrote: > Hi all, > > We have been trying to get better performance from our simple haproxy > setup since we encountered issues when getting over 15k session/sec. > > Our simple test setup to reproduce the issue: > * 1 server with 2 CPUs Xeon (E5 core i7) with 6 cores each with > hyperthreading on (24 cores total) > * 1 Intel I350 Gigabit ethernet interface > * CentOS 6.5 kernel 2.6.32-431.11.2.el6.centos.plus.x86_64 > * 1 haproxy 1.5.1 recompiled (1.4 from the distribution had the same > issues anyway) > * 1 nginx on the same host with one simple worker serving a very > small static file (< 10 bytes) > * 4 client servers running siege with a simple HTTP GET / at > concurrency 500 each, in the same VLAN > > We only get session rate of ~16k per second which seems very low. The > interesting symptom is that we get 100% CPU for the haproxy process > with very low connection rate. > > The CPU usage is split this way: 29.3%us, 46.3%sy, 0.0%ni, 0.0%id, > 0.0%wa, 0.0%hi, 24.3%si, 0.0%st > > The very high si and sy seems strange, we focused on soft IRQ config > after that to no avail.
I'm not surprized. Without pinning, the system will move haproxy to the same CPU as the one delivering the interrupts, leaving very little place for both to coexist. With such workloads it's normal to have a high CPU usage in softirq, as packet processing takes a significant amount of time, more on some drivers than others. The igb driver (i350 is igb, right?) is not among the cheapest ones per packet so I'm not surprized. > We tried all the tweaks we could think about: > * playing with sysctl (we don't have conntrack, we don't have > iptables, tcp settings seems OK) OK great. > * we tried recompiling the igb driver to get all the options, > increasing the rx-usecs setting with ethtool, this reduces the number > of hard IRQ (but probably increases latency) but does not impact CPU > usage Past some point, you'll get the driver to enter polling mode and reach 100% CPU. You can also try to increase the number of Tx descriptors to avoid seeing the queue start/stop too fast (which is expensive as well). But NICs tend to be less efficient with many Tx descriptors. > * we looked at the process with strace/perf without much success to > get any interesting info > > The most interesting part: > * we did the trick to set the smp_affinity of our eth0 interface to > cpu 0 and haproxy on cpu 1 with taskset BUT the soft interrupt CPU > stays on cpu 1 (with the haproxy). This is not what is documented from > the linux kernel, we dug into the RPS and RFS network features but > they are not activated in Centos 6 by default so they should not > interfere. I'm pretty sure that CPU 1 is the first thread of the first core of the second socket. So you're in the absolute worst situation where all the traffic has to transit through memory, and cache lines are doing ping-pong between the two sockets. The best thing to do is to totally stop the second socket for now. Please just verify with this : grep '' /sys/devices/system/cpu/cpu*topology/phy* Second, disable hyperthreading to ensure you're not running haproxy on one core and the network on the other thread of the same core. You'll be able to re-enable it once you figure what the problem is, but there's no reason for wasting time with these parasits for now. > I think that we have been looking way too deep in the problem and the > solution must be right in front of us. > > Does anyone have ideas? Could you check your network card's traffic (ideally on the switch) in terms of bit rate and packet rate in each direction ? At 15khps it depends a lot on the object size, especially when running on gigabit NICs which are easily overloaded. How many concurrent connections are you running with during your tests ? It's easy to reach 100% CPU with no aggregated work if you have too few concurrent requests, simply because you have a lot of small idle places which are not usable for anything else. A few comments below : > $ cat /etc/haproxy/haproxy.conf > global > log /dev/log local1 notice > maxconn 8192 > user haproxy > group haproxy > > defaults > log global Logging to /dev/log generally means a lot of losses (very tiny network buffers on UNIX sockets). So it's likely that haproxy is also sending a lot of alerts that are dropped because you daemonized. That can waste a significant amount of CPU. Also, I tend to say that logging alone consumes 20% of the request rate. But we do better than you on a pentium-M 1.8 GHz with logs enabled. You should try to disable logs first. > mode http > retries 0 > timeout client 5s > timeout connect 1s > timeout server 5s > option dontlognull > option http-server-close > option httpchk GET /admin/status > option httplog > option redispatch > option splice-response Splicing will be of no help for small objects. You can try to see if you're network-bound by adding "option tcp-smart-connect". It saves one packet during the connection setup. If the performance rises, then you're limited by something on the network path (stack included). > balance roundrobin > > frontend main > bind 0.0.0.0:80 > maxconn 8192 > use_backend test > > backend test > server nginx 127.0.0.1:8081 maxconn 8192 Regards, Willy

