Hi Robert, On Mon, May 18, 2015 at 05:17:29PM -0700, Robert Brooks wrote: > Hi, > > I am looking to get more performance out of a host running haproxy-1.5.12. > > The host is running Ubuntu 12.04 (kernel 3.13.0-46-generic) with haproxy > binaries from Vince Bernat's ppa (haproxy_1.5.12-1ppa1~precise_amd64.deb). > > The hardware is an HP DL360, with a 4 core Intel Xeon E5-2609 CPU @ 2.40GHz > and 8GB RAM. > > Hatop shows roughly 7000 request/sec, and top shows > > Cpu0 : 20.9%us, 29.4%sy, 0.0%ni, 49.7%id, 0.0%wa, 0.0%hi, 0.0%si, > 0.0%st > Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 84.4%id, 0.0%wa, 0.0%hi, 15.6%si, > 0.0%st > Cpu2 : 2.3%us, 1.3%sy, 0.0%ni, 85.8%id, 0.0%wa, 0.0%hi, 10.6%si, > 0.0%st > Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 87.1%id, 0.0%wa, 0.0%hi, 12.9%si, > 0.0%st > Mem: 8134648k total, 1411080k used, 6723568k free, 278336k buffers > Swap: 8352252k total, 0k used, 8352252k free, 421120k cached > > HAProxy is a single process mapped to cpu0, NIC interupts are on cpu's 1-3.
That's visible :-) > The http responses are small 0 - 2kB, as the biggest source of traffic is > RTB traffic which generates a lot of small quick responses. Keep-alives are > in use and "option http-keep-alive" is configured which has reduced the > system load, as the cost of initiating a tcp sessions rapidly was having an > impact on performance. > > I have tried enabling tcp splicing, I am not sure if it would be helpful in > this case, however haproxy seems reluctant to use it even with "option > splice-request" and "option splice-response" in the defaults section of the > config. I guess this is possibly due to "[OPTIM] stream_sock: don't use > splice on too small payloads"? It's useless at such sizes. A rule of thumb is that splicing will not be used at all for anything that completely fits in a buffer since haproxy tries to read a whole response at once and needs to parse the HTTP headers anyway. In general, the cost of the splice() system calls compared to copying data sees a break-even around 4-16kB depending on arch, memory speed, setup and many factors. I find splice very useful on moderately large objects (>64kB) at high bit rates (10-60 Gbps). On most gigabit NICs, it's inefficient as most such NICs have limited capabilities which make them less efficient at performing multiple DMA accesses for a single packet (eg: no hardware scatter-gather to enable GSO/TSO for example). > Strace indicates the system time call is like this... > > % time seconds usecs/call calls errors > syscall > ------ ----------- ----------- --------- --------- > ---------------- > 19.83 0.048752 1 36092 4709 > recvfrom > 18.02 0.044313 2 18898 8 > sendto > 10.79 0.026518 3 7701 7701 > connect > 10.20 0.025085 1 19320 > epoll_ctl > 10.07 0.024772 1 23321 > setsockopt > 8.55 0.021028 2 11419 140 > accept4 > 8.34 0.020505 2 10174 > close > 5.98 0.014711 2 7701 > socket > 3.52 0.008648 2 4768 257 > shutdown > 3.30 0.008118 1 7701 > fcntl > 0.91 0.002246 1 1588 > brk > 0.47 0.001159 8 149 > epoll_wait > 0.01 0.000023 1 23 > getsockopt > ------ ----------- ----------- --------- --------- > ---------------- > 100.00 0.245878 148855 12815 total > > Happy to share a redacted config, are there any general recommendations for > a workload like this? Do the number look sane? Nothing apparently wrong here, though the load could look a bit high for that traffic (a single-core Atom N270 can do that at 100% CPU). You should run a benchmark to find your system's limits. The CPU load in general is not linear since most operations will be aggregated when the connection rate increases. Still you're running at approx 50% CPU and a lot of softirq, so I suspect that conntrack is enabled on the system with default settings and not tuned for performance, which may explain why the numbers look a bit high. But nothing to worry about. Willy

