Hi, On Fri, Sep 21, 2012 at 06:43:10PM +0800, Godbach wrote: > Hi??all > I used tcp mode to test the splice performance in haproxy. The traffic > is http, the throughput I have test as following: > > | objsize-> | 64B | 1kB | 8kB | 1M | > |-------------+--------+---------+--------+---------| > | tcp-nosp | 350M | 1.46G | 1.9G | 2.0G | > | tcp-splice | 285M | 1.28G | 1.8G | 1.99G | > * tcp-nosp: haproxy in TCP mode, without splice > * tcp-splice: haproxy in TCP mode, enable splice for both request > and response. > > I have test different HTTP ojbect size, such as 64B, 1kB, 8kB, 1MB. It > is expected that the throuhput can be improved obviously while the > object size is coming larger. But the test results seemed to be > opposite. The thoughput of haproxy without splice is better than > that with splice in my test.
First, it's important to be aware that TCP splicing performance varies a lot according to : - NIC - NIC settings - kernel version - socket buffer size and pipe size > The environment and configuration as follows: > 1. Kernel: linux-2.6.38.8, x86_64 If you're interested, big improvements were made in kernel 3.5. TCP splicing was first introduced in 2.6.25 but some issues with skb ref counting and unacked segments caused some occasional data corruption, which led to disable zero-copy in the next version. It was completely reworked in 3.5 and splice() between TCP sockets is really zero-copy now, and finally solved the issue with splice() showing worse performance than recv()+send() on small data, which you're observing right now. Between 2.6.25 and 3.5, there were a number of kernels which would not always loop on all incoming packets during a splice() call. I have memories of splice() moving 1460 or 2920 bytes at a time on some NICs, which showed disastrous performance, much worse than recv/send. I don't remember exactly what kernel got rid of that, but quite frankly if you're doing performance testing only, do it on 3.5 to get rid of most known issues. > 2. NIC: Intel 82599EB 10-Gigabit??two NICs??one for client, and the > other for server. > Settings for two NICs > rx-checksumming: on > tx-checksumming: on > scatter-gather: on > tcp-segmentation-offload: on > udp-fragmentation-offload: off > generic-segmentation-offload: on > generic-receive-offload: on > large-receive-offload: on > rx-vlan-offload: off > tx-vlan-offload: off > ntuple-filters: off > receive-hashing: on Your settings look good. LRO is one of the most important ones and it is enabled. You need to know that 82599 is the hardest NIC to tune ever. Basically you can hardly reach line rate and low latency at the same time. Either you configure it for low latency and you're lucky if you reach 3 Gbps, or you configure it for throughput and you can easily experience several milliseconds delays in small packet delivery. I suspect you're in the first case. > 3. CPU: Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz, 4 cores > 4. Memory: 16GB > 5. haproxy-1.5-dev7 > Config file: > global > node test > pidfile /tmp/test.haproxy.pid > stats socket /tmp/test.haproxy.socket level admin > maxconn 1048576 In more recent haproxy versions, you have a new tuning in the global section : "tune.pipesize". It is used to change the kernel's pipe size so that splice() is performed on larger data at once. By default, a pipe is 64kB (16 pages) and when data are received on a NIC such as you have above with many concurrent streams, you often only have one packet per page. Pipes are generally filled to 75% of their sizes and if you strace the process, you'll see splice() calls of around 17kB (only 12 segments at once), which is much smaller than what recv/send would do with a copy. By increasing the pipe size, you can significantly increase this. I've seen splice() of more than 400 kB at once on a small device with large pipes. Pipes are shared between multiple connections, so you can allocate large ones, very few of them are kept during transfers (check for this on the stats page). Setting "tune.pipesize 524288" on a 10-gig machine does not seem stupid to me at all. I also suggest that you upgrade to dev11 or dev12 to benefit from some splice improvements (eg: do not call recv after an EAGAIN is received). > The following is CPU usage while haporxy is running with splice and > HTTP 8kB object: It would be nice to check with strace -c how many times splice returns data and how many times recv+send are used instead. Indeed, as indicated above, when splice() fails, recv() was attempted again. At 10G speed, between two syscalls you can get more data, which will often cause recv() to succeed. This means that a significant percentage of your objects might actually be transferred using recv+send instead of splice, since 8kB fits in a buffer. > top - 18:28:06 up 15 days, 2:02, 5 users, load average: 0.07, 0.07, 0.09 > Tasks: 102 total, 2 running, 99 sleeping, 1 stopped, 0 zombie > Cpu0 : 3.7%us, 34.7%sy, 0.0%ni, 9.3%id, 0.0%wa, 0.0%hi, 52.3%si, 0.0%st Here you have a problem, a very common one. The system delivers interrupts to the CPU which is doing the work. The net effect is that these 52% spent in softirq (the driver and TCP stack) are not usable by haproxy. You should force your NIC to deliver interrupts to one core and have haproxy running on another one (eg: force haproxy to CPU1 using taskset). > Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Mem: 16423324k total, 2823352k used, 13599972k free, 279300k buffers > Swap: 16730108k total, 0k used, 16730108k free, 1357740k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 27086 root 20 0 277m 166m 888 R 36.3 1.0 0:08.36 haproxy > > I am wondering that whehter the throughout is normal or not, and is > there anything I can improve to test splice for haproxy. 2 Gbps seems very low, unless the test is limited by your client and server, which actually might be the case given that some idle time remains. Also, it is possible that the NIC is adding some latency when delivering ACKs, which limits transfer rates. In order to limit this, you should also increase your tcp_rmem and tcp_wmem sizes. You can set them both to 4k 256k 16M for example. It is very important that you ensure the wmem is always at least as large as the rmem when using splice, otherwise you can't completely flush the pipes and many of them stay allocated. These buffers do not really each much memory when using splice since pointers are moved only. Hoping this helps, Willy

