Re: Confused by the performance result in tcp mode of haproxy

Willy Tarreau Sun, 23 Sep 2012 22:52:18 -0700

Hi,

On Fri, Sep 21, 2012 at 06:43:10PM +0800, Godbach wrote:
> Hi??all
> I used tcp mode to test the splice performance in haproxy. The traffic
> is http, the throughput I have test as following:
> 
> | objsize->  | 64B    | 1kB     | 8kB     | 1M      |
> |-------------+--------+---------+--------+---------|
> | tcp-nosp  | 350M | 1.46G  | 1.9G   | 2.0G   |
> | tcp-splice | 285M | 1.28G  | 1.8G   | 1.99G |
>   * tcp-nosp: haproxy in TCP mode, without splice
>   *  tcp-splice:  haproxy in TCP mode, enable splice for both request
> and response.
> 
> I have test different HTTP ojbect size, such as 64B, 1kB, 8kB, 1MB. It
> is expected that the throuhput can be improved obviously while the
> object size is coming larger. But the test results seemed to be
> opposite. The thoughput of  haproxy without splice is better  than
> that with splice in my test.


First, it's important to be aware that TCP splicing performance varies a
lot according to :
  - NIC
  - NIC settings
  - kernel version
  - socket buffer size and pipe size

> The environment and configuration as follows:
> 1. Kernel:  linux-2.6.38.8, x86_64

If you're interested, big improvements were made in kernel 3.5. TCP splicing
was first introduced in 2.6.25 but some issues with skb ref counting and unacked
segments caused some occasional data corruption, which led to disable zero-copy
in the next version. It was completely reworked in 3.5 and splice() between TCP
sockets is really zero-copy now, and finally solved the issue with splice()
showing worse performance than recv()+send() on small data, which you're
observing right now.

Between 2.6.25 and 3.5, there were a number of kernels which would not always
loop on all incoming packets during a splice() call. I have memories of
splice() moving 1460 or 2920 bytes at a time on some NICs, which showed
disastrous performance, much worse than recv/send. I don't remember exactly
what kernel got rid of that, but quite frankly if you're doing performance
testing only, do it on 3.5 to get rid of most known issues.

> 2. NIC: Intel 82599EB 10-Gigabit??two NICs??one for client, and the
> other for server.
>   Settings for two NICs
>     rx-checksumming: on
>     tx-checksumming: on
>     scatter-gather: on
>     tcp-segmentation-offload: on
>     udp-fragmentation-offload: off
>     generic-segmentation-offload: on
>     generic-receive-offload: on
>     large-receive-offload: on
>     rx-vlan-offload: off
>     tx-vlan-offload: off
>     ntuple-filters: off
>     receive-hashing: on

Your settings look good. LRO is one of the most important ones and it is
enabled. You need to know that 82599 is the hardest NIC to tune ever.
Basically you can hardly reach line rate and low latency at the same time.
Either you configure it for low latency and you're lucky if you reach 3 Gbps,
or you configure it for throughput and you can easily experience several
milliseconds delays in small packet delivery. I suspect you're in the first
case.

> 3. CPU: Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz, 4 cores
> 4. Memory: 16GB
> 5. haproxy-1.5-dev7
>   Config file:
>     global
>             node test
>             pidfile /tmp/test.haproxy.pid
>             stats socket /tmp/test.haproxy.socket level admin
>             maxconn 1048576

In more recent haproxy versions, you have a new tuning in the global
section : "tune.pipesize". It is used to change the kernel's pipe
size so that splice() is performed on larger data at once. By default,
a pipe is 64kB (16 pages) and when data are received on a NIC such as
you have above with many concurrent streams, you often only have one
packet per page. Pipes are generally filled to 75% of their sizes and
if you strace the process, you'll see splice() calls of around 17kB
(only 12 segments at once), which is much smaller than what recv/send
would do with a copy. By increasing the pipe size, you can significantly
increase this. I've seen splice() of more than 400 kB at once on a small
device with large pipes. Pipes are shared between multiple connections,
so you can allocate large ones, very few of them are kept during transfers
(check for this on the stats page). Setting "tune.pipesize 524288" on a
10-gig machine does not seem stupid to me at all.

I also suggest that you upgrade to dev11 or dev12 to benefit from some
splice improvements (eg: do not call recv after an EAGAIN is received).

> The following is CPU usage while haporxy is running with splice and
> HTTP 8kB object:

It would be nice to check with strace -c how many times splice returns
data and how many times recv+send are used instead.

Indeed, as indicated above, when splice() fails, recv() was attempted
again. At 10G speed, between two syscalls you can get more data, which
will often cause recv() to succeed. This means that a significant percentage
of your objects might actually be transferred using recv+send instead of
splice, since 8kB fits in a buffer.

> top - 18:28:06 up 15 days,  2:02,  5 users,  load average: 0.07, 0.07, 0.09
> Tasks: 102 total,   2 running,  99 sleeping,   1 stopped,   0 zombie
> Cpu0  :  3.7%us, 34.7%sy,  0.0%ni,  9.3%id,  0.0%wa,  0.0%hi, 52.3%si,  0.0%st

Here you have a problem, a very common one. The system delivers interrupts to
the CPU which is doing the work. The net effect is that these 52% spent in
softirq (the driver and TCP stack) are not usable by haproxy. You should
force your NIC to deliver interrupts to one core and have haproxy running
on another one (eg: force haproxy to CPU1 using taskset).

> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:  16423324k total,  2823352k used, 13599972k free,   279300k buffers
> Swap: 16730108k total,        0k used, 16730108k free,  1357740k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 27086 root      20   0  277m 166m  888 R 36.3  1.0   0:08.36 haproxy
> 
> I am wondering that whehter the throughout is normal or not, and is
> there anything I can improve to test splice for haproxy.

2 Gbps seems very low, unless the test is limited by your client and server,
which actually might be the case given that some idle time remains. Also, it
is possible that the NIC is adding some latency when delivering ACKs, which
limits transfer rates.

In order to limit this, you should also increase your tcp_rmem and tcp_wmem
sizes. You can set them both to 4k 256k 16M for example. It is very important
that you ensure the wmem is always at least as large as the rmem when using
splice, otherwise you can't completely flush the pipes and many of them stay
allocated. These buffers do not really each much memory when using splice
since pointers are moved only.

Hoping this helps,
Willy

Re: Confused by the performance result in tcp mode of haproxy

Reply via email to