Hi Godbach, On Tue, Sep 25, 2012 at 06:30:25PM +0800, Godbach wrote: > Hi, Willy > > I have done performance of haproxy again today. > > My tester has eigth Gigabit prots, four gigabit ports aggregated > to emulate clients, and the other four gigabit ports aggregated to > emulate servers. So the max throughput is expected to be 4Gbps. > > No matter splice is enabled or disabled(with the options -dS or > disabled with complied option), the throughput is 2.8Gbps more or less > under such conditions as follow: > 1) HTTP object size is 1MB > 2) max concurrent session is 10,000 > 3) one HTTP transcation on each connection. > The throughput was not promoted by enabling splice. > > The following settings are executed according to your suggestions: > 1. kernel version: 3.5.0 > 2. haproxy version: 1.5-dev12 > 3. haproxy config added: tune.pipesize: 524288 > 4. sysctl: > net.ipv4.tcp_rmem = 4096 262144 16745216 > net.ipv4.tcp_wmem = 4096 262144 16745216 > 5. haproxy running on core 0, and network interrupts are sent to core 1. > 6. LRO is enabled > Offload parameters for eth1(eth3): > rx-checksumming: on > tx-checksumming: on > scatter-gather: on > tcp-segmentation-offload: on > udp-fragmentation-offload: off > generic-segmentation-offload: on > generic-receive-offload: on > large-receive-offload: on > rx-vlan-offload: off > tx-vlan-offload: off > ntuple-filters: off > receive-hashing: on > > The following are CPU usage and interrupts on different cores: > > top - 17:12:30 up 23:10, 3 users, load average: 0.62, 0.60, 0.53 > Tasks: 99 total, 3 running, 96 sleeping, 0 stopped, 0 zombie > Cpu0 : 4.8%us, 30.3%sy, 0.0%ni, 63.8%id, 0.0%wa, 0.3%hi, 0.7%si, 0.0%st > Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 7.0%id, 0.0%wa, 2.3%hi, 90.6%si, 0.0%st > Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
As you can see, most of the time is spent in softirq (network stack+driver). Do you know how many packets were processed per second ? And the interrupt rate needs to be checked too. You can get some advanced stats using ethtool -S on each device (just one on the input path will be enough already). If you can't make the interrupt processing consume less CPU, you can try to spread your interrupts over multiple cores, provided that you're not using the same core as haproxy. You need to saturate either the network or enough CPUs, but right now given that haproxy runs roughly at 35%, there is some room for improvement. Regards, Willy

