On Tue, Sep 18, 2012 at 12:55:02PM +0200, Dick Snippe wrote:
FYI:
> For our production platform I will try some experiments with decreased
> txqueuelen, binding (web)server instances to specific cores ad boot
> a server with kernel 3.5 + fq_codel to see what works best in practice.
After quite a bit of testing + some real-world experience on our
production platform it turns out that tweaking Receive Packet Steering
(rps) in combination with fq_codel can have a huge impact.
My theory (based on the results below) is that when a server is sending
out large volumes of data, the return traffic (ACK's) can become so
large (150.000 packets/second) that the drivers switches to polling.
When rps is not active this polling apparently causes a drop in throughput
and all tx-queues fill up resulting in much larger latency.
Below are our results.
Our test setup consisted of 4 servers (IBM HS22 blades, 96Gbyte RAM, 2x
quad core westmere E5620, dual 82599EB 10-Gigabit NICs, running vanilla
kernel.org 3.5.4 kernel, stock ixgbe 3.9.15-k driver)
host1: runs dltest webserver, serving a 100Mbyte test file
host2+3: acting as clients using ab test program
ab -n 100000 -c 500 http://dltest.omroep.nl/100m
host4: "observation server". for measuring ping latency to server1
sudo ping -c 1000 -i 0.001 -q dltest.omroep.nl
All 4 servers are directly connected through a Cisco Nexus 4001I
10GB Switch. The test servers are in the same blade enclosure and
the test traffic never leaves the blade enclosure.
With default settings and a small number of flows (ab -c 10)
we can obtain line speed easily and ping latency is low (<1ms)
The same goes for iperf.
However, with a larger number of flows (both clients doing ab -c 500,
i.e. a total of 1000 concurrent flows) throughput on the webserver
drops dramatically to 1-2Gbit and ping latency rises to ~100ms.
We tested a number of different combinations:
fq_codel: on/off
smp_affinty: hint/000f/00ff/0f0f/ffff
bql: 100000/1000000/max
rps: 0000/000f/00f0/00ff/ffff/hint
(hint in smp_affinity is reading /prox/irq/X/affinity_hint and writing
to /prox/irq/X/prox/irq/X/smp_affinity. This basically binds the queue`
to cpu1, queue2 to cpu2, etc. hint in rps is reading the smp affinity
hint for all tx queues and using that to set the rps mask for the
matching cpu)
Relevant commands
fq_codel: sudo tc qdisc add dev ethX root fq_codel
aff: echo $val > /proc/irq/X/smp_affinity
bql: echo $val > /sys/class/net/ethX/queues/tx-Y/byte_queue_limits/limit_max
rps: echo $val > /sys/class/net/ethX/queues/rx-Y/rps_cpus
A test consisted of running ab while measuring ping latency (ping -c
1000 -i 0.001 -q, returns min/avg/max/mdev rtt) at the observation
server and bandwidth (sar -n DEV 1 111111, returns kbyte/sec) on the
web server.
During testing, not all results were 100% reproducible. Probably
this has to do with NAPI; sometimes the driver switches to polling
mode, sometimes it doesn't. When the driver switches to polling mode
(as seen by a decrease in interrupts/sec on the rx queues) throughput
can drop dramatically when rps is not active. Unfortunately I was not
able to force the driver in polling or interrupt mode during testing,
so the results marked with [*] should be taken with a grain of salt,
because the driver might flip into polling mode where throughput is
significantly lower.
The results are as follows:
codel bql aff rps b/w[kbyte/sec] latency[ms]
off max hint 0000 147849.07 0.098/90.264/362.198/104.473
off max 000f 0000 1201655.30 23.004/30.942/41.554/4.874
off 1000000 000f 0000 1201604.00 11.304/13.983/16.675/0.896
off 100000 000f 0000 809498.39 17.187/20.158/22.572/0.845
off 3028 000f 0000 81827.24 81.256/86.261/90.519/1.812
on max hint 0000 299866.93 0.175/2.812/16.654/3.192
on max 0001 0000 1160509.10 0.047/0.736/1.625/0.279
on max 0003 0000 1199052.43 0.045/1.807/24.913/3.423
on max 0007 0000 1201782.16 0.038/1.578/21.759/2.898
on max 000f 0000 1201782.16 0.172/0.365/0.796/0.108
on max 00ff 0000 1201782.16 0.180/1.019/16.321/2.351
on max 0f0f 0000 1201782.16 0.177/0.925/12.550/2.029
on max ffff 0000 1201782.16 0.182/1.303/15.290/2.551
on 100000 000f 0000 823541.00 0.042/0.230/4.134/0.158
on 1000000 000f 0000 1201655.30[*] 0.178/0.365/0.808/0.110
on max hint 000f 1201540.48 0.221/1.550/20.786/2.702
on max hint 00f0 1201540.48 0.241/0.780/1.260/0.264
on max hint 00ff 954511.04 0.195/2.364/27.699/4.041
on max hint ffff 313311.60 0.196/6.522/51.672/8.782
on max fff0 000f 1201540.48 0.195/2.364/27.699/4.041
on max 000f 000f 1201540.48 0.212/1.317/15.166/2.203
off max hint 000f 1201540.48 67.948/90.301/111.427/9.628
off max hint hint 368576.08 0.193/18.972/94.248/17.813
Our conclusion is that fq_codel is needed to keep latency low and
rps=f0 is needed to prevent the bandwith from dropping significantly.
smp_affinity can be set on "hint" in order to have optimal performance
when in interrupt mode. Since using these settings tx queues apparently
never build up, bql can be kept on its default (max).
Our current production settings are therefore:
fq_codel: on
smp_affinity: hint
bql: max
rps: f0 (4 cores are powerful enough to drive 10G traffic)
With these settings we've seen sustained real-world traffic of >6Gbit
throughput per server, with latency <10ms.
All in all I'm really impressed by the quality of the hardware and the
ixgbe driver. After some initial tweaking we can obtain line speed and
have low latency using only moderate amounts of CPU.
So, kudos to the kernel developers. Great work guys!
--
Dick Snippe, internetbeheerder \ fight war
[email protected], +31 35 677 3555 \ not wars
NPO ICT, Sumatralaan 45, 1217 GP Hilversum, NPO Gebouw A
------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit
http://communities.intel.com/community/wired