On Tue, Sep 18, 2012 at 12:55:02PM +0200, Dick Snippe wrote: FYI:
> For our production platform I will try some experiments with decreased > txqueuelen, binding (web)server instances to specific cores ad boot > a server with kernel 3.5 + fq_codel to see what works best in practice. After quite a bit of testing + some real-world experience on our production platform it turns out that tweaking Receive Packet Steering (rps) in combination with fq_codel can have a huge impact. My theory (based on the results below) is that when a server is sending out large volumes of data, the return traffic (ACK's) can become so large (150.000 packets/second) that the drivers switches to polling. When rps is not active this polling apparently causes a drop in throughput and all tx-queues fill up resulting in much larger latency. Below are our results. Our test setup consisted of 4 servers (IBM HS22 blades, 96Gbyte RAM, 2x quad core westmere E5620, dual 82599EB 10-Gigabit NICs, running vanilla kernel.org 3.5.4 kernel, stock ixgbe 3.9.15-k driver) host1: runs dltest webserver, serving a 100Mbyte test file host2+3: acting as clients using ab test program ab -n 100000 -c 500 http://dltest.omroep.nl/100m host4: "observation server". for measuring ping latency to server1 sudo ping -c 1000 -i 0.001 -q dltest.omroep.nl All 4 servers are directly connected through a Cisco Nexus 4001I 10GB Switch. The test servers are in the same blade enclosure and the test traffic never leaves the blade enclosure. With default settings and a small number of flows (ab -c 10) we can obtain line speed easily and ping latency is low (<1ms) The same goes for iperf. However, with a larger number of flows (both clients doing ab -c 500, i.e. a total of 1000 concurrent flows) throughput on the webserver drops dramatically to 1-2Gbit and ping latency rises to ~100ms. We tested a number of different combinations: fq_codel: on/off smp_affinty: hint/000f/00ff/0f0f/ffff bql: 100000/1000000/max rps: 0000/000f/00f0/00ff/ffff/hint (hint in smp_affinity is reading /prox/irq/X/affinity_hint and writing to /prox/irq/X/prox/irq/X/smp_affinity. This basically binds the queue` to cpu1, queue2 to cpu2, etc. hint in rps is reading the smp affinity hint for all tx queues and using that to set the rps mask for the matching cpu) Relevant commands fq_codel: sudo tc qdisc add dev ethX root fq_codel aff: echo $val > /proc/irq/X/smp_affinity bql: echo $val > /sys/class/net/ethX/queues/tx-Y/byte_queue_limits/limit_max rps: echo $val > /sys/class/net/ethX/queues/rx-Y/rps_cpus A test consisted of running ab while measuring ping latency (ping -c 1000 -i 0.001 -q, returns min/avg/max/mdev rtt) at the observation server and bandwidth (sar -n DEV 1 111111, returns kbyte/sec) on the web server. During testing, not all results were 100% reproducible. Probably this has to do with NAPI; sometimes the driver switches to polling mode, sometimes it doesn't. When the driver switches to polling mode (as seen by a decrease in interrupts/sec on the rx queues) throughput can drop dramatically when rps is not active. Unfortunately I was not able to force the driver in polling or interrupt mode during testing, so the results marked with [*] should be taken with a grain of salt, because the driver might flip into polling mode where throughput is significantly lower. The results are as follows: codel bql aff rps b/w[kbyte/sec] latency[ms] off max hint 0000 147849.07 0.098/90.264/362.198/104.473 off max 000f 0000 1201655.30 23.004/30.942/41.554/4.874 off 1000000 000f 0000 1201604.00 11.304/13.983/16.675/0.896 off 100000 000f 0000 809498.39 17.187/20.158/22.572/0.845 off 3028 000f 0000 81827.24 81.256/86.261/90.519/1.812 on max hint 0000 299866.93 0.175/2.812/16.654/3.192 on max 0001 0000 1160509.10 0.047/0.736/1.625/0.279 on max 0003 0000 1199052.43 0.045/1.807/24.913/3.423 on max 0007 0000 1201782.16 0.038/1.578/21.759/2.898 on max 000f 0000 1201782.16 0.172/0.365/0.796/0.108 on max 00ff 0000 1201782.16 0.180/1.019/16.321/2.351 on max 0f0f 0000 1201782.16 0.177/0.925/12.550/2.029 on max ffff 0000 1201782.16 0.182/1.303/15.290/2.551 on 100000 000f 0000 823541.00 0.042/0.230/4.134/0.158 on 1000000 000f 0000 1201655.30[*] 0.178/0.365/0.808/0.110 on max hint 000f 1201540.48 0.221/1.550/20.786/2.702 on max hint 00f0 1201540.48 0.241/0.780/1.260/0.264 on max hint 00ff 954511.04 0.195/2.364/27.699/4.041 on max hint ffff 313311.60 0.196/6.522/51.672/8.782 on max fff0 000f 1201540.48 0.195/2.364/27.699/4.041 on max 000f 000f 1201540.48 0.212/1.317/15.166/2.203 off max hint 000f 1201540.48 67.948/90.301/111.427/9.628 off max hint hint 368576.08 0.193/18.972/94.248/17.813 Our conclusion is that fq_codel is needed to keep latency low and rps=f0 is needed to prevent the bandwith from dropping significantly. smp_affinity can be set on "hint" in order to have optimal performance when in interrupt mode. Since using these settings tx queues apparently never build up, bql can be kept on its default (max). Our current production settings are therefore: fq_codel: on smp_affinity: hint bql: max rps: f0 (4 cores are powerful enough to drive 10G traffic) With these settings we've seen sustained real-world traffic of >6Gbit throughput per server, with latency <10ms. All in all I'm really impressed by the quality of the hardware and the ixgbe driver. After some initial tweaking we can obtain line speed and have low latency using only moderate amounts of CPU. So, kudos to the kernel developers. Great work guys! -- Dick Snippe, internetbeheerder \ fight war beh...@omroep.nl, +31 35 677 3555 \ not wars NPO ICT, Sumatralaan 45, 1217 GP Hilversum, NPO Gebouw A ------------------------------------------------------------------------------ Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired