On Tue, Sep 18, 2012 at 12:55:02PM +0200, Dick Snippe wrote:

FYI:

> For our production platform I will try some experiments with decreased
> txqueuelen, binding (web)server instances to specific cores ad boot
> a server with kernel 3.5 + fq_codel to see what works best in practice.

After quite a bit of testing + some real-world experience on our
production platform it turns out that tweaking Receive Packet Steering
(rps) in combination with fq_codel can have a huge impact.

My theory (based on the results below) is that when a server is sending
out large volumes of data, the return traffic (ACK's) can become so
large (150.000 packets/second) that the drivers switches to polling.
When rps is not active this polling apparently causes a drop in throughput
and all tx-queues fill up resulting in much larger latency.

Below are our results.

Our test setup consisted of 4 servers (IBM HS22 blades, 96Gbyte RAM, 2x
quad core westmere E5620, dual 82599EB 10-Gigabit NICs, running vanilla
kernel.org 3.5.4 kernel, stock ixgbe 3.9.15-k driver)

host1: runs dltest webserver, serving a 100Mbyte test file
host2+3: acting as clients using ab test program
        ab -n 100000 -c 500 http://dltest.omroep.nl/100m
host4: "observation server". for measuring ping latency to server1
        sudo ping -c 1000 -i 0.001 -q dltest.omroep.nl

All 4 servers are directly connected through a Cisco Nexus 4001I
10GB Switch.  The test servers are in the same blade enclosure and
the test traffic never leaves the blade enclosure.

With default settings and a small number of flows (ab -c 10)
we can obtain line speed easily and ping latency is low (<1ms)
The same goes for iperf.

However, with a larger number of flows (both clients doing ab -c 500,
i.e. a total of 1000 concurrent flows) throughput on the webserver
drops dramatically to 1-2Gbit and ping latency rises to ~100ms.

We tested a number of different combinations:
fq_codel: on/off
smp_affinty: hint/000f/00ff/0f0f/ffff
bql: 100000/1000000/max
rps: 0000/000f/00f0/00ff/ffff/hint

(hint in smp_affinity is reading /prox/irq/X/affinity_hint and writing
to /prox/irq/X/prox/irq/X/smp_affinity. This basically binds the queue`
to cpu1, queue2 to cpu2, etc. hint in rps is reading the smp affinity
hint for all tx queues and using that to set the rps mask for the
matching cpu)

Relevant commands
fq_codel: sudo tc qdisc add dev ethX root fq_codel
aff: echo $val > /proc/irq/X/smp_affinity
bql: echo $val > /sys/class/net/ethX/queues/tx-Y/byte_queue_limits/limit_max
rps: echo $val > /sys/class/net/ethX/queues/rx-Y/rps_cpus

A test consisted of running ab while measuring ping latency (ping -c
1000 -i 0.001 -q, returns min/avg/max/mdev rtt) at the observation
server and bandwidth (sar -n DEV 1 111111, returns kbyte/sec) on the
web server.

During testing, not all results were 100% reproducible. Probably
this has to do with NAPI; sometimes the driver switches to polling
mode, sometimes it doesn't. When the driver switches to polling mode
(as seen by a decrease in interrupts/sec on the rx queues) throughput
can drop dramatically when rps is not active. Unfortunately I was not
able to force the driver in polling or interrupt mode during testing,
so the results marked with [*] should be taken with a grain of salt,
because the driver might flip into polling mode where throughput is
significantly lower.

The results are as follows:

codel   bql     aff     rps     b/w[kbyte/sec]  latency[ms]

off     max     hint    0000     147849.07      0.098/90.264/362.198/104.473
off     max     000f    0000    1201655.30      23.004/30.942/41.554/4.874
off     1000000 000f    0000    1201604.00      11.304/13.983/16.675/0.896
off     100000  000f    0000     809498.39      17.187/20.158/22.572/0.845
off     3028    000f    0000      81827.24      81.256/86.261/90.519/1.812

on      max     hint    0000     299866.93      0.175/2.812/16.654/3.192
on      max     0001    0000    1160509.10      0.047/0.736/1.625/0.279
on      max     0003    0000    1199052.43      0.045/1.807/24.913/3.423
on      max     0007    0000    1201782.16      0.038/1.578/21.759/2.898
on      max     000f    0000    1201782.16      0.172/0.365/0.796/0.108
on      max     00ff    0000    1201782.16      0.180/1.019/16.321/2.351
on      max     0f0f    0000    1201782.16      0.177/0.925/12.550/2.029
on      max     ffff    0000    1201782.16      0.182/1.303/15.290/2.551
on      100000  000f    0000     823541.00      0.042/0.230/4.134/0.158
on      1000000 000f    0000    1201655.30[*]   0.178/0.365/0.808/0.110

on      max     hint    000f    1201540.48      0.221/1.550/20.786/2.702
on      max     hint    00f0    1201540.48      0.241/0.780/1.260/0.264
on      max     hint    00ff     954511.04      0.195/2.364/27.699/4.041
on      max     hint    ffff     313311.60      0.196/6.522/51.672/8.782

on      max     fff0    000f    1201540.48      0.195/2.364/27.699/4.041
on      max     000f    000f    1201540.48      0.212/1.317/15.166/2.203

off     max     hint    000f    1201540.48      67.948/90.301/111.427/9.628
off     max     hint    hint     368576.08      0.193/18.972/94.248/17.813

Our conclusion is that fq_codel is needed to keep latency low and
rps=f0 is needed to prevent the bandwith from dropping significantly.
smp_affinity can be set on "hint" in order to have optimal performance
when in interrupt mode. Since using these settings tx queues apparently
never build up, bql can be kept on its default (max).

Our current production settings are therefore:
fq_codel: on
smp_affinity: hint
bql: max
rps: f0 (4 cores are powerful enough to drive 10G traffic)

With these settings we've seen sustained real-world traffic of >6Gbit
throughput per server, with latency <10ms.

All in all I'm really impressed by the quality of the hardware and the
ixgbe driver. After some initial tweaking we can obtain line speed and
have low latency using only moderate amounts of CPU.
So, kudos to the kernel developers. Great work guys!

-- 
Dick Snippe, internetbeheerder     \ fight war
beh...@omroep.nl, +31 35 677 3555   \ not wars
NPO ICT, Sumatralaan 45, 1217 GP Hilversum, NPO Gebouw A

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to