On Thu, Oct 04, 2012 at 04:04:51PM +0200, Eric Dumazet wrote:

> On Thu, 2012-10-04 at 15:40 +0200, Dick Snippe wrote:
> > On Tue, Sep 18, 2012 at 12:55:02PM +0200, Dick Snippe wrote:
> > 
> > FYI:
> > 
> > > For our production platform I will try some experiments with decreased
> > > txqueuelen, binding (web)server instances to specific cores ad boot
> > > a server with kernel 3.5 + fq_codel to see what works best in practice.
> > 
> > After quite a bit of testing + some real-world experience on our
> > production platform it turns out that tweaking Receive Packet Steering
> > (rps) in combination with fq_codel can have a huge impact.
> > 
> > My theory (based on the results below) is that when a server is sending
> > out large volumes of data, the return traffic (ACK's) can become so
> > large (150.000 packets/second) that the drivers switches to polling.
> > When rps is not active this polling apparently causes a drop in throughput
> > and all tx-queues fill up resulting in much larger latency.
> > 
> > Below are our results.
> > 
> > Our test setup consisted of 4 servers (IBM HS22 blades, 96Gbyte RAM, 2x
> > quad core westmere E5620, dual 82599EB 10-Gigabit NICs, running vanilla
> > kernel.org 3.5.4 kernel, stock ixgbe 3.9.15-k driver)
> > 
> > host1: runs dltest webserver, serving a 100Mbyte test file
> > host2+3: acting as clients using ab test program
> >     ab -n 100000 -c 500 http://dltest.omroep.nl/100m
> > host4: "observation server". for measuring ping latency to server1
> >     sudo ping -c 1000 -i 0.001 -q dltest.omroep.nl
> > 
> > All 4 servers are directly connected through a Cisco Nexus 4001I
> > 10GB Switch.  The test servers are in the same blade enclosure and
> > the test traffic never leaves the blade enclosure.
> > 
> > With default settings and a small number of flows (ab -c 10)
> > we can obtain line speed easily and ping latency is low (<1ms)
> > The same goes for iperf.
> > 
> > However, with a larger number of flows (both clients doing ab -c 500,
> > i.e. a total of 1000 concurrent flows) throughput on the webserver
> > drops dramatically to 1-2Gbit and ping latency rises to ~100ms.
> 
> I wonder if receivers are using GRO ?

yes:
$ sudo ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
ntuple-filters: off
receive-hashing: on

> If yes, the numer of ACK they are sending back should be limited to one
> ACK per GRO packet, instead of one ACK every 2 MSS.

using sar -n DEV I saw +/- 150.000 rxpck/s on the sending webserver,
with +/- 9.000 rxkB/s, i.e. ~60 bytes/packet. I assume that these are
pure ACKs. But I don't know what sar counts exactly and how that relates
to GRO.

> Also, I was considering adding GRO support of TCP pure ACK, at least for
> local traffic (not forwarding workloads)
> 
> It would be nice if you could post a "perf top" output of the sender,
> because dropping to 1-2Gbit sounds really really bad...

I'll have to look into that as we don't usually build perf with our
kernels and "make" in the tools/perf directory fails, probably because
our version of bison is too old. 

-- 
Dick Snippe, internetbeheerder     \ fight war
beh...@omroep.nl, +31 35 677 3555   \ not wars
NPO ICT, Sumatralaan 45, 1217 GP Hilversum, NPO Gebouw A

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to