Hello,
We've recently upgraded some hosted (physical) servers from 100Mbps links
to 1Gbps links. For the sake of simplicity, I'll say there are two servers
in Los Angeles and two in London.
Before the upgrade we could get ~96Mbps between all locations over a single
TCP stream. We'd reach that speed pretty much straight away (slow start
completed within a second or two). We're using TCP cubic with 8MB max
send/recv windows. This was true even for the London <-> LA links.
After the upgrade we can get ~1Gbps between local servers with the same
test case. However, over the WAN (~100ms RTT) we're now struggling to
30-40Mbps over TCP. Throughput will occasionally reach 90Mbps, but is
unstable and soon drops down again. The ramp up to 90Mbps (when it does
happen) takes around 60 seconds. There is no problem with UDP traffic - we
can hit rates well over 100Mbps with almost no loss. To be clear, I'm not
expecting to hit 1Gbps between London and LA with 8MB TCP buffers - but I
am expecting to hit at least ~96Mbps like I could when the servers were
connected at 100Mbps.
As soon as we downgrade the sender's port speed to 100Mbps then we're back
up to full speed (~96Mbps) immediately with TCP. The network operator
assures me there's no QoS policy or traffic policing on their kit, and if
there were then it should also affect traffic between adjacent nodes.
We're using Xeon E3 and X56xx servers, with 82574L NICs. They're running
CentOS 6.4 (64-bit).
I've tried the following (all unsuccessfully):
- Disabling/enabling TOE features
- Applying the EEPROM patch for losses when entering power saving mode
- Upgrading from the stock 2.1.4 driver to the 2.3.2 driver
- Upgrading the Kernel to 3.9.3
- Many other things (txqueuelen, tx-rings, increasing/reducing TCP window
maxes, etc)
Packet captures show that losses are clearly occurring, which is preventing
TCP from ramping up properly. The graph at
http://www.imagebam.com/image/1f8443255679356 shows the sender side traffic
profile - you can see the bursty nature of the losses and its effect on
TCP. It _looks_ like buffer/QoS behaviour, but I'm not familiar enough with
Cisco switching/routing kit to ask the hosting provider sensible questions
about this.
To be clear, this doesn't just affect this one hosting provider - it seems
to be common to all of our boxes. The issue only occurs when the sender is
connected at 1Gbps, the RTT is reasonably high (> ~60ms), and we use TCP.
By posting here I'm certainly not trying to suggest that the e1000e driver
is at fault... I'm just running out of ideas and could really use some
expert suggestions on where to look next!
Thanks in advance,
Sam
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit
http://communities.intel.com/community/wired