Re: [E1000-devel] 100+ ms latency when 82599EB put under moderate load

Dick Snippe Fri, 14 Sep 2012 11:43:14 -0700

On Fri, Sep 14, 2012 at 10:40:53AM -0700, Alexander Duyck wrote:

> On 09/14/2012 05:22 AM, Dick Snippe wrote:
> > On Wed, Sep 12, 2012 at 05:10:44PM -0700, Jesse Brandeburg wrote:
> >
> >> On Wed, 12 Sep 2012 22:47:55 +0200
> >> Dick Snippe <dick.sni...@tech.omroep.nl> wrote:
> >>
> >>> On Wed, Sep 12, 2012 at 04:05:02PM +0000, Brandeburg, Jesse wrote:
> >>>
> >>>> Hi Dick, we need to know exactly what you are expecting to happen
> >>>> here.
> >>> I'm surprised by the large increase in latency (from <1ms to >100ms).
> >>> In our production environment we see this phenomenon even on "moderate"
> >>> load, transmitting 1.5-2Gbit.
> >> I believe maybe you could be (I'd equivocate more if I could) seeing a
> >> bit of the "bufferbloat" effect maybe from the large queues available by
> >> default on the 10G interface.
> >>
> >> can you try running with smaller transmit descriptor rings?
> >> ethtool -G ethx tx 128
> > Not much difference:
> > |1000 packets transmitted, 1000 received, 0% packet loss, time 7386ms
> > |rtt min/avg/max/mdev = 48.522/76.642/93.488/6.404 ms, pipe 14
> > |Transfer rate:          168162.02 [Kbytes/sec] received
> >
> > However, if I retry with "ifconfig ethx txqueuelen 10" latency
> > (not throughput) looks better:
> > |1000 packets transmitted, 987 received, 1% packet loss, time 5905ms
> > |rtt min/avg/max/mdev = 0.443/17.018/42.106/8.075 ms, pipe 7
> > |Transfer rate:          132776.78 [Kbytes/sec] received
> 
> That is to be expected.  The txqueuelen would have a much larger impact
> then the Tx ring size since the qdisc can hold significantly more
> packets.  You may want to look into enabling Byte Queue Limits (BQL) for
> control over the amount of data that is allowed to sit pending on the
> ring.  That in conjunction with a small txqueuelen should help to reduce
> overall latency.


I was just looking into bql; if I understand correctly activating BQL
is writing a max value to
/sys/class/net/ethx/queuestx-*/byte_queue_limits/limit_max
Am I right?

I can get some good results by tweaking both ifconfig txqueuelen and
byte_queue_limits/limit_max to rather extreme (small) values. With txqueuelen
10 and limit_max=1024 I get 0.2msec ping latency and almost 9Gbit
network throughput. However, I have no idea what is going to happen when
these settings are applied to real-world conditions where we want high
throughput for internet facing traffic and low latency for internal
traffic (notably memcache and NFS)

However, this looks promising, because bandwith is almost a factor 10
better and latency almost a factor 1000 better!

On this particular hardware we've got 2x 10G + 2x 1G nics.
Currently in our production environment the 1G nics aren't being used
and all traffic (both volume webserving traffic to the internet
and internal NFS and memcached traffic) is done over the 10G
nics. (active/passive bond with 802.1q vlans) I could separate the
flows; do high volume internet traffic over a 10G bond and low latency
internal traffic over a 1G bond. That would probably work for now,
but costs us an extra pair of 1G switches and NFS traffic would be
limited to 1G.
Maybe I should look in to Transmit Packet Steering (XPS) to do the
separation in software; 15 queues for volume output, 1 queue for low
latency traffic; however I haven't yet found out how to direct traffic
to the right queue.

> Are you running a mixed 10G/1G network?  If so do you know what kind of
> buffering may be going on in your switch between the two fabrics?  Are
> your tests being run between two 10G systems or are you crossing over
> between 10G and 1G to conduct some of your test?  The reason why I ask
> is because I have seen similar issues in the past when 1Gbs and 100Mbs
> were combined resulting in the latency significantly increasing any time
> one of the 100Mbs links were saturated.

All my 10G testing so far has been on strictly 10G networks.
Basically between 2 servers in a blade enclosure with only 1 10G switch
in between them. I.e. the traffic doesn;t even leave the blade
enclosure.

> One interesting data point might be to test the latency with the 1G
> ports and 10G ports isolated from each other to see if this may be an
> issue of buffer bloat between the two traffic rates introducing a delay.

There is no mixing between 10G and 1G taking place. The 1G tests were
done on 1G nics connected by an 1G switch.
In both cases the setup was NIC1<->switch<->NIC2

> >>>> If that helps then we know that we need to pursue ways to get
> >>>> your high priority traffic onto its own queue, which btw is why the
> >>>> single thread iperf works. Ping goes to a different queue (by luck)
> >>>> and gets out sooner due to not being behind other traffic
> >>> Interestingly multi threaded iperf (iperf -P 50) manages to do +/-
> >>> 7.5Gbit while ping latency is still around 0.1 - 0.3 ms.
> >> Thats only interesting if you're using all 16 queues, were you?
> > I'm not sure. How can I check how many queue's I'm using?
> 
> You can verify how many queues you are using by viewing ethtool -S
> results for the interface while passing traffic.  Any of the Tx queues
> that have incrementing packet counts are in use.

Yes, thanks. It turns out with iperf -P 50 nor all queue's are being
used. So it makes sense that ping latency is lower then.

> >> Lastly, I'm headed out on vacation tonight and won't be available for a
> >> while.  I hope that someone else on my team will continue to work with
> >> you to debug what is going on.
> > Hava a nice vacation!
> > If someone els could help me with this issue, that would be great.
> 
> As you can probably tell from the fact that I am replying I will step in
> while Jesse is out to help you resolve this.

This is much appreciated!

> >> Maybe someone here can reproduce the issue and we will make much more
> >> progress.  Any testing details like kernel version, driver version, etc
> >> will be helpful.
> > $ uname -r
> > 3.5.3-2POi-x86_64           (we compile our own kernels, this is a vanilla
> >                             kernel.org kernel; /proc/config.gz attached)
> > $ sudo ethtool -i eth1
> > driver: ixgbe
> > version: 3.9.15-k
> > firmware-version: 0x613e0001
> > bus-info: 0000:15:00.1
> 
> So the driver is just the stock ixgbe driver included with the 3.5.3
> kernel then?

correct.

> If so, that makes it a bit easier to debug since we know
> exactly what code we are working with if this does turn out to be a
> driver issue.

If needed I can test other kernel/driver/whatever versions if that makes
debugging easier for you.

If I understand correctly so far the driver is operating as intended,
It's just that my assumptions (low latency + high throughput aka "can have
your cake and eat it too") are overly optimistic (?)

-- 
Dick Snippe, internetbeheerder     \ fight war
beh...@omroep.nl, +31 35 677 3555   \ not wars
NPO ICT, Sumatralaan 45, 1217 GP Hilversum, NPO Gebouw A

------------------------------------------------------------------------------
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] 100+ ms latency when 82599EB put under moderate load

Reply via email to