Re: [E1000-devel] 100+ ms latency when 82599EB put under moderate load

Dick Snippe Tue, 18 Sep 2012 03:57:07 -0700

On Fri, Sep 14, 2012 at 01:18:05PM -0700, Alexander Duyck wrote:

> On 09/14/2012 11:40 AM, Dick Snippe wrote:
> > On Fri, Sep 14, 2012 at 10:40:53AM -0700, Alexander Duyck wrote:
> >
> >> On 09/14/2012 05:22 AM, Dick Snippe wrote:
> >>> On Wed, Sep 12, 2012 at 05:10:44PM -0700, Jesse Brandeburg wrote:
> >>>
> >>>> On Wed, 12 Sep 2012 22:47:55 +0200
> >>>> Dick Snippe <dick.sni...@tech.omroep.nl> wrote:
> >>>>
> >>>>> On Wed, Sep 12, 2012 at 04:05:02PM +0000, Brandeburg, Jesse wrote:
> >>>>>
> >>>>>> Hi Dick, we need to know exactly what you are expecting to happen
> >>>>>> here.
> >>>>> I'm surprised by the large increase in latency (from <1ms to >100ms).
> >>>>> In our production environment we see this phenomenon even on "moderate"
> >>>>> load, transmitting 1.5-2Gbit.
> >>>> I believe maybe you could be (I'd equivocate more if I could) seeing a
> >>>> bit of the "bufferbloat" effect maybe from the large queues available by
> >>>> default on the 10G interface.
> >>>>
> >>>> can you try running with smaller transmit descriptor rings?
> >>>> ethtool -G ethx tx 128
> >>> Not much difference:
> >>> |1000 packets transmitted, 1000 received, 0% packet loss, time 7386ms
> >>> |rtt min/avg/max/mdev = 48.522/76.642/93.488/6.404 ms, pipe 14
> >>> |Transfer rate:          168162.02 [Kbytes/sec] received
> >>>
> >>> However, if I retry with "ifconfig ethx txqueuelen 10" latency
> >>> (not throughput) looks better:
> >>> |1000 packets transmitted, 987 received, 1% packet loss, time 5905ms
> >>> |rtt min/avg/max/mdev = 0.443/17.018/42.106/8.075 ms, pipe 7
> >>> |Transfer rate:          132776.78 [Kbytes/sec] received
> >> That is to be expected.  The txqueuelen would have a much larger impact
> >> then the Tx ring size since the qdisc can hold significantly more
> >> packets.  You may want to look into enabling Byte Queue Limits (BQL) for
> >> control over the amount of data that is allowed to sit pending on the
> >> ring.  That in conjunction with a small txqueuelen should help to reduce
> >> overall latency.
> > I was just looking into bql; if I understand correctly activating BQL
> > is writing a max value to
> > /sys/class/net/ethx/queuestx-*/byte_queue_limits/limit_max
> > Am I right?
> >
> > I can get some good results by tweaking both ifconfig txqueuelen and
> > byte_queue_limits/limit_max to rather extreme (small) values. With 
> > txqueuelen
> > 10 and limit_max=1024 I get 0.2msec ping latency and almost 9Gbit
> > network throughput. However, I have no idea what is going to happen when
> > these settings are applied to real-world conditions where we want high
> > throughput for internet facing traffic and low latency for internal
> > traffic (notably memcache and NFS)
> >
> > However, this looks promising, because bandwith is almost a factor 10
> > better and latency almost a factor 1000 better!
> 
> Based on everything we have figured out so far it looks like what is
> probably happening is that your traffic is being placed in a round robin
> fashion on the various Tx queues.  I am assuming what is probably going
> on is that your HTTP server that you had in the trace is probably
> running multi-threaded and is probably placing packets on all of the queues.


Yes. The http server is multi-threaded and is placing packets on all
queues.

> You could probably set the limit_max to a significantly larger value
> such as 64K and you would probably see the same level of performance. 

Well, no. Not really. With txqueuelen=10 and limit_max=65536 I get
~1.2Gbit network throughput and ~10msec ping latency.

> What I suspect is happening is that the Tx descriptor rings and Tx
> Qdiscs are being loaded up with TSO requests.  Each request only counts
> as one packet, but is 64K in size.  So if the default txqueuelen is 1000
> that means your ping is getting stuck behind 64K bytes per packet x 1000
> packets per queue x 16 queues.  The net result of all that is
> theoretical maximum delay up to 800ms if all of the queues were actually
> filled.

That sounds plausible.

> > On this particular hardware we've got 2x 10G + 2x 1G nics.
> > Currently in our production environment the 1G nics aren't being used
> > and all traffic (both volume webserving traffic to the internet
> > and internal NFS and memcached traffic) is done over the 10G
> > nics. (active/passive bond with 802.1q vlans) I could separate the
> > flows; do high volume internet traffic over a 10G bond and low latency
> > internal traffic over a 1G bond. That would probably work for now,
> > but costs us an extra pair of 1G switches and NFS traffic would be
> > limited to 1G.
> > Maybe I should look in to Transmit Packet Steering (XPS) to do the
> > separation in software; 15 queues for volume output, 1 queue for low
> > latency traffic; however I haven't yet found out how to direct traffic
> > to the right queue.
> 
> I don't think XPS will get you very much.  The ixgbe hardware has a
> built in functionality called ATR which is very similar to XPS.  One
> thing you may want to try looking at would be actually change the
> configuration on your HTTP server to reduce the number of CPUs in use. 
> Currently it seems like you are using all of the CPUs on the system to
> service the HTTP requests.  This is resulting in all of the queues
> performing large sends and building up a significant backlog.  If you
> were to isolate the large send requests from the low latency requests by
> blocking of HTTP sends to a set of CPUs you might be able to
> significantly improve the transmit performance.

Yes. With txqueuelen=1000, limit_max=max and binding the webserver
_only_ to one core (taskset -p 1 <pid>) I get 10Gbit throughput and
0.259msec ping latency.
Perhaps a more realistic example is to bind the webserver to half the
cores (taskset -p ff <pid>); in that case I get 4Gbit throughput and
still only 0.3msec ping latency.

Incidentally, I also tried fq_codel. Without any other tuning
this does a good job on reducing latency (~5 msec) with reasonable
bandwidth (~2Gbit)

So, out of all the stuff I tried binding the webserver to specific cores
appears to work best for my _test_-case.

However, the real world is a bit more complicated I'm afraid.
Some of our webservers are configured as cachingproxies. 99% of data is served
from cache. Here we can tolerate a bit of latency. For the remaining 1%
it needs to do an outgoing http request to an origin server, in which
case high latency is killing.
Also if storage resides on NFS mounted volumes, the NFS requests will
probably also suffer from high latency.
Lastly monitoring traffic (loadbalancer and nagios checking if playout
servers are still alive) will probably also still suffer from high
latency.

Anyhow, I'm happy that we know now what is causing the latency and
how to do something about it.

For our production platform I will try some experiments with decreased
txqueuelen, binding (web)server instances to specific cores ad boot
a server with kernel 3.5 + fq_codel to see what works best in practice.

Thansk for all your help so far!

-- 
Dick Snippe, internetbeheerder     \ fight war
beh...@omroep.nl, +31 35 677 3555   \ not wars
NPO ICT, Sumatralaan 45, 1217 GP Hilversum, NPO Gebouw A

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] 100+ ms latency when 82599EB put under moderate load

Reply via email to