On 09/14/2012 11:40 AM, Dick Snippe wrote:
> On Fri, Sep 14, 2012 at 10:40:53AM -0700, Alexander Duyck wrote:
>
>> On 09/14/2012 05:22 AM, Dick Snippe wrote:
>>> On Wed, Sep 12, 2012 at 05:10:44PM -0700, Jesse Brandeburg wrote:
>>>
>>>> On Wed, 12 Sep 2012 22:47:55 +0200
>>>> Dick Snippe <dick.sni...@tech.omroep.nl> wrote:
>>>>
>>>>> On Wed, Sep 12, 2012 at 04:05:02PM +0000, Brandeburg, Jesse wrote:
>>>>>
>>>>>> Hi Dick, we need to know exactly what you are expecting to happen
>>>>>> here.
>>>>> I'm surprised by the large increase in latency (from <1ms to >100ms).
>>>>> In our production environment we see this phenomenon even on "moderate"
>>>>> load, transmitting 1.5-2Gbit.
>>>> I believe maybe you could be (I'd equivocate more if I could) seeing a
>>>> bit of the "bufferbloat" effect maybe from the large queues available by
>>>> default on the 10G interface.
>>>>
>>>> can you try running with smaller transmit descriptor rings?
>>>> ethtool -G ethx tx 128
>>> Not much difference:
>>> |1000 packets transmitted, 1000 received, 0% packet loss, time 7386ms
>>> |rtt min/avg/max/mdev = 48.522/76.642/93.488/6.404 ms, pipe 14
>>> |Transfer rate:          168162.02 [Kbytes/sec] received
>>>
>>> However, if I retry with "ifconfig ethx txqueuelen 10" latency
>>> (not throughput) looks better:
>>> |1000 packets transmitted, 987 received, 1% packet loss, time 5905ms
>>> |rtt min/avg/max/mdev = 0.443/17.018/42.106/8.075 ms, pipe 7
>>> |Transfer rate:          132776.78 [Kbytes/sec] received
>> That is to be expected.  The txqueuelen would have a much larger impact
>> then the Tx ring size since the qdisc can hold significantly more
>> packets.  You may want to look into enabling Byte Queue Limits (BQL) for
>> control over the amount of data that is allowed to sit pending on the
>> ring.  That in conjunction with a small txqueuelen should help to reduce
>> overall latency.
> I was just looking into bql; if I understand correctly activating BQL
> is writing a max value to
> /sys/class/net/ethx/queuestx-*/byte_queue_limits/limit_max
> Am I right?
>
> I can get some good results by tweaking both ifconfig txqueuelen and
> byte_queue_limits/limit_max to rather extreme (small) values. With txqueuelen
> 10 and limit_max=1024 I get 0.2msec ping latency and almost 9Gbit
> network throughput. However, I have no idea what is going to happen when
> these settings are applied to real-world conditions where we want high
> throughput for internet facing traffic and low latency for internal
> traffic (notably memcache and NFS)
>
> However, this looks promising, because bandwith is almost a factor 10
> better and latency almost a factor 1000 better!

Based on everything we have figured out so far it looks like what is
probably happening is that your traffic is being placed in a round robin
fashion on the various Tx queues.  I am assuming what is probably going
on is that your HTTP server that you had in the trace is probably
running multi-threaded and is probably placing packets on all of the queues.

You could probably set the limit_max to a significantly larger value
such as 64K and you would probably see the same level of performance. 
What I suspect is happening is that the Tx descriptor rings and Tx
Qdiscs are being loaded up with TSO requests.  Each request only counts
as one packet, but is 64K in size.  So if the default txqueuelen is 1000
that means your ping is getting stuck behind 64K bytes per packet x 1000
packets per queue x 16 queues.  The net result of all that is
theoretical maximum delay up to 800ms if all of the queues were actually
filled.

> On this particular hardware we've got 2x 10G + 2x 1G nics.
> Currently in our production environment the 1G nics aren't being used
> and all traffic (both volume webserving traffic to the internet
> and internal NFS and memcached traffic) is done over the 10G
> nics. (active/passive bond with 802.1q vlans) I could separate the
> flows; do high volume internet traffic over a 10G bond and low latency
> internal traffic over a 1G bond. That would probably work for now,
> but costs us an extra pair of 1G switches and NFS traffic would be
> limited to 1G.
> Maybe I should look in to Transmit Packet Steering (XPS) to do the
> separation in software; 15 queues for volume output, 1 queue for low
> latency traffic; however I haven't yet found out how to direct traffic
> to the right queue.

I don't think XPS will get you very much.  The ixgbe hardware has a
built in functionality called ATR which is very similar to XPS.  One
thing you may want to try looking at would be actually change the
configuration on your HTTP server to reduce the number of CPUs in use. 
Currently it seems like you are using all of the CPUs on the system to
service the HTTP requests.  This is resulting in all of the queues
performing large sends and building up a significant backlog.  If you
were to isolate the large send requests from the low latency requests by
blocking of HTTP sends to a set of CPUs you might be able to
significantly improve the transmit performance.

>> Are you running a mixed 10G/1G network?  If so do you know what kind of
>> buffering may be going on in your switch between the two fabrics?  Are
>> your tests being run between two 10G systems or are you crossing over
>> between 10G and 1G to conduct some of your test?  The reason why I ask
>> is because I have seen similar issues in the past when 1Gbs and 100Mbs
>> were combined resulting in the latency significantly increasing any time
>> one of the 100Mbs links were saturated.
> All my 10G testing so far has been on strictly 10G networks.
> Basically between 2 servers in a blade enclosure with only 1 10G switch
> in between them. I.e. the traffic doesn;t even leave the blade
> enclosure.
>
>> One interesting data point might be to test the latency with the 1G
>> ports and 10G ports isolated from each other to see if this may be an
>> issue of buffer bloat between the two traffic rates introducing a delay.
> There is no mixing between 10G and 1G taking place. The 1G tests were
> done on 1G nics connected by an 1G switch.
> In both cases the setup was NIC1<->switch<->NIC2

Okay, I just wanted to guarantee we have that kind of isolation to avoid
any possible buffer latency.  You still may want to double check the
hardware configuration on the switch itself to check for internal buffer
configuration.

>> If so, that makes it a bit easier to debug since we know
>> exactly what code we are working with if this does turn out to be a
>> driver issue.
> If needed I can test other kernel/driver/whatever versions if that makes
> debugging easier for you.
>
> If I understand correctly so far the driver is operating as intended,
> It's just that my assumptions (low latency + high throughput aka "can have
> your cake and eat it too") are overly optimistic (?)
Actually it is possible to have both, at least from the 82599
perspective.  The 82599 supports a feature called DCB which is meant to
allow for traffic prioritization.  However in order for it to really
work you would need to have complete end to end support and as such you
would need a DCB capable switch.

Thanks,

Alex

------------------------------------------------------------------------------
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to