Re: [E1000-devel] 100+ ms latency when 82599EB put under moderate load

Alexander Duyck Fri, 14 Sep 2012 10:42:41 -0700

On 09/14/2012 05:22 AM, Dick Snippe wrote:
> On Wed, Sep 12, 2012 at 05:10:44PM -0700, Jesse Brandeburg wrote:
>
>> On Wed, 12 Sep 2012 22:47:55 +0200
>> Dick Snippe <dick.sni...@tech.omroep.nl> wrote:
>>
>>> On Wed, Sep 12, 2012 at 04:05:02PM +0000, Brandeburg, Jesse wrote:
>>>
>>>> Hi Dick, we need to know exactly what you are expecting to happen
>>>> here.
>>> I'm surprised by the large increase in latency (from <1ms to >100ms).
>>> In our production environment we see this phenomenon even on "moderate"
>>> load, transmitting 1.5-2Gbit.
>> I believe maybe you could be (I'd equivocate more if I could) seeing a
>> bit of the "bufferbloat" effect maybe from the large queues available by
>> default on the 10G interface.
>>
>> can you try running with smaller transmit descriptor rings?
>> ethtool -G ethx tx 128
> Not much difference:
> |1000 packets transmitted, 1000 received, 0% packet loss, time 7386ms
> |rtt min/avg/max/mdev = 48.522/76.642/93.488/6.404 ms, pipe 14
> |Transfer rate:          168162.02 [Kbytes/sec] received
>
> However, if I retry with "ifconfig ethx txqueuelen 10" latency
> (not throughput) looks better:
> |1000 packets transmitted, 987 received, 1% packet loss, time 5905ms
> |rtt min/avg/max/mdev = 0.443/17.018/42.106/8.075 ms, pipe 7
> |Transfer rate:          132776.78 [Kbytes/sec] received


That is to be expected.  The txqueuelen would have a much larger impact
then the Tx ring size since the qdisc can hold significantly more
packets.  You may want to look into enabling Byte Queue Limits (BQL) for
control over the amount of data that is allowed to sit pending on the
ring.  That in conjunction with a small txqueuelen should help to reduce
overall latency.

>> can you separately try running without other offloads?  like:
>> ethtool -K ethx lro off tso off gro off
> Worse:
> |1000 packets transmitted, 1000 received, 0% packet loss, time 7254ms
> |rtt min/avg/max/mdev = 221.309/271.187/320.978/20.035 ms, pipe 44
> |Transfer rate:          117764.62 [Kbytes/sec] received

That result doesn't surprise me too much. Basically it just means we are
filling the rings with packets instead of doing large sends/receives.

>>> This effect on 10G infrastructure appears to be much more pronounced 
>>> compared to 1G. When testing on 1G nics latency also increases, but much
>>> less so; from <1ms to ~10ms. A difference is that the 1G nics are
>>> saturated but the 10G ones are "only" transmitting ~1.5 Gbit.
>> that is a very interesting data point.  Are your 1G nics multi-queue?
> We have both flavours. When testing on older, no-multiqueue nics
> things look good: ( Broadcom Corporation NetXtreme II BCM5708S Gigabit
> Ethernet (rev 12), bnx2 driver)
>
> |--- igor06.omroep.nl ping statistics ---
> |1000 packets transmitted, 1000 received, 0% packet loss, time 998ms
> |rtt min/avg/max/mdev = 0.097/0.144/0.437/0.032 ms
>
> newer multiqueue 1G nics appear to have the same problem:
> (Broadcom Corporation NetXtreme II BCM5709S Gigabit Ethernet (rev 20),
> bnx2 driver)
>
> |--- dltest-intern ping statistics ---
> |1000 packets transmitted, 1000 received, 0% packet loss, time 6503ms
> |rtt min/avg/max/mdev = 82.486/96.982/105.990/5.457 ms, pipe 18

Are you running a mixed 10G/1G network?  If so do you know what kind of
buffering may be going on in your switch between the two fabrics?  Are
your tests being run between two 10G systems or are you crossing over
between 10G and 1G to conduct some of your test?  The reason why I ask
is because I have seen similar issues in the past when 1Gbs and 100Mbs
were combined resulting in the latency significantly increasing any time
one of the 100Mbs links were saturated.

One interesting data point might be to test the latency with the 1G
ports and 10G ports isolated from each other to see if this may be an
issue of buffer bloat between the two traffic rates introducing a delay.

>>>> There is a simple test you can do, try to disable TSO using
>>>> ethtool. ethtool -K ethx tso off
>>> I just tried that. The results are very similar.
>> hm, you aren't getting any flow control in your network are you?  (see
>> ethtool -S ethx)
> no, I don't think so:
> $ sudo ethtool -S eth1|grep flow_control
>      tx_flow_control_xon: 0
>      rx_flow_control_xon: 0
>      tx_flow_control_xoff: 0
>      rx_flow_control_xoff: 0

This tells us we shouldn't be doing any buffering in the adapter Tx FIFO
so odds are any impact from reducing queue lengths and such should be
minimal.

>> and take a look at the other stats while you are there.
> Output is attached, however it looks pretty unsuspicious to me.

I agree there isn't much there.  The only thing I noticed is that the
tx_restart_queue is a few thousand, however that is fairly common for
heavy TSO traffic since the descriptor queue will fill and it will take
us a little while to flush all of it out onto the wire.

>> it also might be interesting to sniff the ethx interface to see
>> the outbound traffic patterns and delays between ping request/reponse.
>>
>> start your test
>> start the ping
>> tcpdump -i ethx -s 128 -w snippetx.cap -c 1000
>> bzip2 snippetx.cap
>> <put on pastebin or some other web site and email us link>
> http://download.omroep.nl/gurus/dick/ixgbe/snippet1.cap.gz 
>
> explanation:
> morsa01: host running the webserver, although the actual webserver 
>       uses a different ip address: dltest.omroep.nl
> morsa02: host running ab ("the client")
> morsa03: host running ping to dltest.omroep.nl
>
> Longest ping rtt's in this dump appear to be icmp seq 201, 173 and 181
> each ~120ms:
> 13:38:45.302927 IP morsa03.omroep.nl > dltest1afp.omroep.nl: ICMP echo 
> request, id 55334, seq 173, length 64
> 13:38:45.367699 IP morsa03.omroep.nl > dltest1afp.omroep.nl: ICMP echo 
> request, id 55334, seq 181, length 64
> 13:38:45.424599 IP dltest1afp.omroep.nl > morsa03.omroep.nl: ICMP echo reply, 
> id 55334, seq 173, length 64
> 13:38:45.488640 IP dltest1afp.omroep.nl > morsa03.omroep.nl: ICMP echo reply, 
> id 55334, seq 181, length 64
> 13:38:45.516720 IP morsa03.omroep.nl > dltest1afp.omroep.nl: ICMP echo 
> request, id 55334, seq 201, length 64
> 13:38:45.639179 IP dltest1afp.omroep.nl > morsa03.omroep.nl: ICMP echo reply, 
> id 55334, seq 201, length 64

I looked over the trace and there isn't really much there to examine.

>>>> If that helps then we know that we need to pursue ways to get
>>>> your high priority traffic onto its own queue, which btw is why the
>>>> single thread iperf works. Ping goes to a different queue (by luck)
>>>> and gets out sooner due to not being behind other traffic
>>> Interestingly multi threaded iperf (iperf -P 50) manages to do +/-
>>> 7.5Gbit while ping latency is still around 0.1 - 0.3 ms.
>> Thats only interesting if you're using all 16 queues, were you?
> I'm not sure. How can I check how many queue's I'm using?

You can verify how many queues you are using by viewing ethtool -S
results for the interface while passing traffic.  Any of the Tx queues
that have incrementing packet counts are in use.

>> There are some games here with the scheduler and NIC irq affinity as
>> well that might be impacting us.  Can you please make sure you killall
>> irqbalance, and run set_irq_affinity.sh ethx ethy.  The goal here is to
>> start eliminating latency causes.
> irqbalance is not running on our servers
> the set_irq_affinity.sh sets the affinity identical to our default setup
> in which we set the affinity according to /proc/irq/XX/affinity_hint

Sounds like that set-up is correct.

>>  I'd also be curious what your
>> interrupts per second per queue are during your workload.
> $ awk '/eth1/ {print $1,$19}' /proc/interrupt
> 83: eth1-TxRx-0
> 84: eth1-TxRx-1
> 85: eth1-TxRx-2
> 86: eth1-TxRx-3
> 87: eth1-TxRx-4
> 88: eth1-TxRx-5
> 89: eth1-TxRx-6
> 90: eth1-TxRx-7
> 91: eth1-TxRx-8
> 92: eth1-TxRx-9
> 93: eth1-TxRx-10
> 94: eth1-TxRx-11
> 95: eth1-TxRx-12
> 96: eth1-TxRx-13
> 97: eth1-TxRx-14
> 98: eth1-TxRx-15
> 99: eth1
>
> $ sar -I 83,84,85,86,97,88,89,90,91,92,93,94,95,96,97,98 1 11111
> 14:09:19         INTR    intr/s
> 14:09:20           83   3431.00
> 14:09:20           84   3387.00
> 14:09:20           85   3392.00
> 14:09:20           86   3352.00
> 14:09:20           88   3403.00
> 14:09:20           89   3380.00
> 14:09:20           90   3408.00
> 14:09:20           91   3418.00
> 14:09:20           92   3380.00
> 14:09:20           93   3388.00
> 14:09:20           94   3379.00
> 14:09:20           95   3435.00
> 14:09:20           96   3412.00
> 14:09:20           97   3359.00
> 14:09:20           98   3429.00

These interrupt rates seem a little low, however odds are you may be
polling and that may be reducing the total number of interrupts.  The
fact that all 16 are active likely means you are using all of the 16
queues on the system though.

>> Lastly, I'm headed out on vacation tonight and won't be available for a
>> while.  I hope that someone else on my team will continue to work with
>> you to debug what is going on.
> Hava a nice vacation!
> If someone els could help me with this issue, that would be great.

As you can probably tell from the fact that I am replying I will step in
while Jesse is out to help you resolve this.

>> Maybe someone here can reproduce the issue and we will make much more
>> progress.  Any testing details like kernel version, driver version, etc
>> will be helpful.
> $ uname -r
> 3.5.3-2POi-x86_64             (we compile our own kernels, this is a vanilla
>                               kernel.org kernel; /proc/config.gz attached)
> $ sudo ethtool -i eth1
> driver: ixgbe
> version: 3.9.15-k
> firmware-version: 0x613e0001
> bus-info: 0000:15:00.1

So the driver is just the stock ixgbe driver included with the 3.5.3
kernel then?  If so, that makes it a bit easier to debug since we know
exactly what code we are working with if this does turn out to be a
driver issue.

Thanks,

Alex

------------------------------------------------------------------------------
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] 100+ ms latency when 82599EB put under moderate load

Reply via email to