Hello,

I encounter a strange TCP performance limit in the turn of configuring teql 
layer 3 link aggregation with 6 x 1 GBit NICs.

I started my problem on LARTC list, since there I found the concept of teql 
aggregation. Additional info on my setup maybe found there:
http://www.spinics.net/lists/lartc/msg23205.html

On my research, I found this old article on e1000-devel, which may come close 
to my problem:
http://osdir.com/ml/linux.drivers.e1000.devel/2007-11/msg00133.html

----------<quote>-----------
> you are running out of bus bandwidth (which is why increasing
> descriptors doesn't help). rx_missed_errors occur when you run out of
> fifo on the adapter itself, indicating the bus can't be attained for
> long enough to keep the data rate up.
----------<quote>-----------

Since symptoms focus around a NIC issue, I continue the discussion here.

Short recall of the setup:

I try to configure a beowulf style HPC cluster from old server hardware 
sourced on ebay and now try to squeeze the best out of it.

The cluster nodes are 16x HP 460c G1 in a HP blade enclosure, equipped with 6 
x GBit ethernet Ports each. I configured a layer 3 link aggregation on 6 
seperated vlans with IP routes over physical channels and an additional teql 
layer on top of those routes

+-------------eth4---gateway(aka cruncher)
| +-------------eth5---gateway(aka cruncher)
| | +-------------eth6---gateway(aka cruncher)
| | | +-------------eth7---gateway(aka cruncher)
| | | | +-------------eth8---gateway(aka cruncher)
| | | | | +-------------eth9---gateway(aka cruncher)
+-+-+-+-+-+----blade-001
+-+-+-+-+-+----blade-002
+-+-+-+-+-+----blade-003

Stuff works fine between the blade nodes, which are equipped with broadcom 
chips using tg3 driver. Using mtu 9000 Jumbo frames, I get iperf transfer 
rates of 5.6 GBit, which is > 90 % of the theoretical maximum.

Straight implementation of the same scheme between the blades and the gateway 
yields not more than ~ 2 GBit. 
So, some aggregation happens, but far from the 6 GBit maximum.
No noticable dependency on direction of transmit.

The Gateway uses two sligtly different quad port GBit cards with 82571EB chips 
and e1000e driver. In the initial connection, eth2,3 were left open, 
eth4..eth9 each connected to a different HP Virtual Connect blade center 
switching Ethernet Module.

ifconfig and wireshark show traffic coming equally over all 6 lines.
But with an awful lots of retransmits.
Well, maybe that wireshark gets confused by teql and fails matching packets 
since they go over different interfaces, but thats another issue, not primary 
here.

After lots of googling, I pinned the symptom down to this issue:

# for i in `seq 2 9`; do ethtool -S eth$i | grep rx_missed_errors ; done
     rx_missed_errors: 0                        -> eth2
     rx_missed_errors: 0                        -> eth3
     rx_missed_errors: 0                        -> eth4
     rx_missed_errors: 0                        -> eth5
     rx_missed_errors: 29159                    -> eth6
     rx_missed_errors: 28619                    -> eth7
     rx_missed_errors: 9263                     -> eth8
     rx_missed_errors: 23306                    -> eth9

==================================

I need 6 links out of 8 available, so I plugged 2 cables from the "buggy" 
NIC to the "healthy" one - and kept link config matching. 
So now eth8 and eth9 is open, eth2 ... eth7 form the aggregated link

and - alas - we get up from ~2 GBit to > 3 GBit.
Still thousands of  rx_missed_errors in the "bad" NIC,  which has only to work 
for 2 GBit connections now, and still zero  of  rx_missed_errors for 
the "good" NIC , which carries 4 GBit active links now.

Further googling and tweaking memory limits in
        /proc/sys/net/ipv4/tcp_*mem
and 
        /proc/sys/net/core/*mem*
showed no difference.

What helped, was to incrase the "TCP window size" on the iperf server side 
from 
        "TCP window size: 85.3 KByte (default)"
to a value between 512K and 2 M

This yields 4.35 Gbits/sec, so we are over 70 % of theoretical maximum.
However, neither  do  I really understand it, nor do I know how to transfer 
this window size setting to other applications than iperf.

I think the TCP window size is just a workaround for underlying problems, 
because
- still lots of  rx_missed_errors for eth6 and eth7
- zero rx_missed_errors for eth 2..eth5
- the blade-blade connection with 5.6 GBit works even better without any 
tweaking with small TCP window size. ("85.3  KByte (default)" )


Possible causes on my list

- firmware problem (NICs, Mainboard)
- hardware problem (NICs, Mainboard)
- conceptual limitation of hardware design
- driver problem
- kernel / scheduling issue / IRQ / race...whatever?
 -some realy weird hidden tweak paramater
- still the nasty VC blade switch?
( which led me already to migrate from layer 2 bonding to layer 3 teql)
-  any more????

==================================
System details ....

        ... extracted from lspci ....

root@cruncher:/cluster/etc/dnsmasq.d# lspci | grep -i ether
07:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (Copper) (rev 06)
07:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (Copper) (rev 06)
08:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (Copper) (rev 06)
08:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (Copper) (rev 06)
0c:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
0c:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
0d:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
0d:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
0e:05.0 Ethernet controller: Realtek Semiconductor Co., Ltd. 
RTL-8100/8101L/8139 PCI Fast Ethernet Adapter (rev 10)
10:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 
PCI Express Gigabit Ethernet Controller (rev 09)

----------------

0c:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
        Subsystem: Intel Corporation PRO/1000 PT Quad Port Server Adapter

07:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (Copper) (rev 06)
        Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port 
Gigabit Server Adapter

so both adaptors have the same chipset, same driver, similar bus connectivity 
and announce identical PCI bus bandwith:
        'LnkSta: Speed 2.5GT/s, Width x4'

---------------------

'   +-0a.0-[05-08]----00.0-[06-08]--+-02.0-[07]--+-00.0
'   |                               |            \-00.1
'   |                               \-04.0-[08]--+-00.0
'   |                                            \-00.1
'   +-0b.0-[09]--+-00.0           |            \-00.1
'   +-0d.0-[0a-0d]----00.0-[0b-0d]--+-00.0-[0c]--+-00.0
'   |                               |            \-00.1
'   |                               \-01.0-[0d]--+-00.0
'   |                                            \-00.1

----------------------

The gateway mainboard is a SABERTOOTH 990FX R2.0
[AMD/ATI] RD890 PCI to PCI bridge (external gfx1 port A)
- consumer grade, but quite recent -
Gateway CPU is a AMD FX-8320 8 Core
debian wheezy, vanilla kernel:
Linux cruncher 3.19.0 #1 SMP Tue Mar 3 19:05:04 CET 2015 x86_64 GNU/Linux

driver:
root@cruncher..# modinfo e1000e
filename:       
/lib/modules/3.19.0/kernel/drivers/net/ethernet/intel/e1000e/e1000e.ko
version:        2.3.2-k
license:        GPL
description:    Intel(R) PRO/1000 Network Driver
author:         Intel Corporation, <linux.n...@intel.com>
srcversion:     A1AA8F77482AA26B2715149
....


The blade nodes are HP blades 460c G1
chipset Intel 5000
- enterprise grade, but quite some years now, I suppose -
CPU 2 x Xeon E5430 quad
debian wheezy, debian backport kernel:
Linux blade-002.crunchnet 3.16.0-0.bpo.4-amd64 #1 SMP Debian 
3.16.7-ckt4-3~bpo70+1 (2015-02-12) x86_64 GNU/Linux

----------------

Comparing memory bandwith with mbw (as a first measure of system bus thruput), 
the Gateway outperforms the blades by a factor of two

root@blade-002:~# mbw -n1 1000
AVG     Method: MEMCPY  Elapsed: 0.61679        MiB: 1000.00000 Copy: 1621.300 
MiB/s
AVG     Method: DUMB    Elapsed: 0.51892        MiB: 1000.00000 Copy: 1927.068 
MiB/s
AVG     Method: MCBLOCK Elapsed: 0.39211        MiB: 1000.00000 Copy: 2550.311 
MiB/s

root@cruncher...#  mbw -n1 1000
AVG     Method: MEMCPY  Elapsed: 0.27301        MiB: 1000.00000 Copy: 3662.923 
MiB/s
AVG     Method: DUMB    Elapsed: 0.19693        MiB: 1000.00000 Copy: 5077.972 
MiB/s
AVG     Method: MCBLOCK Elapsed: 0.19287        MiB: 1000.00000 Copy: 5184.947 
MiB/s

In any CPU/memory/bus related tests I ran till now, the gateway outperformed 
the blades, so I'd consider this as the "faster" system in general.

-------------------

So, conceptually, I see no reason why from two nearly identical quad-GB 
adapters, one should fail so badly on the faster system.

again compared lspci -vv line by line and found a tiny difference:

Hewlett-Packard Company NC364T.... (the 'bad')
        Region 0: Memory at fc400000 (32-bit, non-prefetchable) [size=128K]
        Region 1: Memory at fc300000 (32-bit, non-prefetchable) [size=512K]
        Region 2: I/O ports at 8000 [size=32]

Intel Corporation PRO/1000 PT ...('the good')
        Region 0: Memory at fc5a0000 (32-bit, non-prefetchable) [size=128K]
        Region 1: Memory at fc580000 (32-bit, non-prefetchable) [size=128K]
        Region 2: I/O ports at 5020 [size=32]

so the "Region 2" memory is 4x larger in the 'bad' NIC.
Any clue whether this may be related? 

Just an uneducated guess:
The values are too small for a buffer containing data.
If it were some kind of pointer fifo into some buffer memory, the larger one 
might run out of referred buffer, while the smaller does not????
If it were, could I simply increase this buffer?

How to proceed from "Guess" to "Know" to "Cure"?

Anybody any idea? 
Thank you :-)




Wolfgang Rosner

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to