Hello, I encounter a strange TCP performance limit in the turn of configuring teql layer 3 link aggregation with 6 x 1 GBit NICs.
I started my problem on LARTC list, since there I found the concept of teql aggregation. Additional info on my setup maybe found there: http://www.spinics.net/lists/lartc/msg23205.html On my research, I found this old article on e1000-devel, which may come close to my problem: http://osdir.com/ml/linux.drivers.e1000.devel/2007-11/msg00133.html ----------<quote>----------- > you are running out of bus bandwidth (which is why increasing > descriptors doesn't help). rx_missed_errors occur when you run out of > fifo on the adapter itself, indicating the bus can't be attained for > long enough to keep the data rate up. ----------<quote>----------- Since symptoms focus around a NIC issue, I continue the discussion here. Short recall of the setup: I try to configure a beowulf style HPC cluster from old server hardware sourced on ebay and now try to squeeze the best out of it. The cluster nodes are 16x HP 460c G1 in a HP blade enclosure, equipped with 6 x GBit ethernet Ports each. I configured a layer 3 link aggregation on 6 seperated vlans with IP routes over physical channels and an additional teql layer on top of those routes +-------------eth4---gateway(aka cruncher) | +-------------eth5---gateway(aka cruncher) | | +-------------eth6---gateway(aka cruncher) | | | +-------------eth7---gateway(aka cruncher) | | | | +-------------eth8---gateway(aka cruncher) | | | | | +-------------eth9---gateway(aka cruncher) +-+-+-+-+-+----blade-001 +-+-+-+-+-+----blade-002 +-+-+-+-+-+----blade-003 Stuff works fine between the blade nodes, which are equipped with broadcom chips using tg3 driver. Using mtu 9000 Jumbo frames, I get iperf transfer rates of 5.6 GBit, which is > 90 % of the theoretical maximum. Straight implementation of the same scheme between the blades and the gateway yields not more than ~ 2 GBit. So, some aggregation happens, but far from the 6 GBit maximum. No noticable dependency on direction of transmit. The Gateway uses two sligtly different quad port GBit cards with 82571EB chips and e1000e driver. In the initial connection, eth2,3 were left open, eth4..eth9 each connected to a different HP Virtual Connect blade center switching Ethernet Module. ifconfig and wireshark show traffic coming equally over all 6 lines. But with an awful lots of retransmits. Well, maybe that wireshark gets confused by teql and fails matching packets since they go over different interfaces, but thats another issue, not primary here. After lots of googling, I pinned the symptom down to this issue: # for i in `seq 2 9`; do ethtool -S eth$i | grep rx_missed_errors ; done rx_missed_errors: 0 -> eth2 rx_missed_errors: 0 -> eth3 rx_missed_errors: 0 -> eth4 rx_missed_errors: 0 -> eth5 rx_missed_errors: 29159 -> eth6 rx_missed_errors: 28619 -> eth7 rx_missed_errors: 9263 -> eth8 rx_missed_errors: 23306 -> eth9 ================================== I need 6 links out of 8 available, so I plugged 2 cables from the "buggy" NIC to the "healthy" one - and kept link config matching. So now eth8 and eth9 is open, eth2 ... eth7 form the aggregated link and - alas - we get up from ~2 GBit to > 3 GBit. Still thousands of rx_missed_errors in the "bad" NIC, which has only to work for 2 GBit connections now, and still zero of rx_missed_errors for the "good" NIC , which carries 4 GBit active links now. Further googling and tweaking memory limits in /proc/sys/net/ipv4/tcp_*mem and /proc/sys/net/core/*mem* showed no difference. What helped, was to incrase the "TCP window size" on the iperf server side from "TCP window size: 85.3 KByte (default)" to a value between 512K and 2 M This yields 4.35 Gbits/sec, so we are over 70 % of theoretical maximum. However, neither do I really understand it, nor do I know how to transfer this window size setting to other applications than iperf. I think the TCP window size is just a workaround for underlying problems, because - still lots of rx_missed_errors for eth6 and eth7 - zero rx_missed_errors for eth 2..eth5 - the blade-blade connection with 5.6 GBit works even better without any tweaking with small TCP window size. ("85.3 KByte (default)" ) Possible causes on my list - firmware problem (NICs, Mainboard) - hardware problem (NICs, Mainboard) - conceptual limitation of hardware design - driver problem - kernel / scheduling issue / IRQ / race...whatever? -some realy weird hidden tweak paramater - still the nasty VC blade switch? ( which led me already to migrate from layer 2 bonding to layer 3 teql) - any more???? ================================== System details .... ... extracted from lspci .... root@cruncher:/cluster/etc/dnsmasq.d# lspci | grep -i ether 07:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) 07:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) 08:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) 08:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) 0c:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 0c:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 0d:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 0d:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 0e:05.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8100/8101L/8139 PCI Fast Ethernet Adapter (rev 10) 10:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 09) ---------------- 0c:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) Subsystem: Intel Corporation PRO/1000 PT Quad Port Server Adapter 07:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port Gigabit Server Adapter so both adaptors have the same chipset, same driver, similar bus connectivity and announce identical PCI bus bandwith: 'LnkSta: Speed 2.5GT/s, Width x4' --------------------- ' +-0a.0-[05-08]----00.0-[06-08]--+-02.0-[07]--+-00.0 ' | | \-00.1 ' | \-04.0-[08]--+-00.0 ' | \-00.1 ' +-0b.0-[09]--+-00.0 | \-00.1 ' +-0d.0-[0a-0d]----00.0-[0b-0d]--+-00.0-[0c]--+-00.0 ' | | \-00.1 ' | \-01.0-[0d]--+-00.0 ' | \-00.1 ---------------------- The gateway mainboard is a SABERTOOTH 990FX R2.0 [AMD/ATI] RD890 PCI to PCI bridge (external gfx1 port A) - consumer grade, but quite recent - Gateway CPU is a AMD FX-8320 8 Core debian wheezy, vanilla kernel: Linux cruncher 3.19.0 #1 SMP Tue Mar 3 19:05:04 CET 2015 x86_64 GNU/Linux driver: root@cruncher..# modinfo e1000e filename: /lib/modules/3.19.0/kernel/drivers/net/ethernet/intel/e1000e/e1000e.ko version: 2.3.2-k license: GPL description: Intel(R) PRO/1000 Network Driver author: Intel Corporation, <linux.n...@intel.com> srcversion: A1AA8F77482AA26B2715149 .... The blade nodes are HP blades 460c G1 chipset Intel 5000 - enterprise grade, but quite some years now, I suppose - CPU 2 x Xeon E5430 quad debian wheezy, debian backport kernel: Linux blade-002.crunchnet 3.16.0-0.bpo.4-amd64 #1 SMP Debian 3.16.7-ckt4-3~bpo70+1 (2015-02-12) x86_64 GNU/Linux ---------------- Comparing memory bandwith with mbw (as a first measure of system bus thruput), the Gateway outperforms the blades by a factor of two root@blade-002:~# mbw -n1 1000 AVG Method: MEMCPY Elapsed: 0.61679 MiB: 1000.00000 Copy: 1621.300 MiB/s AVG Method: DUMB Elapsed: 0.51892 MiB: 1000.00000 Copy: 1927.068 MiB/s AVG Method: MCBLOCK Elapsed: 0.39211 MiB: 1000.00000 Copy: 2550.311 MiB/s root@cruncher...# mbw -n1 1000 AVG Method: MEMCPY Elapsed: 0.27301 MiB: 1000.00000 Copy: 3662.923 MiB/s AVG Method: DUMB Elapsed: 0.19693 MiB: 1000.00000 Copy: 5077.972 MiB/s AVG Method: MCBLOCK Elapsed: 0.19287 MiB: 1000.00000 Copy: 5184.947 MiB/s In any CPU/memory/bus related tests I ran till now, the gateway outperformed the blades, so I'd consider this as the "faster" system in general. ------------------- So, conceptually, I see no reason why from two nearly identical quad-GB adapters, one should fail so badly on the faster system. again compared lspci -vv line by line and found a tiny difference: Hewlett-Packard Company NC364T.... (the 'bad') Region 0: Memory at fc400000 (32-bit, non-prefetchable) [size=128K] Region 1: Memory at fc300000 (32-bit, non-prefetchable) [size=512K] Region 2: I/O ports at 8000 [size=32] Intel Corporation PRO/1000 PT ...('the good') Region 0: Memory at fc5a0000 (32-bit, non-prefetchable) [size=128K] Region 1: Memory at fc580000 (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at 5020 [size=32] so the "Region 2" memory is 4x larger in the 'bad' NIC. Any clue whether this may be related? Just an uneducated guess: The values are too small for a buffer containing data. If it were some kind of pointer fifo into some buffer memory, the larger one might run out of referred buffer, while the smaller does not???? If it were, could I simply increase this buffer? How to proceed from "Guess" to "Know" to "Cure"? Anybody any idea? Thank you :-) Wolfgang Rosner ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired