I have tried, unsuccessfully, to reproduce this issue internally. Details of my setup below.
1) I have a pair of Dell R210 servers racked (u072 and u073 below), each with a BCM57416 installed: root@u072:~# lspci | grep BCM57416 01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01) 01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01) 2) I've matched the firmware version to one that Nivedita reported in a bad system: root@u072:~# ethtool -i enp1s0f0np0 driver: bnxt_en version: 1.10.0 firmware-version: 214.0.253.1/pkg 21.40.25.31 expansion-rom-version: bus-info: 0000:01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: no supports-priv-flags: no 3) Matched Ubuntu release and kernel version: root@u072:~# lsb_release -dr Description: Ubuntu 18.04.3 LTS Release: 18.04 root@u072:~# uname -a Linux u072 5.0.0-37-generic #40~18.04.1-Ubuntu SMP Thu Nov 14 12:06:39 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux 4) Configured the interface into an active-backup bond: root@u072:~# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: enp1s0f1np1 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: enp1s0f1np1 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 1 Permanent HW addr: 00:0a:f7:a7:10:61 Slave queue ID: 0 Slave Interface: enp1s0f0np0 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 1 Permanent HW addr: 00:0a:f7:a7:10:60 Slave queue ID: 0 5) Run the provided mtr and netperf test cases with the 1st port selected as active: root@u072:~# ip l set enp1s0f1np1 down root@u072:~# ip l set enp1s0f1np1 up root@u072:~# cat /proc/net/bonding/bond0 | grep Active Currently Active Slave: enp1s0f0np0 a) initiated on u072: root@u072:~# mtr --no-dns --report --report-cycles 60 192.168.1.2 Start: 2020-02-13T20:48:01+0000 HOST: u072 Loss% Snt Last Avg Best Wrst StDev 1.|-- 192.168.1.2 0.0% 60 0.2 0.2 0.2 0.2 0.0 root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 1,1 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.2 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 131072 1 1 10.00 29040.91 16384 87380 root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 64,64 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.2 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 131072 64 64 10.00 28633.36 16384 87380 root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 128,8192 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.2 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 131072 128 8192 10.00 17469.30 16384 87380 b) initiated on u073: root@u073:~# mtr --no-dns --report --report-cycles 60 192.168.1.1 Start: 2020-02-13T20:53:37+0000 HOST: u073 Loss% Snt Last Avg Best Wrst StDev 1.|-- 192.168.1.1 0.0% 60 0.1 0.1 0.1 0.2 0.0 root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 1,1 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 1 1 10.00 28514.93 16384 131072 root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 64,64 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 64 64 10.00 27405.88 16384 131072 root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 128,8192 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 128 8192 10.00 17342.42 16384 131072 6) Run the provided mtr and netperf test cases with the 2nd port selected as active: root@u072:~# ip l set enp1s0f0np0 down root@u072:~# ip l set enp1s0f0np0 up root@u072:~# cat /proc/net/bonding/bond0 | grep Active Currently Active Slave: enp1s0f1np1 a) initiated on u072: root@u072:~# mtr --no-dns --report --report-cycles 60 192.168.1.2 Start: 2020-02-13T21:07:36+0000 HOST: u072 Loss% Snt Last Avg Best Wrst StDev 1.|-- 192.168.1.2 0.0% 60 0.2 0.2 0.1 0.2 0.0 root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 1,1 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.2 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 131072 1 1 10.00 28649.85 16384 87380 root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 64,64 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.2 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 131072 64 64 10.00 27053.55 16384 87380 root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 128,8192 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.2 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 131072 128 8192 10.00 16706.59 16384 87380 b) initiated on u073: root@u073:~# mtr --no-dns --report --report-cycles 60 192.168.1.1 Start: 2020-02-13T21:12:54+0000 HOST: u073 Loss% Snt Last Avg Best Wrst StDev 1.|-- 192.168.1.1 0.0% 60 0.1 0.1 0.1 0.2 0.0 root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 1,1 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 1 1 10.00 27782.73 16384 131072 root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 64,64 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 64 64 10.00 26645.73 16384 131072 root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 128,8192 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 128 8192 10.00 17499.00 16384 131072 As can be seen above, I don't see the same behavior. The big difference in my setup is obviously the host, but I would be suprised if that were a factor since the issue has been seen on vastly different host hardware configurations above. Any other differences I could have missed between the "Bad" system above and mine? I am somewhat concerned about the rx_stat_discards, but suspect they're down below the noise floor for the issue. Could you nevertheless please carefully capture more ethtool stats on the production system. From before and after the test for each of the bond leg interfaces (4 captures, I'm interested in the deltas that are due to the test). Given this is a production system, what else is running that might have an influence? I don't see drops on the PCIe rings from the stats, so it doesn't look like the host isn't keeping up, but perhaps you could dump CPU utilization during the test too? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.000007] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D00000000000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD100000000001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD100000000001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0000000000000) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0000000000000) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0000000000000) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0000000000000) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.000000 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.000000 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected: 0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands: 0 Num timeouts (Tx): 0 Num timeouts (Rx): 0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I have followed the troubleshooting information on this link to try determine the problem: https://files.ettus.com/manual/page_usrp_x3x0_config.html - There is no firewall on that port (disabled). - I tried setting the cpu frequency power but got "no or unknown cpufreq driver is active on this CPU". - I also changed the cable to Cat6a connecting the USRPs to the 10G SRIOV port, and I get the same issue This is from the VM with connected USRP x310 tcdforge@x310a:~$ lspci -nn | grep -i ethernet 00:03.0 Ethernet controller [0200]: Red Hat, Inc. Virtio network device [1af4:1000] 00:05.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme-E Ethernet Virtual Function [14e4:16dc] tcdforge@x310a:~$ 5e:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller [14e4:16d8] (rev 01) Subsystem: Dell BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller [1028:1fea] Flags: bus master, fast devsel, latency 0, IRQ 50, NUMA node 0 Memory at b9a10000 (64-bit, prefetchable) [size=64K] Memory at b9100000 (64-bit, prefetchable) [size=1M] Memory at b9aa2000 (64-bit, prefetchable) [size=8K] Expansion ROM at b9c00000 [disabled] [size=512K] Capabilities: <access denied> Kernel driver in use: bnxt_en Kernel modules: bnxt_en We get this info from the server: scamallra@rack9:~$ cpupower frequency-info analyzing CPU 0: no or unknown cpufreq driver is active on this CPU CPUs which run at the same hardware frequency: Not Available CPUs which need to have their frequency coordinated by software: Not Available maximum transition latency: Cannot determine or is not supported. Not Available available cpufreq governors: Not Available Unable to determine current policy current CPU frequency: Unable to call hardware current CPU frequency: Unable to call to kernel boost state support: Supported: yes Active: yes lsb_release -rd Description: Ubuntu 18.04.3 LTS Release: 18.04 ProblemType: Bug DistroRelease: Ubuntu 18.04 Package: network-manager 1.10.6-2ubuntu1.1 ProcVersionSignature: Ubuntu 4.15.0-70.79-generic 4.15.18 Uname: Linux 4.15.0-70-generic x86_64 ApportVersion: 2.20.9-0ubuntu7.9 Architecture: amd64 Date: Fri Nov 22 17:39:21 2019 NetworkManager.state: [main] NetworkingEnabled=true WirelessEnabled=true WWANEnabled=true ProcEnviron: TERM=xterm-256color PATH=(custom, no user) XDG_RUNTIME_DIR=<set> LANG=en_US.UTF-8 SHELL=/bin/bash RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill' SourcePackage: network-manager UpgradeStatus: No upgrade log present (probably fresh install) nmcli-con: NAME UUID TYPE TIMESTAMP TIMESTAMP-REAL AUTOCONNECT AUTOCONNECT-PRIORITY READONLY DBUS-PATH ACTIVE DEVICE STATE ACTIVE-PATH SLAVE nmcli-nm: RUNNING VERSION STATE STARTUP CONNECTIVITY NETWORKING WIFI-HW WIFI WWAN-HW WWAN running 1.10.6 connected started unknown enabled enabled enabled enabled enabled To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1853638/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp