I have tried, unsuccessfully, to reproduce this issue internally.
Details of my setup below.

1) I have a pair of Dell R210 servers racked (u072 and u073 below), each
with a BCM57416 installed:

root@u072:~# lspci | grep BCM57416
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)

2) I've matched the firmware version to one that Nivedita reported in a
bad system:

root@u072:~# ethtool -i enp1s0f0np0
driver: bnxt_en
version: 1.10.0
firmware-version: 214.0.253.1/pkg 21.40.25.31
expansion-rom-version: 
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: no
supports-priv-flags: no

3) Matched Ubuntu release and kernel version:

root@u072:~# lsb_release -dr
Description:    Ubuntu 18.04.3 LTS
Release:        18.04

root@u072:~# uname -a
Linux u072 5.0.0-37-generic #40~18.04.1-Ubuntu SMP Thu Nov 14 12:06:39 UTC 2019 
x86_64 x86_64 x86_64 GNU/Linux

4) Configured the interface into an active-backup bond:

root@u072:~# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: enp1s0f1np1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: enp1s0f1np1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:0a:f7:a7:10:61
Slave queue ID: 0

Slave Interface: enp1s0f0np0
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:0a:f7:a7:10:60
Slave queue ID: 0

5) Run the provided mtr and netperf test cases with the 1st port
selected as active:

root@u072:~# ip l set enp1s0f1np1 down
root@u072:~# ip l set enp1s0f1np1 up
root@u072:~# cat /proc/net/bonding/bond0 | grep Active
Currently Active Slave: enp1s0f0np0

a) initiated on u072:

root@u072:~# mtr --no-dns --report --report-cycles 60 192.168.1.2
Start: 2020-02-13T20:48:01+0000
HOST: u072                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 192.168.1.2                0.0%    60    0.2   0.2   0.2   0.2   0.0

root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 1,1
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.2 () port 0 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  131072 1        1       10.00    29040.91   
16384  87380

root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 64,64
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.2 () port 0 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  131072 64       64      10.00    28633.36   
16384  87380 

root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 128,8192
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.2 () port 0 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  131072 128      8192    10.00    17469.30   
16384  87380

b) initiated on u073:

root@u073:~# mtr --no-dns --report --report-cycles 60 192.168.1.1
Start: 2020-02-13T20:53:37+0000
HOST: u073                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 192.168.1.1                0.0%    60    0.1   0.1   0.1   0.2   0.0

root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 1,1
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.1 () port 0 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1        1       10.00    28514.93   
16384  131072

root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 64,64
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.1 () port 0 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  64       64      10.00    27405.88   
16384  131072

root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 128,8192
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.1 () port 0 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  128      8192    10.00    17342.42   
16384  131072

6) Run the provided mtr and netperf test cases with the 2nd port
selected as active:

root@u072:~# ip l set enp1s0f0np0 down
root@u072:~# ip l set enp1s0f0np0 up
root@u072:~# cat /proc/net/bonding/bond0 | grep Active
Currently Active Slave: enp1s0f1np1

a) initiated on u072:

root@u072:~# mtr --no-dns --report --report-cycles 60 192.168.1.2
Start: 2020-02-13T21:07:36+0000
HOST: u072                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 192.168.1.2                0.0%    60    0.2   0.2   0.1   0.2   0.0

root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 1,1
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.2 () port 0 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  131072 1        1       10.00    28649.85   
16384  87380

root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 64,64
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.2 () port 0 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  131072 64       64      10.00    27053.55   
16384  87380

root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 128,8192
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.2 () port 0 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  131072 128      8192    10.00    16706.59   
16384  87380 

b) initiated on u073:

root@u073:~# mtr --no-dns --report --report-cycles 60 192.168.1.1
Start: 2020-02-13T21:12:54+0000
HOST: u073                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 192.168.1.1                0.0%    60    0.1   0.1   0.1   0.2   0.0

root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 1,1
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.1 () port 0 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1        1       10.00    27782.73   
16384  131072

root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 64,64
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.1 () port 0 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  64       64      10.00    26645.73   
16384  131072

root@u073:~# netperf -t TCP_RR -H 192.168.1.1 -- -r 128,8192
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
192.168.1.1 () port 0 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  128      8192    10.00    17499.00   
16384  131072

As can be seen above, I don't see the same behavior. The big difference
in my setup is obviously the host, but I would be suprised if that were
a factor since the issue has been seen on vastly different host hardware
configurations above. Any other differences I could have missed between
the "Bad" system above and mine?

I am somewhat concerned about the rx_stat_discards, but suspect they're
down below the noise floor for the issue. Could you nevertheless please
carefully capture more ethtool stats on the production system. From
before and after the test for each of the bond leg interfaces (4
captures, I'm interested in the deltas that are due to the test).

Given this is a production system, what else is running that might have
an influence?

I don't see drops on the PCIe rings from the stats, so it doesn't look
like the host isn't keeping up, but perhaps you could dump CPU
utilization during the test too?

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.000007] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D00000000000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD100000000001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD100000000001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0000000000000)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0000000000000)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0000000000000)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0000000000000)
  Using Device: Single USRP:
    Device: X-Series Device
    Mboard 0: X310
    RX Channel: 0
      RX DSP: 0
      RX Dboard: A
      RX Subdev: SBX-120 RX
    RX Channel: 1
      RX DSP: 0
      RX Dboard: B
      RX Subdev: SBX-120 RX
    TX Channel: 0
      TX DSP: 0
      TX Dboard: A
      TX Subdev: SBX-120 TX
    TX Channel: 1
      TX DSP: 0
      TX Dboard: B
      TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.000000 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.000000 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
    Num received samples:     2995435936
    Num dropped samples:      4622800
    Num overruns detected:    0
    Num transmitted samples:  3008276544
    Num sequence errors (Tx): 0
    Num sequence errors (Rx): 15
    Num underruns detected:   0
    Num late commands:        0
    Num timeouts (Tx):        0
    Num timeouts (Rx):        0
  Done!

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$

  
  In this particular case description, the nodes are USRP x310s. However, we 
have the same issue with N210 nodes dropping samples connected to the BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet device.

  There is no problem with the USRPs themselves, as we have tested them
  with normal 1G network cards and have no dropped samples.

  Personally I think its something to do with the 10G network card,
  possibly on a ubuntu driver???

  Note, Dell have said there is no hardware problem with the 10G
  interfaces

  I have followed the troubleshooting information on this link to try determine 
the problem: https://files.ettus.com/manual/page_usrp_x3x0_config.html
  - There is no firewall on that port (disabled). 
  - I tried setting the cpu frequency power but got "no or unknown cpufreq 
driver is active on this CPU". 
  - I also changed the cable to Cat6a connecting the USRPs to the 10G SRIOV 
port, and I get the same issue


  
  This is from the VM with connected USRP x310
  tcdforge@x310a:~$ lspci -nn | grep -i ethernet
  00:03.0 Ethernet controller [0200]: Red Hat, Inc. Virtio network device 
[1af4:1000]
  00:05.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries 
NetXtreme-E Ethernet Virtual Function [14e4:16dc]
  tcdforge@x310a:~$ 

  5e:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet Controller [14e4:16d8] (rev 01)
          Subsystem: Dell BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet 
Controller [1028:1fea]
          Flags: bus master, fast devsel, latency 0, IRQ 50, NUMA node 0
          Memory at b9a10000 (64-bit, prefetchable) [size=64K]
          Memory at b9100000 (64-bit, prefetchable) [size=1M]
          Memory at b9aa2000 (64-bit, prefetchable) [size=8K]
          Expansion ROM at b9c00000 [disabled] [size=512K]
          Capabilities: <access denied>
          Kernel driver in use: bnxt_en
          Kernel modules: bnxt_en

  
  We get this info from the server:
  scamallra@rack9:~$ cpupower frequency-info 
  analyzing CPU 0:
    no or unknown cpufreq driver is active on this CPU
    CPUs which run at the same hardware frequency: Not Available
    CPUs which need to have their frequency coordinated by software: Not 
Available
    maximum transition latency:  Cannot determine or is not supported.
  Not Available
    available cpufreq governors: Not Available
    Unable to determine current policy
    current CPU frequency: Unable to call hardware
    current CPU frequency:  Unable to call to kernel
    boost state support:
      Supported: yes
      Active: yes


  
  lsb_release -rd
  Description:  Ubuntu 18.04.3 LTS
  Release:      18.04

  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: network-manager 1.10.6-2ubuntu1.1
  ProcVersionSignature: Ubuntu 4.15.0-70.79-generic 4.15.18
  Uname: Linux 4.15.0-70-generic x86_64
  ApportVersion: 2.20.9-0ubuntu7.9
  Architecture: amd64
  Date: Fri Nov 22 17:39:21 2019
  NetworkManager.state:
   [main]
   NetworkingEnabled=true
   WirelessEnabled=true
   WWANEnabled=true
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=<set>
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
  SourcePackage: network-manager
  UpgradeStatus: No upgrade log present (probably fresh install)
  nmcli-con: NAME  UUID  TYPE  TIMESTAMP  TIMESTAMP-REAL  AUTOCONNECT  
AUTOCONNECT-PRIORITY  READONLY  DBUS-PATH  ACTIVE  DEVICE  STATE  ACTIVE-PATH  
SLAVE
  nmcli-nm:
   RUNNING  VERSION  STATE      STARTUP  CONNECTIVITY  NETWORKING  WIFI-HW  
WIFI     WWAN-HW  WWAN    
   running  1.10.6   connected  started  unknown       enabled     enabled  
enabled  enabled  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1853638/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to