I am not sure we could deterministically provoke the 
issue. At the very least to ensure no other regression
was introduced, I would run it under heavy network load.

The environment in question which saw the issue had 
network load, contention for cpus and several other 
issues occur.

The basic environment is:

1. For any 25Gb NIC/chipset that requires the 4.4 bnxt_en_bpo
   driver, set its 2 ports/interfaces up in bonding mode 
   as follows:

bond-lacp-rate fast
bond-master bond0
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer3+4
mtu 9000 

2. Run any heavy TCP network load test over the systems
   (e.g. iperf, netperf, file transfer, etc.)

3. Theoretically, it would appear that if the number of tx
   ring descriptors were lower, than that would be more
   likely to hit this (not successfully proven by testing
   here), but can lower it and see if that helps:

   # ethtool -G eno49 tx 128  // for example


I am not sure if that helps, Scott. I'll try and smoke
up more specific steps but I cannot guarantee you will
see the issue.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1814095

Title:
  bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Xenial:
  Fix Committed

Bug description:
  [Impact]

  The bnxt_en_bpo driver experienced tx timeouts causing the system to
  experience network stalls and fail to send data and heartbeat packets.

  The following 25Gb Broadcom NIC error was seen on Xenial
  running the 4.4.0-141-generic kernel on an amd64 host
  seeing moderate-heavy network traffic (just once):

  * The bnxt_en_po driver froze on a "TX timed out" error
    and triggered the Netdev Watchdog timer under load.

  * From kernel log:
    "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
    See attached kern.log excerpt file for full excerpt of error log.

  * Release = Xenial
    Kernel = 4.4.0-141-generic #167
    eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet

  * This caused the driver to reset in order to recover:

    "bnxt_en_bpo 0000:19:00.1 eno2d1: TX timeout detected, starting
  reset task!"

    driver: bnxt_en_bpo
    version: 1.8.1
    source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()

  * The loss of connectivity and softirq stall caused other failures
    on the system.

  * The bnxt_en_po driver is the imported Broadcom driver
    pulled in to support newer Broadcom HW (specific boards)
    while the bnx_en module continues to support the older
    HW. The current Linux upstream driver does not compile
    easily with the 4.4 kernel (too many changes).

  * This upstream and bnxt_en driver fix is a likely solution:
     "bnxt_en: Fix TX timeout during netpoll"
     commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906

    This fix has not been applied to the bnxt_en_po driver
    version, but review of the code indicates that it is
    susceptible to the bug, and the fix would be reasonable.

  [Test Case]

  * Unfortunately, this is not easy to reproduce. Also, it is only seen
  on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo
  driver.

  [Regression Potential]

  * The patch is restricted to the bpo driver, with very constrained
  scope - just the newest Broadcom NICs being used by the Xenial 4.4
  kernel (as opposed to the hwe 4.15 etc. kernels, which would have the
  in-tree fixed driver).

  * The patch is very small and backport is fairly minimal and simple.

  * The fix has been running on the in-tree driver in upstream mainline
  as well as the Ubuntu Linux in-tree driver, although the Broadcom
  driver has a lot of lower level code that is different, this piece is
  still the same.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to