Due to earlier NIC flapping observed on systems for the 25Gb Broadcom NIC, with originally the following config, the firmware was upgraded to avoid a known FW bug:
$ cat ethtool_-i_enp59s0f1d1 driver: bnxt_en_bpo version: 1.8.1 firmware-version: 20.8.163/1.8.4 pkg 20.08.04.03 expansion-rom-version: bus-info: 0000:3b:00.1 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: no supports-priv-flags: no The FW was upgraded on affected systems to: $ cat ethtool_-i_eno2d1 driver: bnxt_en_bpo version: 1.8.1 firmware-version: 214.0.166/1.9.2 pkg 21.40.16.6 expansion-rom-version: bus-info: 0000:19:00.1 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: no supports-priv-flags: no Unfortunately, it's not quite clear which FW version the current bug happened on (I believe the newer but can't confirm -- happened in the midst of several reboots) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: Incomplete Bug description: The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo 0000:19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. * No easy way to reproduce this To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp