For documentation purposes, in a recent Xenial/4.4 kernel, this kernel error log is seen (with options to ignore the hardware error/fault that panics/reboots the system).
[ 113.658876] bnx2x: [bnx2x_stats_comp:205(eno1)]timeout waiting for stats finished [ 123.648066] bnx2x: [bnx2x_state_wait:310(eno1)]timeout waiting for state 6 [ 123.730345] bnx2x: [bnx2x_dcbx_stop_hw_tx:443(eno1)]Unable to hold traffic for HW configuration [ 123.834443] bnx2x: [bnx2x_dcbx_stop_hw_tx:444(eno1)]driver assert [ 123.907439] bnx2x: [bnx2x_panic_dump:919(eno1)]begin crash dump ----------------- ... [ 123.907662] bnx2x 0000:19:00.0 eno1: bc 7.14.11 [ 123.907666] begin fw dump (mark 0x3c65c8) [ 123.908033] end of fw dump [ 123.908048] bnx2x: [bnx2x_mc_assert:751(eno1)]Chip Revision: everest3, FW Version: 7_12_30 [ 123.908049] bnx2x: [bnx2x_panic_dump:1182(eno1)]end crash dump ----------------- [ 128.701944] bnx2x: [bnx2x_func_state_change:6306(eno1)]timeout waiting for previous ramrod completion [ 128.701946] bnx2x: [bnx2x_dcbx_resume_hw_tx:469(eno1)]Unable to resume traffic after HW configuration [ 128.701946] bnx2x: [bnx2x_dcbx_resume_hw_tx:470(eno1)]driver assert [ 128.701948] bnx2x: [bnx2x_panic_dump:919(eno1)]begin crash dump ----------------- ... [ 128.702170] bnx2x 0000:19:00.0 eno1: bc 7.14.11 [ 128.702173] begin fw dump (mark 0x3c65c8) [ 128.702542] end of fw dump [ 128.702557] bnx2x: [bnx2x_mc_assert:751(eno1)]Chip Revision: everest3, FW Version: 7_12_30 [ 128.702558] bnx2x: [bnx2x_panic_dump:1182(eno1)]end crash dump ----------------- [ 128.702565] bnx2x: [bnx2x_sp_rtnl_task:10229(eno1)]Indicating link is down due to Tx-timeout [ 130.704628] bnx2x: [bnx2x_clean_tx_queue:1204(eno1)]timeout waiting for queue[0]: txdata->tx_pkt_prod(4) != txdata->tx_pkt_cons(3) [ 132.706968] bnx2x: [bnx2x_clean_tx_queue:1204(eno1)]timeout waiting for queue[8]: txdata->tx_pkt_prod(445) != txdata->tx_pkt_cons(443) [ 134.710090] bnx2x: [bnx2x_clean_tx_queue:1204(eno1)]timeout waiting for queue[16]: txdata->tx_pkt_prod(29) != txdata->tx_pkt_cons(25) ... [ 202.648543] bnx2x: [bnx2x_clean_tx_queue:1204(eno1)]timeout waiting for queue[7]: txdata->tx_pkt_prod(25) != txdata->tx_pkt_cons(24) [ 204.792441] bnx2x: [bnx2x_clean_tx_queue:1204(eno1)]timeout waiting for queue[23]: txdata->tx_pkt_prod(51) != txdata->tx_pkt_cons(46) [ 204.940151] bnx2x: [bnx2x_del_all_macs:8499(eno1)]Failed to delete MACs: -5 [ 205.023453] bnx2x: [bnx2x_chip_cleanup:9319(eno1)]Failed to schedule DEL commands for UC MACs list: -5 [ 206.351810] bnx2x: [bnx2x_func_stop:9078(eno1)]FUNC_STOP ramrod failed. Running a dry transaction [ 206.778590] bnx2x: [bnx2x_issue_dmae_with_comp:550(eno1)]DMAE timeout! [ 206.856735] bnx2x: [bnx2x_write_dmae:598(eno1)]DMAE returned failure -1 [ 207.134674] bnx2x: [bnx2x_issue_dmae_with_comp:550(eno1)]DMAE timeout! [ 207.212785] bnx2x: [bnx2x_write_dmae:598(eno1)]DMAE returned failure -1 [ 207.490725] bnx2x: [bnx2x_issue_dmae_with_comp:550(eno1)]DMAE timeout! ... -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840789 Title: bnx2x: fatal hardware error/reboot/tx timeout with LLDP enabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Disco: In Progress Status in linux source package in Eoan: Fix Released Bug description: [Impact] * The bnx2x driver may cause hardware faults (leading to panic/reboot) and other behaviors as transmit timeouts, after commit 3968d38917eb ("bnx2x: Fix Multi-Cos.") is introduced. * This issue has been observed by an user shortly after starting docker & kubelet, with adapters: - Broadcom NetXtreme II BCM57800 [14e4:168a] from Dell [1028:1f5c] - Broadcom NetXtreme II BCM57840 [14e4:16a1] from Dell [1028:1f79] * If options to ignore hardware faults are used (erst_disable=1 hest_disable=1 ghes.disable=1) the system doesn't panic/reboot and continues on to timeout on adapter stats, then transmit timeouts, spewing some adapter firmware dumps, but the network interface is non-functional. * The issue only happened when LLDP is enabled on the network switches, and crashdump shows the bnx2x driver is stuck/waits for firmware to complete the stop traffic command in LLDP handling. Workaround used is to disable LLDP in the network switches/ports. * Analysis of the driver and firmware dumps didn't help significantly towards finding the root cause. * Upstream/mainline recently just reverted the patch, due to similar problem reports, while looking for the root cause/proper fix. [Test Case] * No reproducible test case found outside the user's systems/cluster, where it is enough to start docker & kubelet & wait. * The user verified test kernels for Xenial and Bionic - the problem does not happen; build-tested on Disco. [Regression Potential] * Users who significantly use/apply the non-default traffic class (tc) / class of service (cos) might possibly see performance changes (if any at all) in such applications, however that's unclear now. * This is a recent revert upstream (v5.3-rc'ish), so there's chance things might change in this area. * Nonetheless, the patch is authored by the driver vendor, and made its way into stable kernels (e.g., v5.2.8 which made Eoan/19.10 recently). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840789/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : [email protected] Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp

