@Kai-Heng Feng
In reply to comment #16
Can you please try "reboot=" kernel parameter? the value can be "bios, acpi, 
kbd, triple, efi, or pci".
--> I have tried passing kernel parameters as mentioned above, and all the 
times I have observed the fatal error.
Attaching console logs of the efforts.

** Attachment added: "fatal_issue_1.txt"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471/+attachment/5572643/+files/fatal_issue_1.txt

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1917471

Title:
  [SRU][Regression] Revert "PM: ACPI: reboot: Use S5 for reboot" which
  causes Bus Fatal Error when rebooting system with BCM5720 NIC

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Focal:
  In Progress
Status in linux source package in Impish:
  Fix Released
Status in linux source package in Jammy:
  Fix Released

Bug description:
  SRU Justification:

  [IMPACT]

  This is being reported by a hardware partner as it is being noticed a
  lot both in their internal testing teams and also being reported with
  some frequency by customers who are seeing these messages in their
  logs and thus it is generating an unusualy high volume of support
  calls from the field.

  In 5.4, commit d60cd06331a3566d3305b3c7b566e79edf4e2095 was introduced
  upstream and pulled into Ubuntu between 5.4.0-58.64 and 5.4.0-59.65.
  Upstream, these errors were discovered and that patch was reverted
  (see Fix Below).  We carry the revert commit in all subsequent Focal
  HWE kernels starting at 5.12, but the fix was never pulled back into
  Focal 5.4.

  according to the hardware partner:

  the following error messages are observed when rebooting a machine
  that uses the BCM5720 chipset, which is a widely used 1GbE controller
  found on LOMs and OCP NICs as well as many PCIe NIC models.

  [  146.429212] shutdown[1]: Rebooting.
  [  146.435151] kvm: exiting hardware virtualization
  [  146.575319] megaraid_sas 0000:67:00.0: megasas_disable_intr_fusion is 
called outbound_intr_mask:0x40000009
  [  148.088133] [qede_unload:2236(eno12409)]Link is down
  [  148.183618] qede 0000:31:00.1: Ending qede_remove successfully
  [  148.518541] [qede_unload:2236(eno12399)]Link is down
  [  148.625066] qede 0000:31:00.0: Ending qede_remove successfully
  [  148.762067] ACPI: Preparing to enter system sleep state S5
  [  148.794638] {1}[Hardware Error]: Hardware error from APEI Generic Hardware 
Error Source: 5
  [  148.803731] {1}[Hardware Error]: event severity: recoverable
  [  148.810191] {1}[Hardware Error]:  Error 0, type: fatal
  [  148.816088] {1}[Hardware Error]:   section_type: PCIe error
  [  148.822391] {1}[Hardware Error]:   port_type: 0, PCIe end point
  [  148.829026] {1}[Hardware Error]:   version: 3.0
  [  148.834266] {1}[Hardware Error]:   command: 0x0006, status: 0x0010
  [  148.841140] {1}[Hardware Error]:   device_id: 0000:04:00.0
  [  148.847309] {1}[Hardware Error]:   slot: 0
  [  148.852077] {1}[Hardware Error]:   secondary_bus: 0x00
  [  148.857876] {1}[Hardware Error]:   vendor_id: 0x14e4, device_id: 0x165f
  [  148.865145] {1}[Hardware Error]:   class_code: 020000
  [  148.870845] {1}[Hardware Error]:   aer_uncor_status: 0x00100000, 
aer_uncor_mask: 0x00010000
  [  148.879842] {1}[Hardware Error]:   aer_uncor_severity: 0x000ef030
  [  148.886575] {1}[Hardware Error]:   TLP Header: 40000001 0000030f 90028090 
00000000
  [  148.894823] tg3 0000:04:00.0: AER: aer_status: 0x00100000, aer_mask: 
0x00010000
  [  148.902795] tg3 0000:04:00.0: AER:    [20] UnsupReq               (First)
  [  148.910234] tg3 0000:04:00.0: AER: aer_layer=Transaction Layer, 
aer_agent=Requester ID
  [  148.918806] tg3 0000:04:00.0: AER: aer_uncor_severity: 0x000ef030
  [  148.925558] tg3 0000:04:00.0: AER:   TLP Header: 40000001 0000030f 
90028090 00000000
  [  148.933984] reboot: Restarting system
  [  148.938319] reboot: machine restart

  The hardware partner did some bisection and observed the following:

  Kernel  version       Fatal Error
  5.4.0-42.46   No
  5.4.0-45.49   No
  5.4.0-47.51   No
  5.4.0-48.52   No
  5.4.0-51.56   No
  5.4.0-52.57   No
  5.4.0-53.59   No
  5.4.0-54.60   No
  5.4.0-58.64   No
  5.4.0-59.65   yes
  5.4.0-60.67   yes

  [FIX]
  The fix is to apply this patch from upstream:

  commit 9d3fcb28f9b9750b474811a2964ce022df56336e
  Author: Josef Bacik <jo...@toxicpanda.com>
  Date:   Tue Mar 16 22:17:48 2021 -0400

      Revert "PM: ACPI: reboot: Use S5 for reboot"

      This reverts commit d60cd06331a3566d3305b3c7b566e79edf4e2095.

      This patch causes a panic when rebooting my Dell Poweredge r440.  I do
      not have the full panic log as it's lost at that stage of the reboot and
      I do not have a serial console.  Reverting this patch makes my system
      able to reboot again.

  Example:
  
https://code.launchpad.net/~bladernr/ubuntu/+source/linux/+git/focal/+ref/1917471

  The hardware partner has preemptively pulled our 5.4 tree, applied the
  fix and tested it in their labs and determined that this does resolve
  the issue.

  [TEST CASE]
  Install the patched kernel on a machine that uses a BCM5720 LOM and reboot 
the machine and see that the errors no longer appear.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to