Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9
YunQiang Su 于2021年10月22日周五 下午10:36写道: > > Claudio Kuenzler 于2021年10月22日周五 下午2:03写道: > > > > The fact that a later Kernel versions work fine _could_ be because of a > > hpwdt commit after 5.10: > > https://github.com/torvalds/linux/commit/acc195bd2cc48445ea35d00036d8c0afcc4fcc9c#diff-994ee4b010b5c6222ad7a20e160f733401f46894b36fa3e1fb6ffbb48bedb817 > > I have not tested sid or a newer Kernel on our HP machines though. > > If you've compiled your own Kernel and this one works (did your do a > > multiple reboot test?), maybe there's a difference in the Kernel "config"? > > > > What happens if you disable the hpwdt module as mentioned in the other bug > > reports? Does Bullseye with 5.10 and experimental with 5.14 work in this > > case? > > I test upstream linux and debian-linux with the same config. > All of the upstream config works fine, while debian-linux has this problem. > I guess it is due to one patch by Debian. > I find the real problem: it is due to intel_iommu by default. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=934309 > -- > YunQiang Su -- YunQiang Su
Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9
Claudio Kuenzler 于2021年10月22日周五 下午2:03写道: > > The fact that a later Kernel versions work fine _could_ be because of a hpwdt > commit after 5.10: > https://github.com/torvalds/linux/commit/acc195bd2cc48445ea35d00036d8c0afcc4fcc9c#diff-994ee4b010b5c6222ad7a20e160f733401f46894b36fa3e1fb6ffbb48bedb817 > I have not tested sid or a newer Kernel on our HP machines though. > If you've compiled your own Kernel and this one works (did your do a multiple > reboot test?), maybe there's a difference in the Kernel "config"? > > What happens if you disable the hpwdt module as mentioned in the other bug > reports? Does Bullseye with 5.10 and experimental with 5.14 work in this case? I test upstream linux and debian-linux with the same config. All of the upstream config works fine, while debian-linux has this problem. I guess it is due to one patch by Debian. -- YunQiang Su
Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9
The fact that a later Kernel versions work fine _could_ be because of a hpwdt commit after 5.10: https://github.com/torvalds/linux/commit/acc195bd2cc48445ea35d00036d8c0afcc4fcc9c#diff-994ee4b010b5c6222ad7a20e160f733401f46894b36fa3e1fb6ffbb48bedb817 I have not tested sid or a newer Kernel on our HP machines though. If you've compiled your own Kernel and this one works (did your do a multiple reboot test?), maybe there's a difference in the Kernel "config"? What happens if you disable the hpwdt module as mentioned in the other bug reports? Does Bullseye with 5.10 and experimental with 5.14 work in this case?
Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9
Claudio Kuenzler 于2021年10月22日周五 下午1:18写道: > > Also look at the following links and compare. Might be related or even the > same as you are seeing: > > https://www.claudiokuenzler.com/blog/1125/debian-11-bullseye-boot-freeze-kernel-panic-hp-proliant-dl380 > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=898336 > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=995773 > I built kernel by myself (5.14.12), same version as the current debian sid one. in fact 5.14.14 is also tested. It won't trigger this problem. And I make sure that hpwdt module is loaded. No idea why Debian's kernel cannot work. > > On Thu, Oct 21, 2021 at 10:42 AM Yunqiang Su wrote: >> >> On Fri, 10 Sep 2021 09:40:41 +0800 YunQiang Su wrote: >> > Yunqiang Su 于2021年9月9日周四 上午11:11写道: >> > > >> > > >> > > On Wed, 8 Sep 2021 20:53:27 +0800 YunQiang Su wrote: >> > > > Package: src:linux >> > > > Version: 5.10 >> > > > >> > > > After upgrade to bullseyes' kernel, the system always hang after about >> > > > 10 min >> > > > with an error from IML log >> > > > >> > > > An Unrecoverable System Error (NMI) has occurred (Service Information: >> > > > 0x0008, 0x8948) >> > > > >> > > > Kernel 5.14 from experimental also has this problem. >> > > > Kernel 4.19 works fine. >> > > > Fedora 34 seems to be working well. >> > > >> > > This is the output of dmesg and lspci from both Fedora 34 and Debian >> > > bullseye. >> > > Wish they are useful. >> > > >> > >> > Finally, we find the problem: >> > >> > https://github.com/torvalds/linux/commit/8343b1f8b97ac016150c8303f95b63b20b98edf8 >> > https://github.com/torvalds/linux/commit/65161c35554f7135e6656b3df1ce2c500ca0bdcf >> > >> > In the first patch: >> >They thought `err' is not used at all, and removed it. >> > In the second patch: >> >They add it back and a wrong value "-EINVAL" is given. >> > >> > Better KPI got. >> > >> >> The NICs can be detected now, while the machine continue to hang… >> 4.19.y works fine, while 5.10, 5.14 cannot. >> >> I think that we need more dig. >> >> > > > >> > > > -- >> > > > YunQiang Su >> > > > >> > > > >> > >> > >> > >> > -- >> > YunQiang Su >> > >> > >> -- YunQiang Su
Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9
Also look at the following links and compare. Might be related or even the same as you are seeing: https://www.claudiokuenzler.com/blog/1125/debian-11-bullseye-boot-freeze-kernel-panic-hp-proliant-dl380 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=898336 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=995773 On Thu, Oct 21, 2021 at 10:42 AM Yunqiang Su wrote: > On Fri, 10 Sep 2021 09:40:41 +0800 YunQiang Su wrote: > > Yunqiang Su 于2021年9月9日周四 上午11:11写道: > > > > > > > > > On Wed, 8 Sep 2021 20:53:27 +0800 YunQiang Su > wrote: > > > > Package: src:linux > > > > Version: 5.10 > > > > > > > > After upgrade to bullseyes' kernel, the system always hang after > about 10 min > > > > with an error from IML log > > > > > > > > An Unrecoverable System Error (NMI) has occurred (Service > Information: > > > > 0x0008, 0x8948) > > > > > > > > Kernel 5.14 from experimental also has this problem. > > > > Kernel 4.19 works fine. > > > > Fedora 34 seems to be working well. > > > > > > This is the output of dmesg and lspci from both Fedora 34 and Debian > bullseye. > > > Wish they are useful. > > > > > > > Finally, we find the problem: > > > > > https://github.com/torvalds/linux/commit/8343b1f8b97ac016150c8303f95b63b20b98edf8 > > > https://github.com/torvalds/linux/commit/65161c35554f7135e6656b3df1ce2c500ca0bdcf > > > > In the first patch: > >They thought `err' is not used at all, and removed it. > > In the second patch: > >They add it back and a wrong value "-EINVAL" is given. > > > > Better KPI got. > > > > The NICs can be detected now, while the machine continue to hang… > 4.19.y works fine, while 5.10, 5.14 cannot. > > I think that we need more dig. > > > > > > > > > -- > > > > YunQiang Su > > > > > > > > > > > > > > > > -- > > YunQiang Su > > > > > >
Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9
On Fri, 10 Sep 2021 09:40:41 +0800 YunQiang Su wrote: > Yunqiang Su 于2021年9月9日周四 上午11:11写道: > > > > > > On Wed, 8 Sep 2021 20:53:27 +0800 YunQiang Su wrote: > > > Package: src:linux > > > Version: 5.10 > > > > > > After upgrade to bullseyes' kernel, the system always hang after about 10 > > > min > > > with an error from IML log > > > > > > An Unrecoverable System Error (NMI) has occurred (Service Information: > > > 0x0008, 0x8948) > > > > > > Kernel 5.14 from experimental also has this problem. > > > Kernel 4.19 works fine. > > > Fedora 34 seems to be working well. > > > > This is the output of dmesg and lspci from both Fedora 34 and Debian > > bullseye. > > Wish they are useful. > > > > Finally, we find the problem: > > https://github.com/torvalds/linux/commit/8343b1f8b97ac016150c8303f95b63b20b98edf8 > https://github.com/torvalds/linux/commit/65161c35554f7135e6656b3df1ce2c500ca0bdcf > > In the first patch: >They thought `err' is not used at all, and removed it. > In the second patch: >They add it back and a wrong value "-EINVAL" is given. > > Better KPI got. > The NICs can be detected now, while the machine continue to hang… 4.19.y works fine, while 5.10, 5.14 cannot. I think that we need more dig. > > > > > > -- > > > YunQiang Su > > > > > > > > > > -- > YunQiang Su > >
Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9
Yunqiang Su 于2021年9月9日周四 上午11:11写道: > > > On Wed, 8 Sep 2021 20:53:27 +0800 YunQiang Su wrote: > > Package: src:linux > > Version: 5.10 > > > > After upgrade to bullseyes' kernel, the system always hang after about 10 > > min > > with an error from IML log > > > > An Unrecoverable System Error (NMI) has occurred (Service Information: > > 0x0008, 0x8948) > > > > Kernel 5.14 from experimental also has this problem. > > Kernel 4.19 works fine. > > Fedora 34 seems to be working well. > > This is the output of dmesg and lspci from both Fedora 34 and Debian bullseye. > Wish they are useful. > Finally, we find the problem: https://github.com/torvalds/linux/commit/8343b1f8b97ac016150c8303f95b63b20b98edf8 https://github.com/torvalds/linux/commit/65161c35554f7135e6656b3df1ce2c500ca0bdcf In the first patch: They thought `err' is not used at all, and removed it. In the second patch: They add it back and a wrong value "-EINVAL" is given. Better KPI got. > > > > -- > > YunQiang Su > > > > -- YunQiang Su
Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9
On Thu, 9 Sep 2021 11:11:45 +0800 Yunqiang Su wrote: > > On Wed, 8 Sep 2021 20:53:27 +0800 YunQiang Su wrote: > > Package: src:linux > > Version: 5.10 > > > > After upgrade to bullseyes' kernel, the system always hang after about 10 > > min > > with an error from IML log > > > > An Unrecoverable System Error (NMI) has occurred (Service Information: > > 0x0008, 0x8948) > > > > Kernel 5.14 from experimental also has this problem. > > Kernel 4.19 works fine. > > Fedora 34 seems to be working well. > > This is the output of dmesg and lspci from both Fedora 34 and Debian bullseye. > Wish they are useful. > The problem seems due to some problem of the driver/firmware of bnx2x. Since If I purge firmware-bnx2x, the OS will not hang (although no network connection then). I check md5sum of the firmware of Bullseye: they have the same value with Fedora ones. Note: fedora ones is compressed by xz. I test them after decompress. My hardware requires: bnx2x-e2-7.13.15.0.fw > > > > -- > > YunQiang Su > > > >
Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9
dmesg.debian.xz Description: application/xz dmesg.fedora.xz Description: application/xz lspci.debian.xz Description: application/xz lspci.fedora.xz Description: application/xz On Wed, 8 Sep 2021 20:53:27 +0800 YunQiang Su wrote: > Package: src:linux > Version: 5.10 > > After upgrade to bullseyes' kernel, the system always hang after about 10 min > with an error from IML log > > An Unrecoverable System Error (NMI) has occurred (Service Information: > 0x0008, 0x8948) > > Kernel 5.14 from experimental also has this problem. > Kernel 4.19 works fine. > Fedora 34 seems to be working well. This is the output of dmesg and lspci from both Fedora 34 and Debian bullseye. Wish they are useful. > > -- > YunQiang Su > >
Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9
Package: src:linux Version: 5.10 After upgrade to bullseyes' kernel, the system always hang after about 10 min with an error from IML log An Unrecoverable System Error (NMI) has occurred (Service Information: 0x0008, 0x8948) Kernel 5.14 from experimental also has this problem. Kernel 4.19 works fine. Fedora 34 seems to be working well. -- YunQiang Su