Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9

2021-10-23 Thread YunQiang Su
YunQiang Su  于2021年10月22日周五 下午10:36写道:
>
> Claudio Kuenzler  于2021年10月22日周五 下午2:03写道:
> >
> > The fact that a later Kernel versions work fine _could_ be because of a 
> > hpwdt commit after 5.10: 
> > https://github.com/torvalds/linux/commit/acc195bd2cc48445ea35d00036d8c0afcc4fcc9c#diff-994ee4b010b5c6222ad7a20e160f733401f46894b36fa3e1fb6ffbb48bedb817
> > I have not tested sid or a newer Kernel on our HP machines though.
> > If you've compiled your own Kernel and this one works (did your do a 
> > multiple reboot test?), maybe there's a difference in the Kernel "config"?
> >
> > What happens if you disable the hpwdt module as mentioned in the other bug 
> > reports? Does Bullseye with 5.10 and experimental with 5.14 work in this 
> > case?
>
> I test upstream linux and debian-linux with the same config.
> All of the upstream config works fine, while debian-linux has this problem.
> I guess it is due to one patch by Debian.
>

I find the real problem: it is due to intel_iommu by default.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=934309

> --
> YunQiang Su



-- 
YunQiang Su



Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9

2021-10-22 Thread YunQiang Su
Claudio Kuenzler  于2021年10月22日周五 下午2:03写道:
>
> The fact that a later Kernel versions work fine _could_ be because of a hpwdt 
> commit after 5.10: 
> https://github.com/torvalds/linux/commit/acc195bd2cc48445ea35d00036d8c0afcc4fcc9c#diff-994ee4b010b5c6222ad7a20e160f733401f46894b36fa3e1fb6ffbb48bedb817
> I have not tested sid or a newer Kernel on our HP machines though.
> If you've compiled your own Kernel and this one works (did your do a multiple 
> reboot test?), maybe there's a difference in the Kernel "config"?
>
> What happens if you disable the hpwdt module as mentioned in the other bug 
> reports? Does Bullseye with 5.10 and experimental with 5.14 work in this case?

I test upstream linux and debian-linux with the same config.
All of the upstream config works fine, while debian-linux has this problem.
I guess it is due to one patch by Debian.

-- 
YunQiang Su



Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9

2021-10-22 Thread Claudio Kuenzler
The fact that a later Kernel versions work fine _could_ be because of a
hpwdt commit after 5.10:
https://github.com/torvalds/linux/commit/acc195bd2cc48445ea35d00036d8c0afcc4fcc9c#diff-994ee4b010b5c6222ad7a20e160f733401f46894b36fa3e1fb6ffbb48bedb817
I have not tested sid or a newer Kernel on our HP machines though.
If you've compiled your own Kernel and this one works (did your do a
multiple reboot test?), maybe there's a difference in the Kernel "config"?

What happens if you disable the hpwdt module as mentioned in the other bug
reports? Does Bullseye with 5.10 and experimental with 5.14 work in this
case?


Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9

2021-10-21 Thread YunQiang Su
Claudio Kuenzler  于2021年10月22日周五 下午1:18写道:
>
> Also look at the following links and compare. Might be related or even the 
> same as you are seeing:
>
> https://www.claudiokuenzler.com/blog/1125/debian-11-bullseye-boot-freeze-kernel-panic-hp-proliant-dl380
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=898336
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=995773
>

I built kernel by myself (5.14.12), same version as the current debian sid one.
   in fact 5.14.14 is also tested.
It won't trigger this problem.
And I make sure that hpwdt module is loaded.

No idea why Debian's kernel cannot work.

>
> On Thu, Oct 21, 2021 at 10:42 AM Yunqiang Su  wrote:
>>
>> On Fri, 10 Sep 2021 09:40:41 +0800 YunQiang Su  wrote:
>> > Yunqiang Su  于2021年9月9日周四 上午11:11写道:
>> > >
>> > >
>> > > On Wed, 8 Sep 2021 20:53:27 +0800 YunQiang Su  wrote:
>> > > > Package: src:linux
>> > > > Version: 5.10
>> > > >
>> > > > After upgrade to bullseyes' kernel, the system always hang after about 
>> > > > 10 min
>> > > > with an error from IML log
>> > > >
>> > > > An Unrecoverable System Error (NMI) has occurred (Service Information:
>> > > > 0x0008, 0x8948)
>> > > >
>> > > > Kernel 5.14 from experimental also has this problem.
>> > > > Kernel 4.19 works fine.
>> > > > Fedora 34 seems to be working well.
>> > >
>> > > This is the output of dmesg and lspci from both Fedora 34 and Debian 
>> > > bullseye.
>> > > Wish they are useful.
>> > >
>> >
>> > Finally, we find the problem:
>> >
>> > https://github.com/torvalds/linux/commit/8343b1f8b97ac016150c8303f95b63b20b98edf8
>> > https://github.com/torvalds/linux/commit/65161c35554f7135e6656b3df1ce2c500ca0bdcf
>> >
>> > In the first patch:
>> >They thought `err' is not used at all, and removed it.
>> > In the second patch:
>> >They add it back and a wrong value "-EINVAL" is given.
>> >
>> > Better KPI got.
>> >
>>
>> The NICs can be detected now, while the machine continue to hang…
>> 4.19.y works fine, while 5.10, 5.14 cannot.
>>
>> I think that we need more dig.
>>
>> > > >
>> > > > --
>> > > > YunQiang Su
>> > > >
>> > > >
>> >
>> >
>> >
>> > --
>> > YunQiang Su
>> >
>> >
>>


-- 
YunQiang Su



Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9

2021-10-21 Thread Claudio Kuenzler
Also look at the following links and compare. Might be related or even the
same as you are seeing:

https://www.claudiokuenzler.com/blog/1125/debian-11-bullseye-boot-freeze-kernel-panic-hp-proliant-dl380
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=898336
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=995773


On Thu, Oct 21, 2021 at 10:42 AM Yunqiang Su  wrote:

> On Fri, 10 Sep 2021 09:40:41 +0800 YunQiang Su  wrote:
> > Yunqiang Su  于2021年9月9日周四 上午11:11写道:
> > >
> > >
> > > On Wed, 8 Sep 2021 20:53:27 +0800 YunQiang Su 
> wrote:
> > > > Package: src:linux
> > > > Version: 5.10
> > > >
> > > > After upgrade to bullseyes' kernel, the system always hang after
> about 10 min
> > > > with an error from IML log
> > > >
> > > > An Unrecoverable System Error (NMI) has occurred (Service
> Information:
> > > > 0x0008, 0x8948)
> > > >
> > > > Kernel 5.14 from experimental also has this problem.
> > > > Kernel 4.19 works fine.
> > > > Fedora 34 seems to be working well.
> > >
> > > This is the output of dmesg and lspci from both Fedora 34 and Debian
> bullseye.
> > > Wish they are useful.
> > >
> >
> > Finally, we find the problem:
> >
> >
> https://github.com/torvalds/linux/commit/8343b1f8b97ac016150c8303f95b63b20b98edf8
> >
> https://github.com/torvalds/linux/commit/65161c35554f7135e6656b3df1ce2c500ca0bdcf
> >
> > In the first patch:
> >They thought `err' is not used at all, and removed it.
> > In the second patch:
> >They add it back and a wrong value "-EINVAL" is given.
> >
> > Better KPI got.
> >
>
> The NICs can be detected now, while the machine continue to hang…
> 4.19.y works fine, while 5.10, 5.14 cannot.
>
> I think that we need more dig.
>
> > > >
> > > > --
> > > > YunQiang Su
> > > >
> > > >
> >
> >
> >
> > --
> > YunQiang Su
> >
> >
>
>


Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9

2021-10-21 Thread Yunqiang Su
On Fri, 10 Sep 2021 09:40:41 +0800 YunQiang Su  wrote:
> Yunqiang Su  于2021年9月9日周四 上午11:11写道:
> >
> >
> > On Wed, 8 Sep 2021 20:53:27 +0800 YunQiang Su  wrote:
> > > Package: src:linux
> > > Version: 5.10
> > >
> > > After upgrade to bullseyes' kernel, the system always hang after about 10 
> > > min
> > > with an error from IML log
> > >
> > > An Unrecoverable System Error (NMI) has occurred (Service Information:
> > > 0x0008, 0x8948)
> > >
> > > Kernel 5.14 from experimental also has this problem.
> > > Kernel 4.19 works fine.
> > > Fedora 34 seems to be working well.
> >
> > This is the output of dmesg and lspci from both Fedora 34 and Debian 
> > bullseye.
> > Wish they are useful.
> >
> 
> Finally, we find the problem:
> 
> https://github.com/torvalds/linux/commit/8343b1f8b97ac016150c8303f95b63b20b98edf8
> https://github.com/torvalds/linux/commit/65161c35554f7135e6656b3df1ce2c500ca0bdcf
> 
> In the first patch:
>They thought `err' is not used at all, and removed it.
> In the second patch:
>They add it back and a wrong value "-EINVAL" is given.
> 
> Better KPI got.
> 

The NICs can be detected now, while the machine continue to hang…
4.19.y works fine, while 5.10, 5.14 cannot.

I think that we need more dig.

> > >
> > > --
> > > YunQiang Su
> > >
> > >
> 
> 
> 
> -- 
> YunQiang Su
> 
> 



Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9

2021-09-09 Thread YunQiang Su
Yunqiang Su  于2021年9月9日周四 上午11:11写道:
>
>
> On Wed, 8 Sep 2021 20:53:27 +0800 YunQiang Su  wrote:
> > Package: src:linux
> > Version: 5.10
> >
> > After upgrade to bullseyes' kernel, the system always hang after about 10 
> > min
> > with an error from IML log
> >
> > An Unrecoverable System Error (NMI) has occurred (Service Information:
> > 0x0008, 0x8948)
> >
> > Kernel 5.14 from experimental also has this problem.
> > Kernel 4.19 works fine.
> > Fedora 34 seems to be working well.
>
> This is the output of dmesg and lspci from both Fedora 34 and Debian bullseye.
> Wish they are useful.
>

Finally, we find the problem:

https://github.com/torvalds/linux/commit/8343b1f8b97ac016150c8303f95b63b20b98edf8
https://github.com/torvalds/linux/commit/65161c35554f7135e6656b3df1ce2c500ca0bdcf

In the first patch:
   They thought `err' is not used at all, and removed it.
In the second patch:
   They add it back and a wrong value "-EINVAL" is given.

Better KPI got.

> >
> > --
> > YunQiang Su
> >
> >



-- 
YunQiang Su



Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9

2021-09-09 Thread suyunqiang
On Thu, 9 Sep 2021 11:11:45 +0800 Yunqiang Su  wrote:
> 
> On Wed, 8 Sep 2021 20:53:27 +0800 YunQiang Su  wrote:
> > Package: src:linux
> > Version: 5.10
> > 
> > After upgrade to bullseyes' kernel, the system always hang after about 10 
> > min
> > with an error from IML log
> > 
> > An Unrecoverable System Error (NMI) has occurred (Service Information:
> > 0x0008, 0x8948)
> > 
> > Kernel 5.14 from experimental also has this problem.
> > Kernel 4.19 works fine.
> > Fedora 34 seems to be working well.
> 
> This is the output of dmesg and lspci from both Fedora 34 and Debian bullseye.
> Wish they are useful.
> 

The problem seems due to some problem of the driver/firmware of bnx2x.
Since If I purge firmware-bnx2x, the OS will not hang (although no network 
connection then).

I check md5sum of the firmware of Bullseye: they have the same value with 
Fedora ones.
Note: fedora ones is compressed by xz. I test them after decompress.

My hardware requires: bnx2x-e2-7.13.15.0.fw

> > 
> > --
> > YunQiang Su
> > 
> > 


Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9

2021-09-08 Thread Yunqiang Su


dmesg.debian.xz
Description: application/xz


dmesg.fedora.xz
Description: application/xz


lspci.debian.xz
Description: application/xz


lspci.fedora.xz
Description: application/xz

On Wed, 8 Sep 2021 20:53:27 +0800 YunQiang Su  wrote:
> Package: src:linux
> Version: 5.10
> 
> After upgrade to bullseyes' kernel, the system always hang after about 10 min
> with an error from IML log
> 
> An Unrecoverable System Error (NMI) has occurred (Service Information:
> 0x0008, 0x8948)
> 
> Kernel 5.14 from experimental also has this problem.
> Kernel 4.19 works fine.
> Fedora 34 seems to be working well.

This is the output of dmesg and lspci from both Fedora 34 and Debian bullseye.
Wish they are useful.

> 
> --
> YunQiang Su
> 
> 


Bug#993948: kernel/amd64: system hang on HPE ProLiant BL460c Gen9

2021-09-08 Thread YunQiang Su
Package: src:linux
Version: 5.10

After upgrade to bullseyes' kernel, the system always hang after about 10 min
with an error from IML log

An Unrecoverable System Error (NMI) has occurred (Service Information:
0x0008, 0x8948)

Kernel 5.14 from experimental also has this problem.
Kernel 4.19 works fine.
Fedora 34 seems to be working well.

--
YunQiang Su