Re: [BUG] Warning and NULL-ptr dereference in amdgpu driver with 5.18

2022-05-12 Thread Jörg Rödel
On Thu, May 12, 2022 at 09:47:29AM -0400, Alex Deucher wrote:
> Are those new?  Maybe the card is not seated correctly?  Can you try
> another slot?

I can't remember having seen these TLP error messages with older
kernels. 5.17 still works fine with this card.

I will try to put the card into another slot tomorrow.

> As for the null pointer defer in the display code, @Wentland, Harry
> any ideas?  I don't see why that should happen.  Maybe some hotplug
> pin is faulty or the display has input detection and that is causing
> some sort of hotplug interrupt that causes a race somewhere in the
> driver?  Can you make sure the monitor connector is firmly seated on
> the GPU?

The connectors are fine, the displays are connected via miniDP on the
GPU and DP on the display side. On the other hand my monitors do not
seem to have the highest quality. Occassionally the resolution is
wrongly detected or DP signal is lost. I am not sure why, I suspect
there is some interference between the two DP cables. But this is a
problem for as long as I have these two monitors, the NULL-ptr deref
only happens with v5.18.

Regards,

-- 
Jörg Rödel
jroe...@suse.de

SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
 
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev



Re: [BUG] Warning and NULL-ptr dereference in amdgpu driver with 5.18

2022-05-12 Thread Alex Deucher
On Thu, May 12, 2022 at 4:35 AM Jörg Rödel  wrote:
>
> On Tue, May 10, 2022 at 04:41:57PM -0400, Alex Deucher wrote:
> > Does setting amdgpu.runpm=0 on the kernel command line in grub help?
> > If so, that should fixed with:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f95af4a9236695caed24fe6401256bb974e8f2a7
>
> Unfortunatly, no, this option doesn't help. Tested with v5.18-rc6, full
> dmesg attached.
>
> Any idea what the BadTLP messages migh be caused by?

Are those new?  Maybe the card is not seated correctly?  Can you try
another slot?

As for the null pointer defer in the display code, @Wentland, Harry
any ideas?  I don't see why that should happen.  Maybe some hotplug
pin is faulty or the display has input detection and that is causing
some sort of hotplug interrupt that causes a race somewhere in the
driver?  Can you make sure the monitor connector is firmly seated on
the GPU?

Alex


>
> Regards,
>
> Joerg
>
> --
> Jörg Rödel
> jroe...@suse.de
>
> SUSE Software Solutions Germany GmbH
> Maxfeldstr. 5
> 90409 Nürnberg
> Germany
>
> (HRB 36809, AG Nürnberg)
> Geschäftsführer: Ivo Totev
>


Re: [BUG] Warning and NULL-ptr dereference in amdgpu driver with 5.18

2022-05-12 Thread Jörg Rödel
On Tue, May 10, 2022 at 04:41:57PM -0400, Alex Deucher wrote:
> Does setting amdgpu.runpm=0 on the kernel command line in grub help?
> If so, that should fixed with:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f95af4a9236695caed24fe6401256bb974e8f2a7

Unfortunatly, no, this option doesn't help. Tested with v5.18-rc6, full
dmesg attached.

Any idea what the BadTLP messages migh be caused by?

Regards,

Joerg

-- 
Jörg Rödel
jroe...@suse.de

SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
 
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev

[0.00] Linux version 5.18.0-rc6-vanilla (j...@cap.home.8bytes.org) (gcc 
(SUSE Linux) 11.2.1 20220420 [revision 
691af15031e00227ba6d5935c1d737026cda4129], GNU ld (GNU Binutils; openSUSE 
Tumbleweed) 2.38.20220411-4) #2 SMP PREEMPT_DYNAMIC Mon May 9 09:43:39 CEST 2022
[0.00] Command line: BOOT_IMAGE=/vmlinuz-5.18.0-rc6-vanilla 
root=/dev/mapper/cap_vg-root splash=silent resume=/dev/cap_vg/swap 
mitigations=auto quiet iommu=nopt amdgpu.runpm=0
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, 
using 'compacted' format.
[0.00] signal: max sigframe size: 1776
[0.00] BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009] usable
[0.00] BIOS-e820: [mem 0x000a-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x09cf] usable
[0.00] BIOS-e820: [mem 0x09d0-0x09ff] reserved
[0.00] BIOS-e820: [mem 0x0a00-0x0a1f] usable
[0.00] BIOS-e820: [mem 0x0a20-0x0a20afff] ACPI NVS
[0.00] BIOS-e820: [mem 0x0a20b000-0x0aff] usable
[0.00] BIOS-e820: [mem 0x0b00-0x0b01] reserved
[0.00] BIOS-e820: [mem 0x0b02-0xd0214fff] usable
[0.00] BIOS-e820: [mem 0xd0215000-0xd0236fff] ACPI data
[0.00] BIOS-e820: [mem 0xd0237000-0xd9103fff] usable
[0.00] BIOS-e820: [mem 0xd9104000-0xd9273fff] reserved
[0.00] BIOS-e820: [mem 0xd9274000-0xd9282fff] ACPI data
[0.00] BIOS-e820: [mem 0xd9283000-0xd939bfff] usable
[0.00] BIOS-e820: [mem 0xd939c000-0xd9790fff] ACPI NVS
[0.00] BIOS-e820: [mem 0xd9791000-0xda58cfff] reserved
[0.00] BIOS-e820: [mem 0xda58d000-0xdcff] usable
[0.00] BIOS-e820: [mem 0xdd00-0xdfff] reserved
[0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved
[0.00] BIOS-e820: [mem 0xfd10-0xfd1f] reserved
[0.00] BIOS-e820: [mem 0xfea0-0xfea0] reserved
[0.00] BIOS-e820: [mem 0xfeb8-0xfec01fff] reserved
[0.00] BIOS-e820: [mem 0xfec1-0xfec10fff] reserved
[0.00] BIOS-e820: [mem 0xfec3-0xfec30fff] reserved
[0.00] BIOS-e820: [mem 0xfed0-0xfed00fff] reserved
[0.00] BIOS-e820: [mem 0xfed4-0xfed44fff] reserved
[0.00] BIOS-e820: [mem 0xfed8-0xfed8] reserved
[0.00] BIOS-e820: [mem 0xfedc2000-0xfedc] reserved
[0.00] BIOS-e820: [mem 0xfedd4000-0xfedd5fff] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfeef] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00101f37] usable
[0.00] NX (Execute Disable) protection: active
[0.00] e820: update [mem 0xcfe9b018-0xcfedca57] usable ==> usable
[0.00] e820: update [mem 0xcfe9b018-0xcfedca57] usable ==> usable
[0.00] e820: update [mem 0xcfe59018-0xcfe9aa57] usable ==> usable
[0.00] e820: update [mem 0xcfe59018-0xcfe9aa57] usable ==> usable
[0.00] e820: update [mem 0xcfe47018-0xcfe58057] usable ==> usable
[0.00] e820: update [mem 0xcfe47018-0xcfe58057] usable ==> usable
[0.00] e820: update [mem 0xcfe2a018-0xcfe46c57] usable ==> usable
[0.00] e820: update [mem 0xcfe2a018-0xcfe46c57] usable ==> usable
[0.00] extended physical RAM map:
[0.00] reserve setup_data: [mem 0x-0x0009] 
usable
[0.00] reserve setup_data: [mem 0x000a-0x000f] 
reserved
[0.00] reserve setup_data: 

Re: [BUG] Warning and NULL-ptr dereference in amdgpu driver with 5.18

2022-05-10 Thread Alex Deucher
On Tue, May 10, 2022 at 2:17 PM Jörg Rödel  wrote:
>
>
> > Am 10.05.2022 um 17:31 schrieb Alex Deucher :
> >
> > On Tue, May 10, 2022 at 7:12 AM Jörg Rödel  wrote:
> >>
> >> Gentle ping. This is a 5.18 regression and I also see it with
> >> 5.18-rc6. Please let me know if you need anything else to debug.
> >>
> >
> > Are you doing anything special when it happens?  I.e., does it happen
> > when the monitor is coming out of DPMS or something like that?
> >
>
> Yes, it usually happens when I return to the machine and press some button on 
> the keyboard to get the screens enabled again. It doesn’t happen always, it 
> seems to depend on how slow the monitors come out of power saving mode.
>

Does setting amdgpu.runpm=0 on the kernel command line in grub help?
If so, that should fixed with:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f95af4a9236695caed24fe6401256bb974e8f2a7

Alex


> Regards,
>
> Jörg Rödel
> jroe...@suse.de
>
> SUSE Software Solutions Germany GmbH
> Maxfeldstr. 5
> 90409 Nürnberg
> Germany
>
> (HRB 36809, AG Nürnberg)
> Geschäftsführer: Ivo Totev


Re: [BUG] Warning and NULL-ptr dereference in amdgpu driver with 5.18

2022-05-10 Thread Jörg Rödel


> Am 10.05.2022 um 17:31 schrieb Alex Deucher :
> 
> On Tue, May 10, 2022 at 7:12 AM Jörg Rödel  wrote:
>> 
>> Gentle ping. This is a 5.18 regression and I also see it with
>> 5.18-rc6. Please let me know if you need anything else to debug.
>> 
> 
> Are you doing anything special when it happens?  I.e., does it happen
> when the monitor is coming out of DPMS or something like that?
> 

Yes, it usually happens when I return to the machine and press some button on 
the keyboard to get the screens enabled again. It doesn’t happen always, it 
seems to depend on how slow the monitors come out of power saving mode.

Regards,

Jörg Rödel
jroe...@suse.de

SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
 
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev

Re: [BUG] Warning and NULL-ptr dereference in amdgpu driver with 5.18

2022-05-10 Thread Jörg Rödel
Gentle ping. This is a 5.18 regression and I also see it with
5.18-rc6. Please let me know if you need anything else to debug.

Thanks,

Joerg

On Fri, May 06, 2022 at 09:16:12AM -0400, Alex Deucher wrote:
> + some display folks
> 
> On Fri, May 6, 2022 at 6:19 AM Jörg Rödel  wrote:
> >
> > Hi,
> >
> > since recently I started to experience warnings and NULL-ptr
> > dereferences in the amdgpu driver with kernel 5.18-rc5+. Earlier
> > 5.18-based kernels might be affected as well, but I havn't seen this
> > with 5.17.
> >
> > The kernel was built from the iommu-next branch, based on 5.18-rc5.
> >
> > The messages start with some PCIe error being reported:
> >
> > [20389.984993] pcieport :00:03.1: AER: Multiple Corrected error 
> > received: :0a:00.0
> > [20389.985005] amdgpu :0a:00.0: PCIe Bus Error: severity=Corrected, 
> > type=Data Link Layer, (Receiver ID)
> > [20389.985007] amdgpu :0a:00.0:   device [1002:6995] error 
> > status/mask=00c0/2000
> > [20389.985010] amdgpu :0a:00.0:[ 6] BadTLP
> > [20389.985013] amdgpu :0a:00.0:[ 7] BadDLLP
> >
> > Directly followed by a waring:
> >
> > [81829.087101] [ cut here ]
> > [81829.087105] WARNING: CPU: 4 PID: 644 at 
> > drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dce110/dce110_clk_mgr.c:140
> >  dce110_fill_display_configs+0x4a/0x150 [amdgpu]
> > [81829.087461] Modules linked in: snd_seq_dummy(E) snd_hrtimer(E) 
> > snd_seq(E) rfcomm(E) af_packet(E) ocrdma(E) ib_uverbs(E) ib_core(E) 
> > nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) 
> > nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) 
> > nft_ct(E) nft_chain_nat(E) nf_tables(E) ebtable_nat(E) ebtable_broute(E) 
> > ip6table_nat(E) ip6table_mangle(E) ip6table_raw(E) ip6table_security(E) 
> > iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) 
> > nf_defrag_ipv4(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) 
> > ip_set(E) nfnetlink(E) ebtable_filter(E) ebtables(E) ip6table_filter(E) 
> > ip6_tables(E) iptable_filter(E) bpfilter(E) cmac(E) algif_hash(E) 
> > algif_skcipher(E) af_alg(E) bnep(E) dmi_sysfs(E) intel_rapl_msr(E) 
> > intel_rapl_common(E) snd_hda_codec_realtek(E) eeepc_wmi(E) btusb(E) 
> > asus_wmi(E) kvm_amd(E) btrtl(E) snd_hda_codec_generic(E) nls_iso8859_1(E) 
> > battery(E) uvcvideo(E) sparse_keymap(E) ledtrig_audio(E) btbcm(E) video(E) 
> > wmi_bmof(E)
> > [81829.087502]  platform_profile(E) mxm_wmi(E) snd_hda_codec_hdmi(E) 
> > nls_cp437(E) videobuf2_vmalloc(E) btintel(E) asus_wmi_sensors(E) btmtk(E) 
> > vfat(E) snd_hda_intel(E) videobuf2_memops(E) videobuf2_v4l2(E) 
> > snd_intel_dspcfg(E) bluetooth(E) kvm(E) videobuf2_common(E) fat(E) 
> > snd_usb_audio(E) snd_virtuoso(E) irqbypass(E) snd_usbmidi_lib(E) 
> > snd_hda_codec(E) snd_oxygen_lib(E) videodev(E) snd_hwdep(E) 
> > snd_mpu401_uart(E) mc(E) snd_hda_core(E) ecdh_generic(E) snd_rawmidi(E) 
> > snd_pcm(E) snd_seq_device(E) rfkill(E) pcspkr(E) i2c_piix4(E) efi_pstore(E) 
> > k10temp(E) snd_timer(E) ext4(E) igb(E) snd(E) dca(E) soundcore(E) 
> > mbcache(E) be2net(E) jbd2(E) wmi(E) gpio_amdpt(E) gpio_generic(E) 
> > tiny_power_button(E) button(E) acpi_cpufreq(E) fuse(E) configfs(E) 
> > ip_tables(E) x_tables(E) xfs(E) libcrc32c(E) dm_crypt(E) essiv(E) 
> > authenc(E) trusted(E) asn1_encoder(E) tee(E) hid_logitech_hidpp(E) 
> > hid_logitech_dj(E) hid_generic(E) usbhid(E) sr_mod(E) cdrom(E) uas(E) 
> > usb_storage(E) amdgpu(E)
> > [81829.087551]  drm_ttm_helper(E) ttm(E) iommu_v2(E) gpu_sched(E) 
> > i2c_algo_bit(E) crct10dif_pclmul(E) drm_dp_helper(E) crc32_pclmul(E) 
> > crc32c_intel(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) 
> > sysimgblt(E) fb_sys_fops(E) drm(E) xhci_pci(E) cec(E) 
> > ghash_clmulni_intel(E) xhci_pci_renesas(E) aesni_intel(E) crypto_simd(E) 
> > cryptd(E) sp5100_tco(E) xhci_hcd(E) ccp(E) rc_core(E) nvme(E) usbcore(E) 
> > nvme_core(E) sg(E) br_netfilter(E) bridge(E) stp(E) llc(E) dm_multipath(E) 
> > dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) ledtrig_timer(E) 
> > msr(E) efivarfs(E)
> > [81829.087581] CPU: 4 PID: 644 Comm: kworker/4:1H Tainted: GE   
> >   5.18.0-rc5-iommu-next+ #1 4d1b12f73ec264927e45e8f2e5d1c0c8e280bc7d
> > [81829.087585] Hardware name: System manufacturer System Product Name/PRIME 
> > X470-PRO, BIOS 5406 11/13/2019
> > [81829.087588] Workqueue: events_highpri dm_irq_work_func [amdgpu]
> > [81829.087928] RIP: 0010:dce110_fill_display_configs+0x4a/0x150 [amdgpu]
> > [81829.088274] Code: 31 ff 4d 8d 98 f0 01 00 00 49 8b 0c f8 4c 89 da 31 c0 
> > 48 39 0a 0f 84 e4 00 00 00 83 c0 01 48 81 c2 10 08 00 00 83 f8 06 75 e8 
> > <0f> 0b 31 c0 80 b9 48 03 00 00 00 0f 85 a9 00 00 00 48 8b 50 08 8b
> > [81829.088277] RSP: 0018:b891810c3be0 EFLAGS: 00010246
> > [81829.088280] RAX: 0006 RBX: 9719a6b6 RCX: 
> > 971d08e07800
> > [81829.088282] RDX: 9719a6b63250 RSI: 9719a6b72980 RDI: 
> > 0001
> > [81829.088284] 

Re: [BUG] Warning and NULL-ptr dereference in amdgpu driver with 5.18

2022-05-10 Thread Alex Deucher
On Tue, May 10, 2022 at 7:12 AM Jörg Rödel  wrote:
>
> Gentle ping. This is a 5.18 regression and I also see it with
> 5.18-rc6. Please let me know if you need anything else to debug.
>

Are you doing anything special when it happens?  I.e., does it happen
when the monitor is coming out of DPMS or something like that?

Alex

> Thanks,
>
> Joerg
>
> On Fri, May 06, 2022 at 09:16:12AM -0400, Alex Deucher wrote:
> > + some display folks
> >
> > On Fri, May 6, 2022 at 6:19 AM Jörg Rödel  wrote:
> > >
> > > Hi,
> > >
> > > since recently I started to experience warnings and NULL-ptr
> > > dereferences in the amdgpu driver with kernel 5.18-rc5+. Earlier
> > > 5.18-based kernels might be affected as well, but I havn't seen this
> > > with 5.17.
> > >
> > > The kernel was built from the iommu-next branch, based on 5.18-rc5.
> > >
> > > The messages start with some PCIe error being reported:
> > >
> > > [20389.984993] pcieport :00:03.1: AER: Multiple Corrected error 
> > > received: :0a:00.0
> > > [20389.985005] amdgpu :0a:00.0: PCIe Bus Error: severity=Corrected, 
> > > type=Data Link Layer, (Receiver ID)
> > > [20389.985007] amdgpu :0a:00.0:   device [1002:6995] error 
> > > status/mask=00c0/2000
> > > [20389.985010] amdgpu :0a:00.0:[ 6] BadTLP
> > > [20389.985013] amdgpu :0a:00.0:[ 7] BadDLLP
> > >
> > > Directly followed by a waring:
> > >
> > > [81829.087101] [ cut here ]
> > > [81829.087105] WARNING: CPU: 4 PID: 644 at 
> > > drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dce110/dce110_clk_mgr.c:140
> > >  dce110_fill_display_configs+0x4a/0x150 [amdgpu]
> > > [81829.087461] Modules linked in: snd_seq_dummy(E) snd_hrtimer(E) 
> > > snd_seq(E) rfcomm(E) af_packet(E) ocrdma(E) ib_uverbs(E) ib_core(E) 
> > > nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) 
> > > nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) 
> > > nft_ct(E) nft_chain_nat(E) nf_tables(E) ebtable_nat(E) ebtable_broute(E) 
> > > ip6table_nat(E) ip6table_mangle(E) ip6table_raw(E) ip6table_security(E) 
> > > iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) 
> > > nf_defrag_ipv4(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) 
> > > ip_set(E) nfnetlink(E) ebtable_filter(E) ebtables(E) ip6table_filter(E) 
> > > ip6_tables(E) iptable_filter(E) bpfilter(E) cmac(E) algif_hash(E) 
> > > algif_skcipher(E) af_alg(E) bnep(E) dmi_sysfs(E) intel_rapl_msr(E) 
> > > intel_rapl_common(E) snd_hda_codec_realtek(E) eeepc_wmi(E) btusb(E) 
> > > asus_wmi(E) kvm_amd(E) btrtl(E) snd_hda_codec_generic(E) nls_iso8859_1(E) 
> > > battery(E) uvcvideo(E) sparse_keymap(E) ledtrig_audio(E) btbcm(E) 
> > > video(E) wmi_bmof(E)
> > > [81829.087502]  platform_profile(E) mxm_wmi(E) snd_hda_codec_hdmi(E) 
> > > nls_cp437(E) videobuf2_vmalloc(E) btintel(E) asus_wmi_sensors(E) btmtk(E) 
> > > vfat(E) snd_hda_intel(E) videobuf2_memops(E) videobuf2_v4l2(E) 
> > > snd_intel_dspcfg(E) bluetooth(E) kvm(E) videobuf2_common(E) fat(E) 
> > > snd_usb_audio(E) snd_virtuoso(E) irqbypass(E) snd_usbmidi_lib(E) 
> > > snd_hda_codec(E) snd_oxygen_lib(E) videodev(E) snd_hwdep(E) 
> > > snd_mpu401_uart(E) mc(E) snd_hda_core(E) ecdh_generic(E) snd_rawmidi(E) 
> > > snd_pcm(E) snd_seq_device(E) rfkill(E) pcspkr(E) i2c_piix4(E) 
> > > efi_pstore(E) k10temp(E) snd_timer(E) ext4(E) igb(E) snd(E) dca(E) 
> > > soundcore(E) mbcache(E) be2net(E) jbd2(E) wmi(E) gpio_amdpt(E) 
> > > gpio_generic(E) tiny_power_button(E) button(E) acpi_cpufreq(E) fuse(E) 
> > > configfs(E) ip_tables(E) x_tables(E) xfs(E) libcrc32c(E) dm_crypt(E) 
> > > essiv(E) authenc(E) trusted(E) asn1_encoder(E) tee(E) 
> > > hid_logitech_hidpp(E) hid_logitech_dj(E) hid_generic(E) usbhid(E) 
> > > sr_mod(E) cdrom(E) uas(E) usb_storage(E) amdgpu(E)
> > > [81829.087551]  drm_ttm_helper(E) ttm(E) iommu_v2(E) gpu_sched(E) 
> > > i2c_algo_bit(E) crct10dif_pclmul(E) drm_dp_helper(E) crc32_pclmul(E) 
> > > crc32c_intel(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) 
> > > sysimgblt(E) fb_sys_fops(E) drm(E) xhci_pci(E) cec(E) 
> > > ghash_clmulni_intel(E) xhci_pci_renesas(E) aesni_intel(E) crypto_simd(E) 
> > > cryptd(E) sp5100_tco(E) xhci_hcd(E) ccp(E) rc_core(E) nvme(E) usbcore(E) 
> > > nvme_core(E) sg(E) br_netfilter(E) bridge(E) stp(E) llc(E) 
> > > dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) 
> > > ledtrig_timer(E) msr(E) efivarfs(E)
> > > [81829.087581] CPU: 4 PID: 644 Comm: kworker/4:1H Tainted: GE 
> > > 5.18.0-rc5-iommu-next+ #1 4d1b12f73ec264927e45e8f2e5d1c0c8e280bc7d
> > > [81829.087585] Hardware name: System manufacturer System Product 
> > > Name/PRIME X470-PRO, BIOS 5406 11/13/2019
> > > [81829.087588] Workqueue: events_highpri dm_irq_work_func [amdgpu]
> > > [81829.087928] RIP: 0010:dce110_fill_display_configs+0x4a/0x150 [amdgpu]
> > > [81829.088274] Code: 31 ff 4d 8d 98 f0 01 00 00 49 8b 0c f8 4c 89 da 31 
> > > c0 48 39 0a 0f 84 e4 00 00 00 83 c0 01 48 81 c2 10 

Re: [BUG] Warning and NULL-ptr dereference in amdgpu driver with 5.18

2022-05-07 Thread Jörg Rödel
On Fri, May 06, 2022 at 08:30:13AM +0200, Jörg Rödel wrote:
> [81829.087101] [ cut here ]
> [81829.087105] WARNING: CPU: 4 PID: 644 at 
> drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dce110/dce110_clk_mgr.c:140 
> dce110_fill_display_configs+0x4a/0x150 [amdgpu]

Same just happened with a kernel built from latest upstream, based on
commit fe27d189e3f42e31d3c8223d5daed7285e334c5e. So it's at least not
the iommu changes causing it :)

Please let me know if I can be of any help debugging this further.

Thanks,

Joerg


Re: [BUG] Warning and NULL-ptr dereference in amdgpu driver with 5.18

2022-05-06 Thread Alex Deucher
+ some display folks

On Fri, May 6, 2022 at 6:19 AM Jörg Rödel  wrote:
>
> Hi,
>
> since recently I started to experience warnings and NULL-ptr
> dereferences in the amdgpu driver with kernel 5.18-rc5+. Earlier
> 5.18-based kernels might be affected as well, but I havn't seen this
> with 5.17.
>
> The kernel was built from the iommu-next branch, based on 5.18-rc5.
>
> The messages start with some PCIe error being reported:
>
> [20389.984993] pcieport :00:03.1: AER: Multiple Corrected error received: 
> :0a:00.0
> [20389.985005] amdgpu :0a:00.0: PCIe Bus Error: severity=Corrected, 
> type=Data Link Layer, (Receiver ID)
> [20389.985007] amdgpu :0a:00.0:   device [1002:6995] error 
> status/mask=00c0/2000
> [20389.985010] amdgpu :0a:00.0:[ 6] BadTLP
> [20389.985013] amdgpu :0a:00.0:[ 7] BadDLLP
>
> Directly followed by a waring:
>
> [81829.087101] [ cut here ]
> [81829.087105] WARNING: CPU: 4 PID: 644 at 
> drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dce110/dce110_clk_mgr.c:140 
> dce110_fill_display_configs+0x4a/0x150 [amdgpu]
> [81829.087461] Modules linked in: snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) 
> rfcomm(E) af_packet(E) ocrdma(E) ib_uverbs(E) ib_core(E) nft_fib_inet(E) 
> nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) 
> nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) 
> nf_tables(E) ebtable_nat(E) ebtable_broute(E) ip6table_nat(E) 
> ip6table_mangle(E) ip6table_raw(E) ip6table_security(E) iptable_nat(E) 
> nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) 
> iptable_mangle(E) iptable_raw(E) iptable_security(E) ip_set(E) nfnetlink(E) 
> ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) 
> iptable_filter(E) bpfilter(E) cmac(E) algif_hash(E) algif_skcipher(E) 
> af_alg(E) bnep(E) dmi_sysfs(E) intel_rapl_msr(E) intel_rapl_common(E) 
> snd_hda_codec_realtek(E) eeepc_wmi(E) btusb(E) asus_wmi(E) kvm_amd(E) 
> btrtl(E) snd_hda_codec_generic(E) nls_iso8859_1(E) battery(E) uvcvideo(E) 
> sparse_keymap(E) ledtrig_audio(E) btbcm(E) video(E) wmi_bmof(E)
> [81829.087502]  platform_profile(E) mxm_wmi(E) snd_hda_codec_hdmi(E) 
> nls_cp437(E) videobuf2_vmalloc(E) btintel(E) asus_wmi_sensors(E) btmtk(E) 
> vfat(E) snd_hda_intel(E) videobuf2_memops(E) videobuf2_v4l2(E) 
> snd_intel_dspcfg(E) bluetooth(E) kvm(E) videobuf2_common(E) fat(E) 
> snd_usb_audio(E) snd_virtuoso(E) irqbypass(E) snd_usbmidi_lib(E) 
> snd_hda_codec(E) snd_oxygen_lib(E) videodev(E) snd_hwdep(E) 
> snd_mpu401_uart(E) mc(E) snd_hda_core(E) ecdh_generic(E) snd_rawmidi(E) 
> snd_pcm(E) snd_seq_device(E) rfkill(E) pcspkr(E) i2c_piix4(E) efi_pstore(E) 
> k10temp(E) snd_timer(E) ext4(E) igb(E) snd(E) dca(E) soundcore(E) mbcache(E) 
> be2net(E) jbd2(E) wmi(E) gpio_amdpt(E) gpio_generic(E) tiny_power_button(E) 
> button(E) acpi_cpufreq(E) fuse(E) configfs(E) ip_tables(E) x_tables(E) xfs(E) 
> libcrc32c(E) dm_crypt(E) essiv(E) authenc(E) trusted(E) asn1_encoder(E) 
> tee(E) hid_logitech_hidpp(E) hid_logitech_dj(E) hid_generic(E) usbhid(E) 
> sr_mod(E) cdrom(E) uas(E) usb_storage(E) amdgpu(E)
> [81829.087551]  drm_ttm_helper(E) ttm(E) iommu_v2(E) gpu_sched(E) 
> i2c_algo_bit(E) crct10dif_pclmul(E) drm_dp_helper(E) crc32_pclmul(E) 
> crc32c_intel(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) 
> fb_sys_fops(E) drm(E) xhci_pci(E) cec(E) ghash_clmulni_intel(E) 
> xhci_pci_renesas(E) aesni_intel(E) crypto_simd(E) cryptd(E) sp5100_tco(E) 
> xhci_hcd(E) ccp(E) rc_core(E) nvme(E) usbcore(E) nvme_core(E) sg(E) 
> br_netfilter(E) bridge(E) stp(E) llc(E) dm_multipath(E) dm_mod(E) 
> scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) ledtrig_timer(E) msr(E) 
> efivarfs(E)
> [81829.087581] CPU: 4 PID: 644 Comm: kworker/4:1H Tainted: GE 
> 5.18.0-rc5-iommu-next+ #1 4d1b12f73ec264927e45e8f2e5d1c0c8e280bc7d
> [81829.087585] Hardware name: System manufacturer System Product Name/PRIME 
> X470-PRO, BIOS 5406 11/13/2019
> [81829.087588] Workqueue: events_highpri dm_irq_work_func [amdgpu]
> [81829.087928] RIP: 0010:dce110_fill_display_configs+0x4a/0x150 [amdgpu]
> [81829.088274] Code: 31 ff 4d 8d 98 f0 01 00 00 49 8b 0c f8 4c 89 da 31 c0 48 
> 39 0a 0f 84 e4 00 00 00 83 c0 01 48 81 c2 10 08 00 00 83 f8 06 75 e8 <0f> 0b 
> 31 c0 80 b9 48 03 00 00 00 0f 85 a9 00 00 00 48 8b 50 08 8b
> [81829.088277] RSP: 0018:b891810c3be0 EFLAGS: 00010246
> [81829.088280] RAX: 0006 RBX: 9719a6b6 RCX: 
> 971d08e07800
> [81829.088282] RDX: 9719a6b63250 RSI: 9719a6b72980 RDI: 
> 0001
> [81829.088284] RBP: 971a812f R08: 9719a6b6 R09: 
> 
> [81829.088286] R10: 9719a6b72980 R11: 9719a6b601f0 R12: 
> 9719a6b72980
> [81829.088287] R13: 9719a6b6 R14: 0006 R15: 
> 3258
> [81829.088289] FS:  () GS:97285eb0() 
> knlGS:
> [81829.088291] CS:  0010 DS:  ES:  CR0: 

[BUG] Warning and NULL-ptr dereference in amdgpu driver with 5.18

2022-05-06 Thread Jörg Rödel
Hi,

since recently I started to experience warnings and NULL-ptr
dereferences in the amdgpu driver with kernel 5.18-rc5+. Earlier
5.18-based kernels might be affected as well, but I havn't seen this
with 5.17.

The kernel was built from the iommu-next branch, based on 5.18-rc5.

The messages start with some PCIe error being reported:

[20389.984993] pcieport :00:03.1: AER: Multiple Corrected error received: 
:0a:00.0
[20389.985005] amdgpu :0a:00.0: PCIe Bus Error: severity=Corrected, 
type=Data Link Layer, (Receiver ID)
[20389.985007] amdgpu :0a:00.0:   device [1002:6995] error 
status/mask=00c0/2000
[20389.985010] amdgpu :0a:00.0:[ 6] BadTLP
[20389.985013] amdgpu :0a:00.0:[ 7] BadDLLP   

Directly followed by a waring:

[81829.087101] [ cut here ]
[81829.087105] WARNING: CPU: 4 PID: 644 at 
drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dce110/dce110_clk_mgr.c:140 
dce110_fill_display_configs+0x4a/0x150 [amdgpu]
[81829.087461] Modules linked in: snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) 
rfcomm(E) af_packet(E) ocrdma(E) ib_uverbs(E) ib_core(E) nft_fib_inet(E) 
nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) 
nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_tables(E) 
ebtable_nat(E) ebtable_broute(E) ip6table_nat(E) ip6table_mangle(E) 
ip6table_raw(E) ip6table_security(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) 
nf_defrag_ipv6(E) nf_defrag_ipv4(E) iptable_mangle(E) iptable_raw(E) 
iptable_security(E) ip_set(E) nfnetlink(E) ebtable_filter(E) ebtables(E) 
ip6table_filter(E) ip6_tables(E) iptable_filter(E) bpfilter(E) cmac(E) 
algif_hash(E) algif_skcipher(E) af_alg(E) bnep(E) dmi_sysfs(E) 
intel_rapl_msr(E) intel_rapl_common(E) snd_hda_codec_realtek(E) eeepc_wmi(E) 
btusb(E) asus_wmi(E) kvm_amd(E) btrtl(E) snd_hda_codec_generic(E) 
nls_iso8859_1(E) battery(E) uvcvideo(E) sparse_keymap(E) ledtrig_audio(E) 
btbcm(E) video(E) wmi_bmof(E)
[81829.087502]  platform_profile(E) mxm_wmi(E) snd_hda_codec_hdmi(E) 
nls_cp437(E) videobuf2_vmalloc(E) btintel(E) asus_wmi_sensors(E) btmtk(E) 
vfat(E) snd_hda_intel(E) videobuf2_memops(E) videobuf2_v4l2(E) 
snd_intel_dspcfg(E) bluetooth(E) kvm(E) videobuf2_common(E) fat(E) 
snd_usb_audio(E) snd_virtuoso(E) irqbypass(E) snd_usbmidi_lib(E) 
snd_hda_codec(E) snd_oxygen_lib(E) videodev(E) snd_hwdep(E) snd_mpu401_uart(E) 
mc(E) snd_hda_core(E) ecdh_generic(E) snd_rawmidi(E) snd_pcm(E) 
snd_seq_device(E) rfkill(E) pcspkr(E) i2c_piix4(E) efi_pstore(E) k10temp(E) 
snd_timer(E) ext4(E) igb(E) snd(E) dca(E) soundcore(E) mbcache(E) be2net(E) 
jbd2(E) wmi(E) gpio_amdpt(E) gpio_generic(E) tiny_power_button(E) button(E) 
acpi_cpufreq(E) fuse(E) configfs(E) ip_tables(E) x_tables(E) xfs(E) 
libcrc32c(E) dm_crypt(E) essiv(E) authenc(E) trusted(E) asn1_encoder(E) tee(E) 
hid_logitech_hidpp(E) hid_logitech_dj(E) hid_generic(E) usbhid(E) sr_mod(E) 
cdrom(E) uas(E) usb_storage(E) amdgpu(E)
[81829.087551]  drm_ttm_helper(E) ttm(E) iommu_v2(E) gpu_sched(E) 
i2c_algo_bit(E) crct10dif_pclmul(E) drm_dp_helper(E) crc32_pclmul(E) 
crc32c_intel(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) 
fb_sys_fops(E) drm(E) xhci_pci(E) cec(E) ghash_clmulni_intel(E) 
xhci_pci_renesas(E) aesni_intel(E) crypto_simd(E) cryptd(E) sp5100_tco(E) 
xhci_hcd(E) ccp(E) rc_core(E) nvme(E) usbcore(E) nvme_core(E) sg(E) 
br_netfilter(E) bridge(E) stp(E) llc(E) dm_multipath(E) dm_mod(E) 
scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) ledtrig_timer(E) msr(E) 
efivarfs(E)
[81829.087581] CPU: 4 PID: 644 Comm: kworker/4:1H Tainted: GE 
5.18.0-rc5-iommu-next+ #1 4d1b12f73ec264927e45e8f2e5d1c0c8e280bc7d
[81829.087585] Hardware name: System manufacturer System Product Name/PRIME 
X470-PRO, BIOS 5406 11/13/2019
[81829.087588] Workqueue: events_highpri dm_irq_work_func [amdgpu]
[81829.087928] RIP: 0010:dce110_fill_display_configs+0x4a/0x150 [amdgpu]
[81829.088274] Code: 31 ff 4d 8d 98 f0 01 00 00 49 8b 0c f8 4c 89 da 31 c0 48 
39 0a 0f 84 e4 00 00 00 83 c0 01 48 81 c2 10 08 00 00 83 f8 06 75 e8 <0f> 0b 31 
c0 80 b9 48 03 00 00 00 0f 85 a9 00 00 00 48 8b 50 08 8b
[81829.088277] RSP: 0018:b891810c3be0 EFLAGS: 00010246
[81829.088280] RAX: 0006 RBX: 9719a6b6 RCX: 971d08e07800
[81829.088282] RDX: 9719a6b63250 RSI: 9719a6b72980 RDI: 0001
[81829.088284] RBP: 971a812f R08: 9719a6b6 R09: 
[81829.088286] R10: 9719a6b72980 R11: 9719a6b601f0 R12: 9719a6b72980
[81829.088287] R13: 9719a6b6 R14: 0006 R15: 3258
[81829.088289] FS:  () GS:97285eb0() 
knlGS:
[81829.088291] CS:  0010 DS:  ES:  CR0: 80050033
[81829.088293] CR2: 7fbc4800bb28 CR3: 0002909c CR4: 003506e0
[81829.088295] Call Trace:
[81829.088298]  
[81829.088300]  dce11_pplib_apply_display_requirements+0x129/0x200