Re: amdgpu refcount saturation
On Mon, Dec 19, 2022 at 09:23:05AM +0100, Christian König wrote: > Am 17.12.22 um 12:53 schrieb Borislav Petkov: > > Hi folks, > > > > this is with Linus' tree from Wed: > > > > 041fae9c105a ("Merge tag 'f2fs-for-6.2-rc1' of > > git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs") > > > > on a CZ laptop: > > > > [7.782901] [drm] initializing kernel modesetting (CARRIZO 0x1002:0x9874 > > 0x103C:0x807E 0xC4) > > > > The splat is kinda messy: > > Thanks for the notice, going to take a look today. > > Regards, > Christian. In case it might help, I have similar crashes with 6.2 merge window snapshots on a desktop machine with Radeon WX2100 [ 16.045850] [drm] initializing kernel modesetting (POLARIS12 0x1002:0x6995 0x1002:0x0B0C 0x00). The behavior seems pretty deterministic so far, the system boots cleanly, login into KDE is fine but then it crashes as soon as I start firefox. Unfortunately, just like Boris, I always seem to have multiple stack traces tangled together. Michal Commit 77856d911a8c: -- [ 165.210008] [ cut here ] [ 165.215427] refcount_t: underflow; use-after-free. [ 165.221026] WARNING: CPU: 14 PID: 1165 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [ 165.230420] Modules linked in: echainiv esp4 af_packet tun 8021q garp mrp stp llc iscsi_ibft iscsi_boot_sysfs xt_REDIRECT xt_MASQUERADE xt_nat iptable_nat nf_nat deflate sm4_generic sm4_aesni_avx2_x86_64 xt_LOG sm4_aesni_avx_x86_64 nf_log_syslog sm4 twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common camellia_generic xt_conntrack camellia_aesni_avx2 nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 camellia_aesni_avx_x86_64 camellia_x86_64 serpent_avx2 serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic blowfish_generic blowfish_x86_64 vmnet(OE) blowfish_common ppdev parport_pc parport cast5_avx_x86_64 vmw_vsock_vmci_transport cast5_generic cast_common ipt_REJECT nf_reject_ipv4 vsock des_generic libdes sm3_generic rfkill xt_tcpudp sm3_avx_x86_64 sm3 xt_set vmw_vmci cmac xcbc iptable_filter vmmon(OE) rmd160 bpfilter dmi_sysfs ip_set_hash_ip af_key ip_set xfrm_algo nfnetlink msr hwmon_vid dm_crypt essiv authenc trusted asn1_encoder tee amdgpu [ 165.230464] intel_rapl_msr uvcvideo videobuf2_vmalloc iommu_v2 videobuf2_memops drm_buddy i2c_dev videobuf2_v4l2 gpu_sched video intel_rapl_common snd_usb_audio videodev xfs drm_display_helper videobuf2_common snd_usbmidi_lib drm_ttm_helper ttm libcrc32c edac_mce_amd joydev mc irqbypass cec pcspkr wmi_bmof gigabyte_wmi k10temp i2c_piix4 tiny_power_button rc_core igb dca thermal button acpi_cpufreq fuse configfs ip_tables x_tables ext4 mbcache jbd2 hid_generic uas usb_storage usbhid crct10dif_pclmul crc32_pclmul crc32c_intel xhci_pci polyval_clmulni xhci_pci_renesas polyval_generic gf128mul xhci_hcd ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd nvme cryptd usbcore ccp sr_mod sp5100_tco cdrom nvme_core wmi snd_emu10k1 snd_hwdep snd_util_mem snd_ac97_codec ac97_bus snd_pcm snd_timer snd_rawmidi snd_seq_device snd soundcore sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua [ 165.339552] [ cut here ] [ 165.339552] [ cut here ] [ 165.339553] refcount_t: saturated; leaking memory. [ 165.339557] WARNING: CPU: 18 PID: 6237 at lib/refcount.c:19 refcount_warn_saturate+0x97/0x110 [ 165.339562] Modules linked in: echainiv esp4 af_packet tun 8021q garp mrp stp llc iscsi_ibft iscsi_boot_sysfs xt_REDIRECT xt_MASQUERADE xt_nat iptable_nat nf_nat deflate sm4_generic sm4_aesni_avx2_x86_64 xt_LOG sm4_aesni_avx_x86_64 nf_log_syslog sm4 twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common camellia_generic xt_conntrack camellia_aesni_avx2 nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 camellia_aesni_avx_x86_64 camellia_x86_64 serpent_avx2 serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic blowfish_generic blowfish_x86_64 vmnet(OE) blowfish_common ppdev parport_pc parport cast5_avx_x86_64 vmw_vsock_vmci_transport cast5_generic cast_common ipt_REJECT nf_reject_ipv4 vsock des_generic libdes sm3_generic rfkill xt_tcpudp sm3_avx_x86_64 sm3 xt_set vmw_vmci cmac xcbc iptable_filter vmmon(OE) rmd160 bpfilter dmi_sysfs ip_set_hash_ip af_key ip_set xfrm_algo nfnetlink msr hwmon_vid dm_crypt essiv authenc trusted asn1_encoder tee amdgpu [ 165.339588] intel_rapl_msr uvcvideo videobuf2_vmalloc iommu_v2 videobuf2_memops drm_buddy i2c_dev videobuf2_v4l2 gpu_sched video intel_rapl_common snd_usb_audio videodev xfs drm_display_helper videobuf2_common snd_usbmidi_lib drm_ttm_helper ttm libcrc32c edac_mce_amd joydev mc irqbypass cec pcspkr wmi_bmof gigabyte_wmi k10temp i2c_piix4 tiny_power_button rc_core igb dca thermal button acpi_cpufreq fuse configfs ip_tables x_tables ext4 mbcache jbd2 hid_generic uas usb_storage usbhid
Re: [PATCH] drm/amdgpu: grab extra fence reference for drm_sched_job_add_dependency
On Mon, Dec 19, 2022 at 11:47:18AM +0100, Christian König wrote: > That function consumes the reference. > > Signed-off-by: Christian König > Fixes: aab9cf7b6954 ("drm/amdgpu: use scheduler dependencies for VM updates") Tested-by: Michal Kubecek I can still see weird artefacts in some windows (firefox, konsole) but those are probably unrelated, the refcount errors are gone with this patch. Michal > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > index 59cf64216fbb..535cd6569bcc 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > @@ -238,8 +238,10 @@ static int amdgpu_vm_sdma_update(struct > amdgpu_vm_update_params *p, > /* Wait for PD/PT moves to be completed */ > dma_resv_iter_begin(, bo->tbo.base.resv, DMA_RESV_USAGE_KERNEL); > dma_resv_for_each_fence_unlocked(, fence) { > + dma_fence_get(fence); > r = drm_sched_job_add_dependency(>job->base, fence); > if (r) { > + dma_fence_put(fence); > dma_resv_iter_end(); > return r; > } > -- > 2.34.1 > signature.asc Description: PGP signature
Re: (REGRESSION bisected) Re: amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots
On Fri, Jun 03, 2022 at 11:49:31AM -0400, Alex Deucher wrote: > On Thu, Jun 2, 2022 at 10:22 AM Michal Kubecek wrote: > > > > On Thu, Jun 02, 2022 at 09:58:22AM -0400, Alex Deucher wrote: > > > On Fri, May 27, 2022 at 8:58 AM Michal Kubecek wrote: > > > > On Fri, May 27, 2022 at 11:00:39AM +0200, Michal Kubecek wrote: > > > > > Hello, > > > > > > > > > > while testing 5.19 merge window snapshots (commits babf0bb978e3 and > > > > > 7e284070abe5), I keep getting errors like below. I have not seen them > > > > > with 5.18 final or older. > > > > > > > > > > > > > > > [ 247.150333] gmc_v8_0_process_interrupt: 46 callbacks suppressed > > > > > [ 247.150336] amdgpu :0c:00.0: amdgpu: GPU fault detected: 147 > > > > > 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116 > > > > > [ 247.150339] amdgpu :0c:00.0: amdgpu: > > > > > VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00107800 > > > > > [ 247.150340] amdgpu :0c:00.0: amdgpu: > > > > > VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002 > > > > > [ 247.150341] amdgpu :0c:00.0: amdgpu: VM fault (0x02, vmid 6, > > > > > pasid 32780) at page 1079296, write from 'TC2' (0x54433200) (8) > > > > [...] > > > > > [ 249.925909] amdgpu :0c:00.0: amdgpu: IH ring buffer overflow > > > > > (0x000844C0, 0x4A00, 0x44D0) > > > > > [ 250.434986] [drm] Fence fallback timer expired on ring sdma0 > > > > > [ 466.621568] gmc_v8_0_process_interrupt: 122 callbacks suppressed > > > > [...] > > > > > > > > > > > > > > > There does not seem to be any apparent immediate problem with graphics > > > > > but when running commit babf0bb978e3, there seemed to be a noticeable > > > > > lag in some operations, e.g. when moving a window or repainting large > > > > > part of the terminal window in konsole (no idea if it's related). > > > > > > > > > > My GPU is Radeon Pro WX 2100 (1002:6995). What other information > > > > > should > > > > > I collect to help debugging the issue? > > > > > > > > Bisected to commit 5255e146c99a ("drm/amdgpu: rework TLB flushing"). > > > > There seem to be later commits depending on it so I did not test > > > > a revert on top of current mainline. > > > > > > > > I should also mention that most commits tested as "bad" during the > > > > bisect did behave much worse than current mainline (errors starting as > > > > early as with sddm, visibly damaged screen content, sometimes even > > > > crashes). But all of them issued messages similar to those above into > > > > kernel log. > > > > > > Can you verify that the kernel you tested has this patch: > > > https://cgit.freedesktop.org/drm/drm/commit/?id=5be323562c6a699d38430bc068a3fd192be8ed0d > > > > Yes, both of them: > > > > mike@lion:~/work/git/kernel-upstream> git merge-base --is-ancestor > > 5be323562c6a babf0bb978e3 && echo yes > > yes > > > > (7e284070abe5 is a later mainline snapshot so it also contains > > 5be323562c6a) > > > > But it's likely that commit 5be323562c6a fixed most of the problem and > > only some corner case was left as most bisect steps had many more error > > messages and some even crashed before I was able to even log into KDE. > > Compared to that, the mainline snapshots show much fewer errors, no > > distorted picture and no crash; on the other hand, applications like > > firefox or stellarium seem to trigger the errors quite consistently. > > This patch should help: > https://patchwork.freedesktop.org/patch/488258/ After ~48 hours with this patch, still no apparent issues. Tested-by: Michal Kubecek Michal signature.asc Description: PGP signature
Re: (REGRESSION bisected) Re: amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots
On Fri, Jun 03, 2022 at 11:49:31AM -0400, Alex Deucher wrote: > On Thu, Jun 2, 2022 at 10:22 AM Michal Kubecek wrote: > > > > On Thu, Jun 02, 2022 at 09:58:22AM -0400, Alex Deucher wrote: > > > On Fri, May 27, 2022 at 8:58 AM Michal Kubecek wrote: > > > > On Fri, May 27, 2022 at 11:00:39AM +0200, Michal Kubecek wrote: > > > > > Hello, > > > > > > > > > > while testing 5.19 merge window snapshots (commits babf0bb978e3 and > > > > > 7e284070abe5), I keep getting errors like below. I have not seen them > > > > > with 5.18 final or older. > > > > > > > > > > > > > > > [ 247.150333] gmc_v8_0_process_interrupt: 46 callbacks suppressed > > > > > [ 247.150336] amdgpu :0c:00.0: amdgpu: GPU fault detected: 147 > > > > > 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116 > > > > > [ 247.150339] amdgpu :0c:00.0: amdgpu: > > > > > VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00107800 > > > > > [ 247.150340] amdgpu :0c:00.0: amdgpu: > > > > > VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002 > > > > > [ 247.150341] amdgpu :0c:00.0: amdgpu: VM fault (0x02, vmid 6, > > > > > pasid 32780) at page 1079296, write from 'TC2' (0x54433200) (8) > > > > [...] > > > > > [ 249.925909] amdgpu :0c:00.0: amdgpu: IH ring buffer overflow > > > > > (0x000844C0, 0x4A00, 0x44D0) > > > > > [ 250.434986] [drm] Fence fallback timer expired on ring sdma0 > > > > > [ 466.621568] gmc_v8_0_process_interrupt: 122 callbacks suppressed > > > > [...] > > > > > > > > > > > > > > > There does not seem to be any apparent immediate problem with graphics > > > > > but when running commit babf0bb978e3, there seemed to be a noticeable > > > > > lag in some operations, e.g. when moving a window or repainting large > > > > > part of the terminal window in konsole (no idea if it's related). > > > > > > > > > > My GPU is Radeon Pro WX 2100 (1002:6995). What other information > > > > > should > > > > > I collect to help debugging the issue? > > > > > > > > Bisected to commit 5255e146c99a ("drm/amdgpu: rework TLB flushing"). > > > > There seem to be later commits depending on it so I did not test > > > > a revert on top of current mainline. > > > > > > > > I should also mention that most commits tested as "bad" during the > > > > bisect did behave much worse than current mainline (errors starting as > > > > early as with sddm, visibly damaged screen content, sometimes even > > > > crashes). But all of them issued messages similar to those above into > > > > kernel log. > > > > > > Can you verify that the kernel you tested has this patch: > > > https://cgit.freedesktop.org/drm/drm/commit/?id=5be323562c6a699d38430bc068a3fd192be8ed0d > > > > Yes, both of them: > > > > mike@lion:~/work/git/kernel-upstream> git merge-base --is-ancestor > > 5be323562c6a babf0bb978e3 && echo yes > > yes > > > > (7e284070abe5 is a later mainline snapshot so it also contains > > 5be323562c6a) > > > > But it's likely that commit 5be323562c6a fixed most of the problem and > > only some corner case was left as most bisect steps had many more error > > messages and some even crashed before I was able to even log into KDE. > > Compared to that, the mainline snapshots show much fewer errors, no > > distorted picture and no crash; on the other hand, applications like > > firefox or stellarium seem to trigger the errors quite consistently. > > This patch should help: > https://patchwork.freedesktop.org/patch/488258/ It seems to help, I'm running a kernel built with this patch on top of mainline commit 50fd82b3a9a9 (current head) and I haven't seen any errors yet. I'll give it some more time and report back. Michal signature.asc Description: PGP signature
Re: (REGRESSION bisected) Re: amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots
On Thu, Jun 02, 2022 at 09:58:22AM -0400, Alex Deucher wrote: > On Fri, May 27, 2022 at 8:58 AM Michal Kubecek wrote: > > On Fri, May 27, 2022 at 11:00:39AM +0200, Michal Kubecek wrote: > > > Hello, > > > > > > while testing 5.19 merge window snapshots (commits babf0bb978e3 and > > > 7e284070abe5), I keep getting errors like below. I have not seen them > > > with 5.18 final or older. > > > > > > > > > [ 247.150333] gmc_v8_0_process_interrupt: 46 callbacks suppressed > > > [ 247.150336] amdgpu :0c:00.0: amdgpu: GPU fault detected: 147 > > > 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116 > > > [ 247.150339] amdgpu :0c:00.0: amdgpu: > > > VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00107800 > > > [ 247.150340] amdgpu :0c:00.0: amdgpu: > > > VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002 > > > [ 247.150341] amdgpu :0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid > > > 32780) at page 1079296, write from 'TC2' (0x54433200) (8) > > [...] > > > [ 249.925909] amdgpu :0c:00.0: amdgpu: IH ring buffer overflow > > > (0x000844C0, 0x4A00, 0x44D0) > > > [ 250.434986] [drm] Fence fallback timer expired on ring sdma0 > > > [ 466.621568] gmc_v8_0_process_interrupt: 122 callbacks suppressed > > [...] > > > > > > > > > There does not seem to be any apparent immediate problem with graphics > > > but when running commit babf0bb978e3, there seemed to be a noticeable > > > lag in some operations, e.g. when moving a window or repainting large > > > part of the terminal window in konsole (no idea if it's related). > > > > > > My GPU is Radeon Pro WX 2100 (1002:6995). What other information should > > > I collect to help debugging the issue? > > > > Bisected to commit 5255e146c99a ("drm/amdgpu: rework TLB flushing"). > > There seem to be later commits depending on it so I did not test > > a revert on top of current mainline. > > > > I should also mention that most commits tested as "bad" during the > > bisect did behave much worse than current mainline (errors starting as > > early as with sddm, visibly damaged screen content, sometimes even > > crashes). But all of them issued messages similar to those above into > > kernel log. > > Can you verify that the kernel you tested has this patch: > https://cgit.freedesktop.org/drm/drm/commit/?id=5be323562c6a699d38430bc068a3fd192be8ed0d Yes, both of them: mike@lion:~/work/git/kernel-upstream> git merge-base --is-ancestor 5be323562c6a babf0bb978e3 && echo yes yes (7e284070abe5 is a later mainline snapshot so it also contains 5be323562c6a) But it's likely that commit 5be323562c6a fixed most of the problem and only some corner case was left as most bisect steps had many more error messages and some even crashed before I was able to even log into KDE. Compared to that, the mainline snapshots show much fewer errors, no distorted picture and no crash; on the other hand, applications like firefox or stellarium seem to trigger the errors quite consistently. Michal signature.asc Description: PGP signature
(REGRESSION bisected) Re: amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots
On Fri, May 27, 2022 at 11:00:39AM +0200, Michal Kubecek wrote: > Hello, > > while testing 5.19 merge window snapshots (commits babf0bb978e3 and > 7e284070abe5), I keep getting errors like below. I have not seen them > with 5.18 final or older. > > > [ 247.150333] gmc_v8_0_process_interrupt: 46 callbacks suppressed > [ 247.150336] amdgpu :0c:00.0: amdgpu: GPU fault detected: 147 > 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116 > [ 247.150339] amdgpu :0c:00.0: amdgpu: > VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00107800 > [ 247.150340] amdgpu :0c:00.0: amdgpu: > VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002 > [ 247.150341] amdgpu :0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid > 32780) at page 1079296, write from 'TC2' (0x54433200) (8) [...] > [ 249.925909] amdgpu :0c:00.0: amdgpu: IH ring buffer overflow > (0x000844C0, 0x4A00, 0x44D0) > [ 250.434986] [drm] Fence fallback timer expired on ring sdma0 > [ 466.621568] gmc_v8_0_process_interrupt: 122 callbacks suppressed [...] > > > There does not seem to be any apparent immediate problem with graphics > but when running commit babf0bb978e3, there seemed to be a noticeable > lag in some operations, e.g. when moving a window or repainting large > part of the terminal window in konsole (no idea if it's related). > > My GPU is Radeon Pro WX 2100 (1002:6995). What other information should > I collect to help debugging the issue? Bisected to commit 5255e146c99a ("drm/amdgpu: rework TLB flushing"). There seem to be later commits depending on it so I did not test a revert on top of current mainline. I should also mention that most commits tested as "bad" during the bisect did behave much worse than current mainline (errors starting as early as with sddm, visibly damaged screen content, sometimes even crashes). But all of them issued messages similar to those above into kernel log. Michal Kubecek signature.asc Description: PGP signature
amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots
detected: 147 0x000a1002 for process stellarium pid 8057 thread stellarium:cs0 pid 8060 [ 3979.890018] amdgpu :0c:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00143C28 [ 3979.890018] amdgpu :0c:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05010002 [ 3979.890019] amdgpu :0c:00.0: amdgpu: VM fault (0x02, vmid 2, pasid 32781) at page 1326120, write from 'CB3' (0x43423300) (16) [ 3979.891937] amdgpu :0c:00.0: amdgpu: GPU fault detected: 147 0x02000802 for process stellarium pid 8057 thread stellarium:cs0 pid 8060 [ 3979.891937] amdgpu :0c:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00143C40 [ 3979.891938] amdgpu :0c:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04008002 [ 3979.891938] amdgpu :0c:00.0: amdgpu: VM fault (0x02, vmid 2, pasid 32781) at page 1326144, read from 'TC2' (0x54433200) (8) [ 4062.912573] gmc_v8_0_process_interrupt: 2 callbacks suppressed [ 4062.912578] amdgpu :0c:00.0: amdgpu: GPU fault detected: 147 0x4802 for process stellarium pid 8057 thread stellarium:cs0 pid 8060 [ 4062.912580] amdgpu :0c:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0800 [ 4062.912581] amdgpu :0c:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06048002 [ 4062.912582] amdgpu :0c:00.0: amdgpu: VM fault (0x02, vmid 3, pasid 32781) at page 2048, read from 'TC0' (0x54433000) (72) There does not seem to be any apparent immediate problem with graphics but when running commit babf0bb978e3, there seemed to be a noticeable lag in some operations, e.g. when moving a window or repainting large part of the terminal window in konsole (no idea if it's related). My GPU is Radeon Pro WX 2100 (1002:6995). What other information should I collect to help debugging the issue? Michal Kubecek signature.asc Description: PGP signature