Bug#991453: linux-image-5.10.0-8-amd64: Radeon 6800 XT: 100% GPU core usage & 74 Watts when idle
On zaterdag 24 juli 2021 22:03:23 CEST Diederik de Haas wrote: > > It's already backported to 5.10, just after 5.10.46 was released: > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h > > =l inux-5.10.y=fea853aca3210c21dfcf07bb82d501b7fd1900a7 > > Just found out the reverted commit was introduced just before the 5.10.46 > tag was created, which should mean that any version before 5.10.46 should > NOT have this problem. I just found out it's 2 commits that should be reverted https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y=1bd81429d53ded4e111616c755a64fad80849354 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y=fea853aca3210c21dfcf07bb82d501b7fd1900a7 The first one I saved (and attached) as 'fix-bug991453-part1.patch' and the second one as 'fix-bug991453-part1.patch' Then I followed step 4.2.1 and 4.2.2 of the kernel handbook: https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official That resulted in 'linux-image-5.10.0-8-amd64-unsigned_5.10.46-2a~test_amd64.deb' which I then installed on my system and rebooted into that. $ uname -a Linux bagend 5.10.0-8-amd64 #1 SMP Debian 5.10.46-2a~test (2021-07-24) x86_64 GNU/Linux $ cat /sys/class/drm/card0/device/gpu_busy_percent 0 $ sensors nvme-pci-0100 Adapter: PCI adapter Composite:+40.9°C (low = -273.1°C, high = +72.8°C) (crit = +75.8°C) Sensor 1: +40.9°C (low = -273.1°C, high = +65261.8°C) Sensor 2: +51.9°C (low = -273.1°C, high = +65261.8°C) amdgpu-pci-0c00 Adapter: PCI adapter vddgfx: 750.00 mV fan1:1208 RPM (min =0 RPM, max = 3500 RPM) edge: +42.0°C (crit = +85.0°C, hyst = -273.1°C) (emerg = +90.0°C) junction: +42.0°C (crit = +105.0°C, hyst = -273.1°C) (emerg = +110.0°C) mem: +43.0°C (crit = +95.0°C, hyst = -273.1°C) (emerg = +100.0°C) power1:7.00 W (cap = 260.00 W) k10temp-pci-00c3 Adapter: PCI adapter Tctl: +76.5°C Tdie: +56.5°C # radeontop Graphics pipe 0.83% 0.17G / 0.94G Memory Clock 17.67% 0.03G / 1.63G Shader Clock 1.78% These are the same 'scores' as I had with the 5.10.0-7-amd64 kernel. So applying the mentioned/attached patches on top of the current kernel as available in Debian Testing/Bullseye and Sid, fixes the problem. In the last year I've spend considerable time to bring down my energy usage/needs and I never expected that (reverting) 2 kernel commits would save me 67W (continuously), so thank you very much piorunz for bringing this to my attention. I normally have a quiet system and noticed it often wasn't quiet lately; I blame(d) 'baloo' (file indexing) for that, but it turns out it was mostly my GPU running at 100% all the time. As it looks like a lot of users with AMD GPUs are affected and the considerable energy wasted because of it (Climate Change), I really hope/urge that these 2 patches/reverts are applied before Bullseye gets released. Cheers, Diederik >From 1bd81429d53ded4e111616c755a64fad80849354 Mon Sep 17 00:00:00 2001 From: Yifan Zhang Date: Sat, 19 Jun 2021 11:40:54 +0800 Subject: Revert "drm/amdgpu/gfx9: fix the doorbell missing when in CGPG issue." commit ee5468b9f1d3bf48082eed351dace14598e8ca39 upstream. This reverts commit 4cbbe34807938e6e494e535a68d5ff64edac3f20. Reason for revert: side effect of enlarging CP_MEC_DOORBELL_RANGE may cause some APUs fail to enter gfxoff in certain user cases. Signed-off-by: Yifan Zhang Acked-by: Alex Deucher Signed-off-by: Alex Deucher Cc: sta...@vger.kernel.org Signed-off-by: Greg Kroah-Hartman --- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c index 1859d293ef712..fb15e8b5af32f 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c @@ -3619,12 +3619,8 @@ static int gfx_v9_0_kiq_init_register(struct amdgpu_ring *ring) if (ring->use_doorbell) { WREG32_SOC15(GC, 0, mmCP_MEC_DOORBELL_RANGE_LOWER, (adev->doorbell_index.kiq * 2) << 2); - /* If GC has entered CGPG, ringing doorbell > first page doesn't - * wakeup GC. Enlarge CP_MEC_DOORBELL_RANGE_UPPER to workaround - * this issue. - */ WREG32_SOC15(GC, 0, mmCP_MEC_DOORBELL_RANGE_UPPER, - (adev->doorbell.size - 4)); + (adev->doorbell_index.userqueue_end * 2) << 2); } WREG32_SOC15_RLC(GC, 0, mmCP_HQD_PQ_DOORBELL_CONTROL, -- cgit 1.2.3-1.el7 >From fea853aca3210c21dfcf07bb82d501b7fd1900a7 Mon Sep 17 00:00:00 2001 From: Yifan Zhang Date: Sat, 19 Jun 2021 11:39:43 +0800 Subject: Revert "drm/amdgpu/gfx10: enlarge CP_MEC_DOORBELL_RANGE_UPPER to cover full doorbell." commit baacf52a473b24e10322b67757ddb92ab8d86717 upstream. This reverts commit 1c0b0efd148d5b24c4932ddb3fa03c8edd6097b3.
Bug#991453: linux-image-5.10.0-8-amd64: Radeon 6800 XT: 100% GPU core usage & 74 Watts when idle
> It's already backported to 5.10, just after 5.10.46 was released: > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=l > inux-5.10.y=fea853aca3210c21dfcf07bb82d501b7fd1900a7 Just found out the reverted commit was introduced just before the 5.10.46 tag was created, which should mean that any version before 5.10.46 should NOT have this problem. The version uploaded to the Debian archive before that was 5.10.40-1, which in my case was linux-image-5.10.0-7-amd64. I no longer had that installed, so added the following line to /e/a/sources.list: deb [check-valid-until=no] https://snapshot.debian.org/archive/debian/20210529T204006Z/ sid main Installed linux-image-5.10.0-7-amd64 and rebooted into that ... On zaterdag 24 juli 2021 14:22:38 CEST Diederik de Haas wrote: > On zaterdag 24 juli 2021 01:29:55 CEST piorunz wrote: > > GPU core works at 100% usage at all times, even at idle. > > > > $ cat /sys/class/drm/card0/device/gpu_busy_percent > > 99 > > > > $ sensors > > (...) > > amdgpu-pci-0900 > > Adapter: PCI adapter > > vddgfx:1.14 V > > fan1:1098 RPM (min =0 RPM, max = 3000 RPM) > > edge: +51.0°C (crit = +100.0°C, hyst = -273.1°C) > > > >(emerg = +105.0°C) > > > > junction: +55.0°C (crit = +110.0°C, hyst = -273.1°C) > > > >(emerg = +115.0°C) > > > > mem: +56.0°C (crit = +100.0°C, hyst = -273.1°C) > > > >(emerg = +105.0°C) > > > > power1: 74.00 W (cap = 272.00 W) > > > > radeontop - 100% GPU usage and full clocks: > > Graphics pipe 100.00% > > 1.00G / 1.00G Memory Clock 100.00% > > 2.47G / 2.58G Shader Clock 95.92% > > I'm getting the same results on my Radeon RX Vega 64. > $ cat /sys/class/drm/card0/device/gpu_busy_percent > 99 > $ sensors > nvme-pci-0100 > Adapter: PCI adapter > Composite:+43.9°C (low = -273.1°C, high = +72.8°C) >(crit = +75.8°C) > Sensor 1: +43.9°C (low = -273.1°C, high = +65261.8°C) > Sensor 2: +49.9°C (low = -273.1°C, high = +65261.8°C) > > amdgpu-pci-0c00 > Adapter: PCI adapter > vddgfx:1.09 V > fan1:1240 RPM (min =0 RPM, max = 3500 RPM) > edge: +50.0°C (crit = +85.0°C, hyst = -273.1°C) >(emerg = +90.0°C) > junction: +63.0°C (crit = +105.0°C, hyst = -273.1°C) >(emerg = +110.0°C) > mem: +51.0°C (crit = +95.0°C, hyst = -273.1°C) >(emerg = +100.0°C) > power1: 74.00 W (cap = 260.00 W) > > k10temp-pci-00c3 > Adapter: PCI adapter > Tctl: +66.0°C > Tdie: +46.0°C > > And also for radeontop. $ cat /sys/class/drm/card0/device/gpu_busy_percent 0 $ sensors nvme-pci-0100 Adapter: PCI adapter Composite:+32.9°C (low = -273.1°C, high = +72.8°C) (crit = +75.8°C) Sensor 1: +32.9°C (low = -273.1°C, high = +65261.8°C) Sensor 2: +41.9°C (low = -273.1°C, high = +65261.8°C) amdgpu-pci-0c00 Adapter: PCI adapter vddgfx: 750.00 mV fan1:1182 RPM (min =0 RPM, max = 3500 RPM) edge: +37.0°C (crit = +85.0°C, hyst = -273.1°C) (emerg = +90.0°C) junction: +38.0°C (crit = +105.0°C, hyst = -273.1°C) (emerg = +110.0°C) mem: +39.0°C (crit = +95.0°C, hyst = -273.1°C) (emerg = +100.0°C) power1:7.00 W (cap = 260.00 W) k10temp-pci-00c3 Adapter: PCI adapter Tctl: +65.6°C Tdie: +45.6°C # radeontop Graphics pipe 0.83% 0.17G / 0.94G Memory Clock 17.67% 0.03G / 1.63G Shader Clock 1.81% So running a kernel != 5.10.46 does not have this problem :) signature.asc Description: This is a digitally signed message part.
Bug#991453: linux-image-5.10.0-8-amd64: Radeon 6800 XT: 100% GPU core usage & 74 Watts when idle
I see this on Vega (RX 56) as well as Navi10 (5700XT) , and I think Polaris (RX 570) is affected too. So it's not only an issue with Navi 2X . Is it worth opening a separate bug report for each card?
Bug#991453: linux-image-5.10.0-8-amd64: Radeon 6800 XT: 100% GPU core usage & 74 Watts when idle
Control: tags -1 confirmed On zaterdag 24 juli 2021 01:29:55 CEST piorunz wrote: > GPU core works at 100% usage at all times, even at idle. > > $ cat /sys/class/drm/card0/device/gpu_busy_percent > 99 > > $ sensors > (...) > amdgpu-pci-0900 > Adapter: PCI adapter > vddgfx:1.14 V > fan1:1098 RPM (min =0 RPM, max = 3000 RPM) > edge: +51.0°C (crit = +100.0°C, hyst = -273.1°C) >(emerg = +105.0°C) > junction: +55.0°C (crit = +110.0°C, hyst = -273.1°C) >(emerg = +115.0°C) > mem: +56.0°C (crit = +100.0°C, hyst = -273.1°C) >(emerg = +105.0°C) > power1: 74.00 W (cap = 272.00 W) > > radeontop - 100% GPU usage and full clocks: > Graphics pipe 100.00% > 1.00G / 1.00G Memory Clock 100.00% > 2.47G / 2.58G Shader Clock 95.92% I'm getting the same results on my Radeon RX Vega 64. $ cat /sys/class/drm/card0/device/gpu_busy_percent 99 $ sensors nvme-pci-0100 Adapter: PCI adapter Composite:+43.9°C (low = -273.1°C, high = +72.8°C) (crit = +75.8°C) Sensor 1: +43.9°C (low = -273.1°C, high = +65261.8°C) Sensor 2: +49.9°C (low = -273.1°C, high = +65261.8°C) amdgpu-pci-0c00 Adapter: PCI adapter vddgfx:1.09 V fan1:1240 RPM (min =0 RPM, max = 3500 RPM) edge: +50.0°C (crit = +85.0°C, hyst = -273.1°C) (emerg = +90.0°C) junction: +63.0°C (crit = +105.0°C, hyst = -273.1°C) (emerg = +110.0°C) mem: +51.0°C (crit = +95.0°C, hyst = -273.1°C) (emerg = +100.0°C) power1: 74.00 W (cap = 260.00 W) k10temp-pci-00c3 Adapter: PCI adapter Tctl: +66.0°C Tdie: +46.0°C And also for radeontop. > This I believe has been fixed upstream: > https://gitlab.freedesktop.org/drm/amd/-/issues/1632 > > Can you please make sure this will be backported to Debian 5.10 kernel? It's already backported to 5.10, just after 5.10.46 was released: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y=fea853aca3210c21dfcf07bb82d501b7fd1900a7 signature.asc Description: This is a digitally signed message part.
Bug#991453: linux-image-5.10.0-8-amd64: Radeon 6800 XT: 100% GPU core usage & 74 Watts when idle
Package: src:linux Version: 5.10.46-2 Severity: important X-Debbugs-Cc: pior...@gmx.com Dear Maintainer, I am using Bullseye 11 with Radeon 6800 XT and noticed higher temperature and noise comparing to Windows, so I investigated this and found the fault. GPU core works at 100% usage at all times, even at idle. $ cat /sys/class/drm/card0/device/gpu_busy_percent 99 $ sensors (...) amdgpu-pci-0900 Adapter: PCI adapter vddgfx:1.14 V fan1:1098 RPM (min =0 RPM, max = 3000 RPM) edge: +51.0°C (crit = +100.0°C, hyst = -273.1°C) (emerg = +105.0°C) junction: +55.0°C (crit = +110.0°C, hyst = -273.1°C) (emerg = +115.0°C) mem: +56.0°C (crit = +100.0°C, hyst = -273.1°C) (emerg = +105.0°C) power1: 74.00 W (cap = 272.00 W) radeontop - 100% GPU usage and full clocks: Graphics pipe 100.00% 1.00G / 1.00G Memory Clock 100.00% 2.47G / 2.58G Shader Clock 95.92% This card should be using 0% GPU time when idle, downclocking core to 10 MHz, and using 9 to 34W depending on numbers of monitors connected. On Debian 5.10 kernel, it's using 74W minimum, at all times. This I believe has been fixed upstream: https://gitlab.freedesktop.org/drm/amd/-/issues/1632 Can you please make sure this will be backported to Debian 5.10 kernel? Thanks. Kind regards, piorunz -- Package-specific info: ** Version: Linux version 5.10.0-8-amd64 (debian-ker...@lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.46-2 (2021-07-20) ** Command line: BOOT_IMAGE=/@rootfs/boot/vmlinuz-5.10.0-8-amd64 root=UUID=77ba5989-5d64-4929-9145-ede6751a4102 ro rootflags=subvol=@rootfs amdgpu.ppfeaturemask=0xfffd7fff ** Not tainted ** Kernel log: [ 4256.341304] microcode: CPU13: patch_level=0x0a201016 [ 4256.343423] ACPI: \_SB_.PLTF.C00B: Found 2 idle states [ 4256.345209] CPU13 is up [ 4256.345219] smpboot: Booting Node 0 Processor 14 APIC 0xd [ 4256.345313] microcode: CPU14: patch_level=0x0a201016 [ 4256.347435] ACPI: \_SB_.PLTF.C00D: Found 2 idle states [ 4256.349222] CPU14 is up [ 4256.349236] smpboot: Booting Node 0 Processor 15 APIC 0xf [ 4256.349330] microcode: CPU15: patch_level=0x0a201016 [ 4256.351451] ACPI: \_SB_.PLTF.C00F: Found 2 idle states [ 4256.353230] CPU15 is up [ 4256.354126] ACPI: Waking up from system sleep state S3 [ 4256.453431] usb usb1: root hub lost power or was reset [ 4256.453432] usb usb2: root hub lost power or was reset [ 4256.453565] sd 0:0:0:0: [sda] Starting disk [ 4256.453566] sd 1:0:0:0: [sdb] Starting disk [ 4256.453568] sd 2:0:0:0: [sdc] Starting disk [ 4256.453573] sd 3:0:0:0: [sdd] Starting disk [ 4256.453577] sd 5:0:0:0: [sde] Starting disk [ 4256.453768] [drm] PCIE GART of 512M enabled (table at 0x0080). [ 4256.453786] [drm] PSP is resuming... [ 4256.648463] nvme nvme0: 8/0/0 default/read/poll queues [ 4256.661079] [drm] reserve 0xa0 from 0x83fe00 for PSP TMR [ 4256.768036] ata5: SATA link down (SStatus 0 SControl 330) [ 4256.801229] usb 1-7: reset high-speed USB device number 8 using xhci_hcd [ 4256.929094] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 4256.929312] ata6.00: configured for UDMA/133 [ 4256.937100] amdgpu :09:00.0: amdgpu: SMU is resuming... [ 4256.937104] amdgpu :09:00.0: amdgpu: smu driver if version = 0x0039, smu fw if version = 0x003b, smu fw version = 0x003a3100 (58.49.0) [ 4256.937105] amdgpu :09:00.0: amdgpu: SMU driver if version not matched [ 4257.005584] amdgpu :09:00.0: amdgpu: SMU is resumed successfully! [ 4257.006633] [drm] DMUB hardware initialized: version=0x0201 [ 4257.081355] usb 1-6: reset full-speed USB device number 7 using xhci_hcd [ 4257.445313] usb 1-9: reset high-speed USB device number 9 using xhci_hcd [ 4257.737242] usb 1-10: reset full-speed USB device number 11 using xhci_hcd [ 4258.093501] usb 1-2: reset full-speed USB device number 2 using xhci_hcd [ 4258.108494] [drm] kiq ring mec 2 pipe 1 q 0 [ 4258.125155] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 4258.134188] [drm] VCN decode and encode initialized successfully(under DPG Mode). [ 4258.134405] [drm] JPEG decode initialized successfully. [ 4258.134427] amdgpu :09:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0 [ 4258.134428] amdgpu :09:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0 [ 4258.134428] amdgpu :09:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0 [ 4258.134429] amdgpu :09:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0 [ 4258.134429] amdgpu :09:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0 [ 4258.134429] amdgpu :09:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0 [ 4258.134429] amdgpu :09:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0 [ 4258.134430] amdgpu :09:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0 [ 4258.134430] amdgpu