Bug#991453: linux-image-5.10.0-8-amd64: Radeon 6800 XT: 100% GPU core usage & 74 Watts when idle

2021-07-24 Thread Diederik de Haas
On zaterdag 24 juli 2021 22:03:23 CEST Diederik de Haas wrote:
> > It's already backported to 5.10, just after 5.10.46 was released:
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h
> > =l inux-5.10.y=fea853aca3210c21dfcf07bb82d501b7fd1900a7
> 
> Just found out the reverted commit was introduced just before the 5.10.46
> tag was created, which should mean that any version before 5.10.46 should
> NOT have this problem. 

I just found out it's 2 commits that should be reverted
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y=1bd81429d53ded4e111616c755a64fad80849354
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y=fea853aca3210c21dfcf07bb82d501b7fd1900a7

The first one I saved (and attached) as 'fix-bug991453-part1.patch' and the 
second one as 'fix-bug991453-part1.patch'

Then I followed step 4.2.1 and 4.2.2 of the kernel handbook:
https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official
That resulted in 'linux-image-5.10.0-8-amd64-unsigned_5.10.46-2a~test_amd64.deb'
which I then installed on my system and rebooted into that.

$ uname -a
Linux bagend 5.10.0-8-amd64 #1 SMP Debian 5.10.46-2a~test (2021-07-24) x86_64 
GNU/Linux
$ cat /sys/class/drm/card0/device/gpu_busy_percent
0
$ sensors
nvme-pci-0100
Adapter: PCI adapter
Composite:+40.9°C  (low  = -273.1°C, high = +72.8°C)
   (crit = +75.8°C)
Sensor 1: +40.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2: +51.9°C  (low  = -273.1°C, high = +65261.8°C)

amdgpu-pci-0c00
Adapter: PCI adapter
vddgfx:  750.00 mV 
fan1:1208 RPM  (min =0 RPM, max = 3500 RPM)
edge: +42.0°C  (crit = +85.0°C, hyst = -273.1°C)
   (emerg = +90.0°C)
junction: +42.0°C  (crit = +105.0°C, hyst = -273.1°C)
   (emerg = +110.0°C)
mem:  +43.0°C  (crit = +95.0°C, hyst = -273.1°C)
   (emerg = +100.0°C)
power1:7.00 W  (cap = 260.00 W)

k10temp-pci-00c3
Adapter: PCI adapter
Tctl: +76.5°C  
Tdie: +56.5°C

# radeontop
Graphics pipe 0.83%
0.17G / 0.94G Memory Clock 17.67%
0.03G / 1.63G Shader Clock  1.78%

These are the same 'scores' as I had with the 5.10.0-7-amd64 kernel.
So applying the mentioned/attached patches on top of the current kernel
as available in Debian Testing/Bullseye and Sid, fixes the problem.

In the last year I've spend considerable time to bring down my energy
usage/needs and I never expected that (reverting) 2 kernel commits
would save me 67W (continuously), so thank you very much piorunz for
bringing this to my attention.
I normally have a quiet system and noticed it often wasn't quiet lately; 
I blame(d) 'baloo' (file indexing) for that, but it turns out it was mostly
my GPU running at 100% all the time.

As it looks like a lot of users with AMD GPUs are affected and the 
considerable energy wasted because of it (Climate Change), 
I really hope/urge that these 2 patches/reverts are applied before Bullseye
gets released.

Cheers,
  Diederik
>From 1bd81429d53ded4e111616c755a64fad80849354 Mon Sep 17 00:00:00 2001
From: Yifan Zhang 
Date: Sat, 19 Jun 2021 11:40:54 +0800
Subject: Revert "drm/amdgpu/gfx9: fix the doorbell missing when in CGPG
 issue."

commit ee5468b9f1d3bf48082eed351dace14598e8ca39 upstream.

This reverts commit 4cbbe34807938e6e494e535a68d5ff64edac3f20.

Reason for revert: side effect of enlarging CP_MEC_DOORBELL_RANGE may
cause some APUs fail to enter gfxoff in certain user cases.

Signed-off-by: Yifan Zhang 
Acked-by: Alex Deucher 
Signed-off-by: Alex Deucher 
Cc: sta...@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 1859d293ef712..fb15e8b5af32f 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -3619,12 +3619,8 @@ static int gfx_v9_0_kiq_init_register(struct amdgpu_ring *ring)
 	if (ring->use_doorbell) {
 		WREG32_SOC15(GC, 0, mmCP_MEC_DOORBELL_RANGE_LOWER,
 	(adev->doorbell_index.kiq * 2) << 2);
-		/* If GC has entered CGPG, ringing doorbell > first page doesn't
-		 * wakeup GC. Enlarge CP_MEC_DOORBELL_RANGE_UPPER to workaround
-		 * this issue.
-		 */
 		WREG32_SOC15(GC, 0, mmCP_MEC_DOORBELL_RANGE_UPPER,
-	(adev->doorbell.size - 4));
+	(adev->doorbell_index.userqueue_end * 2) << 2);
 	}
 
 	WREG32_SOC15_RLC(GC, 0, mmCP_HQD_PQ_DOORBELL_CONTROL,
-- 
cgit 1.2.3-1.el7

>From fea853aca3210c21dfcf07bb82d501b7fd1900a7 Mon Sep 17 00:00:00 2001
From: Yifan Zhang 
Date: Sat, 19 Jun 2021 11:39:43 +0800
Subject: Revert "drm/amdgpu/gfx10: enlarge CP_MEC_DOORBELL_RANGE_UPPER to
 cover full doorbell."

commit baacf52a473b24e10322b67757ddb92ab8d86717 upstream.

This reverts commit 1c0b0efd148d5b24c4932ddb3fa03c8edd6097b3.


Bug#991453: linux-image-5.10.0-8-amd64: Radeon 6800 XT: 100% GPU core usage & 74 Watts when idle

2021-07-24 Thread Diederik de Haas
> It's already backported to 5.10, just after 5.10.46 was released:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=l
> inux-5.10.y=fea853aca3210c21dfcf07bb82d501b7fd1900a7

Just found out the reverted commit was introduced just before the 5.10.46 tag
was created, which should mean that any version before 5.10.46 should NOT
have this problem. The version uploaded to the Debian archive before that
was 5.10.40-1, which in my case was linux-image-5.10.0-7-amd64.
I no longer had that installed, so added the following line to 
/e/a/sources.list:
deb [check-valid-until=no] 
https://snapshot.debian.org/archive/debian/20210529T204006Z/ sid main

Installed linux-image-5.10.0-7-amd64 and rebooted into that ...

On zaterdag 24 juli 2021 14:22:38 CEST Diederik de Haas wrote:
> On zaterdag 24 juli 2021 01:29:55 CEST piorunz wrote:
> > GPU core works at 100% usage at all times, even at idle.
> > 
> > $ cat /sys/class/drm/card0/device/gpu_busy_percent
> > 99
> > 
> > $ sensors
> > (...)
> > amdgpu-pci-0900
> > Adapter: PCI adapter
> > vddgfx:1.14 V
> > fan1:1098 RPM  (min =0 RPM, max = 3000 RPM)
> > edge: +51.0°C  (crit = +100.0°C, hyst = -273.1°C)
> > 
> >(emerg = +105.0°C)
> > 
> > junction: +55.0°C  (crit = +110.0°C, hyst = -273.1°C)
> > 
> >(emerg = +115.0°C)
> > 
> > mem:  +56.0°C  (crit = +100.0°C, hyst = -273.1°C)
> > 
> >(emerg = +105.0°C)
> > 
> > power1:   74.00 W  (cap = 272.00 W)
> > 
> > radeontop - 100% GPU usage and full clocks:
> > Graphics pipe 100.00%
> > 1.00G / 1.00G Memory Clock 100.00%
> > 2.47G / 2.58G Shader Clock  95.92%
> 
> I'm getting the same results on my Radeon RX Vega 64.
> $ cat /sys/class/drm/card0/device/gpu_busy_percent
> 99
> $ sensors
> nvme-pci-0100
> Adapter: PCI adapter
> Composite:+43.9°C  (low  = -273.1°C, high = +72.8°C)
>(crit = +75.8°C)
> Sensor 1: +43.9°C  (low  = -273.1°C, high = +65261.8°C)
> Sensor 2: +49.9°C  (low  = -273.1°C, high = +65261.8°C)
> 
> amdgpu-pci-0c00
> Adapter: PCI adapter
> vddgfx:1.09 V
> fan1:1240 RPM  (min =0 RPM, max = 3500 RPM)
> edge: +50.0°C  (crit = +85.0°C, hyst = -273.1°C)
>(emerg = +90.0°C)
> junction: +63.0°C  (crit = +105.0°C, hyst = -273.1°C)
>(emerg = +110.0°C)
> mem:  +51.0°C  (crit = +95.0°C, hyst = -273.1°C)
>(emerg = +100.0°C)
> power1:   74.00 W  (cap = 260.00 W)
> 
> k10temp-pci-00c3
> Adapter: PCI adapter
> Tctl: +66.0°C
> Tdie: +46.0°C
> 
> And also for radeontop.

$ cat /sys/class/drm/card0/device/gpu_busy_percent
0
$ sensors
nvme-pci-0100
Adapter: PCI adapter
Composite:+32.9°C  (low  = -273.1°C, high = +72.8°C)
   (crit = +75.8°C)
Sensor 1: +32.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2: +41.9°C  (low  = -273.1°C, high = +65261.8°C)

amdgpu-pci-0c00
Adapter: PCI adapter
vddgfx:  750.00 mV 
fan1:1182 RPM  (min =0 RPM, max = 3500 RPM)
edge: +37.0°C  (crit = +85.0°C, hyst = -273.1°C)
   (emerg = +90.0°C)
junction: +38.0°C  (crit = +105.0°C, hyst = -273.1°C)
   (emerg = +110.0°C)
mem:  +39.0°C  (crit = +95.0°C, hyst = -273.1°C)
   (emerg = +100.0°C)
power1:7.00 W  (cap = 260.00 W)

k10temp-pci-00c3
Adapter: PCI adapter
Tctl: +65.6°C  
Tdie: +45.6°C

# radeontop
Graphics pipe 0.83%
0.17G / 0.94G Memory Clock 17.67%
0.03G / 1.63G Shader Clock  1.81%

So running a kernel != 5.10.46 does not have this problem :)

signature.asc
Description: This is a digitally signed message part.


Bug#991453: linux-image-5.10.0-8-amd64: Radeon 6800 XT: 100% GPU core usage & 74 Watts when idle

2021-07-24 Thread tv.deb...@googlemail.com
I see this on Vega (RX 56) as well as Navi10 (5700XT) , and I think 
Polaris (RX 570) is affected too. So it's not only an issue with Navi 2X 
. Is it worth opening a separate bug report for each card?




Bug#991453: linux-image-5.10.0-8-amd64: Radeon 6800 XT: 100% GPU core usage & 74 Watts when idle

2021-07-24 Thread Diederik de Haas
Control: tags -1 confirmed

On zaterdag 24 juli 2021 01:29:55 CEST piorunz wrote:
> GPU core works at 100% usage at all times, even at idle.
> 
> $ cat /sys/class/drm/card0/device/gpu_busy_percent
> 99
> 
> $ sensors
> (...)
> amdgpu-pci-0900
> Adapter: PCI adapter
> vddgfx:1.14 V
> fan1:1098 RPM  (min =0 RPM, max = 3000 RPM)
> edge: +51.0°C  (crit = +100.0°C, hyst = -273.1°C)
>(emerg = +105.0°C)
> junction: +55.0°C  (crit = +110.0°C, hyst = -273.1°C)
>(emerg = +115.0°C)
> mem:  +56.0°C  (crit = +100.0°C, hyst = -273.1°C)
>(emerg = +105.0°C)
> power1:   74.00 W  (cap = 272.00 W)
> 
> radeontop - 100% GPU usage and full clocks:
> Graphics pipe 100.00%
> 1.00G / 1.00G Memory Clock 100.00%
> 2.47G / 2.58G Shader Clock  95.92%

I'm getting the same results on my Radeon RX Vega 64.
$ cat /sys/class/drm/card0/device/gpu_busy_percent
99
$ sensors
nvme-pci-0100
Adapter: PCI adapter
Composite:+43.9°C  (low  = -273.1°C, high = +72.8°C)
   (crit = +75.8°C)
Sensor 1: +43.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2: +49.9°C  (low  = -273.1°C, high = +65261.8°C)

amdgpu-pci-0c00
Adapter: PCI adapter
vddgfx:1.09 V  
fan1:1240 RPM  (min =0 RPM, max = 3500 RPM)
edge: +50.0°C  (crit = +85.0°C, hyst = -273.1°C)
   (emerg = +90.0°C)
junction: +63.0°C  (crit = +105.0°C, hyst = -273.1°C)
   (emerg = +110.0°C)
mem:  +51.0°C  (crit = +95.0°C, hyst = -273.1°C)
   (emerg = +100.0°C)
power1:   74.00 W  (cap = 260.00 W)

k10temp-pci-00c3
Adapter: PCI adapter
Tctl: +66.0°C  
Tdie: +46.0°C

And also for radeontop.

> This I believe has been fixed upstream:
> https://gitlab.freedesktop.org/drm/amd/-/issues/1632
> 
> Can you please make sure this will be backported to Debian 5.10 kernel?

It's already backported to 5.10, just after 5.10.46 was released:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y=fea853aca3210c21dfcf07bb82d501b7fd1900a7

signature.asc
Description: This is a digitally signed message part.


Bug#991453: linux-image-5.10.0-8-amd64: Radeon 6800 XT: 100% GPU core usage & 74 Watts when idle

2021-07-23 Thread piorunz
Package: src:linux
Version: 5.10.46-2
Severity: important
X-Debbugs-Cc: pior...@gmx.com

Dear Maintainer,

I am using Bullseye 11 with Radeon 6800 XT and noticed higher
temperature and noise comparing to Windows, so I investigated this and
found the fault. GPU core works at 100% usage at all times, even at idle.

$ cat /sys/class/drm/card0/device/gpu_busy_percent
99

$ sensors
(...)
amdgpu-pci-0900
Adapter: PCI adapter
vddgfx:1.14 V
fan1:1098 RPM  (min =0 RPM, max = 3000 RPM)
edge: +51.0°C  (crit = +100.0°C, hyst = -273.1°C)
   (emerg = +105.0°C)
junction: +55.0°C  (crit = +110.0°C, hyst = -273.1°C)
   (emerg = +115.0°C)
mem:  +56.0°C  (crit = +100.0°C, hyst = -273.1°C)
   (emerg = +105.0°C)
power1:   74.00 W  (cap = 272.00 W)

radeontop - 100% GPU usage and full clocks:
Graphics pipe 100.00%
1.00G / 1.00G Memory Clock 100.00%
2.47G / 2.58G Shader Clock  95.92%

This card should be using 0% GPU time when idle, downclocking core to 10 MHz,
and using 9 to 34W depending on numbers of monitors connected.
On Debian 5.10 kernel, it's using 74W minimum, at all times.


This I believe has been fixed upstream:
https://gitlab.freedesktop.org/drm/amd/-/issues/1632

Can you please make sure this will be backported to Debian 5.10 kernel?

Thanks.

Kind regards,
piorunz


-- Package-specific info:
** Version:
Linux version 5.10.0-8-amd64 (debian-ker...@lists.debian.org) (gcc-10 (Debian 
10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP 
Debian 5.10.46-2 (2021-07-20)

** Command line:
BOOT_IMAGE=/@rootfs/boot/vmlinuz-5.10.0-8-amd64 
root=UUID=77ba5989-5d64-4929-9145-ede6751a4102 ro rootflags=subvol=@rootfs 
amdgpu.ppfeaturemask=0xfffd7fff

** Not tainted

** Kernel log:
[ 4256.341304] microcode: CPU13: patch_level=0x0a201016
[ 4256.343423] ACPI: \_SB_.PLTF.C00B: Found 2 idle states
[ 4256.345209] CPU13 is up
[ 4256.345219] smpboot: Booting Node 0 Processor 14 APIC 0xd
[ 4256.345313] microcode: CPU14: patch_level=0x0a201016
[ 4256.347435] ACPI: \_SB_.PLTF.C00D: Found 2 idle states
[ 4256.349222] CPU14 is up
[ 4256.349236] smpboot: Booting Node 0 Processor 15 APIC 0xf
[ 4256.349330] microcode: CPU15: patch_level=0x0a201016
[ 4256.351451] ACPI: \_SB_.PLTF.C00F: Found 2 idle states
[ 4256.353230] CPU15 is up
[ 4256.354126] ACPI: Waking up from system sleep state S3
[ 4256.453431] usb usb1: root hub lost power or was reset
[ 4256.453432] usb usb2: root hub lost power or was reset
[ 4256.453565] sd 0:0:0:0: [sda] Starting disk
[ 4256.453566] sd 1:0:0:0: [sdb] Starting disk
[ 4256.453568] sd 2:0:0:0: [sdc] Starting disk
[ 4256.453573] sd 3:0:0:0: [sdd] Starting disk
[ 4256.453577] sd 5:0:0:0: [sde] Starting disk
[ 4256.453768] [drm] PCIE GART of 512M enabled (table at 0x0080).
[ 4256.453786] [drm] PSP is resuming...
[ 4256.648463] nvme nvme0: 8/0/0 default/read/poll queues
[ 4256.661079] [drm] reserve 0xa0 from 0x83fe00 for PSP TMR
[ 4256.768036] ata5: SATA link down (SStatus 0 SControl 330)
[ 4256.801229] usb 1-7: reset high-speed USB device number 8 using xhci_hcd
[ 4256.929094] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 4256.929312] ata6.00: configured for UDMA/133
[ 4256.937100] amdgpu :09:00.0: amdgpu: SMU is resuming...
[ 4256.937104] amdgpu :09:00.0: amdgpu: smu driver if version = 0x0039, 
smu fw if version = 0x003b, smu fw version = 0x003a3100 (58.49.0)
[ 4256.937105] amdgpu :09:00.0: amdgpu: SMU driver if version not matched
[ 4257.005584] amdgpu :09:00.0: amdgpu: SMU is resumed successfully!
[ 4257.006633] [drm] DMUB hardware initialized: version=0x0201
[ 4257.081355] usb 1-6: reset full-speed USB device number 7 using xhci_hcd
[ 4257.445313] usb 1-9: reset high-speed USB device number 9 using xhci_hcd
[ 4257.737242] usb 1-10: reset full-speed USB device number 11 using xhci_hcd
[ 4258.093501] usb 1-2: reset full-speed USB device number 2 using xhci_hcd
[ 4258.108494] [drm] kiq ring mec 2 pipe 1 q 0
[ 4258.125155] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 4258.134188] [drm] VCN decode and encode initialized successfully(under DPG 
Mode).
[ 4258.134405] [drm] JPEG decode initialized successfully.
[ 4258.134427] amdgpu :09:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on 
hub 0
[ 4258.134428] amdgpu :09:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 
on hub 0
[ 4258.134428] amdgpu :09:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 
on hub 0
[ 4258.134429] amdgpu :09:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 
on hub 0
[ 4258.134429] amdgpu :09:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 
on hub 0
[ 4258.134429] amdgpu :09:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 
on hub 0
[ 4258.134429] amdgpu :09:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 
on hub 0
[ 4258.134430] amdgpu :09:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 
on hub 0
[ 4258.134430] amdgpu