[Kernel-packages] [Bug 2026658] Re: CPU frequency governor broken after upgrading from 22.10 to 23.04, stuck at 400Mhz on Alder Lake

Eli Mon, 10 Jul 2023 18:41:16 -0700

> Maybe it's caused by thermald? See if `sudo systemctl stop thermald`
can help.


I will try this, I will also reinstall the package.

I am waiting to see if using sane power parameters in my script for bug
1 fixes the bug 2 issue, but that means I need to leave it sitting for
12+ hours so my iteration speed here is very slow.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2026658

Title:
  CPU frequency governor broken after upgrading from 22.10 to 23.04,
  stuck at 400Mhz on Alder Lake

Status in linux package in Ubuntu:
  New

Bug description:
  I've tried to include as much detail as possible in this bug report, I
  originally assembled it just after the release of ubuntu 23.04.  There
  has been no change since then.

  
  I have had substantial performance problems since updating from ubuntu 22.10 
to 23.04.
  The computer in question is the 17 inch Razer Blade laptop from 2022 with an 
intel i7-12800H.
  Current kernel is 6.2.0-20-generic.  (now I'm on 6.2.0-24-generic and nothing 
has changed.)
  This issue occurs regardless of whether the OpenRazer 
(https://openrazer.github.io/) drivers etc are installed.

  
  Description of problem:
  I have discovered what may be two separate bugs involving low level power 
management details on the cpu, they involve the cpu entering different types of 
throttled states and never recovering. These issues appeared immediately after 
upgrading from ubuntu 22.10.  The computer is a large ~gaming laptop with 
plenty of thermal headroom, cpu temperatures cannot reach concerning values 
except when using stress testing tools.

  (I don't know how to propery untangle these two issues, so I'm posting
  them as one. I apologize for the review complexity this causes, but I
  think posting the information all in one spot is more constructive
  here.)

  
  High level testing notes:
  - This issue occurs with use of both the intel_pstate driver and the cpufreq 
driver. (I don't have the same level of detail for cpufreq, but the issue still 
occurs.)
  - I have additionally tested a handful of intel_pstate parameters (and 
others) via grub kernel command line arguments to no effect. All testing 
reported here was done with:
    GRUB_CMDLINE_LINUX_DEFAULT="modprobe.blacklist=nouveau"
    GRUB_CMDLINE_LINUX=""
    (loading nouveau caused problems for me on 22.10, I have not bothered 
reinvestigating it on 23.04)
  - There is a firmware update available from the manufacture when I boot into 
Windows, I have not installed it (yet).
  - - Update: I installed it. No change.
  - Changing the cpu governor setting from "powersave" to "performance" using 
`cpupower frequency-set -g performance` has no effect. (Note: this action is 
separate from the intel_pstate's power-saver/balanced/performance setting 
visible with the `powerprofilesctl` utility. It doesn't seem to be a governor 
bug.
  - - (There is a tertiary issue where I also see substantial (+50%) 
performance degredation using the "performance" profile in a test suite I run 
constantly for my job; that is clearly a problem but it is an unrelated bug 
that has existed for quite some time.)

  
  Summary and my own conclusions:
  These are my takeaways, the ~raw data is in the followup section.

  
  Bug 1)
  The reported cpu power limits are progressively constrained over time. Once 
this failure mode starts the performance never recovers.
    - As this situation progresses the observed cpu speeds (I'm using htop) 
list as 2800Mhz at idle, but the instant any load at all is placed on a cpu 
core that core immediately drops to exactly 400Mhz.
    - This situation occurs quite quickly in human terms, frequently within 20 
minutes of normal usage after a boot, but it will also occur when the computer 
is just sitting there unused for a handful of hours.
    - This occurs when using the cpufreq gevernor (by including 
"intel_pstate=disable" on the grub command line args.)
    - At boot the default value for short_term_time looks wrong to me. This is 
the duration of higher thermal targets in seconds, ~0.002 seconds seems 
extremely short. A normal value would be a handful of seconds.
    - This situation can be remedied by running the following python script. It 
uses the undervolt package (pip install undervolt==0.3.0) to force particular 
power limits (the provided values are intentional overkill):
       1   │ from undervolt import read_power_limit, set_power_limit, 
PowerLimit, ADDRESSES
       2   │ from pprint import pprint
       3   │ 
       4   │ limits = read_power_limit(ADDRESSES)
       5   │ pprint(vars(limits))  # print current values before setting them
       6   │ 
       7   │ POWER_LIMITS = PowerLimit()
       8   │ POWER_LIMITS.locked = True  # lock means don't allow the value to 
be reset until a reboot.
       9   │ POWER_LIMITS.backup_rest = 281474976776192  # afaik this is just a 
backup-on-failure setting, it has no effect here.
      10   │ POWER_LIMITS.long_term_enabled = True
      11   │ POWER_LIMITS.long_term_power = 160  # values are intentional 
overkill
      12   │ POWER_LIMITS.long_term_time = 2880.0
      13   │ POWER_LIMITS.short_term_enabled = True
      14   │ POWER_LIMITS.short_term_power = 250
      15   │ POWER_LIMITS.short_term_time = 500.0
      16   │ set_power_limit(POWER_LIMITS, ADDRESSES)
      17   | 
      18   | limits2 = read_power_limit(ADDRESSES)  # and print the new state
      19   | pprint(vars(limits2))

  
  Bug 2)
  `powerprofilesctl` has unearthed some bug where the cpu performance enters 
the degraded state "high-operating-temperature", and never recovers.
    - This appears to happen for no reason. There is a brief cpu temperature 
spike in the example data below, but it does not hit the listed hardware limit 
values so I am at a loss for its cause.
    - I ran a cpu stress test (prime95/mprime torture test), it immediately 
spikes cpu temperature to 100 degrees and throttles the cpu, but doesn't 
trigger the high temperature degraded state. Go figure.
    - This bug takes quite a while to kick in, uptime in my example below was 
at over 14 hours.
    - When this situation occurs the maximum cpu speed becomes 2400Mhz across 
all cpu cores. The cpu power management appears to behave correctly in the 
400-2400Mhz range. I believe this means all turbo frequencies are disabled.
    - Running the comman `sudo cpupower frequency-set -u 4800000` (or any value 
above 2400000) does not correct the reported cpu_policy_range, it remains 
locked at 2400Mhz.
    - The only fix I know is a reboot.

  
  THE DATA:

  Bug 1:
  This output was gathered using a python package called undervolt's 
read_power_limit function from a script that starts running at ~boot.
  long_term_power and short_term_power metrics are values in watts, 
long_term_time and short_term_time are values in seconds.

  2023-05-12  15:14:32 up 0 min,  0 user,  load average: 0.39, 0.10, 0.03 
  (boot, log starts after normal user login)
   long_term_power: 65.0
   long_term_time: 32.0
   short_term_power: 160.0
   short_term_time: 0.00244140625

  2023-05-12  15:20:29 up 6 min,  2 users,  load average: 1.90, 0.86, 0.37 
   long_term_power: 20.875  <-- down
   long_term_time: 28.0  <-- down
   short_term_power: 160.0
   short_term_time: 0.00244140625

  2023-05-12  15:20:46 up 6 min,  2 users,  load average: 1.63, 0.87, 0.38 
   long_term_power: 22.625  <-- hey it went up! I was still using the computer 
at this point
   long_term_time: 28.0
   short_term_power: 160.0
   short_term_time: 0.00244140625

  2023-05-12  15:46:15 up 32 min,  2 users,  load average: 0.66, 0.84, 0.79 
  (no longer at computer by the time this occurs)
   long_term_power: 20.625  <-- down
   long_term_time: 28.0
   short_term_power: 160.0
   short_term_time: 0.00244140625

  2023-05-12  16:04:46 up 50 min,  3 users,  load average: 0.46, 0.70, 0.79 
   long_term_power: 16.625  <-- down
   long_term_time: 28.0
   short_term_power: 160.0
   short_term_time: 0.00244140625

  2023-05-12  17:23:07 up  2:08,  3 users,  load average: 0.49, 0.61, 0.68 
  (by the time long_term_power hits 8.625 all cpu cores throttle to 400Mhz 
under any load. This one was preceded by ~1 second of a single cpu core 
randomly spiking to 78 degrees, output from `powerprofilesctl` remains normal. 
At this point long_term_power will never go up again. I have seen one more 
lowered stage at ~4.3125w.)
   long_term_power: 8.625  <-- way down - I've seen lower, though.
   long_term_time: 28.0
   short_term_power: 160.0
   short_term_time: 0.00244140625

  (And then after several hours stuck in this mode I returned to the
  computer and needed to run the script in the bug 1 summary to make it
  usable again.)

  
  Bug 2:
  (Some cleanup of output, script starts at ~boot)
  2023-05-11  22:21:15 up 14:15,  2 users,  load average: 0.38, 0.42, 0.52 

  Output from powerprofilesctl:
    |  performance:
    |    Driver:     intel_pstate
    |    Degraded:   no
    |* balanced:
    |    Driver:     intel_pstate
    |  power-saver:
    |    Driver:     intel_pstate

  some summarized details from the `cpupower` utility:
    | cpu_number: 2
    | cpu_range: 400 MHz - 4.70 GHz
    | cpu_policy_range: 400 MHz and 4.70 GHz.
    | governor: powersave

  output from `sensors` (slightly compactified, I don't know what's up with the 
cpu core numbers):
    | iwlwifi_1-virtual-0 - Adapter: Virtual device - temp1: +49.0°C  
    | nvme-pci-0300 - Adapter: PCI adapter - Composite:
    |   +40.9°C  (low = -5.2°C, high = +89.8°C) (crit = +93.8°C)
    | nvme-pci-0200 - Adapter: PCI adapter:
    |   Composite:   +36.9°C  (low = -273.1°C, high = +80.8°C) (crit = +84.8°C)
    |   Sensor 1:    +36.9°C  (low = -273.1°C, high = +65261.8°C)
    |   Sensor 2:    +38.9°C  (low = -273.1°C, high = +65261.8°C)
    | coretemp-isa-0000 - Adapter: ISA adapter
    | Package id 0:  +77.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 0:        +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 4:        +54.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 8:        +77.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 12:       +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 16:       +64.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 20:       +45.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 24:       +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 25:       +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 26:       +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 27:       +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 28:       +50.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 29:       +50.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 30:       +50.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 31:       +50.0°C  (high = +100.0°C, crit = +100.0°C)
    | acpitz-acpi-0 - Adapter: ACPI interface: temp1: +27.8°C (crit = +105.0°C)

  
  2023-05-11  22:21:17 up 14:15,  2 users,  load average: 0.38, 0.42, 0.52    
(2 seconds later)

  output from `powerprofilesctl`:
    |  performance:
    |    Driver:     intel_pstate
    |    Degraded:   yes (high-operating-temperature)
    |* balanced:
    |    Driver:     intel_pstate
    |  power-saver:
    |    Driver:     intel_pstate

  some summarized details from the `cpupower` utility:
    | cpu_number: 8
    | cpu_range: 400 MHz - 4.70 GHz
    | cpu_policy_range: 400 MHz and 2.40 GHz.
    | governor: powersave

  output from `sensors` (slightly compactified, I don't know what's up with the 
cpu core numbers):
    | iwlwifi_1-virtual-0 Adapter: Virtual device temp1: +49.0°C  
    | nvme-pci-0300 - Adapter: PCI adapter
    |   Composite:    +40.9°C  (low =  -5.2°C, high = +89.8°C) (crit = +93.8°C)
    | nvme-pci-0200 - Adapter: PCI adapter
    |   Composite:    +36.9°C  (low = -273.1°C, high = +80.8°C) (crit = +84.8°C)
    |   Sensor 1:     +36.9°C  (low = -273.1°C, high = +65261.8°C)
    |   Sensor 2:     +38.9°C  (low = -273.1°C, high = +65261.8°C)
    | coretemp-isa-0000 - Adapter: ISA adapter
    |   Package id 0:  +60.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 0:        +53.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 4:        +59.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 8:        +54.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 12:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 16:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 20:       +60.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 24:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 25:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 26:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 27:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 28:       +55.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 29:       +55.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 30:       +55.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 31:       +55.0°C  (high = +100.0°C, crit = +100.0°C)
    | acpitz-acpi-0 - Adapter: ACPI interface - temp1: +27.8°C (crit = +105.0°C)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2026658/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2026658] Re: CPU frequency governor broken after upgrading from 22.10 to 23.04, stuck at 400Mhz on Alder Lake

Reply via email to