On 2022-01-09, Ryan Kavanagh <[email protected]> wrote:
> Hi Stuart,
>
> On Sun, Jan 09, 2022 at 03:12:06PM -0000, Stuart Henderson wrote:
>> That's shown after the dmesg, though "vmstat -i" would probably be
>> better as it has the totals as well.
>
> vmstat -i under GENERIC before rebooting into GENERIC.MP:
>
> interrupt                       total     rate
> irq0/clock                     885333       99
> irq96/acpi0                      1789        0
> irq144/inteldrm0               117024       13
> irq114/em0                      12414        1
> irq176/azalia1                   1668        0
> irq115/iwm0                    416586       47
> irq102/ehci0                      221        0
> irq103/ahci0                   124340       14
> irq104/ichiic0                   1770        0
> irq145/pckbc0                   17710        1
> irq146/pckbc0                   44588        5
> Total                         1623443      183
>
>> Ryan, any difference if you use GENERIC.MP rather than GENERIC?
>
> Yes. Looking at top, I see that CPU0 is spending roughly the same amount
> of time processing interrupts, while CPU1 is free to do other stuff.
> Representative capture:
>
> CPU0 states: 18.6% user,  0.0% nice,  3.0% sys,  0.2% spin, 44.3% intr, 33.9% 
> idle
> CPU1 states: 50.3% user,  0.0% nice,  5.0% sys,  1.6% spin,  0.0% intr, 43.1% 
> idle
>
> `systat vm` shows an Int% hovering around 20% (I imagine it's the mean
> of the two interrupt rates shown by top), with lows in the low 10% and
> highs in the low 30%.
>
> My laptop does not seem to get as hot as quickly under GENERIC.MP. I
> don't have hard data (I suppose one could plot the output of `sysctl
> hw.sensors` across time to compare), but at least now my laptop fan
> isn't spinning most of the time and my laptop doesn't quickly get too
> hot to handle comfortably.
>
> vmstat -i under GENERIC.MP after being up ~35 minutes with light load:
>
> interrupt                       total     rate
> irq0/clock                     860124      399
> irq0/ipi                       714019      331
> irq96/acpi0                       436        0
> irq144/inteldrm0                33945       15
> irq114/em0                       7965        3
> irq176/azalia1                      1        0
> irq115/iwm0                     73471       34
> irq102/ehci0                       73        0
> irq103/ahci0                    44676       20
> irq104/ichiic0                    431        0
> irq145/pckbc0                    6441        2
> irq146/pckbc0                    3178        1
> Total                         1744760      810
>
> Switching to GENERIC.MP solves the practical issue for me: my laptop no
> longer quickly overheats. That said, I don't know what is a normal rate
> of interrupts under GENERIC.MP, and I'm happy to debug further if you
> think it would be helpful.

The interrupt rates look pretty normal to me, but the cpu time spent
processing them is much higher than I'd expect.

If you're interested to dig deeper then I think the simplest way to find
out which part of the kernel is involved in this is probably going to
be via dt(4) / btrace(8). This is a system tracing facility which can
be used for many things but one common and simple use is to see which
functions the kernel spends its time in which will be helpful here.

In recent OpenBSD this is included in the default kernel builds but must
be enabled by setting "kern.allowdt=1" in /etc/sysctl.conf and rebooting
(it can't be changed in normal runtime, once the system securelevel has
been raised).

When that's done you can run the following as root

btrace -e 'profile:hz:100 { @[kstack] = count(); }' > /tmp/btrace.out

Run it for a while and press ^C (in many use cases you might want to be
carrying out certain activity to investigate what the kernel is doing
while that's occurring, but here I would think it's better to do with
the machine mostly idle, and probably running it for 5-10 seconds
would be enough).

That produces a bpftrace-compatible file which you can process by piping
through the following two perl scripts in order (IIRC they are self-
contained and don't require installing any non-base Perl modules,
but ask if you have any problems getting it to work) :-

https://github.com/brendangregg/FlameGraph/raw/master/stackcollapse-bpftrace.pl
https://github.com/brendangregg/FlameGraph/raw/master/flamegraph.pl

The output from the second is an HTML file with SVG display showing what
the kernel was doing during that time that you can load into a browser and
click around to expand things.

If you do this then I'd suggest following up on bugs@ - make it into a
self-contained mail so anyone looking into doesn't need to jump around
other list posts to gather details. Preferably generated with sendbug(1)
to include dmesg/pcidump/etc. It's often easier to use "sendbug -P" and
move that to a standard email client. Describe the problem again i.e.
high cpu % in interrupt, include the information you've already posted
on this thread, and either attach or link to an online copy of the
results from btrace.


Reply via email to