On Fri, Nov 02, 2018 at 10:19:15PM +0100, Mark Kettenis wrote:
> I think Ted is pointing out a different issue here that doesn't really
> have anything to do with SMT. The issue is that in some cases we end
> up with CPUs reporting being 100% busy running the idle thread instead
> of reporting being 100% idle. This happens quite a lot on machines
> with lots of CPUs immediately after they are booted. Usually this
> funny state disappears after some time.
>
> An idle CPU is of course running the idle thread, so in that sense
> this isn't super-strange. But it does indicate there is some kind of
> accounting issue. I have a feeling this happens before any processes
> have been scheduled on these CPUs. But I've never found the problem...
I can easily reproduce this on a 2 socket machine with 4 cores each.
Hyper threads are turned off in BIOS.
schedcpu() does not account the run time if p->p_slptime > 1. So
fresh idle threads have a CPU percentage of 99%.
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 11006 99.0 0.0 0 0 ?? RK/6 10:19PM 0:15.51 (idle6)
root 26529 99.0 0.0 0 0 ?? RK/7 10:19PM 2:55.45 (idle7)
root 55382 99.0 0.0 0 0 ?? RK/3 10:19PM 2:55.27 (idle3)
root 84574 99.0 0.0 0 0 ?? RK/4 10:19PM 2:54.90 (idle4)
root 53490 99.0 0.0 0 0 ?? RK/5 10:19PM 2:49.72 (idle5)
root 59318 73.2 0.0 0 0 ?? DK 10:19PM 2:53.36 (idle0)
root 24358 0.0 0.0 0 0 ?? RK/1 10:19PM 2:54.66 (idle1)
root 15902 0.0 0.0 0 0 ?? RK/2 10:19PM 2:55.04 (idle2)
This effect goes away after runnning multiple processes on all cores.
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 55382 11.7 0.0 0 0 ?? RK/3 10:19PM 4:06.52 (idle3)
root 84574 11.6 0.0 0 0 ?? DK 10:19PM 4:12.00 (idle4)
root 53490 11.7 0.0 0 0 ?? RK/5 10:19PM 4:11.94 (idle5)
root 11006 11.7 0.0 0 0 ?? RK/6 10:19PM 3:48.90 (idle6)
root 26529 1.3 0.0 0 0 ?? RK/7 10:19PM 4:11.81 (idle7)
root 59318 0.4 0.0 0 0 ?? DK 10:19PM 4:09.78 (idle0)
root 24358 0.0 0.0 0 0 ?? RK/1 10:19PM 4:11.37 (idle1)
root 15902 0.0 0.0 0 0 ?? RK/2 10:19PM 4:11.87 (idle2)
If I initialize p_slptime with 127 the 99% effect does not happen
from the beginning.
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 86374 0.0 0.0 0 0 ?? RK/1 10:18PM 3:10.15 (idle1)
root 78654 0.0 0.0 0 0 ?? DK 10:18PM 3:09.52 (idle2)
root 85088 0.0 0.0 0 0 ?? DK 10:18PM 3:10.82 (idle3)
root 34080 0.0 0.0 0 0 ?? RK/4 10:18PM 2:06.89 (idle4)
root 65391 0.0 0.0 0 0 ?? RK/5 10:18PM 0:20.39 (idle5)
root 6812 0.0 0.0 0 0 ?? RK/6 10:18PM 0:36.33 (idle6)
root 6433 0.0 0.0 0 0 ?? RK/7 10:18PM 0:53.09 (idle7)
root 87244 0.0 0.0 0 0 ?? RK/0 10:18PM 3:07.58 (idle0)
There are still things I don't understand. After a while the CPU
time for idle5, idle6, idle7 does not increase anymore. I am doing
iperf3 performance tests on this machine. My patch makes the results
more unsteady and throughput lower. It seems that iperf3 processes
get scheduled on CPUs with less memory affinity.
bluhm
Index: kern/kern_sched.c
===================================================================
RCS file: /data/mirror/openbsd/cvs/src/sys/kern/kern_sched.c,v
retrieving revision 1.54
diff -u -p -r1.54 kern_sched.c
--- kern/kern_sched.c 17 Nov 2018 23:10:08 -0000 1.54
+++ kern/kern_sched.c 17 Dec 2018 19:35:33 -0000
@@ -145,6 +145,7 @@ sched_idle(void *v)
*/
SCHED_LOCK(s);
cpuset_add(&sched_idle_cpus, ci);
+ p->p_slptime = 127;
p->p_stat = SSLEEP;
p->p_cpu = ci;
atomic_setbits_int(&p->p_flag, P_CPUPEG);