This is fun.
On 17/12/18(Mon) 22:30, Alexander Bluhm wrote:
> On Fri, Nov 02, 2018 at 10:19:15PM +0100, Mark Kettenis wrote:
> > I think Ted is pointing out a different issue here that doesn't really
> > have anything to do with SMT. The issue is that in some cases we end
> > up with CPUs reporting being 100% busy running the idle thread instead
> > of reporting being 100% idle. This happens quite a lot on machines
> > with lots of CPUs immediately after they are booted. Usually this
> > funny state disappears after some time.
> >
> > An idle CPU is of course running the idle thread, so in that sense
> > this isn't super-strange. But it does indicate there is some kind of
> > accounting issue. I have a feeling this happens before any processes
> > have been scheduled on these CPUs. But I've never found the problem...
>
> I can easily reproduce this on a 2 socket machine with 4 cores each.
> Hyper threads are turned off in BIOS.
>
> schedcpu() does not account the run time if p->p_slptime > 1. So
> fresh idle threads have a CPU percentage of 99%.
First note that `p_slptime' is supposed to be reset before/after calling
mi_switch(). However the idle thread isn't doing that 8)
Until a thread is scheduled on a secondary CPU, mi_switch() will always
select the Idle thread and set `p_stat' to SONPROC. This will cause
schedcpu() to calculate `p_pctcpu` for Idle threads until real threads
get scheduled. This will happen until schedcpu() increases `p_slptime'
twice for every Idle thread. For that to happen schedcpu() needs to
look at `p_stat' when the Idle thread isn't running. Since the timer
is set to run every second, it make take some time for this to happen :)
That's why bluhm@'s diff works.
> There are still things I don't understand. After a while the CPU
> time for idle5, idle6, idle7 does not increase anymore. I am doing
> iperf3 performance tests on this machine. My patch makes the results
> more unsteady and throughput lower. It seems that iperf3 processes
> get scheduled on CPUs with less memory affinity.
What do you mean with CPU time?
Since you're doing tests, could you try the diff below? It stops
recalculating the priority of Idle threads. The first effect is fewer
contention on the SCHED_LOCK() and it also implicitly document the
current fun :)
I left an "#if 0" below. You could try enabling it. Does it change
anything? I hope not.
Index: kern/kern_clock.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_clock.c,v
retrieving revision 1.97
diff -u -p -r1.97 kern_clock.c
--- kern/kern_clock.c 17 Oct 2018 12:25:38 -0000 1.97
+++ kern/kern_clock.c 22 Jan 2019 13:07:25 -0000
@@ -400,8 +400,7 @@ statclock(struct clockframe *frame)
* ~~16 Hz is best
*/
if (schedhz == 0) {
- if ((++curcpu()->ci_schedstate.spc_schedticks & 3) ==
- 0)
+ if ((++spc->spc_schedticks & 3) == 0)
schedclock(p);
}
}
Index: kern/sched_bsd.c
===================================================================
RCS file: /cvs/src/sys/kern/sched_bsd.c,v
retrieving revision 1.48
diff -u -p -r1.48 sched_bsd.c
--- kern/sched_bsd.c 6 Jan 2019 12:59:45 -0000 1.48
+++ kern/sched_bsd.c 22 Jan 2019 14:15:28 -0000
@@ -218,6 +218,13 @@ schedcpu(void *arg)
LIST_FOREACH(p, &allproc, p_list) {
/*
+ * Idle threads are never placed in runqueue, therefore
+ * computing their priority is pointless.
+ */
+ if (p->p_cpu != NULL &&
+ p->p_cpu->ci_schedstate.spc_idleproc == p)
+ continue;
+ /*
* Increment sleep time (if sleeping). We ignore overflow.
*/
if (p->p_stat == SSLEEP || p->p_stat == SSTOP)
@@ -528,7 +535,17 @@ resetpriority(struct proc *p)
void
schedclock(struct proc *p)
{
+ struct cpu_info *ci = curcpu();
+ struct schedstate_percpu *spc = &ci->ci_schedstate;
int s;
+
+ if (p == spc->spc_idleproc) {
+#if 0
+ if (spc->spc_nrun)
+ need_resched(ci);
+#endif
+ return;
+ }
SCHED_LOCK(s);
p->p_estcpu = ESTCPULIM(p->p_estcpu + 1);