...
I now have more experience with ULE. A version built today gave
dramatically worse interactivity, so much so that I think it must have
been broken recently. A simple shell loop hangs the rest of the system
in some cases, and a background build has similar bad effects, probably
limited mainly by useful loops not being endless.
I'm not able to reproduce this and no one else has reported it.
This always happens with hz = 100. Reducing preempt_thresh to below
about 50 mostly fixes the problem, and reducing the threshold to 0
fixes the problem a bit more. The shell loop processes still take too
long to start up (often several seconds for just 20), but the second
process starts within a second, instead of showing signs of taking
forever to start up. Apparently, in the broken case, an IPI to stop
the first process is never delivered. ^Z works to stop the whole
process group, and then two %'s to usually result in proceeding to
the next process. Having to use two %'s is strange but may be just
a shell bug.
-current with 4BSD also takes too long to start all the processes,
while ~5.2 restarts them all apparently-instantly. In fact it starts
them too fast and runs into the old exec resource shortage bug after
16 processes and 3 or 4 or the starts fail in exec.
With hz = 1000 and ULE, the default preempt_thresh of 64 works but
reducing it to 0 works better. Startup is still too slow.
Apparently, there is a scaling bug for hz or extra interrupts for
the larger hz help, and the default preempt_thresh is not best.
I saw this behaviour for 2 different kernels:
- SMP kernel (all this is running on an A64 UP in i386 mode) built on
Aug 5. Timer interrupts were via the APIC. hz was set to 100 at
boot time. stathz was always 100 and in perfect sync with hz.
(Plain current with APIC timer interrupts gives a broken stathz of
13 when hz is 100, and stathz in bogus sync with hz.)
- UP kernel built today. Timer interrupts were via the i8254 and the
RTC. hz was set to 100 or 1000 at boot time. stathz was always
128. The different interrupt configuration and timing (except for
increasing hz for ULE) made little difference.
The SMP kernel got a bit further in the shell loop startup when hz = 100
but otherwise behaved similarly.
This may be the result of some incompatibility between bdebsd and ULE.
Nah, I don't use ULE in bdebsd (except all userland is bdebsd), and
I don't touch schedulers in -current (I mainly touch filesystems and
network drivers). Current kernels are remarkably compatible with
old userlands.
Is this a SMP machine? Do you have PREEMPTION enabled? ULE recently
started honoring preemption. Try setting:
See above. Always PREEMPTION for UP, since without it problems like the
above are almost to be expected. I think 5.2 has them. ~5.2 preempts
a lot as a side effect of switching context for clock interrupt handlers
and then (without the queueing hack) rescheduling on switching back.
kern.sched.preempt_thresh: 64
But this setting is part of the problem.
if it is not already. I know you deal with hardclock differently. Without
PREEMPTION it may not work correctly.
No, the difference for hardclock is not in ULE kernels.
First I tried an old regression test for nice[1-2]:
%%%
for i in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
do
nice -$i sh -c "while :; do echo -n;done" &
done
top -o time
%%%
I use this:
for i in -20 -16 -12 -8 -4 0 4 8 12 16 20
do
nice -$i sh -c "while :; do echo -n;done" &
done
top -o time
I like to verify that the distribution doesn't get out of whack. It takes
Then non-multiple-of-4 entries in my list are almost useless. I mostly
use the [0-20] list because it is in the first file in a test directory
and doesn't have any negative values so it doesn't need privilege to run.
some time to settle before the higher nice threads get enough runtime to
sort properly. My results are as so:
The settling time/inertia is both a bug and a feature. It's good to have
inertia for long-running processes, but makeworld can start several hundred
processes per second and finish many of them so there is nowhere near
enough settling time for these processes so their behaviour is hard to
predict.
868 root 1 81 -20 3492K 1404K RUN 0:28 23.58% sh
869 root 1 83 -16 3492K 1404K RUN 0:20 15.09% sh
870 root 1 86 -12 3492K 1404K RUN 0:16 12.16% sh
871 root 1 90 -8 3492K 1404K RUN 0:12 8.89% sh
872 root 1 93 -4 3492K 1404K RUN 0:11 7.96% sh
873 root 1 97 0 3492K 1404K RUN 0:09 6.59% sh
874 root 1 101 4 3492K 1404K RUN 0:08 4.88% sh
875 root 1 105 8 3492K 1404K RUN 0:07 5.37% sh
876 root 1 109 12 3492K 1404K RUN 0:06 3.37% sh
877 root 1 113 16 3492K 1404K RUN 0:06 4.05% sh
878 root 1 116 20 3492K 1404K RUN 0:05 3.96% sh
Really might not be enough difference with positive nice values. I've
never really had a good feeling about how nice should really behave but
this mostly seems reasonable. It would be possible to tweak the algorithm
to further penalize nice.
I still use a table-driven algorithm with weights 2**(nice_value/4). This
gives a dynamic range of a factor 1024.
This hung after starting only about one of the shell processes. After
cutting the list down to just one process with nice -20, it still hung.
Shells on other syscons terminals running at rtprio 0 could not compete
with the nice -20 process:
- they could not start top to look at what was happening
- an already-running could not display anything new
- they could not start killall.
With the list cut down to about 6 processes, ps in ddb showed evidence of
all the processes starting, and I was able to kill them all using
kill in ddb.
Fixed using larger hz and/or smaller preempt_thresh; ddb wasn't necessary
since ^Z worked (if hit it before ^C?) -- see above.
[hz = 100 case not so bad]
Other stange behaviour with preempt_thresh = 64, at least with hz = 100:
start two identical CPU hogs, each with a runtime of 2.5 seconds, on
separate consoles. Then one is given 100% of the CPU until it completes,
and it is always the second one started that gets 100% CPU first. Thus
the first one started takes about 5.0 seconds to complete and the second
one started takes about 2.5 seconds to complete.
Running makeworld with just -j4 n the background gives similar symptoms.
When a new process is started, it sometimes gets too many cycles to
begin with, and apparently completely stops all processes in the
makeworld (but not the top displaying things) for several seconds.
After a while (I guess when the interactivity score descreases), this
behaviour changes to giving the new process very few cycles even if
it is semi-interactive (a foreground process started from a shell).
~5.2 behaves similarly, but I think a little better. In ~5.2 (and
maybe in all schedulers), the initial priority is just a function of
the parent's priority (I use a simple function that might be slightly
different from 5.2. I forget what it is). If neither the parent nor
the child runs for long, then new processes tend to get almost all the
CPU until they run for too long. When the children exit, the parent
inherits some priority according to another simple function. ~5.2
works best here since it uses better functions than 5.2 does (much
better than the exponential functions in 4.x), and it keeps track of
history better than ULE can.
I tested this mainly using:
time /tmp/q1 & time /tmp/q1 & acroread *pdf # type ^q to exit
acroread
where /tmp/q1 measures latency by calling clock_gettime() in a loop and
there are 12 pdf files of total size 4.75MB. acroread is sufficiently
bloated and hoggish to have very bad behaviour here. The results when
this is run on an xterm that has initially been idle for some time (or
is in some more magic state for ULE interactivity?) at loadavg 20 are
approximately:
all: acroread starts fast for the first few runs (would be ~ 1
seconds with no load; this only increases by a second or two)
/tmp/q1 runs for ~2.5 seconds self time and shows low max
latency (would be ~ 200 usec with no load; this increases to
~10 msec; both high variance)
~5.2-4BSD: after a few runs, the parent priority becomes near the
max so further runs take 5-10 seconds to start. 20 seconds at
a load avg of 20 would be fairer, but the parent priority
doesn't get as near the max as background hog's priorities.
After a few runs, max latency is usually 100-500 msec and was
once 2 seconds.
Latency in mouse movements is not noticable
current-4BSD: further runs don't take much longer to start.
Apparently the parent doesn't inherit enough priority.
(In 4.2 it inherited far too much.)
After a few runs, max latency is usually 1-2 seconds and was
once 27 seconds.
The latency of 1-2 is often noticeable for mouse movements and
even for echo in xterms.
current-ULE: further runs sometimes take _much_ longer, a minute
or so, and there is a high variance in the length.
After a few runs, max latency is usually a few hundred msec
larger than for ~5.2.
Latency in mouse movements is not noticable
In at least this phase, ^C to kill processes doesn't work, but ^Z to
suspend them and then kill from the shell works normally, and
interactivity
in not-very-bloated mail programs and editors is very bad. A
^C fails only in the phase where hz is small, preempt_thresh is larger,
and (?) the parent hasn't gained much priority and/or (negative?)
interactivity.
Other behaviour with 4BSD schedulers and various kernels:
- the max scheduling delay is almost independent of the CPU speed.
This may be because it is just a function of the priorities which are
mainly a function of the algorithm.
- the max scheduling delay is slightly worse for -current with 4BSD
than with my ~5.2.
Acually, it is much worse.
- -current has anomalous behaviour relative to ~5.2 for background
makeworld -j16: many fewer runnable processes, a much smaller max
load average, and many more zombies visible when top looks.
This may be related to the slow startup of the shell loops and caused by
the priority inheritance for fork/exit.
- [queue hack]
...
essentially roundrobin scheduling under loads that generate lots
of interrupts. Interactivity is still poor because makeworld
sometimes generates a few hundred processes per second and cycling
through that many takes a long time even with a tiny quantum.
makeworld actually generates remarkably few interrupts when run on
disk file systems (an average of only about 30 non-clock interrupts per
second in my config).
- reducing kern.sched.quantum never had much effect. Same for
increasing HZ in -current with 4BSD.
Bruce