kern sched_ule.c

Jeff Roberson Tue, 02 Oct 2007 13:37:31 -0700

On Tue, 2 Oct 2007, Bruce Evans wrote:

On Mon, 1 Oct 2007, Jeff Roberson wrote:

On Tue, 2 Oct 2007, Bruce Evans wrote:

Further testing of my ~4BSD scheduler in ~5.2 indicates that when a
process wants less than about 1/loadavg of the CPU on average, it
usually just gets it, with no scheduling delays, since it usually has
higher priority than all other user processes.  Otherwise, the worst-case
scheduling delays increase from ~10 msec to ~2 seconds.  It is easy
to reduce the scheduling quantum from its default of 100 msec by a
factor of 100, but this doesn't seem to work right.  So the behaviour
is very dependent on the load and on the amount of CPU wanted by the
interactive process.


[Read the middle of this bloated mail, about debugging ULE, first.]

This is only for my ~5.2 etc. with the queuing hacked backed out.  I
think real 5.2 and 4.x act similarly, except at least 4.x has a bad
policy for priority inheritance on fork/exit which can cause the
priority to grow exponentially in the number of descendants (except
it is clamped to a maximum, so the growth is just nonlinear and breaks
various things when the limit is reached.  I tested a 4.10 kernel a bit
today but didn't have enough 4.x utilities in my userland to see what
it is doing.

-current with 4BSD is much worse than this.  I observed a worst-case
scheduling delay of > 26 seconds.  Mouse movements are jerky.

-current with ULE, after debugging the configuration, is slightly worse
than this.  Mouse movements aren't jerky.  But ULE seems to often
mispredict when a process is interactive, and it sometimes gets into
a state where one process (not an interactive one) is given 100% CPU
for too long while many other processes are runnable.


Bruce,

Sorry I don't have time for a point by point on this one. Thank you foryour interesting analysis. From this I'm taking away a couple of things:

1) I've noticed that ULE relied on PREEMPTION for a long time and lostthe NEEDRESCHED setting in cases where it doesn't set owepreempt.Restoring this should improve some of the !PREEMPTION behavior and perhapseven responsiveness in your nice tests.

2) I need to try running with hz = 100 and see if there are some scalingproblems. I have heard reports that ULE scales better than 4BSD up tohigher hz values but I haven't investigated this much. It should workwith lower as well. Everything important to relative priorites and timeslice allotment runs off of stathz.

3) The code which adjusts priorities for fork may need some more finetuning. ULE agressively penalizes parents for forking expensive children.This helps us learn that make should not create interactive children forexample.

4) I don't think you're losing interrupts when you ctrl+c. It's justtaking too long for the interrupted task to run. ctrl+z takes effectimmediately when the signal is delivered. This may be related to hz = 100or running without preemption. I am not able to reproduce this problemwith a standard GENERIC kernel + ULE.


I will look into these issues soon.

Thanks,
Jeff

...

I now have more experience with ULE.  A version built today gave
dramatically worse interactivity, so much so that I think it must have
been broken recently.  A simple shell loop hangs the rest of the system
in some cases, and a background build has similar bad effects, probably
limited mainly by useful loops not being endless.


I'm not able to reproduce this and no one else has reported it.


This always happens with hz = 100.  Reducing preempt_thresh to below
about 50 mostly fixes the problem, and reducing the threshold to 0
fixes the problem a bit more.  The shell loop processes still take too
long to start up (often several seconds for just 20), but the second
process starts within a second, instead of showing signs of taking
forever to start up.  Apparently, in the broken case, an IPI to stop
the first process is never delivered.  ^Z works to stop the whole
process group, and then two %'s to usually result in proceeding to
the next process.  Having to use two %'s is strange but may be just
a shell bug.

-current with 4BSD also takes too long to start all the processes,
while ~5.2 restarts them all apparently-instantly.  In fact it starts
them too fast and runs into the old exec resource shortage bug after
16 processes and 3 or 4 or the starts fail in exec.

With hz = 1000 and ULE, the default preempt_thresh of 64 works but
reducing it to 0 works better.  Startup is still too slow.

Apparently, there is a scaling bug for hz or extra interrupts for
the larger hz help, and the default preempt_thresh is not best.

I saw this behaviour for 2 different kernels:
- SMP kernel (all this is running on an A64 UP in i386 mode) built on
 Aug 5.  Timer interrupts were via the APIC.  hz was set to 100 at
 boot time.  stathz was always 100 and in perfect sync with hz.
 (Plain current with APIC timer interrupts gives a broken stathz of
 13 when hz is 100, and stathz in bogus sync with hz.)
- UP kernel built today.  Timer interrupts were via the i8254 and the
 RTC.  hz was set to 100 or 1000 at boot time.  stathz was always
 128.  The different interrupt configuration and timing (except for
 increasing hz for ULE) made little difference.

The SMP kernel got a bit further in the shell loop startup when hz = 100
but otherwise behaved similarly.

This may be the result of some incompatibility between bdebsd and ULE.


Nah, I don't use ULE in bdebsd (except all userland is bdebsd), and
I don't touch schedulers in -current (I mainly touch filesystems and
network drivers).  Current kernels are remarkably compatible with
old userlands.

Is this a SMP machine? Do you have PREEMPTION enabled? ULE recentlystarted honoring preemption. Try setting:


See above.  Always PREEMPTION for UP, since without it problems like the
above are almost to be expected.  I think 5.2 has them.  ~5.2 preempts
a lot as a side effect of switching context for clock interrupt handlers
and then (without the queueing hack) rescheduling on switching back.

kern.sched.preempt_thresh: 64


But this setting is part of the problem.

if it is not already. I know you deal with hardclock differently. WithoutPREEMPTION it may not work correctly.


No, the difference for hardclock is not in ULE kernels.

First I tried an old regression test for nice[1-2]:

%%%
for i in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
do
   nice -$i sh -c "while :; do echo -n;done" &
done
top -o time
%%%


I use this:
for i in -20 -16 -12 -8 -4 0 4 8 12 16 20
do
       nice -$i sh -c "while :; do echo -n;done" &
done
top -o time

I like to verify that the distribution doesn't get out of whack.  It takes


Then non-multiple-of-4 entries in my list are almost useless.  I mostly
use the [0-20] list because it is in the first file in a test directory
and doesn't have any negative values so it doesn't need privilege to run.

some time to settle before the higher nice threads get enough runtime tosort properly. My results are as so:


The settling time/inertia is both a bug and a feature.  It's good to have
inertia for long-running processes, but makeworld can start several hundred
processes per second and finish many of them so there is nowhere near
enough settling time for these processes so their behaviour is hard to
predict.

 868 root          1  81  -20  3492K  1404K RUN      0:28 23.58% sh
 869 root          1  83  -16  3492K  1404K RUN      0:20 15.09% sh
 870 root          1  86  -12  3492K  1404K RUN      0:16 12.16% sh
 871 root          1  90   -8  3492K  1404K RUN      0:12  8.89% sh
 872 root          1  93   -4  3492K  1404K RUN      0:11  7.96% sh
 873 root          1  97    0  3492K  1404K RUN      0:09  6.59% sh
 874 root          1 101    4  3492K  1404K RUN      0:08  4.88% sh
 875 root          1 105    8  3492K  1404K RUN      0:07  5.37% sh
 876 root          1 109   12  3492K  1404K RUN      0:06  3.37% sh
 877 root          1 113   16  3492K  1404K RUN      0:06  4.05% sh
 878 root          1 116   20  3492K  1404K RUN      0:05  3.96% sh

Really might not be enough difference with positive nice values. I'venever really had a good feeling about how nice should really behave butthis mostly seems reasonable. It would be possible to tweak the algorithmto further penalize nice.


I still use a table-driven algorithm with weights 2**(nice_value/4).  This
gives a dynamic range of a factor 1024.

This hung after starting only about one of the shell processes.  After
cutting the list down to just one process with nice -20, it still hung.
Shells on other syscons terminals running at rtprio 0 could not compete
with the nice -20 process:
- they could not start top to look at what was happening
- an already-running could not display anything new
- they could not start killall.

With the list cut down to about 6 processes, ps in ddb showed evidence ofall the processes starting, and I was able to kill them all using

kill in ddb.


Fixed using larger hz and/or smaller preempt_thresh; ddb wasn't necessary
since ^Z worked (if hit it before ^C?) -- see above.

[hz = 100 case not so bad]


Other stange behaviour with preempt_thresh = 64, at least with hz = 100:
start two identical CPU hogs, each with a runtime of 2.5 seconds, on
separate consoles.  Then one is given 100% of the CPU until it completes,
and it is always the second one started that gets 100% CPU first.  Thus
the first one started takes about 5.0 seconds to complete and the second
one started takes about 2.5 seconds to complete.

Running makeworld with just -j4 n the background gives similar symptoms.
When a new process is started, it sometimes gets too many cycles to
begin with, and apparently completely stops all processes in the
makeworld (but not the top displaying things) for several seconds.
After a while (I guess when the interactivity score descreases), this
behaviour changes to giving the new process very few cycles even if
it is semi-interactive (a foreground process started from a shell).


~5.2 behaves similarly, but I think a little better.  In ~5.2 (and
maybe in all schedulers), the initial priority is just a function of
the parent's priority (I use a simple function that might be slightly
different from 5.2.  I forget what it is).  If neither the parent nor
the child runs for long, then new processes tend to get almost all the
CPU until they run for too long.  When the children exit, the parent
inherits some priority according to another simple function.  ~5.2
works best here since it uses better functions than 5.2 does (much
better than the exponential functions in 4.x), and it keeps track of
history better than ULE can.

I tested this mainly using:

time /tmp/q1 & time /tmp/q1 & acroread *pdf # type ^q to exitacroread


where /tmp/q1 measures latency by calling clock_gettime() in a loop and
there are 12 pdf files of total size 4.75MB.  acroread is sufficiently
bloated and hoggish to have very bad behaviour here.  The results when
this is run on an xterm that has initially been idle for some time (or
is in some more magic state for ULE interactivity?) at loadavg 20 are
approximately:

        all: acroread starts fast for the first few runs (would be ~ 1
            seconds with no load; this only increases by a second or two)

            /tmp/q1 runs for ~2.5 seconds self time and shows low max
            latency (would be ~ 200 usec with no load; this increases to
            ~10 msec; both high variance)
        ~5.2-4BSD: after a few runs, the parent priority becomes near the
            max so further runs take 5-10 seconds to start.  20 seconds at
            a load avg of 20 would be fairer, but the parent priority
            doesn't get as near the max as background hog's priorities.

            After a few runs, max latency is usually 100-500 msec and was
            once 2 seconds.

            Latency in mouse movements is not noticable
        current-4BSD: further runs don't take much longer to start.
            Apparently the parent doesn't inherit enough priority.
            (In 4.2 it inherited far too much.)

            After a few runs, max latency is usually 1-2 seconds and was
            once 27 seconds.

            The latency of 1-2 is often noticeable for mouse movements and
            even for echo in xterms.
        current-ULE: further runs sometimes take _much_ longer, a minute
            or so, and there is a high variance in the length.

            After a few runs, max latency is usually a few hundred msec
            larger than for ~5.2.

            Latency in mouse movements is not noticable

In at least this phase, ^C to kill processes doesn't work, but ^Z to
suspend them and then kill from the shell works normally, andinteractivity
in not-very-bloated mail programs and editors is very bad.  A


^C fails only in the phase where hz is small, preempt_thresh is larger,
and (?) the parent hasn't gained much priority and/or (negative?)
interactivity.

Other behaviour with 4BSD schedulers and various kernels:
- the max scheduling delay is almost independent of the CPU speed.


This may be because it is just a function of the priorities which are
mainly a function of the algorithm.

- the max scheduling delay is slightly worse for -current with 4BSD
 than with my ~5.2.


Acually, it is much worse.

- -current has anomalous behaviour relative to ~5.2 for background
 makeworld -j16: many fewer runnable processes, a much smaller max
 load average, and many more zombies visible when top looks.


This may be related to the slow startup of the shell loops and caused by
the priority inheritance for fork/exit.

- [queue hack]
 ...
 essentially roundrobin scheduling under loads that generate lots
 of interrupts.  Interactivity is still poor because makeworld
 sometimes generates a few hundred processes per second and cycling
 through that many takes a long time even with a tiny quantum.


makeworld actually generates remarkably few interrupts when run on
disk file systems (an average of only about 30 non-clock interrupts per
second in my config).

- reducing kern.sched.quantum never had much effect.  Same for
 increasing HZ in -current with 4BSD.


Bruce

_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/cvs-all
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: cvs commit: src/sys/kern sched_ule.c

Reply via email to