more data: SCHED_ULE+PREEMPTION is the problem (was: kern.sched.quantum: Creepy, sadistic scheduler

Peter Sat, 07 Apr 2018 08:17:12 -0700

Hi all,
 in the meantime I did some tests and found the following:



A. The Problem:
---------------
On a single CPU, there are -exactly- two processes runnable:

One is doing mostly compute without I/O - this can be a compressing jobor similar; in the tests I used simply an endless-loop. Lets call thisthe "piglet".

The other is doing frequent file reads, but also some compute interim -

this can be a backup job traversing the FS, or a postgres VACUUM, orsome fast compressor like lz4. Lets call this the "worker".

It then happens that the piglet gets 99% CPU, while the worker gets only0.5% CPU and makes nearly no progress at all.

Investigations shows that the worker makes precisely one I/O pertimeslice (timeslice as defined in kern.sched.quantum) - or two I/O on amirrored ZFS.



B. Findings:
------------
1. Filesystem

I could never reproduce this when reading from plain UFS. Only whenreading from ZFS (direct or via l2arc).


2. Machine

The problem originally appeared on a pentium3@1GHz. I was able toreproduce it on an i5-3570T, given the following measures:

 * config in BIOS to use only one CPU
 * reduce speed: "dev.cpu.0.freq=200"

I did see the problem also when running full speed (which means ithappens there also), but could not reproduce it well.


3. kern.sched.preempt_thresh

I could make the problem disappear by changing kern.sched.preempt_threshfrom the default 80 to either 11 (i5-3570T) or 7 (p3) or smaller. Thisseems to correspond to the disk interrupt threads, which run at intr:12(i5-3570T) or intr:8 (p3).


4. dynamic behaviour

Here the piglet is already running as PID=2119. Then we can watch thedynamic behaviour as follows (on i5-3570T@200MHz):


a. with kern.sched.preempt_thresh=80

$ lz4 DATABASE_TEST_FILE /dev/null & while true;do ps -o pid,pri,"%cpu",command -p 2119,$!sleep 3done[1] 6073PID PRI %CPU COMMAND6073 20 0.0 lz4 DATABASE_TEST_FILE /dev/null2119 100 91.0 -bash (bash)PID PRI %CPU COMMAND6073 76 15.0 lz4 DATABASE_TEST_FILE /dev/null2119 95 74.5 -bash (bash)PID PRI %CPU COMMAND6073 52 19.0 lz4 DATABASE_TEST_FILE /dev/null2119 94 71.5 -bash (bash)PID PRI %CPU COMMAND6073 52 16.0 lz4 DATABASE_TEST_FILE /dev/null2119 95 76.5 -bash (bash)PID PRI %CPU COMMAND6073 52 14.0 lz4 DATABASE_TEST_FILE /dev/null2119 96 80.0 -bash (bash)PID PRI %CPU COMMAND6073 52 12.5 lz4 DATABASE_TEST_FILE /dev/null2119 96 82.5 -bash (bash)PID PRI %CPU COMMAND6073 74 10.0 lz4 DATABASE_TEST_FILE /dev/null2119 98 86.5 -bash (bash)PID PRI %CPU COMMAND6073 52 8.0 lz4 DATABASE_TEST_FILE /dev/null2119 98 89.0 -bash (bash)PID PRI %CPU COMMAND6073 52 7.0 lz4 DATABASE_TEST_FILE /dev/null2119 98 90.5 -bash (bash)PID PRI %CPU COMMAND6073 52 6.5 lz4 DATABASE_TEST_FILE /dev/null

2119  99 91.5 -bash (bash)

b. with kern.sched.preempt_thresh=11

PID PRI %CPU COMMAND4920 21 0.0 lz4 DATABASE_TEST_FILE /dev/null2119 101 93.5 -bash (bash)PID PRI %CPU COMMAND4920 78 20.0 lz4 DATABASE_TEST_FILE /dev/null2119 94 70.5 -bash (bash)PID PRI %CPU COMMAND4920 82 34.5 lz4 DATABASE_TEST_FILE /dev/null2119 88 54.0 -bash (bash)PID PRI %CPU COMMAND4920 85 42.5 lz4 DATABASE_TEST_FILE /dev/null2119 86 45.0 -bash (bash)PID PRI %CPU COMMAND4920 85 43.5 lz4 DATABASE_TEST_FILE /dev/null2119 86 44.5 -bash (bash)PID PRI %CPU COMMAND4920 85 43.0 lz4 DATABASE_TEST_FILE /dev/null2119 85 45.0 -bash (bash)PID PRI %CPU COMMAND4920 85 43.0 lz4 DATABASE_TEST_FILE /dev/null

2119  85 45.5 -bash (bash)

From this we can see that in case b. both processes balance out nicelyand meet at equal CPU shares.Whereas in case a., after about 10 Seconds (the first 3 records) theymove to opposite ends of the scale and stay there.

From this I might suppose that here is some kind of mis-calculation ormis-adjustment of the task priorities happening.


P.
_______________________________________________
[email protected] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[email protected]"

more data: SCHED_ULE+PREEMPTION is the problem (was: kern.sched.quantum: Creepy, sadistic scheduler

Reply via email to