The patch titled
CFS scheduler: v14
has been added to the -mm tree. Its filename is
cfs-scheduler-v14-rc2-mm1.patch
*** Remember to use Documentation/SubmitChecklist when testing your code ***
See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
out what to do about this
------------------------------------------------------
Subject: CFS scheduler: v14
From: Ingo Molnar <[EMAIL PROTECTED]>
CFS -v14:
- increase sleeper-fairness (Mike Galbraith, me)
- CFS documentation fixes (Pranith Kumar D)
- increased the default rescheduling granularity to 3msecs on UP,
6 msecs on 2-way systems
- small update_curr() precision fix
- added an overview section to Documentation/sched-design-CFS.txt
Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---
Documentation/sched-design-CFS.txt | 98 +++++++++++++++------------
kernel/sched_fair.c | 73 +++++++++++++++++---
2 files changed, 117 insertions(+), 54 deletions(-)
diff -puN Documentation/sched-design-CFS.txt~cfs-scheduler-v14-rc2-mm1
Documentation/sched-design-CFS.txt
--- a/Documentation/sched-design-CFS.txt~cfs-scheduler-v14-rc2-mm1
+++ a/Documentation/sched-design-CFS.txt
@@ -1,20 +1,59 @@
-[announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
-i'm pleased to announce the first release of the "Modular Scheduler Core
-and Completely Fair Scheduler [CFS]" patchset:
+This is the CFS scheduler.
- http://redhat.com/~mingo/cfs-scheduler/
+80% of CFS's design can be summed up in a single sentence: CFS basically
+models an "ideal, precise multi-tasking CPU" on real hardware.
-This project is a complete rewrite of the Linux task scheduler. My goal
-is to address various feature requests and to fix deficiencies in the
-vanilla scheduler that were suggested/found in the past few years, both
-for desktop scheduling and for server scheduling workloads.
+"Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100%
+physical power and which can run each task at precise equal speed, in
+parallel, each at 1/nr_running speed. For example: if there are 2 tasks
+running then it runs each at 50% physical power - totally in parallel.
+
+On real hardware, we can run only a single task at once, so while that
+one task runs, the other tasks that are waiting for the CPU are at a
+disadvantage - the current task gets an unfair amount of CPU time. In
+CFS this fairness imbalance is expressed and tracked via the per-task
+p->wait_runtime (nanosec-unit) value. "wait_runtime" is the amount of
+time the task should now run on the CPU for it to become completely fair
+and balanced.
+
+( small detail: on 'ideal' hardware, the p->wait_runtime value would
+ always be zero - no task would ever get 'out of balance' from the
+ 'ideal' share of CPU time. )
+
+CFS's task picking logic is based on this p->wait_runtime value and it
+is thus very simple: it always tries to run the task with the largest
+p->wait_runtime value. In other words, CFS tries to run the task with
+the 'gravest need' for more CPU time. So CFS always tries to split up
+CPU time between runnable tasks as close to 'ideal multitasking
+hardware' as possible.
+
+Most of the rest of CFS's design just falls out of this really simple
+concept, with a few add-on embellishments like nice levels,
+multiprocessing and various algorithm variants to recognize sleepers.
+
+In practice it works like this: the system runs a task a bit, and when
+the task schedules (or a scheduler tick happens) the task's CPU usage is
+'accounted for': the (small) time it just spent using the physical CPU
+is deducted from p->wait_runtime. [minus the 'fair share' it would have
+gotten anyway]. Once p->wait_runtime gets low enough so that another
+task becomes the 'leftmost task' of the time-ordered rbtree it maintains
+(plus a small amount of 'granularity' distance relative to the leftmost
+task so that we do not over-schedule tasks and trash the cache) then the
+new leftmost task is picked and the current task is preempted.
+
+The rq->fair_clock value tracks the 'CPU time a runnable task would have
+fairly gotten, had it been runnable during that time'. So by using
+rq->fair_clock values we can accurately timestamp and measure the
+'expected CPU time' a task should have gotten. All runnable tasks are
+sorted in the rbtree by the "rq->fair_clock - p->wait_runtime" key, and
+CFS picks the 'leftmost' task and sticks to it. As the system progresses
+forwards, newly woken tasks are put into the tree more and more to the
+right - slowly but surely giving a chance for every task to become the
+'leftmost task' and thus get on the CPU within a deterministic amount of
+time.
-[ QuickStart: apply the patch, recompile, reboot. The new scheduler
- will be active by default and all tasks will default to the
- SCHED_NORMAL interactive scheduling class. ]
-
-Highlights are:
+Some implementation details:
- the introduction of Scheduling Classes: an extensible hierarchy of
scheduler modules. These modules encapsulate scheduling policy
@@ -25,7 +64,7 @@ Highlights are:
replacement for the vanilla scheduler's SCHED_OTHER interactivity
code.
- i'd like to give credit to Con Kolivas for the general approach here:
+ I'd like to give credit to Con Kolivas for the general approach here:
he has proven via RSDL/SD that 'fair scheduling' is possible and that
it results in better desktop scheduling. Kudos Con!
@@ -53,7 +92,7 @@ Highlights are:
setting suitable for desktop workloads. SCHED_BATCH is handled by the
CFS scheduler module too.
- due to its design, the CFS scheduler is not prone to any of the
+ Due to its design, the CFS scheduler is not prone to any of the
'attacks' that exist today against the heuristics of the stock
scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
work fine and do not impact interactivity and produce the expected
@@ -63,7 +102,7 @@ Highlights are:
SCHED_BATCH: both types of workloads should be isolated much more
agressively than under the vanilla scheduler.
- ( another rdetail: due to nanosec accounting and timeline sorting,
+ ( another detail: due to nanosec accounting and timeline sorting,
sched_yield() support is very simple under CFS, and in fact under
CFS sched_yield() behaves much better than under any other
scheduler i have tested so far. )
@@ -78,30 +117,3 @@ Highlights are:
iterators of the scheduling modules are used. The balancing code got
quite a bit simpler as a result.
-the core scheduler got smaller by more than 700 lines:
-
- kernel/sched.c | 1454
++++++++++++++++------------------------------------------------
- 1 file changed, 372 insertions(+), 1082 deletions(-)
-
-and even adding all the scheduling modules, the total size impact is
-relatively small:
-
- 18 files changed, 1454 insertions(+), 1133 deletions(-)
-
-most of the increase is due to extensive comments. The kernel size
-impact is in fact a small negative:
-
- text data bss dec hex filename
- 23366 4001 24 27391 6aff kernel/sched.o.vanilla
- 24159 2705 56 26920 6928 kernel/sched.o.CFS
-
-(this is mainly due to the benefit of getting rid of the expired array
-and its data structure overhead.)
-
-thanks go to Thomas Gleixner and Arjan van de Ven for review of this
-patchset.
-
-as usual, any sort of feedback, bugreports, fixes and suggestions are
-more than welcome,
-
- Ingo
diff -puN kernel/sched_fair.c~cfs-scheduler-v14-rc2-mm1 kernel/sched_fair.c
--- a/kernel/sched_fair.c~cfs-scheduler-v14-rc2-mm1
+++ a/kernel/sched_fair.c
@@ -16,11 +16,11 @@
* number of CPUs. (i.e. factor 2x on 2-way systems, 3x on 4-way
* systems, 4x on 8-way systems, 5x on 16-way systems, etc.)
*/
-unsigned int sysctl_sched_granularity __read_mostly = 1000000000/HZ;
+unsigned int sysctl_sched_granularity __read_mostly = 3000000000ULL/HZ;
/*
* Wake-up granularity.
- * (default: 1 msec, units: nanoseconds)
+ * (default: 0, units: nanoseconds)
*
* This option delays the preemption effects of decoupled workloads
* and reduces their over-scheduling. Synchronous workloads will still
@@ -28,9 +28,9 @@ unsigned int sysctl_sched_granularity __
*/
unsigned int sysctl_sched_wakeup_granularity __read_mostly = 0;
-unsigned int sysctl_sched_runtime_limit __read_mostly = 2000000000/HZ;
+unsigned int sysctl_sched_runtime_limit __read_mostly = 6000000000ULL/HZ;
-unsigned int sysctl_sched_load_smoothing __read_mostly = 0;
+unsigned int sysctl_sched_load_smoothing __read_mostly = 8 | 16 | 32;
/*
* sys_sched_yield unfairness bug workaround switch.
@@ -192,8 +192,8 @@ static inline void update_curr(struct rq
u64 delta_exec, delta_fair, delta_mine;
struct task_struct *curr = rq->curr;
- if (curr->sched_class != &fair_sched_class || curr == rq->idle
- || !curr->on_rq)
+ if (curr->sched_class != &fair_sched_class ||
+ !rq->raw_weighted_load || curr == rq->idle)
return;
/*
* Get the amount of time the current task was running
@@ -384,6 +384,50 @@ update_stats_curr_end(struct rq *rq, str
p->exec_start = 0;
}
+long div64_s(s64 divident, unsigned long divisor)
+{
+ u64 tmp;
+
+ if (divident < 0) {
+ tmp = -divident;
+ do_div(tmp, divisor);
+ return -(s64)tmp;
+ } else {
+ tmp = divident;
+ do_div(tmp, divisor);
+ return (s64)tmp;
+ }
+}
+
+/*
+ * A task gets added back to the runnable tasks and gets
+ * a small credit for the CPU time it missed out on while
+ * it slept, so fix up all other runnable task's wait_runtime
+ * so that the sum stays constant (around 0).
+ *
+ * Instead of looping over all runnable tasks in an O(N)
+ * manner we move the fair clock back by a proportional
+ * amount of the new wait_runtime this task adds to the pool.
+ */
+static void distribute_fair_add(struct rq *rq, s64 delta)
+{
+ struct task_struct *curr = rq->curr;
+ s64 delta_fair = 0;
+
+ if (!(sysctl_sched_load_smoothing & 32))
+ return;
+
+ if (rq->nr_running) {
+ delta_fair = div64_s(delta, rq->nr_running);
+ /*
+ * The currently running task's next wait_runtime value does
+ * not depend on the fair_clock, so fix it up explicitly:
+ */
+ add_wait_runtime(rq, curr, -delta_fair);
+ rq->fair_clock -= delta_fair;
+ }
+}
+
/**************************************************************/
/* Scheduling class queueing methods:
*/
@@ -391,7 +435,7 @@ update_stats_curr_end(struct rq *rq, str
static void enqueue_sleeper(struct rq *rq, struct task_struct *p)
{
unsigned long load = get_rq_load(rq);
- u64 delta_fair = 0;
+ s64 delta_fair, prev_runtime;
if (!(sysctl_sched_load_smoothing & 16))
goto out;
@@ -404,12 +448,19 @@ static void enqueue_sleeper(struct rq *r
* Fix up delta_fair with the effect of us running
* during the whole sleep period:
*/
- if (sysctl_sched_load_smoothing & 8) {
- delta_fair = delta_fair * load;
- do_div(delta_fair, load + p->load_weight);
- }
+ if (sysctl_sched_load_smoothing & 8)
+ delta_fair = div64_s(delta_fair * load, load + p->load_weight);
+ prev_runtime = p->wait_runtime;
__add_wait_runtime(rq, p, delta_fair);
+ delta_fair = p->wait_runtime - prev_runtime;
+
+ /*
+ * We move the fair clock back by a load-proportional
+ * amount of the new wait_runtime this task adds to
+ * the 'pool':
+ */
+ distribute_fair_add(rq, delta_fair);
out:
rq->wait_runtime += p->wait_runtime;
_
Patches currently in -mm which might be from [EMAIL PROTECTED] are
git-kvm.patch
prevent-going-idle-with-softirq-pending.patch
fix-crash-with-irqpoll-due-to-the-irqf_irqpoll-flag.patch
only-allow-nonlinear-vmas-for-ram-backed-filesystems.patch
cpuset-remove-sched-domain-hooks-from-cpusets.patch
introduce-write_trylock_irqsave.patch
use-write_trylock_irqsave-in-ptrace_attach.patch
fix-stop_machine_run-problem-with-naughty-real-time-process.patch
cpu-hotplug-fix-ksoftirqd-termination-on-cpu-hotplug-with-naughty-realtime-process.patch
cpu-hotplug-fix-ksoftirqd-termination-on-cpu-hotplug-with-naughty-realtime-process-fix.patch
pie-randomization.patch
cfs-scheduler.patch
cfs-scheduler-vs-detach-schedh-from-mmh.patch
cfs-scheduler-v14-rc2-mm1.patch
cfs-scheduler-warning-fixes.patch
sched-add-above-background-load-function.patch
mm-implement-swap-prefetching.patch
detect-atomic-counter-underflows.patch
make-frame_pointer-default=y.patch
mutex-subsystem-synchro-test-module.patch
vdso-print-fatal-signals.patch
lockdep-show-held-locks-when-showing-a-stackdump.patch
kmap_atomic-debugging.patch
random-warning-squishes.patch
-
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html