sched: fix improper load balance across sched domain

Linux Kernel Mailing List Wed, 17 Oct 2007 10:02:18 -0700

Gitweb:     
http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=908a7c1b9b80d06708177432020c80d147754691
Commit:     908a7c1b9b80d06708177432020c80d147754691
Parent:     cd79007634854f9e936e2369890f2512f94b8759
Author:     Ken Chen <[EMAIL PROTECTED]>
AuthorDate: Wed Oct 17 16:55:11 2007 +0200
Committer:  Ingo Molnar <[EMAIL PROTECTED]>
CommitDate: Wed Oct 17 16:55:11 2007 +0200


    sched: fix improper load balance across sched domain
    
    We recently discovered a nasty performance bug in the kernel CPU load
    balancer where we were hit by 50% performance regression.
    
    When tasks are assigned to a subset of CPUs that span across
    sched_domains (either ccNUMA node or the new multi-core domain) via
    cpu affinity, kernel fails to perform proper load balance at
    these domains, due to several logic in find_busiest_group() miss
    identified busiest sched group within a given domain. This leads to
    inadequate load balance and causes 50% performance hit.
    
    To give you a concrete example, on a dual-core, 2 socket numa system,
    there are 4 logical cpu, organized as:
    
    CPU0 attaching sched-domain:
     domain 0: span 0003  groups: 0001 0002
     domain 1: span 000f  groups: 0003 000c
    CPU1 attaching sched-domain:
     domain 0: span 0003  groups: 0002 0001
     domain 1: span 000f  groups: 0003 000c
    CPU2 attaching sched-domain:
     domain 0: span 000c  groups: 0004 0008
     domain 1: span 000f  groups: 000c 0003
    CPU3 attaching sched-domain:
     domain 0: span 000c  groups: 0008 0004
     domain 1: span 000f  groups: 000c 0003
    
    If I run 2 tasks with CPU affinity set to 0x5.  There are situation
    where cpu0 has run queue length of 2, and cpu2 will be idle.  The
    kernel load balancer is unable to balance out these two tasks over
    cpu0 and cpu2 due to at least three logics in find_busiest_group()
    that heavily bias load balance towards power saving mode. e.g. while
    determining "busiest" variable, kernel only set it when
    "sum_nr_running > group_capacity".  This test is flawed that
    "sum_nr_running" is not necessary same as
    sum-tasks-allowed-to-run-within-the sched-group.  The end result is
    that kernel "think" everything is balanced, but in reality we have an
    imbalance and thus causing one CPU to be over-subscribed and leaving
    other idle.  There are two other logic in the same function will also
    causing similar effect.  The nastiness of this bug is that kernel not
    be able to get unstuck in this unfortunate broken state.  From what
    we've seen in our environment, kernel will stuck in imbalanced state
    for extended period of time and it is also very easy for the kernel to
    stuck into that state (it's pretty much 100% reproducible for us).
    
    So proposing the following fix: add addition logic in
    find_busiest_group to detect intrinsic imbalance within the busiest
    group.  When such condition is detected, load balance goes into spread
    mode instead of default grouping mode.
    
    Signed-off-by: Ken Chen <[EMAIL PROTECTED]>
    Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
---
 kernel/sched.c |   23 +++++++++++++++++++----
 1 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 5e220bf..9754364 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2336,7 +2336,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
        unsigned long max_pull;
        unsigned long busiest_load_per_task, busiest_nr_running;
        unsigned long this_load_per_task, this_nr_running;
-       int load_idx;
+       int load_idx, group_imb = 0;
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
        int power_savings_balance = 1;
        unsigned long leader_nr_running = 0, min_load_per_task = 0;
@@ -2355,9 +2355,10 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
                load_idx = sd->idle_idx;
 
        do {
-               unsigned long load, group_capacity;
+               unsigned long load, group_capacity, max_cpu_load, min_cpu_load;
                int local_group;
                int i;
+               int __group_imb = 0;
                unsigned int balance_cpu = -1, first_idle_cpu = 0;
                unsigned long sum_nr_running, sum_weighted_load;
 
@@ -2368,6 +2369,8 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
 
                /* Tally up the load of all CPUs in the group */
                sum_weighted_load = sum_nr_running = avg_load = 0;
+               max_cpu_load = 0;
+               min_cpu_load = ~0UL;
 
                for_each_cpu_mask(i, group->cpumask) {
                        struct rq *rq;
@@ -2388,8 +2391,13 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
                                }
 
                                load = target_load(i, load_idx);
-                       } else
+                       } else {
                                load = source_load(i, load_idx);
+                               if (load > max_cpu_load)
+                                       max_cpu_load = load;
+                               if (min_cpu_load > load)
+                                       min_cpu_load = load;
+                       }
 
                        avg_load += load;
                        sum_nr_running += rq->nr_running;
@@ -2415,6 +2423,9 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
                avg_load = sg_div_cpu_power(group,
                                avg_load * SCHED_LOAD_SCALE);
 
+               if ((max_cpu_load - min_cpu_load) > SCHED_LOAD_SCALE)
+                       __group_imb = 1;
+
                group_capacity = group->__cpu_power / SCHED_LOAD_SCALE;
 
                if (local_group) {
@@ -2423,11 +2434,12 @@ find_busiest_group(struct sched_domain *sd, int 
this_cpu,
                        this_nr_running = sum_nr_running;
                        this_load_per_task = sum_weighted_load;
                } else if (avg_load > max_load &&
-                          sum_nr_running > group_capacity) {
+                          (sum_nr_running > group_capacity || __group_imb)) {
                        max_load = avg_load;
                        busiest = group;
                        busiest_nr_running = sum_nr_running;
                        busiest_load_per_task = sum_weighted_load;
+                       group_imb = __group_imb;
                }
 
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
@@ -2499,6 +2511,9 @@ group_next:
                goto out_balanced;
 
        busiest_load_per_task /= busiest_nr_running;
+       if (group_imb)
+               busiest_load_per_task = min(busiest_load_per_task, avg_load);
+
        /*
         * We're trying to get all the cpus to the average_load, so we don't
         * want to push ourselves above the average load, nor do we wish to
-
To unsubscribe from this list: send the line "unsubscribe git-commits-head" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

sched: fix improper load balance across sched domain

Reply via email to