On Mon, 2016-04-25 at 11:18 +0200, Mike Galbraith wrote:
> On Sun, 2016-04-24 at 09:05 +0200, Mike Galbraith wrote:
> > On Sat, 2016-04-23 at 18:38 -0700, Brendan Gregg wrote:
> > 
> > > The bugs they found seem real, and their analysis is great
> > > (although
> > > using visualizations to find and fix scheduler bugs isn't new),
> > > and it
> > > would be good to see these fixed. However, it would also be
> > > useful to
> > > double check how widespread these issues really are. I suspect
> > > many on
> > > this list can test these patches in different environments.
> > 
> > Part of it sounded to me very much like they're meeting and
> > "fixing"
> > SMP group fairness...
> 
> Ew, NUMA boxen look like they could use a hug or two.  Add a group of
> one hog to compete with a box wide kbuild, ~lose a node.

sched: Fix smp nice induced group scheduling load distribution woes

On even a modest sized NUMA box any load that wants to scale
is essentially reduced to SCHED_IDLE class by smp nice scaling.
Limit niceness to prevent cramming a box wide load into a too
small space.  Given niceness affects latency, give the user the
option to completely disable box wide group fairness as well.

time make -j192 modules on a 4 node NUMA box..

Before:
root cgroup
real    1m6.987s      1.00

cgroup vs 1 groups of 1 hog
real    1m20.871s     1.20

cgroup vs 2 groups of 1 hog
real    1m48.803s     1.62

Each single task group receives a ~full socket because the kbuild
has become an essentially massless object that fits in practically
no space at all.  Near perfect math led directly to far from good
scaling/performance, a "Perfect is the enemy of good" poster child.

After "Let's just be nice enough instead" adjustment, single task
groups continued to sustain >99% utilization while competing with
the box sized kbuild.

cgroup vs 2 groups of 1 hog
real    1m8.151s     1.01  192/190=1.01

Good enough works better.. nearly perfectly in this case.

Signed-off-by: Mike Galbraith <[email protected]>
---
 kernel/sched/fair.c     |   22 ++++++++++++++++++----
 kernel/sched/features.h |    3 +++
 2 files changed, 21 insertions(+), 4 deletions(-)

Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -2464,17 +2464,28 @@ static inline long calc_tg_weight(struct
 
 static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 {
-       long tg_weight, load, shares;
+       long tg_weight, load, shares, min_shares = MIN_SHARES;
 
-       tg_weight = calc_tg_weight(tg, cfs_rq);
+       if (!sched_feat(SMP_NICE_GROUPS))
+               return tg->shares;
+
+       /*
+        * Bound niceness to prevent everything that wants to scale from
+        * essentially becoming SCHED_IDLE on multi/large socket boxen,
+        * screwing up our ability to distribute load properly and/or
+        * deliver acceptable latencies.
+        */
+       tg_weight = min_t(long, calc_tg_weight(tg, cfs_rq), 
sched_prio_to_weight[10]);
        load = cfs_rq->load.weight;
 
        shares = (tg->shares * load);
        if (tg_weight)
                shares /= tg_weight;
 
-       if (shares < MIN_SHARES)
-               shares = MIN_SHARES;
+       if (tg->shares > sched_prio_to_weight[20])
+               min_shares = sched_prio_to_weight[20];
+       if (shares < min_shares)
+               shares = min_shares;
        if (shares > tg->shares)
                shares = tg->shares;
 
@@ -2517,6 +2528,9 @@ static void update_cfs_shares(struct cfs
 #ifndef CONFIG_SMP
        if (likely(se->load.weight == tg->shares))
                return;
+#else
+       if (!sched_feat(SMP_NICE_GROUPS) && se->load.weight == tg->shares)
+               return;
 #endif
        shares = calc_cfs_shares(cfs_rq, tg);
 
Index: linux-2.6/kernel/sched/features.h
===================================================================
--- linux-2.6.orig/kernel/sched/features.h
+++ linux-2.6/kernel/sched/features.h
@@ -69,3 +69,6 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+SCHED_FEAT(SMP_NICE_GROUPS, true)
+#endif

Reply via email to