Hi Hillf,

thanks a lot for your patch!

Compared to 5.7 rc4 vanilla, we observe the following:
  * Single-tenant jobs show improvement up to 15% for SPECjbb2005 and
up to 100% for NAS in low thread mode. In other words, it fixes all
the problems we have reported in this thread.
  * Multitenancy jobs show performance degradation up to 30% for SPECjbb2005

While it fixes problems with single-tenant jobs and with a performance
at low system load, it breaks multi-tenant tasks.

We have compared it against kernel with adjust_numa_imbalance disabled
[1], and both kernels perform at the same level for the single-tenant
jobs, but the proposed patch is bad for the multitenancy mode. The
kernel with adjust_numa_imbalance disabled is a clear winner here.

We would be very interested in what others think about disabling
adjust_numa_imbalance function. The patch is bellow. It would be great
to collect performance results for different scenarios to make sure
the results are objective.

Thanks a lot!
Jirka

[1] Patch to test kernel with adjust_numa_imbalance disabled:
===============================================
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b..8c43d29 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8907,14 +8907,6 @@ static inline long adjust_numa_imbalance(int
imbalance, int src_nr_running)
{
       unsigned int imbalance_min;

-       /*
-        * Allow a small imbalance based on a simple pair of communicating
-        * tasks that remain local when the source domain is almost idle.
-        */
-       imbalance_min = 2;
-       if (src_nr_running <= imbalance_min)
-               return 0;
-
       return imbalance;
}
===============================================





On Fri, May 8, 2020 at 5:47 AM Hillf Danton <hdan...@sina.com> wrote:
>
>
> On Thu, 7 May 2020 13:49:34 Phil Auld wrote:
> >
> > On Thu, May 07, 2020 at 06:29:44PM +0200 Jirka Hladky wrote:
> > > Hi Mel,
> > >
> > > we are not targeting just OMP applications. We see the performance
> > > degradation also for other workloads, like SPECjbb2005 and
> > > SPECjvm2008. Even worse, it also affects a higher number of threads.
> > > For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
> > > server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
> > > threads (the system has 64 CPUs in total). We observe this degradation
> > > only when we run a single SPECjbb binary. When running 4 SPECjbb
> > > binaries in parallel, there is no change in performance between 5.6
> > > and 5.7.
> > >
> > > That's why we are asking for the kernel tunable, which we would add to
> > > the tuned profile. We don't expect users to change this frequently but
> > > rather to set the performance profile once based on the purpose of the
> > > server.
> > >
> > > If you could prepare a patch for us, we would be more than happy to
> > > test it extensively. Based on the results, we can then evaluate if
> > > it's the way to go. Thoughts?
> > >
> >
> > I'm happy to spin up a patch once I'm sure what exactly the tuning would
> > effect. At an initial glance I'm thinking it would be the imbalance_min
> > which is currently hardcoded to 2. But there may be something else...
>
> hrm... try to restore the old behavior by skipping task migrate in favor
> of the local node if there is no imbalance.
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1928,18 +1928,16 @@ static void task_numa_find_cpu(struct ta
>                 src_running = env->src_stats.nr_running - 1;
>                 dst_running = env->dst_stats.nr_running + 1;
>                 imbalance = max(0, dst_running - src_running);
> -               imbalance = adjust_numa_imbalance(imbalance, src_running);
> +               imbalance = adjust_numa_imbalance(imbalance, src_running +1);
>
> -               /* Use idle CPU if there is no imbalance */
> +               /* No task migrate without imbalance */
>                 if (!imbalance) {
> -                       maymove = true;
> -                       if (env->dst_stats.idle_cpu >= 0) {
> -                               env->dst_cpu = env->dst_stats.idle_cpu;
> -                               task_numa_assign(env, NULL, 0);
> -                               return;
> -                       }
> +                       env->best_cpu = -1;
> +                       return;
>                 }
> -       } else {
> +       }
> +
> +       do {
>                 long src_load, dst_load, load;
>                 /*
>                  * If the improvement from just moving env->p direction is 
> better
> @@ -1949,7 +1947,7 @@ static void task_numa_find_cpu(struct ta
>                 dst_load = env->dst_stats.load + load;
>                 src_load = env->src_stats.load - load;
>                 maymove = !load_too_imbalanced(src_load, dst_load, env);
> -       }
> +       } while (0);
>
>         for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
>                 /* Skip this CPU if the source task cannot migrate */
>
>


--
-Jirka

Reply via email to