On 2021/1/25 17:06, Mel Gorman wrote: > On Mon, Jan 25, 2021 at 02:02:58PM +0800, Aubrey Li wrote: >> A long-tail load balance cost is observed on the newly idle path, >> this is caused by a race window between the first nr_running check >> of the busiest runqueue and its nr_running recheck in detach_tasks. >> >> Before the busiest runqueue is locked, the tasks on the busiest >> runqueue could be pulled by other CPUs and nr_running of the busiest >> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED >> flag set, and triggers load_balance redo at the same sched_domain level. >> >> In order to find the new busiest sched_group and CPU, load balance will >> recompute and update the various load statistics, which eventually leads >> to the long-tail load balance cost. >> >> This patch introduces a variable(sched_nr_lb_redo) to limit load balance >> redo times, combined with sysctl_sched_nr_migrate, the max load balance >> cost is reduced from 100+ us to 70+ us, measured on a 4s x86 system with >> 192 logical CPUs. >> >> Cc: Andi Kleen <a...@linux.intel.com> >> Cc: Tim Chen <tim.c.c...@linux.intel.com> >> Cc: Srinivas Pandruvada <srinivas.pandruv...@linux.intel.com> >> Cc: Rafael J. Wysocki <rafael.j.wyso...@intel.com> >> Signed-off-by: Aubrey Li <aubrey...@linux.intel.com> > > If redo_max is a constant, why is it not a #define instead of increasing > the size of lb_env? >
I followed the existing variable sched_nr_migrate_break, I think this might be a tunable as well. Thanks, -Aubrey