Re: wake_wide mechanism clarification

Atish Patra Thu, 10 Aug 2017 10:44:18 -0700

On 08/10/2017 04:48 AM, Brendan Jackman wrote:


On Wed, Aug 09 2017 at 21:22, Atish Patra wrote:

On 08/03/2017 10:05 AM, Brendan Jackman wrote:


On Thu, Aug 03 2017 at 13:15, Josef Bacik wrote:

On Thu, Aug 03, 2017 at 11:53:19AM +0100, Brendan Jackman wrote:


Hi,

On Fri, Jun 30 2017 at 17:55, Josef Bacik wrote:

On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote:

On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:

On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:

That makes sense that we multiply slave's flips by a factor because
its low, but I still didn't get why the factor is chosen to be
llc_size instead of something else for the multiplication with slave
(slave * factor).

Yeah I don't know why llc_size was chosen...


static void update_top_cache_domain(int cpu)
{
struct sched_domain_shared *sds = NULL;
struct sched_domain *sd;
int id = cpu;
int size = 1;

sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
if (sd) {
id = cpumask_first(sched_domain_span(sd));
size = cpumask_weight(sched_domain_span(sd));
sds = sd->shared;
}

rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
per_cpu(sd_llc_size, cpu) = size;

The goal of wake wide was to approximate when pulling would be a futile
consolidation effort and counterproductive to scaling. 'course with
ever increasing socket size, any 1:N waker is ever more likely to run
out of CPU for its one and only self (slamming into scaling wall)
before it needing to turn its minions loose to conquer the world.

Something else to consider: network interrupt waking multiple workers
at high frequency. If the waking CPU is idle, do you really want to
place a worker directly in front of a tattoo artist, or is it better
off nearly anywhere but there?

If the box is virtual, with no topology exposed (or real but ancient)
to let select_idle_sibling() come to the rescue, two workers can even
get tattooed simultaneously (see sync wakeup).


Heuristics are hard, news at 11.  I think messing with wake_wide() itself is too
big of a hammer, we probably need a middle ground.  I'm messing with it right
now so it's too early to say for sure, but i _suspect_ the bigger latencies we
see are not because we overload the cpu we're trying to pull to, but because
when we fail to do the wake_affine() we only look at siblings of the affine_sd
instead of doing the full "find the idlest cpu in the land!" thing.


This is the problem I've been hitting lately. My use case is 1 task per
CPU on ARM big.LITTLE (asymmetrical CPU capacity). The workload is 1
task per CPU, they all do X amount of work then pthread_barrier_wait
(i.e. sleep until the last task finishes its X and hits the barrier). On
big.LITTLE, the tasks which get a "big" CPU finish faster, and then
those CPUs pull over the tasks that are still running:

     v CPU v           ->time->

                    -------------
   0  (big)         11111  /333
                    -------------
   1  (big)         22222   /444|
                    -------------
   2  (LITTLE)      333333/
                    -------------
   3  (LITTLE)      444444/
                    -------------

Now when task 4 hits the barrier (at |) and wakes the others up, there
are 4 tasks with prev_cpu=<big> and 0 tasks with
prev_cpu=<little>. Assuming that those wakeups happen on CPU4,
regardless of wake_affine, want_affine means that we'll only look in
sd_llc (cpus 0 and 1), so tasks will be unnecessarily coscheduled on the
bigs until the next load balance, something like this:

     v CPU v           ->time->

                    ------------------------
   0  (big)         11111  /333  31313\33333
                    ------------------------
   1  (big)         22222   /444|424\4444444
                    ------------------------
   2  (LITTLE)      333333/          \222222
                    ------------------------
   3  (LITTLE)      444444/            \1111
                    ------------------------
                                 ^^^
                           underutilization

I _think_
the answer is to make select_idle_sibling() try less hard to find something
workable and only use obviously idle cpu's in the affine sd, and fall back to
the full load balance esque search.


So this idea of allowing select_idle_sibling to fail, and falling back
to the slow path, would help me too, I think.


Unfortunately this statement of mine was wrong, I had it in my head that we
would fall back to a find the idlest cpu thing provided we failed to wake
affine, but we just do select_idle_sibling() and expect the load balancer to
move things around as needed.


Ah yes, when wake_affine() returns false, we still do
select_idle_sibling (except in prev_cpu's sd_llc instead of
smp_processor_id()'s), and that is the problem faced by my workload. I
thought you were suggesting to change the flow so that
select_idle_sibling can say "I didn't find any idle siblings - go to the
find_idlest_group path".

This is also why I was playing with your
don't-affine-recently-balanced-tasks patch[1], which also helps my case
since it prevents want_affine for tasks 3 and 4 (which were recently
moved by an active balance).

[1] https://marc.info/?l=linux-kernel&m=150003849602535&w=2
    (also linked elsewhere in this thread)


Would you try peter's sched/experimental branch and see how that affects your
workload?  I'm still messing with my patches and I may drop this one as it now
appears to be too aggressive with the new set of patches.  Thanks,


Sure, I'll take a look at those, thanks. I guess the idea of caching
values in LB and then using them in wakeup[2] is a lighter-handed way of
achieving the same thing as last_balance_ts? It won't solve my problem
directly since we'll still only look in sd_llc, but I think it could be
a basis for a way to say "go find_idlest_group path on these tasks" at
the beginning of select_task_rq_fair.

[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/experimental&id=5b4ed509027a5b6f495e6fe871cae850d5762bef

Thanks,
Brendan


Would it be better if it searches for idle cpus in next higher domain
(if that is not NUMA) instead of doing find_idlest_group ?

The above approach [2] will still try to search a idle cpu in the llc
domain of new_cpu. If it finds one great. Otherwise, let's search for an
idle cpu in next higher domain(if that is not NUMA) excluding the
current domain which we have already searched. I think it would help in
cases where LLC < NUMA.


Well for my use case, "search for idle cpus in the next higher domain"
and "go to find_busiest_group & co" mean the same thing.

Yeah. It just may be tad cheaper to do idle cpu search than tofind_idlest_group and find_idlest_cpu.


But for NUMA

boxes (where I'm guessing there are >=3 sched_domains?)

Yeah.
then yeah that

sounds like a good idea at least to my inexpert ears. Currently it's
either "only look at siblings of either prev_cpu or smp_processor_id()"
or "look through the whole system (or at least as far as sd_flag takes
us)", so maybe you want middle ground?

On the other hand, I'm guessing remote wakeups across NUMA nodes are
expensive, so that's why we only look at siblings?

Correct. But I am suggesting the above modification impacting only forarchs where LLC < NUMA i.e. you have SMT, LLC, DIE/CPU, NUMA.. domains.For your case, I guess the domains would stop at CPU. Any architecturewith above domain hierarchy can search wide to find out an idle cpu.


Other architectures where NUMA is the next domain to LLC, it won't do
anything.

There's already a

choice between prev_cpu's and smp_processor_id()'s LLC siblings.

Yes. So we may benefit by searching idle cpus other LLC groups as well.
Initially, I was worried about it may be too expensive to search wide

but Peter's recent patch (1ad3aaf3fcd2:Implement new approach to scaleselect_idle_cpu) should abandon the search if it becomes too costly.


Did I miss any case it may affect performance negatively ?

Regards,
Atish

Re: wake_wide mechanism clarification

Reply via email to