Morten Rasmussen <[email protected]> writes:

> On Tue, Sep 16, 2014 at 06:49:14PM +0100, [email protected] wrote:
>> Morten Rasmussen <[email protected]> writes:
>> 
>> > Task group load-contributions are not updated when tasks belonging to
>> > task groups are migrated by active load-balancing. If no other task
>> > belonging to the same task group is already queued at the destination
>> > cpu the group sched_entity will be enqueued with load_avg_contrib=0.
>> > Hence, weighted_cpuload() won't reflect the newly added load.
>> >
>> > The load may remain invisible until the next tick, when the sched_entity
>> > load_avg_contrib and task group contributions are reevaluated.
>> >
>> > The enqueue loop
>> >
>> >    for_each_entity(se) {
>> >       enqueue_entity(cfs_rq,se)
>> >          ...
>> >          enqueue_entity_load_avg(cfs_rq,se)
>> >             ...
>> >             update_entity_load_avg(se)
>> >                ...
>> >                __update_entity_load_avg_contrib(se)
>> >                ...
>> >             ...
>> >             update_cfs_rq_blocked_load(cfs_rq)
>> >                ...
>> >                __update_cfs_rq_tg_load_contrib(cfs_rq)
>> >                ...
>> >    }
>> >
>> > currently skips __update_entity_load_avg_contrib() and
>> > __update_cfs_rq_tg_load_contrib() for group entities for active
>> > load-balance migrations. The former updates the sched_entity
>> > load_avg_contrib, and the latter updates the task group contribution
>> > which is needed by the former. They must both be called to ensure that
>> > load doesn't temporarily disappear.
>> >
>> > cc: Paul Turner <[email protected]>
>> > cc: Ben Segall <[email protected]>
>> >
>> > Signed-off-by: Morten Rasmussen <[email protected]>
>> > ---
>> >  kernel/sched/fair.c |    7 ++++++-
>> >  1 file changed, 6 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index be9e97b..2b6e2eb 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -2521,7 +2521,8 @@ static inline void update_entity_load_avg(struct 
>> > sched_entity *se,
>> >    else
>> >            now = cfs_rq_clock_task(group_cfs_rq(se));
>> >  
>> > -  if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
>> > +  if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq) &&
>> > +                                                  entity_is_task(se))
>> >            return;
>> >  
>> >    contrib_delta = __update_entity_load_avg_contrib(se);
>> > @@ -2609,6 +2610,10 @@ static inline void enqueue_entity_load_avg(struct 
>> > cfs_rq *cfs_rq,
>> >    cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
>> >    /* we force update consideration on load-balancer moves */
>> >    update_cfs_rq_blocked_load(cfs_rq, !wakeup);
>> > +
>> > +  /* We force update group contributions on load-balancer moves */
>> > +  if (wakeup && !entity_is_task(se))
>> > +          __update_cfs_rq_tg_load_contrib(cfs_rq, 0);
>> >  }
>> >  
>> >  /*
>> 
>> It should probably be clearer that what this actually means is that we
>> always update tg_load_contrib for any group-of-group (which is weird),
>
> Would it not only update groups entities if they aren't already on a
> runqueue? I'm clearly missing something.

No, that's correct, but it will do it on normal sleep/wake too, not just
load-balancer moves.

>
>> so that we update the entire tree on load-balancer moves (which is
>> sensible and what we care about, because it determines each 
>> se.load_avg_contrib).
>> 
>> Wouldn't we need this on dequeue as well?
>
> Yes. The problem is there too I think. It is not as visible though. If
> the last task of a particular group is migrated away from a cpu, the
> group entities will be dequeued and we should be fine. However, if there
> are other tasks left in the group or any parent group the load of that
> group will be off, I think. I need to verify that.

Yeah, the case where it comes up would be different, and possibly less
bad.

>
>> 
>> I believe the reason for both of these ratelimits was performance on
>> sleep/wake (where blocked_load_avg means tg_load_contrib should not have
>> changed), it would be good to know how much this hurts. Given the
>> fast-exit in __update_cfs_rq_tg_load_contrib, it seems like it might be
>> fine, but the group-entity __update_entity_load_avg_contrib is less
>> free when div_u64 is expensive, and is currently only called 1/ms.
>
> Right. Is there a suitable group scheduling benchmark I can use to test
> this?

The most severe would probably be pipe test where you moved the threads
to be in separate cgroups. A more realistic one would probably just be
looping around nanosleep(100ns) on a bunch of cpus or something like that.

>
>> 
>> It seems like it would be a bit of a mess to force an update of the
>> entire path in migrate, but it would certainly dodge the performance
>> concerns if they wind up being an issue. (I suppose we could have
>> ENQUEUE/DEQUEUE_MIGRATE or something...)
>
> Yes, I had the same thought. We might have to do something like that
> anyway to fix the dequeue path. It seems less straightforward to fix,
> but I need to take a closer look.
>
> Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to