Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-19 Thread Steven Rostedt
On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
> The current update of the rq's load can be erroneous when RT tasks are
> involved
> 
> The update of the load of a rq that becomes idle, is done only if the avg_idle
> is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
> alternate, the runnable_avg will not be updated correctly and the time will be
> accounted as idle time when a CFS task wakes up.
> 
> A new idle_enter function is called when the next task is the idle function
> so the elapsed time will be accounted as run time in the load of the rq,
> whatever the average idle time is. The function update_rq_runnable_avg is
> removed from idle_balance.
> 
> When a RT task is scheduled on an idle CPU, the update of the rq's load is
> not done when the rq exit idle state because CFS's functions are not
> called. Then, the idle_balance, which is called just before entering the
> idle function, updates the rq's load and makes the assumption that the
> elapsed time since the last update, was only running time.
> 
> As a consequence, the rq's load of a CPU that only runs a periodic RT task,
> is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
> 
> A new idle_exit function is called when the prev task is the idle function
> so the elapsed time will be accounted as idle time in the rq's load.
> 
> Changes since V5:
> - Rename idle_enter/exit function to idle_enter/exit_fair
> 
> Changes since V4:
> - Rebase on v3.9-rc6 instead of Steven Rostedt's patches

Acked-by: Steven Rostedt 

-- Steve

> - Create the post_schedule_idle function that was previously created by 
> Steven's patches
> 
> Changes since V3:
> - Remove dependancy with CONFIG_FAIR_GROUP_SCHED
> - Add a new idle_enter function and create a post_schedule callback for
>  idle class
> - Remove the update_runnable_avg from idle_balance
> 
> Changes since V2:
> - remove useless definition for UP platform
> - rebased on top of Steven Rostedt's patches :
> https://lkml.org/lkml/2013/2/12/558
> 
> Changes since V1:
> - move code out of schedule function and create a pre_schedule callback for
>   idle class instead.
> 
> Signed-off-by: Vincent Guittot 
> ---


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-19 Thread Peter Zijlstra
On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
> The current update of the rq's load can be erroneous when RT tasks are
> involved
> 
> The update of the load of a rq that becomes idle, is done only if the avg_idle
> is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
> alternate, the runnable_avg will not be updated correctly and the time will be
> accounted as idle time when a CFS task wakes up.
> 
> A new idle_enter function is called when the next task is the idle function
> so the elapsed time will be accounted as run time in the load of the rq,
> whatever the average idle time is. The function update_rq_runnable_avg is
> removed from idle_balance.
> 
> When a RT task is scheduled on an idle CPU, the update of the rq's load is
> not done when the rq exit idle state because CFS's functions are not
> called. Then, the idle_balance, which is called just before entering the
> idle function, updates the rq's load and makes the assumption that the
> elapsed time since the last update, was only running time.
> 
> As a consequence, the rq's load of a CPU that only runs a periodic RT task,
> is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
> 
> A new idle_exit function is called when the prev task is the idle function
> so the elapsed time will be accounted as idle time in the rq's load.

Acked-by: Peter Zijlstra 

Thanks Vince!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-19 Thread Vincent Guittot
On 19 April 2013 11:21, Mike Galbraith  wrote:
> On Fri, 2013-04-19 at 10:50 +0200, Vincent Guittot wrote:
>> On 19 April 2013 10:14, Mike Galbraith  wrote:
>> > On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote:
>> >> On 19 April 2013 06:30, Mike Galbraith  wrote:
>> >> > On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
>> >> >> The current update of the rq's load can be erroneous when RT tasks are
>> >> >> involved
>> >> >>
>> >> >> The update of the load of a rq that becomes idle, is done only if the 
>> >> >> avg_idle
>> >> >> is less than sysctl_sched_migration_cost. If RT tasks and short idle 
>> >> >> duration
>> >> >> alternate, the runnable_avg will not be updated correctly and the time 
>> >> >> will be
>> >> >> accounted as idle time when a CFS task wakes up.
>> >> >>
>> >> >> A new idle_enter function is called when the next task is the idle 
>> >> >> function
>> >> >> so the elapsed time will be accounted as run time in the load of the 
>> >> >> rq,
>> >> >> whatever the average idle time is. The function update_rq_runnable_avg 
>> >> >> is
>> >> >> removed from idle_balance.
>> >> >>
>> >> >> When a RT task is scheduled on an idle CPU, the update of the rq's 
>> >> >> load is
>> >> >> not done when the rq exit idle state because CFS's functions are not
>> >> >> called. Then, the idle_balance, which is called just before entering 
>> >> >> the
>> >> >> idle function, updates the rq's load and makes the assumption that the
>> >> >> elapsed time since the last update, was only running time.
>> >> >>
>> >> >> As a consequence, the rq's load of a CPU that only runs a periodic RT 
>> >> >> task,
>> >> >> is close to LOAD_AVG_MAX whatever the running duration of the RT task 
>> >> >> is.
>> >> >
>> >> > Why do we care what rq's load says, if the only thing running is a
>> >> > periodic RT task?  I _think_ I recall that stuff being put under the
>> >>
>> >> cfs scheduler will use a wrong rq load the next time it wants to schedule 
>> >> a task
>> >>
>> >> > throttle specifically to not waste cycles doing that on every
>> >> > microscopic idle.
>> >>
>> >> yes but this lead to the wrong computation of runnable_avg_sum. To be
>> >> more precise, we only need to call __update_entity_runnable_avg,
>> >> __update_tg_runnable_avg is not mandatory in this step.
>> >
>> > If it only scares fair class tasks away from the periodic rt load, that
>> > seems like a benefit to me, not a liability.  If we really really need
>>
>> I'm not sure that such behavior that is only based on erroneous value,
>> is good one.
>>
>> > perfect load numbers, fine, we have to eat some cycles, but when I look
>> > at it, it looks like one of those "Perfect is the enemy of good" things.
>>
>> The target is not perfect number but good enough to be usable. The
>> systctl_migration_cost threshold is good for idle balancing but can
>> generates wrong load value
>
> But again, why do we care?  To be able to mix rt and fair loads and
> still make pretty mixed load utilization numbers?  Paying a general case

If runnable_avg_sum can be wrong, it becomes unusable and all the
stuff around becomes useless.

> fast path price to make strange (to me) load utilization numbers pretty
> is not very attractive.  If you muck about with rt classes, you need to
> have a good reason for doing that.  If you do have a good reason, you
> also allocated all resources, including CPU, so don't need the kernel to

Some tasks have responsiveness constraints so they use rt class but
they also live with cfs tasks.

Vincent

> balance the load for you.  Paying any fast path price to make the kernel
> balance a mixed rt/fair load just seems fundamentally wrong to me.
>
> -Mike
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-19 Thread Mike Galbraith
On Fri, 2013-04-19 at 11:21 +0200, Mike Galbraith wrote: 
> On Fri, 2013-04-19 at 10:50 +0200, Vincent Guittot wrote: 
> > On 19 April 2013 10:14, Mike Galbraith  wrote:
> > > On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote:
> > >> On 19 April 2013 06:30, Mike Galbraith  wrote:
> > >> > On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
> > >> >> The current update of the rq's load can be erroneous when RT tasks are
> > >> >> involved
> > >> >>
> > >> >> The update of the load of a rq that becomes idle, is done only if the 
> > >> >> avg_idle
> > >> >> is less than sysctl_sched_migration_cost. If RT tasks and short idle 
> > >> >> duration
> > >> >> alternate, the runnable_avg will not be updated correctly and the 
> > >> >> time will be
> > >> >> accounted as idle time when a CFS task wakes up.
> > >> >>
> > >> >> A new idle_enter function is called when the next task is the idle 
> > >> >> function
> > >> >> so the elapsed time will be accounted as run time in the load of the 
> > >> >> rq,
> > >> >> whatever the average idle time is. The function 
> > >> >> update_rq_runnable_avg is
> > >> >> removed from idle_balance.
> > >> >>
> > >> >> When a RT task is scheduled on an idle CPU, the update of the rq's 
> > >> >> load is
> > >> >> not done when the rq exit idle state because CFS's functions are not
> > >> >> called. Then, the idle_balance, which is called just before entering 
> > >> >> the
> > >> >> idle function, updates the rq's load and makes the assumption that the
> > >> >> elapsed time since the last update, was only running time.
> > >> >>
> > >> >> As a consequence, the rq's load of a CPU that only runs a periodic RT 
> > >> >> task,
> > >> >> is close to LOAD_AVG_MAX whatever the running duration of the RT task 
> > >> >> is.
> > >> >
> > >> > Why do we care what rq's load says, if the only thing running is a
> > >> > periodic RT task?  I _think_ I recall that stuff being put under the
> > >>
> > >> cfs scheduler will use a wrong rq load the next time it wants to 
> > >> schedule a task
> > >>
> > >> > throttle specifically to not waste cycles doing that on every
> > >> > microscopic idle.
> > >>
> > >> yes but this lead to the wrong computation of runnable_avg_sum. To be
> > >> more precise, we only need to call __update_entity_runnable_avg,
> > >> __update_tg_runnable_avg is not mandatory in this step.
> > >
> > > If it only scares fair class tasks away from the periodic rt load, that
> > > seems like a benefit to me, not a liability.  If we really really need
> > 
> > I'm not sure that such behavior that is only based on erroneous value,
> > is good one.
> > 
> > > perfect load numbers, fine, we have to eat some cycles, but when I look
> > > at it, it looks like one of those "Perfect is the enemy of good" things.
> > 
> > The target is not perfect number but good enough to be usable. The
> > systctl_migration_cost threshold is good for idle balancing but can
> > generates wrong load value
> 
> But again, why do we care?  To be able to mix rt and fair loads and
> still make pretty mixed load utilization numbers?  Paying a general case
> fast path price to make strange (to me) load utilization numbers pretty
> is not very attractive.

So I'm not convinced this is a good thing to do, but it's not my call,
that's Peter and Ingos job, so having expressed my opinion, I'll shut up
and let them do their thing ;-)

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-19 Thread Mike Galbraith
On Fri, 2013-04-19 at 10:50 +0200, Vincent Guittot wrote: 
> On 19 April 2013 10:14, Mike Galbraith  wrote:
> > On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote:
> >> On 19 April 2013 06:30, Mike Galbraith  wrote:
> >> > On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
> >> >> The current update of the rq's load can be erroneous when RT tasks are
> >> >> involved
> >> >>
> >> >> The update of the load of a rq that becomes idle, is done only if the 
> >> >> avg_idle
> >> >> is less than sysctl_sched_migration_cost. If RT tasks and short idle 
> >> >> duration
> >> >> alternate, the runnable_avg will not be updated correctly and the time 
> >> >> will be
> >> >> accounted as idle time when a CFS task wakes up.
> >> >>
> >> >> A new idle_enter function is called when the next task is the idle 
> >> >> function
> >> >> so the elapsed time will be accounted as run time in the load of the rq,
> >> >> whatever the average idle time is. The function update_rq_runnable_avg 
> >> >> is
> >> >> removed from idle_balance.
> >> >>
> >> >> When a RT task is scheduled on an idle CPU, the update of the rq's load 
> >> >> is
> >> >> not done when the rq exit idle state because CFS's functions are not
> >> >> called. Then, the idle_balance, which is called just before entering the
> >> >> idle function, updates the rq's load and makes the assumption that the
> >> >> elapsed time since the last update, was only running time.
> >> >>
> >> >> As a consequence, the rq's load of a CPU that only runs a periodic RT 
> >> >> task,
> >> >> is close to LOAD_AVG_MAX whatever the running duration of the RT task 
> >> >> is.
> >> >
> >> > Why do we care what rq's load says, if the only thing running is a
> >> > periodic RT task?  I _think_ I recall that stuff being put under the
> >>
> >> cfs scheduler will use a wrong rq load the next time it wants to schedule 
> >> a task
> >>
> >> > throttle specifically to not waste cycles doing that on every
> >> > microscopic idle.
> >>
> >> yes but this lead to the wrong computation of runnable_avg_sum. To be
> >> more precise, we only need to call __update_entity_runnable_avg,
> >> __update_tg_runnable_avg is not mandatory in this step.
> >
> > If it only scares fair class tasks away from the periodic rt load, that
> > seems like a benefit to me, not a liability.  If we really really need
> 
> I'm not sure that such behavior that is only based on erroneous value,
> is good one.
> 
> > perfect load numbers, fine, we have to eat some cycles, but when I look
> > at it, it looks like one of those "Perfect is the enemy of good" things.
> 
> The target is not perfect number but good enough to be usable. The
> systctl_migration_cost threshold is good for idle balancing but can
> generates wrong load value

But again, why do we care?  To be able to mix rt and fair loads and
still make pretty mixed load utilization numbers?  Paying a general case
fast path price to make strange (to me) load utilization numbers pretty
is not very attractive.  If you muck about with rt classes, you need to
have a good reason for doing that.  If you do have a good reason, you
also allocated all resources, including CPU, so don't need the kernel to
balance the load for you.  Paying any fast path price to make the kernel
balance a mixed rt/fair load just seems fundamentally wrong to me.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-19 Thread Vincent Guittot
On 19 April 2013 10:14, Mike Galbraith  wrote:
> On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote:
>> On 19 April 2013 06:30, Mike Galbraith  wrote:
>> > On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
>> >> The current update of the rq's load can be erroneous when RT tasks are
>> >> involved
>> >>
>> >> The update of the load of a rq that becomes idle, is done only if the 
>> >> avg_idle
>> >> is less than sysctl_sched_migration_cost. If RT tasks and short idle 
>> >> duration
>> >> alternate, the runnable_avg will not be updated correctly and the time 
>> >> will be
>> >> accounted as idle time when a CFS task wakes up.
>> >>
>> >> A new idle_enter function is called when the next task is the idle 
>> >> function
>> >> so the elapsed time will be accounted as run time in the load of the rq,
>> >> whatever the average idle time is. The function update_rq_runnable_avg is
>> >> removed from idle_balance.
>> >>
>> >> When a RT task is scheduled on an idle CPU, the update of the rq's load is
>> >> not done when the rq exit idle state because CFS's functions are not
>> >> called. Then, the idle_balance, which is called just before entering the
>> >> idle function, updates the rq's load and makes the assumption that the
>> >> elapsed time since the last update, was only running time.
>> >>
>> >> As a consequence, the rq's load of a CPU that only runs a periodic RT 
>> >> task,
>> >> is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
>> >
>> > Why do we care what rq's load says, if the only thing running is a
>> > periodic RT task?  I _think_ I recall that stuff being put under the
>>
>> cfs scheduler will use a wrong rq load the next time it wants to schedule a 
>> task
>>
>> > throttle specifically to not waste cycles doing that on every
>> > microscopic idle.
>>
>> yes but this lead to the wrong computation of runnable_avg_sum. To be
>> more precise, we only need to call __update_entity_runnable_avg,
>> __update_tg_runnable_avg is not mandatory in this step.
>
> If it only scares fair class tasks away from the periodic rt load, that
> seems like a benefit to me, not a liability.  If we really really need

I'm not sure that such behavior that is only based on erroneous value,
is good one.

> perfect load numbers, fine, we have to eat some cycles, but when I look
> at it, it looks like one of those "Perfect is the enemy of good" things.

The target is not perfect number but good enough to be usable. The
systctl_migration_cost threshold is good for idle balancing but can
generates wrong load value

Vincent
>
> -Mike
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-19 Thread Mike Galbraith
On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote: 
> On 19 April 2013 06:30, Mike Galbraith  wrote:
> > On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
> >> The current update of the rq's load can be erroneous when RT tasks are
> >> involved
> >>
> >> The update of the load of a rq that becomes idle, is done only if the 
> >> avg_idle
> >> is less than sysctl_sched_migration_cost. If RT tasks and short idle 
> >> duration
> >> alternate, the runnable_avg will not be updated correctly and the time 
> >> will be
> >> accounted as idle time when a CFS task wakes up.
> >>
> >> A new idle_enter function is called when the next task is the idle function
> >> so the elapsed time will be accounted as run time in the load of the rq,
> >> whatever the average idle time is. The function update_rq_runnable_avg is
> >> removed from idle_balance.
> >>
> >> When a RT task is scheduled on an idle CPU, the update of the rq's load is
> >> not done when the rq exit idle state because CFS's functions are not
> >> called. Then, the idle_balance, which is called just before entering the
> >> idle function, updates the rq's load and makes the assumption that the
> >> elapsed time since the last update, was only running time.
> >>
> >> As a consequence, the rq's load of a CPU that only runs a periodic RT task,
> >> is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
> >
> > Why do we care what rq's load says, if the only thing running is a
> > periodic RT task?  I _think_ I recall that stuff being put under the
> 
> cfs scheduler will use a wrong rq load the next time it wants to schedule a 
> task
> 
> > throttle specifically to not waste cycles doing that on every
> > microscopic idle.
> 
> yes but this lead to the wrong computation of runnable_avg_sum. To be
> more precise, we only need to call __update_entity_runnable_avg,
> __update_tg_runnable_avg is not mandatory in this step.

If it only scares fair class tasks away from the periodic rt load, that
seems like a benefit to me, not a liability.  If we really really need
perfect load numbers, fine, we have to eat some cycles, but when I look
at it, it looks like one of those "Perfect is the enemy of good" things.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-19 Thread Vincent Guittot
On 19 April 2013 06:30, Mike Galbraith  wrote:
> On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
>> The current update of the rq's load can be erroneous when RT tasks are
>> involved
>>
>> The update of the load of a rq that becomes idle, is done only if the 
>> avg_idle
>> is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
>> alternate, the runnable_avg will not be updated correctly and the time will 
>> be
>> accounted as idle time when a CFS task wakes up.
>>
>> A new idle_enter function is called when the next task is the idle function
>> so the elapsed time will be accounted as run time in the load of the rq,
>> whatever the average idle time is. The function update_rq_runnable_avg is
>> removed from idle_balance.
>>
>> When a RT task is scheduled on an idle CPU, the update of the rq's load is
>> not done when the rq exit idle state because CFS's functions are not
>> called. Then, the idle_balance, which is called just before entering the
>> idle function, updates the rq's load and makes the assumption that the
>> elapsed time since the last update, was only running time.
>>
>> As a consequence, the rq's load of a CPU that only runs a periodic RT task,
>> is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
>
> Why do we care what rq's load says, if the only thing running is a
> periodic RT task?  I _think_ I recall that stuff being put under the

cfs scheduler will use a wrong rq load the next time it wants to schedule a task

> throttle specifically to not waste cycles doing that on every
> microscopic idle.

yes but this lead to the wrong computation of runnable_avg_sum. To be
more precise, we only need to call __update_entity_runnable_avg,
__update_tg_runnable_avg is not mandatory in this step.

>
> Seems to me when scheduling an rt task, you want to do as little other
> than switching to/from the rt task as possible.  I don't let rt tasks do
> idle balancing either, their job isn't to balance fair class on the way
> out the door, it's to get off/onto the cpu ASAP, and do rt work.

I agree but the patch is not about balancing fair task but keep
coherent runnable value

Vincent
>
> -Mike
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-19 Thread Vincent Guittot
On 19 April 2013 06:30, Mike Galbraith efa...@gmx.de wrote:
 On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
 The current update of the rq's load can be erroneous when RT tasks are
 involved

 The update of the load of a rq that becomes idle, is done only if the 
 avg_idle
 is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
 alternate, the runnable_avg will not be updated correctly and the time will 
 be
 accounted as idle time when a CFS task wakes up.

 A new idle_enter function is called when the next task is the idle function
 so the elapsed time will be accounted as run time in the load of the rq,
 whatever the average idle time is. The function update_rq_runnable_avg is
 removed from idle_balance.

 When a RT task is scheduled on an idle CPU, the update of the rq's load is
 not done when the rq exit idle state because CFS's functions are not
 called. Then, the idle_balance, which is called just before entering the
 idle function, updates the rq's load and makes the assumption that the
 elapsed time since the last update, was only running time.

 As a consequence, the rq's load of a CPU that only runs a periodic RT task,
 is close to LOAD_AVG_MAX whatever the running duration of the RT task is.

 Why do we care what rq's load says, if the only thing running is a
 periodic RT task?  I _think_ I recall that stuff being put under the

cfs scheduler will use a wrong rq load the next time it wants to schedule a task

 throttle specifically to not waste cycles doing that on every
 microscopic idle.

yes but this lead to the wrong computation of runnable_avg_sum. To be
more precise, we only need to call __update_entity_runnable_avg,
__update_tg_runnable_avg is not mandatory in this step.


 Seems to me when scheduling an rt task, you want to do as little other
 than switching to/from the rt task as possible.  I don't let rt tasks do
 idle balancing either, their job isn't to balance fair class on the way
 out the door, it's to get off/onto the cpu ASAP, and do rt work.

I agree but the patch is not about balancing fair task but keep
coherent runnable value

Vincent

 -Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-19 Thread Mike Galbraith
On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote: 
 On 19 April 2013 06:30, Mike Galbraith efa...@gmx.de wrote:
  On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
  The current update of the rq's load can be erroneous when RT tasks are
  involved
 
  The update of the load of a rq that becomes idle, is done only if the 
  avg_idle
  is less than sysctl_sched_migration_cost. If RT tasks and short idle 
  duration
  alternate, the runnable_avg will not be updated correctly and the time 
  will be
  accounted as idle time when a CFS task wakes up.
 
  A new idle_enter function is called when the next task is the idle function
  so the elapsed time will be accounted as run time in the load of the rq,
  whatever the average idle time is. The function update_rq_runnable_avg is
  removed from idle_balance.
 
  When a RT task is scheduled on an idle CPU, the update of the rq's load is
  not done when the rq exit idle state because CFS's functions are not
  called. Then, the idle_balance, which is called just before entering the
  idle function, updates the rq's load and makes the assumption that the
  elapsed time since the last update, was only running time.
 
  As a consequence, the rq's load of a CPU that only runs a periodic RT task,
  is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
 
  Why do we care what rq's load says, if the only thing running is a
  periodic RT task?  I _think_ I recall that stuff being put under the
 
 cfs scheduler will use a wrong rq load the next time it wants to schedule a 
 task
 
  throttle specifically to not waste cycles doing that on every
  microscopic idle.
 
 yes but this lead to the wrong computation of runnable_avg_sum. To be
 more precise, we only need to call __update_entity_runnable_avg,
 __update_tg_runnable_avg is not mandatory in this step.

If it only scares fair class tasks away from the periodic rt load, that
seems like a benefit to me, not a liability.  If we really really need
perfect load numbers, fine, we have to eat some cycles, but when I look
at it, it looks like one of those Perfect is the enemy of good things.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-19 Thread Vincent Guittot
On 19 April 2013 10:14, Mike Galbraith efa...@gmx.de wrote:
 On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote:
 On 19 April 2013 06:30, Mike Galbraith efa...@gmx.de wrote:
  On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
  The current update of the rq's load can be erroneous when RT tasks are
  involved
 
  The update of the load of a rq that becomes idle, is done only if the 
  avg_idle
  is less than sysctl_sched_migration_cost. If RT tasks and short idle 
  duration
  alternate, the runnable_avg will not be updated correctly and the time 
  will be
  accounted as idle time when a CFS task wakes up.
 
  A new idle_enter function is called when the next task is the idle 
  function
  so the elapsed time will be accounted as run time in the load of the rq,
  whatever the average idle time is. The function update_rq_runnable_avg is
  removed from idle_balance.
 
  When a RT task is scheduled on an idle CPU, the update of the rq's load is
  not done when the rq exit idle state because CFS's functions are not
  called. Then, the idle_balance, which is called just before entering the
  idle function, updates the rq's load and makes the assumption that the
  elapsed time since the last update, was only running time.
 
  As a consequence, the rq's load of a CPU that only runs a periodic RT 
  task,
  is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
 
  Why do we care what rq's load says, if the only thing running is a
  periodic RT task?  I _think_ I recall that stuff being put under the

 cfs scheduler will use a wrong rq load the next time it wants to schedule a 
 task

  throttle specifically to not waste cycles doing that on every
  microscopic idle.

 yes but this lead to the wrong computation of runnable_avg_sum. To be
 more precise, we only need to call __update_entity_runnable_avg,
 __update_tg_runnable_avg is not mandatory in this step.

 If it only scares fair class tasks away from the periodic rt load, that
 seems like a benefit to me, not a liability.  If we really really need

I'm not sure that such behavior that is only based on erroneous value,
is good one.

 perfect load numbers, fine, we have to eat some cycles, but when I look
 at it, it looks like one of those Perfect is the enemy of good things.

The target is not perfect number but good enough to be usable. The
systctl_migration_cost threshold is good for idle balancing but can
generates wrong load value

Vincent

 -Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-19 Thread Mike Galbraith
On Fri, 2013-04-19 at 10:50 +0200, Vincent Guittot wrote: 
 On 19 April 2013 10:14, Mike Galbraith efa...@gmx.de wrote:
  On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote:
  On 19 April 2013 06:30, Mike Galbraith efa...@gmx.de wrote:
   On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
   The current update of the rq's load can be erroneous when RT tasks are
   involved
  
   The update of the load of a rq that becomes idle, is done only if the 
   avg_idle
   is less than sysctl_sched_migration_cost. If RT tasks and short idle 
   duration
   alternate, the runnable_avg will not be updated correctly and the time 
   will be
   accounted as idle time when a CFS task wakes up.
  
   A new idle_enter function is called when the next task is the idle 
   function
   so the elapsed time will be accounted as run time in the load of the rq,
   whatever the average idle time is. The function update_rq_runnable_avg 
   is
   removed from idle_balance.
  
   When a RT task is scheduled on an idle CPU, the update of the rq's load 
   is
   not done when the rq exit idle state because CFS's functions are not
   called. Then, the idle_balance, which is called just before entering the
   idle function, updates the rq's load and makes the assumption that the
   elapsed time since the last update, was only running time.
  
   As a consequence, the rq's load of a CPU that only runs a periodic RT 
   task,
   is close to LOAD_AVG_MAX whatever the running duration of the RT task 
   is.
  
   Why do we care what rq's load says, if the only thing running is a
   periodic RT task?  I _think_ I recall that stuff being put under the
 
  cfs scheduler will use a wrong rq load the next time it wants to schedule 
  a task
 
   throttle specifically to not waste cycles doing that on every
   microscopic idle.
 
  yes but this lead to the wrong computation of runnable_avg_sum. To be
  more precise, we only need to call __update_entity_runnable_avg,
  __update_tg_runnable_avg is not mandatory in this step.
 
  If it only scares fair class tasks away from the periodic rt load, that
  seems like a benefit to me, not a liability.  If we really really need
 
 I'm not sure that such behavior that is only based on erroneous value,
 is good one.
 
  perfect load numbers, fine, we have to eat some cycles, but when I look
  at it, it looks like one of those Perfect is the enemy of good things.
 
 The target is not perfect number but good enough to be usable. The
 systctl_migration_cost threshold is good for idle balancing but can
 generates wrong load value

But again, why do we care?  To be able to mix rt and fair loads and
still make pretty mixed load utilization numbers?  Paying a general case
fast path price to make strange (to me) load utilization numbers pretty
is not very attractive.  If you muck about with rt classes, you need to
have a good reason for doing that.  If you do have a good reason, you
also allocated all resources, including CPU, so don't need the kernel to
balance the load for you.  Paying any fast path price to make the kernel
balance a mixed rt/fair load just seems fundamentally wrong to me.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-19 Thread Mike Galbraith
On Fri, 2013-04-19 at 11:21 +0200, Mike Galbraith wrote: 
 On Fri, 2013-04-19 at 10:50 +0200, Vincent Guittot wrote: 
  On 19 April 2013 10:14, Mike Galbraith efa...@gmx.de wrote:
   On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote:
   On 19 April 2013 06:30, Mike Galbraith efa...@gmx.de wrote:
On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
The current update of the rq's load can be erroneous when RT tasks are
involved
   
The update of the load of a rq that becomes idle, is done only if the 
avg_idle
is less than sysctl_sched_migration_cost. If RT tasks and short idle 
duration
alternate, the runnable_avg will not be updated correctly and the 
time will be
accounted as idle time when a CFS task wakes up.
   
A new idle_enter function is called when the next task is the idle 
function
so the elapsed time will be accounted as run time in the load of the 
rq,
whatever the average idle time is. The function 
update_rq_runnable_avg is
removed from idle_balance.
   
When a RT task is scheduled on an idle CPU, the update of the rq's 
load is
not done when the rq exit idle state because CFS's functions are not
called. Then, the idle_balance, which is called just before entering 
the
idle function, updates the rq's load and makes the assumption that the
elapsed time since the last update, was only running time.
   
As a consequence, the rq's load of a CPU that only runs a periodic RT 
task,
is close to LOAD_AVG_MAX whatever the running duration of the RT task 
is.
   
Why do we care what rq's load says, if the only thing running is a
periodic RT task?  I _think_ I recall that stuff being put under the
  
   cfs scheduler will use a wrong rq load the next time it wants to 
   schedule a task
  
throttle specifically to not waste cycles doing that on every
microscopic idle.
  
   yes but this lead to the wrong computation of runnable_avg_sum. To be
   more precise, we only need to call __update_entity_runnable_avg,
   __update_tg_runnable_avg is not mandatory in this step.
  
   If it only scares fair class tasks away from the periodic rt load, that
   seems like a benefit to me, not a liability.  If we really really need
  
  I'm not sure that such behavior that is only based on erroneous value,
  is good one.
  
   perfect load numbers, fine, we have to eat some cycles, but when I look
   at it, it looks like one of those Perfect is the enemy of good things.
  
  The target is not perfect number but good enough to be usable. The
  systctl_migration_cost threshold is good for idle balancing but can
  generates wrong load value
 
 But again, why do we care?  To be able to mix rt and fair loads and
 still make pretty mixed load utilization numbers?  Paying a general case
 fast path price to make strange (to me) load utilization numbers pretty
 is not very attractive.

So I'm not convinced this is a good thing to do, but it's not my call,
that's Peter and Ingos job, so having expressed my opinion, I'll shut up
and let them do their thing ;-)

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-19 Thread Vincent Guittot
On 19 April 2013 11:21, Mike Galbraith efa...@gmx.de wrote:
 On Fri, 2013-04-19 at 10:50 +0200, Vincent Guittot wrote:
 On 19 April 2013 10:14, Mike Galbraith efa...@gmx.de wrote:
  On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote:
  On 19 April 2013 06:30, Mike Galbraith efa...@gmx.de wrote:
   On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
   The current update of the rq's load can be erroneous when RT tasks are
   involved
  
   The update of the load of a rq that becomes idle, is done only if the 
   avg_idle
   is less than sysctl_sched_migration_cost. If RT tasks and short idle 
   duration
   alternate, the runnable_avg will not be updated correctly and the time 
   will be
   accounted as idle time when a CFS task wakes up.
  
   A new idle_enter function is called when the next task is the idle 
   function
   so the elapsed time will be accounted as run time in the load of the 
   rq,
   whatever the average idle time is. The function update_rq_runnable_avg 
   is
   removed from idle_balance.
  
   When a RT task is scheduled on an idle CPU, the update of the rq's 
   load is
   not done when the rq exit idle state because CFS's functions are not
   called. Then, the idle_balance, which is called just before entering 
   the
   idle function, updates the rq's load and makes the assumption that the
   elapsed time since the last update, was only running time.
  
   As a consequence, the rq's load of a CPU that only runs a periodic RT 
   task,
   is close to LOAD_AVG_MAX whatever the running duration of the RT task 
   is.
  
   Why do we care what rq's load says, if the only thing running is a
   periodic RT task?  I _think_ I recall that stuff being put under the
 
  cfs scheduler will use a wrong rq load the next time it wants to schedule 
  a task
 
   throttle specifically to not waste cycles doing that on every
   microscopic idle.
 
  yes but this lead to the wrong computation of runnable_avg_sum. To be
  more precise, we only need to call __update_entity_runnable_avg,
  __update_tg_runnable_avg is not mandatory in this step.
 
  If it only scares fair class tasks away from the periodic rt load, that
  seems like a benefit to me, not a liability.  If we really really need

 I'm not sure that such behavior that is only based on erroneous value,
 is good one.

  perfect load numbers, fine, we have to eat some cycles, but when I look
  at it, it looks like one of those Perfect is the enemy of good things.

 The target is not perfect number but good enough to be usable. The
 systctl_migration_cost threshold is good for idle balancing but can
 generates wrong load value

 But again, why do we care?  To be able to mix rt and fair loads and
 still make pretty mixed load utilization numbers?  Paying a general case

If runnable_avg_sum can be wrong, it becomes unusable and all the
stuff around becomes useless.

 fast path price to make strange (to me) load utilization numbers pretty
 is not very attractive.  If you muck about with rt classes, you need to
 have a good reason for doing that.  If you do have a good reason, you
 also allocated all resources, including CPU, so don't need the kernel to

Some tasks have responsiveness constraints so they use rt class but
they also live with cfs tasks.

Vincent

 balance the load for you.  Paying any fast path price to make the kernel
 balance a mixed rt/fair load just seems fundamentally wrong to me.

 -Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-19 Thread Peter Zijlstra
On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
 The current update of the rq's load can be erroneous when RT tasks are
 involved
 
 The update of the load of a rq that becomes idle, is done only if the avg_idle
 is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
 alternate, the runnable_avg will not be updated correctly and the time will be
 accounted as idle time when a CFS task wakes up.
 
 A new idle_enter function is called when the next task is the idle function
 so the elapsed time will be accounted as run time in the load of the rq,
 whatever the average idle time is. The function update_rq_runnable_avg is
 removed from idle_balance.
 
 When a RT task is scheduled on an idle CPU, the update of the rq's load is
 not done when the rq exit idle state because CFS's functions are not
 called. Then, the idle_balance, which is called just before entering the
 idle function, updates the rq's load and makes the assumption that the
 elapsed time since the last update, was only running time.
 
 As a consequence, the rq's load of a CPU that only runs a periodic RT task,
 is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
 
 A new idle_exit function is called when the prev task is the idle function
 so the elapsed time will be accounted as idle time in the rq's load.

Acked-by: Peter Zijlstra a.p.zijls...@chello.nl

Thanks Vince!

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-19 Thread Steven Rostedt
On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote:
 The current update of the rq's load can be erroneous when RT tasks are
 involved
 
 The update of the load of a rq that becomes idle, is done only if the avg_idle
 is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
 alternate, the runnable_avg will not be updated correctly and the time will be
 accounted as idle time when a CFS task wakes up.
 
 A new idle_enter function is called when the next task is the idle function
 so the elapsed time will be accounted as run time in the load of the rq,
 whatever the average idle time is. The function update_rq_runnable_avg is
 removed from idle_balance.
 
 When a RT task is scheduled on an idle CPU, the update of the rq's load is
 not done when the rq exit idle state because CFS's functions are not
 called. Then, the idle_balance, which is called just before entering the
 idle function, updates the rq's load and makes the assumption that the
 elapsed time since the last update, was only running time.
 
 As a consequence, the rq's load of a CPU that only runs a periodic RT task,
 is close to LOAD_AVG_MAX whatever the running duration of the RT task is.
 
 A new idle_exit function is called when the prev task is the idle function
 so the elapsed time will be accounted as idle time in the rq's load.
 
 Changes since V5:
 - Rename idle_enter/exit function to idle_enter/exit_fair
 
 Changes since V4:
 - Rebase on v3.9-rc6 instead of Steven Rostedt's patches

Acked-by: Steven Rostedt rost...@goodmis.org

-- Steve

 - Create the post_schedule_idle function that was previously created by 
 Steven's patches
 
 Changes since V3:
 - Remove dependancy with CONFIG_FAIR_GROUP_SCHED
 - Add a new idle_enter function and create a post_schedule callback for
  idle class
 - Remove the update_runnable_avg from idle_balance
 
 Changes since V2:
 - remove useless definition for UP platform
 - rebased on top of Steven Rostedt's patches :
 https://lkml.org/lkml/2013/2/12/558
 
 Changes since V1:
 - move code out of schedule function and create a pre_schedule callback for
   idle class instead.
 
 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
 ---


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-18 Thread Mike Galbraith
On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: 
> The current update of the rq's load can be erroneous when RT tasks are
> involved
> 
> The update of the load of a rq that becomes idle, is done only if the avg_idle
> is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
> alternate, the runnable_avg will not be updated correctly and the time will be
> accounted as idle time when a CFS task wakes up.
> 
> A new idle_enter function is called when the next task is the idle function
> so the elapsed time will be accounted as run time in the load of the rq,
> whatever the average idle time is. The function update_rq_runnable_avg is
> removed from idle_balance.
> 
> When a RT task is scheduled on an idle CPU, the update of the rq's load is
> not done when the rq exit idle state because CFS's functions are not
> called. Then, the idle_balance, which is called just before entering the
> idle function, updates the rq's load and makes the assumption that the
> elapsed time since the last update, was only running time.
> 
> As a consequence, the rq's load of a CPU that only runs a periodic RT task,
> is close to LOAD_AVG_MAX whatever the running duration of the RT task is.

Why do we care what rq's load says, if the only thing running is a
periodic RT task?  I _think_ I recall that stuff being put under the
throttle specifically to not waste cycles doing that on every
microscopic idle.

Seems to me when scheduling an rt task, you want to do as little other
than switching to/from the rt task as possible.  I don't let rt tasks do
idle balancing either, their job isn't to balance fair class on the way
out the door, it's to get off/onto the cpu ASAP, and do rt work.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks

2013-04-18 Thread Mike Galbraith
On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: 
 The current update of the rq's load can be erroneous when RT tasks are
 involved
 
 The update of the load of a rq that becomes idle, is done only if the avg_idle
 is less than sysctl_sched_migration_cost. If RT tasks and short idle duration
 alternate, the runnable_avg will not be updated correctly and the time will be
 accounted as idle time when a CFS task wakes up.
 
 A new idle_enter function is called when the next task is the idle function
 so the elapsed time will be accounted as run time in the load of the rq,
 whatever the average idle time is. The function update_rq_runnable_avg is
 removed from idle_balance.
 
 When a RT task is scheduled on an idle CPU, the update of the rq's load is
 not done when the rq exit idle state because CFS's functions are not
 called. Then, the idle_balance, which is called just before entering the
 idle function, updates the rq's load and makes the assumption that the
 elapsed time since the last update, was only running time.
 
 As a consequence, the rq's load of a CPU that only runs a periodic RT task,
 is close to LOAD_AVG_MAX whatever the running duration of the RT task is.

Why do we care what rq's load says, if the only thing running is a
periodic RT task?  I _think_ I recall that stuff being put under the
throttle specifically to not waste cycles doing that on every
microscopic idle.

Seems to me when scheduling an rt task, you want to do as little other
than switching to/from the rt task as possible.  I don't let rt tasks do
idle balancing either, their job isn't to balance fair class on the way
out the door, it's to get off/onto the cpu ASAP, and do rt work.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/