Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: > The current update of the rq's load can be erroneous when RT tasks are > involved > > The update of the load of a rq that becomes idle, is done only if the avg_idle > is less than sysctl_sched_migration_cost. If RT tasks and short idle duration > alternate, the runnable_avg will not be updated correctly and the time will be > accounted as idle time when a CFS task wakes up. > > A new idle_enter function is called when the next task is the idle function > so the elapsed time will be accounted as run time in the load of the rq, > whatever the average idle time is. The function update_rq_runnable_avg is > removed from idle_balance. > > When a RT task is scheduled on an idle CPU, the update of the rq's load is > not done when the rq exit idle state because CFS's functions are not > called. Then, the idle_balance, which is called just before entering the > idle function, updates the rq's load and makes the assumption that the > elapsed time since the last update, was only running time. > > As a consequence, the rq's load of a CPU that only runs a periodic RT task, > is close to LOAD_AVG_MAX whatever the running duration of the RT task is. > > A new idle_exit function is called when the prev task is the idle function > so the elapsed time will be accounted as idle time in the rq's load. > > Changes since V5: > - Rename idle_enter/exit function to idle_enter/exit_fair > > Changes since V4: > - Rebase on v3.9-rc6 instead of Steven Rostedt's patches Acked-by: Steven Rostedt -- Steve > - Create the post_schedule_idle function that was previously created by > Steven's patches > > Changes since V3: > - Remove dependancy with CONFIG_FAIR_GROUP_SCHED > - Add a new idle_enter function and create a post_schedule callback for > idle class > - Remove the update_runnable_avg from idle_balance > > Changes since V2: > - remove useless definition for UP platform > - rebased on top of Steven Rostedt's patches : > https://lkml.org/lkml/2013/2/12/558 > > Changes since V1: > - move code out of schedule function and create a pre_schedule callback for > idle class instead. > > Signed-off-by: Vincent Guittot > --- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: > The current update of the rq's load can be erroneous when RT tasks are > involved > > The update of the load of a rq that becomes idle, is done only if the avg_idle > is less than sysctl_sched_migration_cost. If RT tasks and short idle duration > alternate, the runnable_avg will not be updated correctly and the time will be > accounted as idle time when a CFS task wakes up. > > A new idle_enter function is called when the next task is the idle function > so the elapsed time will be accounted as run time in the load of the rq, > whatever the average idle time is. The function update_rq_runnable_avg is > removed from idle_balance. > > When a RT task is scheduled on an idle CPU, the update of the rq's load is > not done when the rq exit idle state because CFS's functions are not > called. Then, the idle_balance, which is called just before entering the > idle function, updates the rq's load and makes the assumption that the > elapsed time since the last update, was only running time. > > As a consequence, the rq's load of a CPU that only runs a periodic RT task, > is close to LOAD_AVG_MAX whatever the running duration of the RT task is. > > A new idle_exit function is called when the prev task is the idle function > so the elapsed time will be accounted as idle time in the rq's load. Acked-by: Peter Zijlstra Thanks Vince! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On 19 April 2013 11:21, Mike Galbraith wrote: > On Fri, 2013-04-19 at 10:50 +0200, Vincent Guittot wrote: >> On 19 April 2013 10:14, Mike Galbraith wrote: >> > On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote: >> >> On 19 April 2013 06:30, Mike Galbraith wrote: >> >> > On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: >> >> >> The current update of the rq's load can be erroneous when RT tasks are >> >> >> involved >> >> >> >> >> >> The update of the load of a rq that becomes idle, is done only if the >> >> >> avg_idle >> >> >> is less than sysctl_sched_migration_cost. If RT tasks and short idle >> >> >> duration >> >> >> alternate, the runnable_avg will not be updated correctly and the time >> >> >> will be >> >> >> accounted as idle time when a CFS task wakes up. >> >> >> >> >> >> A new idle_enter function is called when the next task is the idle >> >> >> function >> >> >> so the elapsed time will be accounted as run time in the load of the >> >> >> rq, >> >> >> whatever the average idle time is. The function update_rq_runnable_avg >> >> >> is >> >> >> removed from idle_balance. >> >> >> >> >> >> When a RT task is scheduled on an idle CPU, the update of the rq's >> >> >> load is >> >> >> not done when the rq exit idle state because CFS's functions are not >> >> >> called. Then, the idle_balance, which is called just before entering >> >> >> the >> >> >> idle function, updates the rq's load and makes the assumption that the >> >> >> elapsed time since the last update, was only running time. >> >> >> >> >> >> As a consequence, the rq's load of a CPU that only runs a periodic RT >> >> >> task, >> >> >> is close to LOAD_AVG_MAX whatever the running duration of the RT task >> >> >> is. >> >> > >> >> > Why do we care what rq's load says, if the only thing running is a >> >> > periodic RT task? I _think_ I recall that stuff being put under the >> >> >> >> cfs scheduler will use a wrong rq load the next time it wants to schedule >> >> a task >> >> >> >> > throttle specifically to not waste cycles doing that on every >> >> > microscopic idle. >> >> >> >> yes but this lead to the wrong computation of runnable_avg_sum. To be >> >> more precise, we only need to call __update_entity_runnable_avg, >> >> __update_tg_runnable_avg is not mandatory in this step. >> > >> > If it only scares fair class tasks away from the periodic rt load, that >> > seems like a benefit to me, not a liability. If we really really need >> >> I'm not sure that such behavior that is only based on erroneous value, >> is good one. >> >> > perfect load numbers, fine, we have to eat some cycles, but when I look >> > at it, it looks like one of those "Perfect is the enemy of good" things. >> >> The target is not perfect number but good enough to be usable. The >> systctl_migration_cost threshold is good for idle balancing but can >> generates wrong load value > > But again, why do we care? To be able to mix rt and fair loads and > still make pretty mixed load utilization numbers? Paying a general case If runnable_avg_sum can be wrong, it becomes unusable and all the stuff around becomes useless. > fast path price to make strange (to me) load utilization numbers pretty > is not very attractive. If you muck about with rt classes, you need to > have a good reason for doing that. If you do have a good reason, you > also allocated all resources, including CPU, so don't need the kernel to Some tasks have responsiveness constraints so they use rt class but they also live with cfs tasks. Vincent > balance the load for you. Paying any fast path price to make the kernel > balance a mixed rt/fair load just seems fundamentally wrong to me. > > -Mike > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On Fri, 2013-04-19 at 11:21 +0200, Mike Galbraith wrote: > On Fri, 2013-04-19 at 10:50 +0200, Vincent Guittot wrote: > > On 19 April 2013 10:14, Mike Galbraith wrote: > > > On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote: > > >> On 19 April 2013 06:30, Mike Galbraith wrote: > > >> > On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: > > >> >> The current update of the rq's load can be erroneous when RT tasks are > > >> >> involved > > >> >> > > >> >> The update of the load of a rq that becomes idle, is done only if the > > >> >> avg_idle > > >> >> is less than sysctl_sched_migration_cost. If RT tasks and short idle > > >> >> duration > > >> >> alternate, the runnable_avg will not be updated correctly and the > > >> >> time will be > > >> >> accounted as idle time when a CFS task wakes up. > > >> >> > > >> >> A new idle_enter function is called when the next task is the idle > > >> >> function > > >> >> so the elapsed time will be accounted as run time in the load of the > > >> >> rq, > > >> >> whatever the average idle time is. The function > > >> >> update_rq_runnable_avg is > > >> >> removed from idle_balance. > > >> >> > > >> >> When a RT task is scheduled on an idle CPU, the update of the rq's > > >> >> load is > > >> >> not done when the rq exit idle state because CFS's functions are not > > >> >> called. Then, the idle_balance, which is called just before entering > > >> >> the > > >> >> idle function, updates the rq's load and makes the assumption that the > > >> >> elapsed time since the last update, was only running time. > > >> >> > > >> >> As a consequence, the rq's load of a CPU that only runs a periodic RT > > >> >> task, > > >> >> is close to LOAD_AVG_MAX whatever the running duration of the RT task > > >> >> is. > > >> > > > >> > Why do we care what rq's load says, if the only thing running is a > > >> > periodic RT task? I _think_ I recall that stuff being put under the > > >> > > >> cfs scheduler will use a wrong rq load the next time it wants to > > >> schedule a task > > >> > > >> > throttle specifically to not waste cycles doing that on every > > >> > microscopic idle. > > >> > > >> yes but this lead to the wrong computation of runnable_avg_sum. To be > > >> more precise, we only need to call __update_entity_runnable_avg, > > >> __update_tg_runnable_avg is not mandatory in this step. > > > > > > If it only scares fair class tasks away from the periodic rt load, that > > > seems like a benefit to me, not a liability. If we really really need > > > > I'm not sure that such behavior that is only based on erroneous value, > > is good one. > > > > > perfect load numbers, fine, we have to eat some cycles, but when I look > > > at it, it looks like one of those "Perfect is the enemy of good" things. > > > > The target is not perfect number but good enough to be usable. The > > systctl_migration_cost threshold is good for idle balancing but can > > generates wrong load value > > But again, why do we care? To be able to mix rt and fair loads and > still make pretty mixed load utilization numbers? Paying a general case > fast path price to make strange (to me) load utilization numbers pretty > is not very attractive. So I'm not convinced this is a good thing to do, but it's not my call, that's Peter and Ingos job, so having expressed my opinion, I'll shut up and let them do their thing ;-) -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On Fri, 2013-04-19 at 10:50 +0200, Vincent Guittot wrote: > On 19 April 2013 10:14, Mike Galbraith wrote: > > On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote: > >> On 19 April 2013 06:30, Mike Galbraith wrote: > >> > On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: > >> >> The current update of the rq's load can be erroneous when RT tasks are > >> >> involved > >> >> > >> >> The update of the load of a rq that becomes idle, is done only if the > >> >> avg_idle > >> >> is less than sysctl_sched_migration_cost. If RT tasks and short idle > >> >> duration > >> >> alternate, the runnable_avg will not be updated correctly and the time > >> >> will be > >> >> accounted as idle time when a CFS task wakes up. > >> >> > >> >> A new idle_enter function is called when the next task is the idle > >> >> function > >> >> so the elapsed time will be accounted as run time in the load of the rq, > >> >> whatever the average idle time is. The function update_rq_runnable_avg > >> >> is > >> >> removed from idle_balance. > >> >> > >> >> When a RT task is scheduled on an idle CPU, the update of the rq's load > >> >> is > >> >> not done when the rq exit idle state because CFS's functions are not > >> >> called. Then, the idle_balance, which is called just before entering the > >> >> idle function, updates the rq's load and makes the assumption that the > >> >> elapsed time since the last update, was only running time. > >> >> > >> >> As a consequence, the rq's load of a CPU that only runs a periodic RT > >> >> task, > >> >> is close to LOAD_AVG_MAX whatever the running duration of the RT task > >> >> is. > >> > > >> > Why do we care what rq's load says, if the only thing running is a > >> > periodic RT task? I _think_ I recall that stuff being put under the > >> > >> cfs scheduler will use a wrong rq load the next time it wants to schedule > >> a task > >> > >> > throttle specifically to not waste cycles doing that on every > >> > microscopic idle. > >> > >> yes but this lead to the wrong computation of runnable_avg_sum. To be > >> more precise, we only need to call __update_entity_runnable_avg, > >> __update_tg_runnable_avg is not mandatory in this step. > > > > If it only scares fair class tasks away from the periodic rt load, that > > seems like a benefit to me, not a liability. If we really really need > > I'm not sure that such behavior that is only based on erroneous value, > is good one. > > > perfect load numbers, fine, we have to eat some cycles, but when I look > > at it, it looks like one of those "Perfect is the enemy of good" things. > > The target is not perfect number but good enough to be usable. The > systctl_migration_cost threshold is good for idle balancing but can > generates wrong load value But again, why do we care? To be able to mix rt and fair loads and still make pretty mixed load utilization numbers? Paying a general case fast path price to make strange (to me) load utilization numbers pretty is not very attractive. If you muck about with rt classes, you need to have a good reason for doing that. If you do have a good reason, you also allocated all resources, including CPU, so don't need the kernel to balance the load for you. Paying any fast path price to make the kernel balance a mixed rt/fair load just seems fundamentally wrong to me. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On 19 April 2013 10:14, Mike Galbraith wrote: > On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote: >> On 19 April 2013 06:30, Mike Galbraith wrote: >> > On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: >> >> The current update of the rq's load can be erroneous when RT tasks are >> >> involved >> >> >> >> The update of the load of a rq that becomes idle, is done only if the >> >> avg_idle >> >> is less than sysctl_sched_migration_cost. If RT tasks and short idle >> >> duration >> >> alternate, the runnable_avg will not be updated correctly and the time >> >> will be >> >> accounted as idle time when a CFS task wakes up. >> >> >> >> A new idle_enter function is called when the next task is the idle >> >> function >> >> so the elapsed time will be accounted as run time in the load of the rq, >> >> whatever the average idle time is. The function update_rq_runnable_avg is >> >> removed from idle_balance. >> >> >> >> When a RT task is scheduled on an idle CPU, the update of the rq's load is >> >> not done when the rq exit idle state because CFS's functions are not >> >> called. Then, the idle_balance, which is called just before entering the >> >> idle function, updates the rq's load and makes the assumption that the >> >> elapsed time since the last update, was only running time. >> >> >> >> As a consequence, the rq's load of a CPU that only runs a periodic RT >> >> task, >> >> is close to LOAD_AVG_MAX whatever the running duration of the RT task is. >> > >> > Why do we care what rq's load says, if the only thing running is a >> > periodic RT task? I _think_ I recall that stuff being put under the >> >> cfs scheduler will use a wrong rq load the next time it wants to schedule a >> task >> >> > throttle specifically to not waste cycles doing that on every >> > microscopic idle. >> >> yes but this lead to the wrong computation of runnable_avg_sum. To be >> more precise, we only need to call __update_entity_runnable_avg, >> __update_tg_runnable_avg is not mandatory in this step. > > If it only scares fair class tasks away from the periodic rt load, that > seems like a benefit to me, not a liability. If we really really need I'm not sure that such behavior that is only based on erroneous value, is good one. > perfect load numbers, fine, we have to eat some cycles, but when I look > at it, it looks like one of those "Perfect is the enemy of good" things. The target is not perfect number but good enough to be usable. The systctl_migration_cost threshold is good for idle balancing but can generates wrong load value Vincent > > -Mike > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote: > On 19 April 2013 06:30, Mike Galbraith wrote: > > On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: > >> The current update of the rq's load can be erroneous when RT tasks are > >> involved > >> > >> The update of the load of a rq that becomes idle, is done only if the > >> avg_idle > >> is less than sysctl_sched_migration_cost. If RT tasks and short idle > >> duration > >> alternate, the runnable_avg will not be updated correctly and the time > >> will be > >> accounted as idle time when a CFS task wakes up. > >> > >> A new idle_enter function is called when the next task is the idle function > >> so the elapsed time will be accounted as run time in the load of the rq, > >> whatever the average idle time is. The function update_rq_runnable_avg is > >> removed from idle_balance. > >> > >> When a RT task is scheduled on an idle CPU, the update of the rq's load is > >> not done when the rq exit idle state because CFS's functions are not > >> called. Then, the idle_balance, which is called just before entering the > >> idle function, updates the rq's load and makes the assumption that the > >> elapsed time since the last update, was only running time. > >> > >> As a consequence, the rq's load of a CPU that only runs a periodic RT task, > >> is close to LOAD_AVG_MAX whatever the running duration of the RT task is. > > > > Why do we care what rq's load says, if the only thing running is a > > periodic RT task? I _think_ I recall that stuff being put under the > > cfs scheduler will use a wrong rq load the next time it wants to schedule a > task > > > throttle specifically to not waste cycles doing that on every > > microscopic idle. > > yes but this lead to the wrong computation of runnable_avg_sum. To be > more precise, we only need to call __update_entity_runnable_avg, > __update_tg_runnable_avg is not mandatory in this step. If it only scares fair class tasks away from the periodic rt load, that seems like a benefit to me, not a liability. If we really really need perfect load numbers, fine, we have to eat some cycles, but when I look at it, it looks like one of those "Perfect is the enemy of good" things. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On 19 April 2013 06:30, Mike Galbraith wrote: > On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: >> The current update of the rq's load can be erroneous when RT tasks are >> involved >> >> The update of the load of a rq that becomes idle, is done only if the >> avg_idle >> is less than sysctl_sched_migration_cost. If RT tasks and short idle duration >> alternate, the runnable_avg will not be updated correctly and the time will >> be >> accounted as idle time when a CFS task wakes up. >> >> A new idle_enter function is called when the next task is the idle function >> so the elapsed time will be accounted as run time in the load of the rq, >> whatever the average idle time is. The function update_rq_runnable_avg is >> removed from idle_balance. >> >> When a RT task is scheduled on an idle CPU, the update of the rq's load is >> not done when the rq exit idle state because CFS's functions are not >> called. Then, the idle_balance, which is called just before entering the >> idle function, updates the rq's load and makes the assumption that the >> elapsed time since the last update, was only running time. >> >> As a consequence, the rq's load of a CPU that only runs a periodic RT task, >> is close to LOAD_AVG_MAX whatever the running duration of the RT task is. > > Why do we care what rq's load says, if the only thing running is a > periodic RT task? I _think_ I recall that stuff being put under the cfs scheduler will use a wrong rq load the next time it wants to schedule a task > throttle specifically to not waste cycles doing that on every > microscopic idle. yes but this lead to the wrong computation of runnable_avg_sum. To be more precise, we only need to call __update_entity_runnable_avg, __update_tg_runnable_avg is not mandatory in this step. > > Seems to me when scheduling an rt task, you want to do as little other > than switching to/from the rt task as possible. I don't let rt tasks do > idle balancing either, their job isn't to balance fair class on the way > out the door, it's to get off/onto the cpu ASAP, and do rt work. I agree but the patch is not about balancing fair task but keep coherent runnable value Vincent > > -Mike > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On 19 April 2013 06:30, Mike Galbraith efa...@gmx.de wrote: On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: The current update of the rq's load can be erroneous when RT tasks are involved The update of the load of a rq that becomes idle, is done only if the avg_idle is less than sysctl_sched_migration_cost. If RT tasks and short idle duration alternate, the runnable_avg will not be updated correctly and the time will be accounted as idle time when a CFS task wakes up. A new idle_enter function is called when the next task is the idle function so the elapsed time will be accounted as run time in the load of the rq, whatever the average idle time is. The function update_rq_runnable_avg is removed from idle_balance. When a RT task is scheduled on an idle CPU, the update of the rq's load is not done when the rq exit idle state because CFS's functions are not called. Then, the idle_balance, which is called just before entering the idle function, updates the rq's load and makes the assumption that the elapsed time since the last update, was only running time. As a consequence, the rq's load of a CPU that only runs a periodic RT task, is close to LOAD_AVG_MAX whatever the running duration of the RT task is. Why do we care what rq's load says, if the only thing running is a periodic RT task? I _think_ I recall that stuff being put under the cfs scheduler will use a wrong rq load the next time it wants to schedule a task throttle specifically to not waste cycles doing that on every microscopic idle. yes but this lead to the wrong computation of runnable_avg_sum. To be more precise, we only need to call __update_entity_runnable_avg, __update_tg_runnable_avg is not mandatory in this step. Seems to me when scheduling an rt task, you want to do as little other than switching to/from the rt task as possible. I don't let rt tasks do idle balancing either, their job isn't to balance fair class on the way out the door, it's to get off/onto the cpu ASAP, and do rt work. I agree but the patch is not about balancing fair task but keep coherent runnable value Vincent -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote: On 19 April 2013 06:30, Mike Galbraith efa...@gmx.de wrote: On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: The current update of the rq's load can be erroneous when RT tasks are involved The update of the load of a rq that becomes idle, is done only if the avg_idle is less than sysctl_sched_migration_cost. If RT tasks and short idle duration alternate, the runnable_avg will not be updated correctly and the time will be accounted as idle time when a CFS task wakes up. A new idle_enter function is called when the next task is the idle function so the elapsed time will be accounted as run time in the load of the rq, whatever the average idle time is. The function update_rq_runnable_avg is removed from idle_balance. When a RT task is scheduled on an idle CPU, the update of the rq's load is not done when the rq exit idle state because CFS's functions are not called. Then, the idle_balance, which is called just before entering the idle function, updates the rq's load and makes the assumption that the elapsed time since the last update, was only running time. As a consequence, the rq's load of a CPU that only runs a periodic RT task, is close to LOAD_AVG_MAX whatever the running duration of the RT task is. Why do we care what rq's load says, if the only thing running is a periodic RT task? I _think_ I recall that stuff being put under the cfs scheduler will use a wrong rq load the next time it wants to schedule a task throttle specifically to not waste cycles doing that on every microscopic idle. yes but this lead to the wrong computation of runnable_avg_sum. To be more precise, we only need to call __update_entity_runnable_avg, __update_tg_runnable_avg is not mandatory in this step. If it only scares fair class tasks away from the periodic rt load, that seems like a benefit to me, not a liability. If we really really need perfect load numbers, fine, we have to eat some cycles, but when I look at it, it looks like one of those Perfect is the enemy of good things. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On 19 April 2013 10:14, Mike Galbraith efa...@gmx.de wrote: On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote: On 19 April 2013 06:30, Mike Galbraith efa...@gmx.de wrote: On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: The current update of the rq's load can be erroneous when RT tasks are involved The update of the load of a rq that becomes idle, is done only if the avg_idle is less than sysctl_sched_migration_cost. If RT tasks and short idle duration alternate, the runnable_avg will not be updated correctly and the time will be accounted as idle time when a CFS task wakes up. A new idle_enter function is called when the next task is the idle function so the elapsed time will be accounted as run time in the load of the rq, whatever the average idle time is. The function update_rq_runnable_avg is removed from idle_balance. When a RT task is scheduled on an idle CPU, the update of the rq's load is not done when the rq exit idle state because CFS's functions are not called. Then, the idle_balance, which is called just before entering the idle function, updates the rq's load and makes the assumption that the elapsed time since the last update, was only running time. As a consequence, the rq's load of a CPU that only runs a periodic RT task, is close to LOAD_AVG_MAX whatever the running duration of the RT task is. Why do we care what rq's load says, if the only thing running is a periodic RT task? I _think_ I recall that stuff being put under the cfs scheduler will use a wrong rq load the next time it wants to schedule a task throttle specifically to not waste cycles doing that on every microscopic idle. yes but this lead to the wrong computation of runnable_avg_sum. To be more precise, we only need to call __update_entity_runnable_avg, __update_tg_runnable_avg is not mandatory in this step. If it only scares fair class tasks away from the periodic rt load, that seems like a benefit to me, not a liability. If we really really need I'm not sure that such behavior that is only based on erroneous value, is good one. perfect load numbers, fine, we have to eat some cycles, but when I look at it, it looks like one of those Perfect is the enemy of good things. The target is not perfect number but good enough to be usable. The systctl_migration_cost threshold is good for idle balancing but can generates wrong load value Vincent -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On Fri, 2013-04-19 at 10:50 +0200, Vincent Guittot wrote: On 19 April 2013 10:14, Mike Galbraith efa...@gmx.de wrote: On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote: On 19 April 2013 06:30, Mike Galbraith efa...@gmx.de wrote: On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: The current update of the rq's load can be erroneous when RT tasks are involved The update of the load of a rq that becomes idle, is done only if the avg_idle is less than sysctl_sched_migration_cost. If RT tasks and short idle duration alternate, the runnable_avg will not be updated correctly and the time will be accounted as idle time when a CFS task wakes up. A new idle_enter function is called when the next task is the idle function so the elapsed time will be accounted as run time in the load of the rq, whatever the average idle time is. The function update_rq_runnable_avg is removed from idle_balance. When a RT task is scheduled on an idle CPU, the update of the rq's load is not done when the rq exit idle state because CFS's functions are not called. Then, the idle_balance, which is called just before entering the idle function, updates the rq's load and makes the assumption that the elapsed time since the last update, was only running time. As a consequence, the rq's load of a CPU that only runs a periodic RT task, is close to LOAD_AVG_MAX whatever the running duration of the RT task is. Why do we care what rq's load says, if the only thing running is a periodic RT task? I _think_ I recall that stuff being put under the cfs scheduler will use a wrong rq load the next time it wants to schedule a task throttle specifically to not waste cycles doing that on every microscopic idle. yes but this lead to the wrong computation of runnable_avg_sum. To be more precise, we only need to call __update_entity_runnable_avg, __update_tg_runnable_avg is not mandatory in this step. If it only scares fair class tasks away from the periodic rt load, that seems like a benefit to me, not a liability. If we really really need I'm not sure that such behavior that is only based on erroneous value, is good one. perfect load numbers, fine, we have to eat some cycles, but when I look at it, it looks like one of those Perfect is the enemy of good things. The target is not perfect number but good enough to be usable. The systctl_migration_cost threshold is good for idle balancing but can generates wrong load value But again, why do we care? To be able to mix rt and fair loads and still make pretty mixed load utilization numbers? Paying a general case fast path price to make strange (to me) load utilization numbers pretty is not very attractive. If you muck about with rt classes, you need to have a good reason for doing that. If you do have a good reason, you also allocated all resources, including CPU, so don't need the kernel to balance the load for you. Paying any fast path price to make the kernel balance a mixed rt/fair load just seems fundamentally wrong to me. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On Fri, 2013-04-19 at 11:21 +0200, Mike Galbraith wrote: On Fri, 2013-04-19 at 10:50 +0200, Vincent Guittot wrote: On 19 April 2013 10:14, Mike Galbraith efa...@gmx.de wrote: On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote: On 19 April 2013 06:30, Mike Galbraith efa...@gmx.de wrote: On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: The current update of the rq's load can be erroneous when RT tasks are involved The update of the load of a rq that becomes idle, is done only if the avg_idle is less than sysctl_sched_migration_cost. If RT tasks and short idle duration alternate, the runnable_avg will not be updated correctly and the time will be accounted as idle time when a CFS task wakes up. A new idle_enter function is called when the next task is the idle function so the elapsed time will be accounted as run time in the load of the rq, whatever the average idle time is. The function update_rq_runnable_avg is removed from idle_balance. When a RT task is scheduled on an idle CPU, the update of the rq's load is not done when the rq exit idle state because CFS's functions are not called. Then, the idle_balance, which is called just before entering the idle function, updates the rq's load and makes the assumption that the elapsed time since the last update, was only running time. As a consequence, the rq's load of a CPU that only runs a periodic RT task, is close to LOAD_AVG_MAX whatever the running duration of the RT task is. Why do we care what rq's load says, if the only thing running is a periodic RT task? I _think_ I recall that stuff being put under the cfs scheduler will use a wrong rq load the next time it wants to schedule a task throttle specifically to not waste cycles doing that on every microscopic idle. yes but this lead to the wrong computation of runnable_avg_sum. To be more precise, we only need to call __update_entity_runnable_avg, __update_tg_runnable_avg is not mandatory in this step. If it only scares fair class tasks away from the periodic rt load, that seems like a benefit to me, not a liability. If we really really need I'm not sure that such behavior that is only based on erroneous value, is good one. perfect load numbers, fine, we have to eat some cycles, but when I look at it, it looks like one of those Perfect is the enemy of good things. The target is not perfect number but good enough to be usable. The systctl_migration_cost threshold is good for idle balancing but can generates wrong load value But again, why do we care? To be able to mix rt and fair loads and still make pretty mixed load utilization numbers? Paying a general case fast path price to make strange (to me) load utilization numbers pretty is not very attractive. So I'm not convinced this is a good thing to do, but it's not my call, that's Peter and Ingos job, so having expressed my opinion, I'll shut up and let them do their thing ;-) -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On 19 April 2013 11:21, Mike Galbraith efa...@gmx.de wrote: On Fri, 2013-04-19 at 10:50 +0200, Vincent Guittot wrote: On 19 April 2013 10:14, Mike Galbraith efa...@gmx.de wrote: On Fri, 2013-04-19 at 09:49 +0200, Vincent Guittot wrote: On 19 April 2013 06:30, Mike Galbraith efa...@gmx.de wrote: On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: The current update of the rq's load can be erroneous when RT tasks are involved The update of the load of a rq that becomes idle, is done only if the avg_idle is less than sysctl_sched_migration_cost. If RT tasks and short idle duration alternate, the runnable_avg will not be updated correctly and the time will be accounted as idle time when a CFS task wakes up. A new idle_enter function is called when the next task is the idle function so the elapsed time will be accounted as run time in the load of the rq, whatever the average idle time is. The function update_rq_runnable_avg is removed from idle_balance. When a RT task is scheduled on an idle CPU, the update of the rq's load is not done when the rq exit idle state because CFS's functions are not called. Then, the idle_balance, which is called just before entering the idle function, updates the rq's load and makes the assumption that the elapsed time since the last update, was only running time. As a consequence, the rq's load of a CPU that only runs a periodic RT task, is close to LOAD_AVG_MAX whatever the running duration of the RT task is. Why do we care what rq's load says, if the only thing running is a periodic RT task? I _think_ I recall that stuff being put under the cfs scheduler will use a wrong rq load the next time it wants to schedule a task throttle specifically to not waste cycles doing that on every microscopic idle. yes but this lead to the wrong computation of runnable_avg_sum. To be more precise, we only need to call __update_entity_runnable_avg, __update_tg_runnable_avg is not mandatory in this step. If it only scares fair class tasks away from the periodic rt load, that seems like a benefit to me, not a liability. If we really really need I'm not sure that such behavior that is only based on erroneous value, is good one. perfect load numbers, fine, we have to eat some cycles, but when I look at it, it looks like one of those Perfect is the enemy of good things. The target is not perfect number but good enough to be usable. The systctl_migration_cost threshold is good for idle balancing but can generates wrong load value But again, why do we care? To be able to mix rt and fair loads and still make pretty mixed load utilization numbers? Paying a general case If runnable_avg_sum can be wrong, it becomes unusable and all the stuff around becomes useless. fast path price to make strange (to me) load utilization numbers pretty is not very attractive. If you muck about with rt classes, you need to have a good reason for doing that. If you do have a good reason, you also allocated all resources, including CPU, so don't need the kernel to Some tasks have responsiveness constraints so they use rt class but they also live with cfs tasks. Vincent balance the load for you. Paying any fast path price to make the kernel balance a mixed rt/fair load just seems fundamentally wrong to me. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: The current update of the rq's load can be erroneous when RT tasks are involved The update of the load of a rq that becomes idle, is done only if the avg_idle is less than sysctl_sched_migration_cost. If RT tasks and short idle duration alternate, the runnable_avg will not be updated correctly and the time will be accounted as idle time when a CFS task wakes up. A new idle_enter function is called when the next task is the idle function so the elapsed time will be accounted as run time in the load of the rq, whatever the average idle time is. The function update_rq_runnable_avg is removed from idle_balance. When a RT task is scheduled on an idle CPU, the update of the rq's load is not done when the rq exit idle state because CFS's functions are not called. Then, the idle_balance, which is called just before entering the idle function, updates the rq's load and makes the assumption that the elapsed time since the last update, was only running time. As a consequence, the rq's load of a CPU that only runs a periodic RT task, is close to LOAD_AVG_MAX whatever the running duration of the RT task is. A new idle_exit function is called when the prev task is the idle function so the elapsed time will be accounted as idle time in the rq's load. Acked-by: Peter Zijlstra a.p.zijls...@chello.nl Thanks Vince! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: The current update of the rq's load can be erroneous when RT tasks are involved The update of the load of a rq that becomes idle, is done only if the avg_idle is less than sysctl_sched_migration_cost. If RT tasks and short idle duration alternate, the runnable_avg will not be updated correctly and the time will be accounted as idle time when a CFS task wakes up. A new idle_enter function is called when the next task is the idle function so the elapsed time will be accounted as run time in the load of the rq, whatever the average idle time is. The function update_rq_runnable_avg is removed from idle_balance. When a RT task is scheduled on an idle CPU, the update of the rq's load is not done when the rq exit idle state because CFS's functions are not called. Then, the idle_balance, which is called just before entering the idle function, updates the rq's load and makes the assumption that the elapsed time since the last update, was only running time. As a consequence, the rq's load of a CPU that only runs a periodic RT task, is close to LOAD_AVG_MAX whatever the running duration of the RT task is. A new idle_exit function is called when the prev task is the idle function so the elapsed time will be accounted as idle time in the rq's load. Changes since V5: - Rename idle_enter/exit function to idle_enter/exit_fair Changes since V4: - Rebase on v3.9-rc6 instead of Steven Rostedt's patches Acked-by: Steven Rostedt rost...@goodmis.org -- Steve - Create the post_schedule_idle function that was previously created by Steven's patches Changes since V3: - Remove dependancy with CONFIG_FAIR_GROUP_SCHED - Add a new idle_enter function and create a post_schedule callback for idle class - Remove the update_runnable_avg from idle_balance Changes since V2: - remove useless definition for UP platform - rebased on top of Steven Rostedt's patches : https://lkml.org/lkml/2013/2/12/558 Changes since V1: - move code out of schedule function and create a pre_schedule callback for idle class instead. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: > The current update of the rq's load can be erroneous when RT tasks are > involved > > The update of the load of a rq that becomes idle, is done only if the avg_idle > is less than sysctl_sched_migration_cost. If RT tasks and short idle duration > alternate, the runnable_avg will not be updated correctly and the time will be > accounted as idle time when a CFS task wakes up. > > A new idle_enter function is called when the next task is the idle function > so the elapsed time will be accounted as run time in the load of the rq, > whatever the average idle time is. The function update_rq_runnable_avg is > removed from idle_balance. > > When a RT task is scheduled on an idle CPU, the update of the rq's load is > not done when the rq exit idle state because CFS's functions are not > called. Then, the idle_balance, which is called just before entering the > idle function, updates the rq's load and makes the assumption that the > elapsed time since the last update, was only running time. > > As a consequence, the rq's load of a CPU that only runs a periodic RT task, > is close to LOAD_AVG_MAX whatever the running duration of the RT task is. Why do we care what rq's load says, if the only thing running is a periodic RT task? I _think_ I recall that stuff being put under the throttle specifically to not waste cycles doing that on every microscopic idle. Seems to me when scheduling an rt task, you want to do as little other than switching to/from the rt task as possible. I don't let rt tasks do idle balancing either, their job isn't to balance fair class on the way out the door, it's to get off/onto the cpu ASAP, and do rt work. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH Resend v6] sched: fix wrong rq's runnable_avg update with rt tasks
On Thu, 2013-04-18 at 18:34 +0200, Vincent Guittot wrote: The current update of the rq's load can be erroneous when RT tasks are involved The update of the load of a rq that becomes idle, is done only if the avg_idle is less than sysctl_sched_migration_cost. If RT tasks and short idle duration alternate, the runnable_avg will not be updated correctly and the time will be accounted as idle time when a CFS task wakes up. A new idle_enter function is called when the next task is the idle function so the elapsed time will be accounted as run time in the load of the rq, whatever the average idle time is. The function update_rq_runnable_avg is removed from idle_balance. When a RT task is scheduled on an idle CPU, the update of the rq's load is not done when the rq exit idle state because CFS's functions are not called. Then, the idle_balance, which is called just before entering the idle function, updates the rq's load and makes the assumption that the elapsed time since the last update, was only running time. As a consequence, the rq's load of a CPU that only runs a periodic RT task, is close to LOAD_AVG_MAX whatever the running duration of the RT task is. Why do we care what rq's load says, if the only thing running is a periodic RT task? I _think_ I recall that stuff being put under the throttle specifically to not waste cycles doing that on every microscopic idle. Seems to me when scheduling an rt task, you want to do as little other than switching to/from the rt task as possible. I don't let rt tasks do idle balancing either, their job isn't to balance fair class on the way out the door, it's to get off/onto the cpu ASAP, and do rt work. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/