Re: [PATCH v2 1/2] sched: fix init NOHZ_IDLE flag
On 19 February 2013 11:29, Vincent Guittot vincent.guit...@linaro.org wrote: On 18 February 2013 16:40, Frederic Weisbecker fweis...@gmail.com wrote: 2013/2/18 Vincent Guittot vincent.guit...@linaro.org: On 18 February 2013 15:38, Frederic Weisbecker fweis...@gmail.com wrote: I pasted the original at: http://pastebin.com/DMm5U8J8 We can clear the idle flag only in the nohz_kick_needed which will not be called if the sched_domain is NULL so the sequence will be = CPU 0 == CPU 1= detach_and_destroy_domain { rcu_assign_pointer(cpu1_dom, NULL); } dom = new_domain(...) { nr_cpus_busy = 0; set_idle(CPU 1); } dom = rcu_dereference(cpu1_dom) //dom == NULL, return rcu_assign_pointer(cpu1_dom, dom); dom = rcu_dereference(cpu1_dom) //dom != NULL, nohz_kick_needed { set_idle(CPU 1) dom = rcu_dereference(cpu1_dom) //dec nr_cpus_busy, } Vincent Ok but CPU 0 can assign NULL to the domain of cpu1 while CPU 1 is already in the middle of nohz_kick_needed(). Yes nothing prevents the sequence below to occur = CPU 0 == CPU 1= dom = rcu_dereference(cpu1_dom) //dom != NULL detach_and_destroy_domain { rcu_assign_pointer(cpu1_dom, NULL); } dom = new_domain(...) { nr_cpus_busy = 0; //nr_cpus_busy in the new_dom set_idle(CPU 1); } nohz_kick_needed { clear_idle(CPU 1) dom = rcu_dereference(cpu1_dom) //cpu1_dom == old_dom inc nr_cpus_busy, //nr_cpus_busy in the old_dom } rcu_assign_pointer(cpu1_dom, dom); //cpu1_dom == new_dom The sequence above is not correct in addition to become unreadable after going through gmail The correct and readable version https://pastebin.linaro.org/1750/ Vincent I'm not sure that this can happen in practice because CPU1 is in interrupt handler but we don't have any mechanism to prevent the sequence. The NULL sched_domain can be used to detect this situation and the set_cpu_sd_state_busy function can be modified like below inline void set_cpu_sd_state_busy { struct sched_domain *sd; int cpu = smp_processor_id(); + int clear = 0; if (!test_bit(NOHZ_IDLE, nohz_flags(cpu))) return; - clear_bit(NOHZ_IDLE, nohz_flags(cpu)); rcu_read_lock(); for_each_domain(cpu, sd) { atomic_inc(sd-groups-sgp-nr_busy_cpus); + clear = 1; } rcu_read_unlock(); + + if (likely(clear)) + clear_bit(NOHZ_IDLE, nohz_flags(cpu)); } The NOHZ_IDLE flag will not be clear if we have a NULL sched_domain attached to the CPU. With this implementation, we still don't need to get the sched_domain for testing the NOHZ_IDLE flag which occurs each time CPU becomes idle The patch 2 become useless Vincent -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4] sched: fix init NOHZ_IDLE flag
On my smp platform which is made of 5 cores in 2 clusters, I have the nr_busy_cpu field of sched_group_power struct that is not null when the platform is fully idle. The root cause seems to be: During the boot sequence, some CPUs reach the idle loop and set their NOHZ_IDLE flag while waiting for others CPUs to boot. But the nr_busy_cpus field is initialized later with the assumption that all CPUs are in the busy state whereas some CPUs have already set their NOHZ_IDLE flag. During the initialization of the sched_domain, we set the NOHZ_IDLE flag when nr_busy_cpus is initialized to 0 in order to have a coherent configuration. If a CPU enters idle and call set_cpu_sd_state_idle during the build of the new sched_domain it will not corrupt the initial state set_cpu_sd_state_busy is modified and clears the NOHZ_IDLE only if a non NULL sched_domain is attached to the CPU (which is the case during the rebuild) Change since V3; - NOHZ flag is not cleared if a NULL domain is attached to the CPU - Remove patch 2/2 which becomes useless with latest modifications Change since V2: - change the initialization to idle state instead of busy state so a CPU that enters idle during the build of the sched_domain will not corrupt the initialization state Change since V1: - remove the patch for SCHED softirq on an idle core use case as it was a side effect of the other use cases. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/core.c |4 +++- kernel/sched/fair.c |9 +++-- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 26058d0..c730a4e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5884,7 +5884,9 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd) return; update_group_power(sd, cpu); - atomic_set(sg-sgp-nr_busy_cpus, sg-group_weight); + atomic_set(sg-sgp-nr_busy_cpus, 0); + set_bit(NOHZ_IDLE, nohz_flags(cpu)); + } int __weak arch_sd_sibling_asym_packing(void) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 81fa536..2701a92 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5403,15 +5403,20 @@ static inline void set_cpu_sd_state_busy(void) { struct sched_domain *sd; int cpu = smp_processor_id(); + int clear = 0; if (!test_bit(NOHZ_IDLE, nohz_flags(cpu))) return; - clear_bit(NOHZ_IDLE, nohz_flags(cpu)); rcu_read_lock(); - for_each_domain(cpu, sd) + for_each_domain(cpu, sd) { atomic_inc(sd-groups-sgp-nr_busy_cpus); + clear = 1; + } rcu_read_unlock(); + + if (likely(clear)) + clear_bit(NOHZ_IDLE, nohz_flags(cpu)); } void set_cpu_sd_state_idle(void) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched: fix init NOHZ_IDLE flag
On 22 February 2013 13:32, Frederic Weisbecker fweis...@gmail.com wrote: On Thu, Feb 21, 2013 at 09:29:16AM +0100, Vincent Guittot wrote: On my smp platform which is made of 5 cores in 2 clusters, I have the nr_busy_cpu field of sched_group_power struct that is not null when the platform is fully idle. The root cause seems to be: During the boot sequence, some CPUs reach the idle loop and set their NOHZ_IDLE flag while waiting for others CPUs to boot. But the nr_busy_cpus field is initialized later with the assumption that all CPUs are in the busy state whereas some CPUs have already set their NOHZ_IDLE flag. During the initialization of the sched_domain, we set the NOHZ_IDLE flag when nr_busy_cpus is initialized to 0 in order to have a coherent configuration. If a CPU enters idle and call set_cpu_sd_state_idle during the build of the new sched_domain it will not corrupt the initial state set_cpu_sd_state_busy is modified and clears the NOHZ_IDLE only if a non NULL sched_domain is attached to the CPU (which is the case during the rebuild) Change since V3; - NOHZ flag is not cleared if a NULL domain is attached to the CPU - Remove patch 2/2 which becomes useless with latest modifications Change since V2: - change the initialization to idle state instead of busy state so a CPU that enters idle during the build of the sched_domain will not corrupt the initialization state Change since V1: - remove the patch for SCHED softirq on an idle core use case as it was a side effect of the other use cases. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/core.c |4 +++- kernel/sched/fair.c |9 +++-- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 26058d0..c730a4e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5884,7 +5884,9 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd) return; update_group_power(sd, cpu); - atomic_set(sg-sgp-nr_busy_cpus, sg-group_weight); + atomic_set(sg-sgp-nr_busy_cpus, 0); + set_bit(NOHZ_IDLE, nohz_flags(cpu)); + } int __weak arch_sd_sibling_asym_packing(void) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 81fa536..2701a92 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5403,15 +5403,20 @@ static inline void set_cpu_sd_state_busy(void) { struct sched_domain *sd; int cpu = smp_processor_id(); + int clear = 0; if (!test_bit(NOHZ_IDLE, nohz_flags(cpu))) return; - clear_bit(NOHZ_IDLE, nohz_flags(cpu)); rcu_read_lock(); - for_each_domain(cpu, sd) + for_each_domain(cpu, sd) { atomic_inc(sd-groups-sgp-nr_busy_cpus); + clear = 1; + } rcu_read_unlock(); + + if (likely(clear)) + clear_bit(NOHZ_IDLE, nohz_flags(cpu)); I fear there is still a race window: = CPU 0 = = CPU 1 = // NOHZ_IDLE is set set_cpu_sd_state_busy() { dom1 = rcu_dereference(dom1); inc(dom1-nr_busy_cpus) rcu_assign_pointer(dom 1, NULL) // create new domain init_sched_group_power() { atomic_set(tmp-nr_busy_cpus, 0); set_bit(NOHZ_IDLE, nohz_flags(cpu 1)); rcu_assign_pointer(dom 1, tmp) clear_bit(NOHZ_IDLE, nohz_flags(cpu)); } I don't know if there is any sane way to deal with this issue other than having nr_busy_cpus and nohz_flags in the same object sharing the same lifecycle. I wanted to avoid having to use the sd pointer for testing NOHZ_IDLE flag because it occurs each time we go into idle but it seems to be not easily feasible. Another solution could be to add a synchronization step between rcu_assign_pointer(dom 1, NULL) and create new domain to ensure that all pending access to old sd values, has finished but this will imply a potential delay in the rebuild of sched_domain and i'm not sure that it's acceptable Vincent -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] sched: nohz_idle_balance
On tickless system, one CPU runs load balance for all idle CPUs. The cpu_load of this CPU is updated before starting the load balance of each other idle CPUs. We should instead update the cpu_load of the balance_cpu. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/fair.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1ca4fe4..9ae3a5b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4794,14 +4794,15 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) if (need_resched()) break; - raw_spin_lock_irq(this_rq-lock); - update_rq_clock(this_rq); - update_idle_cpu_load(this_rq); - raw_spin_unlock_irq(this_rq-lock); + rq = cpu_rq(balance_cpu); + + raw_spin_lock_irq(rq-lock); + update_rq_clock(rq); + update_idle_cpu_load(rq); + raw_spin_unlock_irq(rq-lock); rebalance_domains(balance_cpu, CPU_IDLE); - rq = cpu_rq(balance_cpu); if (time_after(this_rq-next_balance, rq-next_balance)) this_rq-next_balance = rq-next_balance; } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: nohz_idle_balance
Wrong button make me removed others guys from the thread. Sorry for this mistake. On 13 September 2012 09:56, Mike Galbraith efa...@gmx.de wrote: On Thu, 2012-09-13 at 09:44 +0200, Vincent Guittot wrote: On 13 September 2012 09:29, Mike Galbraith efa...@gmx.de wrote: On Thu, 2012-09-13 at 08:59 +0200, Vincent Guittot wrote: On 13 September 2012 08:49, Mike Galbraith efa...@gmx.de wrote: On Thu, 2012-09-13 at 06:11 +0200, Vincent Guittot wrote: On tickless system, one CPU runs load balance for all idle CPUs. The cpu_load of this CPU is updated before starting the load balance of each other idle CPUs. We should instead update the cpu_load of the balance_cpu. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/fair.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1ca4fe4..9ae3a5b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4794,14 +4794,15 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) if (need_resched()) break; - raw_spin_lock_irq(this_rq-lock); - update_rq_clock(this_rq); - update_idle_cpu_load(this_rq); - raw_spin_unlock_irq(this_rq-lock); + rq = cpu_rq(balance_cpu); + + raw_spin_lock_irq(rq-lock); + update_rq_clock(rq); + update_idle_cpu_load(rq); + raw_spin_unlock_irq(rq-lock); rebalance_domains(balance_cpu, CPU_IDLE); - rq = cpu_rq(balance_cpu); if (time_after(this_rq-next_balance, rq-next_balance)) this_rq-next_balance = rq-next_balance; } Ew, banging locks and updating clocks to what good end? The goal is to update the cpu_load table of the CPU before starting the load balance. Other wise we will use outdated value in the load balance sequence If there's load to distribute, seems it should all work out fine without doing that. What harm is being done that makes this worth while? this_load and avg_load can be wrong and make an idle CPU set as balanced compared to the busy one I think you need to present numbers showing benefit. Crawling all over a mostly idle (4096p?) box is decidedly bad thing to do. Yep, let me prepare some figures You should also notice that you are already crawling all over the idle processor in rebalance_domains Vincent -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 0/5] ARM: topology: set the capacity of each cores for big.LITTLE
On 10 July 2012 15:42, Peter Zijlstra a.p.zijls...@chello.nl wrote: On Tue, 2012-07-10 at 14:35 +0200, Vincent Guittot wrote: May be the last one which enable ARCH_POWER should also go into tip ? OK, I can take it. Hi Peter, I can't find the patch that enable ARCH_POWER in the tip tree. Have you take it in your tree ? Regards, Vincent -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 0/5] ARM: topology: set the capacity of each cores for big.LITTLE
On 13 September 2012 14:07, Peter Zijlstra a.p.zijls...@chello.nl wrote: On Thu, 2012-09-13 at 11:17 +0200, Vincent Guittot wrote: On 10 July 2012 15:42, Peter Zijlstra a.p.zijls...@chello.nl wrote: On Tue, 2012-07-10 at 14:35 +0200, Vincent Guittot wrote: May be the last one which enable ARCH_POWER should also go into tip ? OK, I can take it. Hi Peter, I can't find the patch that enable ARCH_POWER in the tip tree. Have you take it in your tree ? Uhmmm how about I say I have now? Sorry about that. ok, thanks -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 15 August 2012 13:05, Peter Zijlstra a.p.zijls...@chello.nl wrote: On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote: Since there is no power saving consideration in scheduler CFS, I has a very rough idea for enabling a new power saving schema in CFS. Adding Thomas, he always delights poking holes in power schemes. It bases on the following assumption: 1, If there are many task crowd in system, just let few domain cpus running and let other cpus idle can not save power. Let all cpu take the load, finish tasks early, and then get into idle. will save more power and have better user experience. I'm not sure this is a valid assumption. I've had it explained to me by various people that race-to-idle isn't always the best thing. It has to do with the cost of switching power states and the duration of execution and other such things. 2, schedule domain, schedule group perfect match the hardware, and the power consumption unit. So, pull tasks out of a domain means potentially this power consumption unit idle. I'm not sure I understand what you're saying, sorry. So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale power aware scheduling), this proposal will adopt the sched_balance_policy concept and use 2 kind of policy: performance, power. Yay, ideally we'd also provide a 3rd option: auto, which simply switches between the two based on AC/BAT, UPS status and simple things like that. But this seems like a later concern, you have to have something to pick between before you can pick :-) And in scheduling, 2 place will care the policy, load_balance() and in task fork/exec: select_task_rq_fair(). ack Here is some pseudo code try to explain the proposal behaviour in load_balance() and select_task_rq_fair(); Oh man.. A few words outlining the general idea would've been nice. load_balance() { update_sd_lb_stats(); //get busiest group, idlest group data. if (sd-nr_running sd's capacity) { //power saving policy is not suitable for //this scenario, it runs like performance policy mv tasks from busiest cpu in busiest group to idlest cpu in idlest group; Once upon a time we talked about adding a factor to the capacity for this. So say you'd allow 2*capacity before overflowing and waking another power group. But I think we should not go on nr_running here, PJTs per-entity load tracking stuff gives us much better measures -- also, repost that series already Paul! :-) Also, I'm not sure this is entirely correct, the thing you want to do for power aware stuff is to minimize the number of active power domains, this means you don't want idlest, you want least busy non-idle. } else {// the sd has enough capacity to hold all tasks. if (sg-nr_running sg's capacity) { //imbalanced between groups if (schedule policy == performance) { //when 2 busiest group at same busy //degree, need to prefer the one has // softest group?? move tasks from busiest group to idletest group; So I'd leave the currently implemented scheme as performance, and I don't think the above describes the current state. } else if (schedule policy == power) move tasks from busiest group to idlest group until busiest is just full of capacity. //the busiest group can balance //internally after next time LB, There's another thing we need to do, and that is collect tasks in a minimal amount of power domains. The old code (that got deleted) did something like that, you can revive some of the that code if needed -- I just killed everything to be able to start with a clean slate. } else { //all groups has enough capacity for its tasks. if (schedule policy == performance) //all tasks may has enough cpu //resources to run, //mv tasks from busiest to idlest group? //no, at this time, it's better to keep //the task on current cpu. //so, it is maybe better to do balance //in each of groups for_each_imbalance_groups() move tasks from busiest cpu to idlest cpu in each of groups; else if (schedule policy == power) { if (no hard pin in idlest group)
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 16 August 2012 07:03, Alex Shi alex@intel.com wrote: On 08/16/2012 12:19 AM, Matthew Garrett wrote: On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote: power aware scheduling), this proposal will adopt the sched_balance_policy concept and use 2 kind of policy: performance, power. Are there workloads in which power might provide more performance than performance? If so, don't use these terms. Power scheme should no chance has better performance in design. A side effect of packing small tasks on one core is that you always use the core with the lowest C-state which will minimize the wake up latency so you can sometime get better results than performance mode which will try to use a other core in another cluster which will take more time to wake up that waiting for the end of the current task. Vincent -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 17 August 2012 10:43, Paul Turner p...@google.com wrote: On Wed, Aug 15, 2012 at 4:05 AM, Peter Zijlstra a.p.zijls...@chello.nl wrote: On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote: Since there is no power saving consideration in scheduler CFS, I has a very rough idea for enabling a new power saving schema in CFS. Adding Thomas, he always delights poking holes in power schemes. It bases on the following assumption: 1, If there are many task crowd in system, just let few domain cpus running and let other cpus idle can not save power. Let all cpu take the load, finish tasks early, and then get into idle. will save more power and have better user experience. I'm not sure this is a valid assumption. I've had it explained to me by various people that race-to-idle isn't always the best thing. It has to do with the cost of switching power states and the duration of execution and other such things. 2, schedule domain, schedule group perfect match the hardware, and the power consumption unit. So, pull tasks out of a domain means potentially this power consumption unit idle. I'm not sure I understand what you're saying, sorry. So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale power aware scheduling), this proposal will adopt the sched_balance_policy concept and use 2 kind of policy: performance, power. Yay, ideally we'd also provide a 3rd option: auto, which simply switches between the two based on AC/BAT, UPS status and simple things like that. But this seems like a later concern, you have to have something to pick between before you can pick :-) And in scheduling, 2 place will care the policy, load_balance() and in task fork/exec: select_task_rq_fair(). ack Here is some pseudo code try to explain the proposal behaviour in load_balance() and select_task_rq_fair(); Oh man.. A few words outlining the general idea would've been nice. load_balance() { update_sd_lb_stats(); //get busiest group, idlest group data. if (sd-nr_running sd's capacity) { //power saving policy is not suitable for //this scenario, it runs like performance policy mv tasks from busiest cpu in busiest group to idlest cpu in idlest group; Once upon a time we talked about adding a factor to the capacity for this. So say you'd allow 2*capacity before overflowing and waking another power group. But I think we should not go on nr_running here, PJTs per-entity load tracking stuff gives us much better measures -- also, repost that series already Paul! :-) Yes -- I just got back from Africa this week. It's updated for almost all the previous comments but I ran out of time before I left to re-post. I'm just about caught up enough that I should be able to get this done over the upcoming weekend. Monday at the latest. Also, I'm not sure this is entirely correct, the thing you want to do for power aware stuff is to minimize the number of active power domains, this means you don't want idlest, you want least busy non-idle. } else {// the sd has enough capacity to hold all tasks. if (sg-nr_running sg's capacity) { //imbalanced between groups if (schedule policy == performance) { //when 2 busiest group at same busy //degree, need to prefer the one has // softest group?? move tasks from busiest group to idletest group; So I'd leave the currently implemented scheme as performance, and I don't think the above describes the current state. } else if (schedule policy == power) move tasks from busiest group to idlest group until busiest is just full of capacity. //the busiest group can balance //internally after next time LB, There's another thing we need to do, and that is collect tasks in a minimal amount of power domains. The old code (that got deleted) did something like that, you can revive some of the that code if needed -- I just killed everything to be able to start with a clean slate. } else { //all groups has enough capacity for its tasks. if (schedule policy == performance) //all tasks may has enough cpu //resources to run, //mv tasks from busiest to idlest group? //no, at this time, it's better to keep //the task on current cpu. //so, it is maybe better to do balance //in each of groups
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 21 August 2012 02:58, Alex Shi alex@intel.com wrote: On 08/20/2012 11:36 PM, Vincent Guittot wrote: What you want it to keep track of a per-cpu utilization level (inverse of idle-time) and using PJTs per-task runnable avg see if placing the new task on will exceed the utilization limit. I think some of the Linaro people actually played around with this, Vincent? Sorry for the late reply but I had almost no network access during last weeks. So Linaro also works on a power aware scheduler as Peter mentioned. Based on previous tests, we have concluded that main drawback of the (now removed) old power scheduler was that we had no way to make difference between short and long running tasks whereas it's a key input (at least for phone) for deciding to pack tasks and for selecting the core on an asymmetric system. It is hard to estimate future in general view point. but from hack point, maybe you can add something to hint this from task_struct. :) per-task load tracking patchsets give you a good view of the last dozen of ms One additional key information is the power distribution in the system which can have a finer granularity than current sched_domain description. Peter's proposal was to use a SHARE_POWERLINE flag similarly to flags that already describe if a sched_domain share resources or cpu capacity. Seems I missed this. what's difference with current SD_SHARE_CPUPOWER and SD_SHARE_PKG_RESOURCES. SD_SHARE_CPUPOWER is set in a sched domain at SMT level (sharing some part of the physical core) SD_SHARE_PKG_RESOURCES is set at MC level (sharing some resources like cache and memory access) With these 2 new information, we can have a 1st power saving scheduler which spread or packed tasks across core and package Fine, I like to test them on X86, plus SMT and NUMA :) Vincent -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4 0/5] ARM: topology: set the capacity of each cores for big.LITTLE
This patchset creates an arch_scale_freq_power function for ARM, which is used to set the relative capacity of each core of a big.LITTLE system. It also removes the broken power estimation of x86. Modification since v3: - Add comments - Add optimization for SMP system - Ensure that capacity of a CPU will be at most 1 Modification since v2: - set_power_scale function becomes static - Rework loop in update_siblings_masks - Remove useless code in parse_dt_topology Modification since v1: - Add and update explanation about the use of the table and the range of the value - Remove the use of NR_CPUS and use nr_cpu_ids instead - Remove broken power estimation of x86 Peter Zijlstra (1): sched, x86: Remove broken power estimation Vincent Guittot (4): ARM: topology: Add arch_scale_freq_power function ARM: topology: factorize the update of sibling masks ARM: topology: Update cpu_power according to DT information sched: cpu_power: enable ARCH_POWER arch/arm/kernel/topology.c | 239 ++ arch/x86/kernel/cpu/Makefile |2 +- arch/x86/kernel/cpu/sched.c | 55 -- kernel/sched/features.h |2 +- 4 files changed, 219 insertions(+), 79 deletions(-) delete mode 100644 arch/x86/kernel/cpu/sched.c -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4 1/5] ARM: topology: Add arch_scale_freq_power function
Add infrastructure to be able to modify the cpu_power of each core Signed-off-by: Vincent Guittot vincent.guit...@linaro.org Reviewed-by: Namhyung Kim namhy...@kernel.org --- arch/arm/kernel/topology.c | 38 +- 1 file changed, 37 insertions(+), 1 deletion(-) diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 8200dea..51f23b3 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -22,6 +22,37 @@ #include asm/cputype.h #include asm/topology.h +/* + * cpu power scale management + */ + +/* + * cpu power table + * This per cpu data structure describes the relative capacity of each core. + * On a heteregenous system, cores don't have the same computation capacity + * and we reflect that difference in the cpu_power field so the scheduler can + * take this difference into account during load balance. A per cpu structure + * is preferred because each CPU updates its own cpu_power field during the + * load balance except for idle cores. One idle core is selected to run the + * rebalance_domains for all idle cores and the cpu_power can be updated + * during this sequence. + */ +static DEFINE_PER_CPU(unsigned long, cpu_scale); + +unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu) +{ + return per_cpu(cpu_scale, cpu); +} + +static void set_power_scale(unsigned int cpu, unsigned long power) +{ + per_cpu(cpu_scale, cpu) = power; +} + +/* + * cpu topology management + */ + #define MPIDR_SMP_BITMASK (0x3 30) #define MPIDR_SMP_VALUE (0x2 30) @@ -41,6 +72,9 @@ #define MPIDR_LEVEL2_MASK 0xFF #define MPIDR_LEVEL2_SHIFT 16 +/* + * cpu topology table + */ struct cputopo_arm cpu_topology[NR_CPUS]; const struct cpumask *cpu_coregroup_mask(int cpu) @@ -134,7 +168,7 @@ void init_cpu_topology(void) { unsigned int cpu; - /* init core mask */ + /* init core mask and power*/ for_each_possible_cpu(cpu) { struct cputopo_arm *cpu_topo = (cpu_topology[cpu]); @@ -143,6 +177,8 @@ void init_cpu_topology(void) cpu_topo-socket_id = -1; cpumask_clear(cpu_topo-core_sibling); cpumask_clear(cpu_topo-thread_sibling); + + set_power_scale(cpu, SCHED_POWER_SCALE); } smp_wmb(); } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4 2/5] ARM: topology: factorize the update of sibling masks
This factorization has also been proposed in another patch that has not been merged yet: http://lists.infradead.org/pipermail/linux-arm-kernel/2012-January/080873.html So, this patch could be dropped depending of the state of the other one. Signed-off-by: Lorenzo Pieralisi lorenzo.pieral...@arm.com Signed-off-by: Vincent Guittot vincent.guit...@linaro.org Reviewed-by: Namhyung Kim namhy...@kernel.org --- arch/arm/kernel/topology.c | 48 +--- 1 file changed, 27 insertions(+), 21 deletions(-) diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 51f23b3..eb5fc81 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -82,6 +82,32 @@ const struct cpumask *cpu_coregroup_mask(int cpu) return cpu_topology[cpu].core_sibling; } +void update_siblings_masks(unsigned int cpuid) +{ + struct cputopo_arm *cpu_topo, *cpuid_topo = cpu_topology[cpuid]; + int cpu; + + /* update core and thread sibling masks */ + for_each_possible_cpu(cpu) { + cpu_topo = cpu_topology[cpu]; + + if (cpuid_topo-socket_id != cpu_topo-socket_id) + continue; + + cpumask_set_cpu(cpuid, cpu_topo-core_sibling); + if (cpu != cpuid) + cpumask_set_cpu(cpu, cpuid_topo-core_sibling); + + if (cpuid_topo-core_id != cpu_topo-core_id) + continue; + + cpumask_set_cpu(cpuid, cpu_topo-thread_sibling); + if (cpu != cpuid) + cpumask_set_cpu(cpu, cpuid_topo-thread_sibling); + } + smp_wmb(); +} + /* * store_cpu_topology is called at boot when only one cpu is running * and with the mutex cpu_hotplug.lock locked, when several cpus have booted, @@ -91,7 +117,6 @@ void store_cpu_topology(unsigned int cpuid) { struct cputopo_arm *cpuid_topo = cpu_topology[cpuid]; unsigned int mpidr; - unsigned int cpu; /* If the cpu topology has been already set, just return */ if (cpuid_topo-core_id != -1) @@ -133,26 +158,7 @@ void store_cpu_topology(unsigned int cpuid) cpuid_topo-socket_id = -1; } - /* update core and thread sibling masks */ - for_each_possible_cpu(cpu) { - struct cputopo_arm *cpu_topo = cpu_topology[cpu]; - - if (cpuid_topo-socket_id == cpu_topo-socket_id) { - cpumask_set_cpu(cpuid, cpu_topo-core_sibling); - if (cpu != cpuid) - cpumask_set_cpu(cpu, - cpuid_topo-core_sibling); - - if (cpuid_topo-core_id == cpu_topo-core_id) { - cpumask_set_cpu(cpuid, - cpu_topo-thread_sibling); - if (cpu != cpuid) - cpumask_set_cpu(cpu, - cpuid_topo-thread_sibling); - } - } - } - smp_wmb(); + update_siblings_masks(cpuid); printk(KERN_INFO CPU%u: thread %d, cpu %d, socket %d, mpidr %x\n, cpuid, cpu_topology[cpuid].thread_id, -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4 3/5] ARM: topology: Update cpu_power according to DT information
Use cpu compatibility field and clock-frequency field of DT to estimate the capacity of each core of the system and to update the cpu_power field accordingly. This patch enables to put more running tasks on big cores than on LITTLE ones. But this patch doesn't ensure that long running tasks will run on big cores and short ones on LITTLE cores. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org Reviewed-by: Namhyung Kim namhy...@kernel.org --- arch/arm/kernel/topology.c | 153 1 file changed, 153 insertions(+) diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index eb5fc81..198b084 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -17,7 +17,9 @@ #include linux/percpu.h #include linux/node.h #include linux/nodemask.h +#include linux/of.h #include linux/sched.h +#include linux/slab.h #include asm/cputype.h #include asm/topology.h @@ -49,6 +51,152 @@ static void set_power_scale(unsigned int cpu, unsigned long power) per_cpu(cpu_scale, cpu) = power; } +#ifdef CONFIG_OF +struct cpu_efficiency { + const char *compatible; + unsigned long efficiency; +}; + +/* + * Table of relative efficiency of each processors + * The efficiency value must fit in 20bit and the final + * cpu_scale value must be in the range + * 0 cpu_scale 3*SCHED_POWER_SCALE/2 + * in order to return at most 1 when DIV_ROUND_CLOSEST + * is used to compute the capacity of a CPU. + * Processors that are not defined in the table, + * use the default SCHED_POWER_SCALE value for cpu_scale. + */ +struct cpu_efficiency table_efficiency[] = { + {arm,cortex-a15, 3891}, + {arm,cortex-a7, 2048}, + {NULL, }, +}; + +struct cpu_capacity { + unsigned long hwid; + unsigned long capacity; +}; + +struct cpu_capacity *cpu_capacity; + +unsigned long middle_capacity = 1; + +/* + * Iterate all CPUs' descriptor in DT and compute the efficiency + * (as per table_efficiency). Also calculate a middle efficiency + * as close as possible to (max{eff_i} - min{eff_i}) / 2 + * This is later used to scale the cpu_power field such that an + * 'average' CPU is of middle power. Also see the comments near + * table_efficiency[] and update_cpu_power(). + */ +static void __init parse_dt_topology(void) +{ + struct cpu_efficiency *cpu_eff; + struct device_node *cn = NULL; + unsigned long min_capacity = (unsigned long)(-1); + unsigned long max_capacity = 0; + unsigned long capacity = 0; + int alloc_size, cpu = 0; + + alloc_size = nr_cpu_ids * sizeof(struct cpu_capacity); + cpu_capacity = (struct cpu_capacity *)kzalloc(alloc_size, GFP_NOWAIT); + + while ((cn = of_find_node_by_type(cn, cpu))) { + const u32 *rate, *reg; + int len; + + if (cpu = num_possible_cpus()) + break; + + for (cpu_eff = table_efficiency; cpu_eff-compatible; cpu_eff++) + if (of_device_is_compatible(cn, cpu_eff-compatible)) + break; + + if (cpu_eff-compatible == NULL) + continue; + + rate = of_get_property(cn, clock-frequency, len); + if (!rate || len != 4) { + pr_err(%s missing clock-frequency property\n, + cn-full_name); + continue; + } + + reg = of_get_property(cn, reg, len); + if (!reg || len != 4) { + pr_err(%s missing reg property\n, cn-full_name); + continue; + } + + capacity = ((be32_to_cpup(rate)) 20) * cpu_eff-efficiency; + + /* Save min capacity of the system */ + if (capacity min_capacity) + min_capacity = capacity; + + /* Save max capacity of the system */ + if (capacity max_capacity) + max_capacity = capacity; + + cpu_capacity[cpu].capacity = capacity; + cpu_capacity[cpu++].hwid = be32_to_cpup(reg); + } + + if (cpu num_possible_cpus()) + cpu_capacity[cpu].hwid = (unsigned long)(-1); + + /* If min and max capacities are equals, we bypass the update of the +* cpu_scale because all CPUs have the same capacity. Otherwise, we +* compute a middle_capacity factor that will ensure that the capacity +* of an 'average' CPU of the system will be as close as possible to +* SCHED_POWER_SCALE, which is the default value, but with the +* constraint explained near table_efficiency[]. +*/ + if (min_capacity == max_capacity) + cpu_capacity[0].hwid = (unsigned long)(-1); + else if (4*max_capacity (3*(max_capacity + min_capacity))) + middle_capacity = (min_capacity + max_capacity
[PATCH v4 4/5] sched, x86: Remove broken power estimation
From: Peter Zijlstra a.p.zijls...@chello.nl The x86 sched power implementation has been broken forever and gets in the way of other stuff, remove it. For archaeological interest, fixing this code would require dealing with the cross-cpu calling of these functions and more importantly, we need to filter idle time out of the a/m-perf stuff because the ratio will go down to 0 when idle, giving a 0 capacity which is not what we'd want. Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl Link: http://lkml.kernel.org/n/tip-wjjwelpti8f8k7i1pdnzm...@git.kernel.org --- arch/x86/kernel/cpu/Makefile |2 +- arch/x86/kernel/cpu/sched.c | 55 -- 2 files changed, 1 insertion(+), 56 deletions(-) delete mode 100644 arch/x86/kernel/cpu/sched.c diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile index 6ab6aa2..c598126 100644 --- a/arch/x86/kernel/cpu/Makefile +++ b/arch/x86/kernel/cpu/Makefile @@ -14,7 +14,7 @@ CFLAGS_common.o := $(nostackp) obj-y := intel_cacheinfo.o scattered.o topology.o obj-y += proc.o capflags.o powerflags.o common.o -obj-y += vmware.o hypervisor.o sched.o mshyperv.o +obj-y += vmware.o hypervisor.o mshyperv.o obj-y += rdrand.o obj-y += match.o diff --git a/arch/x86/kernel/cpu/sched.c b/arch/x86/kernel/cpu/sched.c deleted file mode 100644 index a640ae5..000 --- a/arch/x86/kernel/cpu/sched.c +++ /dev/null @@ -1,55 +0,0 @@ -#include linux/sched.h -#include linux/math64.h -#include linux/percpu.h -#include linux/irqflags.h - -#include asm/cpufeature.h -#include asm/processor.h - -#ifdef CONFIG_SMP - -static DEFINE_PER_CPU(struct aperfmperf, old_perf_sched); - -static unsigned long scale_aperfmperf(void) -{ - struct aperfmperf val, *old = __get_cpu_var(old_perf_sched); - unsigned long ratio, flags; - - local_irq_save(flags); - get_aperfmperf(val); - local_irq_restore(flags); - - ratio = calc_aperfmperf_ratio(old, val); - *old = val; - - return ratio; -} - -unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu) -{ - /* -* do aperf/mperf on the cpu level because it includes things -* like turbo mode, which are relevant to full cores. -*/ - if (boot_cpu_has(X86_FEATURE_APERFMPERF)) - return scale_aperfmperf(); - - /* -* maybe have something cpufreq here -*/ - - return default_scale_freq_power(sd, cpu); -} - -unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu) -{ - /* -* aperf/mperf already includes the smt gain -*/ - if (boot_cpu_has(X86_FEATURE_APERFMPERF)) - return SCHED_LOAD_SCALE; - - return default_scale_smt_power(sd, cpu); -} - -#endif -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4 5/5] sched: cpu_power: enable ARCH_POWER
Heteregeneous ARM platform uses arch_scale_freq_power function to reflect the relative capacity of each core Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/features.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index de00a48..d98ae90 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -42,7 +42,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true) /* * Use arch dependent cpu power functions */ -SCHED_FEAT(ARCH_POWER, false) +SCHED_FEAT(ARCH_POWER, true) SCHED_FEAT(HRTICK, false) SCHED_FEAT(DOUBLE_TICK, false) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 3/5] ARM: topology: Update cpu_power according to DT information
On 9 July 2012 12:55, Shilimkar, Santosh santosh.shilim...@ti.com wrote: Vincent, On Mon, Jul 9, 2012 at 2:57 PM, Vincent Guittot vincent.guit...@linaro.org wrote: Use cpu compatibility field and clock-frequency field of DT to estimate the capacity of each core of the system and to update the cpu_power field accordingly. This patch enables to put more running tasks on big cores than on LITTLE ones. But this patch doesn't ensure that long running tasks will run on big cores and short ones on LITTLE cores. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org Reviewed-by: Namhyung Kim namhy...@kernel.org --- arch/arm/kernel/topology.c | 153 1 file changed, 153 insertions(+) Sorry for not giving this comment on previous version but we should also have a way to provide the big.LITTLE information without Device Tree. May be a platform device/data. Hi Santosh, I had thought of adding such additional way to set cpu_power of big.LITTLE but my conclusion was -it's a new platform so it should come with DT -DT is already required by other patches linked to big.LITTLE (http://lists.infradead.org/pipermail/linux-arm-kernel/2012-January/080873.html) -There is no device that can be easily used to get such information at this early boot stage. I know we are moving DT way, but remember apart from core kernel infrastructure, to have a complete product build with DT means all the drivers must be already supporting DT which is not the case with many huge driver sub-systems like USB, display subsystem, Audio etc. Having that support would greatly help for the SOC's which have not yet reached to stage where entire SOC is DT compliant and want to use big.LITTLE infrastructure. Can't you support both type of devices on your platform ? You can move your device to DT mode when it is supported ? Regards, Vincent Regards Santosh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 3/5] ARM: topology: Update cpu_power according to DT information
On 9 July 2012 15:00, Shilimkar, Santosh santosh.shilim...@ti.com wrote: On Mon, Jul 9, 2012 at 6:02 PM, Vincent Guittot vincent.guit...@linaro.org wrote: On 9 July 2012 12:55, Shilimkar, Santosh santosh.shilim...@ti.com wrote: Vincent, On Mon, Jul 9, 2012 at 2:57 PM, Vincent Guittot vincent.guit...@linaro.org wrote: Use cpu compatibility field and clock-frequency field of DT to estimate the capacity of each core of the system and to update the cpu_power field accordingly. This patch enables to put more running tasks on big cores than on LITTLE ones. But this patch doesn't ensure that long running tasks will run on big cores and short ones on LITTLE cores. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org Reviewed-by: Namhyung Kim namhy...@kernel.org --- arch/arm/kernel/topology.c | 153 1 file changed, 153 insertions(+) Sorry for not giving this comment on previous version but we should also have a way to provide the big.LITTLE information without Device Tree. May be a platform device/data. Hi Santosh, I had thought of adding such additional way to set cpu_power of big.LITTLE but my conclusion was -it's a new platform so it should come with DT -DT is already required by other patches linked to big.LITTLE (http://lists.infradead.org/pipermail/linux-arm-kernel/2012-January/080873.html) -There is no device that can be easily used to get such information at this early boot stage. I see. Its new processor but it is just 1 one of the IP in an entire SOC. As mentioned below I was talking about full SOC support including all the driver subsystem with DT. I know we are moving DT way, but remember apart from core kernel infrastructure, to have a complete product build with DT means all the drivers must be already supporting DT which is not the case with many huge driver sub-systems like USB, display subsystem, Audio etc. Having that support would greatly help for the SOC's which have not yet reached to stage where entire SOC is DT compliant and want to use big.LITTLE infrastructure. Can't you support both type of devices on your platform ? You can move your device to DT mode when it is supported ? That is what eventually people end up doing who don't have DT ready for entire SOC. I was trying to ask whether at least some method is proposed(need not be merged in mainline) to have the big.LITTLE information parsing without DT. Ok, IIUC, you need a temporary methods, which doesn't need to be merged in mainline, to set the cpu_scale field ? Regards Santosh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 3/5] ARM: topology: Update cpu_power according to DT information
On 9 July 2012 16:37, Shilimkar, Santosh santosh.shilim...@ti.com wrote: On Mon, Jul 9, 2012 at 8:06 PM, Vincent Guittot vincent.guit...@linaro.org wrote: On 9 July 2012 15:00, Shilimkar, Santosh santosh.shilim...@ti.com wrote: On Mon, Jul 9, 2012 at 6:02 PM, Vincent Guittot vincent.guit...@linaro.org wrote: On 9 July 2012 12:55, Shilimkar, Santosh santosh.shilim...@ti.com wrote: Vincent, On Mon, Jul 9, 2012 at 2:57 PM, Vincent Guittot vincent.guit...@linaro.org wrote: Use cpu compatibility field and clock-frequency field of DT to estimate the capacity of each core of the system and to update the cpu_power field accordingly. This patch enables to put more running tasks on big cores than on LITTLE ones. But this patch doesn't ensure that long running tasks will run on big cores and short ones on LITTLE cores. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org Reviewed-by: Namhyung Kim namhy...@kernel.org --- arch/arm/kernel/topology.c | 153 1 file changed, 153 insertions(+) Sorry for not giving this comment on previous version but we should also have a way to provide the big.LITTLE information without Device Tree. May be a platform device/data. Hi Santosh, I had thought of adding such additional way to set cpu_power of big.LITTLE but my conclusion was -it's a new platform so it should come with DT -DT is already required by other patches linked to big.LITTLE (http://lists.infradead.org/pipermail/linux-arm-kernel/2012-January/080873.html) -There is no device that can be easily used to get such information at this early boot stage. I see. Its new processor but it is just 1 one of the IP in an entire SOC. As mentioned below I was talking about full SOC support including all the driver subsystem with DT. I know we are moving DT way, but remember apart from core kernel infrastructure, to have a complete product build with DT means all the drivers must be already supporting DT which is not the case with many huge driver sub-systems like USB, display subsystem, Audio etc. Having that support would greatly help for the SOC's which have not yet reached to stage where entire SOC is DT compliant and want to use big.LITTLE infrastructure. Can't you support both type of devices on your platform ? You can move your device to DT mode when it is supported ? That is what eventually people end up doing who don't have DT ready for entire SOC. I was trying to ask whether at least some method is proposed(need not be merged in mainline) to have the big.LITTLE information parsing without DT. Ok, IIUC, you need a temporary methods, which doesn't need to be merged in mainline, to set the cpu_scale field ? Yep. You could use set_power_scale in your platform code to set the cpu_power of your CPUs until DT is ready for your platform. Regards, Vincent Regards Santosh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 0/5] ARM: topology: set the capacity of each cores for big.LITTLE
On 10 July 2012 13:27, Peter Zijlstra a.p.zijls...@chello.nl wrote: On Mon, 2012-07-09 at 11:27 +0200, Vincent Guittot wrote: This patchset creates an arch_scale_freq_power function for ARM, which is used to set the relative capacity of each core of a big.LITTLE system. It also removes the broken power estimation of x86. Modification since v3: - Add comments - Add optimization for SMP system - Ensure that capacity of a CPU will be at most 1 Modification since v2: - set_power_scale function becomes static - Rework loop in update_siblings_masks - Remove useless code in parse_dt_topology Modification since v1: - Add and update explanation about the use of the table and the range of the value - Remove the use of NR_CPUS and use nr_cpu_ids instead - Remove broken power estimation of x86 Peter Zijlstra (1): sched, x86: Remove broken power estimation Note that this patch already made its way into tip as commit bcae21d6e793a7047d38abc9ac0946c53733c1dd so it might be best to base whatever tree this is supposed to fo into on something that includes that. yes Vincent Guittot (4): ARM: topology: Add arch_scale_freq_power function ARM: topology: factorize the update of sibling masks ARM: topology: Update cpu_power according to DT information sched: cpu_power: enable ARCH_POWER How would you like to proceed with these patches, I'm fine with them going through the ARM tree.. May be the last one which enable ARCH_POWER should also go into tip ? In which case: Acked-by: Peter Zijlstra a.p.zijls...@chello.nl Ingo, Russell, any other preferences? I'm going to push the patches related to ARM into Russell's patch system Vincent -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 3/6] sched: pack small tasks
On 20 November 2012 15:28, Morten Rasmussen morten.rasmus...@arm.com wrote: Hi Vincent, On Mon, Nov 12, 2012 at 01:51:00PM +, Vincent Guittot wrote: On 9 November 2012 18:13, Morten Rasmussen morten.rasmus...@arm.com wrote: Hi Vincent, I have experienced suboptimal buddy selection on a dual cluster setup (ARM TC2) if SD_SHARE_POWERLINE is enabled at MC level and disabled at CPU level. This seems to be the correct flag settings for a system with only cluster level power gating. To me it looks like update_packing_domain() is not doing the right thing. See inline comments below. Hi Morten, Thanks for testing the patches. It seems that I have too optimized the loop and remove some use cases. On Sun, Oct 07, 2012 at 08:43:55AM +0100, Vincent Guittot wrote: During sched_domain creation, we define a pack buddy CPU if available. On a system that share the powerline at all level, the buddy is set to -1 On a dual clusters / dual cores system which can powergate each core and cluster independantly, the buddy configuration will be : | CPU0 | CPU1 | CPU2 | CPU3 | --- buddy | CPU0 | CPU0 | CPU0 | CPU2 | Small tasks tend to slip out of the periodic load balance. The best place to choose to migrate them is at their wake up. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/core.c |1 + kernel/sched/fair.c | 109 ++ kernel/sched/sched.h |1 + 3 files changed, 111 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dab7908..70cadbe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq-sd, sd); destroy_sched_domains(tmp, cpu); + update_packing_domain(cpu); update_top_cache_domain(cpu); } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4f4a4f6..8c9d3ed 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -157,6 +157,63 @@ void sched_init_granularity(void) update_sysctl(); } + +/* + * Save the id of the optimal CPU that should be used to pack small tasks + * The value -1 is used when no buddy has been found + */ +DEFINE_PER_CPU(int, sd_pack_buddy); + +/* Look for the best buddy CPU that can be used to pack small tasks + * We make the assumption that it doesn't wort to pack on CPU that share the + * same powerline. We looks for the 1st sched_domain without the + * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the lowest + * power per core based on the assumption that their power efficiency is + * better */ +void update_packing_domain(int cpu) +{ + struct sched_domain *sd; + int id = -1; + + sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE); + if (!sd) + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd); + else + sd = sd-parent; sd is the highest level where SD_SHARE_POWERLINE is enabled so the sched groups of the parent level would represent the power domains. If get it right, we want to pack inside the cluster first and only let first cpu You probably wanted to use sched_group instead of cluster because cluster is only a special use case, didn't you ? of the cluster do packing on another cluster. So all cpus - except the first one - in the current sched domain should find its buddy within the domain and only the first one should go to the parent sched domain to find its buddy. We don't want to pack in the current sched_domain because it shares power domain. We want to pack at the parent level Yes. I think we mean the same thing. The packing takes place at the parent sched_domain but the sched_group that we are looking at only contains the cpus of the level below. I propose the following fix: - sd = sd-parent; + if (cpumask_first(sched_domain_span(sd)) == cpu + || !sd-parent) + sd = sd-parent; We always look for the buddy in the parent level whatever the cpu position in the mask is. + + while (sd) { + struct sched_group *sg = sd-groups; + struct sched_group *pack = sg; + struct sched_group *tmp = sg-next; + + /* 1st CPU of the sched domain is a good candidate */ + if (id == -1) + id = cpumask_first(sched_domain_span(sd)); There is no guarantee that id is in the sched group pointed to by sd-groups, which is implicitly assumed later in the search loop. We need to find the sched group that contains id and point sg to that instead. I haven't found an elegant way to find that group, but the fix below should at least give the right result
Re: [PATCH V2 Resend 3/4] workqueue: Schedule work on non-idle cpu instead of current one
On 27 November 2012 06:19, Viresh Kumar viresh.ku...@linaro.org wrote: Hi Tejun, On 26 November 2012 22:45, Tejun Heo t...@kernel.org wrote: On Tue, Nov 06, 2012 at 04:08:45PM +0530, Viresh Kumar wrote: I'm pretty skeptical about this. queue_work() w/o explicit CPU assignment has always guaranteed that the work item will be executed on the same CPU. I don't think there are too many users depending on that but am not sure at all that there are none. I asked you last time that you would at the very least need to audit most users but it doesn't seem like there has been any effort there. My bad. I completely missed/forgot that comment from your earlier mails. Will do it. That said, if the obtained benefit is big enough, sure, we can definitely change the behavior, which isn't all that great to begin with, and try to shake out the bugs quicky by e.g. forcing all work items to execute on different CPUs, but you're presenting a few percent of work items being migrated to a different CPU from an already active CPU, which doesn't seem like such a big benefit to me even if the migration target CPU is somewhat more efficient. How much powersaving are we talking about? Hmm.. I actually implemented the problem discussed here: (I know you have seen this earlier :) ) http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/08/lpc2012-sched-timer-workqueue.pdf Specifically slides: 12 19. I haven't done much power calculations with it and have tested it more from functionality point of view. @Vincent: Can you add some comments here? Sorry for this late reply. We have faced some situations on TC2 (as an example) where the tasks are running in the LITTLE cluster whereas some periodic works stay on the big cluster so we can have one cluster that wakes up for tasks and another one that wakes up for work. We would like to consolidate the behaviour of the work with the tasks behaviour. Sorry, I don't have relevant figures as the patches are used with other ones which also impact the power consumption. This series introduces the possibility to run a work on another CPU which is necessary if we want a better correlation of task and work scheduling on the system. Most of the time the queue_work is used when a driver don't mind the CPU on which you want to run whereas it looks like it should be used only if you want to run locally. We would like to solve this point with the new interface that is proposed by viresh Vincent -- viresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 Resend 3/4] workqueue: Schedule work on non-idle cpu instead of current one
On 27 November 2012 14:59, Steven Rostedt rost...@goodmis.org wrote: On Tue, 2012-11-27 at 19:18 +0530, Viresh Kumar wrote: On 27 November 2012 18:56, Steven Rostedt rost...@goodmis.org wrote: A couple of things. The sched_select_cpu() is not cheap. It has a double loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs, and we are CPU 1023 and all other CPUs happen to be idle, we could be searching 1023 CPUs before we come up with our own. Not sure if you missed the first check sched_select_cpu() +int sched_select_cpu(unsigned int sd_flags) +{ + /* If Current cpu isn't idle, don't migrate anything */ + if (!idle_cpu(cpu)) + return cpu; We aren't going to search if we aren't idle. OK, we are idle, but CPU 1022 isn't. We still need a large search. But, heh we are idle we can spin. But then why go through this in the first place ;-) By migrating it now, it will create its activity and wake up on the right CPU next time. If migrating on any CPUs seems a bit risky, we could restrict the migration on a CPU on the same node. We can pass such contraints on sched_select_cpu Also, I really don't like this as a default behavior. It seems that this solution is for a very special case, and this can become very intrusive for the normal case. We tried with an KCONFIG option for it, which Tejun rejected. Yeah, I saw that. I don't like adding KCONFIG options either. Best is to get something working that doesn't add any regressions. If you can get this to work without making *any* regressions in the normal case than I'm totally fine with that. But if this adds any issues with the normal case, then it's a show stopper. To be honest, I'm uncomfortable with this approach. It seems to be fighting a symptom and not the disease. I'd rather find a way to keep work from being queued on wrong CPU. If it is a timer, find a way to move the timer. If it is something else, lets work to fix that. Doing searches of possibly all CPUs (unlikely, but it is there), just seems wrong to me. As Vincent pointed out, on big LITTLE systems we just don't want to serve works on big cores. That would be wasting too much of power. Specially if we are going to wake up big cores. It would be difficult to control the source driver (which queues work) to little cores. We thought, if somebody wanted to queue work on current cpu then they must use queue_work_on(). As Tejun has mentioned earlier, is there any assumptions anywhere that expects an unbounded work queue to not migrate? Where per cpu variables might be used. Tejun had a good idea of forcing this to migrate the work *every* time. To not let a work queue run on the same CPU that it was queued on. If it can survive that, then it is probably OK. Maybe add a config option that forces this? That way, anyone can test that this isn't an issue. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 Resend 3/4] workqueue: Schedule work on non-idle cpu instead of current one
On 27 November 2012 16:04, Steven Rostedt rost...@goodmis.org wrote: On Tue, 2012-11-27 at 15:55 +0100, Vincent Guittot wrote: On 27 November 2012 14:59, Steven Rostedt rost...@goodmis.org wrote: On Tue, 2012-11-27 at 19:18 +0530, Viresh Kumar wrote: On 27 November 2012 18:56, Steven Rostedt rost...@goodmis.org wrote: A couple of things. The sched_select_cpu() is not cheap. It has a double loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs, and we are CPU 1023 and all other CPUs happen to be idle, we could be searching 1023 CPUs before we come up with our own. Not sure if you missed the first check sched_select_cpu() +int sched_select_cpu(unsigned int sd_flags) +{ + /* If Current cpu isn't idle, don't migrate anything */ + if (!idle_cpu(cpu)) + return cpu; We aren't going to search if we aren't idle. OK, we are idle, but CPU 1022 isn't. We still need a large search. But, heh we are idle we can spin. But then why go through this in the first place ;-) By migrating it now, it will create its activity and wake up on the right CPU next time. If migrating on any CPUs seems a bit risky, we could restrict the migration on a CPU on the same node. We can pass such contraints on sched_select_cpu That's assuming that the CPUs stay idle. Now if we move the work to another CPU and it goes idle, then it may move that again. It could end up being a ping pong approach. I don't think idle is a strong enough heuristic for the general case. If interrupts are constantly going off on a CPU that happens to be idle most of the time, it will constantly be moving work onto CPUs that are currently doing real work, and by doing so, it will be slowing those CPUs down. I agree that idle is probably not enough but it's the heuristic that is currently used for selecting a CPU for a timer and the timer also uses sched_select_cpu in this series. So in order to go step by step, a common interface has been introduced for selecting a CPU and this function uses the same algorithm than the timer already do. Once we agreed on an interface, the heuristic could be updated. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2 3/6] sched: pack small tasks
On 17 December 2012 16:24, Alex Shi alex@intel.com wrote: The scheme below tries to summaries the idea: Socket | socket 0 | socket 1 | socket 2 | socket 3 | LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 | buddy conf0 | 0 | 0| 1 | 16| 2 | 32| 3 | 48| buddy conf1 | 0 | 0| 0 | 16| 16 | 32| 32 | 48| buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48| But, I don't know how this can interact with NUMA load balance and the better might be to use conf3. I mean conf2 not conf3 So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it is unbalanced for different socket. That the target because we have decided to pack the small tasks in socket 0 when we have parsed the topology at boot. We don't have to loop into sched_domain or sched_group anymore to find the best LCPU when a small tasks wake up. iteration on domain and group is a advantage feature for power efficient requirement, not shortage. If some CPU are already idle before forking, let another waking CPU check their load/util and then decide which one is best CPU can reduce late migrations, that save both the performance and power. In fact, we have already done this job once at boot and we consider that moving small tasks in the buddy CPU is always benefit so we don't need to waste time looping sched_domain and sched_group to compute current capacity of each LCPU for each wake up of each small tasks. We want all small tasks and background activity waking up on the same buddy CPU and let the default behavior of the scheduler choosing the best CPU for heavy tasks or loaded CPUs. IMHO, the design should be very good for your scenario and your machine, but when the code move to general scheduler, we do want it can handle more general scenarios. like sometime the 'small task' is not as small as tasks in cyclictest which even hardly can run longer than migration Cyclictest is the ultimate small tasks use case which points out all weaknesses of a scheduler for such kind of tasks. Music playback is a more realistic one and it also shows improvement granularity or one tick, thus we really don't need to consider task migration cost. But when the task are not too small, migration is more For which kind of machine are you stating that hypothesis ? heavier than domain/group walking, that is the common sense in fork/exec/waking balance. I would have said the opposite: The current scheduler limits its computation of statistic during fork/exec/waking compared to a periodic load balance because it's too heavy. It's even more true for wake up if wake affine is possible. On the contrary, move task walking on each level buddies is not only bad on performance but also bad on power. Consider the quite big latency of waking a deep idle CPU. we lose too much.. My result have shown different conclusion. That should be due to your tasks are too small to need consider migration cost. In fact, there is much more chance that the buddy will not be in a deep idle as all the small tasks and background activity are already waking on this CPU. powertop is helpful to tune your system for more idle time. Another reason is current kernel just try to spread tasks on more cpu for performance consideration. My power scheduling patch should helpful on this. And the ground level has just one buddy for 16 LCPUs - 8 cores, that's not a good design, consider my previous examples: if there are 4 or 8 tasks in one socket, you just has 2 choices: spread them into all cores, or pack them into one LCPU. Actually, moving them just into 2 or 4 cores maybe a better solution. but the design missed this. You speak about tasks without any notion of load. This patch only care of small tasks and light LCPU load, but it falls back to default behavior for other situation. So if there are 4 or 8 small tasks, they will migrate to the socket 0 after 1 or up to 3 migration (it depends of the conf and the LCPU they come from). According to your patch, what your mean 'notion of load' is the utilization of cpu, not the load weight of tasks, right? Yes but not only. The number of tasks that run simultaneously, is another important input Yes, I just talked about tasks numbers, but it naturally extends to the task utilization on cpu. like 8 tasks with 25% util, that just can full fill 2 CPUs. but clearly beyond the capacity of the buddy, so you need to wake up another CPU socket while local socket has some LCPU idle... 8 tasks with a running period of 25ms per 100ms that wake up simultaneously should probably run on 8 different LCPU in order to race to idle nope, it's a rare probability of 8 tasks wakeuping simultaneously. And Multimedia is one example of tasks waking up simultaneously even so they should run in the same socket for power saving consideration(my power scheduling patch can do this), instead of spread to all sockets.
Re: [RFC PATCH v2 3/6] sched: pack small tasks
On 21 December 2012 06:47, Namhyung Kim namhy...@kernel.org wrote: Hi Vincent, On Thu, Dec 13, 2012 at 11:11:11AM +0100, Vincent Guittot wrote: On 13 December 2012 03:17, Alex Shi alex@intel.com wrote: On 12/12/2012 09:31 PM, Vincent Guittot wrote: +static bool is_buddy_busy(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + /* + * A busy buddy is a CPU with a high load or a small load with a lot of + * running tasks. + */ + return ((rq-avg.runnable_avg_sum rq-nr_running) If nr_running a bit big, rq-avg.runnable_avg_sum rq-nr_running is zero. you will get the wrong decision. yes, I'm going to do that like below instead: return (rq-avg.runnable_avg_sum (rq-avg.runnable_avg_period rq-nr_running)); Doesn't it consider nr_running too much? It seems current is_buddy_busy returns false on a cpu that has 1 task runs 40% cputime, but returns true on a cpu that has 3 tasks runs 10% cputime each or for 2 tasks of 15% cputime each, right? Yes it's right. I don't know what is correct, but just guessing that in a cpu's point of view it'd be busier if it has a higher runnable_avg_sum than a higher nr_running IMHO. The nr_running is used to point out how many tasks are running simultaneously and the potential scheduling latency of adding + rq-avg.runnable_avg_period); +} + +static bool is_light_task(struct task_struct *p) +{ + /* A light task runs less than 25% in average */ + return ((p-se.avg.runnable_avg_sum 1) + p-se.avg.runnable_avg_period); 25% may not suitable for big machine. Threshold is always an issue, which threshold should be suitable for big machine ? I'm wondering if i should use the imbalance_pct value for computing the threshold Anyway, I wonder how 'sum 1' computes 25%. Shouldn't it be 2 ? The 1st version of the patch was using 2 but I received a comment saying that it was may be not enough aggressive so I have updated the formula with 1 but forgot to update the comment. I will align comment and formula in the next version. Thanks for pointing this Vincent Thanks, Namhyung -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2 3/6] sched: pack small tasks
On 21 December 2012 09:53, Vincent Guittot vincent.guit...@linaro.org wrote: On 21 December 2012 06:47, Namhyung Kim namhy...@kernel.org wrote: Hi Vincent, On Thu, Dec 13, 2012 at 11:11:11AM +0100, Vincent Guittot wrote: On 13 December 2012 03:17, Alex Shi alex@intel.com wrote: On 12/12/2012 09:31 PM, Vincent Guittot wrote: +static bool is_buddy_busy(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + /* + * A busy buddy is a CPU with a high load or a small load with a lot of + * running tasks. + */ + return ((rq-avg.runnable_avg_sum rq-nr_running) If nr_running a bit big, rq-avg.runnable_avg_sum rq-nr_running is zero. you will get the wrong decision. yes, I'm going to do that like below instead: return (rq-avg.runnable_avg_sum (rq-avg.runnable_avg_period rq-nr_running)); Doesn't it consider nr_running too much? It seems current is_buddy_busy returns false on a cpu that has 1 task runs 40% cputime, but returns true on a cpu that has 3 tasks runs 10% cputime each or for 2 tasks of 15% cputime each, right? Yes it's right. I don't know what is correct, but just guessing that in a cpu's point of view it'd be busier if it has a higher runnable_avg_sum than a higher nr_running IMHO. sorry, the mail has been sent before i finish it The nr_running is used to point out how many tasks are running simultaneously and the potential scheduling latency of adding The nr_running is used to point out how many tasks are running simultaneously and as a result the potential scheduling latency. I have used the shift instruction because it was quite simple and efficient but it may give too much weight to nr_running. I could use a simple division instead of shifting runnable_avg_sum + rq-avg.runnable_avg_period); +} + +static bool is_light_task(struct task_struct *p) +{ + /* A light task runs less than 25% in average */ + return ((p-se.avg.runnable_avg_sum 1) + p-se.avg.runnable_avg_period); 25% may not suitable for big machine. Threshold is always an issue, which threshold should be suitable for big machine ? I'm wondering if i should use the imbalance_pct value for computing the threshold Anyway, I wonder how 'sum 1' computes 25%. Shouldn't it be 2 ? The 1st version of the patch was using 2 but I received a comment saying that it was may be not enough aggressive so I have updated the formula with 1 but forgot to update the comment. I will align comment and formula in the next version. Thanks for pointing this Vincent Thanks, Namhyung -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2 0/6] sched: packing small tasks
Hi, This patchset takes advantage of the new per-task load tracking that is available in the kernel for packing the small tasks in as few as possible CPU/Cluster/Core. The main goal of packing small tasks is to reduce the power consumption by minimizing the number of power domain that are enabled. The packing is done in 2 steps: The 1st step looks for the best place to pack tasks in a system according to its topology and it defines a pack buddy CPU for each CPU if there is one available. The policy for defining a buddy CPU is that we pack at all levels where a group of CPU can be power gated independently from others. For describing this capability, a new flag has been introduced SD_SHARE_POWERDOMAIN that is used to indicate whether the groups of CPUs of a scheduling domain are sharing their power state. By default, this flag has been set in all sched_domain in order to keep unchanged the current behavior of the scheduler. In a 2nd step, the scheduler checks the load average of a task which wakes up as well as the load average of the buddy CPU and can decide to migrate the task on the buddy. This check is done during the wake up because small tasks tend to wake up between load balance and asynchronously to each other which prevents the default mechanism to catch and migrate them efficiently. Change since V1: Patch 2/6 - Change the flag name which was not clear. The new name is SD_SHARE_POWERDOMAIN. - Create an architecture dependent function to tune the sched_domain flags Patch 3/6 - Fix issues in the algorithm that looks for the best buddy CPU - Use pr_debug instead of pr_info - Fix for uniprocessor Patch 4/6 - Remove the use of usage_avg_sum which has not been merged Patch 5/6 - Change the way the coherency of runnable_avg_sum and runnable_avg_period is ensured Patch 6/6 - Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM platform New results for V2: This series has been tested with MP3 play back on ARM platform: TC2 HMP (dual CA-15 and 3xCA-7 cluster). The measurements have been done on an Ubuntu image during 60 seconds of playback and the result has been normalized to 100. | CA15 | CA7 | total | - default | 81 | 97 | 178 | pack | 13 | 100 | 113 | - Previous result for V1: The patch-set has been tested on ARM platforms: quad CA-9 SMP and TC2 HMP (dual CA-15 and 3xCA-7 cluster). For ARM platform, the results have demonstrated that it's worth packing small tasks at all topology levels. The performance tests have been done on both platforms with sysbench. The results don't show any performance regressions. These results are aligned with the policy which uses the normal behavior with heavy use cases. test: sysbench --test=cpu --num-threads=N --max-requests=R run Results below is the average duration of 3 tests on the quad CA-9. default is the current scheduler behavior (pack buddy CPU is -1) pack is the scheduler with the pack mechanism | default | pack | --- N=8; R=200 | 3.1999 | 3.1921 | N=8; R=2000 | 31.4939 | 31.4844 | N=12; R=200 | 3.2043 | 3.2084 | N=12; R=2000 | 31.4897 | 31.4831 | N=16; R=200 | 3.1774 | 3.1824 | N=16; R=2000 | 31.4899 | 31.4897 | --- The power consumption tests have been done only on TC2 platform which has got accessible power lines and I have used cyclictest to simulate small tasks. The tests show some power consumption improvements. test: cyclictest -t 8 -q -e 100 -D 20 cyclictest -t 8 -q -e 100 -D 20 The measurements have been done during 16 seconds and the result has been normalized to 100 | CA15 | CA7 | total | - default | 100 | 40 | 140 | pack | 1 | 45 | 46 | - The A15 cluster is less power efficient than the A7 cluster but if we assume that the tasks is well spread on both clusters, we can guest estimate that the power consumption on a dual cluster of CA7 would have been for a default kernel: Vincent Guittot (6): Revert sched: introduce temporary FAIR_GROUP_SCHED dependency for load-tracking sched: add a new SD SHARE_POWERLINE flag for sched_domain sched: pack small tasks sched: secure access to other CPU statistics sched: pack the idle load balance ARM: sched: clear SD_SHARE_POWERLINE arch/arm/kernel/topology.c |9 +++ arch/ia64/include/asm/topology.h |1 + arch/tile/include/asm/topology.h |1 + include/linux/sched.h|9 +-- include/linux/topology.h |4 ++ kernel/sched/core.c | 14 ++-- kernel/sched/fair.c | 134 +- kernel/sched/sched.h | 14 ++-- 8 files changed, 163 insertions(+), 23 deletions(-) -- 1.7.9.5
RFC PATCH v2 1/6] Revert sched: introduce temporary FAIR_GROUP_SCHED dependency for load-tracking
This reverts commit f4e26b120b9de84cb627bc7361ba43cfdc51341f Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- include/linux/sched.h |8 +--- kernel/sched/core.c |7 +-- kernel/sched/fair.c |3 +-- kernel/sched/sched.h |9 + 4 files changed, 4 insertions(+), 23 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index fd17ca3..046e39a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1195,13 +1195,7 @@ struct sched_entity { /* rq owned by this entity/group: */ struct cfs_rq *my_q; #endif -/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#if defined(CONFIG_SMP) defined(CONFIG_FAIR_GROUP_SCHED) - /* Per-entity load-tracking */ +#ifdef CONFIG_SMP struct sched_avgavg; #endif }; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8482628..c25c75d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1544,12 +1544,7 @@ static void __sched_fork(struct task_struct *p) p-se.vruntime = 0; INIT_LIST_HEAD(p-se.group_node); -/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#if defined(CONFIG_SMP) defined(CONFIG_FAIR_GROUP_SCHED) +#ifdef CONFIG_SMP p-se.avg.runnable_avg_period = 0; p-se.avg.runnable_avg_sum = 0; #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 61c7a10..9916d41 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2820,8 +2820,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq) } #endif /* CONFIG_FAIR_GROUP_SCHED */ -/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */ -#if defined(CONFIG_SMP) defined(CONFIG_FAIR_GROUP_SCHED) +#ifdef CONFIG_SMP /* * We choose a half-life close to 1 scheduling period. * Note: The tables below are dependent on this value. diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index f00eb80..92ba891 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -226,12 +226,6 @@ struct cfs_rq { #endif #ifdef CONFIG_SMP -/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#ifdef CONFIG_FAIR_GROUP_SCHED /* * CFS Load tracking * Under CFS, load is tracked on a per-entity basis and aggregated up. @@ -241,8 +235,7 @@ struct cfs_rq { u64 runnable_load_avg, blocked_load_avg; atomic64_t decay_counter, removed_load; u64 last_decay; -#endif /* CONFIG_FAIR_GROUP_SCHED */ -/* These always depend on CONFIG_FAIR_GROUP_SCHED */ + #ifdef CONFIG_FAIR_GROUP_SCHED u32 tg_runnable_contrib; u64 tg_load_contrib; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2 5/6] sched: pack the idle load balance
Look for an idle CPU close to the pack buddy CPU whenever possible. The goal is to prevent the wake up of a CPU which doesn't share the power domain of the pack buddy CPU. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/fair.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f1a4c24..6475f54 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7274,7 +7274,25 @@ static struct { static inline int find_new_ilb(int call_cpu) { + struct sched_domain *sd; int ilb = cpumask_first(nohz.idle_cpus_mask); + int buddy = per_cpu(sd_pack_buddy, call_cpu); + + /* +* If we have a pack buddy CPU, we try to run load balance on a CPU +* that is close to the buddy. +*/ + if (buddy != -1) + for_each_domain(buddy, sd) { + if (sd-flags SD_SHARE_CPUPOWER) + continue; + + ilb = cpumask_first_and(sched_domain_span(sd), + nohz.idle_cpus_mask); + + if (ilb nr_cpu_ids) + break; + } if (ilb nr_cpu_ids idle_cpu(ilb)) return ilb; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2 6/6] ARM: sched: clear SD_SHARE_POWERLINE
The ARM platforms take advantage of packing small tasks on few cores. This is true even when the cores of a cluster can't be power gated independantly. So we clear SD_SHARE_POWERDOMAIN at MC and CPU level. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- arch/arm/kernel/topology.c |9 + 1 file changed, 9 insertions(+) diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 79282eb..f89a4a2 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -201,6 +201,15 @@ static inline void update_cpu_power(unsigned int cpuid, unsigned int mpidr) {} */ struct cputopo_arm cpu_topology[NR_CPUS]; +int arch_sd_local_flags(int level) +{ + /* Powergate at threading level doesn't make sense */ + if (level SD_SHARE_CPUPOWER) + return 1*SD_SHARE_POWERDOMAIN; + + return 0*SD_SHARE_POWERDOMAIN; +} + const struct cpumask *cpu_coregroup_mask(int cpu) { return cpu_topology[cpu].core_sibling; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH v2 4/6] sched: secure access to other CPU statistics
If a CPU accesses the runnable_avg_sum and runnable_avg_period fields of its buddy CPU while the latter updates it, it can get the new version of a field and the old version of the other one. This can generate erroneous decisions. We don't want to use a lock mechanism for ensuring the coherency because of the overhead in this critical path. The previous attempt can't ensure coherency of both fields for 100% of the platform and use case as it will depend of the toolchain and the platform architecture. The runnable_avg_period of a runqueue tends to the max value in less than 345ms after plugging a CPU, which implies that we could use the max value instead of reading runnable_avg_period after 345ms. During the starting phase, we must ensure a minimum of coherency between the fields. A simple rule is runnable_avg_sum = runnable_avg_period. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/fair.c |7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fc93d96..f1a4c24 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5153,13 +5153,16 @@ static bool numa_allow_migration(struct task_struct *p, int prev_cpu, int new_cp static bool is_buddy_busy(int cpu) { struct rq *rq = cpu_rq(cpu); + u32 sum = rq-avg.runnable_avg_sum; + u32 period = rq-avg.runnable_avg_period; + + sum = min(sum, period); /* * A busy buddy is a CPU with a high load or a small load with a lot of * running tasks. */ - return ((rq-avg.runnable_avg_sum rq-nr_running) - rq-avg.runnable_avg_period); + return ((sum rq-nr_running) period); } static bool is_light_task(struct task_struct *p) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/6] sched: add a new SD SHARE_POWERLINE flag for sched_domain
This new flag SD_SHARE_POWERDOMAIN is used to reflect whether groups of CPU in a sched_domain level can or not reach a different power state. If clusters can be power gated independently, as an example, the flag should be cleared at CPU level. This information is used to decide if it's worth packing some tasks in a group of CPUs in order to power gated the other groups instead of spreading the tasks. The default behavior of the scheduler is to spread tasks so the flag is set into all sched_domains Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- arch/ia64/include/asm/topology.h |1 + arch/tile/include/asm/topology.h |1 + include/linux/sched.h|1 + include/linux/topology.h |4 kernel/sched/core.c |6 ++ 5 files changed, 13 insertions(+) diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e4..6d0b617 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -65,6 +65,7 @@ void build_cpu_to_node_map(void); | SD_BALANCE_EXEC \ | SD_BALANCE_FORK \ | SD_WAKE_AFFINE, \ + | arch_sd_local_flags(0)\ .last_balance = jiffies, \ .balance_interval = 1,\ .nr_balance_failed = 0,\ diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index d5e86c9..adc8710 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -71,6 +71,7 @@ static inline const struct cpumask *cpumask_of_node(int node) | 0*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(0)\ | 0*SD_SERIALIZE\ , \ .last_balance = jiffies, \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 046e39a..3287be1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -844,6 +844,7 @@ enum cpu_idle_type { #define SD_BALANCE_WAKE0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */ +#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING0x0800 /* Place busy groups earlier in the domain */ diff --git a/include/linux/topology.h b/include/linux/topology.h index d3cf0d6..3eab293 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -99,6 +99,8 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 1*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(SD_SHARE_CPUPOWER|\ + SD_SHARE_PKG_RESOURCES) \ | 0*SD_SERIALIZE\ | 0*SD_PREFER_SIBLING \ | arch_sd_sibling_asym_packing()\ @@ -131,6 +133,7 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(SD_SHARE_PKG_RESOURCES)\ | 0*SD_SERIALIZE\ , \ .last_balance = jiffies, \ @@ -161,6 +164,7 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(0)\ | 0*SD_SERIALIZE\ | 1*SD_PREFER_SIBLING \ , \ diff --git a/kernel/sched/core.c
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On 12 December 2012 14:55, Alex Shi lkml.a...@gmail.com wrote: well... it's not always beneficial to group or to spread out it depends on cache behavior mostly which is best Let me try to understand what this means: so performance above with 8 threads means that those threads are spread out across more than one socket, no? If so, this would mean that you have a smaller amount of tasks on each socket, thus the smaller wattage. The powersaving method OTOH fills up the one socket up to the brim, thus the slightly higher consumption due to all threads being occupied. Is that it? not sure. by and large, power efficiency is the same as performance efficiency, with some twists. or to reword that to be more clear if you waste performance due to something that becomes inefficient, you're wasting power as well. now, you might have some hardware effects that can then save you power... but those effects then first need to overcome the waste from the performance inefficiency... and that almost never happens. for example, if you have two workloads that each fit barely inside the last level cache... it's much more efficient to spread these over two sockets... where each has its own full LLC to use. If you'd group these together, both would thrash the cache all the time and run inefficient -- bad for power. now, on the other hand, if you have two threads of a process that share a bunch of data structures, and you'd spread these over 2 sockets, you end up bouncing data between the two sockets a lot, running inefficient -- bad for power. Agree with all of the above. However.. having said all this, if you have to tasks that don't have such cache effects, the most efficient way of running things will be on 2 hyperthreading halves... it's very hard to beat the power efficiency of that. .. there are alternatives to hyperthreading. On ARM's big.LITTLE architecture you could simply schedule them on the LITTLE cores. The big cores just can't beat the power efficiency of the LITTLE ones even with 'race to halt' that you allude to below. And usecases like mp3 playback simply don't require the kind of performance that the big cores can offer. But this assumes the tasks don't compete with resources much on the HT level, and achieve good scaling. and this still has to compete with race to halt, because if you're done quicker, you can put the memory in self refresh quicker. none of this stuff is easy for humans or computer programs to determine ahead of time... or sometimes even afterwards. heck, even for just performance it's really really hard already, never mind adding power. my personal gut feeling is that we should just optimize this scheduler stuff for performance, and that we're going to be doing quite well on power already if we achieve that. If Linux is to continue to work efficiently on heterogeneous multi-processing platforms, it needs to provide scheduling mechanisms that can be exploited as per the demands of the HW architecture. Linus definitely disagree such ideas. :) So, need to summaries the logical beyond all hardware. example is the small task packing (and spreading) for which Vincent Guittot has posted a patchset[1] earlier and so has Alex now. Sure. I just thought my patchset should handled the 'small task packing' scenario. Could you guy like to have a try? Hi Alex, Yes, I will do a try with your patchset when i will have some spare time Vincent [1] http://lwn.net/Articles/518834/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/6] sched: add a new SD SHARE_POWERLINE flag for sched_domain
On 13 December 2012 03:24, Alex Shi alex@intel.com wrote: On 12/12/2012 09:31 PM, Vincent Guittot wrote: This new flag SD_SHARE_POWERDOMAIN is used to reflect whether groups of CPU in a sched_domain level can or not reach a different power state. If clusters can be power gated independently, as an example, the flag should be cleared at CPU level. This information is used to decide if it's worth packing some tasks in a group of CPUs in order to power gated the other groups instead of spreading the tasks. The default behavior of the scheduler is to spread tasks so the flag is set into all sched_domains Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- arch/ia64/include/asm/topology.h |1 + arch/tile/include/asm/topology.h |1 + include/linux/sched.h|1 + include/linux/topology.h |4 kernel/sched/core.c |6 ++ 5 files changed, 13 insertions(+) diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e4..6d0b617 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -65,6 +65,7 @@ void build_cpu_to_node_map(void); | SD_BALANCE_EXEC \ | SD_BALANCE_FORK \ | SD_WAKE_AFFINE, \ + | arch_sd_local_flags(0)\ .last_balance = jiffies, \ .balance_interval = 1,\ .nr_balance_failed = 0,\ diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index d5e86c9..adc8710 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -71,6 +71,7 @@ static inline const struct cpumask *cpumask_of_node(int node) | 0*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(0)\ | 0*SD_SERIALIZE\ , \ .last_balance = jiffies, \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 046e39a..3287be1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -844,6 +844,7 @@ enum cpu_idle_type { #define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_SHARE_CPUPOWER0x0080 /* Domain members share cpu power */ +#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */ diff --git a/include/linux/topology.h b/include/linux/topology.h index d3cf0d6..3eab293 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -99,6 +99,8 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 1*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(SD_SHARE_CPUPOWER|\ + SD_SHARE_PKG_RESOURCES) \ | 0*SD_SERIALIZE\ | 0*SD_PREFER_SIBLING \ | arch_sd_sibling_asym_packing()\ @@ -131,6 +133,7 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(SD_SHARE_PKG_RESOURCES)\ | 0*SD_SERIALIZE\ , \ .last_balance = jiffies, \ @@ -161,6 +164,7 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(0)\ The general style looks like prefering SD_XXX flag directly. In fact, I have followed the same style
Re: [RFC PATCH v2 3/6] sched: pack small tasks
On 13 December 2012 03:17, Alex Shi alex@intel.com wrote: On 12/12/2012 09:31 PM, Vincent Guittot wrote: During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPU can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be : | Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 | --- buddy | CPU0 | CPU0 | CPU0 | CPU2 | Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU Just have a little worry about the scalability on a big machine, like on a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That is different on task distribution decision. The buddy CPU should probably not be the same for all 64 LCPU it depends on where it's worth packing small tasks Which kind of sched_domain configuration have you for such system ? and how many sched_domain level have you ? Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/core.c |1 + kernel/sched/fair.c | 110 ++ kernel/sched/sched.h |5 +++ 3 files changed, 116 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4f36e9d..3436aad 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5693,6 +5693,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq-sd, sd); destroy_sched_domains(tmp, cpu); + update_packing_domain(cpu); update_domain_cache(cpu); } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9916d41..fc93d96 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -163,6 +163,73 @@ void sched_init_granularity(void) update_sysctl(); } + +#ifdef CONFIG_SMP +/* + * Save the id of the optimal CPU that should be used to pack small tasks + * The value -1 is used when no buddy has been found + */ +DEFINE_PER_CPU(int, sd_pack_buddy); + +/* Look for the best buddy CPU that can be used to pack small tasks + * We make the assumption that it doesn't wort to pack on CPU that share the + * same powerline. We looks for the 1st sched_domain without the + * SD_SHARE_POWERDOMAIN flag. Then We look for the sched_group witht the lowest + * power per core based on the assumption that their power efficiency is + * better */ +void update_packing_domain(int cpu) +{ + struct sched_domain *sd; + int id = -1; + + sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN SD_LOAD_BALANCE); + if (!sd) + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd); + else + sd = sd-parent; + + while (sd (sd-flags SD_LOAD_BALANCE)) { + struct sched_group *sg = sd-groups; + struct sched_group *pack = sg; + struct sched_group *tmp; + + /* + * The sched_domain of a CPU points on the local sched_group + * and the 1st CPU of this local group is a good candidate + */ + id = cpumask_first(sched_group_cpus(pack)); + + /* loop the sched groups to find the best one */ + for (tmp = sg-next; tmp != sg; tmp = tmp-next) { + if (tmp-sgp-power * pack-group_weight + pack-sgp-power * tmp-group_weight) + continue; + + if ((tmp-sgp-power * pack-group_weight == + pack-sgp-power * tmp-group_weight) + (cpumask_first(sched_group_cpus(tmp)) = id)) + continue; + + /* we have found a better group */ + pack = tmp; + + /* Take the 1st CPU of the new group */ + id = cpumask_first(sched_group_cpus(pack)); + } + + /* Look for another CPU than itself */ + if (id != cpu) + break; + + sd = sd-parent; + } + + pr_debug(CPU%d packing on CPU%d\n, cpu, id); + per_cpu(sd_pack_buddy, cpu) = id; +} + +#endif /* CONFIG_SMP */ + #if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else @@ -5083,6 +5150,46 @@ static bool numa_allow_migration(struct task_struct *p, int prev_cpu, int new_cp return true; } +static bool is_buddy_busy(int cpu
Re: [RFC PATCH v2 3/6] sched: pack small tasks
On 13 December 2012 15:25, Alex Shi alex@intel.com wrote: On 12/13/2012 06:11 PM, Vincent Guittot wrote: On 13 December 2012 03:17, Alex Shi alex@intel.com wrote: On 12/12/2012 09:31 PM, Vincent Guittot wrote: During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPU can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be : | Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 | --- buddy | CPU0 | CPU0 | CPU0 | CPU2 | Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU Just have a little worry about the scalability on a big machine, like on a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That is different on task distribution decision. The buddy CPU should probably not be the same for all 64 LCPU it depends on where it's worth packing small tasks Do you have further ideas for buddy cpu on such example? yes, I have several ideas which were not really relevant for small system but could be interesting for larger system We keep the same algorithm in a socket but we could either use another LCPU in the targeted socket (conf0) or chain the socket (conf1) instead of packing directly in one LCPU The scheme below tries to summaries the idea: Socket | socket 0 | socket 1 | socket 2 | socket 3 | LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 | buddy conf0 | 0 | 0| 1 | 16| 2 | 32| 3 | 48| buddy conf1 | 0 | 0| 0 | 16| 16 | 32| 32 | 48| buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48| But, I don't know how this can interact with NUMA load balance and the better might be to use conf3. Which kind of sched_domain configuration have you for such system ? and how many sched_domain level have you ? it is general X86 domain configuration. with 4 levels, sibling/core/cpu/numa. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2 3/6] sched: pack small tasks
On 13 December 2012 15:53, Vincent Guittot vincent.guit...@linaro.org wrote: On 13 December 2012 15:25, Alex Shi alex@intel.com wrote: On 12/13/2012 06:11 PM, Vincent Guittot wrote: On 13 December 2012 03:17, Alex Shi alex@intel.com wrote: On 12/12/2012 09:31 PM, Vincent Guittot wrote: During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPU can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be : | Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 | --- buddy | CPU0 | CPU0 | CPU0 | CPU2 | Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU Just have a little worry about the scalability on a big machine, like on a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That is different on task distribution decision. The buddy CPU should probably not be the same for all 64 LCPU it depends on where it's worth packing small tasks Do you have further ideas for buddy cpu on such example? yes, I have several ideas which were not really relevant for small system but could be interesting for larger system We keep the same algorithm in a socket but we could either use another LCPU in the targeted socket (conf0) or chain the socket (conf1) instead of packing directly in one LCPU The scheme below tries to summaries the idea: Socket | socket 0 | socket 1 | socket 2 | socket 3 | LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 | buddy conf0 | 0 | 0| 1 | 16| 2 | 32| 3 | 48| buddy conf1 | 0 | 0| 0 | 16| 16 | 32| 32 | 48| buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48| But, I don't know how this can interact with NUMA load balance and the better might be to use conf3. I mean conf2 not conf3 Which kind of sched_domain configuration have you for such system ? and how many sched_domain level have you ? it is general X86 domain configuration. with 4 levels, sibling/core/cpu/numa. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2 3/6] sched: pack small tasks
On 14 December 2012 02:46, Alex Shi alex@intel.com wrote: On 12/13/2012 11:48 PM, Vincent Guittot wrote: On 13 December 2012 15:53, Vincent Guittot vincent.guit...@linaro.org wrote: On 13 December 2012 15:25, Alex Shi alex@intel.com wrote: On 12/13/2012 06:11 PM, Vincent Guittot wrote: On 13 December 2012 03:17, Alex Shi alex@intel.com wrote: On 12/12/2012 09:31 PM, Vincent Guittot wrote: During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPU can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be : | Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 | --- buddy | CPU0 | CPU0 | CPU0 | CPU2 | Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU Just have a little worry about the scalability on a big machine, like on a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That is different on task distribution decision. The buddy CPU should probably not be the same for all 64 LCPU it depends on where it's worth packing small tasks Do you have further ideas for buddy cpu on such example? yes, I have several ideas which were not really relevant for small system but could be interesting for larger system We keep the same algorithm in a socket but we could either use another LCPU in the targeted socket (conf0) or chain the socket (conf1) instead of packing directly in one LCPU The scheme below tries to summaries the idea: Socket | socket 0 | socket 1 | socket 2 | socket 3 | LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 | buddy conf0 | 0 | 0| 1 | 16| 2 | 32| 3 | 48| buddy conf1 | 0 | 0| 0 | 16| 16 | 32| 32 | 48| buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48| But, I don't know how this can interact with NUMA load balance and the better might be to use conf3. I mean conf2 not conf3 So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it is unbalanced for different socket. That the target because we have decided to pack the small tasks in socket 0 when we have parsed the topology at boot. We don't have to loop into sched_domain or sched_group anymore to find the best LCPU when a small tasks wake up. And the ground level has just one buddy for 16 LCPUs - 8 cores, that's not a good design, consider my previous examples: if there are 4 or 8 tasks in one socket, you just has 2 choices: spread them into all cores, or pack them into one LCPU. Actually, moving them just into 2 or 4 cores maybe a better solution. but the design missed this. You speak about tasks without any notion of load. This patch only care of small tasks and light LCPU load, but it falls back to default behavior for other situation. So if there are 4 or 8 small tasks, they will migrate to the socket 0 after 1 or up to 3 migration (it depends of the conf and the LCPU they come from). Then, if too much small tasks wake up simultaneously on the same LCPU, the default load balance will spread them in the core/cluster/socket Obviously, more and more cores is the trend on any kinds of CPU, the buddy system seems hard to catch up this. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2 3/6] sched: pack small tasks
On 14 December 2012 08:45, Mike Galbraith bitbuc...@online.de wrote: On Fri, 2012-12-14 at 14:36 +0800, Alex Shi wrote: On 12/14/2012 12:45 PM, Mike Galbraith wrote: Do you have further ideas for buddy cpu on such example? Which kind of sched_domain configuration have you for such system ? and how many sched_domain level have you ? it is general X86 domain configuration. with 4 levels, sibling/core/cpu/numa. CPU is a bug that slipped into domain degeneration. You should have SIBLING/MC/NUMA (chasing that down is on todo). Maybe. the CPU/NUMA is different on domain flags, CPU has SD_PREFER_SIBLING. What I noticed during (an unrelated) bisection on a 40 core box was domains going from so.. 3.4.0-bisect (virgin) [5.056214] CPU0 attaching sched-domain: [5.065009] domain 0: span 0,32 level SIBLING [5.075011] groups: 0 (cpu_power = 589) 32 (cpu_power = 589) [5.088381] domain 1: span 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC [5.107669]groups: 0,32 (cpu_power = 1178) 4,36 (cpu_power = 1178) 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178) 16,48 (cpu_power = 1177) 20,52 (cpu_power = 1178) 24,56 (cpu_power = 1177) 28,60 (cpu_power = 1177) 64,72 (cpu_power = 1176) 68,76 (cpu_power = 1176) [5.162115]domain 2: span 0-79 level NODE [5.171927] groups: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11773) 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11772) 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78 (cpu_power = 11773) 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11770) ..to so, which looks a little bent. CPU and MC have identical spans, so CPU should have gone away, as it used to do. 3.6.0-bisect (virgin) [3.978338] CPU0 attaching sched-domain: [3.987125] domain 0: span 0,32 level SIBLING [3.997125] groups: 0 (cpu_power = 588) 32 (cpu_power = 589) [4.010477] domain 1: span 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC [4.029748]groups: 0,32 (cpu_power = 1177) 4,36 (cpu_power = 1177) 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178) 16,48 (cpu_power = 1178) 20,52 (cpu_power = 1178) 24,56 (cpu_power = 1178) 28,60 (cpu_power = 1178) 64,72 (cpu_power = 1178) 68,76 (cpu_power = 1177) [4.084143]domain 2: span 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level CPU [4.103796] groups: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777) [4.124373] domain 3: span 0-79 level NUMA [4.134369] groups: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777) 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11778) 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74 ,78 (cpu_power = 11778) 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11780) Thanks. that's an interesting example of a numa topology For your sched_domain difference, On 3.4, SD_PREFER_SIBLING was set for both MC and CPU level thanks to sd_balance_for_mc_power and sd_balance_for_package_power On 3.6, SD_PREFER_SIBLING is only set for CPU level and this flag difference with MC level prevents the destruction of CPU sched_domain during the degeneration We may need to set SD_PREFER_SIBLING for MC level Vincent -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v2 3/6] sched: pack small tasks
On 16 December 2012 08:12, Alex Shi alex@intel.com wrote: On 12/14/2012 05:33 PM, Vincent Guittot wrote: On 14 December 2012 02:46, Alex Shi alex@intel.com wrote: On 12/13/2012 11:48 PM, Vincent Guittot wrote: On 13 December 2012 15:53, Vincent Guittot vincent.guit...@linaro.org wrote: On 13 December 2012 15:25, Alex Shi alex@intel.com wrote: On 12/13/2012 06:11 PM, Vincent Guittot wrote: On 13 December 2012 03:17, Alex Shi alex@intel.com wrote: On 12/12/2012 09:31 PM, Vincent Guittot wrote: During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPU can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be : | Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 | --- buddy | CPU0 | CPU0 | CPU0 | CPU2 | Small tasks tend to slip out of the periodic load balance so the best place to choose to migrate them is during their wake up. The decision is in O(1) as we only check again one buddy CPU Just have a little worry about the scalability on a big machine, like on a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That is different on task distribution decision. The buddy CPU should probably not be the same for all 64 LCPU it depends on where it's worth packing small tasks Do you have further ideas for buddy cpu on such example? yes, I have several ideas which were not really relevant for small system but could be interesting for larger system We keep the same algorithm in a socket but we could either use another LCPU in the targeted socket (conf0) or chain the socket (conf1) instead of packing directly in one LCPU The scheme below tries to summaries the idea: Socket | socket 0 | socket 1 | socket 2 | socket 3 | LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 | buddy conf0 | 0 | 0| 1 | 16| 2 | 32| 3 | 48| buddy conf1 | 0 | 0| 0 | 16| 16 | 32| 32 | 48| buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48| But, I don't know how this can interact with NUMA load balance and the better might be to use conf3. I mean conf2 not conf3 So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it is unbalanced for different socket. That the target because we have decided to pack the small tasks in socket 0 when we have parsed the topology at boot. We don't have to loop into sched_domain or sched_group anymore to find the best LCPU when a small tasks wake up. iteration on domain and group is a advantage feature for power efficient requirement, not shortage. If some CPU are already idle before forking, let another waking CPU check their load/util and then decide which one is best CPU can reduce late migrations, that save both the performance and power. In fact, we have already done this job once at boot and we consider that moving small tasks in the buddy CPU is always benefit so we don't need to waste time looping sched_domain and sched_group to compute current capacity of each LCPU for each wake up of each small tasks. We want all small tasks and background activity waking up on the same buddy CPU and let the default behavior of the scheduler choosing the best CPU for heavy tasks or loaded CPUs. On the contrary, move task walking on each level buddies is not only bad on performance but also bad on power. Consider the quite big latency of waking a deep idle CPU. we lose too much.. My result have shown different conclusion. In fact, there is much more chance that the buddy will not be in a deep idle as all the small tasks and background activity are already waking on this CPU. And the ground level has just one buddy for 16 LCPUs - 8 cores, that's not a good design, consider my previous examples: if there are 4 or 8 tasks in one socket, you just has 2 choices: spread them into all cores, or pack them into one LCPU. Actually, moving them just into 2 or 4 cores maybe a better solution. but the design missed this. You speak about tasks without any notion of load. This patch only care of small tasks and light LCPU load, but it falls back to default behavior for other situation. So if there are 4 or 8 small tasks, they will migrate to the socket 0 after 1 or up to 3 migration (it depends of the conf and the LCPU they come from). According to your patch, what your mean 'notion of load' is the utilization of cpu, not the load weight of tasks, right? Yes but not only. The number of tasks that run simultaneously, is another
Re: [RFC v2 PATCH 2/2] sched: Use Per-Entity-Load-Tracking metric for load balancing
Hi Preeti, On 15 November 2012 17:54, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: Currently the load balancer weighs a task based upon its priority,and this weight consequently gets added up to the weight of the run queue that it is on.It is this weight of the runqueue that sums up to a sched group's load which is used to decide the busiest or the idlest group and the runqueue thereof. The Per-Entity-Load-Tracking metric however measures how long a task has been runnable over the duration of its lifetime.This gives us a hint of the amount of CPU time that the task can demand.This metric takes care of the task priority as well.Therefore apart from the priority of a task we also have an idea of the live behavior of the task.This seems to be a more realistic metric to use to compute task weight which adds upto the run queue weight and the weight of the sched group.Consequently they can be used for load balancing. The semantics of load balancing is left untouched.The two functions load_balance() and select_task_rq_fair() perform the task of load balancing.These two paths have been browsed through in this patch to make necessary changes. weighted_cpuload() and task_h_load() provide the run queue weight and the weight of the task respectively.They have been modified to provide the Per-Entity-Load-Tracking metric as relevant for each. The rest of the modifications had to be made to suit these two changes. Completely Fair Scheduler class is the only sched_class which contributes to the run queue load.Therefore the rq-load.weight==cfs_rq-load.weight when the cfs_rq is the root cfs_rq (rq-cfs) of the hierarchy.When replacing this with Per-Entity-Load-Tracking metric,cfs_rq-runnable_load_avg needs to be used as this is the right reflection of the run queue load when the cfs_rq is the root cfs_rq (rq-cfs) of the hierarchy.This metric reflects the percentage uptime of the tasks that are queued on it and hence that contribute to the load.Thus cfs_rq-runnable_load_avg replaces the metric earlier used in weighted_cpuload(). The task load is aptly captured by se.avg.load_avg_contrib which captures the runnable time vs the alive time of the task against its priority.This metric replaces the earlier metric used in task_h_load(). The consequent changes appear as data type changes for the helper variables; they abound in number.Because cfs_rq-runnable_load_avg needs to be big enough to capture the tasks' load often and accurately. You are now using cfs_rq-runnable_load_avg instead of cfs_rq-load.weight for calculation of cpu_load but cfs_rq-runnable_load_avg is smaller or equal to cfs_rq-load.weight value. This implies that the new value is smaller or equal to the old statistic so you should be able to keep the same variable width for the computation of cpu_load The following patch does not consider CONFIG_FAIR_GROUP_SCHED AND CONFIG_SCHED_NUMA.This is done so as to evaluate this approach starting from the simplest scenario.Earlier discussions can be found in the link below. Link: https://lkml.org/lkml/2012/10/25/162 Signed-off-by: Preeti U Murthypre...@linux.vnet.ibm.com --- include/linux/sched.h |2 +- kernel/sched/core.c | 12 + kernel/sched/fair.c | 64 + kernel/sched/sched.h |2 +- 4 files changed, 40 insertions(+), 40 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 087dd20..302756e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -924,7 +924,7 @@ struct sched_domain { unsigned int lb_count[CPU_MAX_IDLE_TYPES]; unsigned int lb_failed[CPU_MAX_IDLE_TYPES]; unsigned int lb_balanced[CPU_MAX_IDLE_TYPES]; - unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES]; + u64 lb_imbalance[CPU_MAX_IDLE_TYPES]; unsigned int lb_gained[CPU_MAX_IDLE_TYPES]; unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES]; unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES]; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 24d8b9b..4dea057 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2415,8 +2415,8 @@ static const unsigned char * would be when CPU is idle and so we just decay the old load without * adding any new load. */ -static unsigned long -decay_load_missed(unsigned long load, unsigned long missed_updates, int idx) +static u64 +decay_load_missed(u64 load, unsigned long missed_updates, int idx) { int j = 0; @@ -2444,7 +2444,7 @@ decay_load_missed(unsigned long load, unsigned long missed_updates, int idx) * scheduler tick (TICK_NSEC). With tickless idle this will not be called * every tick. We fix it up based on jiffies. */ -static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, +static void __update_cpu_load(struct rq *this_rq, u64 this_load, unsigned long pending_updates) { int
[RFC 3/3] sched: fix update NOHZ_IDLE flag
The function nohz_kick_needed modifies NOHZ_IDLE flag that is used to update the nr_busy_cpus of the sched_group. When the sched_domain are updated (because of the unplug of a CPUs as an example) a null_domain is attached to CPUs. We have to test likely(!on_null_domain(cpu) first in order to detect such intialization step and to not modify the NOHZ_IDLE flag Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/fair.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3d0686c..1bf7c87 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5490,7 +5490,7 @@ void trigger_load_balance(struct rq *rq, int cpu) likely(!on_null_domain(cpu))) raise_softirq(SCHED_SOFTIRQ); #ifdef CONFIG_NO_HZ - if (nohz_kick_needed(rq, cpu) likely(!on_null_domain(cpu))) + if (likely(!on_null_domain(cpu)) nohz_kick_needed(rq, cpu)) nohz_balancer_kick(cpu); #endif } -- 1.7.10 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 0/3] sched: fix nr_busy_cpus
The nr_busy_cpus field of the sched_group_power is sometime different from 0 whereas the platform is fully idle. This serie fixes 3 use cases: - when the SCHED softirq is raised on an idle core for idle load balance but the platform doesn't go out of the cpuidle state - when some CPUs enter idle state while booting all CPUs - when a CPU is unplug and/or replug Vincent Guittot (3): sched: fix nr_busy_cpus with coupled cpuidle sched: fix init NOHZ_IDLE flag sched: fix update NOHZ_IDLE flag kernel/sched/core.c |1 + kernel/sched/fair.c |2 +- kernel/time/tick-sched.c |2 ++ 3 files changed, 4 insertions(+), 1 deletion(-) -- 1.7.10 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 1/3] sched: fix nr_busy_cpus with coupled cpuidle
With the coupled cpuidle driver (but probably also with other drivers), a CPU loops in a temporary safe state while waiting for other CPUs of its cluster to be ready to enter the coupled C-state. If an IRQ or a softirq occurs, the CPU will stay in this internal loop if there is no need to resched. The SCHED softirq clears the NOHZ and increases nr_busy_cpus. If there is no need to resched, we will not call set_cpu_sd_state_idle because of this internal loop in a cpuidle state. We have to call set_cpu_sd_state_idle in tick_nohz_irq_exit which is used to handle such situation. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/time/tick-sched.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index a402608..e19bbc9 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -526,6 +526,8 @@ void tick_nohz_irq_exit(void) if (!ts-inidle) return; + set_cpu_sd_state_idle(); + __tick_nohz_idle_enter(ts); } -- 1.7.10 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 2/3] sched: fix init NOHZ_IDLE flag
On my smp platform which is made of 5 cores in 2 clusters,I have the nr_busy_cpu field of sched_group_power struct that is not null when the platform is fully idle. The root cause seems to be: During the boot sequence, some CPUs reach the idle loop and set their NOHZ_IDLE flag while waiting for others CPUs to boot. But the nr_busy_cpus field is initialized later with the assumption that all CPUs are in the busy state whereas some CPUs have already set their NOHZ_IDLE flag. We clear the NOHZ_IDLE flag when nr_busy_cpus is initialized in order to have a coherent configuration. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/core.c |1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5dae0d2..05058e8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5817,6 +5817,7 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd) update_group_power(sd, cpu); atomic_set(sg-sgp-nr_busy_cpus, sg-group_weight); + clear_bit(NOHZ_IDLE, nohz_flags(cpu)); } int __weak arch_sd_sibling_asym_packing(void) -- 1.7.10 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 3/3] workqueue: Schedule work on non-idle cpu instead of current one
On 1 October 2012 05:47, Viresh Kumar viresh.ku...@linaro.org wrote: On 1 October 2012 06:02, Tejun Heo t...@kernel.org wrote: It isn't about the CPU being actually idle? No. Being idle only from scheduler's perspective. :) Also, if it's only about timers, shouldn't it be enough to implement it for timer and delayed_work? What if we need a timer, which must re-arm itself + schedule a work? delayed_work will not be sufficient in that case, and we would need to use normal work. If i am not wrong, there can be other future users of this routine too. @Vincent: Can you please comment on this? The goal is to consolidate the place, where the selection of a target CPU for running something is done. The scheduler should select the CPU in order to have coherent behaviour of all framework. A delayed work can be delayed for a long period and the target CPU that have been selected for the timer could not be the best place to queue a work any more. The goal is to be able to migrate a work if the scheduler thinks that' is necessary. Vincent It would be great if you explain what you're trying to achieve how. I can't tell what you're aiming for and why that would be beneficial from these submissions. Following slides are implemented by Vincent and presented during LPC. Please have a look at them, they explain the problem statement well: http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/08/lpc2012-sched-timer-workqueue.pdf Specifically slides: 12 19. There aren't too many users with this behavior, but even a single user will be sufficient not to let the cpu get idle at all. And that will result in less power saving. -- viresh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
On 2 January 2013 05:22, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: Hi everyone, I have been looking at how different workloads react when the per entity load tracking metric is integrated into the load balancer and what are the possible reasons for it. I had posted the integration patch earlier: https://lkml.org/lkml/2012/11/15/391 Essentially what I am doing is: 1.I have disabled CONFIG_FAIR_GROUP_SCHED to make the analysis simple 2.I have replaced cfs_rq-load.weight in weighted_cpuload() with cfs.runnable_load_avg,the active load tracking metric. 3.I have replaced se.load.weight in task_h_load() with se.load.avg.contrib,the per entity load tracking metric. 4.The load balancer will end up using these metrics. After conducting experiments on several workloads I found out that the performance of the workloads with the above integration would neither improve nor deteriorate.And this observation was consistent. Ideally the performance should have improved considering,that the metric does better tracking of load. Let me explain with a simple example as to why we should see a performance improvement ideally:Consider 2 80% tasks and 1 40% task. With integration: 40% 80%40% cpu1 cpu2 The above will be the scenario when the tasks fork initially.And this is a perfectly balanced system,hence no more load balancing.And proper distribution of loads on the cpu. Without integration --- 40% 40% 80%40% 80%40% cpu1 cpu2OR cpu1 cpu2 Because the view is that all the tasks as having the same load.The load balancer could ping pong tasks between these two situations. When I performed this experiment,I did not see an improvement in the performance though in the former case.On further observation I found that the following was actually happening. With integration Initially 40% task sleeps 40% task wakes up and select_idle_sibling() decides to wake it up on cpu1 40% - - 40% 80%40%80%40% 80% 40% cpu1 cpu2cpu1 cpu2 cpu1 cpu2 This makes load balance trigger movement of 40% from cpu1 back to cpu2.Hence the stability that the load balancer was trying to achieve is gone.Hence the culprit boils down to select_idle_sibling.How is it the culprit and how is it hindering performance of the workloads? *What is the way ahead with the per entity load tracking metric in the load balancer then?* In replies to a post by Paul in https://lkml.org/lkml/2012/12/6/105, he mentions the following: It is my intuition that the greatest carnage here is actually caused by wake-up load-balancing getting in the way of periodic in establishing a steady state. I suspect more mileage would result from reducing the interference wake-up load-balancing has with steady state. The whole point of using blocked load is so that you can converge on a steady state where you don't NEED to move tasks. What disrupts this is we naturally prefer idle cpus on wake-up balance to reduce wake-up latency. I think the better answer is making these two processes load balancing() and select_idle_sibling() more co-operative. I had not realised how this would happen until I saw it happening in the above experiment. Based on what Paul explained above let us use the runnable load + the blocked load for calculating the load on a cfs runqueue rather than just the runnable load(which is what i am doing now) and see its consequence. Initially: 40% task sleeps 40% 80%40% - 80% 40% cpu1 cpu2 cpu1 cpu2 So initially the load on cpu1 is say 80 and on cpu2 also it is 80.Balanced.Now when 40% task sleeps,the total load on cpu2=runnable load+blocked load.which is still 80. As a consequence,firstly,during periodic load balancing the load is not moved from cpu1 to cpu2 when the 40% task sleeps.(It sees the load on cpu2 as 80 and not as 40). Hence the above scenario remains the same.On wake up,what happens? Here comes the point of making both load balancing and wake up balance(select_idle_sibling) co operative. How about we always schedule the woken up task on the prev_cpu? This seems more sensible considering load balancing considers blocked load as being a part of the load of cpu2. Hi Preeti, I'm not sure that we want such steady state at cores level because we take advantage of migrating wake up tasks between cores that share their cache as Matthew demonstrated. But I agree that reaching such steady state at cluster and CPU level is interesting. IMHO, you're right that taking the blocked load into consideration should minimize tasks migration between cluster but it should no prevent fast task migration between cores that share their cache Vincent
Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
On 8 January 2013 07:06, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: On 01/07/2013 09:18 PM, Vincent Guittot wrote: On 2 January 2013 05:22, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: Hi everyone, I have been looking at how different workloads react when the per entity load tracking metric is integrated into the load balancer and what are the possible reasons for it. I had posted the integration patch earlier: https://lkml.org/lkml/2012/11/15/391 Essentially what I am doing is: 1.I have disabled CONFIG_FAIR_GROUP_SCHED to make the analysis simple 2.I have replaced cfs_rq-load.weight in weighted_cpuload() with cfs.runnable_load_avg,the active load tracking metric. 3.I have replaced se.load.weight in task_h_load() with se.load.avg.contrib,the per entity load tracking metric. 4.The load balancer will end up using these metrics. After conducting experiments on several workloads I found out that the performance of the workloads with the above integration would neither improve nor deteriorate.And this observation was consistent. Ideally the performance should have improved considering,that the metric does better tracking of load. Let me explain with a simple example as to why we should see a performance improvement ideally:Consider 2 80% tasks and 1 40% task. With integration: 40% 80%40% cpu1 cpu2 The above will be the scenario when the tasks fork initially.And this is a perfectly balanced system,hence no more load balancing.And proper distribution of loads on the cpu. Without integration --- 40% 40% 80%40% 80%40% cpu1 cpu2OR cpu1 cpu2 Because the view is that all the tasks as having the same load.The load balancer could ping pong tasks between these two situations. When I performed this experiment,I did not see an improvement in the performance though in the former case.On further observation I found that the following was actually happening. With integration Initially 40% task sleeps 40% task wakes up and select_idle_sibling() decides to wake it up on cpu1 40% - - 40% 80%40%80%40% 80% 40% cpu1 cpu2cpu1 cpu2 cpu1 cpu2 This makes load balance trigger movement of 40% from cpu1 back to cpu2.Hence the stability that the load balancer was trying to achieve is gone.Hence the culprit boils down to select_idle_sibling.How is it the culprit and how is it hindering performance of the workloads? *What is the way ahead with the per entity load tracking metric in the load balancer then?* In replies to a post by Paul in https://lkml.org/lkml/2012/12/6/105, he mentions the following: It is my intuition that the greatest carnage here is actually caused by wake-up load-balancing getting in the way of periodic in establishing a steady state. I suspect more mileage would result from reducing the interference wake-up load-balancing has with steady state. The whole point of using blocked load is so that you can converge on a steady state where you don't NEED to move tasks. What disrupts this is we naturally prefer idle cpus on wake-up balance to reduce wake-up latency. I think the better answer is making these two processes load balancing() and select_idle_sibling() more co-operative. I had not realised how this would happen until I saw it happening in the above experiment. Based on what Paul explained above let us use the runnable load + the blocked load for calculating the load on a cfs runqueue rather than just the runnable load(which is what i am doing now) and see its consequence. Initially: 40% task sleeps 40% 80%40% - 80% 40% cpu1 cpu2 cpu1 cpu2 So initially the load on cpu1 is say 80 and on cpu2 also it is 80.Balanced.Now when 40% task sleeps,the total load on cpu2=runnable load+blocked load.which is still 80. As a consequence,firstly,during periodic load balancing the load is not moved from cpu1 to cpu2 when the 40% task sleeps.(It sees the load on cpu2 as 80 and not as 40). Hence the above scenario remains the same.On wake up,what happens? Here comes the point of making both load balancing and wake up balance(select_idle_sibling) co operative. How about we always schedule the woken up task on the prev_cpu? This seems more sensible considering load balancing considers blocked load as being a part of the load of cpu2. Hi Preeti, I'm not sure that we want such steady state at cores level because we take advantage of migrating wake up tasks between cores that share their cache as Matthew demonstrated. But I agree that reaching such steady state at cluster and CPU level is interesting. IMHO, you're right that taking the blocked load into consideration should
Re: [RFC][PATCH 2/2] ARM64: introduce cluster id and make a difference between socket id
On 27 July 2013 12:42, Hanjun Guo hanjun@linaro.org wrote: In the cpu topology information, we define topology_physical_package_id() as cpu socket id, which means that the socket id is the idenfication for physical processor, not for a cluster in a cpu die. On ARM64 platform, multi cluster in a cpu die will be normal, here is a example with 2 cores in a cluster and 2 cluster in a socket: |--| |socket| | | | |---| |---| | | |cluster| |cluster| | | | | | | | | | || || | | || || | | | | |core| |core| | | |core| |core| | | | | || || | | || || | | | | | | | | | |---| |---| | | | |--| ARM64 extended the MPIDR into 64 bit and introduce another affinity level, we can use this affinity level for socket id and use the third highest level affinity for cluster id, which make the socket id behavior in its original way. Signed-off-by: Hanjun Guo hanjun@linaro.org --- arch/arm64/include/asm/topology.h |1 + arch/arm64/kernel/topology.c |8 ++-- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h index 8631808..ff68ecc 100644 --- a/arch/arm64/include/asm/topology.h +++ b/arch/arm64/include/asm/topology.h @@ -8,6 +8,7 @@ struct cputopo_arm64 { int thread_id; int core_id; + int cluster_id; int socket_id; cpumask_t thread_sibling; cpumask_t core_sibling; diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index 1eb0435..6d1e5a6 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c @@ -80,12 +80,14 @@ void store_cpu_topology(unsigned int cpuid) /* core performance interdependency */ cpuid_topo-thread_id = MPIDR_AFFINITY_LEVEL_0(mpidr); cpuid_topo-core_id = MPIDR_AFFINITY_LEVEL_1(mpidr); - cpuid_topo-socket_id = MPIDR_AFFINITY_LEVEL_2(mpidr); + cpuid_topo-cluster_id = MPIDR_AFFINITY_LEVEL_2(mpidr); + cpuid_topo-socket_id = MPIDR_AFFINITY_LEVEL_3(mpidr); socket_id is currently used by update_siblings_masks to update the core_sibling mask. This mask defines which CPUs share their cache and AFAICT, the cache are shared at the cluster level so cluster_id should be used instead socket_id. Have you got more information about the goal of this new level_3 ? Vincent } else { /* largely independent cores */ cpuid_topo-thread_id = -1; cpuid_topo-core_id = MPIDR_AFFINITY_LEVEL_0(mpidr); - cpuid_topo-socket_id = MPIDR_AFFINITY_LEVEL_1(mpidr); + cpuid_topo-cluster_id = MPIDR_AFFINITY_LEVEL_1(mpidr); + cpuid_topo-socket_id = MPIDR_AFFINITY_LEVEL_2(mpidr); } } else { /* @@ -95,6 +97,7 @@ void store_cpu_topology(unsigned int cpuid) */ cpuid_topo-thread_id = -1; cpuid_topo-core_id = 0; + cpuid_topo-cluster_id = -1; cpuid_topo-socket_id = -1; } @@ -120,6 +123,7 @@ void __init init_cpu_topology(void) topo-thread_id = -1; topo-core_id = -1; + topo-cluster_id = -1; topo-socket_id = -1; cpumask_clear(topo-core_sibling); cpumask_clear(topo-thread_sibling); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] ARM64: add cpu topology definition
On 27 July 2013 12:42, Hanjun Guo hanjun@linaro.org wrote: Power aware scheduling needs the cpu topology information to improve the cpu scheduler decision making. It's not only power aware scheduling. The scheduler already uses topology and cache sharing when CONFIG_SCHED_MC and/or CONFIG_SCHED_SMT are enable. So you should also add these configs for arm64 so the scheduler can use it Vincent For ARM64, we can get the topology from the MPIDR register which defines the the affinity of processors. This patch is mainly based on arch/arm/kernel/topology.c written by Vincent Guittot, and replaced the topology array with per cpu variable. Signed-off-by: Hanjun Guo hanjun@linaro.org --- arch/arm64/Kconfig|9 +++ arch/arm64/include/asm/cputype.h | 11 arch/arm64/include/asm/topology.h | 41 arch/arm64/kernel/Makefile|1 + arch/arm64/kernel/smp.c |6 ++ arch/arm64/kernel/topology.c | 128 + 6 files changed, 196 insertions(+) create mode 100644 arch/arm64/include/asm/topology.h create mode 100644 arch/arm64/kernel/topology.c diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 9737e97..f0ce91b 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -150,6 +150,15 @@ config SMP If you don't know what to do here, say N. +config ARM64_CPU_TOPOLOGY + bool Support cpu topology definition + depends on SMP ARM64 + default y + help + Support ARM64 cpu topology definition. The MPIDR register defines + affinity between processors which is then used to describe the cpu + topology of an ARM64 System. + config NR_CPUS int Maximum number of CPUs (2-32) range 2 32 diff --git a/arch/arm64/include/asm/cputype.h b/arch/arm64/include/asm/cputype.h index 5fe138e..68b55af 100644 --- a/arch/arm64/include/asm/cputype.h +++ b/arch/arm64/include/asm/cputype.h @@ -30,6 +30,17 @@ #define MPIDR_HWID_BITMASK 0xff00ff +#define MPIDR_SMP_BITMASK (0x1 30) +#define MPIDR_MT_BITMASK (0x1 24) + +#define MPIDR_LEVEL_BITS 8 +#define MPIDR_LEVEL_MASK ((1 MPIDR_LEVEL_BITS) - 1) + +#define MPIDR_AFFINITY_LEVEL_0(mpidr) ((mpidr) MPIDR_LEVEL_MASK) +#define MPIDR_AFFINITY_LEVEL_1(mpidr) ((mpidr 8) MPIDR_LEVEL_MASK) +#define MPIDR_AFFINITY_LEVEL_2(mpidr) ((mpidr 16) MPIDR_LEVEL_MASK) +#define MPIDR_AFFINITY_LEVEL_3(mpidr) ((mpidr 32) MPIDR_LEVEL_MASK) + #define read_cpuid(reg) ({ \ u64 __val; \ asm(mrs%0, reg : =r (__val)); \ diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h new file mode 100644 index 000..8631808 --- /dev/null +++ b/arch/arm64/include/asm/topology.h @@ -0,0 +1,41 @@ +#ifndef _ASM_ARM64_TOPOLOGY_H +#define _ASM_ARM64_TOPOLOGY_H + +#ifdef CONFIG_ARM64_CPU_TOPOLOGY + +#include linux/cpumask.h + +struct cputopo_arm64 { + int thread_id; + int core_id; + int socket_id; + cpumask_t thread_sibling; + cpumask_t core_sibling; +}; + +DECLARE_PER_CPU(struct cputopo_arm64, cpu_topology); + +#define cpu_topo(cpu) per_cpu(cpu_topology, cpu) + +#define topology_physical_package_id(cpu) (cpu_topo(cpu).socket_id) +#define topology_core_id(cpu) (cpu_topo(cpu).core_id) +#define topology_core_cpumask(cpu) (cpu_topo(cpu).core_sibling) +#define topology_thread_cpumask(cpu) (cpu_topo(cpu).thread_sibling) + +#define mc_capable() (cpu_topo(0).socket_id != -1) +#define smt_capable() (cpu_topo(0).thread_id != -1) + +void init_cpu_topology(void); +void store_cpu_topology(unsigned int cpuid); +const struct cpumask *cpu_coregroup_mask(int cpu); + +#else + +static inline void init_cpu_topology(void) { } +static inline void store_cpu_topology(unsigned int cpuid) { } + +#endif + +#include asm-generic/topology.h + +#endif /* _ASM_ARM64_TOPOLOGY_H */ diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile index 7b4b564..a47c359 100644 --- a/arch/arm64/kernel/Makefile +++ b/arch/arm64/kernel/Makefile @@ -18,6 +18,7 @@ arm64-obj-$(CONFIG_SMP) += smp.o smp_spin_table.o smp_psci.o arm64-obj-$(CONFIG_HW_PERF_EVENTS) += perf_event.o arm64-obj-$(CONFIG_HAVE_HW_BREAKPOINT)+= hw_breakpoint.o arm64-obj-$(CONFIG_EARLY_PRINTK) += early_printk.o +arm64-obj-$(CONFIG_ARM64_CPU_TOPOLOGY) += topology.o obj-y += $(arm64-obj-y) vdso/ obj-m += $(arm64-obj-m) diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index fee5cce..197b1da 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -39,6 +39,7 @@ #include asm/atomic.h #include asm
Re: [RFC][PATCH 1/2] ARM64: add cpu topology definition
On 29 July 2013 12:15, Sudeep KarkadaNagesha sudeep.karkadanage...@arm.com wrote: On 29/07/13 10:46, Vincent Guittot wrote: On 27 July 2013 12:42, Hanjun Guo hanjun@linaro.org wrote: Power aware scheduling needs the cpu topology information to improve the cpu scheduler decision making. It's not only power aware scheduling. The scheduler already uses topology and cache sharing when CONFIG_SCHED_MC and/or CONFIG_SCHED_SMT are enable. So you should also add these configs for arm64 so the scheduler can use it Just for my knowledge, I thought power aware using SCHED_MC/SMT was removed. I see commit 8e7fbcbc22c12414bcc9dfdd683637f58fb32759 sched: Remove stale power aware scheduling remnants and dysfunctional knobs I may be missing something here. It's a common mistake to mixed SCHED_MC and powersaving balance with SCHED_MC. Only the powersaving policy has been removed but the SCHED_MC and SCHED_SMT are always in the scheduler and gives perf improvement on arm 32bits Regards, Sudeep -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] ARM64: add cpu topology definition
On 29 July 2013 11:54, Will Deacon will.dea...@arm.com wrote: On Mon, Jul 29, 2013 at 10:46:06AM +0100, Vincent Guittot wrote: On 27 July 2013 12:42, Hanjun Guo hanjun@linaro.org wrote: Power aware scheduling needs the cpu topology information to improve the cpu scheduler decision making. It's not only power aware scheduling. The scheduler already uses topology and cache sharing when CONFIG_SCHED_MC and/or CONFIG_SCHED_SMT are enable. So you should also add these configs for arm64 so the scheduler can use it ... except that the architecture doesn't define what the AFF fields in MPIDR really represent. Using them to make key scheduling decisions relating to Do you mean that it's not define for arm64 ARM? AFAIK, there are good explanation in the arm32 ARM and it's currently used with SCHED_MC and SCHED_SMT cache proximity seems pretty risky to me, especially given the track record we've seen already on AArch32 silicon. It's a convenient register if it contains the data we want it to contain, but we need to force ourselves to come to terms with reality here and simply use it as an identifier for a CPU. Can't we just use the device-tree to represent this topological data for arm64? Lorenzo has been working on bindings in this area. I agree that we should probably use DT if we can't rely in MPIDR for arm64 Vincent Will -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/9] sched: Introduce power scheduler
On 10 July 2013 13:11, Morten Rasmussen morten.rasmus...@arm.com wrote: On Wed, Jul 10, 2013 at 03:10:15AM +0100, Arjan van de Ven wrote: On 7/9/2013 8:55 AM, Morten Rasmussen wrote: + mod_delayed_work_on(schedule_cpu(), system_wq, dwork, + msecs_to_jiffies(INTERVAL)); so thinking about this more, this really really should not be a work queue. a work queue will cause a large number of context switches for no reason (on Intel and AMD you can switch P state from interrupt context, and I'm pretty sure that holds for many ARM as well) Agree. I should have made it clear this is only a temporary solution. I would prefer to tie the power scheduler to the existing scheduler tick instead so we don't wake up cpus unnecessarily. nohz may be able handle that for us. Also, currently the power scheduler updates all cpus. Going forward this would change to per cpu updates and partial updates of the global view to improve scalability. For the packing tasks patches, we are using the periodic load balance sequence to update the activity like it is done for the cpu_power. I have planned to update the packing patches to see how it can cooperate with Morten patches as it has similar needs. and in addition, it causes some really nasty cases, especially around real time tasks. Your workqueue will schedule a kernel thread, which will run BEHIND real time tasks, and such real time task will then never be able to start running at a higher performance. (and with the delta between lowest and highest performance sometimes being 10x or more, the real time task will be running SLOW... quite possible longer than several milliseconds) and all for no good reason; a normal timer running in irq context would be much better for this kind of thing! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal
On 13 July 2013 12:23, Catalin Marinas catalin.mari...@arm.com wrote: Hi Peter, (Morten's away for a week, I'll try cover some bits in the meantime) On Sat, Jul 13, 2013 at 07:49:09AM +0100, Peter Zijlstra wrote: On Tue, Jul 09, 2013 at 04:55:29PM +0100, Morten Rasmussen wrote: This patch set is an initial prototype aiming at the overall power-aware scheduler design proposal that I previously described http://permalink.gmane.org/gmane.linux.kernel/1508480. The patch set introduces a cpu capacity managing 'power scheduler' which lives by the side of the existing (process) scheduler. Its role is to monitor the system load and decide which cpus that should be available to the process scheduler. Hmm... This looks like a userspace hotplug deamon approach lifted to kernel space :/ The difference is that this is faster. We even had hotplug in mind some years ago for big.LITTLE but it wouldn't give the performance we need (hotplug is incredibly slow even if driven from the kernel). How about instead of layering over the load-balancer to constrain its behaviour you change the behaviour to not need constraint? Fix it so it does the right thing, instead of limiting it. I don't think its _that_ hard to make the balancer do packing over spreading. The power balance code removed in 8e7fbcbc had things like that (although it was broken). And I'm sure I've seen patches over the years that did similar things. Didn't Vincent and Alex also do things like that? We should take the good bits from all that and make something of it. And I think its easier now that we have the per task and per rq utilization numbers [1]. That's what we've been pushing for. From a big.LITTLE perspective, I would probably vote for Vincent's patches but I guess we could probably adapt any of the other options. But then we got Ingo NAK'ing all these approaches. Taking the best bits from the current load balancing patches would create yet another set of patches which don't fall under Ingo's requirements (at least as I understand them). In fact we are currently updating our patchset based on Ingo's feedback. The move of cpuidle and cpufreq statistic was planned to appear later in our dev but we are now integrating it based in Ingo's request. We start with cpuidle statistics and are moving it into the scheduler. In addition, we want to integrate the current C-state of a core in the wake up decision. Just start by changing the balancer to pack instead of spread. Once that works, see where the two modes diverge and put a knob in. That's the approach we've had so far (not sure about the knob). But it doesn't solve Ingo's complain about fragmentation between scheduler, cpufreq and cpuidle policies. Then worry about power thingies. To quote Ingo: To create a new low level idle driver mechanism the scheduler could use and integrate proper power saving / idle policy into the scheduler. That's unless we all agree (including Ingo) that the above requirement is orthogonal to task packing and, as a *separate* project, we look at better integrating the cpufreq/cpuidle with the scheduler, possibly with a new driver model and governors as libraries used by such drivers. In which case the current packing patches shouldn't be NAK'ed but reviewed so that they can be improved further or rewritten. The integration of cpuidle and cpufreq should start by unifying all the statistics stuff. For cpuidle we need to pull in the per-cpu idle time guestimator. For cpufreq the per-cpu usage stuff -- which we already have in the scheduler these days! Once we have all the statistics in place, its also easier to see what we can do with them and what might be missing. At this point mandate that policy drivers may not do any statistics gathering of their own. If they feel the need to do so, we're missing something and that's not right. I agree in general but there is the intel_pstate.c driver which has it's own separate statistics that the scheduler does not track. We could move to invariant task load tracking which uses aperf/mperf (and could do similar things with perf counters on ARM). As I understand from Arjan, the new pstate driver will be different, so we don't know exactly what it requires. I'm not entirely sold on differentiating between short running and other tasks either. Although I suppose I see where that comes from. A task that would run 50% on a big core would unlikely be qualified as small, however if it would require 85% of a small core and there's room on the small cores its a good move to run it there. So where's the limit for being small? It seems like an artificial limit and such should be avoided where possible. I agree. With Morten's approach, it doesn't care about how small a task is but rather when a CPU (or cluster) is loaded to a certain threshold, just spread tasks to the next. I think small task threshold on its own doesn't make much sense
Re: [Resend patch v8 06/13] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
On 20 June 2013 04:18, Alex Shi alex@intel.com wrote: They are the base values in load balance, update them with rq runnable load average, then the load balance will consider runnable load avg naturally. We also try to include the blocked_load_avg as cpu load in balancing, but that cause kbuild performance drop 6% on every Intel machine, and aim7/oltp drop on some of 4 CPU sockets machines. Or only add blocked_load_avg into get_rq_runable_load, hackbench still drop a little on NHM EX. Signed-off-by: Alex Shi alex@intel.com Reviewed-by: Gu Zheng guz.f...@cn.fujitsu.com --- kernel/sched/fair.c | 5 +++-- kernel/sched/proc.c | 17 +++-- 2 files changed, 18 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1e5a5e6..7d5c477 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2968,7 +2968,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) /* Used instead of source_load when we know the type == 0 */ static unsigned long weighted_cpuload(const int cpu) { - return cpu_rq(cpu)-load.weight; + return cpu_rq(cpu)-cfs.runnable_load_avg; } Alex, In the wake-affine function, we use current-se.load.weight and p-se.load.weight to update the load of this_cpu and prev_cpu whereas these loads are now equal to runnable_load_avg which is the sum of se-avg.load_avg_contrib now. Shouldn't we use se-avg.load_avg_contrib instead of se.load.weight ? Vincent /* @@ -3013,9 +3013,10 @@ static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); unsigned long nr_running = ACCESS_ONCE(rq-nr_running); + unsigned long load_avg = rq-cfs.runnable_load_avg; if (nr_running) - return rq-load.weight / nr_running; + return load_avg / nr_running; return 0; } diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c index bb3a6a0..ce5cd48 100644 --- a/kernel/sched/proc.c +++ b/kernel/sched/proc.c @@ -501,6 +501,18 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, sched_avg_update(this_rq); } +#ifdef CONFIG_SMP +unsigned long get_rq_runnable_load(struct rq *rq) +{ + return rq-cfs.runnable_load_avg; +} +#else +unsigned long get_rq_runnable_load(struct rq *rq) +{ + return rq-load.weight; +} +#endif + #ifdef CONFIG_NO_HZ_COMMON /* * There is no sane way to deal with nohz on smp when using jiffies because the @@ -522,7 +534,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, void update_idle_cpu_load(struct rq *this_rq) { unsigned long curr_jiffies = ACCESS_ONCE(jiffies); - unsigned long load = this_rq-load.weight; + unsigned long load = get_rq_runnable_load(this_rq); unsigned long pending_updates; /* @@ -568,11 +580,12 @@ void update_cpu_load_nohz(void) */ void update_cpu_load_active(struct rq *this_rq) { + unsigned long load = get_rq_runnable_load(this_rq); /* * See the mess around update_idle_cpu_load() / update_cpu_load_nohz(). */ this_rq-last_load_update_tick = jiffies; - __update_cpu_load(this_rq, this_rq-load.weight, 1); + __update_cpu_load(this_rq, load, 1); calc_load_account_active(this_rq); } -- 1.7.12 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched: factor out code to should_we_balance()
On 2 August 2013 03:50, Joonsoo Kim iamjoonsoo@lge.com wrote: Now checking whether this cpu is appropriate to balance or not is embedded into update_sg_lb_stats() and this checking has no direct relationship to this function. There is not enough reason to place this checking at update_sg_lb_stats(), except saving one iteration for sched_group_cpus. In this patch, I factor out this checking to should_we_balance() function. And before doing actual work for load_balancing, check whether this cpu is appropriate to balance via should_we_balance(). If this cpu is not a candidate for balancing, it quit the work immediately. With this change, we can save two memset cost and can expect better compiler optimization. Below is result of this patch. * Vanilla * textdata bss dec hex filename 344991136 116 357518ba7 kernel/sched/fair.o * Patched * textdata bss dec hex filename 342431136 116 354958aa7 kernel/sched/fair.o In addition, rename @balance to @should_balance in order to represent its purpose more clearly. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index eaae77e..7f51b8c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4426,22 +4426,17 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group) * @group: sched_group whose statistics are to be updated. * @load_idx: Load index of sched_domain of this_cpu for load calc. * @local_group: Does group contain this_cpu. - * @balance: Should we balance. * @sgs: variable to hold the statistics for this group. */ static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, - int local_group, int *balance, struct sg_lb_stats *sgs) + int local_group, struct sg_lb_stats *sgs) { unsigned long nr_running, max_nr_running, min_nr_running; unsigned long load, max_cpu_load, min_cpu_load; - unsigned int balance_cpu = -1, first_idle_cpu = 0; unsigned long avg_load_per_task = 0; int i; - if (local_group) - balance_cpu = group_balance_cpu(group); - /* Tally up the load of all CPUs in the group */ max_cpu_load = 0; min_cpu_load = ~0UL; @@ -4454,15 +4449,9 @@ static inline void update_sg_lb_stats(struct lb_env *env, nr_running = rq-nr_running; /* Bias balancing toward cpus of our domain */ - if (local_group) { - if (idle_cpu(i) !first_idle_cpu - cpumask_test_cpu(i, sched_group_mask(group))) { - first_idle_cpu = 1; - balance_cpu = i; - } - + if (local_group) load = target_load(i, load_idx); - } else { + else { load = source_load(i, load_idx); if (load max_cpu_load) max_cpu_load = load; @@ -4482,22 +4471,9 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs-idle_cpus++; } - /* -* First idle cpu or the first cpu(busiest) in this sched group -* is eligible for doing load balancing at this and above -* domains. In the newly idle case, we will allow all the cpu's -* to do the newly idle load balance. -*/ - if (local_group) { - if (env-idle != CPU_NEWLY_IDLE) { - if (balance_cpu != env-dst_cpu) { - *balance = 0; - return; - } - update_group_power(env-sd, env-dst_cpu); - } else if (time_after_eq(jiffies, group-sgp-next_update)) - update_group_power(env-sd, env-dst_cpu); - } + if (local_group (env-idle != CPU_NEWLY_IDLE || + time_after_eq(jiffies, group-sgp-next_update))) + update_group_power(env-sd, env-dst_cpu); /* Adjust by relative CPU power of the group */ sgs-avg_load = (sgs-group_load*SCHED_POWER_SCALE) / group-sgp-power; @@ -4576,7 +4552,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, * @sds: variable to hold the statistics for this sched_domain. */ static inline void update_sd_lb_stats(struct lb_env *env, - int *balance, struct sd_lb_stats *sds) + struct sd_lb_stats *sds) { struct sched_domain *child = env-sd-child; struct sched_group *sg = env-sd-groups; @@ -4593,10 +4569,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
Re: [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init
On 5 November 2013 15:06, Peter Zijlstra pet...@infradead.org wrote: On Fri, Oct 18, 2013 at 01:52:15PM +0200, Vincent Guittot wrote: The function arch_sd_local_flags is used to set flags in sched_domains according to the platform architecture. A new flag SD_SHARE_POWERDOMAIN is also created to reflect whether groups of CPUs in a sched_domain level can or not reach different power state. As an example, the flag should be cleared at CPU level if groups of cores can be power gated independently. This information is used to decide if it's worth packing some tasks in a group of CPUs in order to power gate the other groups instead of spreading the tasks. The default behavior of the scheduler is to spread tasks across CPUs and groups of CPUs so the flag is set into all sched_domains. The cpu parameter of arch_sd_local_flags can be used by architecture to fine tune the scheduler domain flags. As an example SD_SHARE_POWERDOMAIN flag can be set differently for groups of CPUs according to DT information Signed-off-by: Vincent Guittot vincent.guit...@linaro.org Not your fault, but you're making a bigger mess of the arch topology interface. How about we start with something like the below -- compile tested only. And then try and lift default_topology so that an arch can override it? Your proposal looks fine for me. It's clearly better to move in one place the configuration of sched_domain fields. Have you already got an idea about how to let architecture override the topology? My primary need comes from the fact that the topology configuration is not the same for all cores Vincent --- arch/ia64/include/asm/topology.h | 24 - arch/metag/include/asm/topology.h | 27 -- arch/s390/include/asm/topology.h | 2 - arch/tile/include/asm/topology.h | 33 --- include/linux/topology.h | 115 --- kernel/sched/core.c | 188 +++--- 6 files changed, 114 insertions(+), 275 deletions(-) diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e449b75..20d12fa7e0cd 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -46,30 +46,6 @@ void build_cpu_to_node_map(void); -#define SD_CPU_INIT (struct sched_domain) {\ - .parent = NULL, \ - .child = NULL, \ - .groups = NULL, \ - .min_interval = 1,\ - .max_interval = 4,\ - .busy_factor= 64, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 2,\ - .busy_idx = 2,\ - .idle_idx = 1,\ - .newidle_idx= 0,\ - .wake_idx = 0,\ - .forkexec_idx = 0,\ - .flags = SD_LOAD_BALANCE \ - | SD_BALANCE_NEWIDLE\ - | SD_BALANCE_EXEC \ - | SD_BALANCE_FORK \ - | SD_WAKE_AFFINE, \ - .last_balance = jiffies, \ - .balance_interval = 1,\ - .nr_balance_failed = 0,\ -} - #endif /* CONFIG_NUMA */ #ifdef CONFIG_SMP diff --git a/arch/metag/include/asm/topology.h b/arch/metag/include/asm/topology.h index 8e9c0b3b9691..e95f874ded1b 100644 --- a/arch/metag/include/asm/topology.h +++ b/arch/metag/include/asm/topology.h @@ -3,33 +3,6 @@ #ifdef CONFIG_NUMA -/* sched_domains SD_NODE_INIT for Meta machines */ -#define SD_NODE_INIT (struct sched_domain) { \ - .parent = NULL, \ - .child = NULL, \ - .groups = NULL, \ - .min_interval = 8,\ - .max_interval = 32, \ - .busy_factor= 32, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 2,\ - .busy_idx = 3,\ - .idle_idx = 2,\ - .newidle_idx= 0,\ - .wake_idx = 0,\ - .forkexec_idx = 0,\ - .flags = SD_LOAD_BALANCE \ - | SD_BALANCE_FORK \ - | SD_BALANCE_EXEC
Re: [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init
On 5 November 2013 23:27, Peter Zijlstra pet...@infradead.org wrote: On Tue, Nov 05, 2013 at 03:57:23PM +0100, Vincent Guittot wrote: Your proposal looks fine for me. It's clearly better to move in one place the configuration of sched_domain fields. Have you already got an idea about how to let architecture override the topology? Maybe something like the below -- completely untested (my s390 compiler is on a machine that's currently powered off). My primary need comes from the fact that the topology configuration is not the same for all cores Do expand.. the various cpu masks used in the topology list are per cpu, is that sufficient room to wriggle or do you need more? My current implementation sets a flag in each level (SMT, MC and CPU) to describe the power gating capabilities for the groups of cpus but the capabilities can be different for a same level; I mean that we can have a group of cpus that can power gate at MC level in the system whereas another group of CPUs can only power gate at CPU level. With the current implementation i can't make the difference so i have added the cpu parameters when setting the flags. The other solution is to add new topology levels with cpu masks that can give the power dependency with other (currently the power gating but we can have more level for frequency dependency as an example). In this case the current implementation is enough and the main difficulty will be the place where we can insert these new levels compared to current ones. A typical example with one cluster that can power gate at core level whereas the other cluster can power gate at cluster level, will give the following domain topology: If we set a flag in the current topology levels we should have something like below CPU0: domain 0: span 0-1 level: SMT flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN groups: 0 1 domain 1: span 0-7 level: MC flags: SD_SHARE_PKG_RESOURCES groups: 0-1 2-3 4-5 6-7 domain 2: span 0-15 level: CPU flags: groups: 0-7 8-15 CPU8 domain 0: span 8-9 level: SMT flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN groups: 8 9 domain 1: span 8-15 level: MC flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN groups: 8-9 10-11 12-13 14-15 domain 2: span 0-15 level CPU flags: groups: 8-15 0-7 If we create new levels, we could have something like below CPU0 domain 0: span 0-1 level: SMT flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES groups: 0 1 domain 1: span 0-7 level: MC flags: SD_SHARE_PKG_RESOURCES groups: 0-1 2-3 4-5 6-7 domain 2: span 0-15 level PWR flags SD_NOT_SHARE_POWERDOMAIN groups: 0-1 2-3 4-5 6-7 8-15 domain 3: span 0-15 level: CPU flags: groups: 0-7 8-15 CPU8 domain 0: span 8-9 level: SMT flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES groups: 8 9 domain 1: span 8-15 level: MC flags: SD_SHARE_PKG_RESOURCES groups: 8-9 10-11 12-13 14-15 domain 2: span 0-15 level PWR flags SD_NOT_SHARE_POWERDOMAIN groups: 0-1 2-3 4-5 6-7 8-15 domain 3: span 0-15 level CPU flags: groups: 8-15 0-7 Vincent --- --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -1070,3 +1070,23 @@ static int __init s390_smp_init(void) return 0; } subsys_initcall(s390_smp_init); + +static struct sched_domain_topology_level s390_topology[] = { +#ifdef CONFIG_SCHED_SMT + { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES }, +#endif +#ifdef CONFIG_SCHED_MC + { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES }, +#endif +#ifdef CONFIG_SCHED_BOOK + { cpu_book_mask, }, +#endif + { cpu_cpu_mask, }, + { NULL, }, +}; + +static int __init s390_sched_topology(void) +{ + sched_domain_topology = s390_topology; +} +early_initcall(s390_sched_topology); --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -889,6 +889,20 @@ void free_sched_domains(cpumask_var_t do bool cpus_share_cache(int this_cpu, int that_cpu); +typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); + +#define SDTL_OVERLAP 0x01 + +struct sched_domain_topology_level { + sched_domain_mask_f mask; + int sd_flags; + int flags; + int numa_level; + struct sd_data data; +}; + +extern struct sched_domain_topology_level *sched_domain_topology; + #else /* CONFIG_SMP */ struct sched_domain_attr; --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5377,20 +5377,6 @@ enum s_alloc { sa_none, }; -struct sched_domain_topology_level; - -typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); - -#define SDTL_OVERLAP 0x01 - -struct sched_domain_topology_level { - sched_domain_mask_f mask; - int sd_flags; - int flags; - int numa_level; - struct sd_data data; -}; - /* * Build an iteration mask that can exclude
Bench for testing scheduler
Hi, During the Energy-aware scheduling mini-summit, we spoke about benches that should be used to evaluate the modifications of the scheduler. I’d like to propose a bench that uses cyclictest to measure the wake up latency and the power consumption. The goal of this bench is to exercise the scheduler with various sleeping period and get the average wakeup latency. The range of the sleeping period must cover all residency times of the idle state table of the platform. I have run such tests on a tc2 platform with the packing tasks patchset. I have use the following command: #cyclictest -t number of cores -q -e 1000 -i 500-12000 -d 150 -l 2000 The figures below give the average wakeup latency and power consumption for default scheduler behavior, packing tasks at cluster level and packing tasks at core level. We can see both wakeup latency and power consumption variation. The detailed result is not a simple single value which makes comparison not so easy but the average of all measurements should give us a usable “score”. I know that Ingo would like to add the benches in Tools/* but I wonder if it make sense to copy cyclictest in this directory when we have an official git tree here: git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git I have put both final score and detailed results below so everybody can check the score vs detailed figures: | Default average results | Cluster Packing average results | Core Packing average results | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | (us) (J)(J) | (us) (J)(J) | (us) (J) (J) | 8797948902364175 | 416 879688 12750 | 189897452 30052 Cyclictest | Default | Packing at Cluster level | Packing at Core level Interval | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy (us) | (us) (J)(J) | (us) (J)(J) | (us) (J) (J) 500 24 111474772479576 21 11136768 11693 22 11126062 30138 700 22 111360843058419 21 01125280 11761 21 11109950 23503 900 22 111360173036768 21 11112542 12017 20 01101089 23733 1100 24 111329642506132 21 01109039 12248 21 11091832 23621 1300 24 111238962488459 21 01099308 12015 21 11086301 23264 1500 24 111208422488272 21 01099811 12685 20 01083658 22499 1700 41 3811171663042091 21 01090920 12393 21 11080387 23015 190011918211205522737555 21 01087900 11900 21 11078711 23177 210016719511224253210655 22 21090420 11900 20 11077985 22639 230015215611198542497773 43 221087278 11921 21 11075943 26282 250018216311208182365870 63 291089169 11551 21 01073717 24290 270043920210589523058516107 411077955 12122 21 01070951 23126 290057026810282383099162148 301067562 13287 24 11064200 24260 3100751137 9465123158095178 301059395 12236 29 11058887 23225 3300696203 9648223042524206 281041194 13934 36 11056656 23941 3500728191 9593983006066235 361028150 13387 44 31045841 23873 3700844138 9217803033189245 311019065 14582 62 61034466 22501 3900815172 9256002862994273 331001974 12091 80 91014650 2 4100870179 8976162940444279 35 996226 12014 88 111030588 25461 4300979119 8469122996911306 36 980075 12641100
Re: Bench for testing scheduler
On 7 November 2013 12:32, Catalin Marinas catalin.mari...@arm.com wrote: Hi Vincent, (for whatever reason, the text is wrapped and results hard to read) Yes, i have just seen that. It looks like gmail has wrapped the lines. I have added the results which should not be wrapped, at the end of this email On Thu, Nov 07, 2013 at 10:54:30AM +, Vincent Guittot wrote: During the Energy-aware scheduling mini-summit, we spoke about benches that should be used to evaluate the modifications of the scheduler. I’d like to propose a bench that uses cyclictest to measure the wake up latency and the power consumption. The goal of this bench is to exercise the scheduler with various sleeping period and get the average wakeup latency. The range of the sleeping period must cover all residency times of the idle state table of the platform. I have run such tests on a tc2 platform with the packing tasks patchset. I have use the following command: #cyclictest -t number of cores -q -e 1000 -i 500-12000 -d 150 -l 2000 cyclictest could be a good starting point but we need to improve it to allow threads of different loads, possibly starting multiple processes (can be done with a script), randomly varying load threads. These parameters should be loaded from a file so that we can have multiple configurations (per SoC and per use-case). But the big risk is that we try to optimise the scheduler for something which is not realistic. The goal of this simple bench is to measure the wake up latency and the reachable value of the scheduler on a platform but not to emulate a real use case. In the same way than sched-pipe tests a specific behavior of the scheduler, this bench tests the wake up latency of a system. Starting multi processes and adding some loads can also be useful but the target will be a bit different from wake up latency. I have one concern with randomness because it prevents from having repeatable and comparable tests and results. I agree that we have to test real use cases but it doesn't prevent from testing the limit of a characteristic on a system We are working on describing some basic scenarios (plain English for now) and one of them could be video playing with threads for audio and video decoding with random change in the workload. So I think the first step should be a set of tools/scripts to analyse the scheduler behaviour, both in terms of latency and power, and these can use perf sched. We can then run some real life scenarios (e.g. Android video playback) and build a benchmark that matches such behaviour as close as possible. We can probably use (or improve) perf sched replay to also simulate such workload (we may need additional features like thread dependencies). The figures below give the average wakeup latency and power consumption for default scheduler behavior, packing tasks at cluster level and packing tasks at core level. We can see both wakeup latency and power consumption variation. The detailed result is not a simple single value which makes comparison not so easy but the average of all measurements should give us a usable “score”. How did you assess the power/energy? I have use the embedded joule meter of the tc2. Thanks. -- Catalin | Default average results | Cluster Packing average results | Core Packing average results | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | (us) (J)(J) | (us) (J)(J) | (us) (J)(J) | 8797948902364175 | 416 879688 12750 | 189897452 30052 Cyclictest | Default | Packing at Cluster level | Packing at Core level Interval | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy | Latency stddev A7 energy A15 energy (us) | (us) (J)(J) | (us) (J)(J) | (us) (J)(J) 500 24 111474772479576 21 1 1136768 11693 22 11126062 30138 700 22 111360843058419 21 0 1125280 11761 21 11109950 23503 900 22 111360173036768 21 1 1112542 12017 20 01101089 23733 1100 24 111329642506132 21 0 1109039 12248 21 11091832 23621 1300 24 111238962488459 21 0 1099308 12015 21 11086301 23264 1500
Re: Bench for testing scheduler
On 8 November 2013 01:04, Rowand, Frank frank.row...@sonymobile.com wrote: Hi Vincent, Thanks for creating some benchmark numbers! you're welcome On Thursday, November 07, 2013 5:33 AM, Vincent Guittot [vincent.guit...@linaro.org] wrote: On 7 November 2013 12:32, Catalin Marinas catalin.mari...@arm.com wrote: Hi Vincent, (for whatever reason, the text is wrapped and results hard to read) Yes, i have just seen that. It looks like gmail has wrapped the lines. I have added the results which should not be wrapped, at the end of this email On Thu, Nov 07, 2013 at 10:54:30AM +, Vincent Guittot wrote: During the Energy-aware scheduling mini-summit, we spoke about benches that should be used to evaluate the modifications of the scheduler. I’d like to propose a bench that uses cyclictest to measure the wake up latency and the power consumption. The goal of this bench is to exercise the scheduler with various sleeping period and get the average wakeup latency. The range of the sleeping period must cover all residency times of the idle state table of the platform. I have run such tests on a tc2 platform with the packing tasks patchset. I have use the following command: #cyclictest -t number of cores -q -e 1000 -i 500-12000 -d 150 -l 2000 The number of loops (-l 2000) should be much larger to create useful results. I don't have a specific number that is large enough, I just know from experience that 2000 is way too small. For example, running cyclictest several times with the same values on my laptop gives values that are not consistent: The Avg figures look almost stable IMO. Are you speaking about the Max value for the inconsistency ? $ sudo ./cyclictest -t -q -e 1000 -i 500 -d 150 -l 2000 # /dev/cpu_dma_latency set to 1000us T: 0 ( 9703) P: 0 I:500 C: 2000 Min: 2 Act: 90 Avg: 77 Max: 243 T: 1 ( 9704) P: 0 I:650 C: 1557 Min: 2 Act: 58 Avg: 68 Max: 226 T: 2 ( 9705) P: 0 I:800 C: 1264 Min: 2 Act: 54 Avg: 81 Max: 1017 T: 3 ( 9706) P: 0 I:950 C: 1065 Min: 2 Act: 11 Avg: 80 Max: 260 $ sudo ./cyclictest -t -q -e 1000 -i 500 -d 150 -l 2000 # /dev/cpu_dma_latency set to 1000us T: 0 ( 9709) P: 0 I:500 C: 2000 Min: 2 Act: 45 Avg: 74 Max: 390 T: 1 ( 9710) P: 0 I:650 C: 1554 Min: 2 Act: 82 Avg: 61 Max: 810 T: 2 ( 9711) P: 0 I:800 C: 1263 Min: 2 Act: 83 Avg: 74 Max: 287 T: 3 ( 9712) P: 0 I:950 C: 1064 Min: 2 Act: 103 Avg: 79 Max: 551 $ sudo ./cyclictest -t -q -e 1000 -i 500 -d 150 -l 2000 # /dev/cpu_dma_latency set to 1000us T: 0 ( 9716) P: 0 I:500 C: 2000 Min: 2 Act: 82 Avg: 72 Max: 252 T: 1 ( 9717) P: 0 I:650 C: 1556 Min: 2 Act: 115 Avg: 77 Max: 354 T: 2 ( 9718) P: 0 I:800 C: 1264 Min: 2 Act: 59 Avg: 78 Max: 1143 T: 3 ( 9719) P: 0 I:950 C: 1065 Min: 2 Act: 104 Avg: 70 Max: 238 $ sudo ./cyclictest -t -q -e 1000 -i 500 -d 150 -l 2000 # /dev/cpu_dma_latency set to 1000us T: 0 ( 9722) P: 0 I:500 C: 2000 Min: 2 Act: 82 Avg: 68 Max: 213 T: 1 ( 9723) P: 0 I:650 C: 1555 Min: 2 Act: 65 Avg: 65 Max: 1279 T: 2 ( 9724) P: 0 I:800 C: 1264 Min: 2 Act: 91 Avg: 69 Max: 244 T: 3 ( 9725) P: 0 I:950 C: 1065 Min: 2 Act: 58 Avg: 76 Max: 242 cyclictest could be a good starting point but we need to improve it to allow threads of different loads, possibly starting multiple processes (can be done with a script), randomly varying load threads. These parameters should be loaded from a file so that we can have multiple configurations (per SoC and per use-case). But the big risk is that we try to optimise the scheduler for something which is not realistic. The goal of this simple bench is to measure the wake up latency and the reachable value of the scheduler on a platform but not to emulate a real use case. In the same way than sched-pipe tests a specific behavior of the scheduler, this bench tests the wake up latency of a system. Starting multi processes and adding some loads can also be useful but the target will be a bit different from wake up latency. I have one concern with randomness because it prevents from having repeatable and comparable tests and results. I agree that we have to test real use cases but it doesn't prevent from testing the limit of a characteristic on a system We are working on describing some basic scenarios (plain English for now) and one of them could be video playing with threads for audio and video decoding with random change in the workload. So I think the first step should be a set of tools/scripts to analyse the scheduler behaviour, both in terms of latency and power, and these can use perf sched. We can
Re: Bench for testing scheduler
On 7 November 2013 15:04, Catalin Marinas catalin.mari...@arm.com wrote: On Thu, Nov 07, 2013 at 01:33:43PM +, Vincent Guittot wrote: On 7 November 2013 12:32, Catalin Marinas catalin.mari...@arm.com wrote: On Thu, Nov 07, 2013 at 10:54:30AM +, Vincent Guittot wrote: During the Energy-aware scheduling mini-summit, we spoke about benches that should be used to evaluate the modifications of the scheduler. I’d like to propose a bench that uses cyclictest to measure the wake up latency and the power consumption. The goal of this bench is to exercise the scheduler with various sleeping period and get the average wakeup latency. The range of the sleeping period must cover all residency times of the idle state table of the platform. I have run such tests on a tc2 platform with the packing tasks patchset. I have use the following command: #cyclictest -t number of cores -q -e 1000 -i 500-12000 -d 150 -l 2000 cyclictest could be a good starting point but we need to improve it to allow threads of different loads, possibly starting multiple processes (can be done with a script), randomly varying load threads. These parameters should be loaded from a file so that we can have multiple configurations (per SoC and per use-case). But the big risk is that we try to optimise the scheduler for something which is not realistic. The goal of this simple bench is to measure the wake up latency and the reachable value of the scheduler on a platform but not to emulate a real use case. In the same way than sched-pipe tests a specific behavior of the scheduler, this bench tests the wake up latency of a system. These figures are indeed useful to make sure we don't have any regression in terms of latency but I would not use cyclictest (as it is) to assess power improvements since the test is too artificial. Starting multi processes and adding some loads can also be useful but the target will be a bit different from wake up latency. I have one concern with randomness because it prevents from having repeatable and comparable tests and results. We can avoid randomness but still make it varying by some predictable function. I agree that we have to test real use cases but it doesn't prevent from testing the limit of a characteristic on a system I agree. My point is not to use this as the benchmark. ok, so i don't plan to make cyclictest the benchmark but a benchmark among others because i'm not sure that we can cover all needs with only one benchmark. As an example, cyclictest gives information of the wake up latency that can't be collected with trace I would prefer to assess the impact on latency (and power) using a tool independent from benchmarks like cyclictest (e.g. use the reports from power sched). The reason is that once we have those tools/scripts in the kernel, a third party can run it on real workloads and provide the kernel developers with real numbers on performance vs power scheduling, regressions between kernel versions etc. We can't create a power model that you can run on an x86 for example and give you an indication of the power saving on ARM, you need to run the benchmarks on the actual hardware (that's why I don't think linsched is of much use from a power perspective). -- Catalin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH v5 03/14] sched: define pack buddy CPUs
During the creation of sched_domain, we define a pack buddy CPU for each CPU when one is available. We want to pack at all levels where a group of CPUs can be power gated independently from others. On a system that can't power gate a group of CPUs independently, the flag is set at all sched_domain level and the buddy is set to -1. This is the default behavior for all architectures. On a dual clusters / dual cores system which can power gate each core and cluster independently, the buddy configuration will be : | Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 | --- buddy | CPU0 | CPU0 | CPU0 | CPU2 | If the cores in a cluster can't be power gated independently, the buddy configuration becomes: | Cluster 0 | Cluster 1 | | CPU0 | CPU1 | CPU2 | CPU3 | --- buddy | CPU0 | CPU1 | CPU0 | CPU0 | Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/core.c |1 + kernel/sched/fair.c | 70 ++ kernel/sched/sched.h |5 3 files changed, 76 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 735e964..0bf5f4d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5184,6 +5184,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq-sd, sd); destroy_sched_domains(tmp, cpu); + update_packing_domain(cpu); update_top_cache_domain(cpu); } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 11cd136..5547831 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -178,6 +178,76 @@ void sched_init_granularity(void) update_sysctl(); } +#ifdef CONFIG_SMP +#ifdef CONFIG_SCHED_PACKING_TASKS +/* + * Save the id of the optimal CPU that should be used to pack small tasks + * The value -1 is used when no buddy has been found + */ +DEFINE_PER_CPU(int, sd_pack_buddy); + +/* + * Look for the best buddy CPU that can be used to pack small tasks + * We make the assumption that it doesn't wort to pack on CPU that share the + * same powerline. We look for the 1st sched_domain without the + * SD_SHARE_POWERDOMAIN flag. Then we look for the sched_group with the lowest + * power per core based on the assumption that their power efficiency is + * better + */ +void update_packing_domain(int cpu) +{ + struct sched_domain *sd; + int id = -1; + + sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN); + if (!sd) + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd); + else + sd = sd-parent; + + while (sd (sd-flags SD_LOAD_BALANCE) +!(sd-flags SD_SHARE_POWERDOMAIN)) { + struct sched_group *sg = sd-groups; + struct sched_group *pack = sg; + struct sched_group *tmp; + + /* +* The sched_domain of a CPU points on the local sched_group +* and this CPU of this local group is a good candidate +*/ + id = cpu; + + /* loop the sched groups to find the best one */ + for (tmp = sg-next; tmp != sg; tmp = tmp-next) { + if (tmp-sgp-power * pack-group_weight + pack-sgp-power * tmp-group_weight) + continue; + + if ((tmp-sgp-power * pack-group_weight == + pack-sgp-power * tmp-group_weight) + (cpumask_first(sched_group_cpus(tmp)) = id)) + continue; + + /* we have found a better group */ + pack = tmp; + + /* Take the 1st CPU of the new group */ + id = cpumask_first(sched_group_cpus(pack)); + } + + /* Look for another CPU than itself */ + if (id != cpu) + break; + + sd = sd-parent; + } + + pr_debug(CPU%d packing on CPU%d\n, cpu, id); + per_cpu(sd_pack_buddy, cpu) = id; +} +#endif /* CONFIG_SCHED_PACKING_TASKS */ +#endif /* CONFIG_SMP */ + #if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b3c5653..22e3f1d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1022,6 +1022,11 @@ extern void update_group_power(struct sched_domain *sd, int cpu); extern void trigger_load_balance(struct rq *rq, int cpu); extern void idle_balance(int this_cpu, struct rq *this_rq); +#ifdef CONFIG_SCHED_PACKING_TASKS +extern void update_packing_domain(int cpu); +#else +static inline void update_packing_domain(int cpu) {}; +#endif extern void idle_enter_fair(struct rq *this_rq); extern void idle_exit_fair(struct rq *this_rq); -- 1.7.9.5 -- To unsubscribe from this list
[RFC][PATCH v5 00/14] sched: packing tasks
- Change the way the coherency of runnable_avg_sum and runnable_avg_period is ensured Patch 6/6 - Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM platform Vincent Guittot (14): sched: add a new arch_sd_local_flags for sched_domain init ARM: sched: clear SD_SHARE_POWERDOMAIN sched: define pack buddy CPUs sched: do load balance only with packing cpus sched: add a packing level knob sched: create a new field with available capacity sched: get CPU's activity statistic sched: move load idx selection in find_idlest_group sched: update the packing cpu list sched: init this_load to max in find_idlest_group sched: add a SCHED_PACKING_TASKS config sched: create a statistic structure sched: differantiate idle cpu cpuidle: set the current wake up latency arch/arm/include/asm/topology.h |4 + arch/arm/kernel/topology.c | 50 - arch/ia64/include/asm/topology.h |3 +- arch/tile/include/asm/topology.h |3 +- drivers/cpuidle/cpuidle.c| 11 ++ include/linux/sched.h| 13 +- include/linux/sched/sysctl.h |9 + include/linux/topology.h | 11 +- init/Kconfig | 11 ++ kernel/sched/core.c | 11 +- kernel/sched/fair.c | 395 -- kernel/sched/sched.h |8 +- kernel/sysctl.c | 17 ++ 13 files changed, 521 insertions(+), 25 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH v5 10/14] sched: init this_load to max in find_idlest_group
Init this_load to max value instead of 0 in find_idlest_group. If the local group is skipped because it doesn't have allowed CPUs, this_load stays to 0, no idlest group will be returned and the selected CPU will be a not allowed one (which will be replaced in select_fallback_rq by a random one). With a default value set to max, we will use the idlest group even if we skip the local_group. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/fair.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f9b03c1..2d9f782 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3532,7 +3532,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu, int sd_flag) { struct sched_group *idlest = NULL, *group = sd-groups; - unsigned long min_load = ULONG_MAX, this_load = 0; + unsigned long min_load = ULONG_MAX, this_load = ULONG_MAX; int load_idx = sd-forkexec_idx; int imbalance = 100 + (sd-imbalance_pct-100)/2; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH v5 09/14] sched: update the packing cpu list
Use the activity statistics to update the list of CPUs that should be used to hanlde the current system activity. The cpu_power is updated for CPUs that don't participate to the packing effort. We consider that their cpu_power is allocated to idleness as it could be allocated by rt. So the cpu_power that remains available for cfs, is set to min value (i.e. 1). The cpu_power is used for a task that wakes up because a waking up task is already taken into account in the current activity whereas we use the power_available for a fork and exec because the task is not part of the current activity. In order to quickly found the packing starting point, we save information that will be used to directly start with the right sched_group at the right sched_domain level instead of running the complete update_packing_domain algorithm each time we need to use the packing cpu list. The sd_power_leader defines the leader of a group of CPU that can't be powergated independantly. As soon as this CPU is used, all the CPU in the same group will be used based on the fact that it doesn't worth to keep some cores idle if they can't be power gated while one core in the group is running. The sd_pack_group and sd_pack_domain are used to quickly check if a power leader should be used in the packing effort Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/fair.c | 162 ++- 1 file changed, 149 insertions(+), 13 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c258c38..f9b03c1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -185,11 +185,20 @@ static unsigned long available_of(int cpu) } #ifdef CONFIG_SCHED_PACKING_TASKS +struct sd_pack { + int my_buddy; /* cpu on which tasks should be packed */ + int my_leader; /* cpu which leads the packing state of a group */ + struct sched_domain *domain; /* domain at which the check is done */ + struct sched_group *group; /* starting group for checking */ +}; + /* - * Save the id of the optimal CPU that should be used to pack small tasks - * The value -1 is used when no buddy has been found + * Save per_cpu information about the optimal CPUs that should be used to pack + * tasks. */ -DEFINE_PER_CPU(int, sd_pack_buddy); +DEFINE_PER_CPU(struct sd_pack, sd_pack_buddy) = { + .my_buddy = -1, +}; /* * The packing level of the scheduler @@ -202,6 +211,15 @@ int __read_mostly sysctl_sched_packing_level = DEFAULT_PACKING_LEVEL; unsigned int sd_pack_threshold = (100 * 1024) / DEFAULT_PACKING_LEVEL; +static inline int get_buddy(int cpu) +{ + return per_cpu(sd_pack_buddy, cpu).my_buddy; +} + +static inline int get_leader(int cpu) +{ + return per_cpu(sd_pack_buddy, cpu).my_leader; +} int sched_proc_update_packing(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, @@ -219,13 +237,19 @@ int sched_proc_update_packing(struct ctl_table *table, int write, static inline bool is_packing_cpu(int cpu) { - int my_buddy = per_cpu(sd_pack_buddy, cpu); + int my_buddy = get_buddy(cpu); return (my_buddy == -1) || (cpu == my_buddy); } -static inline int get_buddy(int cpu) +static inline bool is_leader_cpu(int cpu, struct sched_domain *sd) { - return per_cpu(sd_pack_buddy, cpu); + if (sd != per_cpu(sd_pack_buddy, cpu).domain) + return 0; + + if (cpu != get_leader(cpu)) + return 0; + + return 1; } /* @@ -239,7 +263,9 @@ static inline int get_buddy(int cpu) void update_packing_domain(int cpu) { struct sched_domain *sd; - int id = -1; + struct sched_group *target = NULL; + struct sd_pack *pack = per_cpu(sd_pack_buddy, cpu); + int id = cpu, pcpu = cpu; sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN); if (!sd) @@ -247,6 +273,12 @@ void update_packing_domain(int cpu) else sd = sd-parent; + if (sd) { + pcpu = cpumask_first(sched_group_cpus(sd-groups)); + if (pcpu != cpu) + goto end; + } + while (sd (sd-flags SD_LOAD_BALANCE) !(sd-flags SD_SHARE_POWERDOMAIN)) { struct sched_group *sg = sd-groups; @@ -258,15 +290,16 @@ void update_packing_domain(int cpu) * and this CPU of this local group is a good candidate */ id = cpu; + target = pack; /* loop the sched groups to find the best one */ for (tmp = sg-next; tmp != sg; tmp = tmp-next) { - if (tmp-sgp-power * pack-group_weight - pack-sgp-power * tmp-group_weight) + if (tmp-sgp-power_available * pack-group_weight + pack-sgp-power_available * tmp-group_weight) continue
[RFC][PATCH v5 02/14] ARM: sched: clear SD_SHARE_POWERDOMAIN
The ARM platforms take advantage of packing tasks on few cores if the latters can be powergated independantly. We use DT and the cpu topology descirption to define at which level a core can be independantly powergated to the others and the SD_SHARE_POWERDOMAIN will be set accordingly at MC and CPU sched_domain level. The power-gate properties should be added with the value 1 in cpu and cluster node when then can power gate independantly from the other. As an example of a quad cores system which can power gate each core independantly, we should have a DT similar to the example below cpus { #address-cells = 1; #size-cells = 0; cpu-map { cluster0 { power-gate = 1; core0 { cpu = cpu0; power-gate = 1; }; core1 { cpu = cpu1; power-gate = 1; }; core2 { cpu = cpu2; power-gate = 1; }; core3 { cpu = cpu3; power-gate = 1; }; }; }; ... }; Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- arch/arm/include/asm/topology.h |4 arch/arm/kernel/topology.c | 50 ++- 2 files changed, 53 insertions(+), 1 deletion(-) diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h index 58b8b84..5102847 100644 --- a/arch/arm/include/asm/topology.h +++ b/arch/arm/include/asm/topology.h @@ -5,12 +5,16 @@ #include linux/cpumask.h +#define CPU_CORE_GATE 0x1 +#define CPU_CLUSTER_GATE 0x2 + struct cputopo_arm { int thread_id; int core_id; int socket_id; cpumask_t thread_sibling; cpumask_t core_sibling; + int flags; }; extern struct cputopo_arm cpu_topology[NR_CPUS]; diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 85a8737..f38f1f9 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -24,6 +24,7 @@ #include asm/cputype.h #include asm/topology.h +#include asm/smp_plat.h /* * cpu power scale management @@ -79,6 +80,51 @@ unsigned long *__cpu_capacity; unsigned long middle_capacity = 1; +static int __init get_dt_power_topology(struct device_node *topo) +{ + const u32 *reg; + int len, power = 0; + int flag = CPU_CORE_GATE; + + for (; topo; topo = of_get_next_parent(topo)) { + reg = of_get_property(topo, power-gate, len); + if (reg len == 4 be32_to_cpup(reg)) + power |= flag; + flag = 1; + } + + return power; +} + +#define for_each_subnode_with_property(dn, pn, prop_name) \ + for (dn = of_find_node_with_property(pn, prop_name); dn; \ +dn = of_find_node_with_property(dn, prop_name)) + +static void __init init_dt_power_topology(void) +{ + struct device_node *cn, *topo; + + /* Get power domain topology information */ + cn = of_find_node_by_path(/cpus/cpu-map); + if (!cn) { + pr_warn(Missing cpu-map node, bailing out\n); + return; + } + + for_each_subnode_with_property(topo, cn, cpu) { + struct device_node *cpu; + + cpu = of_parse_phandle(topo, cpu, 0); + if (cpu) { + u32 hwid; + + of_property_read_u32(cpu, reg, hwid); + cpu_topology[get_logical_index(hwid)].flags = get_dt_power_topology(topo); + + } + } +} + /* * Iterate all CPUs' descriptor in DT and compute the efficiency * (as per table_efficiency). Also calculate a middle efficiency @@ -151,6 +197,8 @@ static void __init parse_dt_topology(void) middle_capacity = ((max_capacity / 3) (SCHED_POWER_SHIFT-1)) + 1; + /* Retrieve power topology information from DT */ + init_dt_power_topology(); } /* @@ -283,7 +331,7 @@ void __init init_cpu_topology(void) cpu_topo-socket_id = -1; cpumask_clear(cpu_topo-core_sibling); cpumask_clear(cpu_topo-thread_sibling); - + cpu_topo-flags = 0; set_power_scale(cpu, SCHED_POWER_SCALE); } smp_wmb(); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More
[RFC][PATCH v5 05/14] sched: add a packing level knob
The knob is used to set an average load threshold that will be used to trig the inclusion/removal of CPUs in the packing effort list. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- include/linux/sched/sysctl.h |9 + kernel/sched/fair.c | 26 ++ kernel/sysctl.c | 17 + 3 files changed, 52 insertions(+) diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index bf8086b..f41afa5 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -44,6 +44,14 @@ enum sched_tunable_scaling { }; extern enum sched_tunable_scaling sysctl_sched_tunable_scaling; +#ifdef CONFIG_SCHED_PACKING_TASKS +extern int __read_mostly sysctl_sched_packing_level; + +int sched_proc_update_packing(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos); +#endif + extern unsigned int sysctl_numa_balancing_scan_delay; extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; @@ -61,6 +69,7 @@ extern unsigned int sysctl_sched_shares_window; int sched_proc_update_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos); + #endif #ifdef CONFIG_SCHED_DEBUG static inline unsigned int get_sysctl_timer_migration(void) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7149f38..5568980 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -186,6 +186,32 @@ void sched_init_granularity(void) */ DEFINE_PER_CPU(int, sd_pack_buddy); +/* + * The packing level of the scheduler + * + * This level define the activity % above which we should add another CPU to + * participate to the packing effort of the tasks + */ +#define DEFAULT_PACKING_LEVEL 80 +int __read_mostly sysctl_sched_packing_level = DEFAULT_PACKING_LEVEL; + +unsigned int sd_pack_threshold = (100 * 1024) / DEFAULT_PACKING_LEVEL; + + +int sched_proc_update_packing(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); + if (ret || !write) + return ret; + + if (sysctl_sched_packing_level) + sd_pack_threshold = (100 * 1024) / sysctl_sched_packing_level; + + return 0; +} + static inline bool is_packing_cpu(int cpu) { int my_buddy = per_cpu(sd_pack_buddy, cpu); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index b2f06f3..77383fc 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -255,11 +255,17 @@ static struct ctl_table sysctl_base_table[] = { { } }; +#ifdef CONFIG_SCHED_PACKING_TASKS +static int min_sched_packing_level; +static int max_sched_packing_level = 100; +#endif /* CONFIG_SMP */ + #ifdef CONFIG_SCHED_DEBUG static int min_sched_granularity_ns = 10; /* 100 usecs */ static int max_sched_granularity_ns = NSEC_PER_SEC;/* 1 second */ static int min_wakeup_granularity_ns; /* 0 usecs */ static int max_wakeup_granularity_ns = NSEC_PER_SEC; /* 1 second */ + #ifdef CONFIG_SMP static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE; static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1; @@ -279,6 +285,17 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, +#ifdef CONFIG_SCHED_PACKING_TASKS + { + .procname = sched_packing_level, + .data = sysctl_sched_packing_level, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = sched_proc_update_packing, + .extra1 = min_sched_packing_level, + .extra2 = max_sched_packing_level, + }, +#endif #ifdef CONFIG_SCHED_DEBUG { .procname = sched_min_granularity_ns, -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH v5 08/14] sched: move load idx selection in find_idlest_group
load_idx is used in find_idlest_group but initialized in select_task_rq_fair even when not used. The load_idx initialisation is moved in find_idlest_group and the sd_flag replaces it in the function's args. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/fair.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7e26f65..c258c38 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3430,12 +3430,16 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) */ static struct sched_group * find_idlest_group(struct sched_domain *sd, struct task_struct *p, - int this_cpu, int load_idx) + int this_cpu, int sd_flag) { struct sched_group *idlest = NULL, *group = sd-groups; unsigned long min_load = ULONG_MAX, this_load = 0; + int load_idx = sd-forkexec_idx; int imbalance = 100 + (sd-imbalance_pct-100)/2; + if (sd_flag SD_BALANCE_WAKE) + load_idx = sd-wake_idx; + do { unsigned long load, avg_load; int local_group, packing_cpus = 0; @@ -3632,7 +3636,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) } while (sd) { - int load_idx = sd-forkexec_idx; struct sched_group *group; int weight; @@ -3641,10 +3644,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) continue; } - if (sd_flag SD_BALANCE_WAKE) - load_idx = sd-wake_idx; - - group = find_idlest_group(sd, p, cpu, load_idx); + group = find_idlest_group(sd, p, cpu, sd_flag); if (!group) { sd = sd-child; continue; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH v5 07/14] sched: get CPU's activity statistic
Monitor the activity level of each group of each sched_domain level. The activity is the amount of cpu_power that is currently used on a CPU. We use the runnable_avg_sum and _period to evaluate this activity level. In the special use case where the CPU is fully loaded by more than 1 task, the activity level is set above the cpu_power in order to reflect the overload of The cpu Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/fair.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index db9b871..7e26f65 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -179,6 +179,11 @@ void sched_init_granularity(void) } #ifdef CONFIG_SMP +static unsigned long available_of(int cpu) +{ + return cpu_rq(cpu)-cpu_available; +} + #ifdef CONFIG_SCHED_PACKING_TASKS /* * Save the id of the optimal CPU that should be used to pack small tasks @@ -3549,6 +3554,22 @@ done: return target; } +static int get_cpu_activity(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + u32 sum = rq-avg.runnable_avg_sum; + u32 period = rq-avg.runnable_avg_period; + + sum = min(sum, period); + + if (sum == period) { + u32 overload = rq-nr_running 1 ? 1 : 0; + return available_of(cpu) + overload; + } + + return (sum * available_of(cpu)) / period; +} + /* * sched_balance_self: balance the current task (running on cpu) in domains * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and @@ -4430,6 +4451,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_power; + unsigned long group_activity; /* Total activity of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ unsigned int group_capacity; unsigned int idle_cpus; @@ -4446,6 +4468,7 @@ struct sd_lb_stats { struct sched_group *busiest;/* Busiest group in this sd */ struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ + unsigned long total_activity; /* Total activity of all groups in sd */ unsigned long total_pwr;/* Total power of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */ @@ -4465,6 +4488,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .busiest = NULL, .local = NULL, .total_load = 0UL, + .total_activity = 0UL, .total_pwr = 0UL, .busiest_stat = { .avg_load = 0UL, @@ -4771,6 +4795,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, } sgs-group_load += load; + sgs-group_activity += get_cpu_activity(i); sgs-sum_nr_running += nr_running; sgs-sum_weighted_load += weighted_cpuload(i); if (idle_cpu(i)) @@ -4894,6 +4919,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, /* Now, start updating sd_lb_stats */ sds-total_load += sgs-group_load; + sds-total_activity += sgs-group_activity; sds-total_pwr += sgs-group_power; if (!local_group update_sd_pick_busiest(env, sds, sg, sgs)) { -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init
The function arch_sd_local_flags is used to set flags in sched_domains according to the platform architecture. A new flag SD_SHARE_POWERDOMAIN is also created to reflect whether groups of CPUs in a sched_domain level can or not reach different power state. As an example, the flag should be cleared at CPU level if groups of cores can be power gated independently. This information is used to decide if it's worth packing some tasks in a group of CPUs in order to power gate the other groups instead of spreading the tasks. The default behavior of the scheduler is to spread tasks across CPUs and groups of CPUs so the flag is set into all sched_domains. The cpu parameter of arch_sd_local_flags can be used by architecture to fine tune the scheduler domain flags. As an example SD_SHARE_POWERDOMAIN flag can be set differently for groups of CPUs according to DT information Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- arch/ia64/include/asm/topology.h |3 ++- arch/tile/include/asm/topology.h |3 ++- include/linux/sched.h|1 + include/linux/topology.h | 11 --- kernel/sched/core.c | 10 -- 5 files changed, 21 insertions(+), 7 deletions(-) diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e4..4844896 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -46,7 +46,7 @@ void build_cpu_to_node_map(void); -#define SD_CPU_INIT (struct sched_domain) {\ +#define SD_CPU_INIT(cpu) (struct sched_domain) { \ .parent = NULL, \ .child = NULL, \ .groups = NULL, \ @@ -65,6 +65,7 @@ void build_cpu_to_node_map(void); | SD_BALANCE_EXEC \ | SD_BALANCE_FORK \ | SD_WAKE_AFFINE, \ + | arch_sd_local_flags(0, cpu)\ .last_balance = jiffies, \ .balance_interval = 1,\ .nr_balance_failed = 0,\ diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index d15c0d8..6f7a97d 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -51,7 +51,7 @@ static inline const struct cpumask *cpumask_of_node(int node) */ /* sched_domains SD_CPU_INIT for TILE architecture */ -#define SD_CPU_INIT (struct sched_domain) {\ +#define SD_CPU_INIT(cpu) (struct sched_domain) { \ .min_interval = 4,\ .max_interval = 128, \ .busy_factor= 64, \ @@ -71,6 +71,7 @@ static inline const struct cpumask *cpumask_of_node(int node) | 0*SD_WAKE_AFFINE \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \ + | arch_sd_local_flags(0, cpu) \ | 0*SD_SERIALIZE\ , \ .last_balance = jiffies, \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 6682da3..2004cdb 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -763,6 +763,7 @@ enum cpu_idle_type { #define SD_BALANCE_WAKE0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */ +#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING0x0800 /* Place busy groups earlier in the domain */ diff --git a/include/linux/topology.h b/include/linux/topology.h index d3cf0d6..f3cd3c2 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -85,7 +85,7 @@ int arch_update_cpu_topology(void); #define ARCH_HAS_SCHED_WAKE_IDLE /* Common values for SMT siblings */ #ifndef SD_SIBLING_INIT -#define SD_SIBLING_INIT (struct sched_domain) { \ +#define SD_SIBLING_INIT(cpu) (struct sched_domain) { \ .min_interval = 1,\ .max_interval = 2,\ .busy_factor= 64, \ @@ -99,6 +99,8 @@ int
[RFC][PATCH v5 14/14] cpuidle: set the current wake up latency
Save the current wake up latency of a core. This latency is not always the latency of a defined c-state but can also be an intermediate value when a core is ready to shutdown (timer migration, cache flush ...) but wait for the last core of the cluster to finalize the cluster power down. This latter use case is not manage by the current version of the patch because it implies that the cpuidle drivers set the wake up latency instead of the cpuidle core. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org Conflicts: drivers/cpuidle/cpuidle.c --- drivers/cpuidle/cpuidle.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index d75040d..7b6553b 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -19,6 +19,7 @@ #include linux/ktime.h #include linux/hrtimer.h #include linux/module.h +#include linux/sched.h #include trace/events/power.h #include cpuidle.h @@ -42,6 +43,12 @@ void disable_cpuidle(void) off = 1; } +static void cpuidle_set_current_state(int cpu, int latency) +{ + struct sched_pm *stat = per_cpu(sched_stat, cpu); + + atomic_set((stat-wake_latency), latency); +} /** * cpuidle_play_dead - cpu off-lining * @@ -79,12 +86,16 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, ktime_t time_start, time_end; s64 diff; + cpuidle_set_current_state(dev-cpu, target_state-exit_latency); + time_start = ktime_get(); entered_state = target_state-enter(dev, drv, index); time_end = ktime_get(); + cpuidle_set_current_state(dev-cpu, 0); + local_irq_enable(); diff = ktime_to_us(ktime_sub(time_end, time_start)); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH v5 06/14] sched: create a new field with available capacity
This new field power_available reflects the available capacity of a CPU unlike the cpu_power which reflects the current capacity. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/fair.c | 14 +++--- kernel/sched/sched.h |3 ++- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5568980..db9b871 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4584,15 +4584,19 @@ static void update_cpu_power(struct sched_domain *sd, int cpu) if (!power) power = 1; + cpu_rq(cpu)-cpu_available = power; + sdg-sgp-power_available = power; + cpu_rq(cpu)-cpu_power = power; sdg-sgp-power = power; + } void update_group_power(struct sched_domain *sd, int cpu) { struct sched_domain *child = sd-child; struct sched_group *group, *sdg = sd-groups; - unsigned long power; + unsigned long power, available; unsigned long interval; interval = msecs_to_jiffies(sd-balance_interval); @@ -4604,7 +4608,7 @@ void update_group_power(struct sched_domain *sd, int cpu) return; } - power = 0; + power = available = 0; if (child-flags SD_OVERLAP) { /* @@ -4614,6 +4618,8 @@ void update_group_power(struct sched_domain *sd, int cpu) for_each_cpu(cpu, sched_group_cpus(sdg)) power += power_of(cpu); + available += available_of(cpu); + } else { /* * !SD_OVERLAP domains can assume that child groups @@ -4623,11 +4629,13 @@ void update_group_power(struct sched_domain *sd, int cpu) group = child-groups; do { power += group-sgp-power; + available += group-sgp-power_available; group = group-next; } while (group != child-groups); } - sdg-sgp-power_orig = sdg-sgp-power = power; + sdg-sgp-power_orig = sdg-sgp-power_available = available; + sdg-sgp-power = power; } /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 22e3f1d..d5a4ec0 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -459,6 +459,7 @@ struct rq { struct sched_domain *sd; unsigned long cpu_power; + unsigned long cpu_available; unsigned char idle_balance; /* For active balancing */ @@ -603,7 +604,7 @@ struct sched_group_power { * CPU power of this group, SCHED_LOAD_SCALE being max power for a * single CPU. */ - unsigned int power, power_orig; + unsigned int power, power_orig, power_available; unsigned long next_update; /* * Number of busy cpus in this group. -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH v5 04/14] sched: do load balance only with packing cpus
The tasks will be scheduled only on the CPUs that participate to the packing effort. A CPU participates to the packing effort when it is its own buddy. For ILB, look for an idle CPU close to the packing CPUs whenever possible. The goal is to prevent the wake up of a CPU which doesn't share the power domain of the pack buddy CPU. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/fair.c | 80 --- 1 file changed, 76 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5547831..7149f38 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -186,6 +186,17 @@ void sched_init_granularity(void) */ DEFINE_PER_CPU(int, sd_pack_buddy); +static inline bool is_packing_cpu(int cpu) +{ + int my_buddy = per_cpu(sd_pack_buddy, cpu); + return (my_buddy == -1) || (cpu == my_buddy); +} + +static inline int get_buddy(int cpu) +{ + return per_cpu(sd_pack_buddy, cpu); +} + /* * Look for the best buddy CPU that can be used to pack small tasks * We make the assumption that it doesn't wort to pack on CPU that share the @@ -245,6 +256,32 @@ void update_packing_domain(int cpu) pr_debug(CPU%d packing on CPU%d\n, cpu, id); per_cpu(sd_pack_buddy, cpu) = id; } + +static int check_nohz_packing(int cpu) +{ + if (!is_packing_cpu(cpu)) + return true; + + return false; +} +#else /* CONFIG_SCHED_PACKING_TASKS */ + +static inline bool is_packing_cpu(int cpu) +{ + return 1; +} + +static inline int get_buddy(int cpu) +{ + return -1; +} + +static inline int check_nohz_packing(int cpu) +{ + return false; +} + + #endif /* CONFIG_SCHED_PACKING_TASKS */ #endif /* CONFIG_SMP */ @@ -3370,7 +3407,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, do { unsigned long load, avg_load; - int local_group; + int local_group, packing_cpus = 0; int i; /* Skip over this group if it has no CPUs allowed */ @@ -3392,8 +3429,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, load = target_load(i, load_idx); avg_load += load; + + if (is_packing_cpu(i)) + packing_cpus = 1; } + if (!packing_cpus) + continue; + /* Adjust by relative CPU power of the group */ avg_load = (avg_load * SCHED_POWER_SCALE) / group-sgp-power; @@ -3448,7 +3491,8 @@ static int select_idle_sibling(struct task_struct *p, int target) /* * If the prevous cpu is cache affine and idle, don't be stupid. */ - if (i != target cpus_share_cache(i, target) idle_cpu(i)) + if (i != target cpus_share_cache(i, target) idle_cpu(i) +is_packing_cpu(i)) return i; /* @@ -3463,7 +3507,8 @@ static int select_idle_sibling(struct task_struct *p, int target) goto next; for_each_cpu(i, sched_group_cpus(sg)) { - if (i == target || !idle_cpu(i)) + if (i == target || !idle_cpu(i) + || !is_packing_cpu(i)) goto next; } @@ -3528,9 +3573,13 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) } if (affine_sd) { - if (cpu != prev_cpu wake_affine(affine_sd, p, sync)) + if (cpu != prev_cpu (wake_affine(affine_sd, p, sync) + || !is_packing_cpu(prev_cpu))) prev_cpu = cpu; + if (!is_packing_cpu(prev_cpu)) + prev_cpu = get_buddy(prev_cpu); + new_cpu = select_idle_sibling(p, prev_cpu); goto unlock; } @@ -5593,7 +5642,26 @@ static struct { static inline int find_new_ilb(int call_cpu) { + struct sched_domain *sd; int ilb = cpumask_first(nohz.idle_cpus_mask); + int buddy = get_buddy(call_cpu); + + /* +* If we have a pack buddy CPU, we try to run load balance on a CPU +* that is close to the buddy. +*/ + if (buddy != -1) { + for_each_domain(buddy, sd) { + if (sd-flags SD_SHARE_CPUPOWER) + continue; + + ilb = cpumask_first_and(sched_domain_span(sd), + nohz.idle_cpus_mask); + + if (ilb nr_cpu_ids) + break; + } + } if (ilb nr_cpu_ids idle_cpu(ilb)) return ilb; @@ -5874,6 +5942,10 @@ static inline int
[RFC][PATCH v5 11/14] sched: add a SCHED_PACKING_TASKS config
The SCHED_PACKING_TASKS config is used to enable the packing tasks mecanism Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- init/Kconfig | 11 +++ 1 file changed, 11 insertions(+) diff --git a/init/Kconfig b/init/Kconfig index 3ecd8a1..4d2b5db 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1178,6 +1178,17 @@ config SCHED_AUTOGROUP desktop applications. Task group autogeneration is currently based upon task session. +config SCHED_PACKING_TASKS + bool Automatic tasks packing + depends on SMP SCHED_MC + default n + help + This option enable the packing mode of the scheduler. + This mode ensures that the minimal number of CPUs will + be used to handle the activty of the system. The CPUs + are selected to minimized the number of power domain + that must be kept on + config MM_OWNER bool -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH v5 12/14] sched: create a statistic structure
Create a statistic structure that will be used to share information with other frameworks like cpuidle and cpufreq. This structure only contains the current wake up latency of a core for now but could be extended with other usefull information. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- include/linux/sched.h | 12 +++- kernel/sched/fair.c |5 + 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 2004cdb..d676aa2 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2,7 +2,7 @@ #define _LINUX_SCHED_H #include uapi/linux/sched.h - +#include linux/atomic.h struct sched_param { int sched_priority; @@ -63,6 +63,16 @@ struct fs_struct; struct perf_event_context; struct blk_plug; +/* This structure is used to share information and statistics with other + * frameworks. It only shares wake up latency fro the moment but should be + * extended with other usefull informations + */ +struct sched_pm { + atomic_t wake_latency; /* time to wake up the cpu */ +}; + +DECLARE_PER_CPU(struct sched_pm, sched_stat); + /* * List of flags we want to share for kernel threads, * if only because they are not used by them anyway. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2d9f782..ad8b99a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -178,6 +178,11 @@ void sched_init_granularity(void) update_sysctl(); } +/* Save per_cpu information that will be shared with other frameworks */ +DEFINE_PER_CPU(struct sched_pm, sched_stat) = { + .wake_latency = ATOMIC_INIT(0) +}; + #ifdef CONFIG_SMP static unsigned long available_of(int cpu) { -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH v5 13/14] sched: differantiate idle cpu
The cost for waking up of a core varies according to its current idle state. This includes C-state and intermediate state when some sync between cores is required to reach a deep C-state. Waking up a CPU in a deep C-state for running a short task is not efficient from both a power and a performance point of view. We should take into account the wake up latency of an idle CPU when the scheduler looks for the best CPU to use for a waking task. The wake up latency of a CPU is computed into a load that can be directly compared with task load and other CPUs load. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/fair.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ad8b99a..4863dad 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -394,6 +394,20 @@ static int check_nohz_packing(int cpu) return false; } + +int sched_get_idle_load(int cpu) +{ + struct sched_pm *stat = per_cpu(sched_stat, cpu); + int latency = atomic_read((stat-wake_latency)); + /* +* Transform the current wakeup latency (us) into an idle load that +* will be compared to task load to decide if it's worth to wake up +* the cpu. The current formula is quite simple but give good +* approximation in the range [0:10ms] +*/ + return (latency * 21) 10; +} + #else /* CONFIG_SCHED_PACKING_TASKS */ static inline bool is_packing_cpu(int cpu) @@ -416,6 +430,10 @@ static inline int check_nohz_packing(int cpu) return false; } +static inline int sched_get_idle_load(int cpu) +{ + return 0; +} #endif /* CONFIG_SCHED_PACKING_TASKS */ #endif /* CONFIG_SMP */ @@ -3207,6 +3225,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) /* Used instead of source_load when we know the type == 0 */ static unsigned long weighted_cpuload(const int cpu) { + if (idle_cpu(cpu)) + return sched_get_idle_load(cpu); return cpu_rq(cpu)-cfs.runnable_load_avg; } @@ -3655,6 +3675,8 @@ static int select_idle_sibling(struct task_struct *p, int target) if (i == target || !idle_cpu(i) || !is_packing_cpu(i)) goto next; + if (weighted_cpuload(i) p-se.avg.load_avg_contrib) + goto next; } target = cpumask_first_and(sched_group_cpus(sg), -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bench for testing scheduler
On 8 November 2013 22:12, Rowand, Frank frank.row...@sonymobile.com wrote: On Friday, November 08, 2013 1:28 AM, Vincent Guittot [vincent.guit...@linaro.org] wrote: On 8 November 2013 01:04, Rowand, Frank frank.row...@sonymobile.com wrote: snip The Avg figures look almost stable IMO. Are you speaking about the Max value for the inconsistency ? The values on my laptop for -l 2000 are not stable. If I collapse all of the threads in each of the following tests to a single value I get the following table. Note that each thread completes a different number of cycles, so I calculate the average as: total count = T0_count + T1_count + T2_count + T3_count avg = ( (T0_count * T0_avg) + (T1_count * T1_avg) + ... + (T3_count * T3_avg) ) / total count min is the smallest min for any of the threads max is the largest max for any of the threads total test Tcount min avg max --- --- - 1 4 5886276.0 1017 2 4 5881271.5 810 3 4 5885274.2 1143 4 4 5884268.9 1279 test 1 average is 10% larger than test 4. test 4 maximum is 50% larger than test2. But all of this is just a minor detail of how to run cyclictest. The more important question is whether to use cyclictest results as a valid workload or metric, so for the moment I won't comment further on the cyclictest parameters you used to collect the example data you provided. snip Thanks for clarifying how the data was calculated (below). Again, I don't think this level of detail is the most important issue at this point, but I'm going to comment on it while it is still fresh in my mind. Some questions about what these metrics are: The cyclictest data is reported per thread. How did you combine the per thread data to get a single latency and stddev value? Is Latency the average latency? Yes. I have described below the procedure i have followed to get my results: I run the same test (same parameters) several times ( i have tried between 5 and 10 runs and the results were similar). For each run, i compute the average of per thread average figure and i compute the stddev between per thread results. So the test run stddev is the standard deviation of the values for average latency of the 8 (???) cyclictest threads in a test run? I have used 5 threads for my tests If so, I don't think that the calculated stddev has much actual meaning for comparing the algorithms (I do find it useful to get a loose sense of how consistent multiple test runs with the same parameters). The results that i sent is an average of all runs with the same parameters. Then the stddev in the table is the average of the stddev in several test runs? yes it is The stddev later on in the table is often in the range of 10%, 20%, 50%, and 100% of the average latency. That is rather large. yes i agree and it's an interesting figure IMHO because it points out how the wake up of a core can impact the task scheduling latency and how it's possible to reduce it or make it more stable (even if we still have some large max value which are probably not linked to the wake up of a core but other activities like deferable timer that have fired stddev is not reported by cyclictest. How did you create this value? Did you use the -v cyclictest option to report detailed data, then calculate stddev from the detailed data? No i haven't used the -v because it generates too much spurious wake up that makes the results irrelevant Yes, I agree about not using -v. It was just a wild guess on my part since I did not know how stddev was calculated. And I was incorrectly guessing that stdev was describing the frequency distribution of the latencies from a single test run. I haven't be so precise in my computation mainly because the output were almost coherent but we probably need more precised statistic in a final step As a general comment on cyclictest, I don't find average latency (in isolation) sufficient to compare different runs of cyclictest. And stddev of the frequency distribution of the latencies (which can be calculated from the -h data, with fairly low cyclictest overhead) is usually interesting but should be viewed with a healthy skepticism since that frequency distribution is often not a normal distribution. In addition to average latency, I normally look at maximum latency and the frequency distribution of latence (in table or graph form). (One side effect of specifying -h is that the -d option is then ignored.) I'm going to have a look at -h parameters which can be useful to get a better view of the frequency distribution as you point out. Having the distance set to 0 (-d) can be an issue because we could have a synchronization of the wake up of the threads which will finally hide the real wake up latency. It's interesting to have a distance
Re: [RFC][PATCH v5 00/14] sched: packing tasks
On 11 November 2013 17:38, Peter Zijlstra pet...@infradead.org wrote: On Mon, Nov 11, 2013 at 11:33:45AM +, Catalin Marinas wrote: My understanding from the recent discussions is that the scheduler should decide directly on the C-state (or rather the deepest C-state possible since we don't want to duplicate the backend logic for synchronising CPUs going up or down). This means that the scheduler needs to know about C-state target residency, wake-up latency (I think we can leave coupled C-states to the backend, there is some complex synchronisation which I wouldn't duplicate). Alternatively (my preferred approach), we get the scheduler to predict and pass the expected residency and latency requirements down to a power driver and read back the actual C-states for making task placement decisions. Some of the menu governor prediction logic could be turned into a library and used by the scheduler. Basically what this tries to achieve is better scheduler awareness of the current C-states decided by a cpuidle/power driver based on the scheduler constraints. Ah yes.. so I _think_ the scheduler wants to eventually know about idle topology constraints. But we can get there in a gradual fashion I hope. Like the package C states on x86 -- for those to be effective the scheduler needs to pack tasks and keep entire packages idle for as long as possible. That's the purpose of patches 12, 13 and 14. To get the current wakeup latency of a core and use it when selecting a core -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v5 05/14] sched: add a packing level knob
On 12 November 2013 11:32, Peter Zijlstra pet...@infradead.org wrote: On Fri, Oct 18, 2013 at 01:52:18PM +0200, Vincent Guittot wrote: +int sched_proc_update_packing(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); + if (ret || !write) + return ret; + + if (sysctl_sched_packing_level) + sd_pack_threshold = (100 * 1024) / sysctl_sched_packing_level; + + return 0; +} +#ifdef CONFIG_SCHED_PACKING_TASKS +static int min_sched_packing_level; +static int max_sched_packing_level = 100; +#endif /* CONFIG_SMP */ +#ifdef CONFIG_SCHED_PACKING_TASKS + { + .procname = sched_packing_level, + .data = sysctl_sched_packing_level, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = sched_proc_update_packing, + .extra1 = min_sched_packing_level, + .extra2 = max_sched_packing_level, + }, +#endif Shouldn't min_sched_packing_level be 1? Userspace can now write 0 and expect something; but then we don't update sd_pack_threshold so nothing really changed. value 0 is used to disable to packing feature and the scheduler falls back to default behavior. This value is tested when setting which cpus will be used by the scheduler. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v5 05/14] sched: add a packing level knob
On 12 November 2013 11:55, Peter Zijlstra pet...@infradead.org wrote: On Tue, Nov 12, 2013 at 11:44:15AM +0100, Vincent Guittot wrote: On 12 November 2013 11:32, Peter Zijlstra pet...@infradead.org wrote: On Fri, Oct 18, 2013 at 01:52:18PM +0200, Vincent Guittot wrote: +int sched_proc_update_packing(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); + if (ret || !write) + return ret; + + if (sysctl_sched_packing_level) + sd_pack_threshold = (100 * 1024) / sysctl_sched_packing_level; + + return 0; +} +#ifdef CONFIG_SCHED_PACKING_TASKS +static int min_sched_packing_level; +static int max_sched_packing_level = 100; +#endif /* CONFIG_SMP */ +#ifdef CONFIG_SCHED_PACKING_TASKS + { + .procname = sched_packing_level, + .data = sysctl_sched_packing_level, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = sched_proc_update_packing, + .extra1 = min_sched_packing_level, + .extra2 = max_sched_packing_level, + }, +#endif Shouldn't min_sched_packing_level be 1? Userspace can now write 0 and expect something; but then we don't update sd_pack_threshold so nothing really changed. value 0 is used to disable to packing feature and the scheduler falls back to default behavior. This value is tested when setting which cpus will be used by the scheduler. I suspected as much, but it wasn't clear from the Changelog, the patch or any comments. Plz as to fix. Ok. i will add it -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v5 06/14] sched: create a new field with available capacity
On 12 November 2013 11:34, Peter Zijlstra pet...@infradead.org wrote: On Fri, Oct 18, 2013 at 01:52:19PM +0200, Vincent Guittot wrote: This new field power_available reflects the available capacity of a CPU unlike the cpu_power which reflects the current capacity. - sdg-sgp-power_orig = sdg-sgp-power = power; + sdg-sgp-power_orig = sdg-sgp-power_available = available; + sdg-sgp-power = power; This patch leaves me confused as to power_available vs power_orig and the Changelog doesn't really clarify anything much at all. Ok, i will add more details in the changelog power_ori can only modified for SMT purpose otherwise it stays to 1024 whereas power_available take into account the modification that has been done by the platform. I can probably re-factor that and merge power_ori and power_available. At now, arch_scale_smt_power and arch_scale_freq_power are not used simultaneously by an architecture so I can probably move the returned value of arch_scale_freq_power into the power_ori. This would even make more sense regarding the current use of it. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v5 00/14] sched: packing tasks
On 11 November 2013 12:33, Catalin Marinas catalin.mari...@arm.com wrote: Hi Vincent, (cross-posting to linux-pm as it was agreed to follow up on this list) snip So, IMO, defining the power topology is a good starting point and I think it's better to separate the patches from the energy saving algorithms like packing. We need to agree on what information we have Daniel and Tuukka who are working in cpuidle consolidation in the scheduler, are also interested in using similar topology information than me. I have made 1 patchset only because the information was only used here. So I can probably make a separate patchset of the power topology description in DT. Vincent (C-state details, coupling, power gating) and what we can/need to expose to the scheduler. This can be revisited once we start implementing/refining the energy awareness. snip -- Catalin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 0/6] rework sched_domain topology description
This patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology (this patchset) -update and consolidation of cpu_power -tasks packing algorithm Based on Peter Z's proposal [2][3], this patchset modifies the way to configure the sched_domain level in order to let architectures to add specific level like the current BOOK level or the proposed power gating level for ARM architecture. [1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2013/11/5/239 [3] https://lkml.org/lkml/2013/11/5/449 Vincent Guittot (6): sched: remove unused SCHED_INIT_NODE sched: rework of sched_domain topology definition sched: s390: create a dedicated topology table sched: powerpc: create a dedicated topology table sched: add a new SD_SHARE_POWERDOMAIN for sched_domain sched: ARM: create a dedicated scheduler topology table arch/arm/kernel/topology.c| 26 arch/ia64/include/asm/topology.h | 24 arch/metag/include/asm/topology.h | 27 - arch/powerpc/kernel/smp.c | 35 -- arch/s390/include/asm/topology.h | 13 +- arch/s390/kernel/topology.c | 25 arch/tile/include/asm/topology.h | 33 -- include/linux/sched.h | 30 + include/linux/topology.h | 128 ++-- kernel/sched/core.c | 235 ++--- 10 files changed, 237 insertions(+), 339 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 3/6] sched: s390: create a dedicated topology table
BOOK level is only relevant for s390 so we create a dedicated topology table with BOOK level and remove it from default table. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- arch/s390/include/asm/topology.h | 11 +-- arch/s390/kernel/topology.c | 25 + kernel/sched/core.c |3 --- 3 files changed, 26 insertions(+), 13 deletions(-) diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h index 07763bd..56af530 100644 --- a/arch/s390/include/asm/topology.h +++ b/arch/s390/include/asm/topology.h @@ -26,21 +26,12 @@ extern struct cpu_topology_s390 cpu_topology[NR_CPUS]; #define mc_capable() 1 -static inline const struct cpumask *cpu_coregroup_mask(int cpu) -{ - return cpu_topology[cpu].core_mask; -} - -static inline const struct cpumask *cpu_book_mask(int cpu) -{ - return cpu_topology[cpu].book_mask; -} - int topology_cpu_init(struct cpu *); int topology_set_cpu_management(int fc); void topology_schedule_update(void); void store_topology(struct sysinfo_15_1_x *info); void topology_expect_change(void); +const struct cpumask *cpu_coregroup_mask(int cpu); #else /* CONFIG_SCHED_BOOK */ diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c index 4b2e3e3..8a7ac47 100644 --- a/arch/s390/kernel/topology.c +++ b/arch/s390/kernel/topology.c @@ -443,6 +443,28 @@ int topology_cpu_init(struct cpu *cpu) return sysfs_create_group(cpu-dev.kobj, topology_cpu_attr_group); } +const struct cpumask *cpu_coregroup_mask(int cpu) +{ + return cpu_topology[cpu].core_mask; +} + +static const struct cpumask *cpu_book_mask(int cpu) +{ + return cpu_topology[cpu].book_mask; +} + +static struct sched_domain_topology_level s390_topology[] = { + { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) }, + { cpu_book_mask, SD_INIT_NAME(BOOK) }, + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + +static void __init set_sched_topology(void) +{ + sched_domain_topology = s390_topology; +} + static int __init topology_init(void) { if (!MACHINE_HAS_TOPOLOGY) { @@ -452,6 +474,9 @@ static int __init topology_init(void) set_topology_timer(); out: update_cpu_masks(); + + set_sched_topology(); + return device_create_file(cpu_subsys.dev_root, dev_attr_dispatching); } device_initcall(topology_init); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6f6a8f0..3479467 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6043,9 +6043,6 @@ static struct sched_domain_topology_level default_topology[] = { #ifdef CONFIG_SCHED_MC { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) }, #endif -#ifdef CONFIG_SCHED_BOOK - { cpu_book_mask, SD_INIT_NAME(BOOK) }, -#endif { cpu_cpu_mask, SD_INIT_NAME(DIE) }, { NULL, }, }; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 6/6] sched: ARM: create a dedicated scheduler topology table
Create a dedicated topology table for ARM which will create new level to differentiate CPUs that can or not powergate independantly from others. The patch gives an example of how to add domain that will take advantage of SD_SHARE_POWERDOMAIN. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- arch/arm/kernel/topology.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 0bc94b1..ae8ffbc 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -185,6 +185,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu) return cpu_topology[cpu].core_sibling; } +/* + * The current assumption is that we can power gate each core independently. + * This will be superseded by DT binding once available. + */ +const struct cpumask *cpu_corepower_mask(int cpu) +{ + return cpu_topology[cpu].thread_sibling; +} + static void update_siblings_masks(unsigned int cpuid) { struct cputopo_arm *cpu_topo, *cpuid_topo = cpu_topology[cpuid]; @@ -266,6 +275,20 @@ void store_cpu_topology(unsigned int cpuid) cpu_topology[cpuid].socket_id, mpidr); } +static struct sched_domain_topology_level arm_topology[] = { +#ifdef CONFIG_SCHED_MC + { cpu_corepower_mask, SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN, SD_INIT_NAME(GMC) }, + { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) }, +#endif + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + +static void __init set_sched_topology(void) +{ + sched_domain_topology = arm_topology; +} + /* * init_cpu_topology is called at boot when only one cpu is running * which prevent simultaneous write access to cpu_topology array @@ -289,4 +312,7 @@ void __init init_cpu_topology(void) smp_wmb(); parse_dt_topology(); + + /* Set scheduler topology descriptor */ + set_sched_topology(); } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 4/6] sched: powerpc: create a dedicated topology table
Create a dedicated topology table for handling asymetric feature. The current proposal creates a new level which describes which groups of CPUs take adavantge of SD_ASYM_PACKING. The useless level will be removed during the build of the sched_domain topology. Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level during the boot sequence and before the build of the sched_domain topology. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- arch/powerpc/kernel/smp.c | 35 +++ kernel/sched/core.c |6 -- 2 files changed, 27 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ac2621a..75da054 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -755,6 +755,32 @@ int setup_profiling_timer(unsigned int multiplier) return 0; } +#ifdef CONFIG_SCHED_SMT +/* cpumask of CPUs with asymetric SMT dependancy */ +static const struct cpumask *cpu_asmt_mask(int cpu) +{ + if (cpu_has_feature(CPU_FTR_ASYM_SMT)) { + printk_once(KERN_INFO Enabling Asymmetric SMT scheduling\n); + return topology_thread_cpumask(cpu); + } + return cpumask_of(cpu); +} +#endif + +static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT + { cpu_asmt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING, SD_INIT_NAME(ASMT) }, + { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(SMT) }, +#endif + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + +static void __init set_sched_topology(void) +{ + sched_domain_topology = powerpc_topology; +} + void __init smp_cpus_done(unsigned int max_cpus) { cpumask_var_t old_mask; @@ -779,15 +805,8 @@ void __init smp_cpus_done(unsigned int max_cpus) dump_numa_cpu_topology(); -} + set_sched_topology(); -int arch_sd_sibling_asym_packing(void) -{ - if (cpu_has_feature(CPU_FTR_ASYM_SMT)) { - printk_once(KERN_INFO Enabling Asymmetric SMT scheduling\n); - return SD_ASYM_PACKING; - } - return 0; } #ifdef CONFIG_HOTPLUG_CPU diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3479467..7606de0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5818,11 +5818,6 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd) atomic_set(sg-sgp-nr_busy_cpus, sg-group_weight); } -int __weak arch_sd_sibling_asym_packing(void) -{ - return 0*SD_ASYM_PACKING; -} - /* * Initializers for schedule domains * Non-inlined to reduce accumulated stack pressure in build_sched_domains() @@ -6000,7 +5995,6 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) if (sd-flags SD_SHARE_CPUPOWER) { sd-imbalance_pct = 110; sd-smt_gain = 1178; /* ~15% */ - sd-flags |= arch_sd_sibling_asym_packing(); } else if (sd-flags SD_SHARE_PKG_RESOURCES) { sd-imbalance_pct = 117; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 5/6] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
A new flag SD_SHARE_POWERDOMAIN is created to reflect whether groups of CPUs in a sched_domain level can or not reach different power state. As an example, the flag should be cleared at CPU level if groups of cores can be power gated independently. This information can be used to add load balancing level between group of CPUs than can power gate independantly. The default behavior of the scheduler is to spread tasks across CPUs and groups of CPUs so the flag is set into all sched_domains. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- include/linux/sched.h |1 + kernel/sched/core.c |9 ++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index dbc35dd..182a080 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -861,6 +861,7 @@ enum cpu_idle_type { #define SD_BALANCE_WAKE0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */ +#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING0x0800 /* Place busy groups earlier in the domain */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7606de0..b28cff0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5283,7 +5283,8 @@ static int sd_degenerate(struct sched_domain *sd) SD_BALANCE_FORK | SD_BALANCE_EXEC | SD_SHARE_CPUPOWER | -SD_SHARE_PKG_RESOURCES)) { +SD_SHARE_PKG_RESOURCES | +SD_SHARE_POWERDOMAIN)) { if (sd-groups != sd-groups-next) return 0; } @@ -5314,7 +5315,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent) SD_BALANCE_EXEC | SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | - SD_PREFER_SIBLING); + SD_PREFER_SIBLING | + SD_SHARE_POWERDOMAIN); if (nr_node_ids == 1) pflags = ~SD_SERIALIZE; } @@ -5932,7 +5934,8 @@ static struct cpumask ***sched_domains_numa_masks; (SD_SHARE_CPUPOWER |\ SD_SHARE_PKG_RESOURCES | \ SD_NUMA | \ -SD_ASYM_PACKING) +SD_ASYM_PACKING | \ +SD_SHARE_POWERDOMAIN) static struct sched_domain * sd_init(struct sched_domain_topology_level *tl, int cpu) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/6] sched: rework of sched_domain topology definition
We replace the old way to configure the scheduler topology with a new method which enables a platform to declare additionnal level (if needed). We still have a default topology table definition that can be used by platform that don't want more level than the SMT, MC, CPU and NUMA ones. This table can be overwritten by an arch which wants to add new level where a load balance make sense like BOOK or powergating level. For each level, we need a function pointer that returns cpumask for each cpu, the flags configuration and a name. Each level must be a subset on the next one. The build sequence of the sched_domain will take care of removing useless levels like those with 1 CPU and those with the same CPU span and relevant information for load balancing than its child . Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- arch/ia64/include/asm/topology.h | 24 arch/s390/include/asm/topology.h |2 - arch/tile/include/asm/topology.h | 33 -- include/linux/sched.h| 29 + include/linux/topology.h | 128 +++-- kernel/sched/core.c | 227 +++--- 6 files changed, 156 insertions(+), 287 deletions(-) diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e4..20d12fa 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -46,30 +46,6 @@ void build_cpu_to_node_map(void); -#define SD_CPU_INIT (struct sched_domain) {\ - .parent = NULL, \ - .child = NULL, \ - .groups = NULL, \ - .min_interval = 1,\ - .max_interval = 4,\ - .busy_factor= 64, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 2,\ - .busy_idx = 2,\ - .idle_idx = 1,\ - .newidle_idx= 0,\ - .wake_idx = 0,\ - .forkexec_idx = 0,\ - .flags = SD_LOAD_BALANCE \ - | SD_BALANCE_NEWIDLE\ - | SD_BALANCE_EXEC \ - | SD_BALANCE_FORK \ - | SD_WAKE_AFFINE, \ - .last_balance = jiffies, \ - .balance_interval = 1,\ - .nr_balance_failed = 0,\ -} - #endif /* CONFIG_NUMA */ #ifdef CONFIG_SMP diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h index 05425b1..07763bd 100644 --- a/arch/s390/include/asm/topology.h +++ b/arch/s390/include/asm/topology.h @@ -64,8 +64,6 @@ static inline void s390_init_cpu_topology(void) }; #endif -#define SD_BOOK_INIT SD_CPU_INIT - #include asm-generic/topology.h #endif /* _ASM_S390_TOPOLOGY_H */ diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index d15c0d8..9383118 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -44,39 +44,6 @@ static inline const struct cpumask *cpumask_of_node(int node) /* For now, use numa node -1 for global allocation. */ #define pcibus_to_node(bus)((void)(bus), -1) -/* - * TILE architecture has many cores integrated in one processor, so we need - * setup bigger balance_interval for both CPU/NODE scheduling domains to - * reduce process scheduling costs. - */ - -/* sched_domains SD_CPU_INIT for TILE architecture */ -#define SD_CPU_INIT (struct sched_domain) {\ - .min_interval = 4,\ - .max_interval = 128, \ - .busy_factor= 64, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 1,\ - .busy_idx = 2,\ - .idle_idx = 1,\ - .newidle_idx= 0,\ - .wake_idx = 0,\ - .forkexec_idx = 0,\ - \ - .flags = 1*SD_LOAD_BALANCE \ - | 1*SD_BALANCE_NEWIDLE \ - | 1*SD_BALANCE_EXEC \ - | 1*SD_BALANCE_FORK
[RFC 1/6] sched: remove unused SCHED_INIT_NODE
not used since new numa scheduler init sequence Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- arch/metag/include/asm/topology.h | 27 --- 1 file changed, 27 deletions(-) diff --git a/arch/metag/include/asm/topology.h b/arch/metag/include/asm/topology.h index 8e9c0b3..e95f874 100644 --- a/arch/metag/include/asm/topology.h +++ b/arch/metag/include/asm/topology.h @@ -3,33 +3,6 @@ #ifdef CONFIG_NUMA -/* sched_domains SD_NODE_INIT for Meta machines */ -#define SD_NODE_INIT (struct sched_domain) { \ - .parent = NULL, \ - .child = NULL, \ - .groups = NULL, \ - .min_interval = 8,\ - .max_interval = 32, \ - .busy_factor= 32, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 2,\ - .busy_idx = 3,\ - .idle_idx = 2,\ - .newidle_idx= 0,\ - .wake_idx = 0,\ - .forkexec_idx = 0,\ - .flags = SD_LOAD_BALANCE \ - | SD_BALANCE_FORK \ - | SD_BALANCE_EXEC \ - | SD_BALANCE_NEWIDLE\ - | SD_SERIALIZE, \ - .last_balance = jiffies, \ - .balance_interval = 1,\ - .nr_balance_failed = 0,\ - .max_newidle_lb_cost= 0,\ - .next_decay_max_lb_cost = jiffies, \ -} - #define cpu_to_node(cpu) ((void)(cpu), 0) #define parent_node(node) ((void)(node), 0) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/6] sched: rework of sched_domain topology definition
On 6 March 2014 01:09, Dietmar Eggemann dietmar.eggem...@arm.com wrote: On 05/03/14 07:18, Vincent Guittot wrote: We replace the old way to configure the scheduler topology with a new method which enables a platform to declare additionnal level (if needed). We still have a default topology table definition that can be used by platform that don't want more level than the SMT, MC, CPU and NUMA ones. This table can be overwritten by an arch which wants to add new level where a load balance make sense like BOOK or powergating level. For each level, we need a function pointer that returns cpumask for each cpu, the flags configuration and a name. Each level must be a subset on Maybe it's worth mentioning here that those flags are from the set of the sd topology flags to distinguish them from the set of sd behavioural flags. Latter can't be set via this interface. yes, i will add the list of flags that can be set with the table the next one. The build sequence of the sched_domain will take care of removing useless levels like those with 1 CPU and those with the same CPU span and relevant information for load balancing than its child . Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- arch/ia64/include/asm/topology.h | 24 arch/s390/include/asm/topology.h |2 - arch/tile/include/asm/topology.h | 33 -- include/linux/sched.h| 29 + include/linux/topology.h | 128 +++-- kernel/sched/core.c | 227 +++--- 6 files changed, 156 insertions(+), 287 deletions(-) [snip] - -#define for_each_sd_topology(tl) \ - for (tl = sched_domain_topology; tl-init; tl++) Why is sched_domains_curr_level now outside #ifdef CONFIG_NUMA? it should not as well as its use in sd_init +static int sched_domains_curr_level; #ifdef CONFIG_NUMA - static int sched_domains_numa_levels; static int *sched_domains_numa_distance; static struct cpumask ***sched_domains_numa_masks; -static int sched_domains_curr_level; - -static inline int sd_local_flags(int level) -{ - if (sched_domains_numa_distance[level] RECLAIM_DISTANCE) - return 0; +#endif - return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE; -} +/* + * SD_flags allowed in topology descriptions. + * + * SD_SHARE_CPUPOWER - describes SMT topologies + * SD_SHARE_PKG_RESOURCES - describes shared caches + * SD_NUMA- describes NUMA topologies + * + * Odd one out: + * SD_ASYM_PACKING- describes SMT quirks + */ +#define TOPOLOGY_SD_FLAGS \ + (SD_SHARE_CPUPOWER |\ +SD_SHARE_PKG_RESOURCES | \ +SD_NUMA | \ +SD_ASYM_PACKING) static struct sched_domain * -sd_numa_init(struct sched_domain_topology_level *tl, int cpu) +sd_init(struct sched_domain_topology_level *tl, int cpu) { struct sched_domain *sd = *per_cpu_ptr(tl-data.sd, cpu); - int level = tl-numa_level; - int sd_weight = cpumask_weight( - sched_domains_numa_masks[level][cpu_to_node(cpu)]); + int sd_weight; + Next line could be guared by #ifdef CONFIG_NUMA. We still use #ifdef CONFIG_NUMA later in sd_init() though. + /* +* Ugly hack to pass state to sd_numa_mask()... +*/ + sched_domains_curr_level = tl-numa_level; + + sd_weight = cpumask_weight(tl-mask(cpu)); + + if (WARN_ONCE(tl-sd_flags ~TOPOLOGY_SD_FLAGS, + wrong sd_flags in topology description\n)) + tl-sd_flags = ~TOPOLOGY_SD_FLAGS; *sd = (struct sched_domain){ .min_interval = sd_weight, .max_interval = 2*sd_weight, .busy_factor= 32, .imbalance_pct = 125, - .cache_nice_tries = 2, - .busy_idx = 3, - .idle_idx = 2, + + .cache_nice_tries = 0, + .busy_idx = 0, + .idle_idx = 0, .newidle_idx= 0, .wake_idx = 0, .forkexec_idx = 0, Why we want to explicitly set those indexes to 0 here? IMHO, the memory for *sd is zeroed out before. This is true for all data members which are set to 0 later in this function including the | 0*SD_FOO . IMHO, would make the code more readable. I would say that it makes the configuration more readable and modifiable because you have the list of possible flag to set .flags = 1*SD_LOAD_BALANCE | 1*SD_BALANCE_NEWIDLE - | 0*SD_BALANCE_EXEC
Re: [RFC 6/6] sched: ARM: create a dedicated scheduler topology table
On 6 March 2014 06:38, Dietmar Eggemann dietmar.eggem...@arm.com wrote: On 05/03/14 07:18, Vincent Guittot wrote: Create a dedicated topology table for ARM which will create new level to differentiate CPUs that can or not powergate independantly from others. The patch gives an example of how to add domain that will take advantage of SD_SHARE_POWERDOMAIN. Signed-off-by: Vincent Guittot vincent.guit...@linaro.org [snip] parse_dt_topology(); + + /* Set scheduler topology descriptor */ + set_sched_topology(); } How about the core scheduler provides an interface set_sched_topology() instead of each arch has its own __init function? Sketched out below for ARM. yes, i will add it -- 8 -- Subject: [PATCH] sched: set_sched_topology() as an interface of core scheduler --- arch/arm/kernel/topology.c | 7 +-- include/linux/sched.h | 2 +- kernel/sched/core.c| 5 + 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index ae8ffbc..89d5592 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -284,11 +284,6 @@ static struct sched_domain_topology_level arm_topology[] = { { NULL, }, }; -static void __init set_sched_topology(void) -{ - sched_domain_topology = arm_topology; -} - /* * init_cpu_topology is called at boot when only one cpu is running * which prevent simultaneous write access to cpu_topology array @@ -314,5 +309,5 @@ void __init init_cpu_topology(void) parse_dt_topology(); /* Set scheduler topology descriptor */ - set_sched_topology(); + set_sched_topology(arm_topology); } diff --git a/include/linux/sched.h b/include/linux/sched.h index 8831413..fefd4e7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -998,7 +998,7 @@ struct sched_domain_topology_level { #endif }; -extern struct sched_domain_topology_level *sched_domain_topology; +extern void set_sched_topology(struct sched_domain_topology_level *tl); #ifdef CONFIG_SCHED_DEBUG # define SD_INIT_NAME(type).name = #type diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b4fb0df..a748c92 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6105,6 +6105,11 @@ static struct sched_domain_topology_level default_topology[] = { struct sched_domain_topology_level *sched_domain_topology = default_topology; +void set_sched_topology(struct sched_domain_topology_level *tl) +{ + sched_domain_topology = tl; +} + #define for_each_sd_topology(tl) \ for (tl = sched_domain_topology; tl-mask; tl++) -- 1.8.3.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/6] rework sched_domain topology description
On 6 March 2014 07:17, Dietmar Eggemann dietmar.eggem...@arm.com wrote: On 05/03/14 07:18, Vincent Guittot wrote: This patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology (this patchset) -update and consolidation of cpu_power -tasks packing algorithm Based on Peter Z's proposal [2][3], this patchset modifies the way to configure the sched_domain level in order to let architectures to add specific level like the current BOOK level or the proposed power gating level for ARM architecture. [1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2013/11/5/239 [3] https://lkml.org/lkml/2013/11/5/449 Vincent Guittot (6): sched: remove unused SCHED_INIT_NODE sched: rework of sched_domain topology definition sched: s390: create a dedicated topology table sched: powerpc: create a dedicated topology table sched: add a new SD_SHARE_POWERDOMAIN for sched_domain sched: ARM: create a dedicated scheduler topology table arch/arm/kernel/topology.c| 26 arch/ia64/include/asm/topology.h | 24 arch/metag/include/asm/topology.h | 27 - arch/powerpc/kernel/smp.c | 35 -- arch/s390/include/asm/topology.h | 13 +- arch/s390/kernel/topology.c | 25 arch/tile/include/asm/topology.h | 33 -- include/linux/sched.h | 30 + include/linux/topology.h | 128 ++-- kernel/sched/core.c | 235 ++--- 10 files changed, 237 insertions(+), 339 deletions(-) Hi Vincent, I reviewed your patch-set carefully (including test runs on TC2), especially due to the fact that we want to build our sd_energy stuff on top of it. Thanks One thing I'm still not convinced of is the fact that specifying additional sd levels in the struct sched_domain_topology_level table has an advantage over a function pointer for sd topology flags similar to the one we're already using for the cpu mask in struct sched_domain_topology_level. int (*sched_domain_flags_f)(int cpu); We have to create additional level for some kind of topology as described in my trial https://lkml.org/lkml/2013/12/18/279 which is not possible with function pointer. Have you got a situation in mind where it will be necessary to use the function pointer with cpu number as an argument ? In the current example of this patchset, the flags are statically set in the table but nothing prevents an architecture to update the flags value before being given to the scheduler This function pointer would be simply another member of struct sched_domain_topology_level and would replace int sd_flags. AFAICS, you have to create additional cpu mask functions anyway for the additional sd levels, like cpu_corepower_mask() for the GMC level in the ARM case. There could be a set of standard sd topology flags function for the default sd layer and archs which want to pass in something special define those function locally since they will use them only in their arch specific struct sched_domain_topology_level table anyway. I know that you use the existing sd degenerate functionality for this and that the freeing of the redundant data structures (sched_domain, sched_group and sched_group_power) is there too but it still doesn't seem to me to be the right thing to do. The problem that we now expose internal data structures (struct sd_data and struct sched_domain_topology_level) could be dealt with later. -- Dietmar -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/6] rework sched_domain topology description
On 6 March 2014 20:31, Dietmar Eggemann dietmar.eggem...@arm.com wrote: On 06/03/14 09:04, Vincent Guittot wrote: On 6 March 2014 07:17, Dietmar Eggemann dietmar.eggem...@arm.com wrote: On 05/03/14 07:18, Vincent Guittot wrote: This patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology (this patchset) -update and consolidation of cpu_power -tasks packing algorithm Based on Peter Z's proposal [2][3], this patchset modifies the way to configure the sched_domain level in order to let architectures to add specific level like the current BOOK level or the proposed power gating level for ARM architecture. [1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2013/11/5/239 [3] https://lkml.org/lkml/2013/11/5/449 Vincent Guittot (6): sched: remove unused SCHED_INIT_NODE sched: rework of sched_domain topology definition sched: s390: create a dedicated topology table sched: powerpc: create a dedicated topology table sched: add a new SD_SHARE_POWERDOMAIN for sched_domain sched: ARM: create a dedicated scheduler topology table arch/arm/kernel/topology.c| 26 arch/ia64/include/asm/topology.h | 24 arch/metag/include/asm/topology.h | 27 - arch/powerpc/kernel/smp.c | 35 -- arch/s390/include/asm/topology.h | 13 +- arch/s390/kernel/topology.c | 25 arch/tile/include/asm/topology.h | 33 -- include/linux/sched.h | 30 + include/linux/topology.h | 128 ++-- kernel/sched/core.c | 235 ++--- 10 files changed, 237 insertions(+), 339 deletions(-) Hi Vincent, I reviewed your patch-set carefully (including test runs on TC2), especially due to the fact that we want to build our sd_energy stuff on top of it. Thanks One thing I'm still not convinced of is the fact that specifying additional sd levels in the struct sched_domain_topology_level table has an advantage over a function pointer for sd topology flags similar to the one we're already using for the cpu mask in struct sched_domain_topology_level. int (*sched_domain_flags_f)(int cpu); We have to create additional level for some kind of topology as described in my trial https://lkml.org/lkml/2013/12/18/279 which is not possible with function pointer. This is what I don't understand at the moment. In your example in the link above, (2 cluster of 4 cores with SMT), cpu 0-7 can powergate while cpu 8-15 can't). Why can't we have The 2nd example is a bit more complicated and needs an additional level to describe the topology static inline int cpu_coregroup_flags(int cpu) { int flags = SD_SHARE_PKG_RESOURCES; if (cpu = 8) flags |= SD_SHARE_POWERDOMAIN; return flags; } inside the arch specific topology file and use it in the MC level as the int (*sched_domain_flags_f)(int cpu) member of struct sched_domain_topology_level? Have you got a situation in mind where it will be necessary to use the function pointer with cpu number as an argument ? The one above. Fundamentally all situations where you want to set sd topology flags differently for different cpus in the same sd level. big.LITTLE is another example but it's the same as powergated/!powergated in this perspective. You see the flag of a level as something that can be changed per cpu whereas the proposal is to define a number of level with interesting properties for the scheduler and to associate a mask of cpus to this properties. I don't have a strong opinion about using or not a cpu argument for setting the flags of a level (it was part of the initial proposal before we start to completely rework the build of sched_domain) Nevertheless, I see one potential concern that you can have completely different flags configuration of the same sd level of 2 cpus. Peter, Was the use of the cpu as a parameter in the initialization of sched_domain's flag a reason for asking for reworking the initialization of sched_domain ? In the current example of this patchset, the flags are statically set in the table but nothing prevents an architecture to update the flags value before being given to the scheduler What will be the infrastructure if the arch wants to do so? I'm not sure to catch your point. The arch is now able to defines its own table and fill it before giving it to the scheduler so nothing prevents to set/update the flags field according the system configuration during boot sequence. Thanks, Vincent Thanks, -- Dietmar This function pointer would be simply another member of struct sched_domain_topology_level and would replace int sd_flags. AFAICS, you have to create additional cpu mask functions anyway for the additional sd
Re: [RFC 0/6] rework sched_domain topology description
On 8 March 2014 13:40, Dietmar Eggemann dietmar.eggem...@arm.com wrote: On 07/03/14 02:47, Vincent Guittot wrote: On 6 March 2014 20:31, Dietmar Eggemann dietmar.eggem...@arm.com wrote: On 06/03/14 09:04, Vincent Guittot wrote: On 6 March 2014 07:17, Dietmar Eggemann dietmar.eggem...@arm.com wrote: On 05/03/14 07:18, Vincent Guittot wrote: This patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology (this patchset) -update and consolidation of cpu_power -tasks packing algorithm Based on Peter Z's proposal [2][3], this patchset modifies the way to configure the sched_domain level in order to let architectures to add specific level like the current BOOK level or the proposed power gating level for ARM architecture. [1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2013/11/5/239 [3] https://lkml.org/lkml/2013/11/5/449 Vincent Guittot (6): sched: remove unused SCHED_INIT_NODE sched: rework of sched_domain topology definition sched: s390: create a dedicated topology table sched: powerpc: create a dedicated topology table sched: add a new SD_SHARE_POWERDOMAIN for sched_domain sched: ARM: create a dedicated scheduler topology table arch/arm/kernel/topology.c| 26 arch/ia64/include/asm/topology.h | 24 arch/metag/include/asm/topology.h | 27 - arch/powerpc/kernel/smp.c | 35 -- arch/s390/include/asm/topology.h | 13 +- arch/s390/kernel/topology.c | 25 arch/tile/include/asm/topology.h | 33 -- include/linux/sched.h | 30 + include/linux/topology.h | 128 ++-- kernel/sched/core.c | 235 ++--- 10 files changed, 237 insertions(+), 339 deletions(-) Hi Vincent, I reviewed your patch-set carefully (including test runs on TC2), especially due to the fact that we want to build our sd_energy stuff on top of it. Thanks One thing I'm still not convinced of is the fact that specifying additional sd levels in the struct sched_domain_topology_level table has an advantage over a function pointer for sd topology flags similar to the one we're already using for the cpu mask in struct sched_domain_topology_level. int (*sched_domain_flags_f)(int cpu); We have to create additional level for some kind of topology as described in my trial https://lkml.org/lkml/2013/12/18/279 which is not possible with function pointer. This is what I don't understand at the moment. In your example in the link above, (2 cluster of 4 cores with SMT), cpu 0-7 can powergate while cpu 8-15 can't). Why can't we have The 2nd example is a bit more complicated and needs an additional level to describe the topology I see. In the 2. example you want to have an additional MC level for cpu 2-7 because you want to do load balance between those cpus more often than for cpu 0-7 for dst cpus from the set (2-7). Not sure if such topologies (only a subset of cpus inside a cluster can't be powergated) exists today in the real world though? Be sure that such topology is studied and considered by HW guys from https://lkml.org/lkml/2013/12/18/279: CPU2: domain 0: span 2-3 level: SMT flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN groups: 0 1 -- Doesn't this have to be 'groups: 2 3' domain 1: span 2-7 level: MC flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN groups: 2-7 4-5 6-7 domain 2: span 0-7 level: MC flags: SD_SHARE_PKG_RESOURCES groups: 2-7 0-1 domain 3: span 0-15 level: CPU flags: groups: 0-7 8-15 static inline int cpu_coregroup_flags(int cpu) { int flags = SD_SHARE_PKG_RESOURCES; if (cpu = 8) flags |= SD_SHARE_POWERDOMAIN; return flags; } inside the arch specific topology file and use it in the MC level as the int (*sched_domain_flags_f)(int cpu) member of struct sched_domain_topology_level? Have you got a situation in mind where it will be necessary to use the function pointer with cpu number as an argument ? The one above. Fundamentally all situations where you want to set sd topology flags differently for different cpus in the same sd level. big.LITTLE is another example but it's the same as powergated/!powergated in this perspective. You see the flag of a level as something that can be changed per cpu whereas the proposal is to define a number of level with interesting properties for the scheduler and to associate a mask of cpus to this properties. That's right. I don't have a strong opinion about using or not a cpu argument for setting the flags of a level (it was part of the initial
Re: [RFC 4/6] sched: powerpc: create a dedicated topology table
On 11 March 2014 11:08, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: Hi Vincent, On 03/05/2014 12:48 PM, Vincent Guittot wrote: Create a dedicated topology table for handling asymetric feature. The current proposal creates a new level which describes which groups of CPUs take adavantge of SD_ASYM_PACKING. The useless level will be removed during the build of the sched_domain topology. Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level during the boot sequence and before the build of the sched_domain topology. Is the below what you mean as the other solution? If it is so, I would strongly recommend this approach rather than adding another level to the topology level to represent the asymmetric behaviour. +static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT + { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() }, +#endif + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; Not exactly like that but something like below +static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT + { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(SMT) }, +#endif + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + +static void __init set_sched_topology(void) +{ +#ifdef CONFIG_SCHED_SMT + powerpc_topology[0].sd_flags |= arch_sd_sibling_asym_packing(); +#endif + sched_domain_topology = powerpc_topology; +} Regards Preeti U Murthy Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- arch/powerpc/kernel/smp.c | 35 +++ kernel/sched/core.c |6 -- 2 files changed, 27 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ac2621a..75da054 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -755,6 +755,32 @@ int setup_profiling_timer(unsigned int multiplier) return 0; } +#ifdef CONFIG_SCHED_SMT +/* cpumask of CPUs with asymetric SMT dependancy */ +static const struct cpumask *cpu_asmt_mask(int cpu) +{ + if (cpu_has_feature(CPU_FTR_ASYM_SMT)) { + printk_once(KERN_INFO Enabling Asymmetric SMT scheduling\n); + return topology_thread_cpumask(cpu); + } + return cpumask_of(cpu); +} +#endif + +static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT + { cpu_asmt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING, SD_INIT_NAME(ASMT) }, + { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(SMT) }, +#endif + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + +static void __init set_sched_topology(void) +{ + sched_domain_topology = powerpc_topology; +} + void __init smp_cpus_done(unsigned int max_cpus) { cpumask_var_t old_mask; @@ -779,15 +805,8 @@ void __init smp_cpus_done(unsigned int max_cpus) dump_numa_cpu_topology(); -} + set_sched_topology(); -int arch_sd_sibling_asym_packing(void) -{ - if (cpu_has_feature(CPU_FTR_ASYM_SMT)) { - printk_once(KERN_INFO Enabling Asymmetric SMT scheduling\n); - return SD_ASYM_PACKING; - } - return 0; } #ifdef CONFIG_HOTPLUG_CPU diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3479467..7606de0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5818,11 +5818,6 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd) atomic_set(sg-sgp-nr_busy_cpus, sg-group_weight); } -int __weak arch_sd_sibling_asym_packing(void) -{ - return 0*SD_ASYM_PACKING; -} - /* * Initializers for schedule domains * Non-inlined to reduce accumulated stack pressure in build_sched_domains() @@ -6000,7 +5995,6 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) if (sd-flags SD_SHARE_CPUPOWER) { sd-imbalance_pct = 110; sd-smt_gain = 1178; /* ~15% */ - sd-flags |= arch_sd_sibling_asym_packing(); } else if (sd-flags SD_SHARE_PKG_RESOURCES) { sd-imbalance_pct = 117; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/6] sched: rework of sched_domain topology definition
On 11 March 2014 11:31, Peter Zijlstra pet...@infradead.org wrote: On Thu, Mar 06, 2014 at 04:32:35PM +0800, Vincent Guittot wrote: Never got the new name DIE for CPU? Might confuse people when they use /proc/sys/kernel/sched_domain/cpuX/domainY/name or sched_domain_debug_one(). In fact, CPU is also confusing because it's used for different things. But if it makes things even more confusing, i can come back to CPU Yeah, not sure DIE is the right thing either; because there's multi-die packages that get classified under CPU :-) Take for example the Core 2 Quad, which was 2 dual core dies glued together in a single package. There's also the AMD bulldozer which glued two dies into a single package; but for those its not a problem because each die is a separate numa node, so there DIE would actually be the correct term and PACKAGE would be wrong. So while CPU sucks, I'm not sure we can come up with anything that's actually correct. That said; we could try for something less wrong than CPU :-) OK Dietmar, Have you got another naming that DIE that could suit better ? otherwise i will keep it Vincent I'm not sure there are a lot of people who see/know the names of these domains to be bothered by a change in them; it might be limited to just us for all I know. ___ linux-arm-kernel mailing list linux-arm-ker...@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched: fix: remove double calculation in fix_small_imbalance
The tmp value has been already calculated in: scaled_busy_load_per_task = (busiest-load_per_task * SCHED_POWER_SCALE) / busiest-group_power; Signed-off-by: Vincent Guittot vincent.guit...@linaro.org --- kernel/sched/fair.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f1eedae..b301918 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6061,12 +6061,10 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) pwr_now /= SCHED_POWER_SCALE; /* Amount of load we'd subtract */ - tmp = (busiest-load_per_task * SCHED_POWER_SCALE) / - busiest-group_power; - if (busiest-avg_load tmp) { + if (busiest-avg_load scaled_busy_load_per_task) { pwr_move += busiest-group_power * min(busiest-load_per_task, - busiest-avg_load - tmp); + busiest-avg_load - scaled_busy_load_per_task); } /* Amount of load we'd add */ -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/