from:"Vincent Guittot"

Re: [PATCH v2 1/2] sched: fix init NOHZ_IDLE flag

2013-02-19 Thread Vincent Guittot

On 19 February 2013 11:29, Vincent Guittot vincent.guit...@linaro.org wrote:
 On 18 February 2013 16:40, Frederic Weisbecker fweis...@gmail.com wrote:
 2013/2/18 Vincent Guittot vincent.guit...@linaro.org:
 On 18 February 2013 15:38, Frederic Weisbecker fweis...@gmail.com wrote:
 I pasted the original at: http://pastebin.com/DMm5U8J8

 We can clear the idle flag only in the nohz_kick_needed which will not
 be called if the sched_domain is NULL so the sequence will be

 = CPU 0 == CPU 1=

 detach_and_destroy_domain {
 rcu_assign_pointer(cpu1_dom, NULL);
 }

 dom = new_domain(...) {
  nr_cpus_busy = 0;
  set_idle(CPU 1);
 }
 dom =
 rcu_dereference(cpu1_dom)
 //dom == NULL, return

 rcu_assign_pointer(cpu1_dom, dom);

 dom =
 rcu_dereference(cpu1_dom)
 //dom != NULL,
 nohz_kick_needed {

 set_idle(CPU 1)
dom
 = rcu_dereference(cpu1_dom)

 //dec nr_cpus_busy,
 }

 Vincent

 Ok but CPU 0 can assign NULL to the domain of cpu1 while CPU 1 is
 already in the middle of nohz_kick_needed().

 Yes nothing prevents the sequence below to occur

 = CPU 0 == CPU 1=
 dom =
 rcu_dereference(cpu1_dom)
 //dom != NULL
 detach_and_destroy_domain {
 rcu_assign_pointer(cpu1_dom, NULL);
 }

 dom = new_domain(...) {
  nr_cpus_busy = 0;
  //nr_cpus_busy in the new_dom
  set_idle(CPU 1);
 }
 nohz_kick_needed {
  clear_idle(CPU 1)
  dom =
 rcu_dereference(cpu1_dom)

 //cpu1_dom == old_dom
  inc nr_cpus_busy,

 //nr_cpus_busy in the old_dom
 }

 rcu_assign_pointer(cpu1_dom, dom);
 //cpu1_dom == new_dom

The sequence above is not correct in addition to become unreadable
after going through gmail

The correct and readable version
https://pastebin.linaro.org/1750/

Vincent


 I'm not sure that this can happen in practice because CPU1 is in
 interrupt handler but we don't have any mechanism to prevent the
 sequence.

 The NULL sched_domain can be used to detect this situation and the
 set_cpu_sd_state_busy function can be modified like below

 inline void set_cpu_sd_state_busy
  {
 struct sched_domain *sd;
 int cpu = smp_processor_id();
 +   int clear = 0;

 if (!test_bit(NOHZ_IDLE, nohz_flags(cpu)))
 return;
 -   clear_bit(NOHZ_IDLE, nohz_flags(cpu));

 rcu_read_lock();
 for_each_domain(cpu, sd) {
 atomic_inc(sd-groups-sgp-nr_busy_cpus);
 +   clear = 1;
 }
 rcu_read_unlock();
 +
 +   if (likely(clear))
 +   clear_bit(NOHZ_IDLE, nohz_flags(cpu));
  }

 The NOHZ_IDLE flag will not be clear if we have a NULL sched_domain
 attached to the CPU.
 With this implementation, we still don't need to get the sched_domain
 for testing the NOHZ_IDLE flag which occurs each time CPU becomes idle

 The patch 2 become useless

 Vincent
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4] sched: fix init NOHZ_IDLE flag

2013-02-21 Thread Vincent Guittot

On my smp platform which is made of 5 cores in 2 clusters, I have the
nr_busy_cpu field of sched_group_power struct that is not null when the
platform is fully idle. The root cause seems to be:
During the boot sequence, some CPUs reach the idle loop and set their
NOHZ_IDLE flag while waiting for others CPUs to boot. But the nr_busy_cpus
field is initialized later with the assumption that all CPUs are in the busy
state whereas some CPUs have already set their NOHZ_IDLE flag.
During the initialization of the sched_domain, we set the NOHZ_IDLE flag when
nr_busy_cpus is initialized to 0 in order to have a coherent configuration.
If a CPU enters idle and call set_cpu_sd_state_idle during the build of the
new sched_domain it will not corrupt the initial state
set_cpu_sd_state_busy is modified and clears the NOHZ_IDLE only if a non NULL
sched_domain is attached to the CPU (which is the case during the rebuild)

Change since V3;
 - NOHZ flag is not cleared if a NULL domain is attached to the CPU
 - Remove patch 2/2 which becomes useless with latest modifications

Change since V2:
 - change the initialization to idle state instead of busy state so a CPU that
   enters idle during the build of the sched_domain will not corrupt the
   initialization state

Change since V1:
 - remove the patch for SCHED softirq on an idle core use case as it was
   a side effect of the other use cases.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/sched/core.c |4 +++-
 kernel/sched/fair.c |9 +++--
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26058d0..c730a4e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5884,7 +5884,9 @@ static void init_sched_groups_power(int cpu, struct 
sched_domain *sd)
return;
 
update_group_power(sd, cpu);
-   atomic_set(sg-sgp-nr_busy_cpus, sg-group_weight);
+   atomic_set(sg-sgp-nr_busy_cpus, 0);
+   set_bit(NOHZ_IDLE, nohz_flags(cpu));
+
 }
 
 int __weak arch_sd_sibling_asym_packing(void)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 81fa536..2701a92 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5403,15 +5403,20 @@ static inline void set_cpu_sd_state_busy(void)
 {
struct sched_domain *sd;
int cpu = smp_processor_id();
+   int clear = 0;
 
if (!test_bit(NOHZ_IDLE, nohz_flags(cpu)))
return;
-   clear_bit(NOHZ_IDLE, nohz_flags(cpu));
 
rcu_read_lock();
-   for_each_domain(cpu, sd)
+   for_each_domain(cpu, sd) {
atomic_inc(sd-groups-sgp-nr_busy_cpus);
+   clear = 1;
+   }
rcu_read_unlock();
+
+   if (likely(clear))
+   clear_bit(NOHZ_IDLE, nohz_flags(cpu));
 }
 
 void set_cpu_sd_state_idle(void)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4] sched: fix init NOHZ_IDLE flag

2013-02-22 Thread Vincent Guittot

On 22 February 2013 13:32, Frederic Weisbecker fweis...@gmail.com wrote:
 On Thu, Feb 21, 2013 at 09:29:16AM +0100, Vincent Guittot wrote:
 On my smp platform which is made of 5 cores in 2 clusters, I have the
 nr_busy_cpu field of sched_group_power struct that is not null when the
 platform is fully idle. The root cause seems to be:
 During the boot sequence, some CPUs reach the idle loop and set their
 NOHZ_IDLE flag while waiting for others CPUs to boot. But the nr_busy_cpus
 field is initialized later with the assumption that all CPUs are in the busy
 state whereas some CPUs have already set their NOHZ_IDLE flag.
 During the initialization of the sched_domain, we set the NOHZ_IDLE flag when
 nr_busy_cpus is initialized to 0 in order to have a coherent configuration.
 If a CPU enters idle and call set_cpu_sd_state_idle during the build of the
 new sched_domain it will not corrupt the initial state
 set_cpu_sd_state_busy is modified and clears the NOHZ_IDLE only if a non NULL
 sched_domain is attached to the CPU (which is the case during the rebuild)

 Change since V3;
  - NOHZ flag is not cleared if a NULL domain is attached to the CPU
  - Remove patch 2/2 which becomes useless with latest modifications

 Change since V2:
  - change the initialization to idle state instead of busy state so a CPU 
 that
enters idle during the build of the sched_domain will not corrupt the
initialization state

 Change since V1:
  - remove the patch for SCHED softirq on an idle core use case as it was
a side effect of the other use cases.

 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
 ---
  kernel/sched/core.c |4 +++-
  kernel/sched/fair.c |9 +++--
  2 files changed, 10 insertions(+), 3 deletions(-)

 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 26058d0..c730a4e 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -5884,7 +5884,9 @@ static void init_sched_groups_power(int cpu, struct 
 sched_domain *sd)
   return;

   update_group_power(sd, cpu);
 - atomic_set(sg-sgp-nr_busy_cpus, sg-group_weight);
 + atomic_set(sg-sgp-nr_busy_cpus, 0);
 + set_bit(NOHZ_IDLE, nohz_flags(cpu));
 +
  }

  int __weak arch_sd_sibling_asym_packing(void)
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 81fa536..2701a92 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -5403,15 +5403,20 @@ static inline void set_cpu_sd_state_busy(void)
  {
   struct sched_domain *sd;
   int cpu = smp_processor_id();
 + int clear = 0;

   if (!test_bit(NOHZ_IDLE, nohz_flags(cpu)))
   return;
 - clear_bit(NOHZ_IDLE, nohz_flags(cpu));

   rcu_read_lock();
 - for_each_domain(cpu, sd)
 + for_each_domain(cpu, sd) {
   atomic_inc(sd-groups-sgp-nr_busy_cpus);
 + clear = 1;
 + }
   rcu_read_unlock();
 +
 + if (likely(clear))
 + clear_bit(NOHZ_IDLE, nohz_flags(cpu));

 I fear there is still a race window:

   = CPU 0 = = CPU 1 =
  // NOHZ_IDLE is set
  set_cpu_sd_state_busy() {
  dom1 = rcu_dereference(dom1);
  inc(dom1-nr_busy_cpus)

 rcu_assign_pointer(dom 1, NULL)
 // create new domain
 init_sched_group_power() {
 atomic_set(tmp-nr_busy_cpus, 0);
 set_bit(NOHZ_IDLE, nohz_flags(cpu 1));
 rcu_assign_pointer(dom 1, tmp)



   clear_bit(NOHZ_IDLE, nohz_flags(cpu));
   }


 I don't know if there is any sane way to deal with this issue other than
 having nr_busy_cpus and nohz_flags in the same object sharing the same
 lifecycle.

I wanted to avoid having to use the sd pointer for testing NOHZ_IDLE
flag because it occurs each time we go into idle but it seems to be
not easily feasible.
Another solution could be to add a synchronization step between
rcu_assign_pointer(dom 1, NULL) and create new domain to ensure that
all pending access to old sd values, has finished but this will imply
a potential delay in the rebuild  of sched_domain and i'm not sure
that it's acceptable

Vincent
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] sched: nohz_idle_balance

2012-09-12 Thread Vincent Guittot

On tickless system, one CPU runs load balance for all idle CPUs.
The cpu_load of this CPU is updated before starting the load balance
of each other idle CPUs. We should instead update the cpu_load of the 
balance_cpu.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/sched/fair.c |   11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1ca4fe4..9ae3a5b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4794,14 +4794,15 @@ static void nohz_idle_balance(int this_cpu, enum 
cpu_idle_type idle)
if (need_resched())
break;
 
-   raw_spin_lock_irq(this_rq-lock);
-   update_rq_clock(this_rq);
-   update_idle_cpu_load(this_rq);
-   raw_spin_unlock_irq(this_rq-lock);
+   rq = cpu_rq(balance_cpu);
+
+   raw_spin_lock_irq(rq-lock);
+   update_rq_clock(rq);
+   update_idle_cpu_load(rq);
+   raw_spin_unlock_irq(rq-lock);
 
rebalance_domains(balance_cpu, CPU_IDLE);
 
-   rq = cpu_rq(balance_cpu);
if (time_after(this_rq-next_balance, rq-next_balance))
this_rq-next_balance = rq-next_balance;
}
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: nohz_idle_balance

2012-09-13 Thread Vincent Guittot

Wrong button make me removed others guys from the thread.

Sorry for this mistake.

On 13 September 2012 09:56, Mike Galbraith efa...@gmx.de wrote:
 On Thu, 2012-09-13 at 09:44 +0200, Vincent Guittot wrote:
 On 13 September 2012 09:29, Mike Galbraith efa...@gmx.de wrote:
  On Thu, 2012-09-13 at 08:59 +0200, Vincent Guittot wrote:
  On 13 September 2012 08:49, Mike Galbraith efa...@gmx.de wrote:
   On Thu, 2012-09-13 at 06:11 +0200, Vincent Guittot wrote:
   On tickless system, one CPU runs load balance for all idle CPUs.
   The cpu_load of this CPU is updated before starting the load balance
   of each other idle CPUs. We should instead update the cpu_load of the 
   balance_cpu.
  
   Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
   ---
kernel/sched/fair.c |   11 ++-
1 file changed, 6 insertions(+), 5 deletions(-)
  
   diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
   index 1ca4fe4..9ae3a5b 100644
   --- a/kernel/sched/fair.c
   +++ b/kernel/sched/fair.c
   @@ -4794,14 +4794,15 @@ static void nohz_idle_balance(int this_cpu, 
   enum cpu_idle_type idle)
 if (need_resched())
 break;
  
   - raw_spin_lock_irq(this_rq-lock);
   - update_rq_clock(this_rq);
   - update_idle_cpu_load(this_rq);
   - raw_spin_unlock_irq(this_rq-lock);
   + rq = cpu_rq(balance_cpu);
   +
   + raw_spin_lock_irq(rq-lock);
   + update_rq_clock(rq);
   + update_idle_cpu_load(rq);
   + raw_spin_unlock_irq(rq-lock);
  
 rebalance_domains(balance_cpu, CPU_IDLE);
  
   - rq = cpu_rq(balance_cpu);
 if (time_after(this_rq-next_balance, rq-next_balance))
 this_rq-next_balance = rq-next_balance;
 }
  
   Ew, banging locks and updating clocks to what good end?
 
  The goal is to update the cpu_load table of the CPU before starting
  the load balance. Other wise we will use outdated value in the load
  balance sequence
 
  If there's load to distribute, seems it should all work out fine without
  doing that.  What harm is being done that makes this worth while?

 this_load and avg_load can be wrong and make an idle CPU set as
 balanced compared to the busy one

 I think you need to present numbers showing benefit.  Crawling all over
 a mostly idle (4096p?) box is decidedly bad thing to do.

Yep, let me prepare some figures

You should also notice that you are already crawling all over the idle
processor in rebalance_domains

Vincent


 -Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 0/5] ARM: topology: set the capacity of each cores for big.LITTLE

2012-09-13 Thread Vincent Guittot

On 10 July 2012 15:42, Peter Zijlstra a.p.zijls...@chello.nl wrote:
 On Tue, 2012-07-10 at 14:35 +0200, Vincent Guittot wrote:

 May be the last one which enable ARCH_POWER should also go into tip ?

 OK, I can take it.

Hi Peter,

I can't find the patch that enable ARCH_POWER in the tip tree. Have
you take it in your tree ?

Regards,
Vincent
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 0/5] ARM: topology: set the capacity of each cores for big.LITTLE

2012-09-13 Thread Vincent Guittot

On 13 September 2012 14:07, Peter Zijlstra a.p.zijls...@chello.nl wrote:
 On Thu, 2012-09-13 at 11:17 +0200, Vincent Guittot wrote:
 On 10 July 2012 15:42, Peter Zijlstra a.p.zijls...@chello.nl wrote:
  On Tue, 2012-07-10 at 14:35 +0200, Vincent Guittot wrote:
 
  May be the last one which enable ARCH_POWER should also go into tip ?
 
  OK, I can take it.

 Hi Peter,

 I can't find the patch that enable ARCH_POWER in the tip tree. Have
 you take it in your tree ?


 Uhmmm how about I say I have now? Sorry about that.

ok, thanks
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Vincent Guittot

On 15 August 2012 13:05, Peter Zijlstra a.p.zijls...@chello.nl wrote:
 On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
 Since there is no power saving consideration in scheduler CFS, I has a
 very rough idea for enabling a new power saving schema in CFS.

 Adding Thomas, he always delights poking holes in power schemes.

 It bases on the following assumption:
 1, If there are many task crowd in system, just let few domain cpus
 running and let other cpus idle can not save power. Let all cpu take the
 load, finish tasks early, and then get into idle. will save more power
 and have better user experience.

 I'm not sure this is a valid assumption. I've had it explained to me by
 various people that race-to-idle isn't always the best thing. It has to
 do with the cost of switching power states and the duration of execution
 and other such things.

 2, schedule domain, schedule group perfect match the hardware, and
 the power consumption unit. So, pull tasks out of a domain means
 potentially this power consumption unit idle.

 I'm not sure I understand what you're saying, sorry.

 So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
 power aware scheduling), this proposal will adopt the
 sched_balance_policy concept and use 2 kind of policy: performance, power.

 Yay, ideally we'd also provide a 3rd option: auto, which simply switches
 between the two based on AC/BAT, UPS status and simple things like that.
 But this seems like a later concern, you have to have something to pick
 between before you can pick :-)

 And in scheduling, 2 place will care the policy, load_balance() and in
 task fork/exec: select_task_rq_fair().

 ack

 Here is some pseudo code try to explain the proposal behaviour in
 load_balance() and select_task_rq_fair();

 Oh man.. A few words outlining the general idea would've been nice.

 load_balance() {
   update_sd_lb_stats(); //get busiest group, idlest group data.

   if (sd-nr_running  sd's capacity) {
   //power saving policy is not suitable for
   //this scenario, it runs like performance policy
   mv tasks from busiest cpu in busiest group to
   idlest  cpu in idlest group;

 Once upon a time we talked about adding a factor to the capacity for
 this. So say you'd allow 2*capacity before overflowing and waking
 another power group.

 But I think we should not go on nr_running here, PJTs per-entity load
 tracking stuff gives us much better measures -- also, repost that series
 already Paul! :-)

 Also, I'm not sure this is entirely correct, the thing you want to do
 for power aware stuff is to minimize the number of active power domains,
 this means you don't want idlest, you want least busy non-idle.

   } else {// the sd has enough capacity to hold all tasks.
   if (sg-nr_running  sg's capacity) {
   //imbalanced between groups
   if (schedule policy == performance) {
   //when 2 busiest group at same busy
   //degree, need to prefer the one has
   // softest group??
   move tasks from busiest group to
   idletest group;

 So I'd leave the currently implemented scheme as performance, and I
 don't think the above describes the current state.

   } else if (schedule policy == power)
   move tasks from busiest group to
   idlest group until busiest is just full
   of capacity.
   //the busiest group can balance
   //internally after next time LB,

 There's another thing we need to do, and that is collect tasks in a
 minimal amount of power domains. The old code (that got deleted) did
 something like that, you can revive some of the that code if needed -- I
 just killed everything to be able to start with a clean slate.


   } else {
   //all groups has enough capacity for its tasks.
   if (schedule policy == performance)
   //all tasks may has enough cpu
   //resources to run,
   //mv tasks from busiest to idlest group?
   //no, at this time, it's better to keep
   //the task on current cpu.
   //so, it is maybe better to do balance
   //in each of groups
   for_each_imbalance_groups()
   move tasks from busiest cpu to
   idlest cpu in each of groups;
   else if (schedule policy == power) {
   if (no hard pin in idlest group)

Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Vincent Guittot

On 16 August 2012 07:03, Alex Shi alex@intel.com wrote:
 On 08/16/2012 12:19 AM, Matthew Garrett wrote:

 On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote:

 power aware scheduling), this proposal will adopt the
 sched_balance_policy concept and use 2 kind of policy: performance, power.

 Are there workloads in which power might provide more performance than
 performance? If so, don't use these terms.



 Power scheme should no chance has better performance in design.

A side effect of packing small tasks on one core is that you always
use the core with the lowest C-state which will minimize the wake up
latency so you can sometime get better results than performance mode
which will try to use a other core in another cluster which will take
more time to wake up that waiting for the end of the current task.

Vincent
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Vincent Guittot

On 17 August 2012 10:43, Paul Turner p...@google.com wrote:
 On Wed, Aug 15, 2012 at 4:05 AM, Peter Zijlstra a.p.zijls...@chello.nl 
 wrote:
 On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
 Since there is no power saving consideration in scheduler CFS, I has a
 very rough idea for enabling a new power saving schema in CFS.

 Adding Thomas, he always delights poking holes in power schemes.

 It bases on the following assumption:
 1, If there are many task crowd in system, just let few domain cpus
 running and let other cpus idle can not save power. Let all cpu take the
 load, finish tasks early, and then get into idle. will save more power
 and have better user experience.

 I'm not sure this is a valid assumption. I've had it explained to me by
 various people that race-to-idle isn't always the best thing. It has to
 do with the cost of switching power states and the duration of execution
 and other such things.

 2, schedule domain, schedule group perfect match the hardware, and
 the power consumption unit. So, pull tasks out of a domain means
 potentially this power consumption unit idle.

 I'm not sure I understand what you're saying, sorry.

 So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
 power aware scheduling), this proposal will adopt the
 sched_balance_policy concept and use 2 kind of policy: performance, power.

 Yay, ideally we'd also provide a 3rd option: auto, which simply switches
 between the two based on AC/BAT, UPS status and simple things like that.
 But this seems like a later concern, you have to have something to pick
 between before you can pick :-)

 And in scheduling, 2 place will care the policy, load_balance() and in
 task fork/exec: select_task_rq_fair().

 ack

 Here is some pseudo code try to explain the proposal behaviour in
 load_balance() and select_task_rq_fair();

 Oh man.. A few words outlining the general idea would've been nice.

 load_balance() {
   update_sd_lb_stats(); //get busiest group, idlest group data.

   if (sd-nr_running  sd's capacity) {
   //power saving policy is not suitable for
   //this scenario, it runs like performance policy
   mv tasks from busiest cpu in busiest group to
   idlest  cpu in idlest group;

 Once upon a time we talked about adding a factor to the capacity for
 this. So say you'd allow 2*capacity before overflowing and waking
 another power group.

 But I think we should not go on nr_running here, PJTs per-entity load
 tracking stuff gives us much better measures -- also, repost that series
 already Paul! :-)

 Yes -- I just got back from Africa this week.  It's updated for almost
 all the previous comments but I ran out of time before I left to
 re-post.  I'm just about caught up enough that I should be able to get
 this done over the upcoming weekend.  Monday at the latest.


 Also, I'm not sure this is entirely correct, the thing you want to do
 for power aware stuff is to minimize the number of active power domains,
 this means you don't want idlest, you want least busy non-idle.

   } else {// the sd has enough capacity to hold all tasks.
   if (sg-nr_running  sg's capacity) {
   //imbalanced between groups
   if (schedule policy == performance) {
   //when 2 busiest group at same busy
   //degree, need to prefer the one has
   // softest group??
   move tasks from busiest group to
   idletest group;

 So I'd leave the currently implemented scheme as performance, and I
 don't think the above describes the current state.

   } else if (schedule policy == power)
   move tasks from busiest group to
   idlest group until busiest is just full
   of capacity.
   //the busiest group can balance
   //internally after next time LB,

 There's another thing we need to do, and that is collect tasks in a
 minimal amount of power domains. The old code (that got deleted) did
 something like that, you can revive some of the that code if needed -- I
 just killed everything to be able to start with a clean slate.


   } else {
   //all groups has enough capacity for its tasks.
   if (schedule policy == performance)
   //all tasks may has enough cpu
   //resources to run,
   //mv tasks from busiest to idlest group?
   //no, at this time, it's better to keep
   //the task on current cpu.
   //so, it is maybe better to do balance
   //in each of groups

Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Vincent Guittot

On 21 August 2012 02:58, Alex Shi alex@intel.com wrote:
 On 08/20/2012 11:36 PM, Vincent Guittot wrote:

  What you want it to keep track of a per-cpu utilization level (inverse
  of idle-time) and using PJTs per-task runnable avg see if placing the
  new task on will exceed the utilization limit.
 
  I think some of the Linaro people actually played around with this,
  Vincent?
 Sorry for the late reply but I had almost no network access during last 
 weeks.

 So Linaro also works on a power aware scheduler as Peter mentioned.

 Based on previous tests, we have concluded that main drawback of the
 (now removed) old power scheduler was that we had no way to make
 difference between short and long running tasks whereas it's a key
 input (at least for phone) for deciding to pack tasks and for
 selecting the core on an asymmetric system.


 It is hard to estimate future in general view point. but from hack
 point, maybe you can add something to hint this from task_struct. :)


per-task load tracking patchsets give you a good view of the last dozen of ms

 One additional key information is the power distribution in the system
 which can have a finer granularity than current sched_domain
 description. Peter's proposal was to use a SHARE_POWERLINE flag
 similarly to flags that already describe if a sched_domain share
 resources or cpu capacity.


 Seems I missed this. what's difference with current SD_SHARE_CPUPOWER
 and SD_SHARE_PKG_RESOURCES.

SD_SHARE_CPUPOWER is set in a sched domain at SMT level (sharing some
part of the physical core)
SD_SHARE_PKG_RESOURCES is set at MC level (sharing some resources like
cache and memory access)



 With these 2 new information, we can have a 1st power saving scheduler
 which spread or packed tasks across core and package


 Fine, I like to test them on X86, plus SMT and NUMA :)


 Vincent


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 0/5] ARM: topology: set the capacity of each cores for big.LITTLE

2012-07-09 Thread Vincent Guittot

This patchset creates an arch_scale_freq_power function for ARM, which is used 
to set the relative capacity of each core of a big.LITTLE system. It also 
removes
the broken power estimation of x86.

Modification since v3:
 - Add comments
 - Add optimization for SMP system
 - Ensure that capacity of a CPU will be at most 1

Modification since v2:
 - set_power_scale function becomes static
 - Rework loop in update_siblings_masks
 - Remove useless code in parse_dt_topology

Modification since v1:
 - Add and update explanation about the use of the table and the range of the 
value 
 - Remove the use of NR_CPUS and use nr_cpu_ids instead
 - Remove broken power estimation of x86

Peter Zijlstra (1):
  sched, x86: Remove broken power estimation

Vincent Guittot (4):
  ARM: topology: Add arch_scale_freq_power function
  ARM: topology: factorize the update of sibling masks
  ARM: topology: Update cpu_power according to DT information
  sched: cpu_power: enable ARCH_POWER

 arch/arm/kernel/topology.c   |  239 ++
 arch/x86/kernel/cpu/Makefile |2 +-
 arch/x86/kernel/cpu/sched.c  |   55 --
 kernel/sched/features.h  |2 +-
 4 files changed, 219 insertions(+), 79 deletions(-)
 delete mode 100644 arch/x86/kernel/cpu/sched.c

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 1/5] ARM: topology: Add arch_scale_freq_power function

2012-07-09 Thread Vincent Guittot

Add infrastructure to be able to modify the cpu_power of each core

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
Reviewed-by: Namhyung Kim namhy...@kernel.org
---
 arch/arm/kernel/topology.c |   38 +-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 8200dea..51f23b3 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -22,6 +22,37 @@
 #include asm/cputype.h
 #include asm/topology.h
 
+/*
+ * cpu power scale management
+ */
+
+/*
+ * cpu power table
+ * This per cpu data structure describes the relative capacity of each core.
+ * On a heteregenous system, cores don't have the same computation capacity
+ * and we reflect that difference in the cpu_power field so the scheduler can
+ * take this difference into account during load balance. A per cpu structure
+ * is preferred because each CPU updates its own cpu_power field during the
+ * load balance except for idle cores. One idle core is selected to run the
+ * rebalance_domains for all idle cores and the cpu_power can be updated
+ * during this sequence.
+ */
+static DEFINE_PER_CPU(unsigned long, cpu_scale);
+
+unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu)
+{
+   return per_cpu(cpu_scale, cpu);
+}
+
+static void set_power_scale(unsigned int cpu, unsigned long power)
+{
+   per_cpu(cpu_scale, cpu) = power;
+}
+
+/*
+ * cpu topology management
+ */
+
 #define MPIDR_SMP_BITMASK (0x3  30)
 #define MPIDR_SMP_VALUE (0x2  30)
 
@@ -41,6 +72,9 @@
 #define MPIDR_LEVEL2_MASK 0xFF
 #define MPIDR_LEVEL2_SHIFT 16
 
+/*
+ * cpu topology table
+ */
 struct cputopo_arm cpu_topology[NR_CPUS];
 
 const struct cpumask *cpu_coregroup_mask(int cpu)
@@ -134,7 +168,7 @@ void init_cpu_topology(void)
 {
unsigned int cpu;
 
-   /* init core mask */
+   /* init core mask and power*/
for_each_possible_cpu(cpu) {
struct cputopo_arm *cpu_topo = (cpu_topology[cpu]);
 
@@ -143,6 +177,8 @@ void init_cpu_topology(void)
cpu_topo-socket_id = -1;
cpumask_clear(cpu_topo-core_sibling);
cpumask_clear(cpu_topo-thread_sibling);
+
+   set_power_scale(cpu, SCHED_POWER_SCALE);
}
smp_wmb();
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 2/5] ARM: topology: factorize the update of sibling masks

2012-07-09 Thread Vincent Guittot

This factorization has also been proposed in another patch that has not been
merged yet:
http://lists.infradead.org/pipermail/linux-arm-kernel/2012-January/080873.html
So, this patch could be dropped depending of the state of the other one.

Signed-off-by: Lorenzo Pieralisi lorenzo.pieral...@arm.com
Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
Reviewed-by: Namhyung Kim namhy...@kernel.org
---
 arch/arm/kernel/topology.c |   48 +---
 1 file changed, 27 insertions(+), 21 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 51f23b3..eb5fc81 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -82,6 +82,32 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
return cpu_topology[cpu].core_sibling;
 }
 
+void update_siblings_masks(unsigned int cpuid)
+{
+   struct cputopo_arm *cpu_topo, *cpuid_topo = cpu_topology[cpuid];
+   int cpu;
+
+   /* update core and thread sibling masks */
+   for_each_possible_cpu(cpu) {
+   cpu_topo = cpu_topology[cpu];
+
+   if (cpuid_topo-socket_id != cpu_topo-socket_id)
+   continue;
+
+   cpumask_set_cpu(cpuid, cpu_topo-core_sibling);
+   if (cpu != cpuid)
+   cpumask_set_cpu(cpu, cpuid_topo-core_sibling);
+
+   if (cpuid_topo-core_id != cpu_topo-core_id)
+   continue;
+
+   cpumask_set_cpu(cpuid, cpu_topo-thread_sibling);
+   if (cpu != cpuid)
+   cpumask_set_cpu(cpu, cpuid_topo-thread_sibling);
+   }
+   smp_wmb();
+}
+
 /*
  * store_cpu_topology is called at boot when only one cpu is running
  * and with the mutex cpu_hotplug.lock locked, when several cpus have booted,
@@ -91,7 +117,6 @@ void store_cpu_topology(unsigned int cpuid)
 {
struct cputopo_arm *cpuid_topo = cpu_topology[cpuid];
unsigned int mpidr;
-   unsigned int cpu;
 
/* If the cpu topology has been already set, just return */
if (cpuid_topo-core_id != -1)
@@ -133,26 +158,7 @@ void store_cpu_topology(unsigned int cpuid)
cpuid_topo-socket_id = -1;
}
 
-   /* update core and thread sibling masks */
-   for_each_possible_cpu(cpu) {
-   struct cputopo_arm *cpu_topo = cpu_topology[cpu];
-
-   if (cpuid_topo-socket_id == cpu_topo-socket_id) {
-   cpumask_set_cpu(cpuid, cpu_topo-core_sibling);
-   if (cpu != cpuid)
-   cpumask_set_cpu(cpu,
-   cpuid_topo-core_sibling);
-
-   if (cpuid_topo-core_id == cpu_topo-core_id) {
-   cpumask_set_cpu(cpuid,
-   cpu_topo-thread_sibling);
-   if (cpu != cpuid)
-   cpumask_set_cpu(cpu,
-   cpuid_topo-thread_sibling);
-   }
-   }
-   }
-   smp_wmb();
+   update_siblings_masks(cpuid);
 
printk(KERN_INFO CPU%u: thread %d, cpu %d, socket %d, mpidr %x\n,
cpuid, cpu_topology[cpuid].thread_id,
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 3/5] ARM: topology: Update cpu_power according to DT information

2012-07-09 Thread Vincent Guittot

Use cpu compatibility field and clock-frequency field of DT to
estimate the capacity of each core of the system and to update
the cpu_power field accordingly.
This patch enables to put more running tasks on big cores than
on LITTLE ones. But this patch doesn't ensure that long running
tasks will run on big cores and short ones on LITTLE cores.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
Reviewed-by: Namhyung Kim namhy...@kernel.org
---
 arch/arm/kernel/topology.c |  153 
 1 file changed, 153 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index eb5fc81..198b084 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -17,7 +17,9 @@
 #include linux/percpu.h
 #include linux/node.h
 #include linux/nodemask.h
+#include linux/of.h
 #include linux/sched.h
+#include linux/slab.h
 
 #include asm/cputype.h
 #include asm/topology.h
@@ -49,6 +51,152 @@ static void set_power_scale(unsigned int cpu, unsigned long 
power)
per_cpu(cpu_scale, cpu) = power;
 }
 
+#ifdef CONFIG_OF
+struct cpu_efficiency {
+   const char *compatible;
+   unsigned long efficiency;
+};
+
+/*
+ * Table of relative efficiency of each processors
+ * The efficiency value must fit in 20bit and the final
+ * cpu_scale value must be in the range
+ *   0  cpu_scale  3*SCHED_POWER_SCALE/2
+ * in order to return at most 1 when DIV_ROUND_CLOSEST
+ * is used to compute the capacity of a CPU.
+ * Processors that are not defined in the table,
+ * use the default SCHED_POWER_SCALE value for cpu_scale.
+ */
+struct cpu_efficiency table_efficiency[] = {
+   {arm,cortex-a15, 3891},
+   {arm,cortex-a7,  2048},
+   {NULL, },
+};
+
+struct cpu_capacity {
+   unsigned long hwid;
+   unsigned long capacity;
+};
+
+struct cpu_capacity *cpu_capacity;
+
+unsigned long middle_capacity = 1;
+
+/*
+ * Iterate all CPUs' descriptor in DT and compute the efficiency
+ * (as per table_efficiency). Also calculate a middle efficiency
+ * as close as possible to  (max{eff_i} - min{eff_i}) / 2
+ * This is later used to scale the cpu_power field such that an
+ * 'average' CPU is of middle power. Also see the comments near
+ * table_efficiency[] and update_cpu_power().
+ */
+static void __init parse_dt_topology(void)
+{
+   struct cpu_efficiency *cpu_eff;
+   struct device_node *cn = NULL;
+   unsigned long min_capacity = (unsigned long)(-1);
+   unsigned long max_capacity = 0;
+   unsigned long capacity = 0;
+   int alloc_size, cpu = 0;
+
+   alloc_size = nr_cpu_ids * sizeof(struct cpu_capacity);
+   cpu_capacity = (struct cpu_capacity *)kzalloc(alloc_size, GFP_NOWAIT);
+
+   while ((cn = of_find_node_by_type(cn, cpu))) {
+   const u32 *rate, *reg;
+   int len;
+
+   if (cpu = num_possible_cpus())
+   break;
+
+   for (cpu_eff = table_efficiency; cpu_eff-compatible; cpu_eff++)
+   if (of_device_is_compatible(cn, cpu_eff-compatible))
+   break;
+
+   if (cpu_eff-compatible == NULL)
+   continue;
+
+   rate = of_get_property(cn, clock-frequency, len);
+   if (!rate || len != 4) {
+   pr_err(%s missing clock-frequency property\n,
+   cn-full_name);
+   continue;
+   }
+
+   reg = of_get_property(cn, reg, len);
+   if (!reg || len != 4) {
+   pr_err(%s missing reg property\n, cn-full_name);
+   continue;
+   }
+
+   capacity = ((be32_to_cpup(rate))  20) * cpu_eff-efficiency;
+
+   /* Save min capacity of the system */
+   if (capacity  min_capacity)
+   min_capacity = capacity;
+
+   /* Save max capacity of the system */
+   if (capacity  max_capacity)
+   max_capacity = capacity;
+
+   cpu_capacity[cpu].capacity = capacity;
+   cpu_capacity[cpu++].hwid = be32_to_cpup(reg);
+   }
+
+   if (cpu  num_possible_cpus())
+   cpu_capacity[cpu].hwid = (unsigned long)(-1);
+
+   /* If min and max capacities are equals, we bypass the update of the
+* cpu_scale because all CPUs have the same capacity. Otherwise, we
+* compute a middle_capacity factor that will ensure that the capacity
+* of an 'average' CPU of the system will be as close as possible to
+* SCHED_POWER_SCALE, which is the default value, but with the
+* constraint explained near table_efficiency[].
+*/
+   if (min_capacity == max_capacity)
+   cpu_capacity[0].hwid = (unsigned long)(-1);
+   else if (4*max_capacity  (3*(max_capacity + min_capacity)))
+   middle_capacity = (min_capacity + max_capacity

[PATCH v4 4/5] sched, x86: Remove broken power estimation

2012-07-09 Thread Vincent Guittot

From: Peter Zijlstra a.p.zijls...@chello.nl

The x86 sched power implementation has been broken forever and gets in
the way of other stuff, remove it.

For archaeological interest, fixing this code would require dealing with
the cross-cpu calling of these functions and more importantly, we need
to filter idle time out of the a/m-perf stuff because the ratio will go
down to 0 when idle, giving a 0 capacity which is not what we'd want.

Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
Link: http://lkml.kernel.org/n/tip-wjjwelpti8f8k7i1pdnzm...@git.kernel.org
---
 arch/x86/kernel/cpu/Makefile |2 +-
 arch/x86/kernel/cpu/sched.c  |   55 --
 2 files changed, 1 insertion(+), 56 deletions(-)
 delete mode 100644 arch/x86/kernel/cpu/sched.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 6ab6aa2..c598126 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -14,7 +14,7 @@ CFLAGS_common.o   := $(nostackp)
 
 obj-y  := intel_cacheinfo.o scattered.o topology.o
 obj-y  += proc.o capflags.o powerflags.o common.o
-obj-y  += vmware.o hypervisor.o sched.o mshyperv.o
+obj-y  += vmware.o hypervisor.o mshyperv.o
 obj-y  += rdrand.o
 obj-y  += match.o
 
diff --git a/arch/x86/kernel/cpu/sched.c b/arch/x86/kernel/cpu/sched.c
deleted file mode 100644
index a640ae5..000
--- a/arch/x86/kernel/cpu/sched.c
+++ /dev/null
@@ -1,55 +0,0 @@
-#include linux/sched.h
-#include linux/math64.h
-#include linux/percpu.h
-#include linux/irqflags.h
-
-#include asm/cpufeature.h
-#include asm/processor.h
-
-#ifdef CONFIG_SMP
-
-static DEFINE_PER_CPU(struct aperfmperf, old_perf_sched);
-
-static unsigned long scale_aperfmperf(void)
-{
-   struct aperfmperf val, *old = __get_cpu_var(old_perf_sched);
-   unsigned long ratio, flags;
-
-   local_irq_save(flags);
-   get_aperfmperf(val);
-   local_irq_restore(flags);
-
-   ratio = calc_aperfmperf_ratio(old, val);
-   *old = val;
-
-   return ratio;
-}
-
-unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu)
-{
-   /*
-* do aperf/mperf on the cpu level because it includes things
-* like turbo mode, which are relevant to full cores.
-*/
-   if (boot_cpu_has(X86_FEATURE_APERFMPERF))
-   return scale_aperfmperf();
-
-   /*
-* maybe have something cpufreq here
-*/
-
-   return default_scale_freq_power(sd, cpu);
-}
-
-unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu)
-{
-   /*
-* aperf/mperf already includes the smt gain
-*/
-   if (boot_cpu_has(X86_FEATURE_APERFMPERF))
-   return SCHED_LOAD_SCALE;
-
-   return default_scale_smt_power(sd, cpu);
-}
-
-#endif
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 5/5] sched: cpu_power: enable ARCH_POWER

2012-07-09 Thread Vincent Guittot

Heteregeneous ARM platform uses arch_scale_freq_power function
to reflect the relative capacity of each core

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/sched/features.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index de00a48..d98ae90 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -42,7 +42,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
 /*
  * Use arch dependent cpu power functions
  */
-SCHED_FEAT(ARCH_POWER, false)
+SCHED_FEAT(ARCH_POWER, true)
 
 SCHED_FEAT(HRTICK, false)
 SCHED_FEAT(DOUBLE_TICK, false)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 3/5] ARM: topology: Update cpu_power according to DT information

2012-07-09 Thread Vincent Guittot

On 9 July 2012 12:55, Shilimkar, Santosh santosh.shilim...@ti.com wrote:
 Vincent,
 On Mon, Jul 9, 2012 at 2:57 PM, Vincent Guittot
 vincent.guit...@linaro.org wrote:
 Use cpu compatibility field and clock-frequency field of DT to
 estimate the capacity of each core of the system and to update
 the cpu_power field accordingly.
 This patch enables to put more running tasks on big cores than
 on LITTLE ones. But this patch doesn't ensure that long running
 tasks will run on big cores and short ones on LITTLE cores.

 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
 Reviewed-by: Namhyung Kim namhy...@kernel.org
 ---
  arch/arm/kernel/topology.c |  153 
 
  1 file changed, 153 insertions(+)

 Sorry for not giving this comment on previous version but we should also
 have a way to provide the big.LITTLE information without Device Tree.
 May be a platform device/data.

Hi Santosh,

I had thought of adding such additional way to set cpu_power of
big.LITTLE but my conclusion was
-it's a new platform so it should come with DT
-DT is already required by other patches linked to big.LITTLE
(http://lists.infradead.org/pipermail/linux-arm-kernel/2012-January/080873.html)
-There is no device that can be easily used to get such information at
this early boot stage.


 I know we are moving DT way, but remember apart from core kernel
 infrastructure, to have a complete product build with DT means all the
 drivers must be already supporting DT which is not the case with
 many huge driver sub-systems like USB, display subsystem, Audio etc.

 Having that support would greatly help for the SOC's which have not yet
 reached to stage where entire SOC is DT compliant and want to use
 big.LITTLE infrastructure.

Can't you support both type of devices on your platform ? You can move
your device to DT mode when it is supported ?

Regards,
Vincent


 Regards
 Santosh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 3/5] ARM: topology: Update cpu_power according to DT information

2012-07-09 Thread Vincent Guittot

On 9 July 2012 15:00, Shilimkar, Santosh santosh.shilim...@ti.com wrote:
 On Mon, Jul 9, 2012 at 6:02 PM, Vincent Guittot
 vincent.guit...@linaro.org wrote:
 On 9 July 2012 12:55, Shilimkar, Santosh santosh.shilim...@ti.com wrote:
 Vincent,
 On Mon, Jul 9, 2012 at 2:57 PM, Vincent Guittot
 vincent.guit...@linaro.org wrote:
 Use cpu compatibility field and clock-frequency field of DT to
 estimate the capacity of each core of the system and to update
 the cpu_power field accordingly.
 This patch enables to put more running tasks on big cores than
 on LITTLE ones. But this patch doesn't ensure that long running
 tasks will run on big cores and short ones on LITTLE cores.

 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
 Reviewed-by: Namhyung Kim namhy...@kernel.org
 ---
  arch/arm/kernel/topology.c |  153 
 
  1 file changed, 153 insertions(+)

 Sorry for not giving this comment on previous version but we should also
 have a way to provide the big.LITTLE information without Device Tree.
 May be a platform device/data.

 Hi Santosh,

 I had thought of adding such additional way to set cpu_power of
 big.LITTLE but my conclusion was
 -it's a new platform so it should come with DT
 -DT is already required by other patches linked to big.LITTLE
 (http://lists.infradead.org/pipermail/linux-arm-kernel/2012-January/080873.html)
 -There is no device that can be easily used to get such information at
 this early boot stage.

 I see. Its new processor but it is just 1 one of the IP in an entire SOC.
 As mentioned below I was talking about full SOC support including
 all the driver subsystem with DT.


 I know we are moving DT way, but remember apart from core kernel
 infrastructure, to have a complete product build with DT means all the
 drivers must be already supporting DT which is not the case with
 many huge driver sub-systems like USB, display subsystem, Audio etc.

 Having that support would greatly help for the SOC's which have not yet
 reached to stage where entire SOC is DT compliant and want to use
 big.LITTLE infrastructure.

 Can't you support both type of devices on your platform ? You can move
 your device to DT mode when it is supported ?

 That is what eventually people end up doing who don't have
 DT ready for entire SOC. I was trying to ask whether at least some
 method is proposed(need not be merged in mainline) to have
 the big.LITTLE information parsing without DT.

Ok, IIUC, you need a temporary methods, which doesn't need to be
merged in mainline, to set the cpu_scale field ?


 Regards
 Santosh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 3/5] ARM: topology: Update cpu_power according to DT information

2012-07-09 Thread Vincent Guittot

On 9 July 2012 16:37, Shilimkar, Santosh santosh.shilim...@ti.com wrote:
 On Mon, Jul 9, 2012 at 8:06 PM, Vincent Guittot
 vincent.guit...@linaro.org wrote:
 On 9 July 2012 15:00, Shilimkar, Santosh santosh.shilim...@ti.com wrote:
 On Mon, Jul 9, 2012 at 6:02 PM, Vincent Guittot
 vincent.guit...@linaro.org wrote:
 On 9 July 2012 12:55, Shilimkar, Santosh santosh.shilim...@ti.com wrote:
 Vincent,
 On Mon, Jul 9, 2012 at 2:57 PM, Vincent Guittot
 vincent.guit...@linaro.org wrote:
 Use cpu compatibility field and clock-frequency field of DT to
 estimate the capacity of each core of the system and to update
 the cpu_power field accordingly.
 This patch enables to put more running tasks on big cores than
 on LITTLE ones. But this patch doesn't ensure that long running
 tasks will run on big cores and short ones on LITTLE cores.

 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
 Reviewed-by: Namhyung Kim namhy...@kernel.org
 ---
  arch/arm/kernel/topology.c |  153 
 
  1 file changed, 153 insertions(+)

 Sorry for not giving this comment on previous version but we should also
 have a way to provide the big.LITTLE information without Device Tree.
 May be a platform device/data.

 Hi Santosh,

 I had thought of adding such additional way to set cpu_power of
 big.LITTLE but my conclusion was
 -it's a new platform so it should come with DT
 -DT is already required by other patches linked to big.LITTLE
 (http://lists.infradead.org/pipermail/linux-arm-kernel/2012-January/080873.html)
 -There is no device that can be easily used to get such information at
 this early boot stage.

 I see. Its new processor but it is just 1 one of the IP in an entire SOC.
 As mentioned below I was talking about full SOC support including
 all the driver subsystem with DT.


 I know we are moving DT way, but remember apart from core kernel
 infrastructure, to have a complete product build with DT means all the
 drivers must be already supporting DT which is not the case with
 many huge driver sub-systems like USB, display subsystem, Audio etc.

 Having that support would greatly help for the SOC's which have not yet
 reached to stage where entire SOC is DT compliant and want to use
 big.LITTLE infrastructure.

 Can't you support both type of devices on your platform ? You can move
 your device to DT mode when it is supported ?

 That is what eventually people end up doing who don't have
 DT ready for entire SOC. I was trying to ask whether at least some
 method is proposed(need not be merged in mainline) to have
 the big.LITTLE information parsing without DT.

 Ok, IIUC, you need a temporary methods, which doesn't need to be
 merged in mainline, to set the cpu_scale field ?

 Yep.

You could use set_power_scale in your platform code to set the
cpu_power of your CPUs until DT is ready for your platform.

Regards,
Vincent


 Regards
 Santosh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 0/5] ARM: topology: set the capacity of each cores for big.LITTLE

2012-07-10 Thread Vincent Guittot

On 10 July 2012 13:27, Peter Zijlstra a.p.zijls...@chello.nl wrote:
 On Mon, 2012-07-09 at 11:27 +0200, Vincent Guittot wrote:
 This patchset creates an arch_scale_freq_power function for ARM, which is 
 used
 to set the relative capacity of each core of a big.LITTLE system. It also 
 removes
 the broken power estimation of x86.

 Modification since v3:
  - Add comments
  - Add optimization for SMP system
  - Ensure that capacity of a CPU will be at most 1

 Modification since v2:
  - set_power_scale function becomes static
  - Rework loop in update_siblings_masks
  - Remove useless code in parse_dt_topology

 Modification since v1:
  - Add and update explanation about the use of the table and the range of 
 the value
  - Remove the use of NR_CPUS and use nr_cpu_ids instead
  - Remove broken power estimation of x86

 Peter Zijlstra (1):
   sched, x86: Remove broken power estimation

 Note that this patch already made its way into tip as commit
 bcae21d6e793a7047d38abc9ac0946c53733c1dd so it might be best to base
 whatever tree this is supposed to fo into on something that includes
 that.


yes

 Vincent Guittot (4):
   ARM: topology: Add arch_scale_freq_power function
   ARM: topology: factorize the update of sibling masks
   ARM: topology: Update cpu_power according to DT information
   sched: cpu_power: enable ARCH_POWER

 How would you like to proceed with these patches, I'm fine with them
 going through the ARM tree..

May be the last one which enable ARCH_POWER should also go into tip ?


 In which case:

 Acked-by: Peter Zijlstra a.p.zijls...@chello.nl

 Ingo, Russell, any other preferences?

I'm going to push the patches related to ARM into Russell's patch system

Vincent
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 3/6] sched: pack small tasks

2012-11-20 Thread Vincent Guittot

On 20 November 2012 15:28, Morten Rasmussen morten.rasmus...@arm.com wrote:
 Hi Vincent,

 On Mon, Nov 12, 2012 at 01:51:00PM +, Vincent Guittot wrote:
 On 9 November 2012 18:13, Morten Rasmussen morten.rasmus...@arm.com wrote:
  Hi Vincent,
 
  I have experienced suboptimal buddy selection on a dual cluster setup
  (ARM TC2) if SD_SHARE_POWERLINE is enabled at MC level and disabled at
  CPU level. This seems to be the correct flag settings for a system with
  only cluster level power gating.
 
  To me it looks like update_packing_domain() is not doing the right
  thing. See inline comments below.

 Hi Morten,

 Thanks for testing the patches.

 It seems that I have too optimized the loop and remove some use cases.

 
  On Sun, Oct 07, 2012 at 08:43:55AM +0100, Vincent Guittot wrote:
  During sched_domain creation, we define a pack buddy CPU if available.
 
  On a system that share the powerline at all level, the buddy is set to -1
 
  On a dual clusters / dual cores system which can powergate each core and
  cluster independantly, the buddy configuration will be :
| CPU0 | CPU1 | CPU2 | CPU3 |
  ---
  buddy | CPU0 | CPU0 | CPU0 | CPU2 |
 
  Small tasks tend to slip out of the periodic load balance.
  The best place to choose to migrate them is at their wake up.
 
  Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
  ---
   kernel/sched/core.c  |1 +
   kernel/sched/fair.c  |  109 
  ++
   kernel/sched/sched.h |1 +
   3 files changed, 111 insertions(+)
 
  diff --git a/kernel/sched/core.c b/kernel/sched/core.c
  index dab7908..70cadbe 100644
  --- a/kernel/sched/core.c
  +++ b/kernel/sched/core.c
  @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
  root_domain *rd, int cpu)
rcu_assign_pointer(rq-sd, sd);
destroy_sched_domains(tmp, cpu);
 
  + update_packing_domain(cpu);
update_top_cache_domain(cpu);
   }
 
  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
  index 4f4a4f6..8c9d3ed 100644
  --- a/kernel/sched/fair.c
  +++ b/kernel/sched/fair.c
  @@ -157,6 +157,63 @@ void sched_init_granularity(void)
update_sysctl();
   }
 
  +
  +/*
  + * Save the id of the optimal CPU that should be used to pack small tasks
  + * The value -1 is used when no buddy has been found
  + */
  +DEFINE_PER_CPU(int, sd_pack_buddy);
  +
  +/* Look for the best buddy CPU that can be used to pack small tasks
  + * We make the assumption that it doesn't wort to pack on CPU that share 
  the
  + * same powerline. We looks for the 1st sched_domain without the
  + * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the 
  lowest
  + * power per core based on the assumption that their power efficiency is
  + * better */
  +void update_packing_domain(int cpu)
  +{
  + struct sched_domain *sd;
  + int id = -1;
  +
  + sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
  + if (!sd)
  + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd);
  + else
  + sd = sd-parent;
  sd is the highest level where SD_SHARE_POWERLINE is enabled so the sched
  groups of the parent level would represent the power domains. If get it
  right, we want to pack inside the cluster first and only let first cpu

 You probably wanted to use sched_group instead of cluster because
 cluster is only a special use case, didn't you ?

  of the cluster do packing on another cluster. So all cpus - except the
  first one - in the current sched domain should find its buddy within the
  domain and only the first one should go to the parent sched domain to
  find its buddy.

 We don't want to pack in the current sched_domain because it shares
 power domain. We want to pack at the parent level


 Yes. I think we mean the same thing. The packing takes place at the
 parent sched_domain but the sched_group that we are looking at only
 contains the cpus of the level below.

 
  I propose the following fix:
 
  -   sd = sd-parent;
  +   if (cpumask_first(sched_domain_span(sd)) == cpu
  +   || !sd-parent)
  +   sd = sd-parent;

 We always look for the buddy in the parent level whatever the cpu
 position in the mask is.

 
 
  +
  + while (sd) {
  + struct sched_group *sg = sd-groups;
  + struct sched_group *pack = sg;
  + struct sched_group *tmp = sg-next;
  +
  + /* 1st CPU of the sched domain is a good candidate */
  + if (id == -1)
  + id = cpumask_first(sched_domain_span(sd));
 
  There is no guarantee that id is in the sched group pointed to by
  sd-groups, which is implicitly assumed later in the search loop. We
  need to find the sched group that contains id and point sg to that
  instead. I haven't found an elegant way to find that group, but the fix
  below should at least give the right result

Re: [PATCH V2 Resend 3/4] workqueue: Schedule work on non-idle cpu instead of current one

2012-11-27 Thread Vincent Guittot

On 27 November 2012 06:19, Viresh Kumar viresh.ku...@linaro.org wrote:
Hi Tejun,

On 26 November 2012 22:45, Tejun Heo t...@kernel.org wrote:
On Tue, Nov 06, 2012 at 04:08:45PM +0530, Viresh Kumar wrote:

I'm pretty skeptical about this. queue_work() w/o explicit CPU
assignment has always guaranteed that the work item will be executed
on the same CPU. I don't think there are too many users depending on
that but am not sure at all that there are none. I asked you last
time that you would at the very least need to audit most users but it
doesn't seem like there has been any effort there.

My bad. I completely missed/forgot that comment from your earlier mails.
Will do it.

That said, if the obtained benefit is big enough, sure, we can
definitely change the behavior, which isn't all that great to begin
with, and try to shake out the bugs quicky by e.g. forcing all work
items to execute on different CPUs, but you're presenting a few
percent of work items being migrated to a different CPU from an
already active CPU, which doesn't seem like such a big benefit to me
even if the migration target CPU is somewhat more efficient. How much
powersaving are we talking about?

Hmm.. I actually implemented the problem discussed here:
(I know you have seen this earlier :) )

http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/08/lpc2012-sched-timer-workqueue.pdf

Specifically slides: 12 19.

I haven't done much power calculations with it and have tested it more from
functionality point of view.

@Vincent: Can you add some comments here?

Sorry for this late reply.

We have faced some situations on TC2 (as an example) where the tasks
are running in the LITTLE cluster whereas some periodic works stay on
the big cluster so we can have one cluster that wakes up for tasks and
another one that wakes up for work. We would like to consolidate the
behaviour of the work with the tasks behaviour.

Sorry, I don't have relevant figures as the patches are used with
other ones which also impact the power consumption.

This series introduces the possibility to run a work on another CPU
which is necessary if we want a better correlation of task and work
scheduling on the system. Most of the time the queue_work is used when
a driver don't mind the CPU on which you want to run whereas it looks
like it should be used only if you want to run locally. We would like
to solve this point with the new interface that is proposed by viresh

Vincent

--
viresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH V2 Resend 3/4] workqueue: Schedule work on non-idle cpu instead of current one

2012-11-27 Thread Vincent Guittot

On 27 November 2012 14:59, Steven Rostedt rost...@goodmis.org wrote:
 On Tue, 2012-11-27 at 19:18 +0530, Viresh Kumar wrote:
 On 27 November 2012 18:56, Steven Rostedt rost...@goodmis.org wrote:
  A couple of things. The sched_select_cpu() is not cheap. It has a double
  loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs,
  and we are CPU 1023 and all other CPUs happen to be idle, we could be
  searching 1023 CPUs before we come up with our own.

 Not sure if you missed the first check sched_select_cpu()

 +int sched_select_cpu(unsigned int sd_flags)
 +{
 +   /* If Current cpu isn't idle, don't migrate anything */
 +   if (!idle_cpu(cpu))
 +   return cpu;

 We aren't going to search if we aren't idle.

 OK, we are idle, but CPU 1022 isn't. We still need a large search. But,
 heh we are idle we can spin. But then why go through this in the first
 place ;-)

By migrating it now, it will create its activity and wake up on the
right CPU next time.

If migrating on any CPUs seems a bit risky, we could restrict the
migration on a CPU on the same node. We can pass such contraints on
sched_select_cpu




  Also, I really don't like this as a default behavior. It seems that this
  solution is for a very special case, and this can become very intrusive
  for the normal case.

 We tried with an KCONFIG option for it, which Tejun rejected.

 Yeah, I saw that. I don't like adding KCONFIG options either. Best is to
 get something working that doesn't add any regressions. If you can get
 this to work without making *any* regressions in the normal case than
 I'm totally fine with that. But if this adds any issues with the normal
 case, then it's a show stopper.


  To be honest, I'm uncomfortable with this approach. It seems to be
  fighting a symptom and not the disease. I'd rather find a way to keep
  work from being queued on wrong CPU. If it is a timer, find a way to
  move the timer. If it is something else, lets work to fix that. Doing
  searches of possibly all CPUs (unlikely, but it is there), just seems
  wrong to me.

 As Vincent pointed out, on big LITTLE systems we just don't want to
 serve works on big cores. That would be wasting too much of power.
 Specially if we are going to wake up big cores.

 It would be difficult to control the source driver (which queues work) to
 little cores. We thought, if somebody wanted to queue work on current
 cpu then they must use queue_work_on().

 As Tejun has mentioned earlier, is there any assumptions anywhere that
 expects an unbounded work queue to not migrate? Where per cpu variables
 might be used. Tejun had a good idea of forcing this to migrate the work
 *every* time. To not let a work queue run on the same CPU that it was
 queued on. If it can survive that, then it is probably OK. Maybe add a
 config option that forces this? That way, anyone can test that this
 isn't an issue.

 -- Steve


 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 Resend 3/4] workqueue: Schedule work on non-idle cpu instead of current one

2012-11-27 Thread Vincent Guittot

On 27 November 2012 16:04, Steven Rostedt rost...@goodmis.org wrote:
 On Tue, 2012-11-27 at 15:55 +0100, Vincent Guittot wrote:
 On 27 November 2012 14:59, Steven Rostedt rost...@goodmis.org wrote:
  On Tue, 2012-11-27 at 19:18 +0530, Viresh Kumar wrote:
  On 27 November 2012 18:56, Steven Rostedt rost...@goodmis.org wrote:
   A couple of things. The sched_select_cpu() is not cheap. It has a double
   loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs,
   and we are CPU 1023 and all other CPUs happen to be idle, we could be
   searching 1023 CPUs before we come up with our own.
 
  Not sure if you missed the first check sched_select_cpu()
 
  +int sched_select_cpu(unsigned int sd_flags)
  +{
  +   /* If Current cpu isn't idle, don't migrate anything */
  +   if (!idle_cpu(cpu))
  +   return cpu;
 
  We aren't going to search if we aren't idle.
 
  OK, we are idle, but CPU 1022 isn't. We still need a large search. But,
  heh we are idle we can spin. But then why go through this in the first
  place ;-)

 By migrating it now, it will create its activity and wake up on the
 right CPU next time.

 If migrating on any CPUs seems a bit risky, we could restrict the
 migration on a CPU on the same node. We can pass such contraints on
 sched_select_cpu


 That's assuming that the CPUs stay idle. Now if we move the work to
 another CPU and it goes idle, then it may move that again. It could end
 up being a ping pong approach.

 I don't think idle is a strong enough heuristic for the general case. If
 interrupts are constantly going off on a CPU that happens to be idle
 most of the time, it will constantly be moving work onto CPUs that are
 currently doing real work, and by doing so, it will be slowing those
 CPUs down.


I agree that idle is probably not enough but it's the heuristic that
is currently used for selecting a CPU for a timer and the timer also
uses sched_select_cpu in this series. So in order to go step by step,
a common interface has been introduced for selecting a CPU and this
function uses the same algorithm than the timer already do.
Once we agreed on an interface, the heuristic could be updated.


 -- Steve


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-18 Thread Vincent Guittot

On 17 December 2012 16:24, Alex Shi alex@intel.com wrote:
 The scheme below tries to summaries the idea:

 Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
 LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
 buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
 buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
 buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|

 But, I don't know how this can interact with NUMA load balance and the
 better might be to use conf3.

 I mean conf2 not conf3

 So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it
 is unbalanced for different socket.

 That the target because we have decided to pack the small tasks in
 socket 0 when we have parsed the topology at boot.
 We don't have to loop into sched_domain or sched_group anymore to find
 the best LCPU when a small tasks wake up.

 iteration on domain and group is a advantage feature for power efficient
 requirement, not shortage. If some CPU are already idle before forking,
 let another waking CPU check their load/util and then decide which one
 is best CPU can reduce late migrations, that save both the performance
 and power.

 In fact, we have already done this job once at boot and we consider
 that moving small tasks in the buddy CPU is always benefit so we don't
 need to waste time looping sched_domain and sched_group to compute
 current capacity of each LCPU for each wake up of each small tasks. We
 want all small tasks and background activity waking up on the same
 buddy CPU and let the default behavior of the scheduler choosing the
 best CPU for heavy tasks or loaded CPUs.

 IMHO, the design should be very good for your scenario and your machine,
 but when the code move to general scheduler, we do want it can handle
 more general scenarios. like sometime the 'small task' is not as small
 as tasks in cyclictest which even hardly can run longer than migration

Cyclictest is the ultimate small tasks use case which points out all
weaknesses of a scheduler for such kind of tasks.
Music playback is a more realistic one and it also shows improvement

 granularity or one tick, thus we really don't need to consider task
 migration cost. But when the task are not too small, migration is more

For which kind of machine are you stating that hypothesis ?

 heavier than domain/group walking, that is the common sense in
 fork/exec/waking balance.

I would have said the opposite: The current scheduler limits its
computation of statistic during fork/exec/waking compared to a
periodic load balance because it's too heavy. It's even more true for
wake up if wake affine is possible.




 On the contrary, move task walking on each level buddies is not only bad
 on performance but also bad on power. Consider the quite big latency of
 waking a deep idle CPU. we lose too much..

 My result have shown different conclusion.

 That should be due to your tasks are too small to need consider
 migration cost.
 In fact, there is much more chance that the buddy will not be in a
 deep idle as all the small tasks and background activity are already
 waking on this CPU.

 powertop is helpful to tune your system for more idle time. Another
 reason is current kernel just try to spread tasks on more cpu for
 performance consideration. My power scheduling patch should helpful on this.




 And the ground level has just one buddy for 16 LCPUs - 8 cores, that's
 not a good design, consider my previous examples: if there are 4 or 8
 tasks in one socket, you just has 2 choices: spread them into all cores,
 or pack them into one LCPU. Actually, moving them just into 2 or 4 cores
 maybe a better solution. but the design missed this.

 You speak about tasks without any notion of load. This patch only care
 of small tasks and light LCPU load, but it falls back to default
 behavior for other situation. So if there are 4 or 8 small tasks, they
 will migrate to the socket 0 after 1 or up to 3 migration (it depends
 of the conf and the LCPU they come from).

 According to your patch, what your mean 'notion of load' is the
 utilization of cpu, not the load weight of tasks, right?

 Yes but not only. The number of tasks that run simultaneously, is
 another important input


 Yes, I just talked about tasks numbers, but it naturally extends to the
 task utilization on cpu. like 8 tasks with 25% util, that just can full
 fill 2 CPUs. but clearly beyond the capacity of the buddy, so you need
 to wake up another CPU socket while local socket has some LCPU idle...

 8 tasks with a running period of 25ms per 100ms that wake up
 simultaneously should probably run on 8 different LCPU in order to
 race to idle

 nope, it's a rare probability of 8 tasks wakeuping simultaneously. And

Multimedia  is one example of tasks waking up simultaneously

 even so they should run in the same socket for power saving
 consideration(my power scheduling patch can do this), instead of spread
 to all sockets.

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-21 Thread Vincent Guittot

On 21 December 2012 06:47, Namhyung Kim namhy...@kernel.org wrote:
 Hi Vincent,

 On Thu, Dec 13, 2012 at 11:11:11AM +0100, Vincent Guittot wrote:
 On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
  On 12/12/2012 09:31 PM, Vincent Guittot wrote:
  +static bool is_buddy_busy(int cpu)
  +{
  + struct rq *rq = cpu_rq(cpu);
  +
  + /*
  +  * A busy buddy is a CPU with a high load or a small load with a 
  lot of
  +  * running tasks.
  +  */
  + return ((rq-avg.runnable_avg_sum  rq-nr_running) 
 
  If nr_running a bit big, rq-avg.runnable_avg_sum  rq-nr_running is
  zero. you will get the wrong decision.

 yes, I'm going to do that like below instead:
 return (rq-avg.runnable_avg_sum  (rq-avg.runnable_avg_period 
 rq-nr_running));

 Doesn't it consider nr_running too much?  It seems current is_buddy_busy
 returns false on a cpu that has 1 task runs 40% cputime, but returns true
 on a cpu that has 3 tasks runs 10% cputime each or for 2 tasks of 15%
 cputime each, right?

Yes it's right.

 I don't know what is correct, but just guessing that in a cpu's point
 of view it'd be busier if it has a higher runnable_avg_sum than a
 higher nr_running IMHO.

The nr_running is used to point out how many tasks are running
simultaneously and the potential scheduling latency of adding




 
  + rq-avg.runnable_avg_period);
  +}
  +
  +static bool is_light_task(struct task_struct *p)
  +{
  + /* A light task runs less than 25% in average */
  + return ((p-se.avg.runnable_avg_sum  1) 
  + p-se.avg.runnable_avg_period);
 
  25% may not suitable for big machine.

 Threshold is always an issue, which threshold should be suitable for
 big machine ?

 I'm wondering if i should use the imbalance_pct value for computing
 the threshold

 Anyway, I wonder how 'sum  1' computes 25%.  Shouldn't it be  2 ?

The 1st version of the patch was using  2 but I received a comment
saying that it was may be not enough aggressive so I have updated the
formula with  1 but forgot to update the comment. I will align
comment and formula in the next version.
Thanks for pointing this

Vincent


 Thanks,
 Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-21 Thread Vincent Guittot

On 21 December 2012 09:53, Vincent Guittot vincent.guit...@linaro.org wrote:
 On 21 December 2012 06:47, Namhyung Kim namhy...@kernel.org wrote:
 Hi Vincent,

 On Thu, Dec 13, 2012 at 11:11:11AM +0100, Vincent Guittot wrote:
 On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
  On 12/12/2012 09:31 PM, Vincent Guittot wrote:
  +static bool is_buddy_busy(int cpu)
  +{
  + struct rq *rq = cpu_rq(cpu);
  +
  + /*
  +  * A busy buddy is a CPU with a high load or a small load with a 
  lot of
  +  * running tasks.
  +  */
  + return ((rq-avg.runnable_avg_sum  rq-nr_running) 
 
  If nr_running a bit big, rq-avg.runnable_avg_sum  rq-nr_running is
  zero. you will get the wrong decision.

 yes, I'm going to do that like below instead:
 return (rq-avg.runnable_avg_sum  (rq-avg.runnable_avg_period 
 rq-nr_running));

 Doesn't it consider nr_running too much?  It seems current is_buddy_busy
 returns false on a cpu that has 1 task runs 40% cputime, but returns true
 on a cpu that has 3 tasks runs 10% cputime each or for 2 tasks of 15%
 cputime each, right?

 Yes it's right.

 I don't know what is correct, but just guessing that in a cpu's point
 of view it'd be busier if it has a higher runnable_avg_sum than a
 higher nr_running IMHO.

sorry, the mail has been sent before i finish it

 The nr_running is used to point out how many tasks are running
 simultaneously and the potential scheduling latency of adding

The nr_running is used to point out how many tasks are running
simultaneously and as a result the potential scheduling latency.
I have used the shift instruction because it was quite simple and
efficient but it may give too much weight to nr_running. I could use a
simple division instead of shifting runnable_avg_sum





 
  + rq-avg.runnable_avg_period);
  +}
  +
  +static bool is_light_task(struct task_struct *p)
  +{
  + /* A light task runs less than 25% in average */
  + return ((p-se.avg.runnable_avg_sum  1) 
  + p-se.avg.runnable_avg_period);
 
  25% may not suitable for big machine.

 Threshold is always an issue, which threshold should be suitable for
 big machine ?

 I'm wondering if i should use the imbalance_pct value for computing
 the threshold

 Anyway, I wonder how 'sum  1' computes 25%.  Shouldn't it be  2 ?

 The 1st version of the patch was using  2 but I received a comment
 saying that it was may be not enough aggressive so I have updated the
 formula with  1 but forgot to update the comment. I will align
 comment and formula in the next version.
 Thanks for pointing this

 Vincent


 Thanks,
 Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 0/6] sched: packing small tasks

2012-12-12 Thread Vincent Guittot

Hi,

This patchset takes advantage of the new per-task load tracking that is 
available in the kernel for packing the small tasks in as few as possible 
CPU/Cluster/Core. The main goal of packing small tasks is to reduce the power 
consumption by minimizing the number of power domain that are enabled. The 
packing is done in 2 steps:

The 1st step looks for the best place to pack tasks in a system according to 
its topology and it defines a pack buddy CPU for each CPU if there is one 
available. The policy for defining a buddy CPU is that we pack at all levels 
where a group of CPU can be power gated independently from others. For 
describing this capability, a new flag has been introduced SD_SHARE_POWERDOMAIN 
that is used to indicate whether the groups of CPUs of a scheduling domain are 
sharing their power state. By default, this flag has been set in all 
sched_domain in order to keep unchanged the current behavior of the scheduler.

In a 2nd step, the scheduler checks the load average of a task which wakes up 
as well as the load average of the buddy CPU and can decide to migrate the task 
on the buddy. This check is done during the wake up because small tasks tend to 
wake up between load balance and asynchronously to each other which prevents 
the default mechanism to catch and migrate them efficiently.

Change since V1:

Patch 2/6
 - Change the flag name which was not clear. The new name is 
SD_SHARE_POWERDOMAIN.
 - Create an architecture dependent function to tune the sched_domain flags
Patch 3/6
 - Fix issues in the algorithm that looks for the best buddy CPU
 - Use pr_debug instead of pr_info
 - Fix for uniprocessor
Patch 4/6
 - Remove the use of usage_avg_sum which has not been merged
Patch 5/6
 - Change the way the coherency of runnable_avg_sum and runnable_avg_period is 
ensured
Patch 6/6
 - Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM 
platform

New results for V2:

This series has been tested with MP3 play back on ARM platform: TC2 HMP (dual 
CA-15 and 3xCA-7 cluster).

The measurements have been done on an Ubuntu image during 60 seconds of 
playback and the result has been normalized to 100.

  | CA15 | CA7  | total |
-
default   |  81  |   97 | 178   |
pack  |  13  |  100 | 113   |
-

Previous result for V1:

The patch-set has been tested on ARM platforms: quad CA-9 SMP and TC2 HMP (dual 
CA-15 and 3xCA-7 cluster). For ARM platform, the results have demonstrated that 
it's worth packing small tasks at all topology levels.

The performance tests have been done on both platforms with sysbench. The 
results don't show any performance regressions. These results are aligned with 
the policy which uses the normal behavior with heavy use cases.

test: sysbench --test=cpu --num-threads=N --max-requests=R run

Results below is the average duration of 3 tests on the quad CA-9.
default is the current scheduler behavior (pack buddy CPU is -1)
pack is the scheduler with the pack mechanism

  | default |  pack   |
---
N=8;  R=200   |  3.1999 |  3.1921 |
N=8;  R=2000  | 31.4939 | 31.4844 |
N=12; R=200   |  3.2043 |  3.2084 |
N=12; R=2000  | 31.4897 | 31.4831 |
N=16; R=200   |  3.1774 |  3.1824 |
N=16; R=2000  | 31.4899 | 31.4897 |
---

The power consumption tests have been done only on TC2 platform which has got 
accessible power lines and I have used cyclictest to simulate small tasks. The 
tests show some power consumption improvements.

test: cyclictest -t 8 -q -e 100 -D 20  cyclictest -t 8 -q -e 100 -D 20

The measurements have been done during 16 seconds and the result has been 
normalized to 100

  | CA15 | CA7  | total |
-
default   | 100  |  40  | 140   |
pack  |  1  |  45  | 46   |
-

The A15 cluster is less power efficient than the A7 cluster but if we assume 
that the tasks is well spread on both clusters, we can guest estimate that the 
power consumption on a dual cluster of CA7 would have been for a default kernel:

Vincent Guittot (6):
  Revert sched: introduce temporary FAIR_GROUP_SCHED dependency for
load-tracking
  sched: add a new SD SHARE_POWERLINE flag for sched_domain
  sched: pack small tasks
  sched: secure access to other CPU statistics
  sched: pack the idle load balance
  ARM: sched: clear SD_SHARE_POWERLINE

 arch/arm/kernel/topology.c   |9 +++
 arch/ia64/include/asm/topology.h |1 +
 arch/tile/include/asm/topology.h |1 +
 include/linux/sched.h|9 +--
 include/linux/topology.h |4 ++
 kernel/sched/core.c  |   14 ++--
 kernel/sched/fair.c  |  134 +-
 kernel/sched/sched.h |   14 ++--
 8 files changed, 163 insertions(+), 23 deletions(-)

-- 
1.7.9.5

RFC PATCH v2 1/6] Revert sched: introduce temporary FAIR_GROUP_SCHED dependency for load-tracking

2012-12-12 Thread Vincent Guittot

This reverts commit f4e26b120b9de84cb627bc7361ba43cfdc51341f

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 include/linux/sched.h |8 +---
 kernel/sched/core.c   |7 +--
 kernel/sched/fair.c   |3 +--
 kernel/sched/sched.h  |9 +
 4 files changed, 4 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fd17ca3..046e39a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1195,13 +1195,7 @@ struct sched_entity {
/* rq owned by this entity/group: */
struct cfs_rq   *my_q;
 #endif
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP)  defined(CONFIG_FAIR_GROUP_SCHED)
-   /* Per-entity load-tracking */
+#ifdef CONFIG_SMP
struct sched_avgavg;
 #endif
 };
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8482628..c25c75d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1544,12 +1544,7 @@ static void __sched_fork(struct task_struct *p)
p-se.vruntime  = 0;
INIT_LIST_HEAD(p-se.group_node);
 
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP)  defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
p-se.avg.runnable_avg_period = 0;
p-se.avg.runnable_avg_sum = 0;
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61c7a10..9916d41 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2820,8 +2820,7 @@ static inline void update_cfs_shares(struct cfs_rq 
*cfs_rq)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP)  defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
 /*
  * We choose a half-life close to 1 scheduling period.
  * Note: The tables below are dependent on this value.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f00eb80..92ba891 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -226,12 +226,6 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
 * CFS Load tracking
 * Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -241,8 +235,7 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
u32 tg_runnable_contrib;
u64 tg_load_contrib;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 5/6] sched: pack the idle load balance

2012-12-12 Thread Vincent Guittot

Look for an idle CPU close to the pack buddy CPU whenever possible.
The goal is to prevent the wake up of a CPU which doesn't share the power
domain of the pack buddy CPU.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/sched/fair.c |   18 ++
 1 file changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f1a4c24..6475f54 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7274,7 +7274,25 @@ static struct {
 
 static inline int find_new_ilb(int call_cpu)
 {
+   struct sched_domain *sd;
int ilb = cpumask_first(nohz.idle_cpus_mask);
+   int buddy = per_cpu(sd_pack_buddy, call_cpu);
+
+   /*
+* If we have a pack buddy CPU, we try to run load balance on a CPU
+* that is close to the buddy.
+*/
+   if (buddy != -1)
+   for_each_domain(buddy, sd) {
+   if (sd-flags  SD_SHARE_CPUPOWER)
+   continue;
+
+   ilb = cpumask_first_and(sched_domain_span(sd),
+   nohz.idle_cpus_mask);
+
+   if (ilb  nr_cpu_ids)
+   break;
+   }
 
if (ilb  nr_cpu_ids  idle_cpu(ilb))
return ilb;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 6/6] ARM: sched: clear SD_SHARE_POWERLINE

2012-12-12 Thread Vincent Guittot

The ARM platforms take advantage of packing small tasks on few cores.
This is true even when the cores of a cluster can't be power gated
independantly. So we clear SD_SHARE_POWERDOMAIN at MC and CPU level.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 arch/arm/kernel/topology.c |9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 79282eb..f89a4a2 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -201,6 +201,15 @@ static inline void update_cpu_power(unsigned int cpuid, 
unsigned int mpidr) {}
  */
 struct cputopo_arm cpu_topology[NR_CPUS];
 
+int arch_sd_local_flags(int level)
+{
+   /* Powergate at threading level doesn't make sense */
+   if (level  SD_SHARE_CPUPOWER)
+   return 1*SD_SHARE_POWERDOMAIN;
+
+   return 0*SD_SHARE_POWERDOMAIN;
+}
+
 const struct cpumask *cpu_coregroup_mask(int cpu)
 {
return cpu_topology[cpu].core_sibling;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH v2 4/6] sched: secure access to other CPU statistics

2012-12-12 Thread Vincent Guittot

If a CPU accesses the runnable_avg_sum and runnable_avg_period fields of its
buddy CPU while the latter updates it, it can get the new version of a field
and the old version of the other one. This can generate erroneous decisions.
We don't want to use a lock mechanism for ensuring the coherency because of
the overhead in this critical path. The previous attempt can't ensure
coherency of both fields for 100% of the platform and use case as it will
depend of the toolchain and the platform architecture.
The runnable_avg_period of a runqueue tends to the max value in less than
345ms after plugging a CPU, which implies that we could use the max value
instead of reading runnable_avg_period after 345ms. During the starting phase,
we must ensure a minimum of coherency between the fields. A simple rule is
runnable_avg_sum = runnable_avg_period.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/sched/fair.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fc93d96..f1a4c24 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5153,13 +5153,16 @@ static bool numa_allow_migration(struct task_struct *p, 
int prev_cpu, int new_cp
 static bool is_buddy_busy(int cpu)
 {
struct rq *rq = cpu_rq(cpu);
+   u32 sum = rq-avg.runnable_avg_sum;
+   u32 period = rq-avg.runnable_avg_period;
+
+   sum = min(sum, period);
 
/*
 * A busy buddy is a CPU with a high load or a small load with a lot of
 * running tasks.
 */
-   return ((rq-avg.runnable_avg_sum  rq-nr_running) 
-   rq-avg.runnable_avg_period);
+   return ((sum  rq-nr_running)  period);
 }
 
 static bool is_light_task(struct task_struct *p)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/6] sched: add a new SD SHARE_POWERLINE flag for sched_domain

2012-12-12 Thread Vincent Guittot

This new flag SD_SHARE_POWERDOMAIN is used to reflect whether groups of CPU in
a sched_domain level can or not reach a different power state. If clusters can
be power gated independently, as an example, the flag should be cleared at CPU
level. This information is used to decide if it's worth packing some tasks in
a group of CPUs in order to power gated the other groups instead of spreading
the tasks. The default behavior of the scheduler is to spread tasks so the
flag is set into all sched_domains

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 arch/ia64/include/asm/topology.h |1 +
 arch/tile/include/asm/topology.h |1 +
 include/linux/sched.h|1 +
 include/linux/topology.h |4 
 kernel/sched/core.c  |6 ++
 5 files changed, 13 insertions(+)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index a2496e4..6d0b617 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -65,6 +65,7 @@ void build_cpu_to_node_map(void);
| SD_BALANCE_EXEC   \
| SD_BALANCE_FORK   \
| SD_WAKE_AFFINE,   \
+   | arch_sd_local_flags(0)\
.last_balance   = jiffies,  \
.balance_interval   = 1,\
.nr_balance_failed  = 0,\
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h
index d5e86c9..adc8710 100644
--- a/arch/tile/include/asm/topology.h
+++ b/arch/tile/include/asm/topology.h
@@ -71,6 +71,7 @@ static inline const struct cpumask *cpumask_of_node(int node)
| 0*SD_WAKE_AFFINE  \
| 0*SD_SHARE_CPUPOWER   \
| 0*SD_SHARE_PKG_RESOURCES  \
+   | arch_sd_local_flags(0)\
| 0*SD_SERIALIZE\
,   \
.last_balance   = jiffies,  \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 046e39a..3287be1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -844,6 +844,7 @@ enum cpu_idle_type {
 #define SD_BALANCE_WAKE0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE 0x0020  /* Wake task to waking CPU */
 #define SD_SHARE_CPUPOWER  0x0080  /* Domain members share cpu power */
+#define SD_SHARE_POWERDOMAIN   0x0100  /* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES 0x0200  /* Domain members share cpu pkg 
resources */
 #define SD_SERIALIZE   0x0400  /* Only a single load balancing 
instance */
 #define SD_ASYM_PACKING0x0800  /* Place busy groups earlier in 
the domain */
diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..3eab293 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -99,6 +99,8 @@ int arch_update_cpu_topology(void);
| 1*SD_WAKE_AFFINE  \
| 1*SD_SHARE_CPUPOWER   \
| 1*SD_SHARE_PKG_RESOURCES  \
+   | arch_sd_local_flags(SD_SHARE_CPUPOWER|\
+   SD_SHARE_PKG_RESOURCES) \
| 0*SD_SERIALIZE\
| 0*SD_PREFER_SIBLING   \
| arch_sd_sibling_asym_packing()\
@@ -131,6 +133,7 @@ int arch_update_cpu_topology(void);
| 1*SD_WAKE_AFFINE  \
| 0*SD_SHARE_CPUPOWER   \
| 1*SD_SHARE_PKG_RESOURCES  \
+   | arch_sd_local_flags(SD_SHARE_PKG_RESOURCES)\
| 0*SD_SERIALIZE\
,   \
.last_balance   = jiffies,  \
@@ -161,6 +164,7 @@ int arch_update_cpu_topology(void);
| 1*SD_WAKE_AFFINE  \
| 0*SD_SHARE_CPUPOWER   \
| 0*SD_SHARE_PKG_RESOURCES  \
+   | arch_sd_local_flags(0)\
| 0*SD_SERIALIZE\
| 1*SD_PREFER_SIBLING   \
,   \
diff --git a/kernel/sched/core.c

Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-12 Thread Vincent Guittot

On 12 December 2012 14:55, Alex Shi lkml.a...@gmail.com wrote:


 well... it's not always beneficial to group or to spread out
 it depends on cache behavior mostly which is best


 Let me try to understand what this means: so performance above with
 8 threads means that those threads are spread out across more than one
 socket, no?

 If so, this would mean that you have a smaller amount of tasks on each
 socket, thus the smaller wattage.

 The powersaving method OTOH fills up the one socket up to the brim,
 thus the slightly higher consumption due to all threads being occupied.

 Is that it?


 not sure.

 by and large, power efficiency is the same as performance efficiency, with
 some twists.
 or to reword that to be more clear
 if you waste performance due to something that becomes inefficient, you're
 wasting power as well.
 now, you might have some hardware effects that can then save you power...
 but those effects
 then first need to overcome the waste from the performance inefficiency...
 and that almost never happens.

 for example, if you have two workloads that each fit barely inside the last
 level cache...
 it's much more efficient to spread these over two sockets... where each has
 its own full LLC
 to use.
 If you'd group these together, both would thrash the cache all the time and
 run inefficient -- bad for power.

 now, on the other hand, if you have two threads of a process that share a
 bunch of data structures,
 and you'd spread these over 2 sockets, you end up bouncing data between the
 two sockets a lot,
 running inefficient -- bad for power.


 Agree with all of the above. However..

 having said all this, if you have to tasks that don't have such cache
 effects, the most efficient way
 of running things will be on 2 hyperthreading halves... it's very hard to
 beat the power efficiency of that.

 .. there are alternatives to hyperthreading. On ARM's big.LITTLE
 architecture you could simply schedule them on the LITTLE cores. The
 big cores just can't beat the power efficiency of the LITTLE ones even
 with 'race to halt' that you allude to below. And usecases like mp3
 playback simply don't require the kind of performance that the big
 cores can offer.

 But this assumes the tasks don't compete with resources much on the HT
 level, and achieve good scaling.
 and this still has to compete with race to halt, because if you're done
 quicker, you can put the memory
 in self refresh quicker.

 none of this stuff is easy for humans or computer programs to determine
 ahead of time... or sometimes even afterwards.
 heck, even for just performance it's really really hard already, never mind
 adding power.

 my personal gut feeling is that we should just optimize this scheduler stuff
 for performance, and that
 we're going to be doing quite well on power already if we achieve that.

 If Linux is to continue to work efficiently on heterogeneous
 multi-processing platforms, it needs to provide scheduling mechanisms
 that can be exploited as per the demands of the HW architecture.

 Linus definitely disagree such ideas. :) So, need to summaries the
 logical beyond all hardware.

 example is the small task packing (and spreading) for which Vincent
 Guittot has posted a patchset[1] earlier and so has Alex now.

 Sure. I just thought my patchset should handled the 'small task
 packing' scenario. Could you guy like to have a try?

Hi Alex,

Yes, I will do a try with your patchset when i will have some spare time

Vincent


 [1] http://lwn.net/Articles/518834/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/6] sched: add a new SD SHARE_POWERLINE flag for sched_domain

2012-12-13 Thread Vincent Guittot

On 13 December 2012 03:24, Alex Shi alex@intel.com wrote:
 On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 This new flag SD_SHARE_POWERDOMAIN is used to reflect whether groups of CPU 
 in
 a sched_domain level can or not reach a different power state. If clusters 
 can
 be power gated independently, as an example, the flag should be cleared at 
 CPU
 level. This information is used to decide if it's worth packing some tasks in
 a group of CPUs in order to power gated the other groups instead of spreading
 the tasks. The default behavior of the scheduler is to spread tasks so the
 flag is set into all sched_domains

 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
 ---
  arch/ia64/include/asm/topology.h |1 +
  arch/tile/include/asm/topology.h |1 +
  include/linux/sched.h|1 +
  include/linux/topology.h |4 
  kernel/sched/core.c  |6 ++
  5 files changed, 13 insertions(+)

 diff --git a/arch/ia64/include/asm/topology.h 
 b/arch/ia64/include/asm/topology.h
 index a2496e4..6d0b617 100644
 --- a/arch/ia64/include/asm/topology.h
 +++ b/arch/ia64/include/asm/topology.h
 @@ -65,6 +65,7 @@ void build_cpu_to_node_map(void);
   | SD_BALANCE_EXEC   \
   | SD_BALANCE_FORK   \
   | SD_WAKE_AFFINE,   \
 + | arch_sd_local_flags(0)\
   .last_balance   = jiffies,  \
   .balance_interval   = 1,\
   .nr_balance_failed  = 0,\
 diff --git a/arch/tile/include/asm/topology.h 
 b/arch/tile/include/asm/topology.h
 index d5e86c9..adc8710 100644
 --- a/arch/tile/include/asm/topology.h
 +++ b/arch/tile/include/asm/topology.h
 @@ -71,6 +71,7 @@ static inline const struct cpumask *cpumask_of_node(int 
 node)
   | 0*SD_WAKE_AFFINE  \
   | 0*SD_SHARE_CPUPOWER   \
   | 0*SD_SHARE_PKG_RESOURCES  \
 + | arch_sd_local_flags(0)\
   | 0*SD_SERIALIZE\
   ,   \
   .last_balance   = jiffies,  \
 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index 046e39a..3287be1 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -844,6 +844,7 @@ enum cpu_idle_type {
  #define SD_BALANCE_WAKE  0x0010  /* Balance on wakeup */
  #define SD_WAKE_AFFINE   0x0020  /* Wake task to waking CPU */
  #define SD_SHARE_CPUPOWER0x0080  /* Domain members share cpu power */
 +#define SD_SHARE_POWERDOMAIN 0x0100  /* Domain members share power domain */
  #define SD_SHARE_PKG_RESOURCES   0x0200  /* Domain members share cpu 
 pkg resources */
  #define SD_SERIALIZE 0x0400  /* Only a single load balancing 
 instance */
  #define SD_ASYM_PACKING  0x0800  /* Place busy groups earlier 
 in the domain */
 diff --git a/include/linux/topology.h b/include/linux/topology.h
 index d3cf0d6..3eab293 100644
 --- a/include/linux/topology.h
 +++ b/include/linux/topology.h
 @@ -99,6 +99,8 @@ int arch_update_cpu_topology(void);
   | 1*SD_WAKE_AFFINE  \
   | 1*SD_SHARE_CPUPOWER   \
   | 1*SD_SHARE_PKG_RESOURCES  \
 + | arch_sd_local_flags(SD_SHARE_CPUPOWER|\
 + SD_SHARE_PKG_RESOURCES) \
   | 0*SD_SERIALIZE\
   | 0*SD_PREFER_SIBLING   \
   | arch_sd_sibling_asym_packing()\
 @@ -131,6 +133,7 @@ int arch_update_cpu_topology(void);
   | 1*SD_WAKE_AFFINE  \
   | 0*SD_SHARE_CPUPOWER   \
   | 1*SD_SHARE_PKG_RESOURCES  \
 + | arch_sd_local_flags(SD_SHARE_PKG_RESOURCES)\
   | 0*SD_SERIALIZE\
   ,   \
   .last_balance   = jiffies,  \
 @@ -161,6 +164,7 @@ int arch_update_cpu_topology(void);
   | 1*SD_WAKE_AFFINE  \
   | 0*SD_SHARE_CPUPOWER   \
   | 0*SD_SHARE_PKG_RESOURCES  \
 + | arch_sd_local_flags(0)\

 The general style looks like prefering SD_XXX flag directly.

In fact, I have followed the same style

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Vincent Guittot

On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
 On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for each CPU
 when one is available. We want to pack at all levels where a group of CPU can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the flag is
 set at all sched_domain level and the buddy is set to -1. This is the default
 behavior.
 On a dual clusters / dual cores system which can power gate each core and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 Small tasks tend to slip out of the periodic load balance so the best place
 to choose to migrate them is during their wake up. The decision is in O(1) as
 we only check again one buddy CPU

 Just have a little worry about the scalability on a big machine, like on
 a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
 system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
 is different on task distribution decision.

The buddy CPU should probably not be the same for all 64 LCPU it
depends on where it's worth packing small tasks

Which kind of sched_domain configuration have you for such system ?
and how many sched_domain level have you ?



 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
 ---
  kernel/sched/core.c  |1 +
  kernel/sched/fair.c  |  110 
 ++
  kernel/sched/sched.h |5 +++
  3 files changed, 116 insertions(+)

 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 4f36e9d..3436aad 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -5693,6 +5693,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
 root_domain *rd, int cpu)
   rcu_assign_pointer(rq-sd, sd);
   destroy_sched_domains(tmp, cpu);

 + update_packing_domain(cpu);
   update_domain_cache(cpu);
  }

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 9916d41..fc93d96 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -163,6 +163,73 @@ void sched_init_granularity(void)
   update_sysctl();
  }

 +
 +#ifdef CONFIG_SMP
 +/*
 + * Save the id of the optimal CPU that should be used to pack small tasks
 + * The value -1 is used when no buddy has been found
 + */
 +DEFINE_PER_CPU(int, sd_pack_buddy);
 +
 +/* Look for the best buddy CPU that can be used to pack small tasks
 + * We make the assumption that it doesn't wort to pack on CPU that share the
 + * same powerline. We looks for the 1st sched_domain without the
 + * SD_SHARE_POWERDOMAIN flag. Then We look for the sched_group witht the 
 lowest
 + * power per core based on the assumption that their power efficiency is
 + * better */
 +void update_packing_domain(int cpu)
 +{
 + struct sched_domain *sd;
 + int id = -1;
 +
 + sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN  SD_LOAD_BALANCE);
 + if (!sd)
 + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd);
 + else
 + sd = sd-parent;
 +
 + while (sd  (sd-flags  SD_LOAD_BALANCE)) {
 + struct sched_group *sg = sd-groups;
 + struct sched_group *pack = sg;
 + struct sched_group *tmp;
 +
 + /*
 +  * The sched_domain of a CPU points on the local sched_group
 +  * and the 1st CPU of this local group is a good candidate
 +  */
 + id = cpumask_first(sched_group_cpus(pack));
 +
 + /* loop the sched groups to find the best one */
 + for (tmp = sg-next; tmp != sg; tmp = tmp-next) {
 + if (tmp-sgp-power * pack-group_weight 
 + pack-sgp-power * tmp-group_weight)
 + continue;
 +
 + if ((tmp-sgp-power * pack-group_weight ==
 + pack-sgp-power * tmp-group_weight)
 +   (cpumask_first(sched_group_cpus(tmp)) = id))
 + continue;
 +
 + /* we have found a better group */
 + pack = tmp;
 +
 + /* Take the 1st CPU of the new group */
 + id = cpumask_first(sched_group_cpus(pack));
 + }
 +
 + /* Look for another CPU than itself */
 + if (id != cpu)
 + break;
 +
 + sd = sd-parent;
 + }
 +
 + pr_debug(CPU%d packing on CPU%d\n, cpu, id);
 + per_cpu(sd_pack_buddy, cpu) = id;
 +}
 +
 +#endif /* CONFIG_SMP */
 +
  #if BITS_PER_LONG == 32
  # define WMULT_CONST (~0UL)
  #else
 @@ -5083,6 +5150,46 @@ static bool numa_allow_migration(struct task_struct 
 *p, int prev_cpu, int new_cp
   return true;
  }

 +static bool is_buddy_busy(int cpu

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Vincent Guittot

On 13 December 2012 15:25, Alex Shi alex@intel.com wrote:
 On 12/13/2012 06:11 PM, Vincent Guittot wrote:
 On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
 On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for each 
 CPU
 when one is available. We want to pack at all levels where a group of CPU 
 can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the flag 
 is
 set at all sched_domain level and the buddy is set to -1. This is the 
 default
 behavior.
 On a dual clusters / dual cores system which can power gate each core and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 Small tasks tend to slip out of the periodic load balance so the best place
 to choose to migrate them is during their wake up. The decision is in O(1) 
 as
 we only check again one buddy CPU

 Just have a little worry about the scalability on a big machine, like on
 a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
 system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
 is different on task distribution decision.

 The buddy CPU should probably not be the same for all 64 LCPU it
 depends on where it's worth packing small tasks

 Do you have further ideas for buddy cpu on such example?

yes, I have several ideas which were not really relevant for small
system but could be interesting for larger system

We keep the same algorithm in a socket but we could either use another
LCPU in the targeted socket (conf0) or chain the socket (conf1)
instead of packing directly in one LCPU

The scheme below tries to summaries the idea:

Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|

But, I don't know how this can interact with NUMA load balance and the
better might be to use conf3.


 Which kind of sched_domain configuration have you for such system ?
 and how many sched_domain level have you ?

 it is general X86 domain configuration. with 4 levels,
 sibling/core/cpu/numa.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-13 Thread Vincent Guittot

On 13 December 2012 15:53, Vincent Guittot vincent.guit...@linaro.org wrote:
 On 13 December 2012 15:25, Alex Shi alex@intel.com wrote:
 On 12/13/2012 06:11 PM, Vincent Guittot wrote:
 On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
 On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for each 
 CPU
 when one is available. We want to pack at all levels where a group of CPU 
 can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the flag 
 is
 set at all sched_domain level and the buddy is set to -1. This is the 
 default
 behavior.
 On a dual clusters / dual cores system which can power gate each core and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 Small tasks tend to slip out of the periodic load balance so the best 
 place
 to choose to migrate them is during their wake up. The decision is in 
 O(1) as
 we only check again one buddy CPU

 Just have a little worry about the scalability on a big machine, like on
 a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
 system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
 is different on task distribution decision.

 The buddy CPU should probably not be the same for all 64 LCPU it
 depends on where it's worth packing small tasks

 Do you have further ideas for buddy cpu on such example?

 yes, I have several ideas which were not really relevant for small
 system but could be interesting for larger system

 We keep the same algorithm in a socket but we could either use another
 LCPU in the targeted socket (conf0) or chain the socket (conf1)
 instead of packing directly in one LCPU

 The scheme below tries to summaries the idea:

 Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
 LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
 buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
 buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
 buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|

 But, I don't know how this can interact with NUMA load balance and the
 better might be to use conf3.

I mean conf2 not conf3



 Which kind of sched_domain configuration have you for such system ?
 and how many sched_domain level have you ?

 it is general X86 domain configuration. with 4 levels,
 sibling/core/cpu/numa.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-14 Thread Vincent Guittot

On 14 December 2012 02:46, Alex Shi alex@intel.com wrote:
 On 12/13/2012 11:48 PM, Vincent Guittot wrote:
 On 13 December 2012 15:53, Vincent Guittot vincent.guit...@linaro.org 
 wrote:
 On 13 December 2012 15:25, Alex Shi alex@intel.com wrote:
 On 12/13/2012 06:11 PM, Vincent Guittot wrote:
 On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
 On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for 
 each CPU
 when one is available. We want to pack at all levels where a group of 
 CPU can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the 
 flag is
 set at all sched_domain level and the buddy is set to -1. This is the 
 default
 behavior.
 On a dual clusters / dual cores system which can power gate each core 
 and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 Small tasks tend to slip out of the periodic load balance so the best 
 place
 to choose to migrate them is during their wake up. The decision is in 
 O(1) as
 we only check again one buddy CPU

 Just have a little worry about the scalability on a big machine, like on
 a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
 system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
 is different on task distribution decision.

 The buddy CPU should probably not be the same for all 64 LCPU it
 depends on where it's worth packing small tasks

 Do you have further ideas for buddy cpu on such example?

 yes, I have several ideas which were not really relevant for small
 system but could be interesting for larger system

 We keep the same algorithm in a socket but we could either use another
 LCPU in the targeted socket (conf0) or chain the socket (conf1)
 instead of packing directly in one LCPU

 The scheme below tries to summaries the idea:

 Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
 LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
 buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
 buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
 buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|

 But, I don't know how this can interact with NUMA load balance and the
 better might be to use conf3.

 I mean conf2 not conf3

 So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it
 is unbalanced for different socket.

That the target because we have decided to pack the small tasks in
socket 0 when we have parsed the topology at boot.
We don't have to loop into sched_domain or sched_group anymore to find
the best LCPU when a small tasks wake up.


 And the ground level has just one buddy for 16 LCPUs - 8 cores, that's
 not a good design, consider my previous examples: if there are 4 or 8
 tasks in one socket, you just has 2 choices: spread them into all cores,
 or pack them into one LCPU. Actually, moving them just into 2 or 4 cores
 maybe a better solution. but the design missed this.

You speak about tasks without any notion of load. This patch only care
of small tasks and light LCPU load, but it falls back to default
behavior for other situation. So if there are 4 or 8 small tasks, they
will migrate to the socket 0 after 1 or up to 3 migration (it depends
of the conf and the LCPU they come from).

Then, if too much small tasks wake up simultaneously on the same LCPU,
the default load balance will spread them in the core/cluster/socket


 Obviously, more and more cores is the trend on any kinds of CPU, the
 buddy system seems hard to catch up this.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-14 Thread Vincent Guittot

On 14 December 2012 08:45, Mike Galbraith bitbuc...@online.de wrote:
 On Fri, 2012-12-14 at 14:36 +0800, Alex Shi wrote:
 On 12/14/2012 12:45 PM, Mike Galbraith wrote:
   Do you have further ideas for buddy cpu on such example?
   
Which kind of sched_domain configuration have you for such system ?
and how many sched_domain level have you ?
  
   it is general X86 domain configuration. with 4 levels,
   sibling/core/cpu/numa.
  CPU is a bug that slipped into domain degeneration.  You should have
  SIBLING/MC/NUMA (chasing that down is on todo).

 Maybe.
 the CPU/NUMA is different on domain flags, CPU has SD_PREFER_SIBLING.

 What I noticed during (an unrelated) bisection on a 40 core box was
 domains going from so..

 3.4.0-bisect (virgin)
 [5.056214] CPU0 attaching sched-domain:
 [5.065009]  domain 0: span 0,32 level SIBLING
 [5.075011]   groups: 0 (cpu_power = 589) 32 (cpu_power = 589)
 [5.088381]   domain 1: span 
 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC
 [5.107669]groups: 0,32 (cpu_power = 1178)  4,36 (cpu_power = 1178)  
 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178)
  16,48 (cpu_power = 1177) 20,52 (cpu_power = 1178) 
 24,56 (cpu_power = 1177) 28,60 (cpu_power = 1177)
  64,72 (cpu_power = 1176) 68,76 (cpu_power = 1176)
 [5.162115]domain 2: span 0-79 level NODE
 [5.171927] groups: 
 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11773)

 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11772)

 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78 (cpu_power = 11773)

 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11770)

 ..to so, which looks a little bent.  CPU and MC have identical spans, so
 CPU should have gone away, as it used to do.

 3.6.0-bisect (virgin)
 [3.978338] CPU0 attaching sched-domain:
 [3.987125]  domain 0: span 0,32 level SIBLING
 [3.997125]   groups: 0 (cpu_power = 588) 32 (cpu_power = 589)
 [4.010477]   domain 1: span 
 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level MC
 [4.029748]groups: 0,32 (cpu_power = 1177)  4,36 (cpu_power = 1177)  
 8,40 (cpu_power = 1178) 12,44 (cpu_power = 1178)
  16,48 (cpu_power = 1178) 20,52 (cpu_power = 1178) 
 24,56 (cpu_power = 1178) 28,60 (cpu_power = 1178)
  64,72 (cpu_power = 1178) 68,76 (cpu_power = 1177)
 [4.084143]domain 2: span 
 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 level CPU
 [4.103796] groups: 
 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777)
 [4.124373] domain 3: span 0-79 level NUMA
 [4.134369]  groups: 
 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76 (cpu_power = 11777)
 
 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77 (cpu_power = 11778)
 
 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74 ,78 (cpu_power = 11778)
 
 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79 (cpu_power = 11780)


Thanks. that's an interesting example of a numa topology

For your sched_domain difference,
On 3.4, SD_PREFER_SIBLING was set for both MC and CPU level thanks to
sd_balance_for_mc_power and  sd_balance_for_package_power
On 3.6, SD_PREFER_SIBLING is only set for CPU level and this flag
difference with MC level prevents the destruction of CPU sched_domain
during the degeneration

We may need to set SD_PREFER_SIBLING for MC level

Vincent

 -Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 3/6] sched: pack small tasks

2012-12-17 Thread Vincent Guittot

On 16 December 2012 08:12, Alex Shi alex@intel.com wrote:
 On 12/14/2012 05:33 PM, Vincent Guittot wrote:
 On 14 December 2012 02:46, Alex Shi alex@intel.com wrote:
 On 12/13/2012 11:48 PM, Vincent Guittot wrote:
 On 13 December 2012 15:53, Vincent Guittot vincent.guit...@linaro.org 
 wrote:
 On 13 December 2012 15:25, Alex Shi alex@intel.com wrote:
 On 12/13/2012 06:11 PM, Vincent Guittot wrote:
 On 13 December 2012 03:17, Alex Shi alex@intel.com wrote:
 On 12/12/2012 09:31 PM, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for 
 each CPU
 when one is available. We want to pack at all levels where a group of 
 CPU can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the 
 flag is
 set at all sched_domain level and the buddy is set to -1. This is the 
 default
 behavior.
 On a dual clusters / dual cores system which can power gate each core 
 and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 Small tasks tend to slip out of the periodic load balance so the best 
 place
 to choose to migrate them is during their wake up. The decision is in 
 O(1) as
 we only check again one buddy CPU

 Just have a little worry about the scalability on a big machine, like 
 on
 a 4 sockets NUMA machine * 8 cores * HT machine, the buddy cpu in whole
 system need care 64 LCPUs. and in your case cpu0 just care 4 LCPU. That
 is different on task distribution decision.

 The buddy CPU should probably not be the same for all 64 LCPU it
 depends on where it's worth packing small tasks

 Do you have further ideas for buddy cpu on such example?

 yes, I have several ideas which were not really relevant for small
 system but could be interesting for larger system

 We keep the same algorithm in a socket but we could either use another
 LCPU in the targeted socket (conf0) or chain the socket (conf1)
 instead of packing directly in one LCPU

 The scheme below tries to summaries the idea:

 Socket  | socket 0 | socket 1   | socket 2   | socket 3   |
 LCPU| 0 | 1-15 | 16 | 17-31 | 32 | 33-47 | 48 | 49-63 |
 buddy conf0 | 0 | 0| 1  | 16| 2  | 32| 3  | 48|
 buddy conf1 | 0 | 0| 0  | 16| 16 | 32| 32 | 48|
 buddy conf2 | 0 | 0| 16 | 16| 32 | 32| 48 | 48|

 But, I don't know how this can interact with NUMA load balance and the
 better might be to use conf3.

 I mean conf2 not conf3

 So, it has 4 levels 0/16/32/ for socket 3 and 0 level for socket 0, it
 is unbalanced for different socket.

 That the target because we have decided to pack the small tasks in
 socket 0 when we have parsed the topology at boot.
 We don't have to loop into sched_domain or sched_group anymore to find
 the best LCPU when a small tasks wake up.

 iteration on domain and group is a advantage feature for power efficient
 requirement, not shortage. If some CPU are already idle before forking,
 let another waking CPU check their load/util and then decide which one
 is best CPU can reduce late migrations, that save both the performance
 and power.

In fact, we have already done this job once at boot and we consider
that moving small tasks in the buddy CPU is always benefit so we don't
need to waste time looping sched_domain and sched_group to compute
current capacity of each LCPU for each wake up of each small tasks. We
want all small tasks and background activity waking up on the same
buddy CPU and let the default behavior of the scheduler choosing the
best CPU for heavy tasks or loaded CPUs.


 On the contrary, move task walking on each level buddies is not only bad
 on performance but also bad on power. Consider the quite big latency of
 waking a deep idle CPU. we lose too much..

My result have shown different conclusion.
In fact, there is much more chance that the buddy will not be in a
deep idle as all the small tasks and background activity are already
waking on this CPU.




 And the ground level has just one buddy for 16 LCPUs - 8 cores, that's
 not a good design, consider my previous examples: if there are 4 or 8
 tasks in one socket, you just has 2 choices: spread them into all cores,
 or pack them into one LCPU. Actually, moving them just into 2 or 4 cores
 maybe a better solution. but the design missed this.

 You speak about tasks without any notion of load. This patch only care
 of small tasks and light LCPU load, but it falls back to default
 behavior for other situation. So if there are 4 or 8 small tasks, they
 will migrate to the socket 0 after 1 or up to 3 migration (it depends
 of the conf and the LCPU they come from).

 According to your patch, what your mean 'notion of load' is the
 utilization of cpu, not the load weight of tasks, right?

Yes but not only. The number of tasks that run simultaneously, is
another

Re: [RFC v2 PATCH 2/2] sched: Use Per-Entity-Load-Tracking metric for load balancing

2012-11-15 Thread Vincent Guittot

Hi Preeti,

On 15 November 2012 17:54, Preeti U Murthy pre...@linux.vnet.ibm.com wrote:
 Currently the load balancer weighs a task based upon its priority,and this
 weight consequently gets added up to the weight of the run queue that it is
 on.It is this weight of the runqueue that sums up to a sched group's load
 which is used to decide the busiest or the idlest group and the runqueue
 thereof.

 The Per-Entity-Load-Tracking metric however measures how long a task has
 been runnable over the duration of its lifetime.This gives us a hint of
 the amount of CPU time that the task can demand.This metric takes care of the
 task priority as well.Therefore apart from the priority of a task we also
 have an idea of the live behavior of the task.This seems to be a more
 realistic metric to use to compute task weight which adds upto the run queue
 weight and the weight of the sched group.Consequently they can be used for
 load balancing.

 The semantics of load balancing is left untouched.The two functions
 load_balance() and select_task_rq_fair() perform the task of load
 balancing.These two paths have been browsed through in this patch to make
 necessary changes.

 weighted_cpuload() and task_h_load() provide the run queue weight and the
 weight of the task respectively.They have been modified to provide the
 Per-Entity-Load-Tracking metric as relevant for each.
 The rest of the modifications had to be made to suit these two changes.

 Completely Fair Scheduler class is the only sched_class which contributes to
 the run queue load.Therefore the rq-load.weight==cfs_rq-load.weight when
 the cfs_rq is the root cfs_rq (rq-cfs) of the hierarchy.When replacing this
 with Per-Entity-Load-Tracking metric,cfs_rq-runnable_load_avg needs to be
 used as this is the right reflection of the run queue load when
 the cfs_rq is the root cfs_rq (rq-cfs) of the hierarchy.This metric reflects
 the percentage uptime of the tasks that are queued on it and hence that 
 contribute
 to the load.Thus cfs_rq-runnable_load_avg replaces the metric earlier used in
 weighted_cpuload().

 The task load is aptly captured by se.avg.load_avg_contrib which captures the
 runnable time vs the alive time of the task against its priority.This metric
 replaces the earlier metric used in task_h_load().

 The consequent changes appear as data type changes for the helper variables;
 they abound in number.Because cfs_rq-runnable_load_avg needs to be big enough
 to capture the tasks' load often and accurately.

You are now using cfs_rq-runnable_load_avg instead of
cfs_rq-load.weight for calculation of cpu_load but
cfs_rq-runnable_load_avg is smaller or equal to cfs_rq-load.weight
value. This implies that the new value is smaller or equal to the old
statistic so you should be able to keep the same variable width for
the computation of cpu_load


 The following patch does not consider CONFIG_FAIR_GROUP_SCHED AND
 CONFIG_SCHED_NUMA.This is done so as to evaluate this approach starting from 
 the
 simplest scenario.Earlier discussions can be found in the link below.

 Link: https://lkml.org/lkml/2012/10/25/162
 Signed-off-by: Preeti U Murthypre...@linux.vnet.ibm.com
 ---
  include/linux/sched.h |2 +-
  kernel/sched/core.c   |   12 +
  kernel/sched/fair.c   |   64 
 +
  kernel/sched/sched.h  |2 +-
  4 files changed, 40 insertions(+), 40 deletions(-)

 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index 087dd20..302756e 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -924,7 +924,7 @@ struct sched_domain {
 unsigned int lb_count[CPU_MAX_IDLE_TYPES];
 unsigned int lb_failed[CPU_MAX_IDLE_TYPES];
 unsigned int lb_balanced[CPU_MAX_IDLE_TYPES];
 -   unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES];
 +   u64 lb_imbalance[CPU_MAX_IDLE_TYPES];
 unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
 unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
 unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 24d8b9b..4dea057 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -2415,8 +2415,8 @@ static const unsigned char
   * would be when CPU is idle and so we just decay the old load without
   * adding any new load.
   */
 -static unsigned long
 -decay_load_missed(unsigned long load, unsigned long missed_updates, int idx)
 +static u64
 +decay_load_missed(u64 load, unsigned long missed_updates, int idx)
  {
 int j = 0;

 @@ -2444,7 +2444,7 @@ decay_load_missed(unsigned long load, unsigned long 
 missed_updates, int idx)
   * scheduler tick (TICK_NSEC). With tickless idle this will not be called
   * every tick. We fix it up based on jiffies.
   */
 -static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
 +static void __update_cpu_load(struct rq *this_rq, u64 this_load,
   unsigned long pending_updates)
  {
 int

[RFC 3/3] sched: fix update NOHZ_IDLE flag

2012-11-19 Thread Vincent Guittot

The function nohz_kick_needed modifies NOHZ_IDLE flag that is used to update
the nr_busy_cpus of the sched_group.
When the sched_domain are updated (because of the unplug of a CPUs as an
example) a null_domain is attached to CPUs. We have to test
likely(!on_null_domain(cpu) first in order to detect such intialization step
and to not modify the NOHZ_IDLE flag

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/sched/fair.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d0686c..1bf7c87 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5490,7 +5490,7 @@ void trigger_load_balance(struct rq *rq, int cpu)
likely(!on_null_domain(cpu)))
raise_softirq(SCHED_SOFTIRQ);
 #ifdef CONFIG_NO_HZ
-   if (nohz_kick_needed(rq, cpu)  likely(!on_null_domain(cpu)))
+   if (likely(!on_null_domain(cpu))  nohz_kick_needed(rq, cpu))
nohz_balancer_kick(cpu);
 #endif
 }
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 0/3] sched: fix nr_busy_cpus

2012-11-19 Thread Vincent Guittot

The nr_busy_cpus field of the sched_group_power is sometime different from 0
whereas the platform is fully idle. This serie fixes 3 use cases:
 - when the SCHED softirq is raised on an idle core for idle load balance but
   the platform doesn't go out of the cpuidle state
 - when some CPUs enter idle state while booting all CPUs
 - when a CPU is unplug and/or replug

Vincent Guittot (3):
  sched: fix nr_busy_cpus with coupled cpuidle
  sched: fix init NOHZ_IDLE flag
  sched: fix update NOHZ_IDLE flag

 kernel/sched/core.c  |1 +
 kernel/sched/fair.c  |2 +-
 kernel/time/tick-sched.c |2 ++
 3 files changed, 4 insertions(+), 1 deletion(-)

-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 1/3] sched: fix nr_busy_cpus with coupled cpuidle

2012-11-19 Thread Vincent Guittot

With the coupled cpuidle driver (but probably also with other drivers),
a CPU loops in a temporary safe state while waiting for other CPUs of its
cluster to be ready to enter the coupled C-state. If an IRQ or a softirq
occurs, the CPU will stay in this internal loop if there is no need
to resched. The SCHED softirq clears the NOHZ and increases
nr_busy_cpus. If there is no need to resched, we will not call
set_cpu_sd_state_idle because of this internal loop in a cpuidle state.
We have to call set_cpu_sd_state_idle in tick_nohz_irq_exit which is used
to handle such situation.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/time/tick-sched.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a402608..e19bbc9 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -526,6 +526,8 @@ void tick_nohz_irq_exit(void)
if (!ts-inidle)
return;
 
+   set_cpu_sd_state_idle();
+
__tick_nohz_idle_enter(ts);
 }
 
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 2/3] sched: fix init NOHZ_IDLE flag

2012-11-19 Thread Vincent Guittot

On my smp platform which is made of 5 cores in 2 clusters,I have the
nr_busy_cpu field of sched_group_power struct that is not null when the
platform is fully idle. The root cause seems to be:
During the boot sequence, some CPUs reach the idle loop and set their
NOHZ_IDLE flag while waiting for others CPUs to boot. But the nr_busy_cpus
field is initialized later with the assumption that all CPUs are in the busy
state whereas some CPUs have already set their NOHZ_IDLE flag.
We clear the NOHZ_IDLE flag when nr_busy_cpus is initialized in order to 
have a coherent configuration.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/sched/core.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5dae0d2..05058e8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5817,6 +5817,7 @@ static void init_sched_groups_power(int cpu, struct 
sched_domain *sd)
 
update_group_power(sd, cpu);
atomic_set(sg-sgp-nr_busy_cpus, sg-group_weight);
+   clear_bit(NOHZ_IDLE, nohz_flags(cpu));
 }
 
 int __weak arch_sd_sibling_asym_packing(void)
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 3/3] workqueue: Schedule work on non-idle cpu instead of current one

2012-10-01 Thread Vincent Guittot

On 1 October 2012 05:47, Viresh Kumar viresh.ku...@linaro.org wrote:
On 1 October 2012 06:02, Tejun Heo t...@kernel.org wrote:
It isn't about the CPU being actually idle?

No. Being idle only from scheduler's perspective. :)

Also, if it's only about timers, shouldn't it be enough to implement
it for timer and delayed_work?

What if we need a timer, which must re-arm itself + schedule a work?
delayed_work will not be sufficient in that case, and we would need
to use normal work.

If i am not wrong, there can be other future users of this routine too.
@Vincent: Can you please comment on this?

The goal is to consolidate the place, where the selection of a target
CPU for running something is done. The scheduler should select the CPU
in order to have coherent behaviour of all framework.

A delayed work can be delayed for a long period and the target CPU
that have been selected for the timer could not be the best place to
queue a work any more.

The goal is to be able to migrate a work if the scheduler thinks that'
is necessary.

Vincent

It would be great if you explain what you're trying to achieve how. I
can't tell what you're aiming for and why that would be beneficial
from these submissions.

Following slides are implemented by Vincent and presented during LPC.
Please have a look at them, they explain the problem statement well:

http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/08/lpc2012-sched-timer-workqueue.pdf

Specifically slides: 12 19.

There aren't too many users with this behavior, but even a single user
will be sufficient not to let the cpu get idle at all. And that will result in
less power saving.

Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer

2013-01-07 Thread Vincent Guittot

On 2 January 2013 05:22, Preeti U Murthy pre...@linux.vnet.ibm.com wrote:
 Hi everyone,
 I have been looking at how different workloads react when the per entity
 load tracking metric is integrated into the load balancer and what are
 the possible reasons for it.

 I had posted the integration patch earlier:
 https://lkml.org/lkml/2012/11/15/391

 Essentially what I am doing is:
 1.I have disabled CONFIG_FAIR_GROUP_SCHED to make the analysis simple
 2.I have replaced cfs_rq-load.weight in weighted_cpuload() with
 cfs.runnable_load_avg,the active load tracking metric.
 3.I have replaced se.load.weight in task_h_load() with
 se.load.avg.contrib,the per entity load tracking metric.
 4.The load balancer will end up using these metrics.

 After conducting experiments on several workloads I found out that the
 performance of the workloads with the above integration would neither
 improve nor deteriorate.And this observation was consistent.

 Ideally the performance should have improved considering,that the metric
 does better tracking of load.

 Let me explain with a simple example as to why we should see a
 performance improvement ideally:Consider 2 80% tasks and 1 40% task.

 With integration:
 

40%
 80%40%
 cpu1  cpu2

 The above will be the scenario when the tasks fork initially.And this is
 a perfectly balanced system,hence no more load balancing.And proper
 distribution of loads on the cpu.

 Without integration
 ---

 40%   40%
 80%40% 80%40%
 cpu1   cpu2OR cpu1   cpu2

 Because the  view is that all the tasks as having the same load.The load
 balancer could ping pong tasks between these two situations.

 When I performed this experiment,I did not see an improvement in the
 performance though in the former case.On further observation I found
 that the following was actually happening.

 With integration
 

 Initially 40% task sleeps  40% task wakes up
and select_idle_sibling()
decides to wake it up on cpu1

40%   -   -   40%
 80%40%80%40%   80%  40%
 cpu1  cpu2cpu1   cpu2  cpu1 cpu2


 This makes load balance trigger movement of 40% from cpu1 back to
 cpu2.Hence the stability that the load balancer was trying to achieve is
 gone.Hence the culprit boils down to select_idle_sibling.How is it the
 culprit and how is it hindering performance of the workloads?

 *What is the way ahead with the per entity load tracking metric in the
 load balancer then?*

 In replies to a post by Paul in https://lkml.org/lkml/2012/12/6/105,
 he mentions the following:

 It is my intuition that the greatest carnage here is actually caused
 by wake-up load-balancing getting in the way of periodic in
 establishing a steady state. I suspect more mileage would result from
 reducing the interference wake-up load-balancing has with steady
 state.

 The whole point of using blocked load is so that you can converge on a
 steady state where you don't NEED to move tasks.  What disrupts this is
 we naturally prefer idle cpus on wake-up balance to reduce wake-up
 latency. I think the better answer is making these two processes load
 balancing() and select_idle_sibling() more co-operative.

 I had not realised how this would happen until I saw it happening in the
 above experiment.

 Based on what Paul explained above let us use the runnable load + the
 blocked load for calculating the load on a cfs runqueue rather than just
 the runnable load(which is what i am doing now) and see its consequence.

 Initially:   40% task sleeps

40%
 80%40%   -  80%  40%
 cpu1   cpu2 cpu1  cpu2

 So initially the load on cpu1 is say 80 and on cpu2 also it is
 80.Balanced.Now when 40% task sleeps,the total load on cpu2=runnable
 load+blocked load.which is still 80.

 As a consequence,firstly,during periodic load balancing the load is not
 moved from cpu1 to cpu2 when the 40% task sleeps.(It sees the load on
 cpu2 as 80 and not as 40).
 Hence the above scenario remains the same.On wake up,what happens?

 Here comes the point of making both load balancing and wake up
 balance(select_idle_sibling) co operative. How about we always schedule
 the woken up task on the prev_cpu? This seems more sensible considering
 load balancing considers blocked load as being a part of the load of cpu2.

Hi Preeti,

I'm not sure that we want such steady state at cores level because we
take advantage of migrating wake up tasks between cores that share
their cache as Matthew demonstrated. But I agree that reaching such
steady state at cluster and CPU level is interesting.

IMHO, you're right that taking the blocked load into consideration
should minimize tasks migration between cluster but it should no
prevent fast task migration between cores that share their cache

Vincent

Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer

2013-01-08 Thread Vincent Guittot

On 8 January 2013 07:06, Preeti U Murthy pre...@linux.vnet.ibm.com wrote:
 On 01/07/2013 09:18 PM, Vincent Guittot wrote:
 On 2 January 2013 05:22, Preeti U Murthy pre...@linux.vnet.ibm.com wrote:
 Hi everyone,
 I have been looking at how different workloads react when the per entity
 load tracking metric is integrated into the load balancer and what are
 the possible reasons for it.

 I had posted the integration patch earlier:
 https://lkml.org/lkml/2012/11/15/391

 Essentially what I am doing is:
 1.I have disabled CONFIG_FAIR_GROUP_SCHED to make the analysis simple
 2.I have replaced cfs_rq-load.weight in weighted_cpuload() with
 cfs.runnable_load_avg,the active load tracking metric.
 3.I have replaced se.load.weight in task_h_load() with
 se.load.avg.contrib,the per entity load tracking metric.
 4.The load balancer will end up using these metrics.

 After conducting experiments on several workloads I found out that the
 performance of the workloads with the above integration would neither
 improve nor deteriorate.And this observation was consistent.

 Ideally the performance should have improved considering,that the metric
 does better tracking of load.

 Let me explain with a simple example as to why we should see a
 performance improvement ideally:Consider 2 80% tasks and 1 40% task.

 With integration:
 

40%
 80%40%
 cpu1  cpu2

 The above will be the scenario when the tasks fork initially.And this is
 a perfectly balanced system,hence no more load balancing.And proper
 distribution of loads on the cpu.

 Without integration
 ---

 40%   40%
 80%40% 80%40%
 cpu1   cpu2OR cpu1   cpu2

 Because the  view is that all the tasks as having the same load.The load
 balancer could ping pong tasks between these two situations.

 When I performed this experiment,I did not see an improvement in the
 performance though in the former case.On further observation I found
 that the following was actually happening.

 With integration
 

 Initially 40% task sleeps  40% task wakes up
and select_idle_sibling()
decides to wake it up on cpu1

40%   -   -   40%
 80%40%80%40%   80%  40%
 cpu1  cpu2cpu1   cpu2  cpu1 cpu2


 This makes load balance trigger movement of 40% from cpu1 back to
 cpu2.Hence the stability that the load balancer was trying to achieve is
 gone.Hence the culprit boils down to select_idle_sibling.How is it the
 culprit and how is it hindering performance of the workloads?

 *What is the way ahead with the per entity load tracking metric in the
 load balancer then?*

 In replies to a post by Paul in https://lkml.org/lkml/2012/12/6/105,
 he mentions the following:

 It is my intuition that the greatest carnage here is actually caused
 by wake-up load-balancing getting in the way of periodic in
 establishing a steady state. I suspect more mileage would result from
 reducing the interference wake-up load-balancing has with steady
 state.

 The whole point of using blocked load is so that you can converge on a
 steady state where you don't NEED to move tasks.  What disrupts this is
 we naturally prefer idle cpus on wake-up balance to reduce wake-up
 latency. I think the better answer is making these two processes load
 balancing() and select_idle_sibling() more co-operative.

 I had not realised how this would happen until I saw it happening in the
 above experiment.

 Based on what Paul explained above let us use the runnable load + the
 blocked load for calculating the load on a cfs runqueue rather than just
 the runnable load(which is what i am doing now) and see its consequence.

 Initially:   40% task sleeps

40%
 80%40%   -  80%  40%
 cpu1   cpu2 cpu1  cpu2

 So initially the load on cpu1 is say 80 and on cpu2 also it is
 80.Balanced.Now when 40% task sleeps,the total load on cpu2=runnable
 load+blocked load.which is still 80.

 As a consequence,firstly,during periodic load balancing the load is not
 moved from cpu1 to cpu2 when the 40% task sleeps.(It sees the load on
 cpu2 as 80 and not as 40).
 Hence the above scenario remains the same.On wake up,what happens?

 Here comes the point of making both load balancing and wake up
 balance(select_idle_sibling) co operative. How about we always schedule
 the woken up task on the prev_cpu? This seems more sensible considering
 load balancing considers blocked load as being a part of the load of cpu2.

 Hi Preeti,

 I'm not sure that we want such steady state at cores level because we
 take advantage of migrating wake up tasks between cores that share
 their cache as Matthew demonstrated. But I agree that reaching such
 steady state at cluster and CPU level is interesting.

 IMHO, you're right that taking the blocked load into consideration
 should

Re: [RFC][PATCH 2/2] ARM64: introduce cluster id and make a difference between socket id

2013-07-29 Thread Vincent Guittot

On 27 July 2013 12:42, Hanjun Guo hanjun@linaro.org wrote:
 In the cpu topology information, we define topology_physical_package_id()
 as cpu socket id, which means that the socket id is the idenfication for
 physical processor, not for a cluster in a cpu die.

 On ARM64 platform, multi cluster in a cpu die will be normal, here is a
 example with 2 cores in a cluster and 2 cluster in a socket:

 |--|
 |socket|
 |  |
 | |---|  |---| |
 | |cluster|  |cluster| |
 | |   |  |   | |
 | | || || |  | || || | |
 | | |core| |core| |  | |core| |core| | |
 | | || || |  | || || | |
 | |   |  |   | |
 | |---|  |---| |
 |  |
 |--|

 ARM64 extended the MPIDR into 64 bit and introduce another affinity level,
 we can use this affinity level for socket id and use the third highest level
 affinity for cluster id, which  make the socket id behavior in its original
 way.

 Signed-off-by: Hanjun Guo hanjun@linaro.org
 ---
  arch/arm64/include/asm/topology.h |1 +
  arch/arm64/kernel/topology.c  |8 ++--
  2 files changed, 7 insertions(+), 2 deletions(-)

 diff --git a/arch/arm64/include/asm/topology.h 
 b/arch/arm64/include/asm/topology.h
 index 8631808..ff68ecc 100644
 --- a/arch/arm64/include/asm/topology.h
 +++ b/arch/arm64/include/asm/topology.h
 @@ -8,6 +8,7 @@
  struct cputopo_arm64 {
 int thread_id;
 int core_id;
 +   int cluster_id;
 int socket_id;
 cpumask_t thread_sibling;
 cpumask_t core_sibling;
 diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
 index 1eb0435..6d1e5a6 100644
 --- a/arch/arm64/kernel/topology.c
 +++ b/arch/arm64/kernel/topology.c
 @@ -80,12 +80,14 @@ void store_cpu_topology(unsigned int cpuid)
 /* core performance interdependency */
 cpuid_topo-thread_id = MPIDR_AFFINITY_LEVEL_0(mpidr);
 cpuid_topo-core_id = MPIDR_AFFINITY_LEVEL_1(mpidr);
 -   cpuid_topo-socket_id = MPIDR_AFFINITY_LEVEL_2(mpidr);
 +   cpuid_topo-cluster_id = 
 MPIDR_AFFINITY_LEVEL_2(mpidr);
 +   cpuid_topo-socket_id = MPIDR_AFFINITY_LEVEL_3(mpidr);

socket_id is currently used by update_siblings_masks to update the
core_sibling mask. This mask defines which CPUs share their cache and
AFAICT, the cache are shared at the cluster level so cluster_id should
be used instead socket_id.

Have you got more information about the goal of this new level_3 ?

Vincent

 } else {
 /* largely independent cores */
 cpuid_topo-thread_id = -1;
 cpuid_topo-core_id = MPIDR_AFFINITY_LEVEL_0(mpidr);
 -   cpuid_topo-socket_id = MPIDR_AFFINITY_LEVEL_1(mpidr);
 +   cpuid_topo-cluster_id = 
 MPIDR_AFFINITY_LEVEL_1(mpidr);
 +   cpuid_topo-socket_id = MPIDR_AFFINITY_LEVEL_2(mpidr);
 }
 } else {
 /*
 @@ -95,6 +97,7 @@ void store_cpu_topology(unsigned int cpuid)
  */
 cpuid_topo-thread_id = -1;
 cpuid_topo-core_id = 0;
 +   cpuid_topo-cluster_id = -1;
 cpuid_topo-socket_id = -1;
 }

 @@ -120,6 +123,7 @@ void __init init_cpu_topology(void)

 topo-thread_id = -1;
 topo-core_id =  -1;
 +   topo-cluster_id = -1;
 topo-socket_id = -1;
 cpumask_clear(topo-core_sibling);
 cpumask_clear(topo-thread_sibling);
 --
 1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 1/2] ARM64: add cpu topology definition

2013-07-29 Thread Vincent Guittot

On 27 July 2013 12:42, Hanjun Guo hanjun@linaro.org wrote:
 Power aware scheduling needs the cpu topology information to improve the
 cpu scheduler decision making.

It's not only power aware scheduling. The scheduler already uses
topology and cache sharing when  CONFIG_SCHED_MC and/or
CONFIG_SCHED_SMT are enable. So you should also add these configs for
arm64 so the scheduler can use it

Vincent


 For ARM64, we can get the topology from the MPIDR register which defines the
 the affinity of processors.

 This patch is mainly based on arch/arm/kernel/topology.c written by
 Vincent Guittot, and replaced the topology array with per cpu variable.

 Signed-off-by: Hanjun Guo hanjun@linaro.org
 ---
  arch/arm64/Kconfig|9 +++
  arch/arm64/include/asm/cputype.h  |   11 
  arch/arm64/include/asm/topology.h |   41 
  arch/arm64/kernel/Makefile|1 +
  arch/arm64/kernel/smp.c   |6 ++
  arch/arm64/kernel/topology.c  |  128 
 +
  6 files changed, 196 insertions(+)
  create mode 100644 arch/arm64/include/asm/topology.h
  create mode 100644 arch/arm64/kernel/topology.c

 diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
 index 9737e97..f0ce91b 100644
 --- a/arch/arm64/Kconfig
 +++ b/arch/arm64/Kconfig
 @@ -150,6 +150,15 @@ config SMP

   If you don't know what to do here, say N.

 +config ARM64_CPU_TOPOLOGY
 +   bool Support cpu topology definition
 +   depends on SMP  ARM64
 +   default y
 +   help
 + Support ARM64 cpu topology definition. The MPIDR register defines
 + affinity between processors which is then used to describe the cpu
 + topology of an ARM64 System.
 +
  config NR_CPUS
 int Maximum number of CPUs (2-32)
 range 2 32
 diff --git a/arch/arm64/include/asm/cputype.h 
 b/arch/arm64/include/asm/cputype.h
 index 5fe138e..68b55af 100644
 --- a/arch/arm64/include/asm/cputype.h
 +++ b/arch/arm64/include/asm/cputype.h
 @@ -30,6 +30,17 @@

  #define MPIDR_HWID_BITMASK 0xff00ff

 +#define MPIDR_SMP_BITMASK  (0x1  30)
 +#define MPIDR_MT_BITMASK   (0x1  24)
 +
 +#define MPIDR_LEVEL_BITS   8
 +#define MPIDR_LEVEL_MASK   ((1  MPIDR_LEVEL_BITS) - 1)
 +
 +#define MPIDR_AFFINITY_LEVEL_0(mpidr) ((mpidr)  MPIDR_LEVEL_MASK)
 +#define MPIDR_AFFINITY_LEVEL_1(mpidr) ((mpidr  8)  MPIDR_LEVEL_MASK)
 +#define MPIDR_AFFINITY_LEVEL_2(mpidr) ((mpidr  16)  MPIDR_LEVEL_MASK)
 +#define MPIDR_AFFINITY_LEVEL_3(mpidr) ((mpidr  32)  MPIDR_LEVEL_MASK)
 +
  #define read_cpuid(reg) ({ \
 u64 __val;  \
 asm(mrs%0,  reg : =r (__val));  \
 diff --git a/arch/arm64/include/asm/topology.h 
 b/arch/arm64/include/asm/topology.h
 new file mode 100644
 index 000..8631808
 --- /dev/null
 +++ b/arch/arm64/include/asm/topology.h
 @@ -0,0 +1,41 @@
 +#ifndef _ASM_ARM64_TOPOLOGY_H
 +#define _ASM_ARM64_TOPOLOGY_H
 +
 +#ifdef CONFIG_ARM64_CPU_TOPOLOGY
 +
 +#include linux/cpumask.h
 +
 +struct cputopo_arm64 {
 +   int thread_id;
 +   int core_id;
 +   int socket_id;
 +   cpumask_t thread_sibling;
 +   cpumask_t core_sibling;
 +};
 +
 +DECLARE_PER_CPU(struct cputopo_arm64, cpu_topology);
 +
 +#define cpu_topo(cpu) per_cpu(cpu_topology, cpu)
 +
 +#define topology_physical_package_id(cpu)  (cpu_topo(cpu).socket_id)
 +#define topology_core_id(cpu)  (cpu_topo(cpu).core_id)
 +#define topology_core_cpumask(cpu) (cpu_topo(cpu).core_sibling)
 +#define topology_thread_cpumask(cpu)   (cpu_topo(cpu).thread_sibling)
 +
 +#define mc_capable()   (cpu_topo(0).socket_id != -1)
 +#define smt_capable()  (cpu_topo(0).thread_id != -1)
 +
 +void init_cpu_topology(void);
 +void store_cpu_topology(unsigned int cpuid);
 +const struct cpumask *cpu_coregroup_mask(int cpu);
 +
 +#else
 +
 +static inline void init_cpu_topology(void) { }
 +static inline void store_cpu_topology(unsigned int cpuid) { }
 +
 +#endif
 +
 +#include asm-generic/topology.h
 +
 +#endif /* _ASM_ARM64_TOPOLOGY_H */
 diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
 index 7b4b564..a47c359 100644
 --- a/arch/arm64/kernel/Makefile
 +++ b/arch/arm64/kernel/Makefile
 @@ -18,6 +18,7 @@ arm64-obj-$(CONFIG_SMP)   += smp.o 
 smp_spin_table.o smp_psci.o
  arm64-obj-$(CONFIG_HW_PERF_EVENTS) += perf_event.o
  arm64-obj-$(CONFIG_HAVE_HW_BREAKPOINT)+= hw_breakpoint.o
  arm64-obj-$(CONFIG_EARLY_PRINTK)   += early_printk.o
 +arm64-obj-$(CONFIG_ARM64_CPU_TOPOLOGY)  += topology.o

  obj-y  += $(arm64-obj-y) vdso/
  obj-m  += $(arm64-obj-m)
 diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
 index fee5cce..197b1da 100644
 --- a/arch/arm64/kernel/smp.c
 +++ b/arch/arm64/kernel/smp.c
 @@ -39,6 +39,7 @@
  #include asm/atomic.h
  #include asm

Re: [RFC][PATCH 1/2] ARM64: add cpu topology definition

2013-07-29 Thread Vincent Guittot

On 29 July 2013 12:15, Sudeep KarkadaNagesha
sudeep.karkadanage...@arm.com wrote:
 On 29/07/13 10:46, Vincent Guittot wrote:
 On 27 July 2013 12:42, Hanjun Guo hanjun@linaro.org wrote:
 Power aware scheduling needs the cpu topology information to improve the
 cpu scheduler decision making.

 It's not only power aware scheduling. The scheduler already uses
 topology and cache sharing when  CONFIG_SCHED_MC and/or
 CONFIG_SCHED_SMT are enable. So you should also add these configs for
 arm64 so the scheduler can use it

 Just for my knowledge, I thought power aware using SCHED_MC/SMT was
 removed. I see commit 8e7fbcbc22c12414bcc9dfdd683637f58fb32759 sched:
 Remove stale power aware scheduling remnants and dysfunctional knobs
 I may be missing something here.

It's a common mistake to mixed SCHED_MC and powersaving balance with
SCHED_MC. Only the powersaving policy has been removed but the
SCHED_MC and SCHED_SMT are always in the scheduler and gives perf
improvement on arm 32bits


 Regards,
 Sudeep

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 1/2] ARM64: add cpu topology definition

2013-07-29 Thread Vincent Guittot

On 29 July 2013 11:54, Will Deacon will.dea...@arm.com wrote:
 On Mon, Jul 29, 2013 at 10:46:06AM +0100, Vincent Guittot wrote:
 On 27 July 2013 12:42, Hanjun Guo hanjun@linaro.org wrote:
  Power aware scheduling needs the cpu topology information to improve the
  cpu scheduler decision making.

 It's not only power aware scheduling. The scheduler already uses
 topology and cache sharing when  CONFIG_SCHED_MC and/or
 CONFIG_SCHED_SMT are enable. So you should also add these configs for
 arm64 so the scheduler can use it

 ... except that the architecture doesn't define what the AFF fields in MPIDR
 really represent. Using them to make key scheduling decisions relating to

Do you mean that it's not define for arm64 ARM? AFAIK, there are good
explanation in the arm32 ARM and it's currently used with SCHED_MC and
SCHED_SMT

 cache proximity seems pretty risky to me, especially given the track record
 we've seen already on AArch32 silicon. It's a convenient register if it
 contains the data we want it to contain, but we need to force ourselves to
 come to terms with reality here and simply use it as an identifier for a
 CPU.

 Can't we just use the device-tree to represent this topological data for
 arm64? Lorenzo has been working on bindings in this area.

I agree that we should probably use DT if we can't rely in MPIDR for arm64

Vincent

 Will
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 1/9] sched: Introduce power scheduler

2013-07-10 Thread Vincent Guittot

On 10 July 2013 13:11, Morten Rasmussen morten.rasmus...@arm.com wrote:
 On Wed, Jul 10, 2013 at 03:10:15AM +0100, Arjan van de Ven wrote:
 On 7/9/2013 8:55 AM, Morten Rasmussen wrote:
  +   mod_delayed_work_on(schedule_cpu(), system_wq, dwork,
  +   msecs_to_jiffies(INTERVAL));

 so thinking about this more, this really really should not be a work queue.
 a work queue will cause a large number of context switches for no reason
 (on Intel and AMD you can switch P state from interrupt context, and I'm 
 pretty sure
 that holds for many ARM as well)

 Agree. I should have made it clear this is only a temporary solution. I
 would prefer to tie the power scheduler to the existing scheduler tick
 instead so we don't wake up cpus unnecessarily. nohz may be able handle
 that for us. Also, currently the power scheduler updates all cpus.
 Going forward this would change to per cpu updates and partial updates
 of the global view to improve scalability.

For the packing tasks patches, we are using the periodic load balance
sequence to update the activity like it is done for the cpu_power. I
have planned to update the packing patches to see how it can cooperate
with Morten patches as it has similar needs.



 and in addition, it causes some really nasty cases, especially around real 
 time tasks.
 Your workqueue will schedule a kernel thread, which will run
 BEHIND real time tasks, and such real time task will then never be able to 
 start running at a higher performance.

 (and with the delta between lowest and highest performance sometimes being 
 10x or more,
 the real time task will be running SLOW... quite possible longer than 
 several milliseconds)

 and all for no good reason; a normal timer running in irq context would be 
 much better for this kind of thing!



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

2013-07-15 Thread Vincent Guittot

On 13 July 2013 12:23, Catalin Marinas catalin.mari...@arm.com wrote:
 Hi Peter,

 (Morten's away for a week, I'll try cover some bits in the meantime)

 On Sat, Jul 13, 2013 at 07:49:09AM +0100, Peter Zijlstra wrote:
 On Tue, Jul 09, 2013 at 04:55:29PM +0100, Morten Rasmussen wrote:
  This patch set is an initial prototype aiming at the overall power-aware
  scheduler design proposal that I previously described
  http://permalink.gmane.org/gmane.linux.kernel/1508480.
 
  The patch set introduces a cpu capacity managing 'power scheduler' which 
  lives
  by the side of the existing (process) scheduler. Its role is to monitor the
  system load and decide which cpus that should be available to the process
  scheduler.

 Hmm...

 This looks like a userspace hotplug deamon approach lifted to kernel space :/

 The difference is that this is faster. We even had hotplug in mind some
 years ago for big.LITTLE but it wouldn't give the performance we need
 (hotplug is incredibly slow even if driven from the kernel).

 How about instead of layering over the load-balancer to constrain its 
 behaviour
 you change the behaviour to not need constraint? Fix it so it does the right
 thing, instead of limiting it.

 I don't think its _that_ hard to make the balancer do packing over spreading.
 The power balance code removed in 8e7fbcbc had things like that (although it
 was broken). And I'm sure I've seen patches over the years that did similar
 things. Didn't Vincent and Alex also do things like that?

 We should take the good bits from all that and make something of it. And I
 think its easier now that we have the per task and per rq utilization numbers
 [1].

 That's what we've been pushing for. From a big.LITTLE perspective, I
 would probably vote for Vincent's patches but I guess we could probably
 adapt any of the other options.

 But then we got Ingo NAK'ing all these approaches. Taking the best bits
 from the current load balancing patches would create yet another set of
 patches which don't fall under Ingo's requirements (at least as I
 understand them).

In fact we are currently updating our patchset based on Ingo's
feedback. The move of cpuidle and cpufreq statistic was planned to
appear later in our dev but we are now integrating it based in Ingo's
request. We start with cpuidle statistics and are moving it into the
scheduler. In addition, we want to integrate the current C-state of a
core in the wake up decision.


 Just start by changing the balancer to pack instead of spread. Once that 
 works,
 see where the two modes diverge and put a knob in.

 That's the approach we've had so far (not sure about the knob). But it
 doesn't solve Ingo's complain about fragmentation between scheduler,
 cpufreq and cpuidle policies.

 Then worry about power thingies.

 To quote Ingo: To create a new low level idle driver mechanism the
 scheduler could use and integrate proper power saving / idle policy into
 the scheduler.

 That's unless we all agree (including Ingo) that the above requirement
 is orthogonal to task packing and, as a *separate* project, we look at
 better integrating the cpufreq/cpuidle with the scheduler, possibly with
 a new driver model and governors as libraries used by such drivers. In
 which case the current packing patches shouldn't be NAK'ed but reviewed
 so that they can be improved further or rewritten.

 The integration of cpuidle and cpufreq should start by unifying all the
 statistics stuff. For cpuidle we need to pull in the per-cpu idle time
 guestimator. For cpufreq the per-cpu usage stuff -- which we already have in
 the scheduler these days!

 Once we have all the statistics in place, its also easier to see what we can 
 do
 with them and what might be missing.

 At this point mandate that policy drivers may not do any statistics gathering
 of their own. If they feel the need to do so, we're missing something and
 that's not right.

 I agree in general but there is the intel_pstate.c driver which has it's
 own separate statistics that the scheduler does not track. We could move
 to invariant task load tracking which uses aperf/mperf (and could do
 similar things with perf counters on ARM). As I understand from Arjan,
 the new pstate driver will be different, so we don't know exactly what
 it requires.

 I'm not entirely sold on differentiating between short running and other 
 tasks
 either. Although I suppose I see where that comes from. A task that would run
 50% on a big core would unlikely be qualified as small, however if it would
 require 85% of a small core and there's room on the small cores its a good 
 move
 to run it there.

 So where's the limit for being small? It seems like an artificial limit and
 such should be avoided where possible.

 I agree. With Morten's approach, it doesn't care about how small a task
 is but rather when a CPU (or cluster) is loaded to a certain threshold,
 just spread tasks to the next. I think small task threshold on its own
 doesn't make much sense

Re: [Resend patch v8 06/13] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

2013-06-20 Thread Vincent Guittot

On 20 June 2013 04:18, Alex Shi alex@intel.com wrote:
 They are the base values in load balance, update them with rq runnable
 load average, then the load balance will consider runnable load avg
 naturally.

 We also try to include the blocked_load_avg as cpu load in balancing,
 but that cause kbuild performance drop 6% on every Intel machine, and
 aim7/oltp drop on some of 4 CPU sockets machines.
 Or only add blocked_load_avg into get_rq_runable_load, hackbench still
 drop a little on NHM EX.

 Signed-off-by: Alex Shi alex@intel.com
 Reviewed-by: Gu Zheng guz.f...@cn.fujitsu.com
 ---
  kernel/sched/fair.c |  5 +++--
  kernel/sched/proc.c | 17 +++--
  2 files changed, 18 insertions(+), 4 deletions(-)

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 1e5a5e6..7d5c477 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -2968,7 +2968,7 @@ static void dequeue_task_fair(struct rq *rq, struct 
 task_struct *p, int flags)
  /* Used instead of source_load when we know the type == 0 */
  static unsigned long weighted_cpuload(const int cpu)
  {
 -   return cpu_rq(cpu)-load.weight;
 +   return cpu_rq(cpu)-cfs.runnable_load_avg;
  }

Alex,

In the wake-affine function, we use current-se.load.weight  and
p-se.load.weight to update the load of this_cpu and prev_cpu whereas
these loads are now equal to runnable_load_avg which is the sum of
se-avg.load_avg_contrib now. Shouldn't we use
se-avg.load_avg_contrib instead of se.load.weight ?

Vincent


  /*
 @@ -3013,9 +3013,10 @@ static unsigned long cpu_avg_load_per_task(int cpu)
  {
 struct rq *rq = cpu_rq(cpu);
 unsigned long nr_running = ACCESS_ONCE(rq-nr_running);
 +   unsigned long load_avg = rq-cfs.runnable_load_avg;

 if (nr_running)
 -   return rq-load.weight / nr_running;
 +   return load_avg / nr_running;

 return 0;
  }
 diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
 index bb3a6a0..ce5cd48 100644
 --- a/kernel/sched/proc.c
 +++ b/kernel/sched/proc.c
 @@ -501,6 +501,18 @@ static void __update_cpu_load(struct rq *this_rq, 
 unsigned long this_load,
 sched_avg_update(this_rq);
  }

 +#ifdef CONFIG_SMP
 +unsigned long get_rq_runnable_load(struct rq *rq)
 +{
 +   return rq-cfs.runnable_load_avg;
 +}
 +#else
 +unsigned long get_rq_runnable_load(struct rq *rq)
 +{
 +   return rq-load.weight;
 +}
 +#endif
 +
  #ifdef CONFIG_NO_HZ_COMMON
  /*
   * There is no sane way to deal with nohz on smp when using jiffies because 
 the
 @@ -522,7 +534,7 @@ static void __update_cpu_load(struct rq *this_rq, 
 unsigned long this_load,
  void update_idle_cpu_load(struct rq *this_rq)
  {
 unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
 -   unsigned long load = this_rq-load.weight;
 +   unsigned long load = get_rq_runnable_load(this_rq);
 unsigned long pending_updates;

 /*
 @@ -568,11 +580,12 @@ void update_cpu_load_nohz(void)
   */
  void update_cpu_load_active(struct rq *this_rq)
  {
 +   unsigned long load = get_rq_runnable_load(this_rq);
 /*
  * See the mess around update_idle_cpu_load() / 
 update_cpu_load_nohz().
  */
 this_rq-last_load_update_tick = jiffies;
 -   __update_cpu_load(this_rq, this_rq-load.weight, 1);
 +   __update_cpu_load(this_rq, load, 1);

 calc_load_account_active(this_rq);
  }
 --
 1.7.12

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/3] sched: factor out code to should_we_balance()

2013-08-02 Thread Vincent Guittot

On 2 August 2013 03:50, Joonsoo Kim iamjoonsoo@lge.com wrote:
 Now checking whether this cpu is appropriate to balance or not
 is embedded into update_sg_lb_stats() and this checking has no direct
 relationship to this function. There is not enough reason to place
 this checking at update_sg_lb_stats(), except saving one iteration
 for sched_group_cpus.

 In this patch, I factor out this checking to should_we_balance() function.
 And before doing actual work for load_balancing, check whether this cpu is
 appropriate to balance via should_we_balance(). If this cpu is not
 a candidate for balancing, it quit the work immediately.

 With this change, we can save two memset cost and can expect better
 compiler optimization.

 Below is result of this patch.

 * Vanilla *
textdata bss dec hex filename
   344991136 116   357518ba7 kernel/sched/fair.o

 * Patched *
textdata bss dec hex filename
   342431136 116   354958aa7 kernel/sched/fair.o

 In addition, rename @balance to @should_balance in order to represent
 its purpose more clearly.

 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index eaae77e..7f51b8c 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -4426,22 +4426,17 @@ fix_small_capacity(struct sched_domain *sd, struct 
 sched_group *group)
   * @group: sched_group whose statistics are to be updated.
   * @load_idx: Load index of sched_domain of this_cpu for load calc.
   * @local_group: Does group contain this_cpu.
 - * @balance: Should we balance.
   * @sgs: variable to hold the statistics for this group.
   */
  static inline void update_sg_lb_stats(struct lb_env *env,
 struct sched_group *group, int load_idx,
 -   int local_group, int *balance, struct sg_lb_stats 
 *sgs)
 +   int local_group, struct sg_lb_stats *sgs)
  {
 unsigned long nr_running, max_nr_running, min_nr_running;
 unsigned long load, max_cpu_load, min_cpu_load;
 -   unsigned int balance_cpu = -1, first_idle_cpu = 0;
 unsigned long avg_load_per_task = 0;
 int i;

 -   if (local_group)
 -   balance_cpu = group_balance_cpu(group);
 -
 /* Tally up the load of all CPUs in the group */
 max_cpu_load = 0;
 min_cpu_load = ~0UL;
 @@ -4454,15 +4449,9 @@ static inline void update_sg_lb_stats(struct lb_env 
 *env,
 nr_running = rq-nr_running;

 /* Bias balancing toward cpus of our domain */
 -   if (local_group) {
 -   if (idle_cpu(i)  !first_idle_cpu 
 -   cpumask_test_cpu(i, 
 sched_group_mask(group))) {
 -   first_idle_cpu = 1;
 -   balance_cpu = i;
 -   }
 -
 +   if (local_group)
 load = target_load(i, load_idx);
 -   } else {
 +   else {
 load = source_load(i, load_idx);
 if (load  max_cpu_load)
 max_cpu_load = load;
 @@ -4482,22 +4471,9 @@ static inline void update_sg_lb_stats(struct lb_env 
 *env,
 sgs-idle_cpus++;
 }

 -   /*
 -* First idle cpu or the first cpu(busiest) in this sched group
 -* is eligible for doing load balancing at this and above
 -* domains. In the newly idle case, we will allow all the cpu's
 -* to do the newly idle load balance.
 -*/
 -   if (local_group) {
 -   if (env-idle != CPU_NEWLY_IDLE) {
 -   if (balance_cpu != env-dst_cpu) {
 -   *balance = 0;
 -   return;
 -   }
 -   update_group_power(env-sd, env-dst_cpu);
 -   } else if (time_after_eq(jiffies, group-sgp-next_update))
 -   update_group_power(env-sd, env-dst_cpu);
 -   }
 +   if (local_group  (env-idle != CPU_NEWLY_IDLE ||
 +   time_after_eq(jiffies, group-sgp-next_update)))
 +   update_group_power(env-sd, env-dst_cpu);

 /* Adjust by relative CPU power of the group */
 sgs-avg_load = (sgs-group_load*SCHED_POWER_SCALE) / 
 group-sgp-power;
 @@ -4576,7 +4552,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
   * @sds: variable to hold the statistics for this sched_domain.
   */
  static inline void update_sd_lb_stats(struct lb_env *env,
 -   int *balance, struct sd_lb_stats *sds)
 +   struct sd_lb_stats *sds)
  {
 struct sched_domain *child = env-sd-child;
 struct sched_group *sg = env-sd-groups;
 @@ -4593,10 +4569,7 @@ static inline void update_sd_lb_stats(struct lb_env 
 *env,

Re: [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init

2013-11-05 Thread Vincent Guittot

On 5 November 2013 15:06, Peter Zijlstra pet...@infradead.org wrote:
 On Fri, Oct 18, 2013 at 01:52:15PM +0200, Vincent Guittot wrote:
 The function arch_sd_local_flags is used to set flags in sched_domains
 according to the platform architecture. A new flag SD_SHARE_POWERDOMAIN is
 also created to reflect whether groups of CPUs in a sched_domain level can or
 not reach different power state. As an example, the flag should be cleared at
 CPU level if groups of cores can be power gated independently. This 
 information
 is used to decide if it's worth packing some tasks in a group of CPUs in 
 order
 to power gate the other groups instead of spreading the tasks. The default
 behavior of the scheduler is to spread tasks across CPUs and groups of CPUs 
 so
 the flag is set into all sched_domains.

 The cpu parameter of arch_sd_local_flags can be used by architecture to fine
 tune the scheduler domain flags. As an example SD_SHARE_POWERDOMAIN flag can 
 be
 set differently for groups of CPUs according to DT information

 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org

 Not your fault, but you're making a bigger mess of the arch topology
 interface.

 How about we start with something like the below -- compile tested only.

 And then try and lift default_topology so that an arch can override it?

Your proposal looks fine for me. It's clearly better to move in one
place the configuration of sched_domain fields. Have you already got
an idea about how to let architecture override the topology?

My primary need comes from the fact that the topology configuration is
not the same for all cores

Vincent

 ---
  arch/ia64/include/asm/topology.h  |  24 -
  arch/metag/include/asm/topology.h |  27 --
  arch/s390/include/asm/topology.h  |   2 -
  arch/tile/include/asm/topology.h  |  33 ---
  include/linux/topology.h  | 115 ---
  kernel/sched/core.c   | 188 
 +++---
  6 files changed, 114 insertions(+), 275 deletions(-)

 diff --git a/arch/ia64/include/asm/topology.h 
 b/arch/ia64/include/asm/topology.h
 index a2496e449b75..20d12fa7e0cd 100644
 --- a/arch/ia64/include/asm/topology.h
 +++ b/arch/ia64/include/asm/topology.h
 @@ -46,30 +46,6 @@

  void build_cpu_to_node_map(void);

 -#define SD_CPU_INIT (struct sched_domain) {\
 -   .parent = NULL, \
 -   .child  = NULL, \
 -   .groups = NULL, \
 -   .min_interval   = 1,\
 -   .max_interval   = 4,\
 -   .busy_factor= 64,   \
 -   .imbalance_pct  = 125,  \
 -   .cache_nice_tries   = 2,\
 -   .busy_idx   = 2,\
 -   .idle_idx   = 1,\
 -   .newidle_idx= 0,\
 -   .wake_idx   = 0,\
 -   .forkexec_idx   = 0,\
 -   .flags  = SD_LOAD_BALANCE   \
 -   | SD_BALANCE_NEWIDLE\
 -   | SD_BALANCE_EXEC   \
 -   | SD_BALANCE_FORK   \
 -   | SD_WAKE_AFFINE,   \
 -   .last_balance   = jiffies,  \
 -   .balance_interval   = 1,\
 -   .nr_balance_failed  = 0,\
 -}
 -
  #endif /* CONFIG_NUMA */

  #ifdef CONFIG_SMP
 diff --git a/arch/metag/include/asm/topology.h 
 b/arch/metag/include/asm/topology.h
 index 8e9c0b3b9691..e95f874ded1b 100644
 --- a/arch/metag/include/asm/topology.h
 +++ b/arch/metag/include/asm/topology.h
 @@ -3,33 +3,6 @@

  #ifdef CONFIG_NUMA

 -/* sched_domains SD_NODE_INIT for Meta machines */
 -#define SD_NODE_INIT (struct sched_domain) {   \
 -   .parent = NULL, \
 -   .child  = NULL, \
 -   .groups = NULL, \
 -   .min_interval   = 8,\
 -   .max_interval   = 32,   \
 -   .busy_factor= 32,   \
 -   .imbalance_pct  = 125,  \
 -   .cache_nice_tries   = 2,\
 -   .busy_idx   = 3,\
 -   .idle_idx   = 2,\
 -   .newidle_idx= 0,\
 -   .wake_idx   = 0,\
 -   .forkexec_idx   = 0,\
 -   .flags  = SD_LOAD_BALANCE   \
 -   | SD_BALANCE_FORK   \
 -   | SD_BALANCE_EXEC

Re: [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init

2013-11-06 Thread Vincent Guittot

On 5 November 2013 23:27, Peter Zijlstra pet...@infradead.org wrote:
 On Tue, Nov 05, 2013 at 03:57:23PM +0100, Vincent Guittot wrote:
 Your proposal looks fine for me. It's clearly better to move in one
 place the configuration of sched_domain fields. Have you already got
 an idea about how to let architecture override the topology?

 Maybe something like the below -- completely untested (my s390 compiler
 is on a machine that's currently powered off).

 My primary need comes from the fact that the topology configuration is
 not the same for all cores

 Do expand.. the various cpu masks used in the topology list are per cpu,
 is that sufficient room to wriggle or do you need more?

My current implementation sets a flag in each level (SMT, MC and CPU)
to describe the power gating capabilities for the groups of cpus but
the capabilities can be different for a same level; I mean that we can
have a group of cpus that can power gate at MC level in the system
whereas another group of CPUs can only power gate at CPU level. With
the current implementation i can't make the difference so i have added
the cpu parameters when setting the flags.
The other solution is to add new topology levels with cpu masks that
can give the power dependency with other (currently the power gating
but we can have more level for frequency dependency as an example). In
this case the current implementation is enough and the main difficulty
will be the place where we can insert these new levels compared to
current ones.

A typical example with one cluster that can power gate at core level
whereas the other cluster can power gate at cluster level, will give
the following domain topology:

If we set a flag in the current topology levels we should have
something like below

CPU0:
domain 0: span 0-1 level: SMT flags: SD_SHARE_CPUPOWER |
SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
groups: 0 1
domain 1: span 0-7 level: MC flags: SD_SHARE_PKG_RESOURCES
groups: 0-1 2-3 4-5 6-7
domain 2: span 0-15 level: CPU flags:
groups: 0-7 8-15

CPU8
domain 0: span 8-9 level: SMT flags: SD_SHARE_CPUPOWER |
SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
groups: 8 9
domain 1: span 8-15 level: MC flags: SD_SHARE_PKG_RESOURCES |
SD_SHARE_POWERDOMAIN
groups: 8-9 10-11 12-13 14-15
domain 2: span 0-15 level CPU flags:
groups: 8-15 0-7

If we create new levels, we could have something like below

CPU0
domain 0: span 0-1 level: SMT flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES
groups: 0 1
domain 1: span 0-7 level: MC flags: SD_SHARE_PKG_RESOURCES
groups: 0-1 2-3 4-5 6-7
domain 2: span 0-15 level PWR flags  SD_NOT_SHARE_POWERDOMAIN
groups: 0-1 2-3 4-5 6-7 8-15
domain 3: span 0-15 level: CPU flags:
groups: 0-7 8-15

CPU8
domain 0: span 8-9 level: SMT flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES
groups: 8 9
domain 1: span 8-15 level: MC flags: SD_SHARE_PKG_RESOURCES
groups: 8-9 10-11 12-13 14-15
domain 2: span 0-15 level PWR flags  SD_NOT_SHARE_POWERDOMAIN
groups: 0-1 2-3 4-5 6-7 8-15
domain 3: span 0-15 level CPU flags:
groups: 8-15 0-7

Vincent


 ---
 --- a/arch/s390/kernel/smp.c
 +++ b/arch/s390/kernel/smp.c
 @@ -1070,3 +1070,23 @@ static int __init s390_smp_init(void)
 return 0;
  }
  subsys_initcall(s390_smp_init);
 +
 +static struct sched_domain_topology_level s390_topology[] = {
 +#ifdef CONFIG_SCHED_SMT
 +   { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES },
 +#endif
 +#ifdef CONFIG_SCHED_MC
 +   { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES },
 +#endif
 +#ifdef CONFIG_SCHED_BOOK
 +   { cpu_book_mask, },
 +#endif
 +   { cpu_cpu_mask, },
 +   { NULL, },
 +};
 +
 +static int __init s390_sched_topology(void)
 +{
 +   sched_domain_topology = s390_topology;
 +}
 +early_initcall(s390_sched_topology);
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -889,6 +889,20 @@ void free_sched_domains(cpumask_var_t do

  bool cpus_share_cache(int this_cpu, int that_cpu);

 +typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 +
 +#define SDTL_OVERLAP   0x01
 +
 +struct sched_domain_topology_level {
 +   sched_domain_mask_f mask;
 +   int sd_flags;
 +   int flags;
 +   int numa_level;
 +   struct sd_data  data;
 +};
 +
 +extern struct sched_domain_topology_level *sched_domain_topology;
 +
  #else /* CONFIG_SMP */

  struct sched_domain_attr;
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -5377,20 +5377,6 @@ enum s_alloc {
 sa_none,
  };

 -struct sched_domain_topology_level;
 -
 -typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 -
 -#define SDTL_OVERLAP   0x01
 -
 -struct sched_domain_topology_level {
 -   sched_domain_mask_f mask;
 -   int sd_flags;
 -   int flags;
 -   int numa_level;
 -   struct sd_data  data;
 -};
 -
  /*
   * Build an iteration mask that can exclude

Bench for testing scheduler

2013-11-07 Thread Vincent Guittot

Hi,

During the Energy-aware scheduling mini-summit, we spoke about benches
that should be used to evaluate the modifications of the scheduler.
I’d like to propose a bench that uses cyclictest to measure the wake
up latency and the power consumption. The goal of this bench is to
exercise the scheduler with various sleeping period and get the
average wakeup latency. The range of the sleeping period must cover
all residency times of the idle state table of the platform. I have
run such tests on a tc2 platform with the packing tasks patchset.
I have use the following command:
#cyclictest -t number of cores -q -e 1000 -i 500-12000 -d 150 -l 2000

The figures below give the average wakeup latency and power
consumption for default scheduler behavior, packing tasks at cluster
level and packing tasks at core level. We can see both wakeup latency
and power consumption variation. The detailed result is not a simple
single value which makes comparison not so easy but the average of all
measurements should give us a usable “score”.

I know that Ingo would like to add the benches in Tools/* but I wonder
if it make sense to copy cyclictest in this directory when we have an
official git tree here:
git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git

I have put both final score and detailed results below so everybody
can check the score vs detailed figures:

|  Default average results  |  Cluster
Packing average results  |  Core Packing average results
|  Latency stddev  A7 energy A15 energy |  Latency
stddev  A7 energy A15 energy |  Latency stddev  A7 energy A15
energy
| (us)   (J)(J) | (us)
  (J)(J) | (us)   (J)
(J)
|  8797948902364175 |  416
   879688  12750 |  189897452
30052

 Cyclictest |  Default  |  Packing at
Cluster level |  Packing at Core level
   Interval |  Latency stddev  A7 energy A15 energy |  Latency
stddev  A7 energy A15 energy |  Latency stddev  A7 energy A15
energy
   (us) | (us)   (J)(J) | (us)
  (J)(J) | (us)   (J)
(J)
500 24  111474772479576 21
 11136768  11693 22  11126062
30138
700 22  111360843058419 21
 01125280  11761 21  11109950
23503
900 22  111360173036768 21
 11112542  12017 20  01101089
23733
   1100 24  111329642506132 21
 01109039  12248 21  11091832
23621
   1300 24  111238962488459 21
 01099308  12015 21  11086301
23264
   1500 24  111208422488272 21
 01099811  12685 20  01083658
22499
   1700 41 3811171663042091 21
 01090920  12393 21  11080387
23015
   190011918211205522737555 21
 01087900  11900 21  11078711
23177
   210016719511224253210655 22
 21090420  11900 20  11077985
22639
   230015215611198542497773 43
221087278  11921 21  11075943
26282
   250018216311208182365870 63
291089169  11551 21  01073717
24290
   270043920210589523058516107
411077955  12122 21  01070951
23126
   290057026810282383099162148
301067562  13287 24  11064200
24260
   3100751137 9465123158095178
301059395  12236 29  11058887
23225
   3300696203 9648223042524206
281041194  13934 36  11056656
23941
   3500728191 9593983006066235
361028150  13387 44  31045841
23873
   3700844138 9217803033189245
311019065  14582 62  61034466
22501
   3900815172 9256002862994273
331001974  12091 80  91014650
2
   4100870179 8976162940444279
35 996226  12014 88 111030588
25461
   4300979119 8469122996911306
36 980075  12641100

Re: Bench for testing scheduler

2013-11-07 Thread Vincent Guittot

On 7 November 2013 12:32, Catalin Marinas catalin.mari...@arm.com wrote:
 Hi Vincent,

 (for whatever reason, the text is wrapped and results hard to read)

Yes, i have just seen that. It looks like gmail has wrapped the lines.
I have added the results which should not be wrapped, at the end of this email 



 On Thu, Nov 07, 2013 at 10:54:30AM +, Vincent Guittot wrote:
 During the Energy-aware scheduling mini-summit, we spoke about benches
 that should be used to evaluate the modifications of the scheduler.
 I’d like to propose a bench that uses cyclictest to measure the wake
 up latency and the power consumption. The goal of this bench is to
 exercise the scheduler with various sleeping period and get the
 average wakeup latency. The range of the sleeping period must cover
 all residency times of the idle state table of the platform. I have
 run such tests on a tc2 platform with the packing tasks patchset.
 I have use the following command:
 #cyclictest -t number of cores -q -e 1000 -i 500-12000 -d 150 -l 2000

 cyclictest could be a good starting point but we need to improve it to
 allow threads of different loads, possibly starting multiple processes
 (can be done with a script), randomly varying load threads. These
 parameters should be loaded from a file so that we can have multiple
 configurations (per SoC and per use-case). But the big risk is that we
 try to optimise the scheduler for something which is not realistic.

The goal of this simple bench is to measure the wake up latency and the 
reachable value of the scheduler on a platform but not to emulate a real use 
case. In the same way than sched-pipe tests a specific behavior of the 
scheduler, this bench tests the wake up latency of a system.

Starting multi processes and adding some loads can also be useful but the 
target will be a bit different from wake up latency. I have one concern with 
randomness because it prevents from having repeatable and comparable tests and 
results.

I agree that we have to test real use cases but it doesn't prevent from 
testing the limit of a characteristic on a system



 We are working on describing some basic scenarios (plain English for
 now) and one of them could be video playing with threads for audio and
 video decoding with random change in the workload.

 So I think the first step should be a set of tools/scripts to analyse
 the scheduler behaviour, both in terms of latency and power, and these
 can use perf sched. We can then run some real life scenarios (e.g.
 Android video playback) and build a benchmark that matches such
 behaviour as close as possible. We can probably use (or improve) perf
 sched replay to also simulate such workload (we may need additional
 features like thread dependencies).

 The figures below give the average wakeup latency and power
 consumption for default scheduler behavior, packing tasks at cluster
 level and packing tasks at core level. We can see both wakeup latency
 and power consumption variation. The detailed result is not a simple
 single value which makes comparison not so easy but the average of all
 measurements should give us a usable “score”.

 How did you assess the power/energy?

I have use the embedded joule meter of the tc2.


 Thanks.

 --
 Catalin

|  Default average results  |  Cluster Packing 
average results  |  Core Packing average results
|  Latency stddev  A7 energy A15 energy |  Latency stddev  
A7 energy A15 energy |  Latency stddev  A7 energy A15 energy
| (us)   (J)(J) | (us)  
 (J)(J) | (us)   (J)(J)
|  8797948902364175 |  416  
  879688  12750 |  189897452  30052

 Cyclictest |  Default  |  Packing at Cluster 
level |  Packing at Core level
   Interval |  Latency stddev  A7 energy A15 energy |  Latency stddev  
A7 energy A15 energy |  Latency stddev  A7 energy A15 energy
   (us) | (us)   (J)(J) | (us)  
 (J)(J) | (us)   (J)(J)
500 24  111474772479576 21  1   
 1136768  11693 22  11126062  30138
700 22  111360843058419 21  0   
 1125280  11761 21  11109950  23503
900 22  111360173036768 21  1   
 1112542  12017 20  01101089  23733
   1100 24  111329642506132 21  0   
 1109039  12248 21  11091832  23621
   1300 24  111238962488459 21  0   
 1099308  12015 21  11086301  23264
   1500

Re: Bench for testing scheduler

2013-11-08 Thread Vincent Guittot

On 8 November 2013 01:04, Rowand, Frank frank.row...@sonymobile.com wrote:
 Hi Vincent,

 Thanks for creating some benchmark numbers!

you're welcome



 On Thursday, November 07, 2013 5:33 AM, Vincent Guittot 
 [vincent.guit...@linaro.org] wrote:

 On 7 November 2013 12:32, Catalin Marinas catalin.mari...@arm.com wrote:
  Hi Vincent,
 
  (for whatever reason, the text is wrapped and results hard to read)

 Yes, i have just seen that. It looks like gmail has wrapped the lines.
 I have added the results which should not be wrapped, at the end of this 
 email

 
 
  On Thu, Nov 07, 2013 at 10:54:30AM +, Vincent Guittot wrote:
  During the Energy-aware scheduling mini-summit, we spoke about benches
  that should be used to evaluate the modifications of the scheduler.
  I’d like to propose a bench that uses cyclictest to measure the wake
  up latency and the power consumption. The goal of this bench is to
  exercise the scheduler with various sleeping period and get the
  average wakeup latency. The range of the sleeping period must cover
  all residency times of the idle state table of the platform. I have
  run such tests on a tc2 platform with the packing tasks patchset.
  I have use the following command:
  #cyclictest -t number of cores -q -e 1000 -i 500-12000 -d 150 -l 
  2000

 The number of loops (-l 2000) should be much larger to create useful
 results.  I don't have a specific number that is large enough, I just
 know from experience that 2000 is way too small.  For example, running
 cyclictest several times with the same values on my laptop gives values
 that are not consistent:

The Avg figures look almost stable IMO. Are you speaking about the Max
value for the inconsistency ?


$ sudo ./cyclictest -t -q -e 1000 -i 500 -d 150 -l 2000
# /dev/cpu_dma_latency set to 1000us
T: 0 ( 9703) P: 0 I:500 C:   2000 Min:  2 Act:   90 Avg:   77 Max: 
 243
T: 1 ( 9704) P: 0 I:650 C:   1557 Min:  2 Act:   58 Avg:   68 Max: 
 226
T: 2 ( 9705) P: 0 I:800 C:   1264 Min:  2 Act:   54 Avg:   81 Max:
 1017
T: 3 ( 9706) P: 0 I:950 C:   1065 Min:  2 Act:   11 Avg:   80 Max: 
 260

$ sudo ./cyclictest -t -q -e 1000 -i 500 -d 150 -l 2000
# /dev/cpu_dma_latency set to 1000us
T: 0 ( 9709) P: 0 I:500 C:   2000 Min:  2 Act:   45 Avg:   74 Max: 
 390
T: 1 ( 9710) P: 0 I:650 C:   1554 Min:  2 Act:   82 Avg:   61 Max: 
 810
T: 2 ( 9711) P: 0 I:800 C:   1263 Min:  2 Act:   83 Avg:   74 Max: 
 287
T: 3 ( 9712) P: 0 I:950 C:   1064 Min:  2 Act:  103 Avg:   79 Max: 
 551

$ sudo ./cyclictest -t -q -e 1000 -i 500 -d 150 -l 2000
# /dev/cpu_dma_latency set to 1000us
T: 0 ( 9716) P: 0 I:500 C:   2000 Min:  2 Act:   82 Avg:   72 Max: 
 252
T: 1 ( 9717) P: 0 I:650 C:   1556 Min:  2 Act:  115 Avg:   77 Max: 
 354
T: 2 ( 9718) P: 0 I:800 C:   1264 Min:  2 Act:   59 Avg:   78 Max:
 1143
T: 3 ( 9719) P: 0 I:950 C:   1065 Min:  2 Act:  104 Avg:   70 Max: 
 238

$ sudo ./cyclictest -t -q -e 1000 -i 500 -d 150 -l 2000
# /dev/cpu_dma_latency set to 1000us
T: 0 ( 9722) P: 0 I:500 C:   2000 Min:  2 Act:   82 Avg:   68 Max: 
 213
T: 1 ( 9723) P: 0 I:650 C:   1555 Min:  2 Act:   65 Avg:   65 Max:
 1279
T: 2 ( 9724) P: 0 I:800 C:   1264 Min:  2 Act:   91 Avg:   69 Max: 
 244
T: 3 ( 9725) P: 0 I:950 C:   1065 Min:  2 Act:   58 Avg:   76 Max: 
 242


 
  cyclictest could be a good starting point but we need to improve it to
  allow threads of different loads, possibly starting multiple processes
  (can be done with a script), randomly varying load threads. These
  parameters should be loaded from a file so that we can have multiple
  configurations (per SoC and per use-case). But the big risk is that we
  try to optimise the scheduler for something which is not realistic.

 The goal of this simple bench is to measure the wake up latency and the 
 reachable value of the scheduler on a platform but not to emulate a real 
 use case. In the same way than sched-pipe tests a specific behavior of the 
 scheduler, this bench tests the wake up latency of a system.

 Starting multi processes and adding some loads can also be useful but the 
 target will be a bit different from wake up latency. I have one concern with 
 randomness because it prevents from having repeatable and comparable tests 
 and results.

 I agree that we have to test real use cases but it doesn't prevent from 
 testing the limit of a characteristic on a system

 
 
  We are working on describing some basic scenarios (plain English for
  now) and one of them could be video playing with threads for audio and
  video decoding with random change in the workload.
 
  So I think the first step should be a set of tools/scripts to analyse
  the scheduler behaviour, both in terms of latency and power, and these
  can use perf sched. We can

Re: Bench for testing scheduler

2013-11-08 Thread Vincent Guittot

On 7 November 2013 15:04, Catalin Marinas catalin.mari...@arm.com wrote:
 On Thu, Nov 07, 2013 at 01:33:43PM +, Vincent Guittot wrote:
 On 7 November 2013 12:32, Catalin Marinas catalin.mari...@arm.com wrote:
  On Thu, Nov 07, 2013 at 10:54:30AM +, Vincent Guittot wrote:
  During the Energy-aware scheduling mini-summit, we spoke about benches
  that should be used to evaluate the modifications of the scheduler.
  I’d like to propose a bench that uses cyclictest to measure the wake
  up latency and the power consumption. The goal of this bench is to
  exercise the scheduler with various sleeping period and get the
  average wakeup latency. The range of the sleeping period must cover
  all residency times of the idle state table of the platform. I have
  run such tests on a tc2 platform with the packing tasks patchset.
  I have use the following command:
  #cyclictest -t number of cores -q -e 1000 -i 500-12000 -d 150 -l 
  2000
 
  cyclictest could be a good starting point but we need to improve it to
  allow threads of different loads, possibly starting multiple processes
  (can be done with a script), randomly varying load threads. These
  parameters should be loaded from a file so that we can have multiple
  configurations (per SoC and per use-case). But the big risk is that we
  try to optimise the scheduler for something which is not realistic.

 The goal of this simple bench is to measure the wake up latency and
 the reachable value of the scheduler on a platform but not to emulate
 a real use case. In the same way than sched-pipe tests a specific
 behavior of the scheduler, this bench tests the wake up latency of a
 system.

 These figures are indeed useful to make sure we don't have any
 regression in terms of latency but I would not use cyclictest (as it is)
 to assess power improvements since the test is too artificial.

 Starting multi processes and adding some loads can also be useful but
 the target will be a bit different from wake up latency. I have one
 concern with randomness because it prevents from having repeatable and
 comparable tests and results.

 We can avoid randomness but still make it varying by some predictable
 function.

 I agree that we have to test real use cases but it doesn't prevent
 from testing the limit of a characteristic on a system

 I agree. My point is not to use this as the benchmark.

ok, so i don't plan to make cyclictest the benchmark but a
benchmark among others because i'm not sure that we can cover all
needs with only one benchmark.

As an example, cyclictest gives information of the wake up latency
that can't be collected with trace


 I would prefer to assess the impact on latency (and power) using a tool
 independent from benchmarks like cyclictest (e.g. use the reports from
 power sched). The reason is that once we have those tools/scripts in the
 kernel, a third party can run it on real workloads and provide the
 kernel developers with real numbers on performance vs power scheduling,
 regressions between kernel versions etc. We can't create a power model
 that you can run on an x86 for example and give you an indication of the
 power saving on ARM, you need to run the benchmarks on the actual
 hardware (that's why I don't think linsched is of much use from a power
 perspective).

 --
 Catalin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH v5 03/14] sched: define pack buddy CPUs

2013-10-18 Thread Vincent Guittot

During the creation of sched_domain, we define a pack buddy CPU for each CPU
when one is available. We want to pack at all levels where a group of CPUs can
be power gated independently from others.
On a system that can't power gate a group of CPUs independently, the flag is
set at all sched_domain level and the buddy is set to -1. This is the default
behavior for all architectures.

On a dual clusters / dual cores system which can power gate each core and
cluster independently, the buddy configuration will be :

  | Cluster 0   | Cluster 1   |
  | CPU0 | CPU1 | CPU2 | CPU3 |
---
buddy | CPU0 | CPU0 | CPU0 | CPU2 |

If the cores in a cluster can't be power gated independently, the buddy
configuration becomes:

  | Cluster 0   | Cluster 1   |
  | CPU0 | CPU1 | CPU2 | CPU3 |
---
buddy | CPU0 | CPU1 | CPU0 | CPU0 |

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/sched/core.c  |1 +
 kernel/sched/fair.c  |   70 ++
 kernel/sched/sched.h |5 
 3 files changed, 76 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 735e964..0bf5f4d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5184,6 +5184,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
root_domain *rd, int cpu)
rcu_assign_pointer(rq-sd, sd);
destroy_sched_domains(tmp, cpu);
 
+   update_packing_domain(cpu);
update_top_cache_domain(cpu);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 11cd136..5547831 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -178,6 +178,76 @@ void sched_init_granularity(void)
update_sysctl();
 }
 
+#ifdef CONFIG_SMP
+#ifdef CONFIG_SCHED_PACKING_TASKS
+/*
+ * Save the id of the optimal CPU that should be used to pack small tasks
+ * The value -1 is used when no buddy has been found
+ */
+DEFINE_PER_CPU(int, sd_pack_buddy);
+
+/*
+ * Look for the best buddy CPU that can be used to pack small tasks
+ * We make the assumption that it doesn't wort to pack on CPU that share the
+ * same powerline. We look for the 1st sched_domain without the
+ * SD_SHARE_POWERDOMAIN flag. Then we look for the sched_group with the lowest
+ * power per core based on the assumption that their power efficiency is
+ * better
+ */
+void update_packing_domain(int cpu)
+{
+   struct sched_domain *sd;
+   int id = -1;
+
+   sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN);
+   if (!sd)
+   sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd);
+   else
+   sd = sd-parent;
+
+   while (sd  (sd-flags  SD_LOAD_BALANCE)
+!(sd-flags  SD_SHARE_POWERDOMAIN)) {
+   struct sched_group *sg = sd-groups;
+   struct sched_group *pack = sg;
+   struct sched_group *tmp;
+
+   /*
+* The sched_domain of a CPU points on the local sched_group
+* and this CPU of this local group is a good candidate
+*/
+   id = cpu;
+
+   /* loop the sched groups to find the best one */
+   for (tmp = sg-next; tmp != sg; tmp = tmp-next) {
+   if (tmp-sgp-power * pack-group_weight 
+   pack-sgp-power * tmp-group_weight)
+   continue;
+
+   if ((tmp-sgp-power * pack-group_weight ==
+   pack-sgp-power * tmp-group_weight)
+ (cpumask_first(sched_group_cpus(tmp)) = id))
+   continue;
+
+   /* we have found a better group */
+   pack = tmp;
+
+   /* Take the 1st CPU of the new group */
+   id = cpumask_first(sched_group_cpus(pack));
+   }
+
+   /* Look for another CPU than itself */
+   if (id != cpu)
+   break;
+
+   sd = sd-parent;
+   }
+
+   pr_debug(CPU%d packing on CPU%d\n, cpu, id);
+   per_cpu(sd_pack_buddy, cpu) = id;
+}
+#endif /* CONFIG_SCHED_PACKING_TASKS */
+#endif /* CONFIG_SMP */
+
 #if BITS_PER_LONG == 32
 # define WMULT_CONST   (~0UL)
 #else
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b3c5653..22e3f1d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1022,6 +1022,11 @@ extern void update_group_power(struct sched_domain *sd, 
int cpu);
 
 extern void trigger_load_balance(struct rq *rq, int cpu);
 extern void idle_balance(int this_cpu, struct rq *this_rq);
+#ifdef CONFIG_SCHED_PACKING_TASKS
+extern void update_packing_domain(int cpu);
+#else
+static inline void update_packing_domain(int cpu) {};
+#endif
 
 extern void idle_enter_fair(struct rq *this_rq);
 extern void idle_exit_fair(struct rq *this_rq);
-- 
1.7.9.5

--
To unsubscribe from this list

[RFC][PATCH v5 00/14] sched: packing tasks

2013-10-18 Thread Vincent Guittot

 - Change the way the coherency of runnable_avg_sum and runnable_avg_period is
   ensured
Patch 6/6
 - Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM
   platform

Vincent Guittot (14):
  sched: add a new arch_sd_local_flags for sched_domain init
  ARM: sched: clear SD_SHARE_POWERDOMAIN
  sched: define pack buddy CPUs
  sched: do load balance only with packing cpus
  sched: add a packing level knob
  sched: create a new field with available capacity
  sched: get CPU's activity statistic
  sched: move load idx selection in find_idlest_group
  sched: update the packing cpu list
  sched: init this_load to max in find_idlest_group
  sched: add a SCHED_PACKING_TASKS config
  sched: create a statistic structure
  sched: differantiate idle cpu
  cpuidle: set the current wake up latency

 arch/arm/include/asm/topology.h  |4 +
 arch/arm/kernel/topology.c   |   50 -
 arch/ia64/include/asm/topology.h |3 +-
 arch/tile/include/asm/topology.h |3 +-
 drivers/cpuidle/cpuidle.c|   11 ++
 include/linux/sched.h|   13 +-
 include/linux/sched/sysctl.h |9 +
 include/linux/topology.h |   11 +-
 init/Kconfig |   11 ++
 kernel/sched/core.c  |   11 +-
 kernel/sched/fair.c  |  395 --
 kernel/sched/sched.h |8 +-
 kernel/sysctl.c  |   17 ++
 13 files changed, 521 insertions(+), 25 deletions(-)

-- 
1.7.9.5
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH v5 10/14] sched: init this_load to max in find_idlest_group

2013-10-18 Thread Vincent Guittot

Init this_load to max value instead of 0 in find_idlest_group.
If the local group is skipped because it doesn't have allowed CPUs, this_load
stays to 0,  no idlest group will be returned and the selected CPU will be a
not allowed one (which will be replaced in select_fallback_rq by a random
one). With a default value set to max, we will use the idlest group even if we
skip the local_group.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/sched/fair.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f9b03c1..2d9f782 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3532,7 +3532,7 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p,
  int this_cpu, int sd_flag)
 {
struct sched_group *idlest = NULL, *group = sd-groups;
-   unsigned long min_load = ULONG_MAX, this_load = 0;
+   unsigned long min_load = ULONG_MAX, this_load = ULONG_MAX;
int load_idx = sd-forkexec_idx;
int imbalance = 100 + (sd-imbalance_pct-100)/2;
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH v5 09/14] sched: update the packing cpu list

2013-10-18 Thread Vincent Guittot

Use the activity statistics to update the list of CPUs that should be used
to hanlde the current system activity.

The cpu_power is updated for CPUs that don't participate to the packing
effort. We consider that their cpu_power is allocated to idleness as it
could be allocated by rt. So the cpu_power that remains available for cfs,
is set to min value (i.e. 1).

The cpu_power is used for a task that wakes up because a waking up task is
already taken into account in the current activity whereas we use the
power_available for a fork and exec because the task is not part of the current
activity.

In order to quickly found the packing starting point, we save information that
will be used to directly start with the right sched_group at the right
sched_domain level instead of running the complete update_packing_domain
algorithm each time we need to use the packing cpu list.

The sd_power_leader defines the leader of a group of CPU that can't be
powergated independantly. As soon as this CPU is used, all the CPU in the same
group will be used based on the fact that it doesn't worth to keep some cores
idle if they can't be power gated while one core in the group is running.
The sd_pack_group and sd_pack_domain are used to quickly check if a power
leader should be used in the packing effort

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/sched/fair.c |  162 ++-
 1 file changed, 149 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c258c38..f9b03c1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -185,11 +185,20 @@ static unsigned long available_of(int cpu)
 }
 
 #ifdef CONFIG_SCHED_PACKING_TASKS
+struct sd_pack {
+   int  my_buddy; /* cpu on which tasks should be packed */
+   int my_leader; /* cpu which leads the packing state of a group */
+   struct sched_domain *domain; /* domain at which the check is done */
+   struct sched_group *group; /* starting group for checking */
+};
+
 /*
- * Save the id of the optimal CPU that should be used to pack small tasks
- * The value -1 is used when no buddy has been found
+ * Save per_cpu information about the optimal CPUs that should be used to pack
+ * tasks.
  */
-DEFINE_PER_CPU(int, sd_pack_buddy);
+DEFINE_PER_CPU(struct sd_pack, sd_pack_buddy)  = {
+   .my_buddy = -1,
+};
 
 /*
  * The packing level of the scheduler
@@ -202,6 +211,15 @@ int __read_mostly sysctl_sched_packing_level = 
DEFAULT_PACKING_LEVEL;
 
 unsigned int sd_pack_threshold = (100 * 1024) / DEFAULT_PACKING_LEVEL;
 
+static inline int get_buddy(int cpu)
+{
+   return per_cpu(sd_pack_buddy, cpu).my_buddy;
+}
+
+static inline int get_leader(int cpu)
+{
+   return per_cpu(sd_pack_buddy, cpu).my_leader;
+}
 
 int sched_proc_update_packing(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
@@ -219,13 +237,19 @@ int sched_proc_update_packing(struct ctl_table *table, 
int write,
 
 static inline bool is_packing_cpu(int cpu)
 {
-   int my_buddy = per_cpu(sd_pack_buddy, cpu);
+   int my_buddy = get_buddy(cpu);
return (my_buddy == -1) || (cpu == my_buddy);
 }
 
-static inline int get_buddy(int cpu)
+static inline bool is_leader_cpu(int cpu, struct sched_domain *sd)
 {
-   return per_cpu(sd_pack_buddy, cpu);
+   if (sd != per_cpu(sd_pack_buddy, cpu).domain)
+   return 0;
+
+   if (cpu != get_leader(cpu))
+   return 0;
+
+   return 1;
 }
 
 /*
@@ -239,7 +263,9 @@ static inline int get_buddy(int cpu)
 void update_packing_domain(int cpu)
 {
struct sched_domain *sd;
-   int id = -1;
+   struct sched_group *target = NULL;
+   struct sd_pack *pack = per_cpu(sd_pack_buddy, cpu);
+   int id = cpu, pcpu = cpu;
 
sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN);
if (!sd)
@@ -247,6 +273,12 @@ void update_packing_domain(int cpu)
else
sd = sd-parent;
 
+   if (sd) {
+   pcpu = cpumask_first(sched_group_cpus(sd-groups));
+   if (pcpu != cpu)
+   goto end;
+   }
+
while (sd  (sd-flags  SD_LOAD_BALANCE)
 !(sd-flags  SD_SHARE_POWERDOMAIN)) {
struct sched_group *sg = sd-groups;
@@ -258,15 +290,16 @@ void update_packing_domain(int cpu)
 * and this CPU of this local group is a good candidate
 */
id = cpu;
+   target = pack;
 
/* loop the sched groups to find the best one */
for (tmp = sg-next; tmp != sg; tmp = tmp-next) {
-   if (tmp-sgp-power * pack-group_weight 
-   pack-sgp-power * tmp-group_weight)
+   if (tmp-sgp-power_available * pack-group_weight 
+   pack-sgp-power_available * tmp-group_weight)
continue

[RFC][PATCH v5 02/14] ARM: sched: clear SD_SHARE_POWERDOMAIN

2013-10-18 Thread Vincent Guittot

The ARM platforms take advantage of packing tasks on few cores if the latters
can be powergated independantly. We use DT and the cpu topology descirption
to define at which level a core can be independantly powergated to the others
and the SD_SHARE_POWERDOMAIN will be set accordingly at MC and CPU sched_domain
level.

The power-gate properties should be added with the value 1 in cpu and cluster
node when then can power gate independantly from the other.

As an example of a quad cores system which can power gate each core
independantly, we should have a DT similar to the example below

cpus {
#address-cells = 1;
#size-cells = 0;

   cpu-map {
   cluster0 {
power-gate = 1;
core0 {
   cpu = cpu0;
power-gate = 1;
};
core1 {
cpu = cpu1;
power-gate = 1;
};
core2 {
cpu = cpu2;
power-gate = 1;
};
core3 {
cpu = cpu3;
power-gate = 1;
};
};
};

...

};

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 arch/arm/include/asm/topology.h |4 
 arch/arm/kernel/topology.c  |   50 ++-
 2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 58b8b84..5102847 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -5,12 +5,16 @@
 
 #include linux/cpumask.h
 
+#define CPU_CORE_GATE  0x1
+#define CPU_CLUSTER_GATE   0x2
+
 struct cputopo_arm {
int thread_id;
int core_id;
int socket_id;
cpumask_t thread_sibling;
cpumask_t core_sibling;
+   int flags;
 };
 
 extern struct cputopo_arm cpu_topology[NR_CPUS];
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 85a8737..f38f1f9 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -24,6 +24,7 @@
 
 #include asm/cputype.h
 #include asm/topology.h
+#include asm/smp_plat.h
 
 /*
  * cpu power scale management
@@ -79,6 +80,51 @@ unsigned long *__cpu_capacity;
 
 unsigned long middle_capacity = 1;
 
+static int __init get_dt_power_topology(struct device_node *topo)
+{
+   const u32 *reg;
+   int len, power = 0;
+   int flag = CPU_CORE_GATE;
+
+   for (; topo; topo = of_get_next_parent(topo)) {
+   reg = of_get_property(topo, power-gate, len);
+   if (reg  len == 4  be32_to_cpup(reg))
+   power |= flag;
+   flag = 1;
+   }
+
+   return power;
+}
+
+#define for_each_subnode_with_property(dn, pn, prop_name) \
+   for (dn = of_find_node_with_property(pn, prop_name); dn; \
+dn = of_find_node_with_property(dn, prop_name))
+
+static void __init init_dt_power_topology(void)
+{
+   struct device_node *cn, *topo;
+
+   /* Get power domain topology information */
+   cn = of_find_node_by_path(/cpus/cpu-map);
+   if (!cn) {
+   pr_warn(Missing cpu-map node, bailing out\n);
+   return;
+   }
+
+   for_each_subnode_with_property(topo, cn, cpu) {
+   struct device_node *cpu;
+
+   cpu = of_parse_phandle(topo, cpu, 0);
+   if (cpu) {
+   u32 hwid;
+
+   of_property_read_u32(cpu, reg, hwid);
+   cpu_topology[get_logical_index(hwid)].flags = 
get_dt_power_topology(topo);
+
+   }
+   }
+}
+
 /*
  * Iterate all CPUs' descriptor in DT and compute the efficiency
  * (as per table_efficiency). Also calculate a middle efficiency
@@ -151,6 +197,8 @@ static void __init parse_dt_topology(void)
middle_capacity = ((max_capacity / 3)
 (SCHED_POWER_SHIFT-1)) + 1;
 
+   /* Retrieve power topology information from DT */
+   init_dt_power_topology();
 }
 
 /*
@@ -283,7 +331,7 @@ void __init init_cpu_topology(void)
cpu_topo-socket_id = -1;
cpumask_clear(cpu_topo-core_sibling);
cpumask_clear(cpu_topo-thread_sibling);
-
+   cpu_topo-flags = 0;
set_power_scale(cpu, SCHED_POWER_SCALE);
}
smp_wmb();
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More

[RFC][PATCH v5 05/14] sched: add a packing level knob

2013-10-18 Thread Vincent Guittot

The knob is used to set an average load threshold that will be used to trig
the inclusion/removal of CPUs in the packing effort list.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 include/linux/sched/sysctl.h |9 +
 kernel/sched/fair.c  |   26 ++
 kernel/sysctl.c  |   17 +
 3 files changed, 52 insertions(+)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..f41afa5 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -44,6 +44,14 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
+#ifdef CONFIG_SCHED_PACKING_TASKS
+extern int  __read_mostly sysctl_sched_packing_level;
+
+int sched_proc_update_packing(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp,
+   loff_t *ppos);
+#endif
+
 extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
@@ -61,6 +69,7 @@ extern unsigned int sysctl_sched_shares_window;
 int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
loff_t *ppos);
+
 #endif
 #ifdef CONFIG_SCHED_DEBUG
 static inline unsigned int get_sysctl_timer_migration(void)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7149f38..5568980 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -186,6 +186,32 @@ void sched_init_granularity(void)
  */
 DEFINE_PER_CPU(int, sd_pack_buddy);
 
+/*
+ * The packing level of the scheduler
+ *
+ * This level define the activity % above which we should add another CPU to
+ * participate to the packing effort of the tasks
+ */
+#define DEFAULT_PACKING_LEVEL 80
+int __read_mostly sysctl_sched_packing_level = DEFAULT_PACKING_LEVEL;
+
+unsigned int sd_pack_threshold = (100 * 1024) / DEFAULT_PACKING_LEVEL;
+
+
+int sched_proc_update_packing(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp,
+   loff_t *ppos)
+{
+   int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+   if (ret || !write)
+   return ret;
+
+   if (sysctl_sched_packing_level)
+   sd_pack_threshold = (100 * 1024) / sysctl_sched_packing_level;
+
+   return 0;
+}
+
 static inline bool is_packing_cpu(int cpu)
 {
int my_buddy = per_cpu(sd_pack_buddy, cpu);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b2f06f3..77383fc 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -255,11 +255,17 @@ static struct ctl_table sysctl_base_table[] = {
{ }
 };
 
+#ifdef CONFIG_SCHED_PACKING_TASKS
+static int min_sched_packing_level;
+static int max_sched_packing_level = 100;
+#endif /* CONFIG_SMP */
+
 #ifdef CONFIG_SCHED_DEBUG
 static int min_sched_granularity_ns = 10;  /* 100 usecs */
 static int max_sched_granularity_ns = NSEC_PER_SEC;/* 1 second */
 static int min_wakeup_granularity_ns;  /* 0 usecs */
 static int max_wakeup_granularity_ns = NSEC_PER_SEC;   /* 1 second */
+
 #ifdef CONFIG_SMP
 static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
 static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
@@ -279,6 +285,17 @@ static struct ctl_table kern_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec,
},
+#ifdef CONFIG_SCHED_PACKING_TASKS
+   {
+   .procname   = sched_packing_level,
+   .data   = sysctl_sched_packing_level,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = sched_proc_update_packing,
+   .extra1 = min_sched_packing_level,
+   .extra2 = max_sched_packing_level,
+   },
+#endif
 #ifdef CONFIG_SCHED_DEBUG
{
.procname   = sched_min_granularity_ns,
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH v5 08/14] sched: move load idx selection in find_idlest_group

2013-10-18 Thread Vincent Guittot

load_idx is used in find_idlest_group but initialized in select_task_rq_fair
even when not used. The load_idx initialisation is moved in find_idlest_group
and the sd_flag replaces it in the function's args.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/sched/fair.c |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e26f65..c258c38 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3430,12 +3430,16 @@ static int wake_affine(struct sched_domain *sd, struct 
task_struct *p, int sync)
  */
 static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p,
- int this_cpu, int load_idx)
+ int this_cpu, int sd_flag)
 {
struct sched_group *idlest = NULL, *group = sd-groups;
unsigned long min_load = ULONG_MAX, this_load = 0;
+   int load_idx = sd-forkexec_idx;
int imbalance = 100 + (sd-imbalance_pct-100)/2;
 
+   if (sd_flag  SD_BALANCE_WAKE)
+   load_idx = sd-wake_idx;
+
do {
unsigned long load, avg_load;
int local_group, packing_cpus = 0;
@@ -3632,7 +3636,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int wake_flags)
}
 
while (sd) {
-   int load_idx = sd-forkexec_idx;
struct sched_group *group;
int weight;
 
@@ -3641,10 +3644,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int wake_flags)
continue;
}
 
-   if (sd_flag  SD_BALANCE_WAKE)
-   load_idx = sd-wake_idx;
-
-   group = find_idlest_group(sd, p, cpu, load_idx);
+   group = find_idlest_group(sd, p, cpu, sd_flag);
if (!group) {
sd = sd-child;
continue;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH v5 07/14] sched: get CPU's activity statistic

2013-10-18 Thread Vincent Guittot

Monitor the activity level of each group of each sched_domain level. The
activity is the amount of cpu_power that is currently used on a CPU. We use
the runnable_avg_sum and _period to evaluate this activity level. In the
special use case where the CPU is fully loaded by more than 1 task, the
activity level is set above the cpu_power in order to reflect the overload of
The cpu

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/sched/fair.c |   26 ++
 1 file changed, 26 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index db9b871..7e26f65 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -179,6 +179,11 @@ void sched_init_granularity(void)
 }
 
 #ifdef CONFIG_SMP
+static unsigned long available_of(int cpu)
+{
+   return cpu_rq(cpu)-cpu_available;
+}
+
 #ifdef CONFIG_SCHED_PACKING_TASKS
 /*
  * Save the id of the optimal CPU that should be used to pack small tasks
@@ -3549,6 +3554,22 @@ done:
return target;
 }
 
+static int get_cpu_activity(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+   u32 sum = rq-avg.runnable_avg_sum;
+   u32 period = rq-avg.runnable_avg_period;
+
+   sum = min(sum, period);
+
+   if (sum == period) {
+   u32 overload = rq-nr_running  1 ? 1 : 0;
+   return available_of(cpu) + overload;
+   }
+
+   return (sum * available_of(cpu)) / period;
+}
+
 /*
  * sched_balance_self: balance the current task (running on cpu) in domains
  * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
@@ -4430,6 +4451,7 @@ struct sg_lb_stats {
unsigned long sum_weighted_load; /* Weighted load of group's tasks */
unsigned long load_per_task;
unsigned long group_power;
+   unsigned long group_activity; /* Total activity of the group */
unsigned int sum_nr_running; /* Nr tasks running in the group */
unsigned int group_capacity;
unsigned int idle_cpus;
@@ -4446,6 +4468,7 @@ struct sd_lb_stats {
struct sched_group *busiest;/* Busiest group in this sd */
struct sched_group *local;  /* Local group in this sd */
unsigned long total_load;   /* Total load of all groups in sd */
+   unsigned long total_activity;  /* Total activity of all groups in sd */
unsigned long total_pwr;/* Total power of all groups in sd */
unsigned long avg_load; /* Average load across all groups in sd */
 
@@ -4465,6 +4488,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats 
*sds)
.busiest = NULL,
.local = NULL,
.total_load = 0UL,
+   .total_activity = 0UL,
.total_pwr = 0UL,
.busiest_stat = {
.avg_load = 0UL,
@@ -4771,6 +4795,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
}
 
sgs-group_load += load;
+   sgs-group_activity += get_cpu_activity(i);
sgs-sum_nr_running += nr_running;
sgs-sum_weighted_load += weighted_cpuload(i);
if (idle_cpu(i))
@@ -4894,6 +4919,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 
/* Now, start updating sd_lb_stats */
sds-total_load += sgs-group_load;
+   sds-total_activity += sgs-group_activity;
sds-total_pwr += sgs-group_power;
 
if (!local_group  update_sd_pick_busiest(env, sds, sg, sgs)) {
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init

2013-10-18 Thread Vincent Guittot

The function arch_sd_local_flags is used to set flags in sched_domains
according to the platform architecture. A new flag SD_SHARE_POWERDOMAIN is
also created to reflect whether groups of CPUs in a sched_domain level can or
not reach different power state. As an example, the flag should be cleared at
CPU level if groups of cores can be power gated independently. This information
is used to decide if it's worth packing some tasks in a group of CPUs in order
to power gate the other groups instead of spreading the tasks. The default
behavior of the scheduler is to spread tasks across CPUs and groups of CPUs so
the flag is set into all sched_domains.

The cpu parameter of arch_sd_local_flags can be used by architecture to fine
tune the scheduler domain flags. As an example SD_SHARE_POWERDOMAIN flag can be
set differently for groups of CPUs according to DT information

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 arch/ia64/include/asm/topology.h |3 ++-
 arch/tile/include/asm/topology.h |3 ++-
 include/linux/sched.h|1 +
 include/linux/topology.h |   11 ---
 kernel/sched/core.c  |   10 --
 5 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index a2496e4..4844896 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -46,7 +46,7 @@
 
 void build_cpu_to_node_map(void);
 
-#define SD_CPU_INIT (struct sched_domain) {\
+#define SD_CPU_INIT(cpu) (struct sched_domain) {   \
.parent = NULL, \
.child  = NULL, \
.groups = NULL, \
@@ -65,6 +65,7 @@ void build_cpu_to_node_map(void);
| SD_BALANCE_EXEC   \
| SD_BALANCE_FORK   \
| SD_WAKE_AFFINE,   \
+   | arch_sd_local_flags(0, cpu)\
.last_balance   = jiffies,  \
.balance_interval   = 1,\
.nr_balance_failed  = 0,\
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h
index d15c0d8..6f7a97d 100644
--- a/arch/tile/include/asm/topology.h
+++ b/arch/tile/include/asm/topology.h
@@ -51,7 +51,7 @@ static inline const struct cpumask *cpumask_of_node(int node)
  */
 
 /* sched_domains SD_CPU_INIT for TILE architecture */
-#define SD_CPU_INIT (struct sched_domain) {\
+#define SD_CPU_INIT(cpu) (struct sched_domain) {   \
.min_interval   = 4,\
.max_interval   = 128,  \
.busy_factor= 64,   \
@@ -71,6 +71,7 @@ static inline const struct cpumask *cpumask_of_node(int node)
| 0*SD_WAKE_AFFINE  \
| 0*SD_SHARE_CPUPOWER   \
| 0*SD_SHARE_PKG_RESOURCES  \
+   | arch_sd_local_flags(0, cpu)   \
| 0*SD_SERIALIZE\
,   \
.last_balance   = jiffies,  \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6682da3..2004cdb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -763,6 +763,7 @@ enum cpu_idle_type {
 #define SD_BALANCE_WAKE0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE 0x0020  /* Wake task to waking CPU */
 #define SD_SHARE_CPUPOWER  0x0080  /* Domain members share cpu power */
+#define SD_SHARE_POWERDOMAIN   0x0100  /* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES 0x0200  /* Domain members share cpu pkg 
resources */
 #define SD_SERIALIZE   0x0400  /* Only a single load balancing 
instance */
 #define SD_ASYM_PACKING0x0800  /* Place busy groups earlier in 
the domain */
diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..f3cd3c2 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -85,7 +85,7 @@ int arch_update_cpu_topology(void);
 #define ARCH_HAS_SCHED_WAKE_IDLE
 /* Common values for SMT siblings */
 #ifndef SD_SIBLING_INIT
-#define SD_SIBLING_INIT (struct sched_domain) {
\
+#define SD_SIBLING_INIT(cpu) (struct sched_domain) {   \
.min_interval   = 1,\
.max_interval   = 2,\
.busy_factor= 64,   \
@@ -99,6 +99,8 @@ int

[RFC][PATCH v5 14/14] cpuidle: set the current wake up latency

2013-10-18 Thread Vincent Guittot

Save the current wake up latency of a core. This latency is not always
the latency of a defined c-state but can also be an intermediate value
when a core is ready to shutdown (timer migration, cache flush ...) but
wait for the last core of the cluster to finalize the cluster power down.
This latter use case is not manage by the current version of the patch because
it implies that the cpuidle drivers set the wake up latency instead of the
cpuidle core.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org

Conflicts:

drivers/cpuidle/cpuidle.c
---
 drivers/cpuidle/cpuidle.c |   11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index d75040d..7b6553b 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -19,6 +19,7 @@
 #include linux/ktime.h
 #include linux/hrtimer.h
 #include linux/module.h
+#include linux/sched.h
 #include trace/events/power.h
 
 #include cpuidle.h
@@ -42,6 +43,12 @@ void disable_cpuidle(void)
off = 1;
 }
 
+static void cpuidle_set_current_state(int cpu, int latency)
+{
+   struct sched_pm *stat = per_cpu(sched_stat, cpu);
+
+   atomic_set((stat-wake_latency), latency);
+}
 /**
  * cpuidle_play_dead - cpu off-lining
  *
@@ -79,12 +86,16 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
ktime_t time_start, time_end;
s64 diff;
 
+   cpuidle_set_current_state(dev-cpu, target_state-exit_latency);
+
time_start = ktime_get();
 
entered_state = target_state-enter(dev, drv, index);
 
time_end = ktime_get();
 
+   cpuidle_set_current_state(dev-cpu, 0);
+
local_irq_enable();
 
diff = ktime_to_us(ktime_sub(time_end, time_start));
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH v5 06/14] sched: create a new field with available capacity

2013-10-18 Thread Vincent Guittot

This new field power_available reflects the available capacity of a CPU
unlike the cpu_power which reflects the current capacity.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/sched/fair.c  |   14 +++---
 kernel/sched/sched.h |3 ++-
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5568980..db9b871 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4584,15 +4584,19 @@ static void update_cpu_power(struct sched_domain *sd, 
int cpu)
if (!power)
power = 1;
 
+   cpu_rq(cpu)-cpu_available = power;
+   sdg-sgp-power_available = power;
+
cpu_rq(cpu)-cpu_power = power;
sdg-sgp-power = power;
+
 }
 
 void update_group_power(struct sched_domain *sd, int cpu)
 {
struct sched_domain *child = sd-child;
struct sched_group *group, *sdg = sd-groups;
-   unsigned long power;
+   unsigned long power, available;
unsigned long interval;
 
interval = msecs_to_jiffies(sd-balance_interval);
@@ -4604,7 +4608,7 @@ void update_group_power(struct sched_domain *sd, int cpu)
return;
}
 
-   power = 0;
+   power = available = 0;
 
if (child-flags  SD_OVERLAP) {
/*
@@ -4614,6 +4618,8 @@ void update_group_power(struct sched_domain *sd, int cpu)
 
for_each_cpu(cpu, sched_group_cpus(sdg))
power += power_of(cpu);
+   available += available_of(cpu);
+
} else  {
/*
 * !SD_OVERLAP domains can assume that child groups
@@ -4623,11 +4629,13 @@ void update_group_power(struct sched_domain *sd, int 
cpu)
group = child-groups;
do {
power += group-sgp-power;
+   available += group-sgp-power_available;
group = group-next;
} while (group != child-groups);
}
 
-   sdg-sgp-power_orig = sdg-sgp-power = power;
+   sdg-sgp-power_orig = sdg-sgp-power_available = available;
+   sdg-sgp-power = power;
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 22e3f1d..d5a4ec0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -459,6 +459,7 @@ struct rq {
struct sched_domain *sd;
 
unsigned long cpu_power;
+   unsigned long cpu_available;
 
unsigned char idle_balance;
/* For active balancing */
@@ -603,7 +604,7 @@ struct sched_group_power {
 * CPU power of this group, SCHED_LOAD_SCALE being max power for a
 * single CPU.
 */
-   unsigned int power, power_orig;
+   unsigned int power, power_orig, power_available;
unsigned long next_update;
/*
 * Number of busy cpus in this group.
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH v5 04/14] sched: do load balance only with packing cpus

2013-10-18 Thread Vincent Guittot

The tasks will be scheduled only on the CPUs that participate to the packing
effort. A CPU participates to the packing effort when it is its own buddy.

For ILB, look for an idle CPU close to the packing CPUs whenever possible.
The goal is to prevent the wake up of a CPU which doesn't share the power
domain of the pack buddy CPU.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/sched/fair.c |   80 ---
 1 file changed, 76 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5547831..7149f38 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -186,6 +186,17 @@ void sched_init_granularity(void)
  */
 DEFINE_PER_CPU(int, sd_pack_buddy);
 
+static inline bool is_packing_cpu(int cpu)
+{
+   int my_buddy = per_cpu(sd_pack_buddy, cpu);
+   return (my_buddy == -1) || (cpu == my_buddy);
+}
+
+static inline int get_buddy(int cpu)
+{
+   return per_cpu(sd_pack_buddy, cpu);
+}
+
 /*
  * Look for the best buddy CPU that can be used to pack small tasks
  * We make the assumption that it doesn't wort to pack on CPU that share the
@@ -245,6 +256,32 @@ void update_packing_domain(int cpu)
pr_debug(CPU%d packing on CPU%d\n, cpu, id);
per_cpu(sd_pack_buddy, cpu) = id;
 }
+
+static int check_nohz_packing(int cpu)
+{
+   if (!is_packing_cpu(cpu))
+   return true;
+
+   return false;
+}
+#else /* CONFIG_SCHED_PACKING_TASKS */
+
+static inline bool is_packing_cpu(int cpu)
+{
+   return 1;
+}
+
+static inline int get_buddy(int cpu)
+{
+   return -1;
+}
+
+static inline int check_nohz_packing(int cpu)
+{
+   return false;
+}
+
+
 #endif /* CONFIG_SCHED_PACKING_TASKS */
 #endif /* CONFIG_SMP */
 
@@ -3370,7 +3407,7 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p,
 
do {
unsigned long load, avg_load;
-   int local_group;
+   int local_group, packing_cpus = 0;
int i;
 
/* Skip over this group if it has no CPUs allowed */
@@ -3392,8 +3429,14 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p,
load = target_load(i, load_idx);
 
avg_load += load;
+
+   if (is_packing_cpu(i))
+   packing_cpus = 1;
}
 
+   if (!packing_cpus)
+   continue;
+
/* Adjust by relative CPU power of the group */
avg_load = (avg_load * SCHED_POWER_SCALE) / group-sgp-power;
 
@@ -3448,7 +3491,8 @@ static int select_idle_sibling(struct task_struct *p, int 
target)
/*
 * If the prevous cpu is cache affine and idle, don't be stupid.
 */
-   if (i != target  cpus_share_cache(i, target)  idle_cpu(i))
+   if (i != target  cpus_share_cache(i, target)  idle_cpu(i)
+is_packing_cpu(i))
return i;
 
/*
@@ -3463,7 +3507,8 @@ static int select_idle_sibling(struct task_struct *p, int 
target)
goto next;
 
for_each_cpu(i, sched_group_cpus(sg)) {
-   if (i == target || !idle_cpu(i))
+   if (i == target || !idle_cpu(i)
+   || !is_packing_cpu(i))
goto next;
}
 
@@ -3528,9 +3573,13 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int wake_flags)
}
 
if (affine_sd) {
-   if (cpu != prev_cpu  wake_affine(affine_sd, p, sync))
+   if (cpu != prev_cpu  (wake_affine(affine_sd, p, sync)
+   || !is_packing_cpu(prev_cpu)))
prev_cpu = cpu;
 
+   if (!is_packing_cpu(prev_cpu))
+   prev_cpu =  get_buddy(prev_cpu);
+
new_cpu = select_idle_sibling(p, prev_cpu);
goto unlock;
}
@@ -5593,7 +5642,26 @@ static struct {
 
 static inline int find_new_ilb(int call_cpu)
 {
+   struct sched_domain *sd;
int ilb = cpumask_first(nohz.idle_cpus_mask);
+   int buddy = get_buddy(call_cpu);
+
+   /*
+* If we have a pack buddy CPU, we try to run load balance on a CPU
+* that is close to the buddy.
+*/
+   if (buddy != -1) {
+   for_each_domain(buddy, sd) {
+   if (sd-flags  SD_SHARE_CPUPOWER)
+   continue;
+
+   ilb = cpumask_first_and(sched_domain_span(sd),
+   nohz.idle_cpus_mask);
+
+   if (ilb  nr_cpu_ids)
+   break;
+   }
+   }
 
if (ilb  nr_cpu_ids  idle_cpu(ilb))
return ilb;
@@ -5874,6 +5942,10 @@ static inline int

[RFC][PATCH v5 11/14] sched: add a SCHED_PACKING_TASKS config

2013-10-18 Thread Vincent Guittot

The SCHED_PACKING_TASKS config is used to enable the packing tasks mecanism

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 init/Kconfig |   11 +++
 1 file changed, 11 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index 3ecd8a1..4d2b5db 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1178,6 +1178,17 @@ config SCHED_AUTOGROUP
  desktop applications.  Task group autogeneration is currently based
  upon task session.
 
+config SCHED_PACKING_TASKS
+   bool Automatic tasks packing
+   depends on SMP  SCHED_MC
+   default n
+   help
+ This option enable the packing mode of the scheduler.
+ This mode ensures that the minimal number of CPUs will
+ be used to handle the activty of the system. The CPUs
+ are selected to minimized the number of power domain
+ that must be kept on
+
 config MM_OWNER
bool
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH v5 12/14] sched: create a statistic structure

2013-10-18 Thread Vincent Guittot

Create a statistic structure that will be used to share information with
other frameworks like cpuidle and cpufreq. This structure only contains the
current wake up latency of a core for now but could be extended with other
usefull information.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 include/linux/sched.h |   12 +++-
 kernel/sched/fair.c   |5 +
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2004cdb..d676aa2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2,7 +2,7 @@
 #define _LINUX_SCHED_H
 
 #include uapi/linux/sched.h
-
+#include linux/atomic.h
 
 struct sched_param {
int sched_priority;
@@ -63,6 +63,16 @@ struct fs_struct;
 struct perf_event_context;
 struct blk_plug;
 
+/* This structure is used to share information and statistics with other
+ * frameworks. It only shares wake up latency fro the moment but should be
+ * extended with other usefull informations
+ */
+struct sched_pm {
+   atomic_t  wake_latency; /* time to wake up the cpu */
+};
+
+DECLARE_PER_CPU(struct sched_pm, sched_stat);
+
 /*
  * List of flags we want to share for kernel threads,
  * if only because they are not used by them anyway.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d9f782..ad8b99a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -178,6 +178,11 @@ void sched_init_granularity(void)
update_sysctl();
 }
 
+/* Save per_cpu information that will be shared with other frameworks */
+DEFINE_PER_CPU(struct sched_pm, sched_stat) = {
+   .wake_latency = ATOMIC_INIT(0)
+};
+
 #ifdef CONFIG_SMP
 static unsigned long available_of(int cpu)
 {
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH v5 13/14] sched: differantiate idle cpu

2013-10-18 Thread Vincent Guittot

The cost for waking up of a core varies according to its current idle state.
This includes C-state and intermediate state when some sync between cores is
required to reach a deep C-state.
Waking up a CPU in a deep C-state for running a short task is not efficient
from both a power and a performance point of view. We should take into account
the wake up latency of an idle CPU when the scheduler looks for the best CPU
to use for a waking task.
The wake up latency of a CPU is computed into a load that can be directly
compared with task load and other CPUs load.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/sched/fair.c |   22 ++
 1 file changed, 22 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ad8b99a..4863dad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -394,6 +394,20 @@ static int check_nohz_packing(int cpu)
 
return false;
 }
+
+int sched_get_idle_load(int cpu)
+{
+   struct sched_pm *stat = per_cpu(sched_stat, cpu);
+   int latency = atomic_read((stat-wake_latency));
+   /*
+* Transform the current wakeup latency (us) into an idle load that
+* will be compared to task load to decide if it's worth to wake up
+* the cpu. The current formula is quite simple but give good
+* approximation in the range [0:10ms]
+*/
+   return (latency * 21)  10;
+}
+
 #else /* CONFIG_SCHED_PACKING_TASKS */
 
 static inline bool is_packing_cpu(int cpu)
@@ -416,6 +430,10 @@ static inline int check_nohz_packing(int cpu)
return false;
 }
 
+static inline int sched_get_idle_load(int cpu)
+{
+   return 0;
+}
 
 #endif /* CONFIG_SCHED_PACKING_TASKS */
 #endif /* CONFIG_SMP */
@@ -3207,6 +3225,8 @@ static void dequeue_task_fair(struct rq *rq, struct 
task_struct *p, int flags)
 /* Used instead of source_load when we know the type == 0 */
 static unsigned long weighted_cpuload(const int cpu)
 {
+   if (idle_cpu(cpu))
+   return sched_get_idle_load(cpu);
return cpu_rq(cpu)-cfs.runnable_load_avg;
 }
 
@@ -3655,6 +3675,8 @@ static int select_idle_sibling(struct task_struct *p, int 
target)
if (i == target || !idle_cpu(i)
|| !is_packing_cpu(i))
goto next;
+   if (weighted_cpuload(i)  
p-se.avg.load_avg_contrib)
+   goto next;
}
 
target = cpumask_first_and(sched_group_cpus(sg),
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Bench for testing scheduler

2013-11-12 Thread Vincent Guittot

On 8 November 2013 22:12, Rowand, Frank frank.row...@sonymobile.com wrote:

 On Friday, November 08, 2013 1:28 AM, Vincent Guittot 
 [vincent.guit...@linaro.org] wrote:

 On 8 November 2013 01:04, Rowand, Frank frank.row...@sonymobile.com wrote:

snip

 The Avg figures look almost stable IMO. Are you speaking about the Max
 value for the inconsistency ?

 The values on my laptop for -l 2000 are not stable.

 If I collapse all of the threads in each of the following tests to a
 single value I get the following table.  Note that each thread completes
 a different number of cycles, so I calculate the average as:

   total count = T0_count + T1_count + T2_count + T3_count

   avg = ( (T0_count * T0_avg) + (T1_count * T1_avg) + ... + (T3_count * 
 T3_avg) ) / total count

   min is the smallest min for any of the threads

   max is the largest max for any of the threads

 total
 test   Tcount  min avg   max
  ---   --- -
1   4 5886276.0  1017
2   4 5881271.5   810
3   4 5885274.2  1143
4   4 5884268.9  1279

 test 1 average is 10% larger than test 4.

 test 4 maximum is 50% larger than test2.

 But all of this is just a minor detail of how to run cyclictest.  The more
 important question is whether to use cyclictest results as a valid workload
 or metric, so for the moment I won't comment further on the cyclictest
 parameters you used to collect the example data you provided.




snip

 

 Thanks for clarifying how the data was calculated (below).  Again, I don't 
 think
 this level of detail is the most important issue at this point, but I'm going
 to comment on it while it is still fresh in my mind.

  Some questions about what these metrics are:
 
  The cyclictest data is reported per thread.  How did you combine the per 
  thread data
  to get a single latency and stddev value?
 
  Is Latency the average latency?

 Yes. I have described below the procedure i have followed to get my results:

 I run the same test (same parameters) several times ( i have tried
 between 5 and 10 runs and the results were similar).
 For each run, i compute the average of per thread average figure and i
 compute the stddev between per thread results.

 So the test run stddev is the standard deviation of the values for average
 latency of the 8 (???) cyclictest threads in a test run?

I have used 5 threads for my tests


 If so, I don't think that the calculated stddev has much actual meaning for
 comparing the algorithms (I do find it useful to get a loose sense of how
 consistent multiple test runs with the same parameters).

 The results that i sent is an average of all runs with the same parameters.

 Then the stddev in the table is the average of the stddev in several test 
 runs?

yes it is


 The stddev later on in the table is often in the range of 10%, 20%, 50%, and 
 100%
 of the average latency.  That is rather large.

yes i agree and it's an interesting figure IMHO because it points out
how the wake up of a core can impact the task scheduling latency and
how it's possible to reduce it or make it more stable (even if we
still have some large max value which are probably not linked to the
wake up of a core but other activities like deferable timer that have
fired



 
  stddev is not reported by cyclictest.  How did you create this value?  Did 
  you
  use the -v cyclictest option to report detailed data, then calculate 
  stddev from
  the detailed data?

 No i haven't used the -v because it generates too much spurious wake
 up that makes the results irrelevant

 Yes, I agree about not using -v.  It was just a wild guess on my part since
 I did not know how stddev was calculated.  And I was incorrectly guessing
 that stdev was describing the frequency distribution of the latencies
 from a single test run.

I haven't be so precise in my computation mainly because the output
were almost coherent but we probably need more precised statistic in a
final step


 As a general comment on cyclictest, I don't find average latency
 (in isolation) sufficient to compare different runs of cyclictest.
 And stddev of the frequency distribution of the latencies (which
 can be calculated from the -h data, with fairly low cyclictest
 overhead) is usually interesting but should be viewed with a healthy
 skepticism since that frequency distribution is often not a normal
 distribution.  In addition to average latency, I normally look at
 maximum latency and the frequency distribution of latence (in table
 or graph form).

 (One side effect of specifying -h is that the -d option is then
 ignored.)


I'm going to have a look at -h parameters which can be useful to get a
better view of the frequency distribution as you point out. Having the
distance set to 0 (-d) can be an issue because we could have a
synchronization of the wake up of the threads which will finally hide
the real wake up latency. It's interesting to have a distance

Re: [RFC][PATCH v5 00/14] sched: packing tasks

2013-11-12 Thread Vincent Guittot

On 11 November 2013 17:38, Peter Zijlstra pet...@infradead.org wrote:
 On Mon, Nov 11, 2013 at 11:33:45AM +, Catalin Marinas wrote:
 My understanding from the recent discussions is that the scheduler
 should decide directly on the C-state (or rather the deepest C-state
 possible since we don't want to duplicate the backend logic for
 synchronising CPUs going up or down). This means that the scheduler
 needs to know about C-state target residency, wake-up latency (I think
 we can leave coupled C-states to the backend, there is some complex
 synchronisation which I wouldn't duplicate).

 Alternatively (my preferred approach), we get the scheduler to predict
 and pass the expected residency and latency requirements down to a
 power driver and read back the actual C-states for making task
 placement decisions. Some of the menu governor prediction logic could
 be turned into a library and used by the scheduler. Basically what
 this tries to achieve is better scheduler awareness of the current
 C-states decided by a cpuidle/power driver based on the scheduler
 constraints.

 Ah yes.. so I _think_ the scheduler wants to eventually know about idle
 topology constraints. But we can get there in a gradual fashion I hope.

 Like the package C states on x86 -- for those to be effective the
 scheduler needs to pack tasks and keep entire packages idle for as long
 as possible.

That's the purpose of patches 12, 13 and 14. To get the current wakeup
latency of a core and use it when selecting a core



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH v5 05/14] sched: add a packing level knob

2013-11-12 Thread Vincent Guittot

On 12 November 2013 11:32, Peter Zijlstra pet...@infradead.org wrote:
 On Fri, Oct 18, 2013 at 01:52:18PM +0200, Vincent Guittot wrote:
 +int sched_proc_update_packing(struct ctl_table *table, int write,
 + void __user *buffer, size_t *lenp,
 + loff_t *ppos)
 +{
 + int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
 + if (ret || !write)
 + return ret;
 +
 + if (sysctl_sched_packing_level)
 + sd_pack_threshold = (100 * 1024) / sysctl_sched_packing_level;
 +
 + return 0;
 +}

 +#ifdef CONFIG_SCHED_PACKING_TASKS
 +static int min_sched_packing_level;
 +static int max_sched_packing_level = 100;
 +#endif /* CONFIG_SMP */

 +#ifdef CONFIG_SCHED_PACKING_TASKS
 + {
 + .procname   = sched_packing_level,
 + .data   = sysctl_sched_packing_level,
 + .maxlen = sizeof(int),
 + .mode   = 0644,
 + .proc_handler   = sched_proc_update_packing,
 + .extra1 = min_sched_packing_level,
 + .extra2 = max_sched_packing_level,
 + },
 +#endif

 Shouldn't min_sched_packing_level be 1? Userspace can now write 0 and
 expect something; but then we don't update sd_pack_threshold so nothing
 really changed.

value 0 is used to disable to packing feature and the scheduler falls
back to default behavior. This value is tested when setting which cpus
will be used by the scheduler.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH v5 05/14] sched: add a packing level knob

2013-11-12 Thread Vincent Guittot

On 12 November 2013 11:55, Peter Zijlstra pet...@infradead.org wrote:
 On Tue, Nov 12, 2013 at 11:44:15AM +0100, Vincent Guittot wrote:
 On 12 November 2013 11:32, Peter Zijlstra pet...@infradead.org wrote:
  On Fri, Oct 18, 2013 at 01:52:18PM +0200, Vincent Guittot wrote:
  +int sched_proc_update_packing(struct ctl_table *table, int write,
  + void __user *buffer, size_t *lenp,
  + loff_t *ppos)
  +{
  + int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
  + if (ret || !write)
  + return ret;
  +
  + if (sysctl_sched_packing_level)
  + sd_pack_threshold = (100 * 1024) / 
  sysctl_sched_packing_level;
  +
  + return 0;
  +}
 
  +#ifdef CONFIG_SCHED_PACKING_TASKS
  +static int min_sched_packing_level;
  +static int max_sched_packing_level = 100;
  +#endif /* CONFIG_SMP */
 
  +#ifdef CONFIG_SCHED_PACKING_TASKS
  + {
  + .procname   = sched_packing_level,
  + .data   = sysctl_sched_packing_level,
  + .maxlen = sizeof(int),
  + .mode   = 0644,
  + .proc_handler   = sched_proc_update_packing,
  + .extra1 = min_sched_packing_level,
  + .extra2 = max_sched_packing_level,
  + },
  +#endif
 
  Shouldn't min_sched_packing_level be 1? Userspace can now write 0 and
  expect something; but then we don't update sd_pack_threshold so nothing
  really changed.

 value 0 is used to disable to packing feature and the scheduler falls
 back to default behavior. This value is tested when setting which cpus
 will be used by the scheduler.

 I suspected as much, but it wasn't clear from the Changelog, the patch
 or any comments. Plz as to fix.

Ok. i will add it
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH v5 06/14] sched: create a new field with available capacity

2013-11-12 Thread Vincent Guittot

On 12 November 2013 11:34, Peter Zijlstra pet...@infradead.org wrote:
 On Fri, Oct 18, 2013 at 01:52:19PM +0200, Vincent Guittot wrote:
 This new field power_available reflects the available capacity of a CPU
 unlike the cpu_power which reflects the current capacity.

 - sdg-sgp-power_orig = sdg-sgp-power = power;
 + sdg-sgp-power_orig = sdg-sgp-power_available = available;
 + sdg-sgp-power = power;

 This patch leaves me confused as to power_available vs power_orig and
 the Changelog doesn't really clarify anything much at all.

Ok, i will add more details in the changelog
power_ori can only modified for SMT purpose otherwise it stays to 1024
whereas power_available take into account the modification that has
been done by the platform.

I can probably re-factor that and merge power_ori and power_available.
At now,  arch_scale_smt_power and arch_scale_freq_power are not used
simultaneously by an architecture so I can probably move the returned
value of arch_scale_freq_power into the power_ori. This would even
make more sense regarding the current use of it.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH v5 00/14] sched: packing tasks

2013-11-12 Thread Vincent Guittot

On 11 November 2013 12:33, Catalin Marinas catalin.mari...@arm.com wrote:
 Hi Vincent,

 (cross-posting to linux-pm as it was agreed to follow up on this list)


snip


 So, IMO, defining the power topology is a good starting point and I
 think it's better to separate the patches from the energy saving
 algorithms like packing. We need to agree on what information we have

Daniel and Tuukka who are working in cpuidle consolidation in the
scheduler, are also interested in using similar topology information
than me. I have made 1 patchset only because the information was only
used here. So I can probably make a separate patchset of the power
topology description in DT.

Vincent

 (C-state details, coupling, power gating) and what we can/need to
 expose to the scheduler. This can be revisited once we start
 implementing/refining the energy awareness.


snip

 --
 Catalin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 0/6] rework sched_domain topology description

2014-03-04 Thread Vincent Guittot

This patchset was previously part of the larger tasks packing patchset [1].
I have splitted the latter in 3 different patchsets (at least) to make the
thing easier.
-configuration of sched_domain topology (this patchset)
-update and consolidation of cpu_power
-tasks packing algorithm

Based on Peter Z's proposal [2][3], this patchset modifies the way to configure
the sched_domain level in order to let architectures to add specific level like
the current BOOK level or the proposed power gating level for ARM architecture.

[1] https://lkml.org/lkml/2013/10/18/121
[2] https://lkml.org/lkml/2013/11/5/239
[3] https://lkml.org/lkml/2013/11/5/449

Vincent Guittot (6):
  sched: remove unused SCHED_INIT_NODE
  sched: rework of sched_domain topology definition
  sched: s390: create a dedicated topology table
  sched: powerpc: create a dedicated topology table
  sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
  sched: ARM: create a dedicated scheduler topology table

 arch/arm/kernel/topology.c|   26 
 arch/ia64/include/asm/topology.h  |   24 
 arch/metag/include/asm/topology.h |   27 -
 arch/powerpc/kernel/smp.c |   35 --
 arch/s390/include/asm/topology.h  |   13 +-
 arch/s390/kernel/topology.c   |   25 
 arch/tile/include/asm/topology.h  |   33 --
 include/linux/sched.h |   30 +
 include/linux/topology.h  |  128 ++--
 kernel/sched/core.c   |  235 ++---
 10 files changed, 237 insertions(+), 339 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 3/6] sched: s390: create a dedicated topology table

2014-03-04 Thread Vincent Guittot

BOOK level is only relevant for s390 so we create a dedicated topology table
with BOOK level and remove it from default table.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 arch/s390/include/asm/topology.h |   11 +--
 arch/s390/kernel/topology.c  |   25 +
 kernel/sched/core.c  |3 ---
 3 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
index 07763bd..56af530 100644
--- a/arch/s390/include/asm/topology.h
+++ b/arch/s390/include/asm/topology.h
@@ -26,21 +26,12 @@ extern struct cpu_topology_s390 cpu_topology[NR_CPUS];
 
 #define mc_capable() 1
 
-static inline const struct cpumask *cpu_coregroup_mask(int cpu)
-{
-   return cpu_topology[cpu].core_mask;
-}
-
-static inline const struct cpumask *cpu_book_mask(int cpu)
-{
-   return cpu_topology[cpu].book_mask;
-}
-
 int topology_cpu_init(struct cpu *);
 int topology_set_cpu_management(int fc);
 void topology_schedule_update(void);
 void store_topology(struct sysinfo_15_1_x *info);
 void topology_expect_change(void);
+const struct cpumask *cpu_coregroup_mask(int cpu);
 
 #else /* CONFIG_SCHED_BOOK */
 
diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c
index 4b2e3e3..8a7ac47 100644
--- a/arch/s390/kernel/topology.c
+++ b/arch/s390/kernel/topology.c
@@ -443,6 +443,28 @@ int topology_cpu_init(struct cpu *cpu)
return sysfs_create_group(cpu-dev.kobj, topology_cpu_attr_group);
 }
 
+const struct cpumask *cpu_coregroup_mask(int cpu)
+{
+   return cpu_topology[cpu].core_mask;
+}
+
+static const struct cpumask *cpu_book_mask(int cpu)
+{
+   return cpu_topology[cpu].book_mask;
+}
+
+static struct sched_domain_topology_level s390_topology[] = {
+   { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },
+   { cpu_book_mask, SD_INIT_NAME(BOOK) },
+   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+   { NULL, },
+};
+
+static void __init set_sched_topology(void)
+{
+   sched_domain_topology = s390_topology;
+}
+
 static int __init topology_init(void)
 {
if (!MACHINE_HAS_TOPOLOGY) {
@@ -452,6 +474,9 @@ static int __init topology_init(void)
set_topology_timer();
 out:
update_cpu_masks();
+
+   set_sched_topology();
+
return device_create_file(cpu_subsys.dev_root, dev_attr_dispatching);
 }
 device_initcall(topology_init);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6f6a8f0..3479467 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6043,9 +6043,6 @@ static struct sched_domain_topology_level 
default_topology[] = {
 #ifdef CONFIG_SCHED_MC
{ cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },
 #endif
-#ifdef CONFIG_SCHED_BOOK
-   { cpu_book_mask, SD_INIT_NAME(BOOK) },
-#endif
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
{ NULL, },
 };
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 6/6] sched: ARM: create a dedicated scheduler topology table

2014-03-04 Thread Vincent Guittot

Create a dedicated topology table for ARM which will create new level to
differentiate CPUs that can or not powergate independantly from others.

The patch gives an example of how to add domain that will take advantage of
SD_SHARE_POWERDOMAIN.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 arch/arm/kernel/topology.c |   26 ++
 1 file changed, 26 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 0bc94b1..ae8ffbc 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -185,6 +185,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
return cpu_topology[cpu].core_sibling;
 }
 
+/*
+ * The current assumption is that we can power gate each core independently.
+ * This will be superseded by DT binding once available.
+ */
+const struct cpumask *cpu_corepower_mask(int cpu)
+{
+   return cpu_topology[cpu].thread_sibling;
+}
+
 static void update_siblings_masks(unsigned int cpuid)
 {
struct cputopo_arm *cpu_topo, *cpuid_topo = cpu_topology[cpuid];
@@ -266,6 +275,20 @@ void store_cpu_topology(unsigned int cpuid)
cpu_topology[cpuid].socket_id, mpidr);
 }
 
+static struct sched_domain_topology_level arm_topology[] = {
+#ifdef CONFIG_SCHED_MC
+   { cpu_corepower_mask, SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN, 
SD_INIT_NAME(GMC) },
+   { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },
+#endif
+   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+   { NULL, },
+};
+
+static void __init set_sched_topology(void)
+{
+   sched_domain_topology = arm_topology;
+}
+
 /*
  * init_cpu_topology is called at boot when only one cpu is running
  * which prevent simultaneous write access to cpu_topology array
@@ -289,4 +312,7 @@ void __init init_cpu_topology(void)
smp_wmb();
 
parse_dt_topology();
+
+   /* Set scheduler topology descriptor */
+   set_sched_topology();
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 4/6] sched: powerpc: create a dedicated topology table

2014-03-04 Thread Vincent Guittot

Create a dedicated topology table for handling asymetric feature.
The current proposal creates a new level which describes which groups of CPUs
take adavantge of SD_ASYM_PACKING. The useless level will be removed during the
build of the sched_domain topology.

Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level
during the boot sequence and before the build of the sched_domain topology.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 arch/powerpc/kernel/smp.c |   35 +++
 kernel/sched/core.c   |6 --
 2 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ac2621a..75da054 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -755,6 +755,32 @@ int setup_profiling_timer(unsigned int multiplier)
return 0;
 }
 
+#ifdef CONFIG_SCHED_SMT
+/* cpumask of CPUs with asymetric SMT dependancy */
+static const struct cpumask *cpu_asmt_mask(int cpu)
+{
+   if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
+   printk_once(KERN_INFO Enabling Asymmetric SMT scheduling\n);
+   return topology_thread_cpumask(cpu);
+   }
+   return cpumask_of(cpu);
+}
+#endif
+
+static struct sched_domain_topology_level powerpc_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+   { cpu_asmt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | 
SD_ASYM_PACKING, SD_INIT_NAME(ASMT) },
+   { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES, 
SD_INIT_NAME(SMT) },
+#endif
+   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+   { NULL, },
+};
+
+static void __init set_sched_topology(void)
+{
+   sched_domain_topology = powerpc_topology;
+}
+
 void __init smp_cpus_done(unsigned int max_cpus)
 {
cpumask_var_t old_mask;
@@ -779,15 +805,8 @@ void __init smp_cpus_done(unsigned int max_cpus)
 
dump_numa_cpu_topology();
 
-}
+   set_sched_topology();
 
-int arch_sd_sibling_asym_packing(void)
-{
-   if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
-   printk_once(KERN_INFO Enabling Asymmetric SMT scheduling\n);
-   return SD_ASYM_PACKING;
-   }
-   return 0;
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3479467..7606de0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5818,11 +5818,6 @@ static void init_sched_groups_power(int cpu, struct 
sched_domain *sd)
atomic_set(sg-sgp-nr_busy_cpus, sg-group_weight);
 }
 
-int __weak arch_sd_sibling_asym_packing(void)
-{
-   return 0*SD_ASYM_PACKING;
-}
-
 /*
  * Initializers for schedule domains
  * Non-inlined to reduce accumulated stack pressure in build_sched_domains()
@@ -6000,7 +5995,6 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
if (sd-flags  SD_SHARE_CPUPOWER) {
sd-imbalance_pct = 110;
sd-smt_gain = 1178; /* ~15% */
-   sd-flags |= arch_sd_sibling_asym_packing();
 
} else if (sd-flags  SD_SHARE_PKG_RESOURCES) {
sd-imbalance_pct = 117;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 5/6] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain

2014-03-04 Thread Vincent Guittot

A new flag SD_SHARE_POWERDOMAIN is created to reflect whether groups of CPUs
in a sched_domain level can or not reach different power state. As an example,
the flag should be cleared at CPU level if groups of cores can be power gated
independently. This information can be used to add load balancing level between
group of CPUs than can power gate independantly. The default behavior of the
scheduler is to spread tasks across CPUs and groups of CPUs so the flag is set
into all sched_domains.

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 include/linux/sched.h |1 +
 kernel/sched/core.c   |9 ++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dbc35dd..182a080 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -861,6 +861,7 @@ enum cpu_idle_type {
 #define SD_BALANCE_WAKE0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE 0x0020  /* Wake task to waking CPU */
 #define SD_SHARE_CPUPOWER  0x0080  /* Domain members share cpu power */
+#define SD_SHARE_POWERDOMAIN   0x0100  /* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES 0x0200  /* Domain members share cpu pkg 
resources */
 #define SD_SERIALIZE   0x0400  /* Only a single load balancing 
instance */
 #define SD_ASYM_PACKING0x0800  /* Place busy groups earlier in 
the domain */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7606de0..b28cff0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5283,7 +5283,8 @@ static int sd_degenerate(struct sched_domain *sd)
 SD_BALANCE_FORK |
 SD_BALANCE_EXEC |
 SD_SHARE_CPUPOWER |
-SD_SHARE_PKG_RESOURCES)) {
+SD_SHARE_PKG_RESOURCES |
+SD_SHARE_POWERDOMAIN)) {
if (sd-groups != sd-groups-next)
return 0;
}
@@ -5314,7 +5315,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct 
sched_domain *parent)
SD_BALANCE_EXEC |
SD_SHARE_CPUPOWER |
SD_SHARE_PKG_RESOURCES |
-   SD_PREFER_SIBLING);
+   SD_PREFER_SIBLING |
+   SD_SHARE_POWERDOMAIN);
if (nr_node_ids == 1)
pflags = ~SD_SERIALIZE;
}
@@ -5932,7 +5934,8 @@ static struct cpumask ***sched_domains_numa_masks;
(SD_SHARE_CPUPOWER |\
 SD_SHARE_PKG_RESOURCES |   \
 SD_NUMA |  \
-SD_ASYM_PACKING)
+SD_ASYM_PACKING |  \
+SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
 sd_init(struct sched_domain_topology_level *tl, int cpu)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/6] sched: rework of sched_domain topology definition

2014-03-04 Thread Vincent Guittot

We replace the old way to configure the scheduler topology with a new method
which enables a platform to declare additionnal level (if needed).

We still have a default topology table definition that can be used by platform
that don't want more level than the SMT, MC, CPU and NUMA ones. This table can
be overwritten by an arch which wants to add new level where a load balance
make sense like BOOK or powergating level.

For each level, we need a function pointer that returns cpumask for each cpu,
the flags configuration and a name. Each level must be a subset on
the next one. The build sequence of the sched_domain will take care of
removing useless levels like those with 1 CPU and those with the same CPU span
and relevant information for load balancing than its child .

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 arch/ia64/include/asm/topology.h |   24 
 arch/s390/include/asm/topology.h |2 -
 arch/tile/include/asm/topology.h |   33 --
 include/linux/sched.h|   29 +
 include/linux/topology.h |  128 +++--
 kernel/sched/core.c  |  227 +++---
 6 files changed, 156 insertions(+), 287 deletions(-)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index a2496e4..20d12fa 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -46,30 +46,6 @@
 
 void build_cpu_to_node_map(void);
 
-#define SD_CPU_INIT (struct sched_domain) {\
-   .parent = NULL, \
-   .child  = NULL, \
-   .groups = NULL, \
-   .min_interval   = 1,\
-   .max_interval   = 4,\
-   .busy_factor= 64,   \
-   .imbalance_pct  = 125,  \
-   .cache_nice_tries   = 2,\
-   .busy_idx   = 2,\
-   .idle_idx   = 1,\
-   .newidle_idx= 0,\
-   .wake_idx   = 0,\
-   .forkexec_idx   = 0,\
-   .flags  = SD_LOAD_BALANCE   \
-   | SD_BALANCE_NEWIDLE\
-   | SD_BALANCE_EXEC   \
-   | SD_BALANCE_FORK   \
-   | SD_WAKE_AFFINE,   \
-   .last_balance   = jiffies,  \
-   .balance_interval   = 1,\
-   .nr_balance_failed  = 0,\
-}
-
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_SMP
diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
index 05425b1..07763bd 100644
--- a/arch/s390/include/asm/topology.h
+++ b/arch/s390/include/asm/topology.h
@@ -64,8 +64,6 @@ static inline void s390_init_cpu_topology(void)
 };
 #endif
 
-#define SD_BOOK_INIT   SD_CPU_INIT
-
 #include asm-generic/topology.h
 
 #endif /* _ASM_S390_TOPOLOGY_H */
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h
index d15c0d8..9383118 100644
--- a/arch/tile/include/asm/topology.h
+++ b/arch/tile/include/asm/topology.h
@@ -44,39 +44,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
 /* For now, use numa node -1 for global allocation. */
 #define pcibus_to_node(bus)((void)(bus), -1)
 
-/*
- * TILE architecture has many cores integrated in one processor, so we need
- * setup bigger balance_interval for both CPU/NODE scheduling domains to
- * reduce process scheduling costs.
- */
-
-/* sched_domains SD_CPU_INIT for TILE architecture */
-#define SD_CPU_INIT (struct sched_domain) {\
-   .min_interval   = 4,\
-   .max_interval   = 128,  \
-   .busy_factor= 64,   \
-   .imbalance_pct  = 125,  \
-   .cache_nice_tries   = 1,\
-   .busy_idx   = 2,\
-   .idle_idx   = 1,\
-   .newidle_idx= 0,\
-   .wake_idx   = 0,\
-   .forkexec_idx   = 0,\
-   \
-   .flags  = 1*SD_LOAD_BALANCE \
-   | 1*SD_BALANCE_NEWIDLE  \
-   | 1*SD_BALANCE_EXEC \
-   | 1*SD_BALANCE_FORK

[RFC 1/6] sched: remove unused SCHED_INIT_NODE

2014-03-04 Thread Vincent Guittot

not used since new numa scheduler init sequence

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 arch/metag/include/asm/topology.h |   27 ---
 1 file changed, 27 deletions(-)

diff --git a/arch/metag/include/asm/topology.h 
b/arch/metag/include/asm/topology.h
index 8e9c0b3..e95f874 100644
--- a/arch/metag/include/asm/topology.h
+++ b/arch/metag/include/asm/topology.h
@@ -3,33 +3,6 @@
 
 #ifdef CONFIG_NUMA
 
-/* sched_domains SD_NODE_INIT for Meta machines */
-#define SD_NODE_INIT (struct sched_domain) {   \
-   .parent = NULL, \
-   .child  = NULL, \
-   .groups = NULL, \
-   .min_interval   = 8,\
-   .max_interval   = 32,   \
-   .busy_factor= 32,   \
-   .imbalance_pct  = 125,  \
-   .cache_nice_tries   = 2,\
-   .busy_idx   = 3,\
-   .idle_idx   = 2,\
-   .newidle_idx= 0,\
-   .wake_idx   = 0,\
-   .forkexec_idx   = 0,\
-   .flags  = SD_LOAD_BALANCE   \
-   | SD_BALANCE_FORK   \
-   | SD_BALANCE_EXEC   \
-   | SD_BALANCE_NEWIDLE\
-   | SD_SERIALIZE, \
-   .last_balance   = jiffies,  \
-   .balance_interval   = 1,\
-   .nr_balance_failed  = 0,\
-   .max_newidle_lb_cost= 0,\
-   .next_decay_max_lb_cost = jiffies,  \
-}
-
 #define cpu_to_node(cpu)   ((void)(cpu), 0)
 #define parent_node(node)  ((void)(node), 0)
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/6] sched: rework of sched_domain topology definition

2014-03-06 Thread Vincent Guittot

On 6 March 2014 01:09, Dietmar Eggemann dietmar.eggem...@arm.com wrote:
 On 05/03/14 07:18, Vincent Guittot wrote:

 We replace the old way to configure the scheduler topology with a new
 method
 which enables a platform to declare additionnal level (if needed).

 We still have a default topology table definition that can be used by
 platform
 that don't want more level than the SMT, MC, CPU and NUMA ones. This table
 can
 be overwritten by an arch which wants to add new level where a load
 balance
 make sense like BOOK or powergating level.

 For each level, we need a function pointer that returns cpumask for each
 cpu,
 the flags configuration and a name. Each level must be a subset on


 Maybe it's worth mentioning here that those flags are from the set of the sd
 topology flags to distinguish them from the set of sd behavioural flags.
 Latter can't be set via this interface.

yes, i will add the list of flags that can be set with the table



 the next one. The build sequence of the sched_domain will take care of
 removing useless levels like those with 1 CPU and those with the same CPU
 span
 and relevant information for load balancing than its child .

 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
 ---
   arch/ia64/include/asm/topology.h |   24 
   arch/s390/include/asm/topology.h |2 -
   arch/tile/include/asm/topology.h |   33 --
   include/linux/sched.h|   29 +
   include/linux/topology.h |  128 +++--
   kernel/sched/core.c  |  227
 +++---
   6 files changed, 156 insertions(+), 287 deletions(-)


[snip]

 -
 -#define for_each_sd_topology(tl)   \
 -   for (tl = sched_domain_topology; tl-init; tl++)


 Why is sched_domains_curr_level now outside #ifdef CONFIG_NUMA?

it should not as well as its use in sd_init



 +static int sched_domains_curr_level;

   #ifdef CONFIG_NUMA
 -
   static int sched_domains_numa_levels;
   static int *sched_domains_numa_distance;
   static struct cpumask ***sched_domains_numa_masks;
 -static int sched_domains_curr_level;
 -
 -static inline int sd_local_flags(int level)
 -{
 -   if (sched_domains_numa_distance[level]  RECLAIM_DISTANCE)
 -   return 0;
 +#endif

 -   return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
 -}
 +/*
 + * SD_flags allowed in topology descriptions.
 + *
 + * SD_SHARE_CPUPOWER  - describes SMT topologies
 + * SD_SHARE_PKG_RESOURCES - describes shared caches
 + * SD_NUMA- describes NUMA topologies
 + *
 + * Odd one out:
 + * SD_ASYM_PACKING- describes SMT quirks
 + */
 +#define TOPOLOGY_SD_FLAGS  \
 +   (SD_SHARE_CPUPOWER |\
 +SD_SHARE_PKG_RESOURCES |   \
 +SD_NUMA |  \
 +SD_ASYM_PACKING)

   static struct sched_domain *
 -sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 +sd_init(struct sched_domain_topology_level *tl, int cpu)
   {
  struct sched_domain *sd = *per_cpu_ptr(tl-data.sd, cpu);
 -   int level = tl-numa_level;
 -   int sd_weight = cpumask_weight(
 -
 sched_domains_numa_masks[level][cpu_to_node(cpu)]);
 +   int sd_weight;
 +


 Next line could be guared by #ifdef CONFIG_NUMA. We still use #ifdef
 CONFIG_NUMA later in sd_init() though.

 +   /*
 +* Ugly hack to pass state to sd_numa_mask()...
 +*/
 +   sched_domains_curr_level = tl-numa_level;
 +
 +   sd_weight = cpumask_weight(tl-mask(cpu));
 +
 +   if (WARN_ONCE(tl-sd_flags  ~TOPOLOGY_SD_FLAGS,
 +   wrong sd_flags in topology description\n))
 +   tl-sd_flags = ~TOPOLOGY_SD_FLAGS;

  *sd = (struct sched_domain){
  .min_interval   = sd_weight,
  .max_interval   = 2*sd_weight,
  .busy_factor= 32,
  .imbalance_pct  = 125,
 -   .cache_nice_tries   = 2,
 -   .busy_idx   = 3,
 -   .idle_idx   = 2,
 +
 +   .cache_nice_tries   = 0,
 +   .busy_idx   = 0,
 +   .idle_idx   = 0,
  .newidle_idx= 0,
  .wake_idx   = 0,
  .forkexec_idx   = 0,


 Why we want to explicitly set those indexes to 0 here? IMHO, the memory for
 *sd is zeroed out before. This is true for all data members which are set to
 0 later in this function including the | 0*SD_FOO . IMHO, would make the
 code more readable.

I would say that it makes the configuration more readable and
modifiable because you have the list of possible flag to set




  .flags  = 1*SD_LOAD_BALANCE
  | 1*SD_BALANCE_NEWIDLE
 -   | 0*SD_BALANCE_EXEC

Re: [RFC 6/6] sched: ARM: create a dedicated scheduler topology table

2014-03-06 Thread Vincent Guittot

On 6 March 2014 06:38, Dietmar Eggemann dietmar.eggem...@arm.com wrote:
 On 05/03/14 07:18, Vincent Guittot wrote:
 Create a dedicated topology table for ARM which will create new level to
 differentiate CPUs that can or not powergate independantly from others.

 The patch gives an example of how to add domain that will take advantage of
 SD_SHARE_POWERDOMAIN.

 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org

[snip]


   parse_dt_topology();
 +
 + /* Set scheduler topology descriptor */
 + set_sched_topology();
   }


 How about the core scheduler provides an interface set_sched_topology()
 instead of each arch has its own __init function? Sketched out below
 for ARM.

yes, i will add it


 -- 8 --
 Subject: [PATCH] sched: set_sched_topology() as an interface of core scheduler

 ---
  arch/arm/kernel/topology.c | 7 +--
  include/linux/sched.h  | 2 +-
  kernel/sched/core.c| 5 +
  3 files changed, 7 insertions(+), 7 deletions(-)

 diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
 index ae8ffbc..89d5592 100644
 --- a/arch/arm/kernel/topology.c
 +++ b/arch/arm/kernel/topology.c
 @@ -284,11 +284,6 @@ static struct sched_domain_topology_level arm_topology[] 
 = {
 { NULL, },
  };

 -static void __init set_sched_topology(void)
 -{
 -   sched_domain_topology = arm_topology;
 -}
 -
  /*
   * init_cpu_topology is called at boot when only one cpu is running
   * which prevent simultaneous write access to cpu_topology array
 @@ -314,5 +309,5 @@ void __init init_cpu_topology(void)
 parse_dt_topology();

 /* Set scheduler topology descriptor */
 -   set_sched_topology();
 +   set_sched_topology(arm_topology);
  }
 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index 8831413..fefd4e7 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -998,7 +998,7 @@ struct sched_domain_topology_level {
  #endif
  };

 -extern struct sched_domain_topology_level *sched_domain_topology;
 +extern void set_sched_topology(struct sched_domain_topology_level *tl);

  #ifdef CONFIG_SCHED_DEBUG
  # define SD_INIT_NAME(type).name = #type
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index b4fb0df..a748c92 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -6105,6 +6105,11 @@ static struct sched_domain_topology_level 
 default_topology[] = {

  struct sched_domain_topology_level *sched_domain_topology = default_topology;

 +void set_sched_topology(struct sched_domain_topology_level *tl)
 +{
 +   sched_domain_topology = tl;
 +}
 +
  #define for_each_sd_topology(tl)   \
 for (tl = sched_domain_topology; tl-mask; tl++)

 --
 1.8.3.2





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 0/6] rework sched_domain topology description

2014-03-06 Thread Vincent Guittot

On 6 March 2014 07:17, Dietmar Eggemann dietmar.eggem...@arm.com wrote:
 On 05/03/14 07:18, Vincent Guittot wrote:

 This patchset was previously part of the larger tasks packing patchset
 [1].
 I have splitted the latter in 3 different patchsets (at least) to make the
 thing easier.
 -configuration of sched_domain topology (this patchset)
 -update and consolidation of cpu_power
 -tasks packing algorithm

 Based on Peter Z's proposal [2][3], this patchset modifies the way to
 configure
 the sched_domain level in order to let architectures to add specific level
 like
 the current BOOK level or the proposed power gating level for ARM
 architecture.

 [1] https://lkml.org/lkml/2013/10/18/121
 [2] https://lkml.org/lkml/2013/11/5/239
 [3] https://lkml.org/lkml/2013/11/5/449

 Vincent Guittot (6):
sched: remove unused SCHED_INIT_NODE
sched: rework of sched_domain topology definition
sched: s390: create a dedicated topology table
sched: powerpc: create a dedicated topology table
sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
sched: ARM: create a dedicated scheduler topology table

   arch/arm/kernel/topology.c|   26 
   arch/ia64/include/asm/topology.h  |   24 
   arch/metag/include/asm/topology.h |   27 -
   arch/powerpc/kernel/smp.c |   35 --
   arch/s390/include/asm/topology.h  |   13 +-
   arch/s390/kernel/topology.c   |   25 
   arch/tile/include/asm/topology.h  |   33 --
   include/linux/sched.h |   30 +
   include/linux/topology.h  |  128 ++--
   kernel/sched/core.c   |  235
 ++---
   10 files changed, 237 insertions(+), 339 deletions(-)


 Hi Vincent,

 I reviewed your patch-set carefully (including test runs on TC2), especially
 due to the fact that we want to build our sd_energy stuff on top of it.

Thanks



 One thing I'm still not convinced of is the fact that specifying additional
 sd levels in the struct sched_domain_topology_level table has an advantage
 over a function pointer for sd topology flags similar to the one we're
 already using for the cpu mask in struct sched_domain_topology_level.

   int (*sched_domain_flags_f)(int cpu);


We have to create additional level for some kind of topology as
described in my trial https://lkml.org/lkml/2013/12/18/279 which is
not possible with function pointer.

Have you got a situation in mind where it will be necessary to use the
function pointer with cpu number as an argument ?

In the current example of this patchset, the flags are statically set
in the table but nothing prevents an architecture to update the flags
value before being given to the scheduler

 This function pointer would be simply another member of struct
 sched_domain_topology_level and would replace int sd_flags.  AFAICS, you
 have to create additional cpu mask functions anyway for the additional sd
 levels, like cpu_corepower_mask() for the  GMC level in the ARM case.  There
 could be a set of standard sd topology flags function for the default sd
 layer and archs which want to pass in something special define those
 function locally since they will use them only in their arch specific struct
 sched_domain_topology_level table anyway.  I know that you use the existing
 sd degenerate functionality for this and that the freeing of the redundant
 data structures (sched_domain, sched_group and sched_group_power) is there
 too but it still doesn't seem to me to be the right thing to do.

 The problem that we now expose internal data structures (struct sd_data and
 struct sched_domain_topology_level) could be dealt with later.

 -- Dietmar

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 0/6] rework sched_domain topology description

2014-03-06 Thread Vincent Guittot

On 6 March 2014 20:31, Dietmar Eggemann dietmar.eggem...@arm.com wrote:
 On 06/03/14 09:04, Vincent Guittot wrote:

 On 6 March 2014 07:17, Dietmar Eggemann dietmar.eggem...@arm.com wrote:

 On 05/03/14 07:18, Vincent Guittot wrote:


 This patchset was previously part of the larger tasks packing patchset
 [1].
 I have splitted the latter in 3 different patchsets (at least) to make
 the
 thing easier.
 -configuration of sched_domain topology (this patchset)
 -update and consolidation of cpu_power
 -tasks packing algorithm

 Based on Peter Z's proposal [2][3], this patchset modifies the way to
 configure
 the sched_domain level in order to let architectures to add specific
 level
 like
 the current BOOK level or the proposed power gating level for ARM
 architecture.

 [1] https://lkml.org/lkml/2013/10/18/121
 [2] https://lkml.org/lkml/2013/11/5/239
 [3] https://lkml.org/lkml/2013/11/5/449

 Vincent Guittot (6):
 sched: remove unused SCHED_INIT_NODE
 sched: rework of sched_domain topology definition
 sched: s390: create a dedicated topology table
 sched: powerpc: create a dedicated topology table
 sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
 sched: ARM: create a dedicated scheduler topology table

arch/arm/kernel/topology.c|   26 
arch/ia64/include/asm/topology.h  |   24 
arch/metag/include/asm/topology.h |   27 -
arch/powerpc/kernel/smp.c |   35 --
arch/s390/include/asm/topology.h  |   13 +-
arch/s390/kernel/topology.c   |   25 
arch/tile/include/asm/topology.h  |   33 --
include/linux/sched.h |   30 +
include/linux/topology.h  |  128 ++--
kernel/sched/core.c   |  235
 ++---
10 files changed, 237 insertions(+), 339 deletions(-)


 Hi Vincent,

 I reviewed your patch-set carefully (including test runs on TC2),
 especially
 due to the fact that we want to build our sd_energy stuff on top of it.


 Thanks



 One thing I'm still not convinced of is the fact that specifying
 additional
 sd levels in the struct sched_domain_topology_level table has an
 advantage
 over a function pointer for sd topology flags similar to the one we're
 already using for the cpu mask in struct sched_domain_topology_level.

int (*sched_domain_flags_f)(int cpu);


 We have to create additional level for some kind of topology as
 described in my trial https://lkml.org/lkml/2013/12/18/279 which is
 not possible with function pointer.


 This is what I don't understand at the moment. In your example in the link
 above, (2 cluster of 4 cores with SMT), cpu 0-7 can powergate while cpu 8-15
 can't). Why can't we have

The 2nd example is a bit more complicated and needs an additional
level to describe the topology


 static inline int cpu_coregroup_flags(int cpu)
 {
 int flags = SD_SHARE_PKG_RESOURCES;

 if (cpu = 8)
 flags |= SD_SHARE_POWERDOMAIN;

 return flags;
 }

 inside the arch specific topology file and use it in the MC level as the int
 (*sched_domain_flags_f)(int cpu) member of struct
 sched_domain_topology_level?



 Have you got a situation in mind where it will be necessary to use the
 function pointer with cpu number as an argument ?


 The one above. Fundamentally all situations where you want to set sd
 topology flags differently for different cpus in the same sd level.
 big.LITTLE is another example but it's the same as powergated/!powergated in
 this perspective.

You see the flag of a level as something that can be changed per cpu
whereas the proposal is to define a number of level with interesting
properties for the scheduler and to associate a mask of cpus to this
properties.

I don't have a strong opinion about using or not a cpu argument for
setting the flags of a level (it was part of the initial proposal
before we start to completely rework the build of sched_domain)
Nevertheless, I see one potential concern that you can have completely
different flags configuration of the same sd level of 2 cpus.

Peter,

Was the use of the cpu as a parameter in the initialization of
sched_domain's flag a reason for asking for reworking the
initialization of sched_domain ?




 In the current example of this patchset, the flags are statically set
 in the table but nothing prevents an architecture to update the flags
 value before being given to the scheduler


 What will be the infrastructure if the arch wants to do so?

I'm not sure to catch your point. The arch is now able to defines its
own table and fill it before giving it to the scheduler so nothing
prevents to set/update the flags field according the system
configuration during boot sequence.

Thanks,
Vincent

 Thanks,

 -- Dietmar



 This function pointer would be simply another member of struct
 sched_domain_topology_level and would replace int sd_flags.  AFAICS, you
 have to create additional cpu mask functions anyway for the additional sd

Re: [RFC 0/6] rework sched_domain topology description

2014-03-10 Thread Vincent Guittot

On 8 March 2014 13:40, Dietmar Eggemann dietmar.eggem...@arm.com wrote:
 On 07/03/14 02:47, Vincent Guittot wrote:

 On 6 March 2014 20:31, Dietmar Eggemann dietmar.eggem...@arm.com wrote:

 On 06/03/14 09:04, Vincent Guittot wrote:


 On 6 March 2014 07:17, Dietmar Eggemann dietmar.eggem...@arm.com
 wrote:


 On 05/03/14 07:18, Vincent Guittot wrote:



 This patchset was previously part of the larger tasks packing patchset
 [1].
 I have splitted the latter in 3 different patchsets (at least) to make
 the
 thing easier.
 -configuration of sched_domain topology (this patchset)
 -update and consolidation of cpu_power
 -tasks packing algorithm

 Based on Peter Z's proposal [2][3], this patchset modifies the way to
 configure
 the sched_domain level in order to let architectures to add specific
 level
 like
 the current BOOK level or the proposed power gating level for ARM
 architecture.

 [1] https://lkml.org/lkml/2013/10/18/121
 [2] https://lkml.org/lkml/2013/11/5/239
 [3] https://lkml.org/lkml/2013/11/5/449

 Vincent Guittot (6):
  sched: remove unused SCHED_INIT_NODE
  sched: rework of sched_domain topology definition
  sched: s390: create a dedicated topology table
  sched: powerpc: create a dedicated topology table
  sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
  sched: ARM: create a dedicated scheduler topology table

 arch/arm/kernel/topology.c|   26 
 arch/ia64/include/asm/topology.h  |   24 
 arch/metag/include/asm/topology.h |   27 -
 arch/powerpc/kernel/smp.c |   35 --
 arch/s390/include/asm/topology.h  |   13 +-
 arch/s390/kernel/topology.c   |   25 
 arch/tile/include/asm/topology.h  |   33 --
 include/linux/sched.h |   30 +
 include/linux/topology.h  |  128 ++--
 kernel/sched/core.c   |  235
 ++---
 10 files changed, 237 insertions(+), 339 deletions(-)


 Hi Vincent,

 I reviewed your patch-set carefully (including test runs on TC2),
 especially
 due to the fact that we want to build our sd_energy stuff on top of it.



 Thanks



 One thing I'm still not convinced of is the fact that specifying
 additional
 sd levels in the struct sched_domain_topology_level table has an
 advantage
 over a function pointer for sd topology flags similar to the one we're
 already using for the cpu mask in struct sched_domain_topology_level.

 int (*sched_domain_flags_f)(int cpu);


 We have to create additional level for some kind of topology as
 described in my trial https://lkml.org/lkml/2013/12/18/279 which is
 not possible with function pointer.



 This is what I don't understand at the moment. In your example in the
 link
 above, (2 cluster of 4 cores with SMT), cpu 0-7 can powergate while cpu
 8-15
 can't). Why can't we have


 The 2nd example is a bit more complicated and needs an additional
 level to describe the topology


 I see. In the 2. example you want to have an additional MC level for cpu 2-7
 because you want to do load balance between those cpus more often than for
 cpu 0-7 for dst cpus from the set (2-7). Not sure if such topologies (only a
 subset of cpus inside a cluster can't be powergated) exists today in the
 real world though?

Be sure that such topology is studied and considered by HW guys


 from https://lkml.org/lkml/2013/12/18/279:

 CPU2:
 domain 0: span 2-3 level: SMT
 flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
 groups: 0 1  -- Doesn't this have to be 'groups: 2 3'
   domain 1: span 2-7 level: MC
   flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
   groups: 2-7 4-5 6-7
 domain 2: span 0-7 level: MC
 flags: SD_SHARE_PKG_RESOURCES
 groups: 2-7 0-1
   domain 3: span 0-15 level: CPU
   flags:
   groups: 0-7 8-15




 static inline int cpu_coregroup_flags(int cpu)
 {
  int flags = SD_SHARE_PKG_RESOURCES;

  if (cpu = 8)
  flags |= SD_SHARE_POWERDOMAIN;

  return flags;
 }

 inside the arch specific topology file and use it in the MC level as the
 int
 (*sched_domain_flags_f)(int cpu) member of struct
 sched_domain_topology_level?



 Have you got a situation in mind where it will be necessary to use the
 function pointer with cpu number as an argument ?



 The one above. Fundamentally all situations where you want to set sd
 topology flags differently for different cpus in the same sd level.
 big.LITTLE is another example but it's the same as powergated/!powergated
 in
 this perspective.


 You see the flag of a level as something that can be changed per cpu
 whereas the proposal is to define a number of level with interesting
 properties for the scheduler and to associate a mask of cpus to this
 properties.


 That's right.



 I don't have a strong opinion about using or not a cpu argument for
 setting the flags of a level (it was part of the initial

Re: [RFC 4/6] sched: powerpc: create a dedicated topology table

2014-03-11 Thread Vincent Guittot

On 11 March 2014 11:08, Preeti U Murthy pre...@linux.vnet.ibm.com wrote:
 Hi Vincent,

 On 03/05/2014 12:48 PM, Vincent Guittot wrote:
 Create a dedicated topology table for handling asymetric feature.
 The current proposal creates a new level which describes which groups of CPUs
 take adavantge of SD_ASYM_PACKING. The useless level will be removed during 
 the
 build of the sched_domain topology.

 Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level
 during the boot sequence and before the build of the sched_domain topology.

 Is the below what you mean as the other solution? If it is so, I would
 strongly recommend this approach rather than adding another level to the
 topology level to represent the asymmetric behaviour.

 +static struct sched_domain_topology_level powerpc_topology[] = {
 +#ifdef CONFIG_SCHED_SMT
 +   { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
 SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() },
 +#endif
 +   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
 +   { NULL, },
 +};

Not exactly like that but something like below

+static struct sched_domain_topology_level powerpc_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+ { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
SD_INIT_NAME(SMT) },
+#endif
+ { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+ { NULL, },
+};
+
+static void __init set_sched_topology(void)
+{
+#ifdef CONFIG_SCHED_SMT
+ powerpc_topology[0].sd_flags |= arch_sd_sibling_asym_packing();
+#endif
+ sched_domain_topology = powerpc_topology;
+}


 Regards
 Preeti U Murthy

 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
 ---
  arch/powerpc/kernel/smp.c |   35 +++
  kernel/sched/core.c   |6 --
  2 files changed, 27 insertions(+), 14 deletions(-)

 diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
 index ac2621a..75da054 100644
 --- a/arch/powerpc/kernel/smp.c
 +++ b/arch/powerpc/kernel/smp.c
 @@ -755,6 +755,32 @@ int setup_profiling_timer(unsigned int multiplier)
   return 0;
  }

 +#ifdef CONFIG_SCHED_SMT
 +/* cpumask of CPUs with asymetric SMT dependancy */
 +static const struct cpumask *cpu_asmt_mask(int cpu)
 +{
 + if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
 + printk_once(KERN_INFO Enabling Asymmetric SMT scheduling\n);
 + return topology_thread_cpumask(cpu);
 + }
 + return cpumask_of(cpu);
 +}
 +#endif
 +
 +static struct sched_domain_topology_level powerpc_topology[] = {
 +#ifdef CONFIG_SCHED_SMT
 + { cpu_asmt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | 
 SD_ASYM_PACKING, SD_INIT_NAME(ASMT) },
 + { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES, 
 SD_INIT_NAME(SMT) },
 +#endif
 + { cpu_cpu_mask, SD_INIT_NAME(DIE) },
 + { NULL, },
 +};
 +
 +static void __init set_sched_topology(void)
 +{
 + sched_domain_topology = powerpc_topology;
 +}
 +
  void __init smp_cpus_done(unsigned int max_cpus)
  {
   cpumask_var_t old_mask;
 @@ -779,15 +805,8 @@ void __init smp_cpus_done(unsigned int max_cpus)

   dump_numa_cpu_topology();

 -}
 + set_sched_topology();

 -int arch_sd_sibling_asym_packing(void)
 -{
 - if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
 - printk_once(KERN_INFO Enabling Asymmetric SMT scheduling\n);
 - return SD_ASYM_PACKING;
 - }
 - return 0;
  }

  #ifdef CONFIG_HOTPLUG_CPU
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 3479467..7606de0 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -5818,11 +5818,6 @@ static void init_sched_groups_power(int cpu, struct 
 sched_domain *sd)
   atomic_set(sg-sgp-nr_busy_cpus, sg-group_weight);
  }

 -int __weak arch_sd_sibling_asym_packing(void)
 -{
 -   return 0*SD_ASYM_PACKING;
 -}
 -
  /*
   * Initializers for schedule domains
   * Non-inlined to reduce accumulated stack pressure in build_sched_domains()
 @@ -6000,7 +5995,6 @@ sd_init(struct sched_domain_topology_level *tl, int 
 cpu)
   if (sd-flags  SD_SHARE_CPUPOWER) {
   sd-imbalance_pct = 110;
   sd-smt_gain = 1178; /* ~15% */
 - sd-flags |= arch_sd_sibling_asym_packing();

   } else if (sd-flags  SD_SHARE_PKG_RESOURCES) {
   sd-imbalance_pct = 117;


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/6] sched: rework of sched_domain topology definition

2014-03-11 Thread Vincent Guittot

On 11 March 2014 11:31, Peter Zijlstra pet...@infradead.org wrote:
 On Thu, Mar 06, 2014 at 04:32:35PM +0800, Vincent Guittot wrote:
  Never got the new name DIE for CPU? Might confuse people when they use
  /proc/sys/kernel/sched_domain/cpuX/domainY/name or 
  sched_domain_debug_one().

 In fact, CPU is also confusing because it's used for different things.
 But if it makes things even more confusing, i can come back to CPU

 Yeah, not sure DIE is the right thing either; because there's multi-die
 packages that get classified under CPU :-)

 Take for example the Core 2 Quad, which was 2 dual core dies glued
 together in a single package.

 There's also the AMD bulldozer which glued two dies into a single
 package; but for those its not a problem because each die is a separate
 numa node, so there DIE would actually be the correct term and PACKAGE
 would be wrong.

 So while CPU sucks, I'm not sure we can come up with anything that's
 actually correct. That said; we could try for something less wrong than
 CPU :-)

OK

Dietmar,

Have you got another naming that DIE that could suit better ?
otherwise i will keep it

Vincent


 I'm not sure there are a lot of people who see/know the names of these
 domains to be bothered by a change in them; it might be limited to just
 us for all I know.

 ___
 linux-arm-kernel mailing list
 linux-arm-ker...@lists.infradead.org
 http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] sched: fix: remove double calculation in fix_small_imbalance

2014-03-11 Thread Vincent Guittot

The tmp value has been already calculated in:
scaled_busy_load_per_task =
(busiest-load_per_task * SCHED_POWER_SCALE) /
busiest-group_power;

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
---
 kernel/sched/fair.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f1eedae..b301918 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6061,12 +6061,10 @@ void fix_small_imbalance(struct lb_env *env, struct 
sd_lb_stats *sds)
pwr_now /= SCHED_POWER_SCALE;
 
/* Amount of load we'd subtract */
-   tmp = (busiest-load_per_task * SCHED_POWER_SCALE) /
-   busiest-group_power;
-   if (busiest-avg_load  tmp) {
+   if (busiest-avg_load  scaled_busy_load_per_task) {
pwr_move += busiest-group_power *
min(busiest-load_per_task,
-   busiest-avg_load - tmp);
+   busiest-avg_load - scaled_busy_load_per_task);
}
 
/* Amount of load we'd add */
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 3961 matches

Mail list logo