Re: [patch 2/2] sched: dynticks idle load balancing - v2

2007-02-22 Thread Nick Piggin
On Thu, Feb 22, 2007 at 02:33:00PM -0800, Suresh B wrote:
> On Thu, Feb 22, 2007 at 04:26:54AM +0100, Nick Piggin wrote:
> > This is really ugly, sorry :(
> 
> hm. myself and others too thought it was a simple and nice idea.

The idea is not bad. I won't guarantee mine will be as good or better,
but I think it is sensible to try implementing the simplest approach
first, so we can get a baseline to justify more complexity against...

Your code just needs work, but if it really produces good results then
it should be able to be made into a mergeable patch.

> > My suggestion for handling this was to increase the maximum balance
> > interval for an idle CPU, and just implement a global shutdown when
> > the entire system goes idle.
> > 
> > The former should take care of the power savings issues for bare metal
> > hardware, and the latter should solve performance problems for many idle
> > SMP guests. It should take very little code to implement.
> 
> coming to max balance interval will be challenging. It needs to save
> power and at the same time respond to load changes fast enough.

Yep.

> > If that approach doesn't cut it, then at least we can have some numbers
> > to see how much better yours is so we can justify including it.
> > 
> > If you are against my approach, then I can have a try at coding it up
> > if you like?
> 
> Sure. If you can provide a patch, I will be glad to provide power and
> performance comparision numbers with both the approaches.

OK that would be good. I'll see if I can code something up by next week.

Thanks,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/2] sched: dynticks idle load balancing - v2

2007-02-22 Thread Siddha, Suresh B
On Thu, Feb 22, 2007 at 04:26:54AM +0100, Nick Piggin wrote:
> This is really ugly, sorry :(

hm. myself and others too thought it was a simple and nice idea.

> My suggestion for handling this was to increase the maximum balance
> interval for an idle CPU, and just implement a global shutdown when
> the entire system goes idle.
> 
> The former should take care of the power savings issues for bare metal
> hardware, and the latter should solve performance problems for many idle
> SMP guests. It should take very little code to implement.

coming to max balance interval will be challenging. It needs to save
power and at the same time respond to load changes fast enough.

> If that approach doesn't cut it, then at least we can have some numbers
> to see how much better yours is so we can justify including it.
> 
> If you are against my approach, then I can have a try at coding it up
> if you like?

Sure. If you can provide a patch, I will be glad to provide power and
performance comparision numbers with both the approaches.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/2] sched: dynticks idle load balancing - v2

2007-02-22 Thread Siddha, Suresh B
On Thu, Feb 22, 2007 at 04:26:54AM +0100, Nick Piggin wrote:
 This is really ugly, sorry :(

hm. myself and others too thought it was a simple and nice idea.

 My suggestion for handling this was to increase the maximum balance
 interval for an idle CPU, and just implement a global shutdown when
 the entire system goes idle.
 
 The former should take care of the power savings issues for bare metal
 hardware, and the latter should solve performance problems for many idle
 SMP guests. It should take very little code to implement.

coming to max balance interval will be challenging. It needs to save
power and at the same time respond to load changes fast enough.

 If that approach doesn't cut it, then at least we can have some numbers
 to see how much better yours is so we can justify including it.
 
 If you are against my approach, then I can have a try at coding it up
 if you like?

Sure. If you can provide a patch, I will be glad to provide power and
performance comparision numbers with both the approaches.

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/2] sched: dynticks idle load balancing - v2

2007-02-22 Thread Nick Piggin
On Thu, Feb 22, 2007 at 02:33:00PM -0800, Suresh B wrote:
 On Thu, Feb 22, 2007 at 04:26:54AM +0100, Nick Piggin wrote:
  This is really ugly, sorry :(
 
 hm. myself and others too thought it was a simple and nice idea.

The idea is not bad. I won't guarantee mine will be as good or better,
but I think it is sensible to try implementing the simplest approach
first, so we can get a baseline to justify more complexity against...

Your code just needs work, but if it really produces good results then
it should be able to be made into a mergeable patch.

  My suggestion for handling this was to increase the maximum balance
  interval for an idle CPU, and just implement a global shutdown when
  the entire system goes idle.
  
  The former should take care of the power savings issues for bare metal
  hardware, and the latter should solve performance problems for many idle
  SMP guests. It should take very little code to implement.
 
 coming to max balance interval will be challenging. It needs to save
 power and at the same time respond to load changes fast enough.

Yep.

  If that approach doesn't cut it, then at least we can have some numbers
  to see how much better yours is so we can justify including it.
  
  If you are against my approach, then I can have a try at coding it up
  if you like?
 
 Sure. If you can provide a patch, I will be glad to provide power and
 performance comparision numbers with both the approaches.

OK that would be good. I'll see if I can code something up by next week.

Thanks,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/2] sched: dynticks idle load balancing - v2

2007-02-21 Thread Nick Piggin
On Fri, Feb 16, 2007 at 06:08:42PM -0800, Suresh B wrote:
> Changes since v1:
> 
>   - Move the idle load balancer selection from schedule()
> to the first busy scheduler_tick() after restarting the tick.
> This will avoid the unnecessay ownership changes when 
> softirq's(which are run in softirqd context in certain -rt
> configurations) like timer, sched are invoked for every idle tick
> that happens.
> 
>   - Misc cleanups.
> 
> ---
> Fix the process idle load balancing in the presence of dynticks.
> cpus for which ticks are stopped will sleep till the next event wakes it up.
> Potentially these sleeps can be for large durations and during which today,
> there is no periodic idle load balancing being done.
> 
> This patch nominates an owner among the idle cpus, which does the idle load
> balancing on behalf of the other idle cpus. And once all the cpus are
> completely idle, then we can stop this idle load balancing too. Checks added
> in fast path are minimized.  Whenever there are busy cpus in the system, there
> will be an owner(idle cpu) doing the system wide idle load balancing.

This is really ugly, sorry :(

My suggestion for handling this was to increase the maximum balance
interval for an idle CPU, and just implement a global shutdown when
the entire system goes idle.

The former should take care of the power savings issues for bare metal
hardware, and the latter should solve performance problems for many idle
SMP guests. It should take very little code to implement.

If that approach doesn't cut it, then at least we can have some numbers
to see how much better yours is so we can justify including it.

If you are against my approach, then I can have a try at coding it up
if you like?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/2] sched: dynticks idle load balancing - v2

2007-02-21 Thread Nick Piggin
On Wed, Feb 21, 2007 at 12:23:44PM -0800, Andrew Morton wrote:
> On Fri, 16 Feb 2007 18:08:42 -0800

> > +int select_nohz_load_balancer(int stop_tick)
> > +{
> > +   int cpu = smp_processor_id();
> > +
> > +   if (stop_tick) {
> > +   cpu_set(cpu, nohz.cpu_mask);
> > +   cpu_rq(cpu)->in_nohz_recently = 1;
> > +
> > +   /*
> > +* If we are going offline and still the leader, give up!
> > +*/
> > +   if (cpu_is_offline(cpu) && nohz.load_balancer == cpu) {
> > +   if (cmpxchg(_balancer,  cpu, -1) != cpu)
> 
> So we require that architectures which implement CONFIG_NO_HZ also
> implement cmpxchg.

Just use atomic_cmpxchg, please.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/2] sched: dynticks idle load balancing - v2

2007-02-21 Thread Andrew Morton
On Fri, 16 Feb 2007 18:08:42 -0800
"Siddha, Suresh B" <[EMAIL PROTECTED]> wrote:

> Changes since v1:
> 
>   - Move the idle load balancer selection from schedule()
> to the first busy scheduler_tick() after restarting the tick.
> This will avoid the unnecessay ownership changes when 
> softirq's(which are run in softirqd context in certain -rt
> configurations) like timer, sched are invoked for every idle tick
> that happens.
> 
>   - Misc cleanups.
> 
> ---
> Fix the process idle load balancing in the presence of dynticks.
> cpus for which ticks are stopped will sleep till the next event wakes it up.
> Potentially these sleeps can be for large durations and during which today,
> there is no periodic idle load balancing being done.
> 
> This patch nominates an owner among the idle cpus, which does the idle load
> balancing on behalf of the other idle cpus. And once all the cpus are
> completely idle, then we can stop this idle load balancing too. Checks added
> in fast path are minimized.  Whenever there are busy cpus in the system, there
> will be an owner(idle cpu) doing the system wide idle load balancing.
> 
> Open items:
> 1. Intelligent owner selection (like an idle core in a busy package).
> 2. Merge with rcu's nohz_cpu_mask?
> 

I spose I'll hold my nose and merge this, but it creates too much of a mess
to be mergeable into the CPU scheduler, IMO.

Can we please do something to reduce the ifdef density?  And if possible,
all the newly-added returns-from-the-middle-of-a-function?


> +#ifdef CONFIG_NO_HZ
> +static struct {
> + int load_balancer;
> + cpumask_t  cpu_mask;
> +} nohz cacheline_aligned = {
> + .load_balancer = -1,
> + .cpu_mask = CPU_MASK_NONE,
> +};
> +
> +/*
> + * This routine will try to nominate the ilb (idle load balancing)
> + * owner among the cpus whose ticks are stopped. ilb owner will do the idle
> + * load balancing on behalf of all those cpus. If all the cpus in the system
> + * go into this tickless mode, then there will be no ilb owner (as there is
> + * no need for one) and all the cpus will sleep till the next wakeup event
> + * arrives...
> + *
> + * For the ilb owner, tick is not stopped. And this tick will be used
> + * for idle load balancing. ilb owner will still be part of
> + * nohz.cpu_mask..
> + *
> + * While stopping the tick, this cpu will become the ilb owner if there
> + * is no other owner. And will be the owner till that cpu becomes busy
> + * or if all cpus in the system stop their ticks at which point
> + * there is no need for ilb owner.
> + *
> + * When the ilb owner becomes busy, it nominates another owner, during the
> + * next busy scheduler_tick()
> + */
> +int select_nohz_load_balancer(int stop_tick)
> +{
> + int cpu = smp_processor_id();
> +
> + if (stop_tick) {
> + cpu_set(cpu, nohz.cpu_mask);
> + cpu_rq(cpu)->in_nohz_recently = 1;
> +
> + /*
> +  * If we are going offline and still the leader, give up!
> +  */
> + if (cpu_is_offline(cpu) && nohz.load_balancer == cpu) {
> + if (cmpxchg(_balancer,  cpu, -1) != cpu)

So we require that architectures which implement CONFIG_NO_HZ also
implement cmpxchg.

> + BUG();
> + return 0;
> + }
> +
> + /* time for ilb owner also to sleep */
> + if (cpus_weight(nohz.cpu_mask) == num_online_cpus()) {
> + if (nohz.load_balancer == cpu)
> + nohz.load_balancer = -1;
> + return 0;
> + }
> +
> + if (nohz.load_balancer == -1) {
> + /* make me the ilb owner */
> + if (cmpxchg(_balancer, -1, cpu) == -1)
> + return 1;
> + } else if (nohz.load_balancer == cpu)
> + return 1;
> + } else {
> + if (!cpu_isset(cpu, nohz.cpu_mask))
> + return 0;
> +
> + cpu_clear(cpu, nohz.cpu_mask);
> +
> + if (nohz.load_balancer == cpu)
> + if (cmpxchg(_balancer,  cpu, -1) != cpu)
> + BUG();
> + }
> + return 0;
> +}
> +#endif
> +
>  /*
>   * run_rebalance_domains is triggered when needed from the scheduler tick.
>   *
> @@ -3347,15 +3437,46 @@ static DEFINE_SPINLOCK(balancing);
>  
>  static void run_rebalance_domains(struct softirq_action *h)
>  {
> - int this_cpu = smp_processor_id(), balance = 1;
> - struct rq *this_rq = cpu_rq(this_cpu);
> - unsigned long interval;
> + int balance_cpu = smp_processor_id(), balance;
> + struct rq *balance_rq = cpu_rq(balance_cpu);
> + unsigned long interval, next_balance;

One definition per line is preferred.

>   struct sched_domain *sd;
> - enum idle_type idle = this_rq->idle_at_tick ? SCHED_IDLE : NOT_IDLE;
> + enum idle_type idle;
> +
> 

Re: [patch 2/2] sched: dynticks idle load balancing - v2

2007-02-21 Thread Andrew Morton
On Fri, 16 Feb 2007 18:08:42 -0800
Siddha, Suresh B [EMAIL PROTECTED] wrote:

 Changes since v1:
 
   - Move the idle load balancer selection from schedule()
 to the first busy scheduler_tick() after restarting the tick.
 This will avoid the unnecessay ownership changes when 
 softirq's(which are run in softirqd context in certain -rt
 configurations) like timer, sched are invoked for every idle tick
 that happens.
 
   - Misc cleanups.
 
 ---
 Fix the process idle load balancing in the presence of dynticks.
 cpus for which ticks are stopped will sleep till the next event wakes it up.
 Potentially these sleeps can be for large durations and during which today,
 there is no periodic idle load balancing being done.
 
 This patch nominates an owner among the idle cpus, which does the idle load
 balancing on behalf of the other idle cpus. And once all the cpus are
 completely idle, then we can stop this idle load balancing too. Checks added
 in fast path are minimized.  Whenever there are busy cpus in the system, there
 will be an owner(idle cpu) doing the system wide idle load balancing.
 
 Open items:
 1. Intelligent owner selection (like an idle core in a busy package).
 2. Merge with rcu's nohz_cpu_mask?
 

I spose I'll hold my nose and merge this, but it creates too much of a mess
to be mergeable into the CPU scheduler, IMO.

Can we please do something to reduce the ifdef density?  And if possible,
all the newly-added returns-from-the-middle-of-a-function?


 +#ifdef CONFIG_NO_HZ
 +static struct {
 + int load_balancer;
 + cpumask_t  cpu_mask;
 +} nohz cacheline_aligned = {
 + .load_balancer = -1,
 + .cpu_mask = CPU_MASK_NONE,
 +};
 +
 +/*
 + * This routine will try to nominate the ilb (idle load balancing)
 + * owner among the cpus whose ticks are stopped. ilb owner will do the idle
 + * load balancing on behalf of all those cpus. If all the cpus in the system
 + * go into this tickless mode, then there will be no ilb owner (as there is
 + * no need for one) and all the cpus will sleep till the next wakeup event
 + * arrives...
 + *
 + * For the ilb owner, tick is not stopped. And this tick will be used
 + * for idle load balancing. ilb owner will still be part of
 + * nohz.cpu_mask..
 + *
 + * While stopping the tick, this cpu will become the ilb owner if there
 + * is no other owner. And will be the owner till that cpu becomes busy
 + * or if all cpus in the system stop their ticks at which point
 + * there is no need for ilb owner.
 + *
 + * When the ilb owner becomes busy, it nominates another owner, during the
 + * next busy scheduler_tick()
 + */
 +int select_nohz_load_balancer(int stop_tick)
 +{
 + int cpu = smp_processor_id();
 +
 + if (stop_tick) {
 + cpu_set(cpu, nohz.cpu_mask);
 + cpu_rq(cpu)-in_nohz_recently = 1;
 +
 + /*
 +  * If we are going offline and still the leader, give up!
 +  */
 + if (cpu_is_offline(cpu)  nohz.load_balancer == cpu) {
 + if (cmpxchg(nohz.load_balancer,  cpu, -1) != cpu)

So we require that architectures which implement CONFIG_NO_HZ also
implement cmpxchg.

 + BUG();
 + return 0;
 + }
 +
 + /* time for ilb owner also to sleep */
 + if (cpus_weight(nohz.cpu_mask) == num_online_cpus()) {
 + if (nohz.load_balancer == cpu)
 + nohz.load_balancer = -1;
 + return 0;
 + }
 +
 + if (nohz.load_balancer == -1) {
 + /* make me the ilb owner */
 + if (cmpxchg(nohz.load_balancer, -1, cpu) == -1)
 + return 1;
 + } else if (nohz.load_balancer == cpu)
 + return 1;
 + } else {
 + if (!cpu_isset(cpu, nohz.cpu_mask))
 + return 0;
 +
 + cpu_clear(cpu, nohz.cpu_mask);
 +
 + if (nohz.load_balancer == cpu)
 + if (cmpxchg(nohz.load_balancer,  cpu, -1) != cpu)
 + BUG();
 + }
 + return 0;
 +}
 +#endif
 +
  /*
   * run_rebalance_domains is triggered when needed from the scheduler tick.
   *
 @@ -3347,15 +3437,46 @@ static DEFINE_SPINLOCK(balancing);
  
  static void run_rebalance_domains(struct softirq_action *h)
  {
 - int this_cpu = smp_processor_id(), balance = 1;
 - struct rq *this_rq = cpu_rq(this_cpu);
 - unsigned long interval;
 + int balance_cpu = smp_processor_id(), balance;
 + struct rq *balance_rq = cpu_rq(balance_cpu);
 + unsigned long interval, next_balance;

One definition per line is preferred.

   struct sched_domain *sd;
 - enum idle_type idle = this_rq-idle_at_tick ? SCHED_IDLE : NOT_IDLE;
 + enum idle_type idle;
 +
 +#ifdef CONFIG_NO_HZ
 + cpumask_t cpus = nohz.cpu_mask;
 + int local_cpu = balance_cpu;
 + 

Re: [patch 2/2] sched: dynticks idle load balancing - v2

2007-02-21 Thread Nick Piggin
On Wed, Feb 21, 2007 at 12:23:44PM -0800, Andrew Morton wrote:
 On Fri, 16 Feb 2007 18:08:42 -0800

  +int select_nohz_load_balancer(int stop_tick)
  +{
  +   int cpu = smp_processor_id();
  +
  +   if (stop_tick) {
  +   cpu_set(cpu, nohz.cpu_mask);
  +   cpu_rq(cpu)-in_nohz_recently = 1;
  +
  +   /*
  +* If we are going offline and still the leader, give up!
  +*/
  +   if (cpu_is_offline(cpu)  nohz.load_balancer == cpu) {
  +   if (cmpxchg(nohz.load_balancer,  cpu, -1) != cpu)
 
 So we require that architectures which implement CONFIG_NO_HZ also
 implement cmpxchg.

Just use atomic_cmpxchg, please.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/2] sched: dynticks idle load balancing - v2

2007-02-21 Thread Nick Piggin
On Fri, Feb 16, 2007 at 06:08:42PM -0800, Suresh B wrote:
 Changes since v1:
 
   - Move the idle load balancer selection from schedule()
 to the first busy scheduler_tick() after restarting the tick.
 This will avoid the unnecessay ownership changes when 
 softirq's(which are run in softirqd context in certain -rt
 configurations) like timer, sched are invoked for every idle tick
 that happens.
 
   - Misc cleanups.
 
 ---
 Fix the process idle load balancing in the presence of dynticks.
 cpus for which ticks are stopped will sleep till the next event wakes it up.
 Potentially these sleeps can be for large durations and during which today,
 there is no periodic idle load balancing being done.
 
 This patch nominates an owner among the idle cpus, which does the idle load
 balancing on behalf of the other idle cpus. And once all the cpus are
 completely idle, then we can stop this idle load balancing too. Checks added
 in fast path are minimized.  Whenever there are busy cpus in the system, there
 will be an owner(idle cpu) doing the system wide idle load balancing.

This is really ugly, sorry :(

My suggestion for handling this was to increase the maximum balance
interval for an idle CPU, and just implement a global shutdown when
the entire system goes idle.

The former should take care of the power savings issues for bare metal
hardware, and the latter should solve performance problems for many idle
SMP guests. It should take very little code to implement.

If that approach doesn't cut it, then at least we can have some numbers
to see how much better yours is so we can justify including it.

If you are against my approach, then I can have a try at coding it up
if you like?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 2/2] sched: dynticks idle load balancing - v2

2007-02-16 Thread Siddha, Suresh B
Changes since v1:

  - Move the idle load balancer selection from schedule()
to the first busy scheduler_tick() after restarting the tick.
This will avoid the unnecessay ownership changes when 
softirq's(which are run in softirqd context in certain -rt
configurations) like timer, sched are invoked for every idle tick
that happens.

  - Misc cleanups.

---
Fix the process idle load balancing in the presence of dynticks.
cpus for which ticks are stopped will sleep till the next event wakes it up.
Potentially these sleeps can be for large durations and during which today,
there is no periodic idle load balancing being done.

This patch nominates an owner among the idle cpus, which does the idle load
balancing on behalf of the other idle cpus. And once all the cpus are
completely idle, then we can stop this idle load balancing too. Checks added
in fast path are minimized.  Whenever there are busy cpus in the system, there
will be an owner(idle cpu) doing the system wide idle load balancing.

Open items:
1. Intelligent owner selection (like an idle core in a busy package).
2. Merge with rcu's nohz_cpu_mask?

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
---

diff -pNru linux-2.6.20.x86_64/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.6.20.x86_64/include/linux/sched.h   2007-02-10 03:41:27.0 
-0800
+++ linux/include/linux/sched.h 2007-02-16 17:44:08.0 -0800
@@ -237,6 +237,14 @@ extern void sched_init_smp(void);
 extern void init_idle(struct task_struct *idle, int cpu);
 
 extern cpumask_t nohz_cpu_mask;
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
+extern int select_nohz_load_balancer(int cpu);
+#else
+static inline int select_nohz_load_balancer(int cpu)
+{
+   return 0;
+}
+#endif
 
 /*
  * Only dump TASK_* tasks. (-1 for all tasks)
diff -pNru linux-2.6.20.x86_64/kernel/sched.c linux/kernel/sched.c
--- linux-2.6.20.x86_64/kernel/sched.c  2007-02-16 16:40:16.0 -0800
+++ linux/kernel/sched.c2007-02-16 17:56:13.0 -0800
@@ -241,6 +241,9 @@ struct rq {
 #ifdef CONFIG_SMP
unsigned long cpu_load[3];
unsigned char idle_at_tick;
+#ifdef CONFIG_NO_HZ
+   unsigned char in_nohz_recently;
+#endif
 #endif
unsigned long long nr_switches;
 
@@ -1169,6 +1172,17 @@ static void resched_task(struct task_str
if (!tsk_is_polling(p))
smp_send_reschedule(cpu);
 }
+
+static void resched_cpu(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+   unsigned long flags;
+
+   if (!spin_trylock_irqsave(>lock, flags))
+   return;
+   resched_task(cpu_curr(cpu));
+   spin_unlock_irqrestore(>lock, flags);
+}
 #else
 static inline void resched_task(struct task_struct *p)
 {
@@ -3067,6 +3081,9 @@ redo:
double_rq_unlock(this_rq, busiest);
local_irq_restore(flags);
 
+   if (nr_moved && this_cpu != smp_processor_id())
+   resched_cpu(this_cpu);
+
/* All tasks on this runqueue were pinned by CPU affinity */
if (unlikely(all_pinned)) {
cpu_clear(cpu_of(busiest), cpus);
@@ -3335,6 +3352,79 @@ static void update_load(struct rq *this_
}
 }
 
+#ifdef CONFIG_NO_HZ
+static struct {
+   int load_balancer;
+   cpumask_t  cpu_mask;
+} nohz cacheline_aligned = {
+   .load_balancer = -1,
+   .cpu_mask = CPU_MASK_NONE,
+};
+
+/*
+ * This routine will try to nominate the ilb (idle load balancing)
+ * owner among the cpus whose ticks are stopped. ilb owner will do the idle
+ * load balancing on behalf of all those cpus. If all the cpus in the system
+ * go into this tickless mode, then there will be no ilb owner (as there is
+ * no need for one) and all the cpus will sleep till the next wakeup event
+ * arrives...
+ *
+ * For the ilb owner, tick is not stopped. And this tick will be used
+ * for idle load balancing. ilb owner will still be part of
+ * nohz.cpu_mask..
+ *
+ * While stopping the tick, this cpu will become the ilb owner if there
+ * is no other owner. And will be the owner till that cpu becomes busy
+ * or if all cpus in the system stop their ticks at which point
+ * there is no need for ilb owner.
+ *
+ * When the ilb owner becomes busy, it nominates another owner, during the
+ * next busy scheduler_tick()
+ */
+int select_nohz_load_balancer(int stop_tick)
+{
+   int cpu = smp_processor_id();
+
+   if (stop_tick) {
+   cpu_set(cpu, nohz.cpu_mask);
+   cpu_rq(cpu)->in_nohz_recently = 1;
+
+   /*
+* If we are going offline and still the leader, give up!
+*/
+   if (cpu_is_offline(cpu) && nohz.load_balancer == cpu) {
+   if (cmpxchg(_balancer,  cpu, -1) != cpu)
+   BUG();
+   return 0;
+   }
+
+   /* time for ilb owner also to sleep */
+   if 

[patch 2/2] sched: dynticks idle load balancing - v2

2007-02-16 Thread Siddha, Suresh B
Changes since v1:

  - Move the idle load balancer selection from schedule()
to the first busy scheduler_tick() after restarting the tick.
This will avoid the unnecessay ownership changes when 
softirq's(which are run in softirqd context in certain -rt
configurations) like timer, sched are invoked for every idle tick
that happens.

  - Misc cleanups.

---
Fix the process idle load balancing in the presence of dynticks.
cpus for which ticks are stopped will sleep till the next event wakes it up.
Potentially these sleeps can be for large durations and during which today,
there is no periodic idle load balancing being done.

This patch nominates an owner among the idle cpus, which does the idle load
balancing on behalf of the other idle cpus. And once all the cpus are
completely idle, then we can stop this idle load balancing too. Checks added
in fast path are minimized.  Whenever there are busy cpus in the system, there
will be an owner(idle cpu) doing the system wide idle load balancing.

Open items:
1. Intelligent owner selection (like an idle core in a busy package).
2. Merge with rcu's nohz_cpu_mask?

Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
---

diff -pNru linux-2.6.20.x86_64/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.6.20.x86_64/include/linux/sched.h   2007-02-10 03:41:27.0 
-0800
+++ linux/include/linux/sched.h 2007-02-16 17:44:08.0 -0800
@@ -237,6 +237,14 @@ extern void sched_init_smp(void);
 extern void init_idle(struct task_struct *idle, int cpu);
 
 extern cpumask_t nohz_cpu_mask;
+#if defined(CONFIG_SMP)  defined(CONFIG_NO_HZ)
+extern int select_nohz_load_balancer(int cpu);
+#else
+static inline int select_nohz_load_balancer(int cpu)
+{
+   return 0;
+}
+#endif
 
 /*
  * Only dump TASK_* tasks. (-1 for all tasks)
diff -pNru linux-2.6.20.x86_64/kernel/sched.c linux/kernel/sched.c
--- linux-2.6.20.x86_64/kernel/sched.c  2007-02-16 16:40:16.0 -0800
+++ linux/kernel/sched.c2007-02-16 17:56:13.0 -0800
@@ -241,6 +241,9 @@ struct rq {
 #ifdef CONFIG_SMP
unsigned long cpu_load[3];
unsigned char idle_at_tick;
+#ifdef CONFIG_NO_HZ
+   unsigned char in_nohz_recently;
+#endif
 #endif
unsigned long long nr_switches;
 
@@ -1169,6 +1172,17 @@ static void resched_task(struct task_str
if (!tsk_is_polling(p))
smp_send_reschedule(cpu);
 }
+
+static void resched_cpu(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+   unsigned long flags;
+
+   if (!spin_trylock_irqsave(rq-lock, flags))
+   return;
+   resched_task(cpu_curr(cpu));
+   spin_unlock_irqrestore(rq-lock, flags);
+}
 #else
 static inline void resched_task(struct task_struct *p)
 {
@@ -3067,6 +3081,9 @@ redo:
double_rq_unlock(this_rq, busiest);
local_irq_restore(flags);
 
+   if (nr_moved  this_cpu != smp_processor_id())
+   resched_cpu(this_cpu);
+
/* All tasks on this runqueue were pinned by CPU affinity */
if (unlikely(all_pinned)) {
cpu_clear(cpu_of(busiest), cpus);
@@ -3335,6 +3352,79 @@ static void update_load(struct rq *this_
}
 }
 
+#ifdef CONFIG_NO_HZ
+static struct {
+   int load_balancer;
+   cpumask_t  cpu_mask;
+} nohz cacheline_aligned = {
+   .load_balancer = -1,
+   .cpu_mask = CPU_MASK_NONE,
+};
+
+/*
+ * This routine will try to nominate the ilb (idle load balancing)
+ * owner among the cpus whose ticks are stopped. ilb owner will do the idle
+ * load balancing on behalf of all those cpus. If all the cpus in the system
+ * go into this tickless mode, then there will be no ilb owner (as there is
+ * no need for one) and all the cpus will sleep till the next wakeup event
+ * arrives...
+ *
+ * For the ilb owner, tick is not stopped. And this tick will be used
+ * for idle load balancing. ilb owner will still be part of
+ * nohz.cpu_mask..
+ *
+ * While stopping the tick, this cpu will become the ilb owner if there
+ * is no other owner. And will be the owner till that cpu becomes busy
+ * or if all cpus in the system stop their ticks at which point
+ * there is no need for ilb owner.
+ *
+ * When the ilb owner becomes busy, it nominates another owner, during the
+ * next busy scheduler_tick()
+ */
+int select_nohz_load_balancer(int stop_tick)
+{
+   int cpu = smp_processor_id();
+
+   if (stop_tick) {
+   cpu_set(cpu, nohz.cpu_mask);
+   cpu_rq(cpu)-in_nohz_recently = 1;
+
+   /*
+* If we are going offline and still the leader, give up!
+*/
+   if (cpu_is_offline(cpu)  nohz.load_balancer == cpu) {
+   if (cmpxchg(nohz.load_balancer,  cpu, -1) != cpu)
+   BUG();
+   return 0;
+   }
+
+   /* time for ilb owner also to sleep */
+   if