from:"Preeti U Murthy"

Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST

2015-06-26 Thread Preeti U Murthy

On 06/26/2015 06:08 PM, Thomas Gleixner wrote:
> On Fri, 26 Jun 2015, Preeti U Murthy wrote:
>> On 06/26/2015 01:17 PM, Thomas Gleixner wrote:
>>> if (state == TICK_BROADCAST_ENTER) {
>>> +   /*
>>> +* If the current CPU owns the hrtimer broadcast
>>> +* mechanism, it cannot go deep idle and we do not add
>>> +* the CPU to the broadcast mask. We don't have to go
>>> +* through the EXIT path as the local timer is not
>>> +* shutdown.
>>> +*/
>>> +   ret = broadcast_needs_cpu(bc, cpu);
>>> +
>>> +   /*
>>> +* If the hrtimer broadcast check tells us that the
>>> +* cpu cannot go deep idle, or if the broadcast device
>>> +* is in periodic mode, we just return.
>>> +*/
>>> +   if (ret || tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
>>> +   goto out;
>>
>> The check on PERIODIC mode is good, but I don't see the point of moving
>> broadcast_needs_cpu() up above. broadcast_shutdown_local() calls
>> broadcast_needs_cpu() internally.
>>
>> Besides, by jumping to 'out', we will miss programming the broadcast
>> hrtimer in tick_broadcast_set_event() below, if the cpu happen to be the
>> broadcast cpu(which is why it was not allowed to go to deep idle).
> 
> Hmm. Need to think a bit more about this convoluted maze ...

I think you cover all cases just by having that check on periodic mode.
This covers the nohz_full=n,highres=n, TICK_ONESHOT=y and
GENERIC_CLOCKEVENTS_BROADCAST=y. broadcast_needs_cpu() should remain
where it was though.

And of course, the additional patch on tick_broadcast_device.evtdev ==
NULL in BROADCAST_ON.

Regards
Preeti U Murthy
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST

2015-06-26 Thread Preeti U Murthy

On 06/26/2015 05:20 PM, Thomas Gleixner wrote:
> On Fri, 26 Jun 2015, Preeti U Murthy wrote:
>> On 06/26/2015 01:17 PM, Thomas Gleixner wrote:
>>> On Fri, 26 Jun 2015, Preeti U Murthy wrote:
>>>> What about the case where GENERIC_CLOCKEVENTS_BROADCAST=y and
>>>> TICK_ONESHOT=n (HZ_PERIODIC=y) ? Have you tested this ?
>>>>
>>>> This will hang the kernel at boot if you are using the hrtimer mode of
>>>> broadcast. This is because the local timers of all cpus are shutdown
>>>> when the cpuidle driver registers itself, on finding out that there are
>>>> idle states where local tick devices stop. The broadcast tick device is
>>>> then in charge of waking up the cpus at every period. In hrtimer mode of
>>>> broadcast, there is no such real device and we hang.
>>>
>>> Hmm, no. tick-broadcast-hrtimer.o depends on TICK_ONESHOT=y. So this
>>> is covered by the check for the broadcast device, which is NULL.
>>>
>>> But there is another variant:
>>>
>>> GENERIC_CLOCKEVENTS_BROADCAST=y TICK_ONESHOT=y and 'highres=off
>>> nohz=off' on the kernel command line. 
>>
>> Can this happen at all? It is during tick_init_highres() or
>> tick_nohz_switch_to_nohz() that we switch to oneshot mode, not otherwise
>> AFAICT.
> 
> And how does that matter? If 'highres=off nohz=off' is on the kernel
> command line none of the switchovers happens. So system stays in
> periodic mode and the broadcast hrtimer thing is registered, right?

Yes we are good here. I overlooked the fact that we could disable high
resolution/nohz just before boot.

> 
>> I was actually talking of the following scenario. In periodic mode,
>> where GENERIC_CLOCKEVENTS_BROADCAST=y, the arch can execute
>> tick_setup_hrtimer_broadcast(), which will return nothing as you point
>> out above. So there is no broadcast clockevent device.
>>
>> When the cpuidle driver registers with the cpuidle core however,
>> cpuidle_setup_broadcast_timer() on every cpu is executed if it finds
>> that there is an idle state where ticks stop.
>>
>> cpuidle_setup_broadcast_timer()
>>   tick_broadcast_enable()
>> tick_broadcast_control(BROADCAST_ON)
>>bc = tick_broadcast_device.evtdev which is NULL in this case
>>
>>  TICK_BROADCAST_ON:
>>  checks for periodic mode of the broadcast device - succeeds
>>  although we haven't registered a broadcast device because
>>  value of TICKDEV_PERIODIC is 0, the default value of td.mode.
>>
>>  clockevents_shutdown(dev)
>>
>> At this point all cpus stop.
> 
> Right. That's a different one, if tick_broadcast_device.evtdev == NULL.
> 
> We should not stop any cpu local timer in that case. Combined with the
> patch I sent, we prevent the idle stuff from going into a state where
> the cpu local timers stop.

Right. We need a check above too.

Regards
Preeti U Murthy
> 
> Thanks,
> 
>   tglx
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST

2015-06-26 Thread Preeti U Murthy

On 06/26/2015 01:17 PM, Thomas Gleixner wrote:
> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
> index d39f32cdd1b5..281ce29d295e 100644
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -662,46 +662,39 @@ static void broadcast_shutdown_local(struct 
> clock_event_device *bc,
>   clockevents_switch_state(dev, CLOCK_EVT_STATE_SHUTDOWN);
>  }
> 
> -/**
> - * tick_broadcast_oneshot_control - Enter/exit broadcast oneshot mode
> - * @state:   The target state (enter/exit)
> - *
> - * The system enters/leaves a state, where affected devices might stop
> - * Returns 0 on success, -EBUSY if the cpu is used to broadcast wakeups.
> - *
> - * Called with interrupts disabled, so clockevents_lock is not
> - * required here because the local clock event device cannot go away
> - * under us.
> - */
> -int tick_broadcast_oneshot_control(enum tick_broadcast_state state)
> +int __tick_broadcast_oneshot_control(enum tick_broadcast_state state)
>  {
>   struct clock_event_device *bc, *dev;
> - struct tick_device *td;
>   int cpu, ret = 0;
>   ktime_t now;
> 
> - /*
> -  * Periodic mode does not care about the enter/exit of power
> -  * states
> -  */
> - if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
> - return 0;
> + if (!tick_broadcast_device.evtdev)
> + return -EBUSY;
> 
> - /*
> -  * We are called with preemtion disabled from the depth of the
> -  * idle code, so we can't be moved away.
> -  */
> - td = this_cpu_ptr(_cpu_device);
> - dev = td->evtdev;
> -
> - if (!(dev->features & CLOCK_EVT_FEAT_C3STOP))
> - return 0;
> + dev = this_cpu_ptr(_cpu_device)->evtdev;
> 
>   raw_spin_lock(_broadcast_lock);
>   bc = tick_broadcast_device.evtdev;
>   cpu = smp_processor_id();
> 
>   if (state == TICK_BROADCAST_ENTER) {
> + /*
> +  * If the current CPU owns the hrtimer broadcast
> +  * mechanism, it cannot go deep idle and we do not add
> +  * the CPU to the broadcast mask. We don't have to go
> +  * through the EXIT path as the local timer is not
> +  * shutdown.
> +  */
> + ret = broadcast_needs_cpu(bc, cpu);
> +
> + /*
> +  * If the hrtimer broadcast check tells us that the
> +  * cpu cannot go deep idle, or if the broadcast device
> +  * is in periodic mode, we just return.
> +  */
> + if (ret || tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
> + goto out;

The check on PERIODIC mode is good, but I don't see the point of moving
broadcast_needs_cpu() up above. broadcast_shutdown_local() calls
broadcast_needs_cpu() internally.

Besides, by jumping to 'out', we will miss programming the broadcast
hrtimer in tick_broadcast_set_event() below, if the cpu happen to be the
broadcast cpu(which is why it was not allowed to go to deep idle).

> +
>   if (!cpumask_test_and_set_cpu(cpu, 
> tick_broadcast_oneshot_mask)) {
>   WARN_ON_ONCE(cpumask_test_cpu(cpu, 
> tick_broadcast_pending_mask));
>   broadcast_shutdown_local(bc, dev);
> @@ -717,16 +710,6 @@ int tick_broadcast_oneshot_control(enum 
> tick_broadcast_state state)
>   dev->next_event.tv64 < bc->next_event.tv64)
>   tick_broadcast_set_event(bc, cpu, 
> dev->next_event);
>   }
> - /*
> -  * If the current CPU owns the hrtimer broadcast
> -  * mechanism, it cannot go deep idle and we remove the
> -  * CPU from the broadcast mask. We don't have to go
> -  * through the EXIT path as the local timer is not
> -  * shutdown.
> -  */
> - ret = broadcast_needs_cpu(bc, cpu);
> - if (ret)
> - cpumask_clear_cpu(cpu, tick_broadcast_oneshot_mask);
>   } else {
>   if (cpumask_test_and_clear_cpu(cpu, 
> tick_broadcast_oneshot_mask)) {
>   clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT);
> @@ -938,6 +921,13 @@ bool tick_broadcast_oneshot_available(void)
>   return bc ? bc->features & CLOCK_EVT_FEAT_ONESHOT : false;
>  }
> 

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST

2015-06-26 Thread Preeti U Murthy

On 06/26/2015 01:17 PM, Thomas Gleixner wrote:
> On Fri, 26 Jun 2015, Preeti U Murthy wrote:
>> What about the case where GENERIC_CLOCKEVENTS_BROADCAST=y and
>> TICK_ONESHOT=n (HZ_PERIODIC=y) ? Have you tested this ?
>>
>> This will hang the kernel at boot if you are using the hrtimer mode of
>> broadcast. This is because the local timers of all cpus are shutdown
>> when the cpuidle driver registers itself, on finding out that there are
>> idle states where local tick devices stop. The broadcast tick device is
>> then in charge of waking up the cpus at every period. In hrtimer mode of
>> broadcast, there is no such real device and we hang.
> 
> Hmm, no. tick-broadcast-hrtimer.o depends on TICK_ONESHOT=y. So this
> is covered by the check for the broadcast device, which is NULL.
> 
> But there is another variant:
> 
> GENERIC_CLOCKEVENTS_BROADCAST=y TICK_ONESHOT=y and 'highres=off
> nohz=off' on the kernel command line. 

Can this happen at all? It is during tick_init_highres() or
tick_nohz_switch_to_nohz() that we switch to oneshot mode, not otherwise
AFAICT.

I was actually talking of the following scenario. In periodic mode,
where GENERIC_CLOCKEVENTS_BROADCAST=y, the arch can execute
tick_setup_hrtimer_broadcast(), which will return nothing as you point
out above. So there is no broadcast clockevent device.

When the cpuidle driver registers with the cpuidle core however,
cpuidle_setup_broadcast_timer() on every cpu is executed if it finds
that there is an idle state where ticks stop.

cpuidle_setup_broadcast_timer()
  tick_broadcast_enable()
tick_broadcast_control(BROADCAST_ON)
   bc = tick_broadcast_device.evtdev which is NULL in this case

 TICK_BROADCAST_ON:
 checks for periodic mode of the broadcast device - succeeds
 although we haven't registered a broadcast device because
 value of TICKDEV_PERIODIC is 0, the default value of td.mode.

     clockevents_shutdown(dev)

At this point all cpus stop.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST

2015-06-26 Thread Preeti U Murthy

On 06/26/2015 05:20 PM, Thomas Gleixner wrote:
 On Fri, 26 Jun 2015, Preeti U Murthy wrote:
 On 06/26/2015 01:17 PM, Thomas Gleixner wrote:
 On Fri, 26 Jun 2015, Preeti U Murthy wrote:
 What about the case where GENERIC_CLOCKEVENTS_BROADCAST=y and
 TICK_ONESHOT=n (HZ_PERIODIC=y) ? Have you tested this ?

 This will hang the kernel at boot if you are using the hrtimer mode of
 broadcast. This is because the local timers of all cpus are shutdown
 when the cpuidle driver registers itself, on finding out that there are
 idle states where local tick devices stop. The broadcast tick device is
 then in charge of waking up the cpus at every period. In hrtimer mode of
 broadcast, there is no such real device and we hang.

 Hmm, no. tick-broadcast-hrtimer.o depends on TICK_ONESHOT=y. So this
 is covered by the check for the broadcast device, which is NULL.

 But there is another variant:

 GENERIC_CLOCKEVENTS_BROADCAST=y TICK_ONESHOT=y and 'highres=off
 nohz=off' on the kernel command line. 

 Can this happen at all? It is during tick_init_highres() or
 tick_nohz_switch_to_nohz() that we switch to oneshot mode, not otherwise
 AFAICT.
 
 And how does that matter? If 'highres=off nohz=off' is on the kernel
 command line none of the switchovers happens. So system stays in
 periodic mode and the broadcast hrtimer thing is registered, right?

Yes we are good here. I overlooked the fact that we could disable high
resolution/nohz just before boot.

 
 I was actually talking of the following scenario. In periodic mode,
 where GENERIC_CLOCKEVENTS_BROADCAST=y, the arch can execute
 tick_setup_hrtimer_broadcast(), which will return nothing as you point
 out above. So there is no broadcast clockevent device.

 When the cpuidle driver registers with the cpuidle core however,
 cpuidle_setup_broadcast_timer() on every cpu is executed if it finds
 that there is an idle state where ticks stop.

 cpuidle_setup_broadcast_timer()
   tick_broadcast_enable()
 tick_broadcast_control(BROADCAST_ON)
bc = tick_broadcast_device.evtdev which is NULL in this case

  TICK_BROADCAST_ON:
  checks for periodic mode of the broadcast device - succeeds
  although we haven't registered a broadcast device because
  value of TICKDEV_PERIODIC is 0, the default value of td.mode.

  clockevents_shutdown(dev)

 At this point all cpus stop.
 
 Right. That's a different one, if tick_broadcast_device.evtdev == NULL.
 
 We should not stop any cpu local timer in that case. Combined with the
 patch I sent, we prevent the idle stuff from going into a state where
 the cpu local timers stop.

Right. We need a check above too.

Regards
Preeti U Murthy
 
 Thanks,
 
   tglx
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST

2015-06-26 Thread Preeti U Murthy

On 06/26/2015 06:08 PM, Thomas Gleixner wrote:
 On Fri, 26 Jun 2015, Preeti U Murthy wrote:
 On 06/26/2015 01:17 PM, Thomas Gleixner wrote:
 if (state == TICK_BROADCAST_ENTER) {
 +   /*
 +* If the current CPU owns the hrtimer broadcast
 +* mechanism, it cannot go deep idle and we do not add
 +* the CPU to the broadcast mask. We don't have to go
 +* through the EXIT path as the local timer is not
 +* shutdown.
 +*/
 +   ret = broadcast_needs_cpu(bc, cpu);
 +
 +   /*
 +* If the hrtimer broadcast check tells us that the
 +* cpu cannot go deep idle, or if the broadcast device
 +* is in periodic mode, we just return.
 +*/
 +   if (ret || tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
 +   goto out;

 The check on PERIODIC mode is good, but I don't see the point of moving
 broadcast_needs_cpu() up above. broadcast_shutdown_local() calls
 broadcast_needs_cpu() internally.

 Besides, by jumping to 'out', we will miss programming the broadcast
 hrtimer in tick_broadcast_set_event() below, if the cpu happen to be the
 broadcast cpu(which is why it was not allowed to go to deep idle).
 
 Hmm. Need to think a bit more about this convoluted maze ...

I think you cover all cases just by having that check on periodic mode.
This covers the nohz_full=n,highres=n, TICK_ONESHOT=y and
GENERIC_CLOCKEVENTS_BROADCAST=y. broadcast_needs_cpu() should remain
where it was though.

And of course, the additional patch on tick_broadcast_device.evtdev ==
NULL in BROADCAST_ON.

Regards
Preeti U Murthy
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST

2015-06-26 Thread Preeti U Murthy

On 06/26/2015 01:17 PM, Thomas Gleixner wrote:
 On Fri, 26 Jun 2015, Preeti U Murthy wrote:
 What about the case where GENERIC_CLOCKEVENTS_BROADCAST=y and
 TICK_ONESHOT=n (HZ_PERIODIC=y) ? Have you tested this ?

 This will hang the kernel at boot if you are using the hrtimer mode of
 broadcast. This is because the local timers of all cpus are shutdown
 when the cpuidle driver registers itself, on finding out that there are
 idle states where local tick devices stop. The broadcast tick device is
 then in charge of waking up the cpus at every period. In hrtimer mode of
 broadcast, there is no such real device and we hang.
 
 Hmm, no. tick-broadcast-hrtimer.o depends on TICK_ONESHOT=y. So this
 is covered by the check for the broadcast device, which is NULL.
 
 But there is another variant:
 
 GENERIC_CLOCKEVENTS_BROADCAST=y TICK_ONESHOT=y and 'highres=off
 nohz=off' on the kernel command line. 

Can this happen at all? It is during tick_init_highres() or
tick_nohz_switch_to_nohz() that we switch to oneshot mode, not otherwise
AFAICT.

I was actually talking of the following scenario. In periodic mode,
where GENERIC_CLOCKEVENTS_BROADCAST=y, the arch can execute
tick_setup_hrtimer_broadcast(), which will return nothing as you point
out above. So there is no broadcast clockevent device.

When the cpuidle driver registers with the cpuidle core however,
cpuidle_setup_broadcast_timer() on every cpu is executed if it finds
that there is an idle state where ticks stop.

cpuidle_setup_broadcast_timer()
  tick_broadcast_enable()
tick_broadcast_control(BROADCAST_ON)
   bc = tick_broadcast_device.evtdev which is NULL in this case

 TICK_BROADCAST_ON:
 checks for periodic mode of the broadcast device - succeeds
 although we haven't registered a broadcast device because
 value of TICKDEV_PERIODIC is 0, the default value of td.mode.

 clockevents_shutdown(dev)

At this point all cpus stop.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST

2015-06-26 Thread Preeti U Murthy

On 06/26/2015 01:17 PM, Thomas Gleixner wrote:
 diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
 index d39f32cdd1b5..281ce29d295e 100644
 --- a/kernel/time/tick-broadcast.c
 +++ b/kernel/time/tick-broadcast.c
 @@ -662,46 +662,39 @@ static void broadcast_shutdown_local(struct 
 clock_event_device *bc,
   clockevents_switch_state(dev, CLOCK_EVT_STATE_SHUTDOWN);
  }
 
 -/**
 - * tick_broadcast_oneshot_control - Enter/exit broadcast oneshot mode
 - * @state:   The target state (enter/exit)
 - *
 - * The system enters/leaves a state, where affected devices might stop
 - * Returns 0 on success, -EBUSY if the cpu is used to broadcast wakeups.
 - *
 - * Called with interrupts disabled, so clockevents_lock is not
 - * required here because the local clock event device cannot go away
 - * under us.
 - */
 -int tick_broadcast_oneshot_control(enum tick_broadcast_state state)
 +int __tick_broadcast_oneshot_control(enum tick_broadcast_state state)
  {
   struct clock_event_device *bc, *dev;
 - struct tick_device *td;
   int cpu, ret = 0;
   ktime_t now;
 
 - /*
 -  * Periodic mode does not care about the enter/exit of power
 -  * states
 -  */
 - if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
 - return 0;
 + if (!tick_broadcast_device.evtdev)
 + return -EBUSY;
 
 - /*
 -  * We are called with preemtion disabled from the depth of the
 -  * idle code, so we can't be moved away.
 -  */
 - td = this_cpu_ptr(tick_cpu_device);
 - dev = td-evtdev;
 -
 - if (!(dev-features  CLOCK_EVT_FEAT_C3STOP))
 - return 0;
 + dev = this_cpu_ptr(tick_cpu_device)-evtdev;
 
   raw_spin_lock(tick_broadcast_lock);
   bc = tick_broadcast_device.evtdev;
   cpu = smp_processor_id();
 
   if (state == TICK_BROADCAST_ENTER) {
 + /*
 +  * If the current CPU owns the hrtimer broadcast
 +  * mechanism, it cannot go deep idle and we do not add
 +  * the CPU to the broadcast mask. We don't have to go
 +  * through the EXIT path as the local timer is not
 +  * shutdown.
 +  */
 + ret = broadcast_needs_cpu(bc, cpu);
 +
 + /*
 +  * If the hrtimer broadcast check tells us that the
 +  * cpu cannot go deep idle, or if the broadcast device
 +  * is in periodic mode, we just return.
 +  */
 + if (ret || tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
 + goto out;

The check on PERIODIC mode is good, but I don't see the point of moving
broadcast_needs_cpu() up above. broadcast_shutdown_local() calls
broadcast_needs_cpu() internally.

Besides, by jumping to 'out', we will miss programming the broadcast
hrtimer in tick_broadcast_set_event() below, if the cpu happen to be the
broadcast cpu(which is why it was not allowed to go to deep idle).

 +
   if (!cpumask_test_and_set_cpu(cpu, 
 tick_broadcast_oneshot_mask)) {
   WARN_ON_ONCE(cpumask_test_cpu(cpu, 
 tick_broadcast_pending_mask));
   broadcast_shutdown_local(bc, dev);
 @@ -717,16 +710,6 @@ int tick_broadcast_oneshot_control(enum 
 tick_broadcast_state state)
   dev-next_event.tv64  bc-next_event.tv64)
   tick_broadcast_set_event(bc, cpu, 
 dev-next_event);
   }
 - /*
 -  * If the current CPU owns the hrtimer broadcast
 -  * mechanism, it cannot go deep idle and we remove the
 -  * CPU from the broadcast mask. We don't have to go
 -  * through the EXIT path as the local timer is not
 -  * shutdown.
 -  */
 - ret = broadcast_needs_cpu(bc, cpu);
 - if (ret)
 - cpumask_clear_cpu(cpu, tick_broadcast_oneshot_mask);
   } else {
   if (cpumask_test_and_clear_cpu(cpu, 
 tick_broadcast_oneshot_mask)) {
   clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT);
 @@ -938,6 +921,13 @@ bool tick_broadcast_oneshot_available(void)
   return bc ? bc-features  CLOCK_EVT_FEAT_ONESHOT : false;
  }
 

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST

2015-06-25 Thread Preeti U Murthy

On 06/25/2015 09:00 PM, Sudeep Holla wrote:
> 
> 
> On 25/06/15 14:55, Thomas Gleixner wrote:
>> On Thu, 25 Jun 2015, Sudeep Holla wrote:
>>
>>> tick_broadcast_enter returns 0 when CPU can switch to broadcast
>>> timer and non-zero otherwise. However when GENERIC_CLOCKEVENTS_BROADCAST
>>> and TICK_ONESHOT are disabled, tick_broadcast_oneshot_control returns 0
>>> which indicates to the CPUIdle framework that the CPU can enter deeper
>>> idle states even when the CPU local timer will be shutdown. If the
>>> target state needs broadcast but not broadcast timer is available, then
>>> the CPU can not resume back from that idle state.
>>>
>>> This patch returns error when there's no broadcast timer support
>>> available so that CPUIdle framework prevents the CPU from entering any
>>> idle states losing the local timer.
>>
>> That's wrong and breaks stuff which does not require the broadcast
>> nonsense.
>>
> 
> OK, sorry for not considering that case.
> 
>> If TICK_ONESHOT is disabled, then everything is in periodic mode and
>> tick_broadcast_enter() rightfully returns 0. Ditto for 'highres=off'
>> on the command line.
>>
>> But there is a case which is not correctly handled right now. That's
>> what you are trying to solve in the wrong way.
>>
> 
> Correct I was trying to solve exactly the case mentioned below.
> 
>> If
>> GENERIC_CLOCKEVENTS_BROADCAST=n
>>
>> or
>>
>> GENERIC_CLOCKEVENTS_BROADCAST=y and no broadcast device is available,
>>
>> AND cpu local tick device has the C3STOP flag set,
>>
>> then we have no way to tell the idle code that going deep is not
>> allowed.
>>
>> So we need to be smarter than blindly changing a return
>> value. Completely untested patch below.
>>
> 
> Agreed, thanks for the quick patch, I have tested it and it works fine.
> You can add
> 
> Tested-by: Sudeep Holla 

What about the case where GENERIC_CLOCKEVENTS_BROADCAST=y and
TICK_ONESHOT=n (HZ_PERIODIC=y) ? Have you tested this ?

This will hang the kernel at boot if you are using the hrtimer mode of
broadcast. This is because the local timers of all cpus are shutdown
when the cpuidle driver registers itself, on finding out that there are
idle states where local tick devices stop. The broadcast tick device is
then in charge of waking up the cpus at every period. In hrtimer mode of
broadcast, there is no such real device and we hang.

There was a patch sent out recently to fix this on powerpc.
https://lkml.org/lkml/2015/6/24/42

Regards
Preeti U Murthy
> 
> Regards,
> Sudeep
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST

2015-06-25 Thread Preeti U Murthy

  #ifdef CONFIG_HOTPLUG_CPU
>  /*
>   * Transfer the do_timer job away from a dying cpu.
> diff --git a/kernel/time/tick-sched.h b/kernel/time/tick-sched.h
> index 42fdf4958bcc..a4a8d4e9baa1 100644
> --- a/kernel/time/tick-sched.h
> +++ b/kernel/time/tick-sched.h
> @@ -71,4 +71,14 @@ extern void tick_cancel_sched_timer(int cpu);
>  static inline void tick_cancel_sched_timer(int cpu) { }
>  #endif
> 
> +#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
> +extern int __tick_broadcast_oneshot_control(enum tick_broadcast_state state);
> +#else
> +static inline int
> +__tick_broadcast_oneshot_control(enum tick_broadcast_state state)
> +{
> + return -EBUSY;
> +}
> +#endif
> +
>  #endif

Reviewed-by: Preeti U Murthy 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST

2015-06-25 Thread Preeti U Murthy

(enum tick_broadcast_state state)
 +{
 + return -EBUSY;
 +}
 +#endif
 +
  #endif

Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST

2015-06-25 Thread Preeti U Murthy

On 06/25/2015 09:00 PM, Sudeep Holla wrote:
 
 
 On 25/06/15 14:55, Thomas Gleixner wrote:
 On Thu, 25 Jun 2015, Sudeep Holla wrote:

 tick_broadcast_enter returns 0 when CPU can switch to broadcast
 timer and non-zero otherwise. However when GENERIC_CLOCKEVENTS_BROADCAST
 and TICK_ONESHOT are disabled, tick_broadcast_oneshot_control returns 0
 which indicates to the CPUIdle framework that the CPU can enter deeper
 idle states even when the CPU local timer will be shutdown. If the
 target state needs broadcast but not broadcast timer is available, then
 the CPU can not resume back from that idle state.

 This patch returns error when there's no broadcast timer support
 available so that CPUIdle framework prevents the CPU from entering any
 idle states losing the local timer.

 That's wrong and breaks stuff which does not require the broadcast
 nonsense.

 
 OK, sorry for not considering that case.
 
 If TICK_ONESHOT is disabled, then everything is in periodic mode and
 tick_broadcast_enter() rightfully returns 0. Ditto for 'highres=off'
 on the command line.

 But there is a case which is not correctly handled right now. That's
 what you are trying to solve in the wrong way.

 
 Correct I was trying to solve exactly the case mentioned below.
 
 If
 GENERIC_CLOCKEVENTS_BROADCAST=n

 or

 GENERIC_CLOCKEVENTS_BROADCAST=y and no broadcast device is available,

 AND cpu local tick device has the C3STOP flag set,

 then we have no way to tell the idle code that going deep is not
 allowed.

 So we need to be smarter than blindly changing a return
 value. Completely untested patch below.

 
 Agreed, thanks for the quick patch, I have tested it and it works fine.
 You can add
 
 Tested-by: Sudeep Holla sudeep.ho...@arm.com

What about the case where GENERIC_CLOCKEVENTS_BROADCAST=y and
TICK_ONESHOT=n (HZ_PERIODIC=y) ? Have you tested this ?

This will hang the kernel at boot if you are using the hrtimer mode of
broadcast. This is because the local timers of all cpus are shutdown
when the cpuidle driver registers itself, on finding out that there are
idle states where local tick devices stop. The broadcast tick device is
then in charge of waking up the cpus at every period. In hrtimer mode of
broadcast, there is no such real device and we hang.

There was a patch sent out recently to fix this on powerpc.
https://lkml.org/lkml/2015/6/24/42

Regards
Preeti U Murthy
 
 Regards,
 Sudeep
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] tick/idle/powerpc: Do not register idle states with CPUIDLE_FLAG_TIMER_STOP set in periodic mode

2015-06-24 Thread Preeti U Murthy

On 06/25/2015 05:36 AM, Rafael J. Wysocki wrote:
> On Thu, Jun 25, 2015 at 12:06 AM, Benjamin Herrenschmidt
>  wrote:
>> On Wed, 2015-06-24 at 15:50 +0200, Rafael J. Wysocki wrote:
>>> 4.2 material I suppose?
>>
>> And stable. Without this, if you configure without TICK_ONESHOT, the
>> machine will hang.
> 
> OK, which -stable?  All of them or any specific series?

This needs to go into stable/linux-3.19.y,
stable/linux-4.0.y, stable/linux-4.1.y.

Thanks

Regards
Preeti U Murthy
> 
> Rafael
> ___
> Linuxppc-dev mailing list
> linuxppc-...@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND] nohz: Affining unpinned timers

2015-06-24 Thread Preeti U Murthy

On 06/23/2015 02:01 PM, Vatika Harlalka wrote:
> The problem addressed in this patch is about affining unpinned timers.
> Adaptive or Full Dynticks CPUs should not be disturbed by unnecessary
> jitter due to firing of such timers on them.
> This patch will affine timers to online CPUs which are not Full Dynticks
> in FULL_NOHZ configured systems. It will not bring about functional
> changes if NOHZ_FULL is not configured, because is_housekeeping_cpu()
> always returns true in CONFIG_NO_HZ_FULL=n.
> 
> Signed-off by: Vatika Harlalka 
> ---

Reviewed-by: Preeti U Murthy 

>  Changes: Patch sent to the right recipients.
> 
>  include/linux/tick.h | 9 -
>  kernel/sched/core.c  | 9 +++--
>  2 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index f8492da5..145bcba 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -133,13 +133,20 @@ static inline bool tick_nohz_full_cpu(int cpu)
> 
>   return cpumask_test_cpu(cpu, tick_nohz_full_mask);
>  }
> -
> +static inline int housekeeping_any_cpu(void)
> +{
> + return cpumask_any_and(housekeeping_mask, cpu_online_mask);
> +}
>  extern void __tick_nohz_full_check(void);
>  extern void tick_nohz_full_kick(void);
>  extern void tick_nohz_full_kick_cpu(int cpu);
>  extern void tick_nohz_full_kick_all(void);
>  extern void __tick_nohz_task_switch(struct task_struct *tsk);
>  #else
> +static inline int housekeeping_any_cpu(void)
> +{
> + return smp_processor_id();
> +}
>  static inline bool tick_nohz_full_enabled(void) { return false; }
>  static inline bool tick_nohz_full_cpu(int cpu) { return false; }
>  static inline void __tick_nohz_full_check(void) { }
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 57bd333..acee856 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -599,18 +599,23 @@ int get_nohz_timer_target(int pinned)
>   int i;
>   struct sched_domain *sd;
> 
> - if (pinned || !get_sysctl_timer_migration() || !idle_cpu(cpu))
> + if (pinned || !get_sysctl_timer_migration() ||
> + (!idle_cpu(cpu) && is_housekeeping_cpu(cpu)))
>   return cpu;
> 
>   rcu_read_lock();
>   for_each_domain(cpu, sd) {
>   for_each_cpu(i, sched_domain_span(sd)) {
> - if (!idle_cpu(i)) {
> + if (!idle_cpu(i) && is_housekeeping_cpu(cpu)) {
>   cpu = i;
>   goto unlock;
>   }
>   }
>   }
> +
> + if (!is_housekeeping_cpu(cpu))
> + cpu = housekeeping_any_cpu();
> +
>  unlock:
>   rcu_read_unlock();
>   return cpu;
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] tick/idle/powerpc: Do not register idle states with CPUIDLE_FLAG_TIMER_STOP set in periodic mode

2015-06-24 Thread Preeti U Murthy

On some archs, the local clockevent device stops in deep cpuidle states.
The broadcast framework is used to wakeup cpus in these idle states, in
which either an external clockevent device is used to send wakeup ipis
or the hrtimer broadcast framework kicks in in the absence of such a
device. One cpu is nominated as the broadcast cpu and this cpu sends
wakeup ipis to sleeping cpus at the appropriate time. This is the
implementation in the oneshot mode of broadcast.

In periodic mode of broadcast however, the presence of such cpuidle
states results in the cpuidle driver calling tick_broadcast_enable()
which shuts down the local clockevent devices of all the cpus and
appoints the tick broadcast device as the clockevent device for each of
them. This works on those archs where the tick broadcast device is a
real clockevent device.  But on archs which depend on the hrtimer mode
of broadcast, the tick broadcast device hapens to be a pseudo device.
The consequence is that the local clockevent devices of all cpus are
shutdown and the kernel hangs at boot time in periodic mode.

Let us thus not register the cpuidle states which have
CPUIDLE_FLAG_TIMER_STOP flag set, on archs which depend on the hrtimer
mode of broadcast in periodic mode. This patch takes care of doing this
on powerpc. The cpus would not have entered into such deep cpuidle
states in periodic mode on powerpc anyway. So there is no loss here.

Signed-off-by: Preeti U Murthy 
---
 drivers/cpuidle/cpuidle-powernv.c |   15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 5937207..3442764 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -60,6 +60,8 @@ static int nap_loop(struct cpuidle_device *dev,
return index;
 }
 
+/* Register for fastsleep only in oneshot mode of broadcast */
+#ifdef CONFIG_TICK_ONESHOT
 static int fastsleep_loop(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index)
@@ -83,7 +85,7 @@ static int fastsleep_loop(struct cpuidle_device *dev,
 
return index;
 }
-
+#endif
 /*
  * States for dedicated partition case.
  */
@@ -209,7 +211,14 @@ static int powernv_add_idle_states(void)
powernv_states[nr_idle_states].flags = 0;
powernv_states[nr_idle_states].target_residency = 100;
powernv_states[nr_idle_states].enter = _loop;
-   } else if (flags[i] & OPAL_PM_SLEEP_ENABLED ||
+   }
+
+   /*
+* All cpuidle states with CPUIDLE_FLAG_TIMER_STOP set must come
+* within this config dependency check.
+*/
+#ifdef CONFIG_TICK_ONESHOT
+   if (flags[i] & OPAL_PM_SLEEP_ENABLED ||
flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) {
/* Add FASTSLEEP state */
strcpy(powernv_states[nr_idle_states].name, 
"FastSleep");
@@ -218,7 +227,7 @@ static int powernv_add_idle_states(void)
powernv_states[nr_idle_states].target_residency = 
30;
powernv_states[nr_idle_states].enter = _loop;
}
-
+#endif
powernv_states[nr_idle_states].exit_latency =
((unsigned int)latency_ns[i]) / 1000;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND] nohz: Affining unpinned timers

2015-06-24 Thread Preeti U Murthy

On 06/23/2015 02:01 PM, Vatika Harlalka wrote:
 The problem addressed in this patch is about affining unpinned timers.
 Adaptive or Full Dynticks CPUs should not be disturbed by unnecessary
 jitter due to firing of such timers on them.
 This patch will affine timers to online CPUs which are not Full Dynticks
 in FULL_NOHZ configured systems. It will not bring about functional
 changes if NOHZ_FULL is not configured, because is_housekeeping_cpu()
 always returns true in CONFIG_NO_HZ_FULL=n.
 
 Signed-off by: Vatika Harlalka vatikaharla...@gmail.com
 ---

Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com

  Changes: Patch sent to the right recipients.
 
  include/linux/tick.h | 9 -
  kernel/sched/core.c  | 9 +++--
  2 files changed, 15 insertions(+), 3 deletions(-)
 
 diff --git a/include/linux/tick.h b/include/linux/tick.h
 index f8492da5..145bcba 100644
 --- a/include/linux/tick.h
 +++ b/include/linux/tick.h
 @@ -133,13 +133,20 @@ static inline bool tick_nohz_full_cpu(int cpu)
 
   return cpumask_test_cpu(cpu, tick_nohz_full_mask);
  }
 -
 +static inline int housekeeping_any_cpu(void)
 +{
 + return cpumask_any_and(housekeeping_mask, cpu_online_mask);
 +}
  extern void __tick_nohz_full_check(void);
  extern void tick_nohz_full_kick(void);
  extern void tick_nohz_full_kick_cpu(int cpu);
  extern void tick_nohz_full_kick_all(void);
  extern void __tick_nohz_task_switch(struct task_struct *tsk);
  #else
 +static inline int housekeeping_any_cpu(void)
 +{
 + return smp_processor_id();
 +}
  static inline bool tick_nohz_full_enabled(void) { return false; }
  static inline bool tick_nohz_full_cpu(int cpu) { return false; }
  static inline void __tick_nohz_full_check(void) { }
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 57bd333..acee856 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -599,18 +599,23 @@ int get_nohz_timer_target(int pinned)
   int i;
   struct sched_domain *sd;
 
 - if (pinned || !get_sysctl_timer_migration() || !idle_cpu(cpu))
 + if (pinned || !get_sysctl_timer_migration() ||
 + (!idle_cpu(cpu)  is_housekeeping_cpu(cpu)))
   return cpu;
 
   rcu_read_lock();
   for_each_domain(cpu, sd) {
   for_each_cpu(i, sched_domain_span(sd)) {
 - if (!idle_cpu(i)) {
 + if (!idle_cpu(i)  is_housekeeping_cpu(cpu)) {
   cpu = i;
   goto unlock;
   }
   }
   }
 +
 + if (!is_housekeeping_cpu(cpu))
 + cpu = housekeeping_any_cpu();
 +
  unlock:
   rcu_read_unlock();
   return cpu;
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] tick/idle/powerpc: Do not register idle states with CPUIDLE_FLAG_TIMER_STOP set in periodic mode

2015-06-24 Thread Preeti U Murthy

On some archs, the local clockevent device stops in deep cpuidle states.
The broadcast framework is used to wakeup cpus in these idle states, in
which either an external clockevent device is used to send wakeup ipis
or the hrtimer broadcast framework kicks in in the absence of such a
device. One cpu is nominated as the broadcast cpu and this cpu sends
wakeup ipis to sleeping cpus at the appropriate time. This is the
implementation in the oneshot mode of broadcast.

In periodic mode of broadcast however, the presence of such cpuidle
states results in the cpuidle driver calling tick_broadcast_enable()
which shuts down the local clockevent devices of all the cpus and
appoints the tick broadcast device as the clockevent device for each of
them. This works on those archs where the tick broadcast device is a
real clockevent device.  But on archs which depend on the hrtimer mode
of broadcast, the tick broadcast device hapens to be a pseudo device.
The consequence is that the local clockevent devices of all cpus are
shutdown and the kernel hangs at boot time in periodic mode.

Let us thus not register the cpuidle states which have
CPUIDLE_FLAG_TIMER_STOP flag set, on archs which depend on the hrtimer
mode of broadcast in periodic mode. This patch takes care of doing this
on powerpc. The cpus would not have entered into such deep cpuidle
states in periodic mode on powerpc anyway. So there is no loss here.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---
 drivers/cpuidle/cpuidle-powernv.c |   15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 5937207..3442764 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -60,6 +60,8 @@ static int nap_loop(struct cpuidle_device *dev,
return index;
 }
 
+/* Register for fastsleep only in oneshot mode of broadcast */
+#ifdef CONFIG_TICK_ONESHOT
 static int fastsleep_loop(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index)
@@ -83,7 +85,7 @@ static int fastsleep_loop(struct cpuidle_device *dev,
 
return index;
 }
-
+#endif
 /*
  * States for dedicated partition case.
  */
@@ -209,7 +211,14 @@ static int powernv_add_idle_states(void)
powernv_states[nr_idle_states].flags = 0;
powernv_states[nr_idle_states].target_residency = 100;
powernv_states[nr_idle_states].enter = nap_loop;
-   } else if (flags[i]  OPAL_PM_SLEEP_ENABLED ||
+   }
+
+   /*
+* All cpuidle states with CPUIDLE_FLAG_TIMER_STOP set must come
+* within this config dependency check.
+*/
+#ifdef CONFIG_TICK_ONESHOT
+   if (flags[i]  OPAL_PM_SLEEP_ENABLED ||
flags[i]  OPAL_PM_SLEEP_ENABLED_ER1) {
/* Add FASTSLEEP state */
strcpy(powernv_states[nr_idle_states].name, 
FastSleep);
@@ -218,7 +227,7 @@ static int powernv_add_idle_states(void)
powernv_states[nr_idle_states].target_residency = 
30;
powernv_states[nr_idle_states].enter = fastsleep_loop;
}
-
+#endif
powernv_states[nr_idle_states].exit_latency =
((unsigned int)latency_ns[i]) / 1000;
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] tick/idle/powerpc: Do not register idle states with CPUIDLE_FLAG_TIMER_STOP set in periodic mode

2015-06-24 Thread Preeti U Murthy

On 06/25/2015 05:36 AM, Rafael J. Wysocki wrote:
 On Thu, Jun 25, 2015 at 12:06 AM, Benjamin Herrenschmidt
 b...@kernel.crashing.org wrote:
 On Wed, 2015-06-24 at 15:50 +0200, Rafael J. Wysocki wrote:
 4.2 material I suppose?

 And stable. Without this, if you configure without TICK_ONESHOT, the
 machine will hang.
 
 OK, which -stable?  All of them or any specific series?

This needs to go into stable/linux-3.19.y,
stable/linux-4.0.y, stable/linux-4.1.y.

Thanks

Regards
Preeti U Murthy
 
 Rafael
 ___
 Linuxppc-dev mailing list
 linuxppc-...@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] cpuidle: powernv/pseries: Auto-promotion of snooze to deeper idle state

2015-06-18 Thread Preeti U Murthy

On 06/18/2015 04:53 PM, Shilpasri G Bhat wrote:
> The idle cpus which stay in snooze for a long period can degrade the
> perfomance of the sibling cpus. If the cpu stays in snooze for more
> than target residency of the next available idle state, then exit from
> snooze. This gives a chance to the cpuidle governor to re-evaluate the
> last idle state of the cpu to promote it to deeper idle states.
> 
> Signed-off-by: Shilpasri G Bhat 
> ---

Reviewed-by: Preeti U Murthy 

> Changes from v1:
> -Modified commit message
> 
>  drivers/cpuidle/cpuidle-powernv.c | 12 
>  drivers/cpuidle/cpuidle-pseries.c | 11 +++
>  2 files changed, 23 insertions(+)
> 
> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> b/drivers/cpuidle/cpuidle-powernv.c
> index 5937207..1e3ef5e 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -29,18 +29,25 @@ struct cpuidle_driver powernv_idle_driver = {
>  
>  static int max_idle_state;
>  static struct cpuidle_state *cpuidle_state_table;
> +static u64 snooze_timeout;
> +static bool snooze_timeout_en;
>  
>  static int snooze_loop(struct cpuidle_device *dev,
>   struct cpuidle_driver *drv,
>   int index)
>  {
> + u64 snooze_exit_time;
> +
>   local_irq_enable();
>   set_thread_flag(TIF_POLLING_NRFLAG);
>  
> + snooze_exit_time = get_tb() + snooze_timeout;
>   ppc64_runlatch_off();
>   while (!need_resched()) {
>   HMT_low();
>   HMT_very_low();
> + if (snooze_timeout_en && get_tb() > snooze_exit_time)
> + break;
>   }
>  
>   HMT_medium();
> @@ -252,6 +259,11 @@ static int powernv_idle_probe(void)
>   cpuidle_state_table = powernv_states;
>   /* Device tree can indicate more idle states */
>   max_idle_state = powernv_add_idle_states();
> + if (max_idle_state > 1) {
> + snooze_timeout_en = true;
> + snooze_timeout = powernv_states[1].target_residency *
> +  tb_ticks_per_usec;
> + }
>   } else
>   return -ENODEV;
>  
> diff --git a/drivers/cpuidle/cpuidle-pseries.c 
> b/drivers/cpuidle/cpuidle-pseries.c
> index bb9e2b6..07135e0 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -27,6 +27,8 @@ struct cpuidle_driver pseries_idle_driver = {
>  
>  static int max_idle_state;
>  static struct cpuidle_state *cpuidle_state_table;
> +static u64 snooze_timeout;
> +static bool snooze_timeout_en;
>  
>  static inline void idle_loop_prolog(unsigned long *in_purr)
>  {
> @@ -58,14 +60,18 @@ static int snooze_loop(struct cpuidle_device *dev,
>   int index)
>  {
>   unsigned long in_purr;
> + u64 snooze_exit_time;
>  
>   idle_loop_prolog(_purr);
>   local_irq_enable();
>   set_thread_flag(TIF_POLLING_NRFLAG);
> + snooze_exit_time = get_tb() + snooze_timeout;
>  
>   while (!need_resched()) {
>   HMT_low();
>   HMT_very_low();
> + if (snooze_timeout_en && get_tb() > snooze_exit_time)
> + break;
>   }
>  
>   HMT_medium();
> @@ -244,6 +250,11 @@ static int pseries_idle_probe(void)
>   } else
>   return -ENODEV;
>  
> + if (max_idle_state > 1) {
> + snooze_timeout_en = true;
> + snooze_timeout = cpuidle_state_table[1].target_residency *
> +  tb_ticks_per_usec;
> + }
>   return 0;
>  }
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Fix race in hrtimer broadcast and take care of dependencies

2015-06-18 Thread Preeti U Murthy

A reminder to pick this series up for stable-4.0

On 05/13/2015 05:19 PM, Preeti U Murthy wrote:
> The intention was to backport only PATCH[5/5]: clockevents: Fix
> cpu_down() race for hrtimer based broadcasting,
> 
> but this depends on commits upstream which did cleanup and
> reorganization of code. However these commits cannot be cherry-picked as
> is. There are a few minor changes upstream and hence these commits had
> to be backported too. Reasons for backports are given below:
> 
> 1. PATCH[1/5]: The upstream commit introduced a duplicate function which
> was removed much later by commit:
> 
> 9eed56e889d: clockevents: Clean up clockchips.h
> 
> We do not need to cherry-pick the entire above commit because the backport
> was a simple one.
> 
> 2. PATCH[3/5]: There are newer functions introduced upstream through
> 
> 554ef3876c6: clockevents: Handle tick device's resume separately
> 
> Yet again we did not need to cherry pick the entire above commit,
> firstly because it introduces significant new code, secondly because the
> backport was a simple one.
> 
> 3. PATCH[4/5]: The function clockevent_set_mode() was renamed to
> clockevent_set_state() upstream by commit:
> 
> 77e32c89a7: clockevents: Manage device's state separately for
> the core
> 
> We do not need to cherry-pick this commit for the same reason as in (2).
> 
> This series needs to be applied on top of stable-4.0. 
> ---
> 
> Preeti U Murthy (1):
>   clockevents: Fix cpu_down() race for hrtimer based broadcasting
> 
> Thomas Gleixner (4):
>   clockevents: Remove CONFIG_GENERIC_CLOCKEVENTS_BUILD
>   tick: Move clocksource related stuff to timekeeping.h
>   tick: Simplify tick-internal.h
>   tick: Move core only declarations and functions to core
> 
> 
>  include/linux/clockchips.h   |   23 ++
>  include/linux/tick.h |  138 ++-
>  kernel/cpu.c |2 +
>  kernel/time/Kconfig  |6 --
>  kernel/time/Makefile |6 +-
>  kernel/time/clockevents.c|3 -
>  kernel/time/hrtimer.c|2 -
>  kernel/time/jiffies.c|2 -
>  kernel/time/ntp.c|1 
>  kernel/time/tick-broadcast.c |   19 +++--
>  kernel/time/tick-internal.h  |  166 
> --
>  kernel/time/tick-sched.c |7 ++
>  kernel/time/tick-sched.h |   64 
>  kernel/time/timekeeping.h|7 ++
>  kernel/time/timer_list.c |2 -
>  15 files changed, 190 insertions(+), 258 deletions(-)
>  create mode 100644 kernel/time/tick-sched.h
> 
> --
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] cpuidle: powernv/pseries: Auto-promotion of snooze to deeper idle state

2015-06-18 Thread Preeti U Murthy

On 06/18/2015 04:53 PM, Shilpasri G Bhat wrote:
 The idle cpus which stay in snooze for a long period can degrade the
 perfomance of the sibling cpus. If the cpu stays in snooze for more
 than target residency of the next available idle state, then exit from
 snooze. This gives a chance to the cpuidle governor to re-evaluate the
 last idle state of the cpu to promote it to deeper idle states.
 
 Signed-off-by: Shilpasri G Bhat shilpa.b...@linux.vnet.ibm.com
 ---

Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com

 Changes from v1:
 -Modified commit message
 
  drivers/cpuidle/cpuidle-powernv.c | 12 
  drivers/cpuidle/cpuidle-pseries.c | 11 +++
  2 files changed, 23 insertions(+)
 
 diff --git a/drivers/cpuidle/cpuidle-powernv.c 
 b/drivers/cpuidle/cpuidle-powernv.c
 index 5937207..1e3ef5e 100644
 --- a/drivers/cpuidle/cpuidle-powernv.c
 +++ b/drivers/cpuidle/cpuidle-powernv.c
 @@ -29,18 +29,25 @@ struct cpuidle_driver powernv_idle_driver = {
  
  static int max_idle_state;
  static struct cpuidle_state *cpuidle_state_table;
 +static u64 snooze_timeout;
 +static bool snooze_timeout_en;
  
  static int snooze_loop(struct cpuidle_device *dev,
   struct cpuidle_driver *drv,
   int index)
  {
 + u64 snooze_exit_time;
 +
   local_irq_enable();
   set_thread_flag(TIF_POLLING_NRFLAG);
  
 + snooze_exit_time = get_tb() + snooze_timeout;
   ppc64_runlatch_off();
   while (!need_resched()) {
   HMT_low();
   HMT_very_low();
 + if (snooze_timeout_en  get_tb()  snooze_exit_time)
 + break;
   }
  
   HMT_medium();
 @@ -252,6 +259,11 @@ static int powernv_idle_probe(void)
   cpuidle_state_table = powernv_states;
   /* Device tree can indicate more idle states */
   max_idle_state = powernv_add_idle_states();
 + if (max_idle_state  1) {
 + snooze_timeout_en = true;
 + snooze_timeout = powernv_states[1].target_residency *
 +  tb_ticks_per_usec;
 + }
   } else
   return -ENODEV;
  
 diff --git a/drivers/cpuidle/cpuidle-pseries.c 
 b/drivers/cpuidle/cpuidle-pseries.c
 index bb9e2b6..07135e0 100644
 --- a/drivers/cpuidle/cpuidle-pseries.c
 +++ b/drivers/cpuidle/cpuidle-pseries.c
 @@ -27,6 +27,8 @@ struct cpuidle_driver pseries_idle_driver = {
  
  static int max_idle_state;
  static struct cpuidle_state *cpuidle_state_table;
 +static u64 snooze_timeout;
 +static bool snooze_timeout_en;
  
  static inline void idle_loop_prolog(unsigned long *in_purr)
  {
 @@ -58,14 +60,18 @@ static int snooze_loop(struct cpuidle_device *dev,
   int index)
  {
   unsigned long in_purr;
 + u64 snooze_exit_time;
  
   idle_loop_prolog(in_purr);
   local_irq_enable();
   set_thread_flag(TIF_POLLING_NRFLAG);
 + snooze_exit_time = get_tb() + snooze_timeout;
  
   while (!need_resched()) {
   HMT_low();
   HMT_very_low();
 + if (snooze_timeout_en  get_tb()  snooze_exit_time)
 + break;
   }
  
   HMT_medium();
 @@ -244,6 +250,11 @@ static int pseries_idle_probe(void)
   } else
   return -ENODEV;
  
 + if (max_idle_state  1) {
 + snooze_timeout_en = true;
 + snooze_timeout = cpuidle_state_table[1].target_residency *
 +  tb_ticks_per_usec;
 + }
   return 0;
  }
  
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Fix race in hrtimer broadcast and take care of dependencies

2015-06-18 Thread Preeti U Murthy

A reminder to pick this series up for stable-4.0

On 05/13/2015 05:19 PM, Preeti U Murthy wrote:
 The intention was to backport only PATCH[5/5]: clockevents: Fix
 cpu_down() race for hrtimer based broadcasting,
 
 but this depends on commits upstream which did cleanup and
 reorganization of code. However these commits cannot be cherry-picked as
 is. There are a few minor changes upstream and hence these commits had
 to be backported too. Reasons for backports are given below:
 
 1. PATCH[1/5]: The upstream commit introduced a duplicate function which
 was removed much later by commit:
 
 9eed56e889d: clockevents: Clean up clockchips.h
 
 We do not need to cherry-pick the entire above commit because the backport
 was a simple one.
 
 2. PATCH[3/5]: There are newer functions introduced upstream through
 
 554ef3876c6: clockevents: Handle tick device's resume separately
 
 Yet again we did not need to cherry pick the entire above commit,
 firstly because it introduces significant new code, secondly because the
 backport was a simple one.
 
 3. PATCH[4/5]: The function clockevent_set_mode() was renamed to
 clockevent_set_state() upstream by commit:
 
 77e32c89a7: clockevents: Manage device's state separately for
 the core
 
 We do not need to cherry-pick this commit for the same reason as in (2).
 
 This series needs to be applied on top of stable-4.0. 
 ---
 
 Preeti U Murthy (1):
   clockevents: Fix cpu_down() race for hrtimer based broadcasting
 
 Thomas Gleixner (4):
   clockevents: Remove CONFIG_GENERIC_CLOCKEVENTS_BUILD
   tick: Move clocksource related stuff to timekeeping.h
   tick: Simplify tick-internal.h
   tick: Move core only declarations and functions to core
 
 
  include/linux/clockchips.h   |   23 ++
  include/linux/tick.h |  138 ++-
  kernel/cpu.c |2 +
  kernel/time/Kconfig  |6 --
  kernel/time/Makefile |6 +-
  kernel/time/clockevents.c|3 -
  kernel/time/hrtimer.c|2 -
  kernel/time/jiffies.c|2 -
  kernel/time/ntp.c|1 
  kernel/time/tick-broadcast.c |   19 +++--
  kernel/time/tick-internal.h  |  166 
 --
  kernel/time/tick-sched.c |7 ++
  kernel/time/tick-sched.h |   64 
  kernel/time/timekeeping.h|7 ++
  kernel/time/timer_list.c |2 -
  15 files changed, 190 insertions(+), 258 deletions(-)
  create mode 100644 kernel/time/tick-sched.h
 
 --
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 7/8] nohz: Evaluate tick dependency once on context switch

2015-06-16 Thread Preeti U Murthy

On 06/12/2015 02:16 AM, Rik van Riel wrote:
> On 06/11/2015 01:36 PM, Frederic Weisbecker wrote:
>> The tick dependency is evaluated on every irq. This is a batch of checks
>> which determine whether it is safe to stop the tick or not. These checks
>> are often split in many details: posix cpu timers, scheduler, sched clock,
>> perf events. Each of which are made of smaller details: posix cpu
>> timer involves checking process wide timers then thread wide timers. Perf
>> involves checking freq events then more per cpu details.
>>
>> Checking these details every time we update the full dynticks state
>> bring avoidable overhead.
>>
>> So lets evaluate these dependencies once on context switch. Then the
>> further dependency checks will be performed through a single state check.
>>
>> This is a first step that can be later optimized by dividing task level
>> dependency, CPU level dependency and global dependency and update
>> each at the right time.
> 
>> +static void tick_nohz_full_update_dependencies(void)
>> +{
>> +struct tick_sched *ts = this_cpu_ptr(_cpu_sched);
>> +
>> +if (!posix_cpu_timers_can_stop_tick(current))
>> +ts->tick_needed |= TICK_NEEDED_POSIX_CPU_TIMER;
>> +
>> +if (!perf_event_can_stop_tick())
>> +ts->tick_needed |= TICK_NEEDED_PERF_EVENT;
>> +
>> +if (!sched_can_stop_tick())
>> +ts->tick_needed |= TICK_NEEDED_SCHED;
>>  
> 
> I see this getting kicked from task work and from ipi
> context, but does it get kicked on task wakeup, when
> we have a second runnable task on a CPU, but we decide
> not to preempt the currently running task to switch to
> it yet, but we will want to preempt the currently running
> task at a later point in time?

+1. This is not taken care of as far as I can see too.

Regards
Preeti U Murthy
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 7/7]powerpc/powernv: nest pmu cpumask and cpu hotplug support

2015-06-16 Thread Preeti U Murthy

On 06/11/2015 10:47 AM, Madhavan Srinivasan wrote:
> Adds cpumask attribute to be used by each nest pmu since nest
> units are per-chip. Only one cpu (first online cpu) from each node/chip
> is designated to read counters.
> 
> On cpu hotplug, dying cpu is checked to see whether it is one of the
> designated cpus, if yes, next online cpu from the same node/chip is designated
> as new cpu to read counters.
> 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Anton Blanchard 
> Cc: Sukadev Bhattiprolu 
> Cc: Anshuman Khandual 
> Cc: Stephane Eranian 
> Cc: Preeti U Murthy 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Signed-off-by: Madhavan Srinivasan 
> ---
> +static void nest_change_cpu_context(int old_cpu, int new_cpu)
> +{
> + int i;
> +
> + if (new_cpu >= 0) {
> + for (i = 0; per_nest_pmu_arr[i] != NULL; i++)
> + perf_pmu_migrate_context(_nest_pmu_arr[i]->pmu,
> + old_cpu, new_cpu);
> + }
> +}
> +
> +static void nest_exit_cpu(int cpu)
> +{
> + int i, nid, target = -1;
> + const struct cpumask *l_cpumask;
> + int src_chipid;
> +
> + /*
> +  * Check in the designated list for this cpu. Dont bother
> +  * if not one of them.
> +  */
> + if (!cpumask_test_and_clear_cpu(cpu, _mask_nest_pmu))
> + return;
> +
> + /*
> +  * Now that this cpu is one of the designated,
> +  * find a new cpu a) which is not dying and

This comment is not right. nest_exit_cpu() is called in the hotplug
path, so another cpu cannot be dying in parallel. Hotplug operations are
done serially. The comment ought to be "a) which is online" instead.

> +  * b) is in same node/chip.

node is not the same as a chip right ? And you are interested in cpus on
the same chip alone. So shouldn't the above comment be b) in the same chip ?

> +  */
> + nid = cpu_to_node(cpu);
> + src_chipid = topology_physical_package_id(cpu);

What is the relation between a node and a chip ? Can't we have a
function which returns the cpumask of a chip straight away, since that
is what you seem to be more interested in ? You can then simply choose
the next cpu in this cpumask rather than going through the below loop.

> + l_cpumask = cpumask_of_node(nid);
> + for_each_cpu(i, l_cpumask) {
> + if (i == cpu)
> + continue;
> + if (src_chipid == topology_physical_package_id(i)) {
> + target = i;
> + break;
> + }
> + }
> +
> + /*
> +  * Update the cpumask with the target cpu and
> +  * migrate the context if needed
> +  */
> + if (target >= 0) {

You already check for target >= 0 here. So you don't need to check for
new_cpu >= 0 in nest_change_cpu_context() above ?

> + cpumask_set_cpu(target, _mask_nest_pmu);
> + nest_change_cpu_context (cpu, target);
> + }
> +}
> +
> +static void nest_init_cpu(int cpu)
> +{
> + int i, src_chipid;
> +
> + /*
> +  * Search for any existing designated thread from
> +  * the incoming cpu's node/chip. If found, do nothing.
> +  */
> + src_chipid = topology_physical_package_id(cpu);
> + for_each_cpu(i, _mask_nest_pmu)
> + if (src_chipid == topology_physical_package_id(i))
> + return;
> +
> + /*
> +  * Make incoming cpu as a designated thread for
> +  * this node/chip
> +  */
> + cpumask_set_cpu(cpu, _mask_nest_pmu);

Why can't we simply check if cpu is the first one coming online in the
chip and designate it as the cpu_mask_nest_pmu for that chip ? If it is
not the first cpu, it means that another cpu in the same chip already
took over this duty and it needn't bother.

And shouldn't we also call nest_init() on this cpu, just like you do in
cpumask_chip() on all cpu_mask_nest_pmu cpus ?

> +}
> +
> +static int nest_cpu_notifier(struct notifier_block *self,
> + unsigned long action, void *hcpu)
> +{
> + long cpu = (long)hcpu;
> +
> + switch (action & ~CPU_TASKS_FROZEN) {
> + case CPU_DOWN_FAILED:

Why do we need to handle the DOWN_FAILED case ? In DOWN_PREPARE, you
have ensured that the function moves on to another cpu. So even if the
offline failed, its no issue. The duty is safely taken over.

> + case CPU_STARTING:

I would suggest initializing nest in the CPU_ONLINE stage. This is
because CPU_STARTING phase can fail. In that case, we will be
unnecessarily initializing nest pre-maturely. CPU_ONLINE phase assures
that the cpu is successf

Re: [PATCH 7/8] nohz: Evaluate tick dependency once on context switch

2015-06-16 Thread Preeti U Murthy

On 06/12/2015 02:16 AM, Rik van Riel wrote:
 On 06/11/2015 01:36 PM, Frederic Weisbecker wrote:
 The tick dependency is evaluated on every irq. This is a batch of checks
 which determine whether it is safe to stop the tick or not. These checks
 are often split in many details: posix cpu timers, scheduler, sched clock,
 perf events. Each of which are made of smaller details: posix cpu
 timer involves checking process wide timers then thread wide timers. Perf
 involves checking freq events then more per cpu details.

 Checking these details every time we update the full dynticks state
 bring avoidable overhead.

 So lets evaluate these dependencies once on context switch. Then the
 further dependency checks will be performed through a single state check.

 This is a first step that can be later optimized by dividing task level
 dependency, CPU level dependency and global dependency and update
 each at the right time.
 
 +static void tick_nohz_full_update_dependencies(void)
 +{
 +struct tick_sched *ts = this_cpu_ptr(tick_cpu_sched);
 +
 +if (!posix_cpu_timers_can_stop_tick(current))
 +ts-tick_needed |= TICK_NEEDED_POSIX_CPU_TIMER;
 +
 +if (!perf_event_can_stop_tick())
 +ts-tick_needed |= TICK_NEEDED_PERF_EVENT;
 +
 +if (!sched_can_stop_tick())
 +ts-tick_needed |= TICK_NEEDED_SCHED;
  
 
 I see this getting kicked from task work and from ipi
 context, but does it get kicked on task wakeup, when
 we have a second runnable task on a CPU, but we decide
 not to preempt the currently running task to switch to
 it yet, but we will want to preempt the currently running
 task at a later point in time?

+1. This is not taken care of as far as I can see too.

Regards
Preeti U Murthy
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 7/7]powerpc/powernv: nest pmu cpumask and cpu hotplug support

2015-06-16 Thread Preeti U Murthy

On 06/11/2015 10:47 AM, Madhavan Srinivasan wrote:
 Adds cpumask attribute to be used by each nest pmu since nest
 units are per-chip. Only one cpu (first online cpu) from each node/chip
 is designated to read counters.
 
 On cpu hotplug, dying cpu is checked to see whether it is one of the
 designated cpus, if yes, next online cpu from the same node/chip is designated
 as new cpu to read counters.
 
 Cc: Michael Ellerman m...@ellerman.id.au
 Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
 Cc: Paul Mackerras pau...@samba.org
 Cc: Anton Blanchard an...@samba.org
 Cc: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 Cc: Anshuman Khandual khand...@linux.vnet.ibm.com
 Cc: Stephane Eranian eran...@google.com
 Cc: Preeti U Murthy pre...@linux.vnet.ibm.com
 Cc: Ingo Molnar mi...@kernel.org
 Cc: Peter Zijlstra pet...@infradead.org
 Signed-off-by: Madhavan Srinivasan ma...@linux.vnet.ibm.com
 ---
 +static void nest_change_cpu_context(int old_cpu, int new_cpu)
 +{
 + int i;
 +
 + if (new_cpu = 0) {
 + for (i = 0; per_nest_pmu_arr[i] != NULL; i++)
 + perf_pmu_migrate_context(per_nest_pmu_arr[i]-pmu,
 + old_cpu, new_cpu);
 + }
 +}
 +
 +static void nest_exit_cpu(int cpu)
 +{
 + int i, nid, target = -1;
 + const struct cpumask *l_cpumask;
 + int src_chipid;
 +
 + /*
 +  * Check in the designated list for this cpu. Dont bother
 +  * if not one of them.
 +  */
 + if (!cpumask_test_and_clear_cpu(cpu, cpu_mask_nest_pmu))
 + return;
 +
 + /*
 +  * Now that this cpu is one of the designated,
 +  * find a new cpu a) which is not dying and

This comment is not right. nest_exit_cpu() is called in the hotplug
path, so another cpu cannot be dying in parallel. Hotplug operations are
done serially. The comment ought to be a) which is online instead.

 +  * b) is in same node/chip.

node is not the same as a chip right ? And you are interested in cpus on
the same chip alone. So shouldn't the above comment be b) in the same chip ?

 +  */
 + nid = cpu_to_node(cpu);
 + src_chipid = topology_physical_package_id(cpu);

What is the relation between a node and a chip ? Can't we have a
function which returns the cpumask of a chip straight away, since that
is what you seem to be more interested in ? You can then simply choose
the next cpu in this cpumask rather than going through the below loop.


 + l_cpumask = cpumask_of_node(nid);
 + for_each_cpu(i, l_cpumask) {
 + if (i == cpu)
 + continue;
 + if (src_chipid == topology_physical_package_id(i)) {
 + target = i;
 + break;
 + }
 + }
 +
 + /*
 +  * Update the cpumask with the target cpu and
 +  * migrate the context if needed
 +  */
 + if (target = 0) {

You already check for target = 0 here. So you don't need to check for
new_cpu = 0 in nest_change_cpu_context() above ?

 + cpumask_set_cpu(target, cpu_mask_nest_pmu);
 + nest_change_cpu_context (cpu, target);
 + }
 +}
 +
 +static void nest_init_cpu(int cpu)
 +{
 + int i, src_chipid;
 +
 + /*
 +  * Search for any existing designated thread from
 +  * the incoming cpu's node/chip. If found, do nothing.
 +  */
 + src_chipid = topology_physical_package_id(cpu);
 + for_each_cpu(i, cpu_mask_nest_pmu)
 + if (src_chipid == topology_physical_package_id(i))
 + return;
 +
 + /*
 +  * Make incoming cpu as a designated thread for
 +  * this node/chip
 +  */
 + cpumask_set_cpu(cpu, cpu_mask_nest_pmu);

Why can't we simply check if cpu is the first one coming online in the
chip and designate it as the cpu_mask_nest_pmu for that chip ? If it is
not the first cpu, it means that another cpu in the same chip already
took over this duty and it needn't bother.

And shouldn't we also call nest_init() on this cpu, just like you do in
cpumask_chip() on all cpu_mask_nest_pmu cpus ?

 +}
 +
 +static int nest_cpu_notifier(struct notifier_block *self,
 + unsigned long action, void *hcpu)
 +{
 + long cpu = (long)hcpu;
 +
 + switch (action  ~CPU_TASKS_FROZEN) {
 + case CPU_DOWN_FAILED:

Why do we need to handle the DOWN_FAILED case ? In DOWN_PREPARE, you
have ensured that the function moves on to another cpu. So even if the
offline failed, its no issue. The duty is safely taken over.

 + case CPU_STARTING:

I would suggest initializing nest in the CPU_ONLINE stage. This is
because CPU_STARTING phase can fail. In that case, we will be
unnecessarily initializing nest pre-maturely. CPU_ONLINE phase assures
that the cpu is successfully online and you can then initiate nest.

 + nest_init_cpu(cpu);
 + break;
 + case CPU_DOWN_PREPARE:
 + nest_exit_cpu(cpu);
 + break;
 + default

Re: [PATCH 5/8] nohz: Restart the tick from irq exit

2015-06-14 Thread Preeti U Murthy

On 06/12/2015 06:08 PM, Frederic Weisbecker wrote:
> On Fri, Jun 12, 2015 at 09:32:45AM +0200, Peter Zijlstra wrote:
>> On Thu, Jun 11, 2015 at 07:36:05PM +0200, Frederic Weisbecker wrote:
>>> Restart the tick when necessary from the irq exit path. It makes nohz
>>> full more flexible and allow it to piggyback the tick restart on the
>>> scheduler IPI in the future instead of sending a dedicated IPI that
>>> often doubles the scheduler IPI on task wakeup. This will require
>>> careful review of resched_curr() callers.
>>
>> This seems to assume schedule_ipi() callers use irq_exit(), this is
>> false.
> 
> Indeed there will be that too. Note the current patch doesn't yet rely on
> schedule_ipi(), we are still using the nohz ipis. But introducing the
> tick restart on irq exit prepares for later piggybacking on scheduler_ipi().
> 
> I think this will involve changes on got_nohz_idle_kick(), renamed to
> got_nohz_kick() and include nohz full related checks to trigger the
> irq_enter()/exit() pair.

I maybe saying something obvious here, nevertheless; I am not sure about
other archs, but atleast on powerpc after handling an interrupt, we will
call irq_exit() and reevaluate starting of ticks. So in our case even if
scheduler_ipi() callers do not call irq_exit(), it will be called after
handling the reschedule interrupt.

Regards
Preeti U Murthy
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/8] nohz: Restart the tick from irq exit

2015-06-14 Thread Preeti U Murthy

On 06/11/2015 11:06 PM, Frederic Weisbecker wrote:
> Restart the tick when necessary from the irq exit path. It makes nohz
> full more flexible and allow it to piggyback the tick restart on the
> scheduler IPI in the future instead of sending a dedicated IPI that
> often doubles the scheduler IPI on task wakeup. This will require

You can piggy back on the scheduler ipi when you add a timer/hrtimer and
add a new task to the runqueue of the nohz_full cpus, since we call
resched_curr() in these code paths. But what about the calls to kick
nohz_full cpus by perf events and posix cpu timers ? These call sites
seem to be concerned about specifically waking up nohz_full cpus as far
as I can see. IOW there is no scheduling ipi that we can fall back on in
these paths.

> careful review of resched_curr() callers.
> 

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/8] nohz: Restart the tick from irq exit

2015-06-14 Thread Preeti U Murthy

On 06/11/2015 11:06 PM, Frederic Weisbecker wrote:
> Restart the tick when necessary from the irq exit path. It makes nohz
> full more flexible and allow it to piggyback the tick restart on the
> scheduler IPI in the future instead of sending a dedicated IPI that
> often doubles the scheduler IPI on task wakeup. This will require

You can piggy back on the scheduler ipi when you add a timer/hrtimer and
add a new task to the runqueue of the nohz_full cpus, since we call
resched_curr() in these code paths. But what about the calls to kick
nohz_full cpus by perf events and posix cpu timers ? These call sites
seem to be concerned about specifically waking up nohz_full cpus as far
as I can see. IOW there is no scheduling ipi that we can fall back on in
these paths.

> careful review of resched_curr() callers.
> 

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/8] nohz: Restart the tick from irq exit

2015-06-14 Thread Preeti U Murthy

On 06/11/2015 11:06 PM, Frederic Weisbecker wrote:
 Restart the tick when necessary from the irq exit path. It makes nohz
 full more flexible and allow it to piggyback the tick restart on the
 scheduler IPI in the future instead of sending a dedicated IPI that
 often doubles the scheduler IPI on task wakeup. This will require

You can piggy back on the scheduler ipi when you add a timer/hrtimer and
add a new task to the runqueue of the nohz_full cpus, since we call
resched_curr() in these code paths. But what about the calls to kick
nohz_full cpus by perf events and posix cpu timers ? These call sites
seem to be concerned about specifically waking up nohz_full cpus as far
as I can see. IOW there is no scheduling ipi that we can fall back on in
these paths.

 careful review of resched_curr() callers.
 

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/8] nohz: Restart the tick from irq exit

2015-06-14 Thread Preeti U Murthy

On 06/11/2015 11:06 PM, Frederic Weisbecker wrote:
 Restart the tick when necessary from the irq exit path. It makes nohz
 full more flexible and allow it to piggyback the tick restart on the
 scheduler IPI in the future instead of sending a dedicated IPI that
 often doubles the scheduler IPI on task wakeup. This will require

You can piggy back on the scheduler ipi when you add a timer/hrtimer and
add a new task to the runqueue of the nohz_full cpus, since we call
resched_curr() in these code paths. But what about the calls to kick
nohz_full cpus by perf events and posix cpu timers ? These call sites
seem to be concerned about specifically waking up nohz_full cpus as far
as I can see. IOW there is no scheduling ipi that we can fall back on in
these paths.

 careful review of resched_curr() callers.
 

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/8] nohz: Restart the tick from irq exit

2015-06-14 Thread Preeti U Murthy

On 06/12/2015 06:08 PM, Frederic Weisbecker wrote:
 On Fri, Jun 12, 2015 at 09:32:45AM +0200, Peter Zijlstra wrote:
 On Thu, Jun 11, 2015 at 07:36:05PM +0200, Frederic Weisbecker wrote:
 Restart the tick when necessary from the irq exit path. It makes nohz
 full more flexible and allow it to piggyback the tick restart on the
 scheduler IPI in the future instead of sending a dedicated IPI that
 often doubles the scheduler IPI on task wakeup. This will require
 careful review of resched_curr() callers.

 This seems to assume schedule_ipi() callers use irq_exit(), this is
 false.
 
 Indeed there will be that too. Note the current patch doesn't yet rely on
 schedule_ipi(), we are still using the nohz ipis. But introducing the
 tick restart on irq exit prepares for later piggybacking on scheduler_ipi().
 
 I think this will involve changes on got_nohz_idle_kick(), renamed to
 got_nohz_kick() and include nohz full related checks to trigger the
 irq_enter()/exit() pair.

I maybe saying something obvious here, nevertheless; I am not sure about
other archs, but atleast on powerpc after handling an interrupt, we will
call irq_exit() and reevaluate starting of ticks. So in our case even if
scheduler_ipi() callers do not call irq_exit(), it will be called after
handling the reschedule interrupt.

Regards
Preeti U Murthy
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/8] nohz: Remove idle task special case

2015-06-13 Thread Preeti U Murthy

On 06/11/2015 11:06 PM, Frederic Weisbecker wrote:
> This is a leftover from old days to avoid conflicts with dynticks idle
> code. Now full dynticks and idle dynticks are better integrated and
> interact without known issue.

I am sorry but I fail to understand why the check on idle task was there
in the first place in the below code paths. It would help if you could
clarify this in the changelog as well.

> 
> So lets remove it.
> 
> Cc: Christoph Lameter 
> Cc: Ingo Molnar 
> Cc; John Stultz 
> Cc: Peter Zijlstra 
> Cc: Preeti U Murthy 
> Cc: Rik van Riel 
> Cc: Thomas Gleixner 
> Cc: Viresh Kumar 
> Signed-off-by: Frederic Weisbecker 
> ---
>  kernel/time/tick-sched.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 812f7a3..324482f 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -208,10 +208,8 @@ void __tick_nohz_full_check(void)
>   struct tick_sched *ts = this_cpu_ptr(_cpu_sched);
> 
>   if (tick_nohz_full_cpu(smp_processor_id())) {
> - if (ts->tick_stopped && !is_idle_task(current)) {
> - if (!can_stop_full_tick())

can_stop_full_tick() would have bailed out if the current task was idle,
since it checks for the number of tasks being greater than 1 to restart
the tick. So why was the check is_idle_task() introduced earlier ?

> - tick_nohz_restart_sched_tick(ts, ktime_get());
> - }
> + if (ts->tick_stopped && !can_stop_full_tick())
> + tick_nohz_restart_sched_tick(ts, ktime_get());
>   }
>  }
> 
> @@ -710,7 +708,7 @@ static void tick_nohz_full_stop_tick(struct tick_sched 
> *ts)
>  #ifdef CONFIG_NO_HZ_FULL
>   int cpu = smp_processor_id();
> 
> - if (!tick_nohz_full_cpu(cpu) || is_idle_task(current))
> + if (!tick_nohz_full_cpu(cpu))

If the current task was indeed idle, the check on ts->inidle would have
succeeded in tick_irq_exit() and we would not have reached this function
at all, isn't it? So here too I am unable to understand why we had it in
the first place.

Regards
Preeti U Murthy
>   return;
> 
>   if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/8] nohz: Remove idle task special case

2015-06-13 Thread Preeti U Murthy

On 06/11/2015 11:06 PM, Frederic Weisbecker wrote:
 This is a leftover from old days to avoid conflicts with dynticks idle
 code. Now full dynticks and idle dynticks are better integrated and
 interact without known issue.

I am sorry but I fail to understand why the check on idle task was there
in the first place in the below code paths. It would help if you could
clarify this in the changelog as well.

 
 So lets remove it.
 
 Cc: Christoph Lameter c...@linux.com
 Cc: Ingo Molnar mi...@kernel.org
 Cc; John Stultz john.stu...@linaro.org
 Cc: Peter Zijlstra pet...@infradead.org
 Cc: Preeti U Murthy pre...@linux.vnet.ibm.com
 Cc: Rik van Riel r...@redhat.com
 Cc: Thomas Gleixner t...@linutronix.de
 Cc: Viresh Kumar viresh.ku...@linaro.org
 Signed-off-by: Frederic Weisbecker fweis...@gmail.com
 ---
  kernel/time/tick-sched.c | 8 +++-
  1 file changed, 3 insertions(+), 5 deletions(-)
 
 diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
 index 812f7a3..324482f 100644
 --- a/kernel/time/tick-sched.c
 +++ b/kernel/time/tick-sched.c
 @@ -208,10 +208,8 @@ void __tick_nohz_full_check(void)
   struct tick_sched *ts = this_cpu_ptr(tick_cpu_sched);
 
   if (tick_nohz_full_cpu(smp_processor_id())) {
 - if (ts-tick_stopped  !is_idle_task(current)) {
 - if (!can_stop_full_tick())

can_stop_full_tick() would have bailed out if the current task was idle,
since it checks for the number of tasks being greater than 1 to restart
the tick. So why was the check is_idle_task() introduced earlier ?

 - tick_nohz_restart_sched_tick(ts, ktime_get());
 - }
 + if (ts-tick_stopped  !can_stop_full_tick())
 + tick_nohz_restart_sched_tick(ts, ktime_get());
   }
  }
 
 @@ -710,7 +708,7 @@ static void tick_nohz_full_stop_tick(struct tick_sched 
 *ts)
  #ifdef CONFIG_NO_HZ_FULL
   int cpu = smp_processor_id();
 
 - if (!tick_nohz_full_cpu(cpu) || is_idle_task(current))
 + if (!tick_nohz_full_cpu(cpu))

If the current task was indeed idle, the check on ts-inidle would have
succeeded in tick_irq_exit() and we would not have reached this function
at all, isn't it? So here too I am unable to understand why we had it in
the first place.

Regards
Preeti U Murthy
   return;
 
   if (!ts-tick_stopped  ts-nohz_mode == NOHZ_MODE_INACTIVE)
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions

2015-06-02 Thread Preeti U Murthy

On 06/02/2015 11:57 AM, Viresh Kumar wrote:
> On 02-06-15, 11:50, Preeti U Murthy wrote:
>>  CPU0CPU1
>>
>> store* store*
>>
>> lock(policy 1)  lock(policy 2)
>> cpufreq_set_policy()   cpufreq_set_policy()
>> EXIT() :
>> dbs-data->usage_count--
>>
>> INIT()  EXIT()
> 
> When will INIT() follow EXIT() in set_policy() for the same governor ?
> Maybe not, and so this sequence is hypothetical ?

Ah, yes, this will be hypothetical.
> 
>> dbs_data exists dbs_data->usage_count -- = 0
>> kfree(dbs_data)
>> dbs-data->usage_count++
>> *NULL dereference*
> 
> But even if this happens, it should be handled with
> dbs_data->mutex_lock, which is used at other places already.

dbs_data->mutex_lock is used to serialize only START,STOP,LIMIT calls.
It does not serialize EXIT/INIT with these operations, but that is
understandable. We need to note that EXIT can proceed in parallel with
START/STOP/LIMIT operations which can result in null pointer
dereferences of dbs_data.

Yet again, this is due to the reason that you pointed out. One such case
is __cpufreq_remove_dev_finish(). It also drops policy locks before
calling into START/LIMIT. So this can proceed in parallel with an EXIT
operation from a store. So dbs_data->mutex does not help serialize these
two and START can refer a freed dbs_data. Consider the scenario today
where CPUFREQ_HAVE_GOVERNOR_PER_POLICY is set itself.

CPU0CPU1

cpufreq_set_policy()

__cpufreq_governor
(CPUFREQ_GOV_POLICY_EXIT)
since the only usage
becomes 0.
__cpufreq_remove_dev_finish()

free(dbs_data)  __cpufreq_governor
(CPUFRQ_GOV_START)

dbs_data->mutex <= NULL dereference

This is what we hit initially.

So we will need the policy wide lock even for those drivers that have a
governor per policy, before calls to __cpufreq_governor() in order to
avoid such scenarios. So, your patch at
https://git.linaro.org/people/viresh.kumar/linux.git/patch/57714d5b1778f2f610bcc5c74d85b29ba1cc1995
can lead to above races between different store operations themselves,
won't it ? An EXIT on one of the policy cpus, followed by a STOP on
another will lead to null dereference of dbs_data(For
GOVERNOR_PER_POLICY). Today stores are serialized through the policy lock.

Anyway, since taking the policy lock leads to ABBA deadlock and
releasing it can lead to races like above, shouldn't we try another
approach at serialization ?

Regards
Preeti U Murthy
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions

2015-06-02 Thread Preeti U Murthy

On 06/02/2015 11:41 AM, Viresh Kumar wrote:
> On 02-06-15, 11:33, Preeti U Murthy wrote:
>> No, dbs_data is a governor wide data structure and not a policy wide
> 
> Yeah, that's the common part which I was referring to. But normally
> its just read for policies in START/STOP, they just update per-cpu
> data for policy->cpus.
> 
>> one, which is manipulated in START/STOP calls for drivers where the
>> CPUFREQ_HAVE_GOVERNOR_PER_POLICY is not set.
>>
>> So even if we assume that we hold per-policy locks, the following race
>> is still present. Assume that we have just two cpus which do not have a
>> governor-per-policy set.
>>
>> CPU0CPU1
>>
>>  store* store*
>>
>> lock(policy 1)  lock(policy 2)
>> cpufreq_set_policy()   cpufreq_set_policy()
>> EXIT() :
>> dbs-data->usage_count--
>>
>> INIT()
>> dbs_data exists
> 
> You missed the usage_count++ here.

Ok, sorry about that. How about the below ?
> 
>> so return
>> EXIT()
>> dbs_data->usage_count -- = 0
>> kfree(dbs_data)
> 
> And so this shouldn't happen. Else we
> are missing locking in governor's
> code, rather than cpufreq.c
> 
 CPU0CPU1

store* store*

lock(policy 1)  lock(policy 2)
cpufreq_set_policy()   cpufreq_set_policy()
EXIT() :
dbs-data->usage_count--

INIT()  EXIT()
dbs_data exists dbs_data->usage_count -- = 0
    kfree(dbs_data)
dbs-data->usage_count++
*NULL dereference*

The point is there are potential race conditions. Its just a matter of
interleaving ?

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions

2015-06-02 Thread Preeti U Murthy

On 06/02/2015 11:09 AM, Viresh Kumar wrote:
> On 02-06-15, 11:01, Preeti U Murthy wrote:
>> How will a policy lock help here at all, when cpus from multiple
>> policies are calling into __cpufreq_governor() ? How will a policy lock
>> serialize their entry into cpufreq_governor_dbs() ?
> 
> So different policies don't really depend on each other. The only
> thing common to them are the governor's sysfs files (only if
> governor-per-policy isn't set, i.e. in your case). Those sysfs files
> and their kernel counterpart variables aren't touched unless all the
> policies have EXITED. All these START/STOP calls touch only the data
> relevant to those policies only.

No, dbs_data is a governor wide data structure and not a policy wide
one, which is manipulated in START/STOP calls for drivers where the
CPUFREQ_HAVE_GOVERNOR_PER_POLICY is not set.

So even if we assume that we hold per-policy locks, the following race
is still present. Assume that we have just two cpus which do not have a
governor-per-policy set.

CPU0CPU1

 store* store*

lock(policy 1)  lock(policy 2)
cpufreq_set_policy()   cpufreq_set_policy()
EXIT() :
dbs-data->usage_count--

INIT()
dbs_data exists
so return
EXIT()
dbs_data->usage_count -- = 0
kfree(dbs_data)

START()
dereference dbs_data
*NULL dereference*


Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions

2015-06-02 Thread Preeti U Murthy

On 06/02/2015 11:57 AM, Viresh Kumar wrote:
 On 02-06-15, 11:50, Preeti U Murthy wrote:
  CPU0CPU1

 store* store*

 lock(policy 1)  lock(policy 2)
 cpufreq_set_policy()   cpufreq_set_policy()
 EXIT() :
 dbs-data-usage_count--

 INIT()  EXIT()
 
 When will INIT() follow EXIT() in set_policy() for the same governor ?
 Maybe not, and so this sequence is hypothetical ?

Ah, yes, this will be hypothetical.
 
 dbs_data exists dbs_data-usage_count -- = 0
 kfree(dbs_data)
 dbs-data-usage_count++
 *NULL dereference*
 
 But even if this happens, it should be handled with
 dbs_data-mutex_lock, which is used at other places already.

dbs_data-mutex_lock is used to serialize only START,STOP,LIMIT calls.
It does not serialize EXIT/INIT with these operations, but that is
understandable. We need to note that EXIT can proceed in parallel with
START/STOP/LIMIT operations which can result in null pointer
dereferences of dbs_data.

Yet again, this is due to the reason that you pointed out. One such case
is __cpufreq_remove_dev_finish(). It also drops policy locks before
calling into START/LIMIT. So this can proceed in parallel with an EXIT
operation from a store. So dbs_data-mutex does not help serialize these
two and START can refer a freed dbs_data. Consider the scenario today
where CPUFREQ_HAVE_GOVERNOR_PER_POLICY is set itself.

CPU0CPU1

cpufreq_set_policy()

__cpufreq_governor
(CPUFREQ_GOV_POLICY_EXIT)
since the only usage
becomes 0.
__cpufreq_remove_dev_finish()

free(dbs_data)  __cpufreq_governor
(CPUFRQ_GOV_START)

dbs_data-mutex = NULL dereference

This is what we hit initially.

So we will need the policy wide lock even for those drivers that have a
governor per policy, before calls to __cpufreq_governor() in order to
avoid such scenarios. So, your patch at
https://git.linaro.org/people/viresh.kumar/linux.git/patch/57714d5b1778f2f610bcc5c74d85b29ba1cc1995
can lead to above races between different store operations themselves,
won't it ? An EXIT on one of the policy cpus, followed by a STOP on
another will lead to null dereference of dbs_data(For
GOVERNOR_PER_POLICY). Today stores are serialized through the policy lock.

Anyway, since taking the policy lock leads to ABBA deadlock and
releasing it can lead to races like above, shouldn't we try another
approach at serialization ?


Regards
Preeti U Murthy
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions

2015-06-02 Thread Preeti U Murthy

On 06/02/2015 11:41 AM, Viresh Kumar wrote:
 On 02-06-15, 11:33, Preeti U Murthy wrote:
 No, dbs_data is a governor wide data structure and not a policy wide
 
 Yeah, that's the common part which I was referring to. But normally
 its just read for policies in START/STOP, they just update per-cpu
 data for policy-cpus.
 
 one, which is manipulated in START/STOP calls for drivers where the
 CPUFREQ_HAVE_GOVERNOR_PER_POLICY is not set.

 So even if we assume that we hold per-policy locks, the following race
 is still present. Assume that we have just two cpus which do not have a
 governor-per-policy set.

 CPU0CPU1

  store* store*

 lock(policy 1)  lock(policy 2)
 cpufreq_set_policy()   cpufreq_set_policy()
 EXIT() :
 dbs-data-usage_count--

 INIT()
 dbs_data exists
 
 You missed the usage_count++ here.

Ok, sorry about that. How about the below ?
 
 so return
 EXIT()
 dbs_data-usage_count -- = 0
 kfree(dbs_data)
 
 And so this shouldn't happen. Else we
 are missing locking in governor's
 code, rather than cpufreq.c
 
 CPU0CPU1

store* store*

lock(policy 1)  lock(policy 2)
cpufreq_set_policy()   cpufreq_set_policy()
EXIT() :
dbs-data-usage_count--

INIT()  EXIT()
dbs_data exists dbs_data-usage_count -- = 0
kfree(dbs_data)
dbs-data-usage_count++
*NULL dereference*

The point is there are potential race conditions. Its just a matter of
interleaving ?

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions

2015-06-02 Thread Preeti U Murthy

On 06/02/2015 11:09 AM, Viresh Kumar wrote:
 On 02-06-15, 11:01, Preeti U Murthy wrote:
 How will a policy lock help here at all, when cpus from multiple
 policies are calling into __cpufreq_governor() ? How will a policy lock
 serialize their entry into cpufreq_governor_dbs() ?
 
 So different policies don't really depend on each other. The only
 thing common to them are the governor's sysfs files (only if
 governor-per-policy isn't set, i.e. in your case). Those sysfs files
 and their kernel counterpart variables aren't touched unless all the
 policies have EXITED. All these START/STOP calls touch only the data
 relevant to those policies only.

No, dbs_data is a governor wide data structure and not a policy wide
one, which is manipulated in START/STOP calls for drivers where the
CPUFREQ_HAVE_GOVERNOR_PER_POLICY is not set.

So even if we assume that we hold per-policy locks, the following race
is still present. Assume that we have just two cpus which do not have a
governor-per-policy set.

CPU0CPU1

 store* store*

lock(policy 1)  lock(policy 2)
cpufreq_set_policy()   cpufreq_set_policy()
EXIT() :
dbs-data-usage_count--

INIT()
dbs_data exists
so return
EXIT()
dbs_data-usage_count -- = 0
kfree(dbs_data)

START()
dereference dbs_data
*NULL dereference*


Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions

2015-06-01 Thread Preeti U Murthy

On 06/01/2015 12:49 PM, Viresh Kumar wrote:
> On 01-06-15, 01:40, Preeti U Murthy wrote:
> 
> I have to mention that this is somewhat inspired by:
> 
> https://git.linaro.org/people/viresh.kumar/linux.git/commit/1e37f1d6ae12f5896e4e216f986762c3050129a5
> 
> and I was waiting to finish some core-changes to make all this simple.
> 
> I am fine to you trying to finish it though :)
> 
>> The problem showed up when running hotplug operations and changing
>> governors in parallel. The crash would be at:
>>
>> [  174.319645] Unable to handle kernel paging request for data at address 
>> 0x
>> [  174.319782] Faulting instruction address: 0xc053b3e0
>> cpu 0x1: Vector: 300 (Data Access) at [c3fdb870]
>> pc: c053b3e0: __bitmap_weight+0x70/0x100
>> lr: c085a338: need_load_eval+0x38/0xf0
>> sp: c3fdbaf0
>>msr: 90019033
>>dar: 0
>>  dsisr: 4000
>>   current = 0xc3151a40
>>   paca= 0xc7da0980softe: 0irq_happened: 0x01
>> pid   = 842, comm = kworker/1:2
>> enter ? for help
>> [c3fdbb40] c085a338 need_load_eval+0x38/0xf0
>> [c3fdbb70] c0856a10 od_dbs_timer+0x90/0x1e0
>> [c3fdbbe0] c00f489c process_one_work+0x24c/0x910
>> [c3fdbc90] c00f50dc worker_thread+0x17c/0x540
>> [c3fdbd20] c00fed70 kthread+0x120/0x140
>> [c3fdbe30] c0009678 ret_from_kernel_thread+0x5c/0x64
>>
>> While debugging the issue, other problems in this area were uncovered,
>> all of them necessitating serialized calls to __cpufreq_governor().  One
>> potential race condition that can happen today is the below:
>>
>> CPU0 CPU1
>>
>> cpufreq_set_policy()
>>
>> __cpufreq_governor
>> (CPUFREQ_GOV_POLICY_EXIT)
>>  __cpufreq_remove_dev_finish()
>>
>> free(dbs_data)   __cpufreq_governor
>>  (CPUFRQ_GOV_START)
>>
>>  dbs_data->mutex <= NULL dereference
>>
>> The issue here is that calls to cpufreq_governor_dbs() is not serialized
>> and they can conflict with each other in numerous ways. One way to sort
>> this out would be to serialize all calls to cpufreq_governor_dbs()
>> by setting the governor busy if a call is in progress and
>> blocking all other calls. But this approach will not cover all loop
>> holes. Take the above scenario: CPU1 will still hit a NULL dereference if
>> care is not taken to check for a NULL dbs_data.
>>
>> To sort such scenarios, we could filter out the sequence of events: A
>> CPUFREQ_GOV_START cannot be called without an INIT, if the previous
>> event was an EXIT. However this results in analysing all possible
>> sequence of events and adding each of them as a filter. This results in
>> unmanagable code. There is high probability of missing out on a race
>> condition. Both the above approaches were tried out earlier [1]
> 
> I agree.
> 
>> Let us therefore look at the heart of the issue.
> 
> Yeah, we should :)
> 
>> It is not really about
>> serializing calls to cpufreq_governor_dbs(), it seems to be about
>> serializing entire sequence of CPUFREQ_GOV* operations. For instance, in
>> cpufreq_set_policy(), we STOP,EXIT the old policy and INIT and START the
>> new policy. Between the EXIT and INIT, there must not be
>> anybody else starting the policy. And between INIT and START, there must
>> be nobody stopping the policy.
> 
> Hmm..
> 
>> A similar argument holds for the CPUFREQ_GOV* operations in
>>  __cpufreq_policy_dev_{prepare|finish} and cpufreq_add_policy(). Hence
>> until each of these functions complete in totality, none of the others
>> should run in parallel. The interleaving of the individual calls to
>> cpufreq_governor_dbs() is resulting in invalid operations. This patch
>> therefore tries to serialize entire cpufreq functions calling CPUFREQ_GOV*
>> operations, with respect to each other.
> 
> We were forced to put band-aids until this time and I am really
> looking into getting this fixed at the root.
> 
> The problem is that we drop policy locks before calling
> __cpufreq_governor() and that's the root cause of all these problems
> we are facing. We did that because we were getting warnings about
> circular locks (955ef4833574 ("cpufreq: Drop rwsem lock around
> CPUFREQ_GOV_POLICY_EXIT"))..
> 
> I have explained that problem here (Never sent this upstr

Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions

2015-06-01 Thread Preeti U Murthy

On 06/01/2015 12:49 PM, Viresh Kumar wrote:
> On 01-06-15, 01:40, Preeti U Murthy wrote:
> 
> I have to mention that this is somewhat inspired by:
> 
> https://git.linaro.org/people/viresh.kumar/linux.git/commit/1e37f1d6ae12f5896e4e216f986762c3050129a5
> 
> and I was waiting to finish some core-changes to make all this simple.
> 
> I am fine to you trying to finish it though :)

I am extremely sorry for not having pointed it out explicitly. I assumed
mentioning a reference to it would do. But in retrospect, I should have
stated it clearly.

I will remember to check with the previous authors about their progress
on a previously posted out patch before posting out my version of the
same. Thank you for pointing it out :)

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions

2015-06-01 Thread Preeti U Murthy

The problem showed up when running hotplug operations and changing
governors in parallel. The crash would be at:

[  174.319645] Unable to handle kernel paging request for data at address 
0x
[  174.319782] Faulting instruction address: 0xc053b3e0
cpu 0x1: Vector: 300 (Data Access) at [c3fdb870]
pc: c053b3e0: __bitmap_weight+0x70/0x100
lr: c085a338: need_load_eval+0x38/0xf0
sp: c3fdbaf0
   msr: 90019033
   dar: 0
 dsisr: 4000
  current = 0xc3151a40
  paca= 0xc7da0980   softe: 0irq_happened: 0x01
pid   = 842, comm = kworker/1:2
enter ? for help
[c3fdbb40] c085a338 need_load_eval+0x38/0xf0
[c3fdbb70] c0856a10 od_dbs_timer+0x90/0x1e0
[c3fdbbe0] c00f489c process_one_work+0x24c/0x910
[c3fdbc90] c00f50dc worker_thread+0x17c/0x540
[c3fdbd20] c00fed70 kthread+0x120/0x140
[c3fdbe30] c0009678 ret_from_kernel_thread+0x5c/0x64

While debugging the issue, other problems in this area were uncovered,
all of them necessitating serialized calls to __cpufreq_governor().  One
potential race condition that can happen today is the below:

CPU0CPU1

cpufreq_set_policy()

__cpufreq_governor
(CPUFREQ_GOV_POLICY_EXIT)
__cpufreq_remove_dev_finish()

free(dbs_data)  __cpufreq_governor
(CPUFRQ_GOV_START)

dbs_data->mutex <= NULL dereference

The issue here is that calls to cpufreq_governor_dbs() is not serialized
and they can conflict with each other in numerous ways. One way to sort
this out would be to serialize all calls to cpufreq_governor_dbs()
by setting the governor busy if a call is in progress and
blocking all other calls. But this approach will not cover all loop
holes. Take the above scenario: CPU1 will still hit a NULL dereference if
care is not taken to check for a NULL dbs_data.

To sort such scenarios, we could filter out the sequence of events: A
CPUFREQ_GOV_START cannot be called without an INIT, if the previous
event was an EXIT. However this results in analysing all possible
sequence of events and adding each of them as a filter. This results in
unmanagable code. There is high probability of missing out on a race
condition. Both the above approaches were tried out earlier [1]

Let us therefore look at the heart of the issue. It is not really about
serializing calls to cpufreq_governor_dbs(), it seems to be about
serializing entire sequence of CPUFREQ_GOV* operations. For instance, in
cpufreq_set_policy(), we STOP,EXIT the old policy and INIT and START the
new policy. Between the EXIT and INIT, there must not be
anybody else starting the policy. And between INIT and START, there must
be nobody stopping the policy.

A similar argument holds for the CPUFREQ_GOV* operations in
 __cpufreq_policy_dev_{prepare|finish} and cpufreq_add_policy(). Hence
until each of these functions complete in totality, none of the others
should run in parallel. The interleaving of the individual calls to
cpufreq_governor_dbs() is resulting in invalid operations. This patch
therefore tries to serialize entire cpufreq functions calling CPUFREQ_GOV*
operations, with respect to each other.

However there are a few concerns:

a. On those archs, which do not have CPUFREQ_HAVE_GOVERNOR_PER_POLICY flag
set, this patch results in softlockups. This may be because we are setting
the governor busy in a preempt enabled section, which will block other
cpufreq operations across all cpus.

b. This has still not solved the kernel panic mentioned a the beginning of
the changelog. But it does resolve the kernel panic on a 3.18 stable kernel.
The 3.18 kernel survives parallel hotplug and cpufreq governor change
operations with this patch.

So the reason this patch was posted out as an RFC was to resolve the
above two issues wrg to this patch and get the discussion going on resolving
potential race conditions in parallel cpufreq and hotplug operations which
seems rampant and not easily solvable. There are race conditions in
parallel cpufreq operations themselves.

This RFC patch brings forth potential issues and possible approaches to
solutions.

[1] http://comments.gmane.org/gmane.linux.power-management.general/49337

Signed-off-by: Preeti U Murthy 
---
 drivers/cpufreq/cpufreq.c |   68 +++--
 include/linux/cpufreq.h   |2 +
 2 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 8ae655c..e03e738 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -121,6 +121,45 @@ struct kobject *get_governor_parent_kobj(struct 
cpufreq_policy *policy)
 }
 EXPORT_SYMBOL_GPL(get_governor_parent_kobj);
 
+static bool is_governor_busy(struct cpufreq_policy *policy)
+{
+   if (have_governor_per_

[RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions

2015-06-01 Thread Preeti U Murthy

The problem showed up when running hotplug operations and changing
governors in parallel. The crash would be at:

[  174.319645] Unable to handle kernel paging request for data at address 
0x
[  174.319782] Faulting instruction address: 0xc053b3e0
cpu 0x1: Vector: 300 (Data Access) at [c3fdb870]
pc: c053b3e0: __bitmap_weight+0x70/0x100
lr: c085a338: need_load_eval+0x38/0xf0
sp: c3fdbaf0
   msr: 90019033
   dar: 0
 dsisr: 4000
  current = 0xc3151a40
  paca= 0xc7da0980   softe: 0irq_happened: 0x01
pid   = 842, comm = kworker/1:2
enter ? for help
[c3fdbb40] c085a338 need_load_eval+0x38/0xf0
[c3fdbb70] c0856a10 od_dbs_timer+0x90/0x1e0
[c3fdbbe0] c00f489c process_one_work+0x24c/0x910
[c3fdbc90] c00f50dc worker_thread+0x17c/0x540
[c3fdbd20] c00fed70 kthread+0x120/0x140
[c3fdbe30] c0009678 ret_from_kernel_thread+0x5c/0x64

While debugging the issue, other problems in this area were uncovered,
all of them necessitating serialized calls to __cpufreq_governor().  One
potential race condition that can happen today is the below:

CPU0CPU1

cpufreq_set_policy()

__cpufreq_governor
(CPUFREQ_GOV_POLICY_EXIT)
__cpufreq_remove_dev_finish()

free(dbs_data)  __cpufreq_governor
(CPUFRQ_GOV_START)

dbs_data-mutex = NULL dereference

The issue here is that calls to cpufreq_governor_dbs() is not serialized
and they can conflict with each other in numerous ways. One way to sort
this out would be to serialize all calls to cpufreq_governor_dbs()
by setting the governor busy if a call is in progress and
blocking all other calls. But this approach will not cover all loop
holes. Take the above scenario: CPU1 will still hit a NULL dereference if
care is not taken to check for a NULL dbs_data.

To sort such scenarios, we could filter out the sequence of events: A
CPUFREQ_GOV_START cannot be called without an INIT, if the previous
event was an EXIT. However this results in analysing all possible
sequence of events and adding each of them as a filter. This results in
unmanagable code. There is high probability of missing out on a race
condition. Both the above approaches were tried out earlier [1]

Let us therefore look at the heart of the issue. It is not really about
serializing calls to cpufreq_governor_dbs(), it seems to be about
serializing entire sequence of CPUFREQ_GOV* operations. For instance, in
cpufreq_set_policy(), we STOP,EXIT the old policy and INIT and START the
new policy. Between the EXIT and INIT, there must not be
anybody else starting the policy. And between INIT and START, there must
be nobody stopping the policy.

A similar argument holds for the CPUFREQ_GOV* operations in
 __cpufreq_policy_dev_{prepare|finish} and cpufreq_add_policy(). Hence
until each of these functions complete in totality, none of the others
should run in parallel. The interleaving of the individual calls to
cpufreq_governor_dbs() is resulting in invalid operations. This patch
therefore tries to serialize entire cpufreq functions calling CPUFREQ_GOV*
operations, with respect to each other.

However there are a few concerns:

a. On those archs, which do not have CPUFREQ_HAVE_GOVERNOR_PER_POLICY flag
set, this patch results in softlockups. This may be because we are setting
the governor busy in a preempt enabled section, which will block other
cpufreq operations across all cpus.

b. This has still not solved the kernel panic mentioned a the beginning of
the changelog. But it does resolve the kernel panic on a 3.18 stable kernel.
The 3.18 kernel survives parallel hotplug and cpufreq governor change
operations with this patch.

So the reason this patch was posted out as an RFC was to resolve the
above two issues wrg to this patch and get the discussion going on resolving
potential race conditions in parallel cpufreq and hotplug operations which
seems rampant and not easily solvable. There are race conditions in
parallel cpufreq operations themselves.

This RFC patch brings forth potential issues and possible approaches to
solutions.

[1] http://comments.gmane.org/gmane.linux.power-management.general/49337

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---
 drivers/cpufreq/cpufreq.c |   68 +++--
 include/linux/cpufreq.h   |2 +
 2 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 8ae655c..e03e738 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -121,6 +121,45 @@ struct kobject *get_governor_parent_kobj(struct 
cpufreq_policy *policy)
 }
 EXPORT_SYMBOL_GPL(get_governor_parent_kobj);
 
+static bool is_governor_busy(struct cpufreq_policy *policy

Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions

2015-06-01 Thread Preeti U Murthy

On 06/01/2015 12:49 PM, Viresh Kumar wrote:
 On 01-06-15, 01:40, Preeti U Murthy wrote:
 
 I have to mention that this is somewhat inspired by:
 
 https://git.linaro.org/people/viresh.kumar/linux.git/commit/1e37f1d6ae12f5896e4e216f986762c3050129a5
 
 and I was waiting to finish some core-changes to make all this simple.
 
 I am fine to you trying to finish it though :)

I am extremely sorry for not having pointed it out explicitly. I assumed
mentioning a reference to it would do. But in retrospect, I should have
stated it clearly.

I will remember to check with the previous authors about their progress
on a previously posted out patch before posting out my version of the
same. Thank you for pointing it out :)

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions

2015-06-01 Thread Preeti U Murthy

On 06/01/2015 12:49 PM, Viresh Kumar wrote:
 On 01-06-15, 01:40, Preeti U Murthy wrote:
 
 I have to mention that this is somewhat inspired by:
 
 https://git.linaro.org/people/viresh.kumar/linux.git/commit/1e37f1d6ae12f5896e4e216f986762c3050129a5
 
 and I was waiting to finish some core-changes to make all this simple.
 
 I am fine to you trying to finish it though :)
 
 The problem showed up when running hotplug operations and changing
 governors in parallel. The crash would be at:

 [  174.319645] Unable to handle kernel paging request for data at address 
 0x
 [  174.319782] Faulting instruction address: 0xc053b3e0
 cpu 0x1: Vector: 300 (Data Access) at [c3fdb870]
 pc: c053b3e0: __bitmap_weight+0x70/0x100
 lr: c085a338: need_load_eval+0x38/0xf0
 sp: c3fdbaf0
msr: 90019033
dar: 0
  dsisr: 4000
   current = 0xc3151a40
   paca= 0xc7da0980softe: 0irq_happened: 0x01
 pid   = 842, comm = kworker/1:2
 enter ? for help
 [c3fdbb40] c085a338 need_load_eval+0x38/0xf0
 [c3fdbb70] c0856a10 od_dbs_timer+0x90/0x1e0
 [c3fdbbe0] c00f489c process_one_work+0x24c/0x910
 [c3fdbc90] c00f50dc worker_thread+0x17c/0x540
 [c3fdbd20] c00fed70 kthread+0x120/0x140
 [c3fdbe30] c0009678 ret_from_kernel_thread+0x5c/0x64

 While debugging the issue, other problems in this area were uncovered,
 all of them necessitating serialized calls to __cpufreq_governor().  One
 potential race condition that can happen today is the below:

 CPU0 CPU1

 cpufreq_set_policy()

 __cpufreq_governor
 (CPUFREQ_GOV_POLICY_EXIT)
  __cpufreq_remove_dev_finish()

 free(dbs_data)   __cpufreq_governor
  (CPUFRQ_GOV_START)

  dbs_data-mutex = NULL dereference

 The issue here is that calls to cpufreq_governor_dbs() is not serialized
 and they can conflict with each other in numerous ways. One way to sort
 this out would be to serialize all calls to cpufreq_governor_dbs()
 by setting the governor busy if a call is in progress and
 blocking all other calls. But this approach will not cover all loop
 holes. Take the above scenario: CPU1 will still hit a NULL dereference if
 care is not taken to check for a NULL dbs_data.

 To sort such scenarios, we could filter out the sequence of events: A
 CPUFREQ_GOV_START cannot be called without an INIT, if the previous
 event was an EXIT. However this results in analysing all possible
 sequence of events and adding each of them as a filter. This results in
 unmanagable code. There is high probability of missing out on a race
 condition. Both the above approaches were tried out earlier [1]
 
 I agree.
 
 Let us therefore look at the heart of the issue.
 
 Yeah, we should :)
 
 It is not really about
 serializing calls to cpufreq_governor_dbs(), it seems to be about
 serializing entire sequence of CPUFREQ_GOV* operations. For instance, in
 cpufreq_set_policy(), we STOP,EXIT the old policy and INIT and START the
 new policy. Between the EXIT and INIT, there must not be
 anybody else starting the policy. And between INIT and START, there must
 be nobody stopping the policy.
 
 Hmm..
 
 A similar argument holds for the CPUFREQ_GOV* operations in
  __cpufreq_policy_dev_{prepare|finish} and cpufreq_add_policy(). Hence
 until each of these functions complete in totality, none of the others
 should run in parallel. The interleaving of the individual calls to
 cpufreq_governor_dbs() is resulting in invalid operations. This patch
 therefore tries to serialize entire cpufreq functions calling CPUFREQ_GOV*
 operations, with respect to each other.
 
 We were forced to put band-aids until this time and I am really
 looking into getting this fixed at the root.
 
 The problem is that we drop policy locks before calling
 __cpufreq_governor() and that's the root cause of all these problems
 we are facing. We did that because we were getting warnings about
 circular locks (955ef4833574 (cpufreq: Drop rwsem lock around
 CPUFREQ_GOV_POLICY_EXIT))..
 
 I have explained that problem here (Never sent this upstream, as I was
 waiting for some other patches to get included first):
 https://git.linaro.org/people/viresh.kumar/linux.git/commit/57714d5b1778f2f610bcc5c74d85b29ba1cc1995
 
 The actual problem was:
 If we hold any locks, that the attribute operations grab, when
 removing the attribute, then it can result in a ABBA deadlock.
 
 show()/store() holds the policy-rwsem lock while accessing any sysfs
 attributes under cpu/cpuX/cpufreq/ directory.
 
 But something like what I have done is the real way to tackle all
 these problems.

How will a policy lock help here at all, when cpus from multiple
policies are calling into __cpufreq_governor() ? How will a policy lock
serialize their entry

Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency

2015-05-30 Thread Preeti U Murthy

On 05/30/2015 11:31 AM, Vaidyanathan Srinivasan wrote:
> * Preeti U Murthy  [2015-05-29 19:17:17]:
> 
> [snip]
> 
>>> +   if (max_idle_state > 1) {
>>> +   snooze_timeout_en = true;
>>> +   snooze_timeout = cpuidle_state_table[1].target_residency *
>>> +tb_ticks_per_usec;
>>> +   }
>>
>> Any idea why we don't have snooze defined on the shared lpar configuration ?
> 
> In shared lpar case, spinning in guest context may potentially take
> away cycles from other lpars waiting to run on the same physical cpu.
> 
> So the policy in shared lpar case is to let PowerVM hypervisor know
> immediately that the guest cpu is idle which will allow the hypervisor
> to use the cycles for other tasks/lpars.
> 

Oh Ok! Thanks for the clarification !

Regards
Preeti U Murthy
> --Vaidy
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency

2015-05-30 Thread Preeti U Murthy

On 05/30/2015 11:31 AM, Vaidyanathan Srinivasan wrote:
 * Preeti U Murthy pre...@linux.vnet.ibm.com [2015-05-29 19:17:17]:
 
 [snip]
 
 +   if (max_idle_state  1) {
 +   snooze_timeout_en = true;
 +   snooze_timeout = cpuidle_state_table[1].target_residency *
 +tb_ticks_per_usec;
 +   }

 Any idea why we don't have snooze defined on the shared lpar configuration ?
 
 In shared lpar case, spinning in guest context may potentially take
 away cycles from other lpars waiting to run on the same physical cpu.
 
 So the policy in shared lpar case is to let PowerVM hypervisor know
 immediately that the guest cpu is idle which will allow the hypervisor
 to use the cycles for other tasks/lpars.
 

Oh Ok! Thanks for the clarification !

Regards
Preeti U Murthy
 --Vaidy
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency

2015-05-29 Thread Preeti U Murthy

Hi Shilpa,

The subject does not convey the purpose of this patch clearly IMO.
I would definitely suggest changing the subject to something like
"Auto promotion of snooze to deeper idle state" or similar.

On 05/29/2015 06:02 PM, Shilpasri G Bhat wrote:
> The idle cpus which stay in snooze for a long period can degrade the
> perfomance of the sibling cpus. If the cpu stays in snooze for more
> than target residency of the next available idle state, then exit from
> snooze. This gives a chance to the cpuidle governor to re-evaluate the
> last idle state of the cpu to promote it to deeper idle states.
> 
> Signed-off-by: Shilpasri G Bhat 
> ---
>  drivers/cpuidle/cpuidle-powernv.c | 12 
>  drivers/cpuidle/cpuidle-pseries.c | 11 +++
>  2 files changed, 23 insertions(+)
> 
> diff --git a/drivers/cpuidle/cpuidle-pseries.c 
> b/drivers/cpuidle/cpuidle-pseries.c
> index bb9e2b6..07135e0 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -27,6 +27,8 @@ struct cpuidle_driver pseries_idle_driver = {
>  
>  static int max_idle_state;
>  static struct cpuidle_state *cpuidle_state_table;
> +static u64 snooze_timeout;
> +static bool snooze_timeout_en;
>  
>  static inline void idle_loop_prolog(unsigned long *in_purr)
>  {
> @@ -58,14 +60,18 @@ static int snooze_loop(struct cpuidle_device *dev,
>   int index)
>  {
>   unsigned long in_purr;
> + u64 snooze_exit_time;
>  
>   idle_loop_prolog(_purr);
>   local_irq_enable();
>   set_thread_flag(TIF_POLLING_NRFLAG);
> + snooze_exit_time = get_tb() + snooze_timeout;
>  
>   while (!need_resched()) {
>   HMT_low();
>   HMT_very_low();
> + if (snooze_timeout_en && get_tb() > snooze_exit_time)
> + break;
>   }
>  
>   HMT_medium();
> @@ -244,6 +250,11 @@ static int pseries_idle_probe(void)
>   } else
>   return -ENODEV;
>  
> + if (max_idle_state > 1) {
> + snooze_timeout_en = true;
> + snooze_timeout = cpuidle_state_table[1].target_residency *
> +  tb_ticks_per_usec;
> + }

Any idea why we don't have snooze defined on the shared lpar configuration ?

Regards
Preeti U Murthy
>   return 0;
>  }
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency

2015-05-29 Thread Preeti U Murthy

Hi Shilpa,

The subject does not convey the purpose of this patch clearly IMO.
I would definitely suggest changing the subject to something like
Auto promotion of snooze to deeper idle state or similar.

On 05/29/2015 06:02 PM, Shilpasri G Bhat wrote:
 The idle cpus which stay in snooze for a long period can degrade the
 perfomance of the sibling cpus. If the cpu stays in snooze for more
 than target residency of the next available idle state, then exit from
 snooze. This gives a chance to the cpuidle governor to re-evaluate the
 last idle state of the cpu to promote it to deeper idle states.
 
 Signed-off-by: Shilpasri G Bhat shilpa.b...@linux.vnet.ibm.com
 ---
  drivers/cpuidle/cpuidle-powernv.c | 12 
  drivers/cpuidle/cpuidle-pseries.c | 11 +++
  2 files changed, 23 insertions(+)
 
 diff --git a/drivers/cpuidle/cpuidle-pseries.c 
 b/drivers/cpuidle/cpuidle-pseries.c
 index bb9e2b6..07135e0 100644
 --- a/drivers/cpuidle/cpuidle-pseries.c
 +++ b/drivers/cpuidle/cpuidle-pseries.c
 @@ -27,6 +27,8 @@ struct cpuidle_driver pseries_idle_driver = {
  
  static int max_idle_state;
  static struct cpuidle_state *cpuidle_state_table;
 +static u64 snooze_timeout;
 +static bool snooze_timeout_en;
  
  static inline void idle_loop_prolog(unsigned long *in_purr)
  {
 @@ -58,14 +60,18 @@ static int snooze_loop(struct cpuidle_device *dev,
   int index)
  {
   unsigned long in_purr;
 + u64 snooze_exit_time;
  
   idle_loop_prolog(in_purr);
   local_irq_enable();
   set_thread_flag(TIF_POLLING_NRFLAG);
 + snooze_exit_time = get_tb() + snooze_timeout;
  
   while (!need_resched()) {
   HMT_low();
   HMT_very_low();
 + if (snooze_timeout_en  get_tb()  snooze_exit_time)
 + break;
   }
  
   HMT_medium();
 @@ -244,6 +250,11 @@ static int pseries_idle_probe(void)
   } else
   return -ENODEV;
  
 + if (max_idle_state  1) {
 + snooze_timeout_en = true;
 + snooze_timeout = cpuidle_state_table[1].target_residency *
 +  tb_ticks_per_usec;
 + }

Any idea why we don't have snooze defined on the shared lpar configuration ?

Regards
Preeti U Murthy
   return 0;
  }
  
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] cpuidle: Do not use CPUIDLE_DRIVER_STATE_START in cpuidle.c

2015-05-27 Thread Preeti U Murthy

On 05/28/2015 07:39 AM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> The CPUIDLE_DRIVER_STATE_START symbol is defined as 1 only if
> CONFIG_ARCH_HAS_CPU_RELAX is set, otherwise it is defined as 0.
> However, if CONFIG_ARCH_HAS_CPU_RELAX is set, the first (index 0)
> entry in the cpuidle driver's table of states is overwritten with
> the default "poll" entry by the core.  The "state" defined by the
> "poll" entry doesn't provide ->enter_dead and ->enter_freeze
> callbacks and its exit_latency is 0.
> 
> For this reason, it is not necessary to use CPUIDLE_DRIVER_STATE_START
> in cpuidle_play_dead() (->enter_dead is NULL, so the "poll state"
> will be skipped by the loop).
> 
> It also is arguably unuseful to return states with exit_latency
> equal to 0 from find_deepest_state(), so the function can be modified
> to start the loop from index 0 and the "poll state" will be skipped by
> it as a result of the check against latency_req.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
>  drivers/cpuidle/cpuidle.c |6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> Index: linux-pm/drivers/cpuidle/cpuidle.c
> ===
> --- linux-pm.orig/drivers/cpuidle/cpuidle.c
> +++ linux-pm/drivers/cpuidle/cpuidle.c
> @@ -65,7 +65,7 @@ int cpuidle_play_dead(void)
>   return -ENODEV;
> 
>   /* Find lowest-power state that supports long-term idle */
> - for (i = drv->state_count - 1; i >= CPUIDLE_DRIVER_STATE_START; i--)
> + for (i = drv->state_count - 1; i >= 0; i--)
>   if (drv->states[i].enter_dead)
>   return drv->states[i].enter_dead(dev, i);
> 
> @@ -79,9 +79,9 @@ static int find_deepest_state(struct cpu
> bool freeze)
>  {
>   unsigned int latency_req = 0;
> - int i, ret = freeze ? -1 : CPUIDLE_DRIVER_STATE_START - 1;
> + int i, ret = -ENXIO;
> 
> - for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) {
> + for (i = 0; i < drv->state_count; i++) {
>   struct cpuidle_state *s = >states[i];
>   struct cpuidle_state_usage *su = >states_usage[i];
> 

Reviewed-by: Preeti U Murthy 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpuidle: Do not use CPUIDLE_DRIVER_STATE_START in cpuidle.c

2015-05-27 Thread Preeti U Murthy

On 05/27/2015 07:27 PM, Rafael J. Wysocki wrote:
> On Wed, May 27, 2015 at 2:25 PM, Daniel Lezcano
>  wrote:
>> On 05/27/2015 01:31 PM, Preeti U Murthy wrote:
>>>
>>> On 05/27/2015 07:06 AM, Rafael J. Wysocki wrote:
>>>>
>>>> From: Rafael J. Wysocki 
>>>>
>>>> The CPUIDLE_DRIVER_STATE_START symbol is defined as 1 only if
>>>> CONFIG_ARCH_HAS_CPU_RELAX is set, otherwise it is defined as 0.
>>>> However, if CONFIG_ARCH_HAS_CPU_RELAX is set, the first (index 0)
>>>> entry in the cpuidle driver's table of states is overwritten with
>>>> the default "poll" entry by the core.  The "state" defined by the
>>>> "poll" entry doesn't provide ->enter_dead and ->enter_freeze
>>>> callbacks and its exit_latency is 0.
>>>>
>>>> For this reason, it is not necessary to use CPUIDLE_DRIVER_STATE_START
>>>> in cpuidle_play_dead() (->enter_dead is NULL, so the "poll state"
>>>> will be skipped by the loop) and in find_deepest_state() (since
>>>> exit_latency is 0, the "poll state" will become the default if the
>>>> "s->exit_latency <= latency_req" check is replaced with
>>>> "s->exit_latency < latency_req" which may only matter for drivers
>>>> providing different states with the same exit_latency).
>>>>
>>>> Signed-off-by: Rafael J. Wysocki 
>>>> ---
>>>>   drivers/cpuidle/cpuidle.c |8 
>>>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>>
>>> 
>>>
>>>>
>>>> @@ -79,13 +79,13 @@ static int find_deepest_state(struct cpu
>>>>   bool freeze)
>>>>   {
>>>> unsigned int latency_req = 0;
>>>> -   int i, ret = freeze ? -1 : CPUIDLE_DRIVER_STATE_START - 1;
>>>> +   int i, ret = -ENXIO;
>>>>
>>>> -   for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) {
>>>> +   for (i = 0; i < drv->state_count; i++) {
>>>> struct cpuidle_state *s = >states[i];
>>>> struct cpuidle_state_usage *su = >states_usage[i];
>>>>
>>>> -   if (s->disabled || su->disable || s->exit_latency <=
>>>> latency_req
>>>> +   if (s->disabled || su->disable || s->exit_latency <
>>>> latency_req
>>>
>>>
>>> Prior to this patch,
>>>
>>> For drivers on which CPUIDLE_DRIVER_STATE_START takes a value 0 and
>>> whose first idle state has an exit_latency of 0, find_deepest_state()
>>> would return -1 if it failed to find a deeper idle state.
>>> But as an effect of this patch, find_deepest_state() returns 0 in the
>>> above circumstance.
>>
>>
>> Except I am missing something, with an exit_latency = 0, the state will be
>> never selected, because of the "s->exit_latency < latency_req" condition
>> (strictly greater than).
> 
> No, this is the condition to skip the state, so previously it wouldn't
> be selected, but after the patch it will.
> 
> So yes, the patch changes behavior for systems with
> CONFIG_ARCH_HAS_CPU_RELAX unset.
> 
>>> My concern is if these drivers do not intend to enter a polling state
>>> during suspend, this will cause an issue, won't it?
> 
> The change in behavior happens for architectures where
> CONFIG_ARCH_HAS_CPU_RELAX is not set.  In those cases the 0-index
> state is supposed to be provided by the driver.  Is there a reason to
> expect that this may not be a genuine idle state?

On PowerPC, we have the 0-index idle state, whose exit_latency is 0 and
all that the CPU does in this state, is poll on need_resched(), except
at a lower priority from the hardware's standpoint. Nevertheless, the
CPU is busy polling. So, I would not consider it a genuine idle state.

On a side note, we do not yet support suspend on Power servers, but we
may in the future. Hence the concern.

>> Definitively poll can cause thermal issues, especially when suspending. It
>> is a dangerous state (let's imagine you close your laptop => suspend/poll
>> and then put it in your bag for a travel).
> 
> With ARCH_HAS_CPU_RELAX set the "poll state" is supposed to be thermally safe.
> 
>> I don't think with the code above we can reach this situation but I agree
>> this is something we have to take care carefully.
>>
>> Actually, I am in favour of re

Re: [PATCH] cpuidle: Do not use CPUIDLE_DRIVER_STATE_START in cpuidle.c

2015-05-27 Thread Preeti U Murthy

On 05/27/2015 07:06 AM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> The CPUIDLE_DRIVER_STATE_START symbol is defined as 1 only if
> CONFIG_ARCH_HAS_CPU_RELAX is set, otherwise it is defined as 0.
> However, if CONFIG_ARCH_HAS_CPU_RELAX is set, the first (index 0)
> entry in the cpuidle driver's table of states is overwritten with
> the default "poll" entry by the core.  The "state" defined by the
> "poll" entry doesn't provide ->enter_dead and ->enter_freeze
> callbacks and its exit_latency is 0.
> 
> For this reason, it is not necessary to use CPUIDLE_DRIVER_STATE_START
> in cpuidle_play_dead() (->enter_dead is NULL, so the "poll state"
> will be skipped by the loop) and in find_deepest_state() (since
> exit_latency is 0, the "poll state" will become the default if the
> "s->exit_latency <= latency_req" check is replaced with
> "s->exit_latency < latency_req" which may only matter for drivers
> providing different states with the same exit_latency).
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
>  drivers/cpuidle/cpuidle.c |8 
>  1 file changed, 4 insertions(+), 4 deletions(-)


> 
> @@ -79,13 +79,13 @@ static int find_deepest_state(struct cpu
> bool freeze)
>  {
>   unsigned int latency_req = 0;
> - int i, ret = freeze ? -1 : CPUIDLE_DRIVER_STATE_START - 1;
> + int i, ret = -ENXIO;
> 
> - for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) {
> + for (i = 0; i < drv->state_count; i++) {
>   struct cpuidle_state *s = >states[i];
>   struct cpuidle_state_usage *su = >states_usage[i];
> 
> - if (s->disabled || su->disable || s->exit_latency <= latency_req
> + if (s->disabled || su->disable || s->exit_latency < latency_req

Prior to this patch,

For drivers on which CPUIDLE_DRIVER_STATE_START takes a value 0 and
whose first idle state has an exit_latency of 0, find_deepest_state()
would return -1 if it failed to find a deeper idle state.

But as an effect of this patch, find_deepest_state() returns 0 in the
above circumstance.

My concern is if these drivers do not intend to enter a polling state
during suspend, this will cause an issue, won't it? This also gets me
wondering if polling state is an acceptable idle state during suspend,
given that the drivers with ARCH_HAS_CPU_RELAX permit entry into it
during suspend today. I would expect the cpus to be in a hardware
defined idle state.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpuidle: Do not use CPUIDLE_DRIVER_STATE_START in cpuidle.c

2015-05-27 Thread Preeti U Murthy

On 05/27/2015 07:06 AM, Rafael J. Wysocki wrote:
 From: Rafael J. Wysocki rafael.j.wyso...@intel.com
 
 The CPUIDLE_DRIVER_STATE_START symbol is defined as 1 only if
 CONFIG_ARCH_HAS_CPU_RELAX is set, otherwise it is defined as 0.
 However, if CONFIG_ARCH_HAS_CPU_RELAX is set, the first (index 0)
 entry in the cpuidle driver's table of states is overwritten with
 the default poll entry by the core.  The state defined by the
 poll entry doesn't provide -enter_dead and -enter_freeze
 callbacks and its exit_latency is 0.
 
 For this reason, it is not necessary to use CPUIDLE_DRIVER_STATE_START
 in cpuidle_play_dead() (-enter_dead is NULL, so the poll state
 will be skipped by the loop) and in find_deepest_state() (since
 exit_latency is 0, the poll state will become the default if the
 s-exit_latency = latency_req check is replaced with
 s-exit_latency  latency_req which may only matter for drivers
 providing different states with the same exit_latency).
 
 Signed-off-by: Rafael J. Wysocki rafael.j.wyso...@intel.com
 ---
  drivers/cpuidle/cpuidle.c |8 
  1 file changed, 4 insertions(+), 4 deletions(-)
snip

 
 @@ -79,13 +79,13 @@ static int find_deepest_state(struct cpu
 bool freeze)
  {
   unsigned int latency_req = 0;
 - int i, ret = freeze ? -1 : CPUIDLE_DRIVER_STATE_START - 1;
 + int i, ret = -ENXIO;
 
 - for (i = CPUIDLE_DRIVER_STATE_START; i  drv-state_count; i++) {
 + for (i = 0; i  drv-state_count; i++) {
   struct cpuidle_state *s = drv-states[i];
   struct cpuidle_state_usage *su = dev-states_usage[i];
 
 - if (s-disabled || su-disable || s-exit_latency = latency_req
 + if (s-disabled || su-disable || s-exit_latency  latency_req

Prior to this patch,

For drivers on which CPUIDLE_DRIVER_STATE_START takes a value 0 and
whose first idle state has an exit_latency of 0, find_deepest_state()
would return -1 if it failed to find a deeper idle state.

But as an effect of this patch, find_deepest_state() returns 0 in the
above circumstance.

My concern is if these drivers do not intend to enter a polling state
during suspend, this will cause an issue, won't it? This also gets me
wondering if polling state is an acceptable idle state during suspend,
given that the drivers with ARCH_HAS_CPU_RELAX permit entry into it
during suspend today. I would expect the cpus to be in a hardware
defined idle state.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpuidle: Do not use CPUIDLE_DRIVER_STATE_START in cpuidle.c

2015-05-27 Thread Preeti U Murthy

On 05/27/2015 07:27 PM, Rafael J. Wysocki wrote:
 On Wed, May 27, 2015 at 2:25 PM, Daniel Lezcano
 daniel.lezc...@linaro.org wrote:
 On 05/27/2015 01:31 PM, Preeti U Murthy wrote:

 On 05/27/2015 07:06 AM, Rafael J. Wysocki wrote:

 From: Rafael J. Wysocki rafael.j.wyso...@intel.com

 The CPUIDLE_DRIVER_STATE_START symbol is defined as 1 only if
 CONFIG_ARCH_HAS_CPU_RELAX is set, otherwise it is defined as 0.
 However, if CONFIG_ARCH_HAS_CPU_RELAX is set, the first (index 0)
 entry in the cpuidle driver's table of states is overwritten with
 the default poll entry by the core.  The state defined by the
 poll entry doesn't provide -enter_dead and -enter_freeze
 callbacks and its exit_latency is 0.

 For this reason, it is not necessary to use CPUIDLE_DRIVER_STATE_START
 in cpuidle_play_dead() (-enter_dead is NULL, so the poll state
 will be skipped by the loop) and in find_deepest_state() (since
 exit_latency is 0, the poll state will become the default if the
 s-exit_latency = latency_req check is replaced with
 s-exit_latency  latency_req which may only matter for drivers
 providing different states with the same exit_latency).

 Signed-off-by: Rafael J. Wysocki rafael.j.wyso...@intel.com
 ---
   drivers/cpuidle/cpuidle.c |8 
   1 file changed, 4 insertions(+), 4 deletions(-)

 snip


 @@ -79,13 +79,13 @@ static int find_deepest_state(struct cpu
   bool freeze)
   {
 unsigned int latency_req = 0;
 -   int i, ret = freeze ? -1 : CPUIDLE_DRIVER_STATE_START - 1;
 +   int i, ret = -ENXIO;

 -   for (i = CPUIDLE_DRIVER_STATE_START; i  drv-state_count; i++) {
 +   for (i = 0; i  drv-state_count; i++) {
 struct cpuidle_state *s = drv-states[i];
 struct cpuidle_state_usage *su = dev-states_usage[i];

 -   if (s-disabled || su-disable || s-exit_latency =
 latency_req
 +   if (s-disabled || su-disable || s-exit_latency 
 latency_req


 Prior to this patch,

 For drivers on which CPUIDLE_DRIVER_STATE_START takes a value 0 and
 whose first idle state has an exit_latency of 0, find_deepest_state()
 would return -1 if it failed to find a deeper idle state.
 But as an effect of this patch, find_deepest_state() returns 0 in the
 above circumstance.


 Except I am missing something, with an exit_latency = 0, the state will be
 never selected, because of the s-exit_latency  latency_req condition
 (strictly greater than).
 
 No, this is the condition to skip the state, so previously it wouldn't
 be selected, but after the patch it will.
 
 So yes, the patch changes behavior for systems with
 CONFIG_ARCH_HAS_CPU_RELAX unset.
 
 My concern is if these drivers do not intend to enter a polling state
 during suspend, this will cause an issue, won't it?
 
 The change in behavior happens for architectures where
 CONFIG_ARCH_HAS_CPU_RELAX is not set.  In those cases the 0-index
 state is supposed to be provided by the driver.  Is there a reason to
 expect that this may not be a genuine idle state?

On PowerPC, we have the 0-index idle state, whose exit_latency is 0 and
all that the CPU does in this state, is poll on need_resched(), except
at a lower priority from the hardware's standpoint. Nevertheless, the
CPU is busy polling. So, I would not consider it a genuine idle state.

On a side note, we do not yet support suspend on Power servers, but we
may in the future. Hence the concern.

 Definitively poll can cause thermal issues, especially when suspending. It
 is a dangerous state (let's imagine you close your laptop = suspend/poll
 and then put it in your bag for a travel).
 
 With ARCH_HAS_CPU_RELAX set the poll state is supposed to be thermally safe.
 
 I don't think with the code above we can reach this situation but I agree
 this is something we have to take care carefully.

 Actually, I am in favour of removing poll at all from the cpuidle driver and
 poll only when a cpuidle state selection fails under certain condition.

 So I fully agree with your statement below.

 I would expect the cpus to be in a hardware
 defined idle state.
 
 Well, except for the degenerate case in which all of them are disabled
 and for the broadcast timer stopping aviodance use case for
 find_deepest_state().

During suspend, the CPUs can very well enter states where the timer
stops since we stop timer interrupts anyway. So unless the idle states
are explicitly disabled by the user/hardware for some reason, deeper
idle states will still be available during suspend as far as I can see.

 
 So there are two questions in my view:
 (1) Should find_deepest_state() ever return states with exit_latency equal to 
 0?

I would say no, since such an idle state would mostly be polling on a
wakeup event. Atleast, there is one such case in PowerPC.

 (2) If the answer to (1) is yes, should the poll state be ever
 returned by find_deepest_state()?
 
 In any case, find_deepest_state() will only return a state with
 exit_latency equal to 0

Re: [PATCH v2] cpuidle: Do not use CPUIDLE_DRIVER_STATE_START in cpuidle.c

2015-05-27 Thread Preeti U Murthy

On 05/28/2015 07:39 AM, Rafael J. Wysocki wrote:
 From: Rafael J. Wysocki rafael.j.wyso...@intel.com
 
 The CPUIDLE_DRIVER_STATE_START symbol is defined as 1 only if
 CONFIG_ARCH_HAS_CPU_RELAX is set, otherwise it is defined as 0.
 However, if CONFIG_ARCH_HAS_CPU_RELAX is set, the first (index 0)
 entry in the cpuidle driver's table of states is overwritten with
 the default poll entry by the core.  The state defined by the
 poll entry doesn't provide -enter_dead and -enter_freeze
 callbacks and its exit_latency is 0.
 
 For this reason, it is not necessary to use CPUIDLE_DRIVER_STATE_START
 in cpuidle_play_dead() (-enter_dead is NULL, so the poll state
 will be skipped by the loop).
 
 It also is arguably unuseful to return states with exit_latency
 equal to 0 from find_deepest_state(), so the function can be modified
 to start the loop from index 0 and the poll state will be skipped by
 it as a result of the check against latency_req.
 
 Signed-off-by: Rafael J. Wysocki rafael.j.wyso...@intel.com
 ---
  drivers/cpuidle/cpuidle.c |6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)
 
 Index: linux-pm/drivers/cpuidle/cpuidle.c
 ===
 --- linux-pm.orig/drivers/cpuidle/cpuidle.c
 +++ linux-pm/drivers/cpuidle/cpuidle.c
 @@ -65,7 +65,7 @@ int cpuidle_play_dead(void)
   return -ENODEV;
 
   /* Find lowest-power state that supports long-term idle */
 - for (i = drv-state_count - 1; i = CPUIDLE_DRIVER_STATE_START; i--)
 + for (i = drv-state_count - 1; i = 0; i--)
   if (drv-states[i].enter_dead)
   return drv-states[i].enter_dead(dev, i);
 
 @@ -79,9 +79,9 @@ static int find_deepest_state(struct cpu
 bool freeze)
  {
   unsigned int latency_req = 0;
 - int i, ret = freeze ? -1 : CPUIDLE_DRIVER_STATE_START - 1;
 + int i, ret = -ENXIO;
 
 - for (i = CPUIDLE_DRIVER_STATE_START; i  drv-state_count; i++) {
 + for (i = 0; i  drv-state_count; i++) {
   struct cpuidle_state *s = drv-states[i];
   struct cpuidle_state_usage *su = dev-states_usage[i];
 

Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-pm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH linux-next] tracing/mm: Use raw_smp_processor_id() instead of smp_processor_id()

2015-05-15 Thread Preeti U Murthy

On 05/15/2015 06:57 PM, Shreyas B. Prabhu wrote:
> trace_mm_page_pcpu_drain, trace_kmem_cache_free, trace_mm_page_free
> can be potentially called from an offlined cpu. Since trace points use
> RCU and RCU should not be used from offlined cpus, we have checks to
> filter out such calls.
> 
> But these checks use smp_processor_id() and since these trace calls can
> happen from preemptable sections, this throws a warning when running
> with CONFIG_DEBUG_PREEMPT set.
> 
> Now consider task gets migrated after calling smp_processor_id()
>  - From an online cpu to another online cpu - No impact
>  - From an online cpu to an offline cpu - Should never happen
>  - From an offline cpu to an online cpu - Once a cpu has been
>offlined it returns to cpu_idle_loop, discovers its offline and calls
>arch_cpu_idle_dead. All this happens with preemption disabled. So
>this scenario too should never happen.
> 
> Thus running with PREEMPT set has no impact on the condition. So use
> raw_smp_processor_id() so that the warnings are suppressed.
> 
> Signed-off-by: Shreyas B. Prabhu 
> ---
>  include/trace/events/kmem.h | 34 +++---
>  1 file changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
> index 6cd975f..f7554fd 100644
> --- a/include/trace/events/kmem.h
> +++ b/include/trace/events/kmem.h
> @@ -146,7 +146,16 @@ DEFINE_EVENT_CONDITION(kmem_free, kmem_cache_free,
> 
>   TP_ARGS(call_site, ptr),
> 
> - TP_CONDITION(cpu_online(smp_processor_id()))
> + /*
> +  * This trace can be potentially called from an offlined cpu.
> +  * Since trace points use RCU and RCU should not be used from
> +  * offline cpus, filter such calls out.
> +  * While this trace can be called from a preemptable section,
> +  * it has no impact on the condition since tasks can migrate
> +  * only from online cpus to other online cpus. Thus its safe
> +  * to use raw_smp_processor_id.
> +  */
> + TP_CONDITION(cpu_online(raw_smp_processor_id()))
>  );
> 
>  TRACE_EVENT_CONDITION(mm_page_free,
> @@ -155,7 +164,17 @@ TRACE_EVENT_CONDITION(mm_page_free,
> 
>   TP_ARGS(page, order),
> 
> - TP_CONDITION(cpu_online(smp_processor_id())),
> +
> + /*
> +  * This trace can be potentially called from an offlined cpu.
> +  * Since trace points use RCU and RCU should not be used from
> +  * offline cpus, filter such calls out.
> +  * While this trace can be called from a preemptable section,
> +  * it has no impact on the condition since tasks can migrate
> +  * only from online cpus to other online cpus. Thus its safe
> +  * to use raw_smp_processor_id.
> +  */
> + TP_CONDITION(cpu_online(raw_smp_processor_id())),
> 
>   TP_STRUCT__entry(
>   __field(unsigned long,  pfn )
> @@ -263,7 +282,16 @@ TRACE_EVENT_CONDITION(mm_page_pcpu_drain,
> 
>   TP_ARGS(page, order, migratetype),
> 
> - TP_CONDITION(cpu_online(smp_processor_id())),
> + /*
> +  * This trace can be potentially called from an offlined cpu.
> +  * Since trace points use RCU and RCU should not be used from
> +  * offline cpus, filter such calls out.
> +  * While this trace can be called from a preemptable section,
> +  * it has no impact on the condition since tasks can migrate
> +  * only from online cpus to other online cpus. Thus its safe
> +  * to use raw_smp_processor_id.
> +  */
> + TP_CONDITION(cpu_online(raw_smp_processor_id())),
> 
>   TP_STRUCT__entry(
>   __field(unsigned long,  pfn )

Reviewed-by: Preeti U Murthy 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH linux-next] tracing/mm: Use raw_smp_processor_id() instead of smp_processor_id()

2015-05-15 Thread Preeti U Murthy

On 05/15/2015 06:57 PM, Shreyas B. Prabhu wrote:
 trace_mm_page_pcpu_drain, trace_kmem_cache_free, trace_mm_page_free
 can be potentially called from an offlined cpu. Since trace points use
 RCU and RCU should not be used from offlined cpus, we have checks to
 filter out such calls.
 
 But these checks use smp_processor_id() and since these trace calls can
 happen from preemptable sections, this throws a warning when running
 with CONFIG_DEBUG_PREEMPT set.
 
 Now consider task gets migrated after calling smp_processor_id()
  - From an online cpu to another online cpu - No impact
  - From an online cpu to an offline cpu - Should never happen
  - From an offline cpu to an online cpu - Once a cpu has been
offlined it returns to cpu_idle_loop, discovers its offline and calls
arch_cpu_idle_dead. All this happens with preemption disabled. So
this scenario too should never happen.
 
 Thus running with PREEMPT set has no impact on the condition. So use
 raw_smp_processor_id() so that the warnings are suppressed.
 
 Signed-off-by: Shreyas B. Prabhu shre...@linux.vnet.ibm.com
 ---
  include/trace/events/kmem.h | 34 +++---
  1 file changed, 31 insertions(+), 3 deletions(-)
 
 diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
 index 6cd975f..f7554fd 100644
 --- a/include/trace/events/kmem.h
 +++ b/include/trace/events/kmem.h
 @@ -146,7 +146,16 @@ DEFINE_EVENT_CONDITION(kmem_free, kmem_cache_free,
 
   TP_ARGS(call_site, ptr),
 
 - TP_CONDITION(cpu_online(smp_processor_id()))
 + /*
 +  * This trace can be potentially called from an offlined cpu.
 +  * Since trace points use RCU and RCU should not be used from
 +  * offline cpus, filter such calls out.
 +  * While this trace can be called from a preemptable section,
 +  * it has no impact on the condition since tasks can migrate
 +  * only from online cpus to other online cpus. Thus its safe
 +  * to use raw_smp_processor_id.
 +  */
 + TP_CONDITION(cpu_online(raw_smp_processor_id()))
  );
 
  TRACE_EVENT_CONDITION(mm_page_free,
 @@ -155,7 +164,17 @@ TRACE_EVENT_CONDITION(mm_page_free,
 
   TP_ARGS(page, order),
 
 - TP_CONDITION(cpu_online(smp_processor_id())),
 +
 + /*
 +  * This trace can be potentially called from an offlined cpu.
 +  * Since trace points use RCU and RCU should not be used from
 +  * offline cpus, filter such calls out.
 +  * While this trace can be called from a preemptable section,
 +  * it has no impact on the condition since tasks can migrate
 +  * only from online cpus to other online cpus. Thus its safe
 +  * to use raw_smp_processor_id.
 +  */
 + TP_CONDITION(cpu_online(raw_smp_processor_id())),
 
   TP_STRUCT__entry(
   __field(unsigned long,  pfn )
 @@ -263,7 +282,16 @@ TRACE_EVENT_CONDITION(mm_page_pcpu_drain,
 
   TP_ARGS(page, order, migratetype),
 
 - TP_CONDITION(cpu_online(smp_processor_id())),
 + /*
 +  * This trace can be potentially called from an offlined cpu.
 +  * Since trace points use RCU and RCU should not be used from
 +  * offline cpus, filter such calls out.
 +  * While this trace can be called from a preemptable section,
 +  * it has no impact on the condition since tasks can migrate
 +  * only from online cpus to other online cpus. Thus its safe
 +  * to use raw_smp_processor_id.
 +  */
 + TP_CONDITION(cpu_online(raw_smp_processor_id())),
 
   TP_STRUCT__entry(
   __field(unsigned long,  pfn )

Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] cpuidle: updates related to tick_broadcast_enter() failures

2015-05-13 Thread Preeti U Murthy

On 05/14/2015 04:29 AM, Kevin Hilman wrote:
> "Rafael J. Wysocki"  writes:
> 
> [...]
> 
>> Second, quite honestly, I don't see a connection to genpd here.
> 
> The connection with genpd is because the *reason* the timer was
> shutdown/stopped is because it shares power with the CPU, which is why
> the timer stops when the CPU hits ceratin low power states.  IOW, it's
> in the same power domain as the CPU.
> 
>> What you seem to be saying is "maybe we can eliminate the need to check the
>> return value of tick_broadcast_enter() in the idle loop if we proactively
>> disable the TIMER_STOP idle states of a CPU when we start to use that CPU's
>> timer as a broadcast one".
>>
>> So this seems to be about the timekeeping rather than power domains, because
>> that's where the broadcast thing is done.  So the code setting up the CPU's
>> timer for broadcast would pretty much need to pause cpuidle, go through the
>> CPU's idle states and disable the TIMER_STOP ones.  And do the reverse when 
>> the
>> timer is not going the be used for broadcast any more.  
> 
> Or..., modify the timer subystem to use runtime PM on the timer devices,
> create a genpd that includes the timer device, and use
> pm_genpd_attach_cpuidle() to attach that genpd so that whenever that
> timer is runtime PM active, the deeper C-states cannot be hit.

I think you are missing a point here. If such a solution were possible,
the tick broadcast framework would not have been designed to support
deep cpu idle states. One reason we cannot go this way of course, is not
all archs may support genpd as was pointed out. But the second reason
IMO is that a timer is runtime PM active as long as there is some
deferred work, either in the near or far future.

The point behind the broadcast framework is let these CPUs go to deeper
idle states when the timers are in the "far" future. We can potentially
save power by doing so and don't need to keep the entire power domain
active just because the timer is supposed to fire 5 minutes from now,
which is precisely what happens if we go the genpd way.

Hence I don't think we can trivially club timers with genpd unless we
have a way to power the timer PM domain down, depending on when it is
supposed to fire, in which case we will merely be replicating the
cpuidle governor code.

Regards
Preeti U Murthy

> 
>> So question is whether or not this is actually really more
>> straightforward than checking the return value of
>> tick_broadcast_enter() in the idle loop after all.
> 
> Unfortunetly this problem doesn't only affect timers.
> 
> Daniel's broader point is that $SUBJECT series only handles this for the
> timer, but there's actually a more general problem to solve for *any*
> device that shares a power domain with a CPU (e.g. CPU-local
> timers, interrupt controllers, performance monitoring units, floating
> point units, etc. etc.)
> 
> If we keep adding checks to the idle loop for all those devices, we're
> heading for a mess.  (In fact, this is exactly what CPUidle drivers in
> lots of vendor trees are doing, and it is indeed quite messy, and very
> vendor specific.)
> 
> Also, solving this more general problem was the primary motivation for
> adding the gnpd _attach_cpuidle() feature in the first place, so why not
> use that?
> 
> Longer term, IMO, these dependencies between CPUs and all these "extras"
> logic that share a power domain should be modeled by a genpd.  If all
> those devices are using runtime PM, including the CPUs, and they are
> grouped into a genpd, then we we can very easily know at the genpd level
> whether or not the CPU could be powered down, and to what level.  This
> longer-term solution is what I want to discuss at LPC this year in my
> "Unifiy idle management of CPUs and IO devices" topic[1].  ( Also FYI,
> using a genpd to model a CPU and connected logic is part of the
> motivation behind the recent proposals to add support for multiple
> states to genpd by Axel Haslam. )
> 
> Anyways I digress...
> 
> In the short term, while your patches look fine to me, the objection I
> have is that it's only a band-aid fix that handles timers, but none of
> the other "extras" that might share a power rail with the CPU.  So,
> until we have the long-term stuff sorted out, the better
> short-term solution IMO is the _attach_cpuidle() one above.
> 
> Kevin
> 
> [1] http://wiki.linuxplumbersconf.org/2015:energy-aware_scheduling
> ___
> Linuxppc-dev mailing list
> linuxppc-...@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] cpuidle: updates related to tick_broadcast_enter() failures

2015-05-13 Thread Preeti U Murthy

On 05/14/2015 04:29 AM, Kevin Hilman wrote:
 Rafael J. Wysocki r...@rjwysocki.net writes:
 
 [...]
 
 Second, quite honestly, I don't see a connection to genpd here.
 
 The connection with genpd is because the *reason* the timer was
 shutdown/stopped is because it shares power with the CPU, which is why
 the timer stops when the CPU hits ceratin low power states.  IOW, it's
 in the same power domain as the CPU.
 
 What you seem to be saying is maybe we can eliminate the need to check the
 return value of tick_broadcast_enter() in the idle loop if we proactively
 disable the TIMER_STOP idle states of a CPU when we start to use that CPU's
 timer as a broadcast one.

 So this seems to be about the timekeeping rather than power domains, because
 that's where the broadcast thing is done.  So the code setting up the CPU's
 timer for broadcast would pretty much need to pause cpuidle, go through the
 CPU's idle states and disable the TIMER_STOP ones.  And do the reverse when 
 the
 timer is not going the be used for broadcast any more.  
 
 Or..., modify the timer subystem to use runtime PM on the timer devices,
 create a genpd that includes the timer device, and use
 pm_genpd_attach_cpuidle() to attach that genpd so that whenever that
 timer is runtime PM active, the deeper C-states cannot be hit.

I think you are missing a point here. If such a solution were possible,
the tick broadcast framework would not have been designed to support
deep cpu idle states. One reason we cannot go this way of course, is not
all archs may support genpd as was pointed out. But the second reason
IMO is that a timer is runtime PM active as long as there is some
deferred work, either in the near or far future.

The point behind the broadcast framework is let these CPUs go to deeper
idle states when the timers are in the far future. We can potentially
save power by doing so and don't need to keep the entire power domain
active just because the timer is supposed to fire 5 minutes from now,
which is precisely what happens if we go the genpd way.

Hence I don't think we can trivially club timers with genpd unless we
have a way to power the timer PM domain down, depending on when it is
supposed to fire, in which case we will merely be replicating the
cpuidle governor code.

Regards
Preeti U Murthy

 
 So question is whether or not this is actually really more
 straightforward than checking the return value of
 tick_broadcast_enter() in the idle loop after all.
 
 Unfortunetly this problem doesn't only affect timers.
 
 Daniel's broader point is that $SUBJECT series only handles this for the
 timer, but there's actually a more general problem to solve for *any*
 device that shares a power domain with a CPU (e.g. CPU-local
 timers, interrupt controllers, performance monitoring units, floating
 point units, etc. etc.)
 
 If we keep adding checks to the idle loop for all those devices, we're
 heading for a mess.  (In fact, this is exactly what CPUidle drivers in
 lots of vendor trees are doing, and it is indeed quite messy, and very
 vendor specific.)
 
 Also, solving this more general problem was the primary motivation for
 adding the gnpd _attach_cpuidle() feature in the first place, so why not
 use that?
 
 Longer term, IMO, these dependencies between CPUs and all these extras
 logic that share a power domain should be modeled by a genpd.  If all
 those devices are using runtime PM, including the CPUs, and they are
 grouped into a genpd, then we we can very easily know at the genpd level
 whether or not the CPU could be powered down, and to what level.  This
 longer-term solution is what I want to discuss at LPC this year in my
 Unifiy idle management of CPUs and IO devices topic[1].  ( Also FYI,
 using a genpd to model a CPU and connected logic is part of the
 motivation behind the recent proposals to add support for multiple
 states to genpd by Axel Haslam. )
 
 Anyways I digress...
 
 In the short term, while your patches look fine to me, the objection I
 have is that it's only a band-aid fix that handles timers, but none of
 the other extras that might share a power rail with the CPU.  So,
 until we have the long-term stuff sorted out, the better
 short-term solution IMO is the _attach_cpuidle() one above.
 
 Kevin
 
 [1] http://wiki.linuxplumbersconf.org/2015:energy-aware_scheduling
 ___
 Linuxppc-dev mailing list
 linuxppc-...@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] cpuidle: updates related to tick_broadcast_enter() failures

2015-05-10 Thread Preeti U Murthy

On 05/10/2015 04:45 AM, Rafael J. Wysocki wrote:
> On Saturday, May 09, 2015 10:33:05 PM Rafael J. Wysocki wrote:
>> On Saturday, May 09, 2015 10:11:41 PM Rafael J. Wysocki wrote:
>>> On Saturday, May 09, 2015 11:19:16 AM Preeti U Murthy wrote:
>>>> Hi Rafael,
>>>>
>>>> On 05/08/2015 07:48 PM, Rafael J. Wysocki wrote:
>>>
>>> [cut]
>>>
>>>>>>  
>>>>>> +/* Take note of the planned idle state. */
>>>>>> +idle_set_state(smp_processor_id(), target_state);
>>>>>
>>>>> And I wouldn't do this either.
>>>>>
>>>>> The behavior here is pretty much as though the driver demoted the state 
>>>>> chosen
>>>>> by the governor and we don't call idle_set_state() again in those cases.
>>>>
>>>> Why is this wrong?
>>>
>>> It is not "wrong", but incomplete, because demotions done by the cpuidle 
>>> driver
>>> should also be taken into account in the same way.
>>>
>>> But I'm seeing that the recent patch of mine that made cpuidle_enter_state()
>>> call default_idle_call() was a mistake, because it might confuse 
>>> find_idlest_cpu()
>>> significantly as to what state the CPU is in.  I'll drop that one for now.
>>
>> OK, done.
>>
>> So after I've dropped it I think we need to do three things:
>> (1) Move the idle_set_state() calls to cpuidle_enter_state().
>> (2) Make cpuidle_enter_state() call default_idle_call() again, but this time
>> do that *before* it has called idle_set_state() for target_state.
>> (3) Introduce demotion as per my last patch.
>>
>> Let me cut patches for that.
> 
> Done as per the above and the patches follow in replies to this messge.
> 
> All on top of the current linux-next branch of the linux-pm.git tree.

The patches look good. Based and tested these patches on top of
linux-pm/linux-next (They are not yet in the branch as far as I can see.)

All patches in this series
Reviewed and Tested-by: Preeti U Murthy 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] cpuidle: updates related to tick_broadcast_enter() failures

2015-05-10 Thread Preeti U Murthy

On 05/10/2015 04:45 AM, Rafael J. Wysocki wrote:
> On Saturday, May 09, 2015 10:33:05 PM Rafael J. Wysocki wrote:
>> On Saturday, May 09, 2015 10:11:41 PM Rafael J. Wysocki wrote:
>>> On Saturday, May 09, 2015 11:19:16 AM Preeti U Murthy wrote:
>>>> Hi Rafael,
>>>>
>>>> On 05/08/2015 07:48 PM, Rafael J. Wysocki wrote:
>>>
>>> [cut]
>>>
>>>>>>  
>>>>>> +/* Take note of the planned idle state. */
>>>>>> +idle_set_state(smp_processor_id(), target_state);
>>>>>
>>>>> And I wouldn't do this either.
>>>>>
>>>>> The behavior here is pretty much as though the driver demoted the state 
>>>>> chosen
>>>>> by the governor and we don't call idle_set_state() again in those cases.
>>>>
>>>> Why is this wrong?
>>>
>>> It is not "wrong", but incomplete, because demotions done by the cpuidle 
>>> driver
>>> should also be taken into account in the same way.
>>>
>>> But I'm seeing that the recent patch of mine that made cpuidle_enter_state()
>>> call default_idle_call() was a mistake, because it might confuse 
>>> find_idlest_cpu()
>>> significantly as to what state the CPU is in.  I'll drop that one for now.
>>
>> OK, done.
>>
>> So after I've dropped it I think we need to do three things:
>> (1) Move the idle_set_state() calls to cpuidle_enter_state().
>> (2) Make cpuidle_enter_state() call default_idle_call() again, but this time
>> do that *before* it has called idle_set_state() for target_state.
>> (3) Introduce demotion as per my last patch.
>>
>> Let me cut patches for that.
> 
> Done as per the above and the patches follow in replies to this messge.
> 
> All on top of the current linux-next branch of the linux-pm.git tree.

I don't see the patches on linux-pm/linux-next.

Regards
Preeti U Murthy
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] cpuidle: updates related to tick_broadcast_enter() failures

2015-05-10 Thread Preeti U Murthy

On 05/10/2015 04:45 AM, Rafael J. Wysocki wrote:
 On Saturday, May 09, 2015 10:33:05 PM Rafael J. Wysocki wrote:
 On Saturday, May 09, 2015 10:11:41 PM Rafael J. Wysocki wrote:
 On Saturday, May 09, 2015 11:19:16 AM Preeti U Murthy wrote:
 Hi Rafael,

 On 05/08/2015 07:48 PM, Rafael J. Wysocki wrote:

 [cut]

  
 +/* Take note of the planned idle state. */
 +idle_set_state(smp_processor_id(), target_state);

 And I wouldn't do this either.

 The behavior here is pretty much as though the driver demoted the state 
 chosen
 by the governor and we don't call idle_set_state() again in those cases.

 Why is this wrong?

 It is not wrong, but incomplete, because demotions done by the cpuidle 
 driver
 should also be taken into account in the same way.

 But I'm seeing that the recent patch of mine that made cpuidle_enter_state()
 call default_idle_call() was a mistake, because it might confuse 
 find_idlest_cpu()
 significantly as to what state the CPU is in.  I'll drop that one for now.

 OK, done.

 So after I've dropped it I think we need to do three things:
 (1) Move the idle_set_state() calls to cpuidle_enter_state().
 (2) Make cpuidle_enter_state() call default_idle_call() again, but this time
 do that *before* it has called idle_set_state() for target_state.
 (3) Introduce demotion as per my last patch.

 Let me cut patches for that.
 
 Done as per the above and the patches follow in replies to this messge.
 
 All on top of the current linux-next branch of the linux-pm.git tree.

I don't see the patches on linux-pm/linux-next.

Regards
Preeti U Murthy
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] cpuidle: updates related to tick_broadcast_enter() failures

2015-05-10 Thread Preeti U Murthy

On 05/10/2015 04:45 AM, Rafael J. Wysocki wrote:
 On Saturday, May 09, 2015 10:33:05 PM Rafael J. Wysocki wrote:
 On Saturday, May 09, 2015 10:11:41 PM Rafael J. Wysocki wrote:
 On Saturday, May 09, 2015 11:19:16 AM Preeti U Murthy wrote:
 Hi Rafael,

 On 05/08/2015 07:48 PM, Rafael J. Wysocki wrote:

 [cut]

  
 +/* Take note of the planned idle state. */
 +idle_set_state(smp_processor_id(), target_state);

 And I wouldn't do this either.

 The behavior here is pretty much as though the driver demoted the state 
 chosen
 by the governor and we don't call idle_set_state() again in those cases.

 Why is this wrong?

 It is not wrong, but incomplete, because demotions done by the cpuidle 
 driver
 should also be taken into account in the same way.

 But I'm seeing that the recent patch of mine that made cpuidle_enter_state()
 call default_idle_call() was a mistake, because it might confuse 
 find_idlest_cpu()
 significantly as to what state the CPU is in.  I'll drop that one for now.

 OK, done.

 So after I've dropped it I think we need to do three things:
 (1) Move the idle_set_state() calls to cpuidle_enter_state().
 (2) Make cpuidle_enter_state() call default_idle_call() again, but this time
 do that *before* it has called idle_set_state() for target_state.
 (3) Introduce demotion as per my last patch.

 Let me cut patches for that.
 
 Done as per the above and the patches follow in replies to this messge.
 
 All on top of the current linux-next branch of the linux-pm.git tree.

The patches look good. Based and tested these patches on top of
linux-pm/linux-next (They are not yet in the branch as far as I can see.)

All patches in this series
Reviewed and Tested-by: Preeti U Murthy pre...@linux.vnet.ibm.com
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-08 Thread Preeti U Murthy

gt;state_count; i++) {
> + for (i = CPUIDLE_DRIVER_STATE_START; i < limit; i++) {
>   struct cpuidle_state *s = >states[i];
>   struct cpuidle_state_usage *su = >states_usage[i];
> 
>   if (s->disabled || su->disable || s->exit_latency <= latency_req
> - || (freeze && !s->enter_freeze))
> + || (freeze && !s->enter_freeze)
> + || (s->flags & flags_to_avoid))
>   continue;
> 
>   latency_req = s->exit_latency;
> @@ -100,7 +102,7 @@ static int find_deepest_state(struct cpu
>  int cpuidle_find_deepest_state(struct cpuidle_driver *drv,
>  struct cpuidle_device *dev)
>  {
> - return find_deepest_state(drv, dev, false);
> + return find_deepest_state(drv, dev, false, drv->state_count, 0);
>  }
> 
>  static void enter_freeze_proper(struct cpuidle_driver *drv,
> @@ -139,7 +141,7 @@ int cpuidle_enter_freeze(struct cpuidle_
>* that interrupts won't be enabled when it exits and allows the tick to
>* be frozen safely.
>*/
> - index = find_deepest_state(drv, dev, true);
> + index = find_deepest_state(drv, dev, true, drv->state_count, 0);
>   if (index >= 0)
>   enter_freeze_proper(drv, dev, index);
> 
> @@ -168,8 +170,13 @@ int cpuidle_enter_state(struct cpuidle_d
>* CPU as a broadcast timer, this call may fail if it is not available.
>*/
>   if (broadcast && tick_broadcast_enter()) {
> - default_idle_call();
> - return -EBUSY;
> + index = find_deepest_state(drv, dev, false, index,
> +CPUIDLE_FLAG_TIMER_STOP);
> + if (index < 0) {
> + default_idle_call();
> + return -EBUSY;
> + }
> + target_state = >states[index];
>   }
> 
>   trace_cpu_idle_rcuidle(index, dev->cpu);

Regards
Preeti U Murthy
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:locking/core] kernel: Replace reference to ASSIGN_ONCE() with WRITE_ONCE() in comment

2015-05-08 Thread tip-bot for Preeti U Murthy

Commit-ID:  663fdcbee0a656cdaef934e7f50e6c2670373bc9
Gitweb: http://git.kernel.org/tip/663fdcbee0a656cdaef934e7f50e6c2670373bc9
Author: Preeti U Murthy 
AuthorDate: Thu, 30 Apr 2015 17:27:21 +0530
Committer:  Ingo Molnar 
CommitDate: Fri, 8 May 2015 12:28:53 +0200

kernel: Replace reference to ASSIGN_ONCE() with WRITE_ONCE() in comment

Looks like commit :

 43239cbe79fc ("kernel: Change ASSIGN_ONCE(val, x) to WRITE_ONCE(x, val)")

left behind a reference to ASSIGN_ONCE(). Update this to WRITE_ONCE().

Signed-off-by: Preeti U Murthy 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Borislav Petkov 
Cc: H. Peter Anvin 
Cc: Thomas Gleixner 
Cc: borntrae...@de.ibm.com
Cc: d...@stgolabs.net
Cc: paul...@linux.vnet.ibm.com
Link: 
http://lkml.kernel.org/r/20150430115721.22278.94082.st...@preeti.in.ibm.com
Signed-off-by: Ingo Molnar 
---
 include/linux/compiler.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 8677225..a7c0941 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -450,7 +450,7 @@ static __always_inline void __write_once_size(volatile void 
*p, void *res, int s
  * with an explicit memory barrier or atomic instruction that provides the
  * required ordering.
  *
- * If possible use READ_ONCE/ASSIGN_ONCE instead.
+ * If possible use READ_ONCE()/WRITE_ONCE() instead.
  */
 #define __ACCESS_ONCE(x) ({ \
 __maybe_unused typeof(x) __var = (__force typeof(x)) 0; \
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-08 Thread Preeti U Murthy

On 05/08/2015 02:20 AM, Rafael J. Wysocki wrote:
> On Thursday, May 07, 2015 11:17:21 PM Preeti U Murthy wrote:
>> When a CPU has to enter an idle state where tick stops, it makes a call
>> to tick_broadcast_enter(). The call will fail if this CPU is the
>> broadcast CPU. Today, under such a circumstance, the arch cpuidle code
>> handles this CPU.  This is not convincing because not only are we not
>> aware what the arch cpuidle code does, but we also do not account for
>> the idle state residency time and usage of such a CPU.
>>
>> This scenario can be handled better by simply asking the cpuidle
>> governor to choose an idle state where in ticks do not stop. To
>> accommodate this change move the setting of runqueue idle state from the
>> core to the cpuidle driver, else the rq->idle_state will be set wrong.
>>
>> Signed-off-by: Preeti U Murthy 
>> ---
>> Changes from V1: https://lkml.org/lkml/2015/5/7/24
>> Rebased on the latest linux-pm/bleeding-edge
>>
>>  drivers/cpuidle/cpuidle.c  |   21 +
>>  drivers/cpuidle/governors/ladder.c |   13 ++---
>>  drivers/cpuidle/governors/menu.c   |6 +-
>>  include/linux/cpuidle.h|6 +++---
>>  include/linux/sched.h  |   16 
>>  kernel/sched/core.c|   17 +
>>  kernel/sched/fair.c|2 +-
>>  kernel/sched/idle.c|8 +---
>>  kernel/sched/sched.h   |   24 
>>  9 files changed, 70 insertions(+), 43 deletions(-)
>>
>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>> index 8c24f95..b7e86f4 100644
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -21,6 +21,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  
>>  #include "cpuidle.h"
>> @@ -168,10 +169,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
>> struct cpuidle_driver *drv,
>>   * CPU as a broadcast timer, this call may fail if it is not available.
>>   */
>>  if (broadcast && tick_broadcast_enter()) {
>> -default_idle_call();
>> -return -EBUSY;
>> +index = cpuidle_select(drv, dev, !broadcast);
> 
> No, you can't do that.
> 
> This code path may be used by suspend-to-idle and that should not call
> cpuidle_select().
> 
> What's needed here seems to be a fallback mechanism like "choose the
> deepest state shallower than X and such that it won't stop the tick".
> You don't really need to run a full governor for that.

Agreed. Makes the patch a lot simpler as well. I have sent out V3 doing
this.

Thank you

Regards
Preeti U Murthy
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V3] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-08 Thread Preeti U Murthy

When a CPU has to enter an idle state where tick stops, it makes a call
to tick_broadcast_enter(). The call will fail if this CPU is the
broadcast CPU. Today, under such a circumstance, the arch cpuidle code
handles this CPU.  This is not convincing because not only do we not
know what the arch cpuidle code does, but we also do not account for the
idle state residency time and usage of such a CPU.

This scenario can be handled better by simply choosing an idle state
where in ticks do not stop. To accommodate this change move the setting
of runqueue idle state from the core to the cpuidle driver, else the
rq->idle_state will be set wrong.

Signed-off-by: Preeti U Murthy 
---
Changes from V2: https://lkml.org/lkml/2015/5/7/78
Introduce a function in cpuidle core to select an idle state where ticks do not
stop rather than going through the governors.

Changes from V1: https://lkml.org/lkml/2015/5/7/24
Rebased on the latest linux-pm/bleeding-edge branch

 drivers/cpuidle/cpuidle.c |   45 +++--
 include/linux/sched.h |   16 
 kernel/sched/core.c   |   17 +
 kernel/sched/fair.c   |2 +-
 kernel/sched/idle.c   |6 --
 kernel/sched/sched.h  |   24 
 6 files changed, 77 insertions(+), 33 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 8c24f95..d1af760 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "cpuidle.h"
@@ -146,6 +147,36 @@ int cpuidle_enter_freeze(struct cpuidle_driver *drv, 
struct cpuidle_device *dev)
return index;
 }
 
+/*
+ * find_tick_valid_state - select a state where tick does not stop
+ * @dev: cpuidle device for this cpu
+ * @drv: cpuidle driver for this cpu
+ */
+static int find_tick_valid_state(struct cpuidle_device *dev,
+   struct cpuidle_driver *drv)
+{
+   int i, ret = -1;
+
+   for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) {
+   struct cpuidle_state *s = >states[i];
+   struct cpuidle_state_usage *su = >states_usage[i];
+
+   /*
+* We do not explicitly check for latency requirement
+* since it is safe to assume that only shallower idle
+* states will have the CPUIDLE_FLAG_TIMER_STOP bit
+* cleared and they will invariably meet the latency
+* requirement.
+*/
+   if (s->disabled || su->disable ||
+   (s->flags & CPUIDLE_FLAG_TIMER_STOP))
+   continue;
+
+   ret = i;
+   }
+   return ret;
+}
+
 /**
  * cpuidle_enter_state - enter the state and update stats
  * @dev: cpuidle device for this cpu
@@ -168,10 +199,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
 * CPU as a broadcast timer, this call may fail if it is not available.
 */
if (broadcast && tick_broadcast_enter()) {
-   default_idle_call();
-   return -EBUSY;
+   index = find_tick_valid_state(dev, drv);
+   if (index < 0) {
+   default_idle_call();
+   return -EBUSY;
+   }
+   target_state = >states[index];
}
 
+   /* Take note of the planned idle state. */
+   idle_set_state(smp_processor_id(), target_state);
+
trace_cpu_idle_rcuidle(index, dev->cpu);
time_start = ktime_get();
 
@@ -180,6 +218,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
time_end = ktime_get();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
 
+   /* The cpu is no longer idle or about to enter idle. */
+   idle_set_state(smp_processor_id(), NULL);
+
if (broadcast) {
if (WARN_ON_ONCE(!irqs_disabled()))
local_irq_disable();
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 26a2e61..fef8359 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -45,6 +45,7 @@ struct sched_param {
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -893,6 +894,21 @@ enum cpu_idle_type {
CPU_MAX_IDLE_TYPES
 };
 
+#ifdef CONFIG_CPU_IDLE
+extern void idle_set_state(int cpu, struct cpuidle_state *idle_state);
+extern struct cpuidle_state *idle_get_state(int cpu);
+#else
+static inline void idle_set_state(int cpu,
+ struct cpuidle_state *idle_state)
+{
+}
+
+static inline struct cpuidle_state *idle_get_state(int cpu)
+{
+   return NULL;
+}
+#endif
+
 /*
  * Increase resolution of cpu_capacity calculations
  */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe22f75..8e1cc50 100644
---

Re: [PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-08 Thread Preeti U Murthy

On 05/08/2015 02:20 AM, Rafael J. Wysocki wrote:
 On Thursday, May 07, 2015 11:17:21 PM Preeti U Murthy wrote:
 When a CPU has to enter an idle state where tick stops, it makes a call
 to tick_broadcast_enter(). The call will fail if this CPU is the
 broadcast CPU. Today, under such a circumstance, the arch cpuidle code
 handles this CPU.  This is not convincing because not only are we not
 aware what the arch cpuidle code does, but we also do not account for
 the idle state residency time and usage of such a CPU.

 This scenario can be handled better by simply asking the cpuidle
 governor to choose an idle state where in ticks do not stop. To
 accommodate this change move the setting of runqueue idle state from the
 core to the cpuidle driver, else the rq-idle_state will be set wrong.

 Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
 ---
 Changes from V1: https://lkml.org/lkml/2015/5/7/24
 Rebased on the latest linux-pm/bleeding-edge

  drivers/cpuidle/cpuidle.c  |   21 +
  drivers/cpuidle/governors/ladder.c |   13 ++---
  drivers/cpuidle/governors/menu.c   |6 +-
  include/linux/cpuidle.h|6 +++---
  include/linux/sched.h  |   16 
  kernel/sched/core.c|   17 +
  kernel/sched/fair.c|2 +-
  kernel/sched/idle.c|8 +---
  kernel/sched/sched.h   |   24 
  9 files changed, 70 insertions(+), 43 deletions(-)

 diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
 index 8c24f95..b7e86f4 100644
 --- a/drivers/cpuidle/cpuidle.c
 +++ b/drivers/cpuidle/cpuidle.c
 @@ -21,6 +21,7 @@
  #include linux/module.h
  #include linux/suspend.h
  #include linux/tick.h
 +#include linux/sched.h
  #include trace/events/power.h
  
  #include cpuidle.h
 @@ -168,10 +169,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
 struct cpuidle_driver *drv,
   * CPU as a broadcast timer, this call may fail if it is not available.
   */
  if (broadcast  tick_broadcast_enter()) {
 -default_idle_call();
 -return -EBUSY;
 +index = cpuidle_select(drv, dev, !broadcast);
 
 No, you can't do that.
 
 This code path may be used by suspend-to-idle and that should not call
 cpuidle_select().
 
 What's needed here seems to be a fallback mechanism like choose the
 deepest state shallower than X and such that it won't stop the tick.
 You don't really need to run a full governor for that.

Agreed. Makes the patch a lot simpler as well. I have sent out V3 doing
this.

Thank you

Regards
Preeti U Murthy
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V3] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-08 Thread Preeti U Murthy

When a CPU has to enter an idle state where tick stops, it makes a call
to tick_broadcast_enter(). The call will fail if this CPU is the
broadcast CPU. Today, under such a circumstance, the arch cpuidle code
handles this CPU.  This is not convincing because not only do we not
know what the arch cpuidle code does, but we also do not account for the
idle state residency time and usage of such a CPU.

This scenario can be handled better by simply choosing an idle state
where in ticks do not stop. To accommodate this change move the setting
of runqueue idle state from the core to the cpuidle driver, else the
rq-idle_state will be set wrong.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---
Changes from V2: https://lkml.org/lkml/2015/5/7/78
Introduce a function in cpuidle core to select an idle state where ticks do not
stop rather than going through the governors.

Changes from V1: https://lkml.org/lkml/2015/5/7/24
Rebased on the latest linux-pm/bleeding-edge branch

 drivers/cpuidle/cpuidle.c |   45 +++--
 include/linux/sched.h |   16 
 kernel/sched/core.c   |   17 +
 kernel/sched/fair.c   |2 +-
 kernel/sched/idle.c   |6 --
 kernel/sched/sched.h  |   24 
 6 files changed, 77 insertions(+), 33 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 8c24f95..d1af760 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -21,6 +21,7 @@
 #include linux/module.h
 #include linux/suspend.h
 #include linux/tick.h
+#include linux/sched.h
 #include trace/events/power.h
 
 #include cpuidle.h
@@ -146,6 +147,36 @@ int cpuidle_enter_freeze(struct cpuidle_driver *drv, 
struct cpuidle_device *dev)
return index;
 }
 
+/*
+ * find_tick_valid_state - select a state where tick does not stop
+ * @dev: cpuidle device for this cpu
+ * @drv: cpuidle driver for this cpu
+ */
+static int find_tick_valid_state(struct cpuidle_device *dev,
+   struct cpuidle_driver *drv)
+{
+   int i, ret = -1;
+
+   for (i = CPUIDLE_DRIVER_STATE_START; i  drv-state_count; i++) {
+   struct cpuidle_state *s = drv-states[i];
+   struct cpuidle_state_usage *su = dev-states_usage[i];
+
+   /*
+* We do not explicitly check for latency requirement
+* since it is safe to assume that only shallower idle
+* states will have the CPUIDLE_FLAG_TIMER_STOP bit
+* cleared and they will invariably meet the latency
+* requirement.
+*/
+   if (s-disabled || su-disable ||
+   (s-flags  CPUIDLE_FLAG_TIMER_STOP))
+   continue;
+
+   ret = i;
+   }
+   return ret;
+}
+
 /**
  * cpuidle_enter_state - enter the state and update stats
  * @dev: cpuidle device for this cpu
@@ -168,10 +199,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
 * CPU as a broadcast timer, this call may fail if it is not available.
 */
if (broadcast  tick_broadcast_enter()) {
-   default_idle_call();
-   return -EBUSY;
+   index = find_tick_valid_state(dev, drv);
+   if (index  0) {
+   default_idle_call();
+   return -EBUSY;
+   }
+   target_state = drv-states[index];
}
 
+   /* Take note of the planned idle state. */
+   idle_set_state(smp_processor_id(), target_state);
+
trace_cpu_idle_rcuidle(index, dev-cpu);
time_start = ktime_get();
 
@@ -180,6 +218,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
time_end = ktime_get();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev-cpu);
 
+   /* The cpu is no longer idle or about to enter idle. */
+   idle_set_state(smp_processor_id(), NULL);
+
if (broadcast) {
if (WARN_ON_ONCE(!irqs_disabled()))
local_irq_disable();
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 26a2e61..fef8359 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -45,6 +45,7 @@ struct sched_param {
 #include linux/rcupdate.h
 #include linux/rculist.h
 #include linux/rtmutex.h
+#include linux/cpuidle.h
 
 #include linux/time.h
 #include linux/param.h
@@ -893,6 +894,21 @@ enum cpu_idle_type {
CPU_MAX_IDLE_TYPES
 };
 
+#ifdef CONFIG_CPU_IDLE
+extern void idle_set_state(int cpu, struct cpuidle_state *idle_state);
+extern struct cpuidle_state *idle_get_state(int cpu);
+#else
+static inline void idle_set_state(int cpu,
+ struct cpuidle_state *idle_state)
+{
+}
+
+static inline struct cpuidle_state *idle_get_state(int cpu)
+{
+   return NULL;
+}
+#endif
+
 /*
  * Increase resolution

[tip:locking/core] kernel: Replace reference to ASSIGN_ONCE() with WRITE_ONCE() in comment

2015-05-08 Thread tip-bot for Preeti U Murthy

Commit-ID:  663fdcbee0a656cdaef934e7f50e6c2670373bc9
Gitweb: http://git.kernel.org/tip/663fdcbee0a656cdaef934e7f50e6c2670373bc9
Author: Preeti U Murthy pre...@linux.vnet.ibm.com
AuthorDate: Thu, 30 Apr 2015 17:27:21 +0530
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Fri, 8 May 2015 12:28:53 +0200

kernel: Replace reference to ASSIGN_ONCE() with WRITE_ONCE() in comment

Looks like commit :

 43239cbe79fc (kernel: Change ASSIGN_ONCE(val, x) to WRITE_ONCE(x, val))

left behind a reference to ASSIGN_ONCE(). Update this to WRITE_ONCE().

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Cc: Borislav Petkov b...@alien8.de
Cc: H. Peter Anvin h...@zytor.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: borntrae...@de.ibm.com
Cc: d...@stgolabs.net
Cc: paul...@linux.vnet.ibm.com
Link: 
http://lkml.kernel.org/r/20150430115721.22278.94082.st...@preeti.in.ibm.com
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 include/linux/compiler.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 8677225..a7c0941 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -450,7 +450,7 @@ static __always_inline void __write_once_size(volatile void 
*p, void *res, int s
  * with an explicit memory barrier or atomic instruction that provides the
  * required ordering.
  *
- * If possible use READ_ONCE/ASSIGN_ONCE instead.
+ * If possible use READ_ONCE()/WRITE_ONCE() instead.
  */
 #define __ACCESS_ONCE(x) ({ \
 __maybe_unused typeof(x) __var = (__force typeof(x)) 0; \
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-08 Thread Preeti U Murthy

,
  struct cpuidle_device *dev)
  {
 - return find_deepest_state(drv, dev, false);
 + return find_deepest_state(drv, dev, false, drv-state_count, 0);
  }
 
  static void enter_freeze_proper(struct cpuidle_driver *drv,
 @@ -139,7 +141,7 @@ int cpuidle_enter_freeze(struct cpuidle_
* that interrupts won't be enabled when it exits and allows the tick to
* be frozen safely.
*/
 - index = find_deepest_state(drv, dev, true);
 + index = find_deepest_state(drv, dev, true, drv-state_count, 0);
   if (index = 0)
   enter_freeze_proper(drv, dev, index);
 
 @@ -168,8 +170,13 @@ int cpuidle_enter_state(struct cpuidle_d
* CPU as a broadcast timer, this call may fail if it is not available.
*/
   if (broadcast  tick_broadcast_enter()) {
 - default_idle_call();
 - return -EBUSY;
 + index = find_deepest_state(drv, dev, false, index,
 +CPUIDLE_FLAG_TIMER_STOP);
 + if (index  0) {
 + default_idle_call();
 + return -EBUSY;
 + }
 + target_state = drv-states[index];
   }
 
   trace_cpu_idle_rcuidle(index, dev-cpu);

Regards
Preeti U Murthy
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-pm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

2015-05-07 Thread Preeti U Murthy

On 05/08/2015 02:29 AM, Rafael J. Wysocki wrote:
> On Thursday, May 07, 2015 05:49:22 PM Preeti U Murthy wrote:
>> On 05/05/2015 02:11 PM, Preeti U Murthy wrote:
>>> On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote:
>>>> Hi Preeti,
>>>>
>>>> On 05/05/2015 09:30 AM, Preeti U Murthy wrote:
>>>>> Hi Shilpa,
>>>>>
>>>>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
>>>>>> Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
>>>>>> notification by executing *throttle_check() on any one of the cpu on
>>>>>> the chip. This is a sanity check to verify if we were indeed
>>>>>> throttled/unthrottled after receiving OCC_THROTTLE notification.
>>>>>>
>>>>>> We cannot call *throttle_check() directly from the notification
>>>>>> handler because we could be handling chip1's notification in chip2. So
>>>>>> initiate an smp_call to execute *throttle_check(). We are irq-disabled
>>>>>> in the notification handler, so use a worker thread to smp_call
>>>>>> throttle_check() on any of the cpu in the chipmask.
>>>>>
>>>>> I see that the first patch takes care of reporting *per-chip* throttling
>>>>> for pmax capping condition. But where are we taking care of reporting
>>>>> "pstate set to safe" and "freq control disabled" scenarios per-chip ?
>>>>>
>>>>
>>>> IMO let us not have "psafe" and "freq control disabled" states managed 
>>>> per-chip.
>>>> Because when the above two conditions occur it is likely to happen across 
>>>> all
>>>> chips during an OCC reset cycle. So I am setting 'throttled' to false on
>>>> OCC_ACTIVE and re-verifying if it actually is the case by invoking
>>>> *throttle_check().
>>>
>>> Alright like I pointed in the previous reply, a comment to indicate that
>>> psafe and freq control disabled conditions will fail when occ is
>>> inactive and that all chips face the consequence of this will help.
>>
>> From your explanation on the thread of the first patch of this series,
>> this will not be required.
>>
>> So,
>> Reviewed-by: Preeti U Murthy 
> 
> OK, so is the whole series reviewed now?

Yes the whole series has been reviewed.

Regards
Preeti U Murthy


> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-07 Thread Preeti U Murthy

On 05/07/2015 11:09 PM, Preeti U Murthy wrote:
> When a CPU has to enter an idle state where tick stops, it makes a call
> to tick_broadcast_enter(). The call will fail if this CPU is the
> broadcast CPU. Today, under such a circumstance, the arch cpuidle code
> handles this CPU.  This is not convincing because not only are we not
> aware what the arch cpuidle code does, but we also do not account for
> the idle state residency time and usage of such a CPU.
> 
> This scenario can be handled better by simply asking the cpuidle
> governor to choose an idle state where in ticks do not stop. To
> accommodate this change move the setting of runqueue idle state from the
> core to the cpuidle driver, else the rq->idle_state will be set wrong.
> 
> Signed-off-by: Preeti U Murthy 
> ---
> Rebased on the latest linux-pm/bleeding-edge

Kindly ignore this. I have sent the rebase as V2.
[PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully

The below patch is not updated.

Regards
Preeti U Murthy
> 
>  drivers/cpuidle/cpuidle.c  |   21 +
>  drivers/cpuidle/governors/ladder.c |   13 ++---
>  drivers/cpuidle/governors/menu.c   |6 +-
>  include/linux/cpuidle.h|6 +++---
>  include/linux/sched.h  |   16 
>  kernel/sched/core.c|   17 +
>  kernel/sched/fair.c|2 +-
>  kernel/sched/idle.c|8 +---
>  kernel/sched/sched.h   |   24 
>  9 files changed, 70 insertions(+), 43 deletions(-)
> 
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index 61c417b..8f5657e 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
> 
>  #include "cpuidle.h"
> @@ -167,8 +168,15 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
> struct cpuidle_driver *drv,
>* local timer will be shut down.  If a local timer is used from another
>* CPU as a broadcast timer, this call may fail if it is not available.
>*/
> - if (broadcast && tick_broadcast_enter())
> - return -EBUSY;
> + if (broadcast && tick_broadcast_enter()) {
> + index = cpuidle_select(drv, dev, !broadcast);
> + if (index < 0)
> + return -EBUSY;
> + target_state = >states[index];
> + }
> +
> + /* Take note of the planned idle state. */
> + idle_set_state(smp_processor_id(), target_state);
> 
>   trace_cpu_idle_rcuidle(index, dev->cpu);
>   time_start = ktime_get();
> @@ -178,6 +186,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
> struct cpuidle_driver *drv,
>   time_end = ktime_get();
>   trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
> 
> + /* The cpu is no longer idle or about to enter idle. */
> + idle_set_state(smp_processor_id(), NULL);
> +
>   if (broadcast) {
>   if (WARN_ON_ONCE(!irqs_disabled()))
>   local_irq_disable();
> @@ -213,12 +224,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
> struct cpuidle_driver *drv,
>   *
>   * @drv: the cpuidle driver
>   * @dev: the cpuidle device
> + * @timer_stop_valid: allow selection of idle state where tick stops
>   *
>   * Returns the index of the idle state.
>   */
> -int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
> +int cpuidle_select(struct cpuidle_driver *drv,
> + struct cpuidle_device *dev, int timer_stop_valid)
>  {
> - return cpuidle_curr_governor->select(drv, dev);
> + return cpuidle_curr_governor->select(drv, dev, timer_stop_valid);
>  }
> 
>  /**
> diff --git a/drivers/cpuidle/governors/ladder.c 
> b/drivers/cpuidle/governors/ladder.c
> index 401c010..c437322 100644
> --- a/drivers/cpuidle/governors/ladder.c
> +++ b/drivers/cpuidle/governors/ladder.c
> @@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct 
> ladder_device *ldev,
>   * ladder_select_state - selects the next state to enter
>   * @drv: cpuidle driver
>   * @dev: the CPU
> + * @timer_stop_valid: allow selection of idle state where tick stops
>   */
>  static int ladder_select_state(struct cpuidle_driver *drv,
> - struct cpuidle_device *dev)
> + struct cpuidle_device *dev, int 
> timer_stop_valid)
>  {
>   struct ladder_device *ldev = this_cpu_ptr(_devices);
>   struct ladder_device_state *last_state;
> @@ -86,6 +87,7 @@ static int ladder_s

Re: [PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-07 Thread Preeti U Murthy

On 05/07/2015 03:26 PM, Sudeep Holla wrote:
> Hi Preeti,
> 
> On 07/05/15 06:26, Preeti U Murthy wrote:
>> When a CPU has to enter an idle state where tick stops, it makes a call
>> to tick_broadcast_enter(). The call will fail if this CPU is the
>> broadcast CPU. Today, under such a circumstance, the arch cpuidle code
>> handles this CPU.  This is not convincing because not only are we not
>> aware what the arch cpuidle code does, but we also do not account for
>> the idle state residency time and usage of such a CPU.
>>
>> This scenario can be handled better by simply asking the cpuidle
>> governor to choose an idle state where in ticks do not stop. To
>> accommodate this change move the setting of runqueue idle state from the
>> core to the cpuidle driver, else the rq->idle_state will be set wrong.
>>
>> Signed-off-by: Preeti U Murthy 
>> ---
>> Based on linux-pm/bleeding-edge
> 
> I am unable to apply this patch cleanly on linux-pm/bleeding-edge
> I think it conflicts with few patches that Rafael posted recently
> which are in the branch now.

Thanks so much for pointing out this! I have resent the patch as V2
after the rebase on the updated linux-pm/bleeding-edge branch

Regards
Preeti U Murthy
> 
> Regards,
> Sudeep
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-07 Thread Preeti U Murthy

When a CPU has to enter an idle state where tick stops, it makes a call
to tick_broadcast_enter(). The call will fail if this CPU is the
broadcast CPU. Today, under such a circumstance, the arch cpuidle code
handles this CPU.  This is not convincing because not only are we not
aware what the arch cpuidle code does, but we also do not account for
the idle state residency time and usage of such a CPU.

This scenario can be handled better by simply asking the cpuidle
governor to choose an idle state where in ticks do not stop. To
accommodate this change move the setting of runqueue idle state from the
core to the cpuidle driver, else the rq->idle_state will be set wrong.

Signed-off-by: Preeti U Murthy 
---
Changes from V1: https://lkml.org/lkml/2015/5/7/24
Rebased on the latest linux-pm/bleeding-edge

 drivers/cpuidle/cpuidle.c  |   21 +
 drivers/cpuidle/governors/ladder.c |   13 ++---
 drivers/cpuidle/governors/menu.c   |6 +-
 include/linux/cpuidle.h|6 +++---
 include/linux/sched.h  |   16 
 kernel/sched/core.c|   17 +
 kernel/sched/fair.c|2 +-
 kernel/sched/idle.c|8 +---
 kernel/sched/sched.h   |   24 
 9 files changed, 70 insertions(+), 43 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 8c24f95..b7e86f4 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "cpuidle.h"
@@ -168,10 +169,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
 * CPU as a broadcast timer, this call may fail if it is not available.
 */
if (broadcast && tick_broadcast_enter()) {
-   default_idle_call();
-   return -EBUSY;
+   index = cpuidle_select(drv, dev, !broadcast);
+   if (index < 0) {
+   default_idle_call();
+   return -EBUSY;
+   }
+   target_state = >states[index];
}
 
+   /* Take note of the planned idle state. */
+   idle_set_state(smp_processor_id(), target_state);
+
trace_cpu_idle_rcuidle(index, dev->cpu);
time_start = ktime_get();
 
@@ -180,6 +188,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
time_end = ktime_get();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
 
+   /* The cpu is no longer idle or about to enter idle. */
+   idle_set_state(smp_processor_id(), NULL);
+
if (broadcast) {
if (WARN_ON_ONCE(!irqs_disabled()))
local_irq_disable();
@@ -215,12 +226,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
  *
  * @drv: the cpuidle driver
  * @dev: the cpuidle device
+ * @timer_stop_valid: allow selection of idle state where tick stops
  *
  * Returns the index of the idle state.
  */
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv,
+   struct cpuidle_device *dev, int timer_stop_valid)
 {
-   return cpuidle_curr_governor->select(drv, dev);
+   return cpuidle_curr_governor->select(drv, dev, timer_stop_valid);
 }
 
 /**
diff --git a/drivers/cpuidle/governors/ladder.c 
b/drivers/cpuidle/governors/ladder.c
index 401c010..c437322 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct ladder_device 
*ldev,
  * ladder_select_state - selects the next state to enter
  * @drv: cpuidle driver
  * @dev: the CPU
+ * @timer_stop_valid: allow selection of idle state where tick stops
  */
 static int ladder_select_state(struct cpuidle_driver *drv,
-   struct cpuidle_device *dev)
+   struct cpuidle_device *dev, int 
timer_stop_valid)
 {
struct ladder_device *ldev = this_cpu_ptr(_devices);
struct ladder_device_state *last_state;
@@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv,
!drv->states[last_idx + 1].disabled &&
!dev->states_usage[last_idx + 1].disable &&
last_residency > last_state->threshold.promotion_time &&
+   !(!timer_stop_valid && (drv->states[last_idx + 1].flags & 
CPUIDLE_FLAG_TIMER_STOP)) &&
drv->states[last_idx + 1].exit_latency <= latency_req) {
last_state->stats.promotion_count++;
last_state->stats.demotion_count = 0;
@@ -99,11 +101,14 @@ static int ladder_select_state(struct cpuidle_driver *drv,
if (last_idx > CPUIDLE_DRI

[RESEND PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-07 Thread Preeti U Murthy

When a CPU has to enter an idle state where tick stops, it makes a call
to tick_broadcast_enter(). The call will fail if this CPU is the
broadcast CPU. Today, under such a circumstance, the arch cpuidle code
handles this CPU.  This is not convincing because not only are we not
aware what the arch cpuidle code does, but we also do not account for
the idle state residency time and usage of such a CPU.

This scenario can be handled better by simply asking the cpuidle
governor to choose an idle state where in ticks do not stop. To
accommodate this change move the setting of runqueue idle state from the
core to the cpuidle driver, else the rq->idle_state will be set wrong.

Signed-off-by: Preeti U Murthy 
---
Rebased on the latest linux-pm/bleeding-edge

 drivers/cpuidle/cpuidle.c  |   21 +
 drivers/cpuidle/governors/ladder.c |   13 ++---
 drivers/cpuidle/governors/menu.c   |6 +-
 include/linux/cpuidle.h|6 +++---
 include/linux/sched.h  |   16 
 kernel/sched/core.c|   17 +
 kernel/sched/fair.c|2 +-
 kernel/sched/idle.c|8 +---
 kernel/sched/sched.h   |   24 
 9 files changed, 70 insertions(+), 43 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 61c417b..8f5657e 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "cpuidle.h"
@@ -167,8 +168,15 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
 * local timer will be shut down.  If a local timer is used from another
 * CPU as a broadcast timer, this call may fail if it is not available.
 */
-   if (broadcast && tick_broadcast_enter())
-   return -EBUSY;
+   if (broadcast && tick_broadcast_enter()) {
+   index = cpuidle_select(drv, dev, !broadcast);
+   if (index < 0)
+   return -EBUSY;
+   target_state = >states[index];
+   }
+
+   /* Take note of the planned idle state. */
+   idle_set_state(smp_processor_id(), target_state);
 
trace_cpu_idle_rcuidle(index, dev->cpu);
time_start = ktime_get();
@@ -178,6 +186,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
time_end = ktime_get();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
 
+   /* The cpu is no longer idle or about to enter idle. */
+   idle_set_state(smp_processor_id(), NULL);
+
if (broadcast) {
if (WARN_ON_ONCE(!irqs_disabled()))
local_irq_disable();
@@ -213,12 +224,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
  *
  * @drv: the cpuidle driver
  * @dev: the cpuidle device
+ * @timer_stop_valid: allow selection of idle state where tick stops
  *
  * Returns the index of the idle state.
  */
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv,
+   struct cpuidle_device *dev, int timer_stop_valid)
 {
-   return cpuidle_curr_governor->select(drv, dev);
+   return cpuidle_curr_governor->select(drv, dev, timer_stop_valid);
 }
 
 /**
diff --git a/drivers/cpuidle/governors/ladder.c 
b/drivers/cpuidle/governors/ladder.c
index 401c010..c437322 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct ladder_device 
*ldev,
  * ladder_select_state - selects the next state to enter
  * @drv: cpuidle driver
  * @dev: the CPU
+ * @timer_stop_valid: allow selection of idle state where tick stops
  */
 static int ladder_select_state(struct cpuidle_driver *drv,
-   struct cpuidle_device *dev)
+   struct cpuidle_device *dev, int 
timer_stop_valid)
 {
struct ladder_device *ldev = this_cpu_ptr(_devices);
struct ladder_device_state *last_state;
@@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv,
!drv->states[last_idx + 1].disabled &&
!dev->states_usage[last_idx + 1].disable &&
last_residency > last_state->threshold.promotion_time &&
+   !(!timer_stop_valid && (drv->states[last_idx + 1].flags & 
CPUIDLE_FLAG_TIMER_STOP)) &&
drv->states[last_idx + 1].exit_latency <= latency_req) {
last_state->stats.promotion_count++;
last_state->stats.demotion_count = 0;
@@ -99,11 +101,14 @@ static int ladder_select_state(struct cpuidle_driver *drv,
if (last_idx > CPUIDLE_DRIVER_STATE_START &&

Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

2015-05-07 Thread Preeti U Murthy

On 05/05/2015 02:11 PM, Preeti U Murthy wrote:
> On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote:
>> Hi Preeti,
>>
>> On 05/05/2015 09:30 AM, Preeti U Murthy wrote:
>>> Hi Shilpa,
>>>
>>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
>>>> Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
>>>> notification by executing *throttle_check() on any one of the cpu on
>>>> the chip. This is a sanity check to verify if we were indeed
>>>> throttled/unthrottled after receiving OCC_THROTTLE notification.
>>>>
>>>> We cannot call *throttle_check() directly from the notification
>>>> handler because we could be handling chip1's notification in chip2. So
>>>> initiate an smp_call to execute *throttle_check(). We are irq-disabled
>>>> in the notification handler, so use a worker thread to smp_call
>>>> throttle_check() on any of the cpu in the chipmask.
>>>
>>> I see that the first patch takes care of reporting *per-chip* throttling
>>> for pmax capping condition. But where are we taking care of reporting
>>> "pstate set to safe" and "freq control disabled" scenarios per-chip ?
>>>
>>
>> IMO let us not have "psafe" and "freq control disabled" states managed 
>> per-chip.
>> Because when the above two conditions occur it is likely to happen across all
>> chips during an OCC reset cycle. So I am setting 'throttled' to false on
>> OCC_ACTIVE and re-verifying if it actually is the case by invoking
>> *throttle_check().
> 
> Alright like I pointed in the previous reply, a comment to indicate that
> psafe and freq control disabled conditions will fail when occ is
> inactive and that all chips face the consequence of this will help.

>From your explanation on the thread of the first patch of this series,
this will not be required.

So,
Reviewed-by: Preeti U Murthy 

Regards
Preeti U Murthy
> 
>>
>>>>
>>>> Signed-off-by: Shilpasri G Bhat 
>>>> ---
>>>>  drivers/cpufreq/powernv-cpufreq.c | 28 ++--
>>>>  1 file changed, 26 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
>>>> b/drivers/cpufreq/powernv-cpufreq.c
>>>> index 9268424..9618813 100644
>>>> --- a/drivers/cpufreq/powernv-cpufreq.c
>>>> +++ b/drivers/cpufreq/powernv-cpufreq.c
>>>> @@ -50,6 +50,8 @@ static bool rebooting, throttled, occ_reset;
>>>>  static struct chip {
>>>>unsigned int id;
>>>>bool throttled;
>>>> +  cpumask_t mask;
>>>> +  struct work_struct throttle;
>>>>  } *chips;
>>>>
>>>>  static int nr_chips;
>>>> @@ -310,8 +312,9 @@ static inline unsigned int get_nominal_index(void)
>>>>return powernv_pstate_info.max - powernv_pstate_info.nominal;
>>>>  }
>>>>
>>>> -static void powernv_cpufreq_throttle_check(unsigned int cpu)
>>>> +static void powernv_cpufreq_throttle_check(void *data)
>>>>  {
>>>> +  unsigned int cpu = smp_processor_id();
>>>>unsigned long pmsr;
>>>>int pmsr_pmax, pmsr_lp, i;
>>>>
>>>> @@ -373,7 +376,7 @@ static int powernv_cpufreq_target_index(struct 
>>>> cpufreq_policy *policy,
>>>>return 0;
>>>>
>>>>if (!throttled)
>>>> -  powernv_cpufreq_throttle_check(smp_processor_id());
>>>> +  powernv_cpufreq_throttle_check(NULL);
>>>>
>>>>freq_data.pstate_id = powernv_freqs[new_index].driver_data;
>>>>
>>>> @@ -418,6 +421,14 @@ static struct notifier_block 
>>>> powernv_cpufreq_reboot_nb = {
>>>>.notifier_call = powernv_cpufreq_reboot_notifier,
>>>>  };
>>>>
>>>> +void powernv_cpufreq_work_fn(struct work_struct *work)
>>>> +{
>>>> +  struct chip *chip = container_of(work, struct chip, throttle);
>>>> +
>>>> +  smp_call_function_any(>mask,
>>>> +powernv_cpufreq_throttle_check, NULL, 0);
>>>> +}
>>>> +
>>>>  static char throttle_reason[][30] = {
>>>>"No throttling",
>>>>"Power Cap",
>>>> @@ -433,6 +444,7 @@ static int powernv_cpufreq_occ_msg(struct 
>>>> notifier_block *nb,
>>&g

Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level

2015-05-07 Thread Preeti U Murthy

On 05/07/2015 04:05 PM, Shilpasri G Bhat wrote:
> 
> 
> On 05/05/2015 02:08 PM, Preeti U Murthy wrote:
>> On 05/05/2015 11:36 AM, Shilpasri G Bhat wrote:
>>> Hi Preeti,
>>>
>>> On 05/05/2015 09:21 AM, Preeti U Murthy wrote:
>>>> Hi Shilpa,
>>>>
>>>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
>>>>> The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the
>>>>> max allowed frequency for that chip if the chip exceeds its power or
>>>>> temperature limits. As Pmax capping is a chip level condition report
>>>>> this throttling behavior at chip level and also do not set the global
>>>>> 'throttled' on Pmax capping instead set the per-chip throttled
>>>>> variable. Report unthrottling if Pmax is restored after throttling.
>>>>>
>>>>> This patch adds a structure to store chip id and throttled state of
>>>>> the chip.
>>>>>
>>>>> Signed-off-by: Shilpasri G Bhat 
>>>>> ---
>>>>>  drivers/cpufreq/powernv-cpufreq.c | 59 
>>>>> ---
>>>>>  1 file changed, 55 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
>>>>> b/drivers/cpufreq/powernv-cpufreq.c
>>>>> index ebef0d8..d0c18c9 100644
>>>>> --- a/drivers/cpufreq/powernv-cpufreq.c
>>>>> +++ b/drivers/cpufreq/powernv-cpufreq.c
>>>>> @@ -27,6 +27,7 @@
>>>>>  #include 
>>>>>  #include 
>>>>>  #include 
>>>>> +#include 
>>>>>
>>>>>  #include 
>>>>>  #include 
>>>>> @@ -42,6 +43,13 @@
>>>>>  static struct cpufreq_frequency_table 
>>>>> powernv_freqs[POWERNV_MAX_PSTATES+1];
>>>>>  static bool rebooting, throttled;
>>>>>
>>>>> +static struct chip {
>>>>> + unsigned int id;
>>>>> + bool throttled;
>>>>> +} *chips;
>>>>> +
>>>>> +static int nr_chips;
>>>>> +
>>>>>  /*
>>>>>   * Note: The set of pstates consists of contiguous integers, the
>>>>>   * smallest of which is indicated by powernv_pstate_info.min, the
>>>>> @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void)
>>>>>  static void powernv_cpufreq_throttle_check(unsigned int cpu)
>>>>>  {
>>>>>   unsigned long pmsr;
>>>>> - int pmsr_pmax, pmsr_lp;
>>>>> + int pmsr_pmax, pmsr_lp, i;
>>>>>
>>>>>   pmsr = get_pmspr(SPRN_PMSR);
>>>>>
>>>>> + for (i = 0; i < nr_chips; i++)
>>>>> + if (chips[i].id == cpu_to_chip_id(cpu))
>>>>> + break;
>>>>> +
>>>>>   /* Check for Pmax Capping */
>>>>>   pmsr_pmax = (s8)PMSR_MAX(pmsr);
>>>>>   if (pmsr_pmax != powernv_pstate_info.max) {
>>>>> - throttled = true;
>>>>> - pr_info("CPU %d Pmax is reduced to %d\n", cpu, pmsr_pmax);
>>>>> - pr_info("Max allowed Pstate is capped\n");
>>>>> + if (chips[i].throttled)
>>>>> + goto next;
>>>>> + chips[i].throttled = true;
>>>>> + pr_info("CPU %d on Chip %u has Pmax reduced to %d\n", cpu,
>>>>> + chips[i].id, pmsr_pmax);
>>>>> + } else if (chips[i].throttled) {
>>>>> + chips[i].throttled = false;
>>>>
>>>> Is this check on pmax sufficient to indicate that the chip is unthrottled ?
>>>
>>> Unthrottling due to Pmax uncapping here is specific to a chip. So it is
>>> sufficient to decide throttling/unthrottling when OCC is active for that 
>>> chip.
>>
>> Ok then we can perhaps exit after detecting unthrottling here.
> 
> This won't work for older firmwares which do not clear "Frequency control
> enabled bit" on OCC reset cycle. So let us check for remaining two conditions 
> on
> unthrottling as well.

ok.

> 
>>>
>>>>
>>>>> + pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu,
>>>>> + chips[i].id, pmsr_pmax);
>>>>>   }
>>>>>
>>>>>   /*
>>&g

Re: [PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-07 Thread Preeti U Murthy

On 05/07/2015 03:26 PM, Sudeep Holla wrote:
 Hi Preeti,
 
 On 07/05/15 06:26, Preeti U Murthy wrote:
 When a CPU has to enter an idle state where tick stops, it makes a call
 to tick_broadcast_enter(). The call will fail if this CPU is the
 broadcast CPU. Today, under such a circumstance, the arch cpuidle code
 handles this CPU.  This is not convincing because not only are we not
 aware what the arch cpuidle code does, but we also do not account for
 the idle state residency time and usage of such a CPU.

 This scenario can be handled better by simply asking the cpuidle
 governor to choose an idle state where in ticks do not stop. To
 accommodate this change move the setting of runqueue idle state from the
 core to the cpuidle driver, else the rq-idle_state will be set wrong.

 Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
 ---
 Based on linux-pm/bleeding-edge
 
 I am unable to apply this patch cleanly on linux-pm/bleeding-edge
 I think it conflicts with few patches that Rafael posted recently
 which are in the branch now.

Thanks so much for pointing out this! I have resent the patch as V2
after the rebase on the updated linux-pm/bleeding-edge branch

Regards
Preeti U Murthy
 
 Regards,
 Sudeep
 -- 
 To unsubscribe from this list: send the line unsubscribe linux-pm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-07 Thread Preeti U Murthy

On 05/07/2015 11:09 PM, Preeti U Murthy wrote:
 When a CPU has to enter an idle state where tick stops, it makes a call
 to tick_broadcast_enter(). The call will fail if this CPU is the
 broadcast CPU. Today, under such a circumstance, the arch cpuidle code
 handles this CPU.  This is not convincing because not only are we not
 aware what the arch cpuidle code does, but we also do not account for
 the idle state residency time and usage of such a CPU.
 
 This scenario can be handled better by simply asking the cpuidle
 governor to choose an idle state where in ticks do not stop. To
 accommodate this change move the setting of runqueue idle state from the
 core to the cpuidle driver, else the rq-idle_state will be set wrong.
 
 Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
 ---
 Rebased on the latest linux-pm/bleeding-edge

Kindly ignore this. I have sent the rebase as V2.
[PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully

The below patch is not updated.

Regards
Preeti U Murthy
 
  drivers/cpuidle/cpuidle.c  |   21 +
  drivers/cpuidle/governors/ladder.c |   13 ++---
  drivers/cpuidle/governors/menu.c   |6 +-
  include/linux/cpuidle.h|6 +++---
  include/linux/sched.h  |   16 
  kernel/sched/core.c|   17 +
  kernel/sched/fair.c|2 +-
  kernel/sched/idle.c|8 +---
  kernel/sched/sched.h   |   24 
  9 files changed, 70 insertions(+), 43 deletions(-)
 
 diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
 index 61c417b..8f5657e 100644
 --- a/drivers/cpuidle/cpuidle.c
 +++ b/drivers/cpuidle/cpuidle.c
 @@ -21,6 +21,7 @@
  #include linux/module.h
  #include linux/suspend.h
  #include linux/tick.h
 +#include linux/sched.h
  #include trace/events/power.h
 
  #include cpuidle.h
 @@ -167,8 +168,15 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
 struct cpuidle_driver *drv,
* local timer will be shut down.  If a local timer is used from another
* CPU as a broadcast timer, this call may fail if it is not available.
*/
 - if (broadcast  tick_broadcast_enter())
 - return -EBUSY;
 + if (broadcast  tick_broadcast_enter()) {
 + index = cpuidle_select(drv, dev, !broadcast);
 + if (index  0)
 + return -EBUSY;
 + target_state = drv-states[index];
 + }
 +
 + /* Take note of the planned idle state. */
 + idle_set_state(smp_processor_id(), target_state);
 
   trace_cpu_idle_rcuidle(index, dev-cpu);
   time_start = ktime_get();
 @@ -178,6 +186,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
 struct cpuidle_driver *drv,
   time_end = ktime_get();
   trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev-cpu);
 
 + /* The cpu is no longer idle or about to enter idle. */
 + idle_set_state(smp_processor_id(), NULL);
 +
   if (broadcast) {
   if (WARN_ON_ONCE(!irqs_disabled()))
   local_irq_disable();
 @@ -213,12 +224,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
 struct cpuidle_driver *drv,
   *
   * @drv: the cpuidle driver
   * @dev: the cpuidle device
 + * @timer_stop_valid: allow selection of idle state where tick stops
   *
   * Returns the index of the idle state.
   */
 -int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 +int cpuidle_select(struct cpuidle_driver *drv,
 + struct cpuidle_device *dev, int timer_stop_valid)
  {
 - return cpuidle_curr_governor-select(drv, dev);
 + return cpuidle_curr_governor-select(drv, dev, timer_stop_valid);
  }
 
  /**
 diff --git a/drivers/cpuidle/governors/ladder.c 
 b/drivers/cpuidle/governors/ladder.c
 index 401c010..c437322 100644
 --- a/drivers/cpuidle/governors/ladder.c
 +++ b/drivers/cpuidle/governors/ladder.c
 @@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct 
 ladder_device *ldev,
   * ladder_select_state - selects the next state to enter
   * @drv: cpuidle driver
   * @dev: the CPU
 + * @timer_stop_valid: allow selection of idle state where tick stops
   */
  static int ladder_select_state(struct cpuidle_driver *drv,
 - struct cpuidle_device *dev)
 + struct cpuidle_device *dev, int 
 timer_stop_valid)
  {
   struct ladder_device *ldev = this_cpu_ptr(ladder_devices);
   struct ladder_device_state *last_state;
 @@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv,
   !drv-states[last_idx + 1].disabled 
   !dev-states_usage[last_idx + 1].disable 
   last_residency  last_state-threshold.promotion_time 
 + !(!timer_stop_valid  (drv-states[last_idx + 1].flags  
 CPUIDLE_FLAG_TIMER_STOP)) 
   drv-states[last_idx + 1].exit_latency = latency_req) {
   last_state

[PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-07 Thread Preeti U Murthy

When a CPU has to enter an idle state where tick stops, it makes a call
to tick_broadcast_enter(). The call will fail if this CPU is the
broadcast CPU. Today, under such a circumstance, the arch cpuidle code
handles this CPU.  This is not convincing because not only are we not
aware what the arch cpuidle code does, but we also do not account for
the idle state residency time and usage of such a CPU.

This scenario can be handled better by simply asking the cpuidle
governor to choose an idle state where in ticks do not stop. To
accommodate this change move the setting of runqueue idle state from the
core to the cpuidle driver, else the rq-idle_state will be set wrong.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---
Changes from V1: https://lkml.org/lkml/2015/5/7/24
Rebased on the latest linux-pm/bleeding-edge

 drivers/cpuidle/cpuidle.c  |   21 +
 drivers/cpuidle/governors/ladder.c |   13 ++---
 drivers/cpuidle/governors/menu.c   |6 +-
 include/linux/cpuidle.h|6 +++---
 include/linux/sched.h  |   16 
 kernel/sched/core.c|   17 +
 kernel/sched/fair.c|2 +-
 kernel/sched/idle.c|8 +---
 kernel/sched/sched.h   |   24 
 9 files changed, 70 insertions(+), 43 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 8c24f95..b7e86f4 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -21,6 +21,7 @@
 #include linux/module.h
 #include linux/suspend.h
 #include linux/tick.h
+#include linux/sched.h
 #include trace/events/power.h
 
 #include cpuidle.h
@@ -168,10 +169,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
 * CPU as a broadcast timer, this call may fail if it is not available.
 */
if (broadcast  tick_broadcast_enter()) {
-   default_idle_call();
-   return -EBUSY;
+   index = cpuidle_select(drv, dev, !broadcast);
+   if (index  0) {
+   default_idle_call();
+   return -EBUSY;
+   }
+   target_state = drv-states[index];
}
 
+   /* Take note of the planned idle state. */
+   idle_set_state(smp_processor_id(), target_state);
+
trace_cpu_idle_rcuidle(index, dev-cpu);
time_start = ktime_get();
 
@@ -180,6 +188,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
time_end = ktime_get();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev-cpu);
 
+   /* The cpu is no longer idle or about to enter idle. */
+   idle_set_state(smp_processor_id(), NULL);
+
if (broadcast) {
if (WARN_ON_ONCE(!irqs_disabled()))
local_irq_disable();
@@ -215,12 +226,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
  *
  * @drv: the cpuidle driver
  * @dev: the cpuidle device
+ * @timer_stop_valid: allow selection of idle state where tick stops
  *
  * Returns the index of the idle state.
  */
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv,
+   struct cpuidle_device *dev, int timer_stop_valid)
 {
-   return cpuidle_curr_governor-select(drv, dev);
+   return cpuidle_curr_governor-select(drv, dev, timer_stop_valid);
 }
 
 /**
diff --git a/drivers/cpuidle/governors/ladder.c 
b/drivers/cpuidle/governors/ladder.c
index 401c010..c437322 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct ladder_device 
*ldev,
  * ladder_select_state - selects the next state to enter
  * @drv: cpuidle driver
  * @dev: the CPU
+ * @timer_stop_valid: allow selection of idle state where tick stops
  */
 static int ladder_select_state(struct cpuidle_driver *drv,
-   struct cpuidle_device *dev)
+   struct cpuidle_device *dev, int 
timer_stop_valid)
 {
struct ladder_device *ldev = this_cpu_ptr(ladder_devices);
struct ladder_device_state *last_state;
@@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv,
!drv-states[last_idx + 1].disabled 
!dev-states_usage[last_idx + 1].disable 
last_residency  last_state-threshold.promotion_time 
+   !(!timer_stop_valid  (drv-states[last_idx + 1].flags  
CPUIDLE_FLAG_TIMER_STOP)) 
drv-states[last_idx + 1].exit_latency = latency_req) {
last_state-stats.promotion_count++;
last_state-stats.demotion_count = 0;
@@ -99,11 +101,14 @@ static int ladder_select_state(struct cpuidle_driver *drv,
if (last_idx  CPUIDLE_DRIVER_STATE_START 
(drv-states

Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level

2015-05-07 Thread Preeti U Murthy

On 05/07/2015 04:05 PM, Shilpasri G Bhat wrote:
 
 
 On 05/05/2015 02:08 PM, Preeti U Murthy wrote:
 On 05/05/2015 11:36 AM, Shilpasri G Bhat wrote:
 Hi Preeti,

 On 05/05/2015 09:21 AM, Preeti U Murthy wrote:
 Hi Shilpa,

 On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
 The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the
 max allowed frequency for that chip if the chip exceeds its power or
 temperature limits. As Pmax capping is a chip level condition report
 this throttling behavior at chip level and also do not set the global
 'throttled' on Pmax capping instead set the per-chip throttled
 variable. Report unthrottling if Pmax is restored after throttling.

 This patch adds a structure to store chip id and throttled state of
 the chip.

 Signed-off-by: Shilpasri G Bhat shilpa.b...@linux.vnet.ibm.com
 ---
  drivers/cpufreq/powernv-cpufreq.c | 59 
 ---
  1 file changed, 55 insertions(+), 4 deletions(-)

 diff --git a/drivers/cpufreq/powernv-cpufreq.c 
 b/drivers/cpufreq/powernv-cpufreq.c
 index ebef0d8..d0c18c9 100644
 --- a/drivers/cpufreq/powernv-cpufreq.c
 +++ b/drivers/cpufreq/powernv-cpufreq.c
 @@ -27,6 +27,7 @@
  #include linux/smp.h
  #include linux/of.h
  #include linux/reboot.h
 +#include linux/slab.h

  #include asm/cputhreads.h
  #include asm/firmware.h
 @@ -42,6 +43,13 @@
  static struct cpufreq_frequency_table 
 powernv_freqs[POWERNV_MAX_PSTATES+1];
  static bool rebooting, throttled;

 +static struct chip {
 + unsigned int id;
 + bool throttled;
 +} *chips;
 +
 +static int nr_chips;
 +
  /*
   * Note: The set of pstates consists of contiguous integers, the
   * smallest of which is indicated by powernv_pstate_info.min, the
 @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void)
  static void powernv_cpufreq_throttle_check(unsigned int cpu)
  {
   unsigned long pmsr;
 - int pmsr_pmax, pmsr_lp;
 + int pmsr_pmax, pmsr_lp, i;

   pmsr = get_pmspr(SPRN_PMSR);

 + for (i = 0; i  nr_chips; i++)
 + if (chips[i].id == cpu_to_chip_id(cpu))
 + break;
 +
   /* Check for Pmax Capping */
   pmsr_pmax = (s8)PMSR_MAX(pmsr);
   if (pmsr_pmax != powernv_pstate_info.max) {
 - throttled = true;
 - pr_info(CPU %d Pmax is reduced to %d\n, cpu, pmsr_pmax);
 - pr_info(Max allowed Pstate is capped\n);
 + if (chips[i].throttled)
 + goto next;
 + chips[i].throttled = true;
 + pr_info(CPU %d on Chip %u has Pmax reduced to %d\n, cpu,
 + chips[i].id, pmsr_pmax);
 + } else if (chips[i].throttled) {
 + chips[i].throttled = false;

 Is this check on pmax sufficient to indicate that the chip is unthrottled ?

 Unthrottling due to Pmax uncapping here is specific to a chip. So it is
 sufficient to decide throttling/unthrottling when OCC is active for that 
 chip.

 Ok then we can perhaps exit after detecting unthrottling here.
 
 This won't work for older firmwares which do not clear Frequency control
 enabled bit on OCC reset cycle. So let us check for remaining two conditions 
 on
 unthrottling as well.

ok.

 


 + pr_info(CPU %d on Chip %u has Pmax restored to %d\n, cpu,
 + chips[i].id, pmsr_pmax);
   }

   /*
* Check for Psafe by reading LocalPstate
* or check if Psafe_mode_active is set in PMSR.
*/
 +next:
   pmsr_lp = (s8)PMSR_LP(pmsr);
   if ((pmsr_lp  powernv_pstate_info.min) ||
   (pmsr  PMSR_PSAFE_ENABLE)) {
 @@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver 
 = {
   .attr   = powernv_cpu_freq_attr,

 What about the situation where although occ is active, this particular
 chip has been throttled and we end up repeatedly reporting pstate set
 to safe and frequency control disabled from OS ? Should we not have a
 check on (chips[i].throttled) before reporting an anomaly for these two
 scenarios as well just like you have for pmsr_pmax ?

 We will not have Psafe and frequency control disabled repeatedly printed
 because of global variable 'throttled', which is set to true on passing any 
 of
 these two conditions.

 It is quite unlikely behavior to have only one chip in Psafe or frequency
 control disabled state. These two conditions are most likely to happen 
 during
 an OCC reset cycle which will occur across all chips.

 Let us then add a comment to indicate that Psafe and frequency control
 disabled conditions will fail *only if OCC is inactive* and not
 otherwise and that this is a system wide phenomenon.

 
 I agree that adding a comment here will clear global vs local throttling
 scenarios, but this will contradict the architectural design of OCC wherein it
 can independently go to Psafe and frequency control disabled state. It is
 the implementation in FSP today that has made the above two states global. My
 point is adding a comment here may be confusing if at all for the future
 firmwares this implementation is changed. Having said

Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

2015-05-07 Thread Preeti U Murthy

On 05/05/2015 02:11 PM, Preeti U Murthy wrote:
 On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote:
 Hi Preeti,

 On 05/05/2015 09:30 AM, Preeti U Murthy wrote:
 Hi Shilpa,

 On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
 Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
 notification by executing *throttle_check() on any one of the cpu on
 the chip. This is a sanity check to verify if we were indeed
 throttled/unthrottled after receiving OCC_THROTTLE notification.

 We cannot call *throttle_check() directly from the notification
 handler because we could be handling chip1's notification in chip2. So
 initiate an smp_call to execute *throttle_check(). We are irq-disabled
 in the notification handler, so use a worker thread to smp_call
 throttle_check() on any of the cpu in the chipmask.

 I see that the first patch takes care of reporting *per-chip* throttling
 for pmax capping condition. But where are we taking care of reporting
 pstate set to safe and freq control disabled scenarios per-chip ?


 IMO let us not have psafe and freq control disabled states managed 
 per-chip.
 Because when the above two conditions occur it is likely to happen across all
 chips during an OCC reset cycle. So I am setting 'throttled' to false on
 OCC_ACTIVE and re-verifying if it actually is the case by invoking
 *throttle_check().
 
 Alright like I pointed in the previous reply, a comment to indicate that
 psafe and freq control disabled conditions will fail when occ is
 inactive and that all chips face the consequence of this will help.

From your explanation on the thread of the first patch of this series,
this will not be required.

So,
Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com

Regards
Preeti U Murthy
 


 Signed-off-by: Shilpasri G Bhat shilpa.b...@linux.vnet.ibm.com
 ---
  drivers/cpufreq/powernv-cpufreq.c | 28 ++--
  1 file changed, 26 insertions(+), 2 deletions(-)

 diff --git a/drivers/cpufreq/powernv-cpufreq.c 
 b/drivers/cpufreq/powernv-cpufreq.c
 index 9268424..9618813 100644
 --- a/drivers/cpufreq/powernv-cpufreq.c
 +++ b/drivers/cpufreq/powernv-cpufreq.c
 @@ -50,6 +50,8 @@ static bool rebooting, throttled, occ_reset;
  static struct chip {
unsigned int id;
bool throttled;
 +  cpumask_t mask;
 +  struct work_struct throttle;
  } *chips;

  static int nr_chips;
 @@ -310,8 +312,9 @@ static inline unsigned int get_nominal_index(void)
return powernv_pstate_info.max - powernv_pstate_info.nominal;
  }

 -static void powernv_cpufreq_throttle_check(unsigned int cpu)
 +static void powernv_cpufreq_throttle_check(void *data)
  {
 +  unsigned int cpu = smp_processor_id();
unsigned long pmsr;
int pmsr_pmax, pmsr_lp, i;

 @@ -373,7 +376,7 @@ static int powernv_cpufreq_target_index(struct 
 cpufreq_policy *policy,
return 0;

if (!throttled)
 -  powernv_cpufreq_throttle_check(smp_processor_id());
 +  powernv_cpufreq_throttle_check(NULL);

freq_data.pstate_id = powernv_freqs[new_index].driver_data;

 @@ -418,6 +421,14 @@ static struct notifier_block 
 powernv_cpufreq_reboot_nb = {
.notifier_call = powernv_cpufreq_reboot_notifier,
  };

 +void powernv_cpufreq_work_fn(struct work_struct *work)
 +{
 +  struct chip *chip = container_of(work, struct chip, throttle);
 +
 +  smp_call_function_any(chip-mask,
 +powernv_cpufreq_throttle_check, NULL, 0);
 +}
 +
  static char throttle_reason[][30] = {
No throttling,
Power Cap,
 @@ -433,6 +444,7 @@ static int powernv_cpufreq_occ_msg(struct 
 notifier_block *nb,
struct opal_msg *occ_msg = msg;
uint64_t token;
uint64_t chip_id, reason;
 +  int i;

if (msg_type != OPAL_MSG_OCC)
return 0;
 @@ -466,6 +478,10 @@ static int powernv_cpufreq_occ_msg(struct 
 notifier_block *nb,
occ_reset = false;
throttled = false;
pr_info(OCC: Active\n);
 +
 +  for (i = 0; i  nr_chips; i++)
 +  schedule_work(chips[i].throttle);
 +
return 0;
}

 @@ -476,6 +492,12 @@ static int powernv_cpufreq_occ_msg(struct 
 notifier_block *nb,
else if (!reason)
pr_info(OCC: Chip %u %s\n, (unsigned int)chip_id,
throttle_reason[reason]);
 +  else
 +  return 0;

 Why the else section ? The code can never reach here, can it ?

 When reason  5 , we dont want to handle it.
 
 Of course! My bad!


 +
 +  for (i = 0; i  nr_chips; i++)
 +  if (chips[i].id == chip_id)
 +  schedule_work(chips[i].throttle);
}

 Should we not do this only when we get unthrottled so as to cross verify
 if it is indeed the case ? In case of throttling notification, opal's
 verdict is final and there is no need to cross verify right ?

 Two reasons for invoking

Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

2015-05-07 Thread Preeti U Murthy

On 05/08/2015 02:29 AM, Rafael J. Wysocki wrote:
 On Thursday, May 07, 2015 05:49:22 PM Preeti U Murthy wrote:
 On 05/05/2015 02:11 PM, Preeti U Murthy wrote:
 On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote:
 Hi Preeti,

 On 05/05/2015 09:30 AM, Preeti U Murthy wrote:
 Hi Shilpa,

 On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
 Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
 notification by executing *throttle_check() on any one of the cpu on
 the chip. This is a sanity check to verify if we were indeed
 throttled/unthrottled after receiving OCC_THROTTLE notification.

 We cannot call *throttle_check() directly from the notification
 handler because we could be handling chip1's notification in chip2. So
 initiate an smp_call to execute *throttle_check(). We are irq-disabled
 in the notification handler, so use a worker thread to smp_call
 throttle_check() on any of the cpu in the chipmask.

 I see that the first patch takes care of reporting *per-chip* throttling
 for pmax capping condition. But where are we taking care of reporting
 pstate set to safe and freq control disabled scenarios per-chip ?


 IMO let us not have psafe and freq control disabled states managed 
 per-chip.
 Because when the above two conditions occur it is likely to happen across 
 all
 chips during an OCC reset cycle. So I am setting 'throttled' to false on
 OCC_ACTIVE and re-verifying if it actually is the case by invoking
 *throttle_check().

 Alright like I pointed in the previous reply, a comment to indicate that
 psafe and freq control disabled conditions will fail when occ is
 inactive and that all chips face the consequence of this will help.

 From your explanation on the thread of the first patch of this series,
 this will not be required.

 So,
 Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com
 
 OK, so is the whole series reviewed now?

Yes the whole series has been reviewed.

Regards
Preeti U Murthy


 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RESEND PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-07 Thread Preeti U Murthy

When a CPU has to enter an idle state where tick stops, it makes a call
to tick_broadcast_enter(). The call will fail if this CPU is the
broadcast CPU. Today, under such a circumstance, the arch cpuidle code
handles this CPU.  This is not convincing because not only are we not
aware what the arch cpuidle code does, but we also do not account for
the idle state residency time and usage of such a CPU.

This scenario can be handled better by simply asking the cpuidle
governor to choose an idle state where in ticks do not stop. To
accommodate this change move the setting of runqueue idle state from the
core to the cpuidle driver, else the rq-idle_state will be set wrong.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---
Rebased on the latest linux-pm/bleeding-edge

 drivers/cpuidle/cpuidle.c  |   21 +
 drivers/cpuidle/governors/ladder.c |   13 ++---
 drivers/cpuidle/governors/menu.c   |6 +-
 include/linux/cpuidle.h|6 +++---
 include/linux/sched.h  |   16 
 kernel/sched/core.c|   17 +
 kernel/sched/fair.c|2 +-
 kernel/sched/idle.c|8 +---
 kernel/sched/sched.h   |   24 
 9 files changed, 70 insertions(+), 43 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 61c417b..8f5657e 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -21,6 +21,7 @@
 #include linux/module.h
 #include linux/suspend.h
 #include linux/tick.h
+#include linux/sched.h
 #include trace/events/power.h
 
 #include cpuidle.h
@@ -167,8 +168,15 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
 * local timer will be shut down.  If a local timer is used from another
 * CPU as a broadcast timer, this call may fail if it is not available.
 */
-   if (broadcast  tick_broadcast_enter())
-   return -EBUSY;
+   if (broadcast  tick_broadcast_enter()) {
+   index = cpuidle_select(drv, dev, !broadcast);
+   if (index  0)
+   return -EBUSY;
+   target_state = drv-states[index];
+   }
+
+   /* Take note of the planned idle state. */
+   idle_set_state(smp_processor_id(), target_state);
 
trace_cpu_idle_rcuidle(index, dev-cpu);
time_start = ktime_get();
@@ -178,6 +186,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
time_end = ktime_get();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev-cpu);
 
+   /* The cpu is no longer idle or about to enter idle. */
+   idle_set_state(smp_processor_id(), NULL);
+
if (broadcast) {
if (WARN_ON_ONCE(!irqs_disabled()))
local_irq_disable();
@@ -213,12 +224,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
  *
  * @drv: the cpuidle driver
  * @dev: the cpuidle device
+ * @timer_stop_valid: allow selection of idle state where tick stops
  *
  * Returns the index of the idle state.
  */
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv,
+   struct cpuidle_device *dev, int timer_stop_valid)
 {
-   return cpuidle_curr_governor-select(drv, dev);
+   return cpuidle_curr_governor-select(drv, dev, timer_stop_valid);
 }
 
 /**
diff --git a/drivers/cpuidle/governors/ladder.c 
b/drivers/cpuidle/governors/ladder.c
index 401c010..c437322 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct ladder_device 
*ldev,
  * ladder_select_state - selects the next state to enter
  * @drv: cpuidle driver
  * @dev: the CPU
+ * @timer_stop_valid: allow selection of idle state where tick stops
  */
 static int ladder_select_state(struct cpuidle_driver *drv,
-   struct cpuidle_device *dev)
+   struct cpuidle_device *dev, int 
timer_stop_valid)
 {
struct ladder_device *ldev = this_cpu_ptr(ladder_devices);
struct ladder_device_state *last_state;
@@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv,
!drv-states[last_idx + 1].disabled 
!dev-states_usage[last_idx + 1].disable 
last_residency  last_state-threshold.promotion_time 
+   !(!timer_stop_valid  (drv-states[last_idx + 1].flags  
CPUIDLE_FLAG_TIMER_STOP)) 
drv-states[last_idx + 1].exit_latency = latency_req) {
last_state-stats.promotion_count++;
last_state-stats.demotion_count = 0;
@@ -99,11 +101,14 @@ static int ladder_select_state(struct cpuidle_driver *drv,
if (last_idx  CPUIDLE_DRIVER_STATE_START 
(drv-states[last_idx].disabled

[PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-06 Thread Preeti U Murthy

When a CPU has to enter an idle state where tick stops, it makes a call
to tick_broadcast_enter(). The call will fail if this CPU is the
broadcast CPU. Today, under such a circumstance, the arch cpuidle code
handles this CPU.  This is not convincing because not only are we not
aware what the arch cpuidle code does, but we also do not account for
the idle state residency time and usage of such a CPU.

This scenario can be handled better by simply asking the cpuidle
governor to choose an idle state where in ticks do not stop. To
accommodate this change move the setting of runqueue idle state from the
core to the cpuidle driver, else the rq->idle_state will be set wrong.

Signed-off-by: Preeti U Murthy 
---
Based on linux-pm/bleeding-edge

 drivers/cpuidle/cpuidle.c  |   21 +
 drivers/cpuidle/governors/ladder.c |   13 ++---
 drivers/cpuidle/governors/menu.c   |6 +-
 include/linux/cpuidle.h|6 +++---
 include/linux/sched.h  |   16 
 kernel/sched/core.c|   17 +
 kernel/sched/fair.c|2 +-
 kernel/sched/idle.c|8 +---
 kernel/sched/sched.h   |   24 
 9 files changed, 70 insertions(+), 43 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 61c417b..8f5657e 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "cpuidle.h"
@@ -167,8 +168,15 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
 * local timer will be shut down.  If a local timer is used from another
 * CPU as a broadcast timer, this call may fail if it is not available.
 */
-   if (broadcast && tick_broadcast_enter())
-   return -EBUSY;
+   if (broadcast && tick_broadcast_enter()) {
+   index = cpuidle_select(drv, dev, !broadcast);
+   if (index < 0)
+   return -EBUSY;
+   target_state = >states[index];
+   }
+
+   /* Take note of the planned idle state. */
+   idle_set_state(smp_processor_id(), target_state);
 
trace_cpu_idle_rcuidle(index, dev->cpu);
time_start = ktime_get();
@@ -178,6 +186,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
time_end = ktime_get();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
 
+   /* The cpu is no longer idle or about to enter idle. */
+   idle_set_state(smp_processor_id(), NULL);
+
if (broadcast) {
if (WARN_ON_ONCE(!irqs_disabled()))
local_irq_disable();
@@ -213,12 +224,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
  *
  * @drv: the cpuidle driver
  * @dev: the cpuidle device
+ * @timer_stop_valid: allow selection of idle state where tick stops
  *
  * Returns the index of the idle state.
  */
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv,
+   struct cpuidle_device *dev, int timer_stop_valid)
 {
-   return cpuidle_curr_governor->select(drv, dev);
+   return cpuidle_curr_governor->select(drv, dev, timer_stop_valid);
 }
 
 /**
diff --git a/drivers/cpuidle/governors/ladder.c 
b/drivers/cpuidle/governors/ladder.c
index 401c010..c437322 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct ladder_device 
*ldev,
  * ladder_select_state - selects the next state to enter
  * @drv: cpuidle driver
  * @dev: the CPU
+ * @timer_stop_valid: allow selection of idle state where tick stops
  */
 static int ladder_select_state(struct cpuidle_driver *drv,
-   struct cpuidle_device *dev)
+   struct cpuidle_device *dev, int 
timer_stop_valid)
 {
struct ladder_device *ldev = this_cpu_ptr(_devices);
struct ladder_device_state *last_state;
@@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv,
!drv->states[last_idx + 1].disabled &&
!dev->states_usage[last_idx + 1].disable &&
last_residency > last_state->threshold.promotion_time &&
+   !(!timer_stop_valid && (drv->states[last_idx + 1].flags & 
CPUIDLE_FLAG_TIMER_STOP)) &&
drv->states[last_idx + 1].exit_latency <= latency_req) {
last_state->stats.promotion_count++;
last_state->stats.demotion_count = 0;
@@ -99,11 +101,14 @@ static int ladder_select_state(struct cpuidle_driver *drv,
if (last_idx > CPUIDLE_DRIVER_STATE_START &&
(dr

[PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully

2015-05-06 Thread Preeti U Murthy

When a CPU has to enter an idle state where tick stops, it makes a call
to tick_broadcast_enter(). The call will fail if this CPU is the
broadcast CPU. Today, under such a circumstance, the arch cpuidle code
handles this CPU.  This is not convincing because not only are we not
aware what the arch cpuidle code does, but we also do not account for
the idle state residency time and usage of such a CPU.

This scenario can be handled better by simply asking the cpuidle
governor to choose an idle state where in ticks do not stop. To
accommodate this change move the setting of runqueue idle state from the
core to the cpuidle driver, else the rq-idle_state will be set wrong.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---
Based on linux-pm/bleeding-edge

 drivers/cpuidle/cpuidle.c  |   21 +
 drivers/cpuidle/governors/ladder.c |   13 ++---
 drivers/cpuidle/governors/menu.c   |6 +-
 include/linux/cpuidle.h|6 +++---
 include/linux/sched.h  |   16 
 kernel/sched/core.c|   17 +
 kernel/sched/fair.c|2 +-
 kernel/sched/idle.c|8 +---
 kernel/sched/sched.h   |   24 
 9 files changed, 70 insertions(+), 43 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 61c417b..8f5657e 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -21,6 +21,7 @@
 #include linux/module.h
 #include linux/suspend.h
 #include linux/tick.h
+#include linux/sched.h
 #include trace/events/power.h
 
 #include cpuidle.h
@@ -167,8 +168,15 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
 * local timer will be shut down.  If a local timer is used from another
 * CPU as a broadcast timer, this call may fail if it is not available.
 */
-   if (broadcast  tick_broadcast_enter())
-   return -EBUSY;
+   if (broadcast  tick_broadcast_enter()) {
+   index = cpuidle_select(drv, dev, !broadcast);
+   if (index  0)
+   return -EBUSY;
+   target_state = drv-states[index];
+   }
+
+   /* Take note of the planned idle state. */
+   idle_set_state(smp_processor_id(), target_state);
 
trace_cpu_idle_rcuidle(index, dev-cpu);
time_start = ktime_get();
@@ -178,6 +186,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
time_end = ktime_get();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev-cpu);
 
+   /* The cpu is no longer idle or about to enter idle. */
+   idle_set_state(smp_processor_id(), NULL);
+
if (broadcast) {
if (WARN_ON_ONCE(!irqs_disabled()))
local_irq_disable();
@@ -213,12 +224,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
  *
  * @drv: the cpuidle driver
  * @dev: the cpuidle device
+ * @timer_stop_valid: allow selection of idle state where tick stops
  *
  * Returns the index of the idle state.
  */
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv,
+   struct cpuidle_device *dev, int timer_stop_valid)
 {
-   return cpuidle_curr_governor-select(drv, dev);
+   return cpuidle_curr_governor-select(drv, dev, timer_stop_valid);
 }
 
 /**
diff --git a/drivers/cpuidle/governors/ladder.c 
b/drivers/cpuidle/governors/ladder.c
index 401c010..c437322 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct ladder_device 
*ldev,
  * ladder_select_state - selects the next state to enter
  * @drv: cpuidle driver
  * @dev: the CPU
+ * @timer_stop_valid: allow selection of idle state where tick stops
  */
 static int ladder_select_state(struct cpuidle_driver *drv,
-   struct cpuidle_device *dev)
+   struct cpuidle_device *dev, int 
timer_stop_valid)
 {
struct ladder_device *ldev = this_cpu_ptr(ladder_devices);
struct ladder_device_state *last_state;
@@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv,
!drv-states[last_idx + 1].disabled 
!dev-states_usage[last_idx + 1].disable 
last_residency  last_state-threshold.promotion_time 
+   !(!timer_stop_valid  (drv-states[last_idx + 1].flags  
CPUIDLE_FLAG_TIMER_STOP)) 
drv-states[last_idx + 1].exit_latency = latency_req) {
last_state-stats.promotion_count++;
last_state-stats.demotion_count = 0;
@@ -99,11 +101,14 @@ static int ladder_select_state(struct cpuidle_driver *drv,
if (last_idx  CPUIDLE_DRIVER_STATE_START 
(drv-states[last_idx].disabled ||
dev

Re: [PATCH v3 6/6] cpufreq: powernv: Restore cpu frequency to policy->cur on unthrottling

2015-05-05 Thread Preeti U Murthy

On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
> If frequency is throttled due to OCC reset then cpus will be in Psafe
> frequency, so restore the frequency on all cpus to policy->cur when
> OCCs are active again. And if frequency is throttled due to Pmax
> capping then restore the frequency of all the cpus  in the chip on
> unthrottling.
> 
> Signed-off-by: Shilpasri G Bhat 
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 31 +--
>  1 file changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index 0a59d5b..b2915bc 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -51,6 +51,7 @@ static struct chip {
>   bool throttled;
>   cpumask_t mask;
>   struct work_struct throttle;
> + bool restore;
>  } *chips;
> 
>  static int nr_chips;
> @@ -418,9 +419,29 @@ static struct notifier_block powernv_cpufreq_reboot_nb = 
> {
>  void powernv_cpufreq_work_fn(struct work_struct *work)
>  {
>   struct chip *chip = container_of(work, struct chip, throttle);
> + unsigned int cpu;
> + cpumask_var_t mask;
> 
>   smp_call_function_any(>mask,
> powernv_cpufreq_throttle_check, NULL, 0);
> +
> + if (!chip->restore)
> + return;
> +
> + chip->restore = false;
> + cpumask_copy(mask, >mask);
> + for_each_cpu_and(cpu, mask, cpu_online_mask) {
> + int index, tcpu;
> + struct cpufreq_policy policy;
> +
> + cpufreq_get_policy(, cpu);
> + cpufreq_frequency_table_target(, policy.freq_table,
> +policy.cur,
> +CPUFREQ_RELATION_C, );
> + powernv_cpufreq_target_index(, index);
> + for_each_cpu(tcpu, policy.cpus)
> + cpumask_clear_cpu(tcpu, mask);
> + }
>  }
> 
>  static char throttle_reason[][30] = {
> @@ -473,8 +494,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block 
> *nb,
>   throttled = false;
>   pr_info("OCC: Active\n");
> 
> - for (i = 0; i < nr_chips; i++)
> + for (i = 0; i < nr_chips; i++) {
> + chips[i].restore = true;
>   schedule_work([i].throttle);
> + }
> 
>   return 0;
>   }
> @@ -490,8 +513,11 @@ static int powernv_cpufreq_occ_msg(struct notifier_block 
> *nb,
>   return 0;
> 
>   for (i = 0; i < nr_chips; i++)
> - if (chips[i].id == chip_id)
> + if (chips[i].id == chip_id) {
> + if (!reason)
> + chips[i].restore = true;
>   schedule_work([i].throttle);
> + }
>   }
>   return 0;
>  }
> @@ -545,6 +571,7 @@ static int init_chip_info(void)
>   chips[i].throttled = false;
>   cpumask_copy([i].mask, cpumask_of_node(chip[i]));
>   INIT_WORK([i].throttle, powernv_cpufreq_work_fn);
> + chips[i].restore = false;
>   }
> 
>   return 0;
> 

Reviewed-by: Preeti U Murthy 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

2015-05-05 Thread Preeti U Murthy

On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote:
> Hi Preeti,
> 
> On 05/05/2015 09:30 AM, Preeti U Murthy wrote:
>> Hi Shilpa,
>>
>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
>>> Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
>>> notification by executing *throttle_check() on any one of the cpu on
>>> the chip. This is a sanity check to verify if we were indeed
>>> throttled/unthrottled after receiving OCC_THROTTLE notification.
>>>
>>> We cannot call *throttle_check() directly from the notification
>>> handler because we could be handling chip1's notification in chip2. So
>>> initiate an smp_call to execute *throttle_check(). We are irq-disabled
>>> in the notification handler, so use a worker thread to smp_call
>>> throttle_check() on any of the cpu in the chipmask.
>>
>> I see that the first patch takes care of reporting *per-chip* throttling
>> for pmax capping condition. But where are we taking care of reporting
>> "pstate set to safe" and "freq control disabled" scenarios per-chip ?
>>
> 
> IMO let us not have "psafe" and "freq control disabled" states managed 
> per-chip.
> Because when the above two conditions occur it is likely to happen across all
> chips during an OCC reset cycle. So I am setting 'throttled' to false on
> OCC_ACTIVE and re-verifying if it actually is the case by invoking
> *throttle_check().

Alright like I pointed in the previous reply, a comment to indicate that
psafe and freq control disabled conditions will fail when occ is
inactive and that all chips face the consequence of this will help.

> 
>>>
>>> Signed-off-by: Shilpasri G Bhat 
>>> ---
>>>  drivers/cpufreq/powernv-cpufreq.c | 28 ++--
>>>  1 file changed, 26 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
>>> b/drivers/cpufreq/powernv-cpufreq.c
>>> index 9268424..9618813 100644
>>> --- a/drivers/cpufreq/powernv-cpufreq.c
>>> +++ b/drivers/cpufreq/powernv-cpufreq.c
>>> @@ -50,6 +50,8 @@ static bool rebooting, throttled, occ_reset;
>>>  static struct chip {
>>> unsigned int id;
>>> bool throttled;
>>> +   cpumask_t mask;
>>> +   struct work_struct throttle;
>>>  } *chips;
>>>
>>>  static int nr_chips;
>>> @@ -310,8 +312,9 @@ static inline unsigned int get_nominal_index(void)
>>> return powernv_pstate_info.max - powernv_pstate_info.nominal;
>>>  }
>>>
>>> -static void powernv_cpufreq_throttle_check(unsigned int cpu)
>>> +static void powernv_cpufreq_throttle_check(void *data)
>>>  {
>>> +   unsigned int cpu = smp_processor_id();
>>> unsigned long pmsr;
>>> int pmsr_pmax, pmsr_lp, i;
>>>
>>> @@ -373,7 +376,7 @@ static int powernv_cpufreq_target_index(struct 
>>> cpufreq_policy *policy,
>>> return 0;
>>>
>>> if (!throttled)
>>> -   powernv_cpufreq_throttle_check(smp_processor_id());
>>> +   powernv_cpufreq_throttle_check(NULL);
>>>
>>> freq_data.pstate_id = powernv_freqs[new_index].driver_data;
>>>
>>> @@ -418,6 +421,14 @@ static struct notifier_block powernv_cpufreq_reboot_nb 
>>> = {
>>> .notifier_call = powernv_cpufreq_reboot_notifier,
>>>  };
>>>
>>> +void powernv_cpufreq_work_fn(struct work_struct *work)
>>> +{
>>> +   struct chip *chip = container_of(work, struct chip, throttle);
>>> +
>>> +   smp_call_function_any(>mask,
>>> + powernv_cpufreq_throttle_check, NULL, 0);
>>> +}
>>> +
>>>  static char throttle_reason[][30] = {
>>> "No throttling",
>>> "Power Cap",
>>> @@ -433,6 +444,7 @@ static int powernv_cpufreq_occ_msg(struct 
>>> notifier_block *nb,
>>> struct opal_msg *occ_msg = msg;
>>> uint64_t token;
>>> uint64_t chip_id, reason;
>>> +   int i;
>>>
>>> if (msg_type != OPAL_MSG_OCC)
>>> return 0;
>>> @@ -466,6 +478,10 @@ static int powernv_cpufreq_occ_msg(struct 
>>> notifier_block *nb,
>>> occ_reset = false;
>>> throttled = false;
>>> pr_info("OCC: Active\n");
>>> +
>>> +   for (

[tip:timers/core] tick-broadcast: Fix the printing of broadcast masks

2015-05-05 Thread tip-bot for Preeti U Murthy

Commit-ID:  1ef09cd713c90781b683a0b4e0a874803c172b1d
Gitweb: http://git.kernel.org/tip/1ef09cd713c90781b683a0b4e0a874803c172b1d
Author: Preeti U Murthy 
AuthorDate: Tue, 28 Apr 2015 14:15:20 +0530
Committer:  Thomas Gleixner 
CommitDate: Tue, 5 May 2015 10:35:58 +0200

tick-broadcast: Fix the printing of broadcast masks

Today the number of bits of the broadcast masks that is output into
/proc/timer_list is sizeof(unsigned long). This means that on machines
with a larger number of CPUs, the bitmasks of CPUs beyond this range do
not appear.

Fix this by using bitmap printing through "%*pb" instead, so as to
output the broadcast masks for the range of nr_cpu_ids into
/proc/timer_list.

Signed-off-by: Preeti U Murthy 
Cc: pet...@infradead.org
Cc: linuxppc-...@ozlabs.org
Cc: john.stu...@linaro.org
Link: http://lkml.kernel.org/r/20150428084520.3314.62668.st...@preeti.in.ibm.com
Signed-off-by: Thomas Gleixner 
---
 kernel/time/timer_list.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index 66f39bb..18b074b 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -276,11 +276,11 @@ static void timer_list_show_tickdevices_header(struct 
seq_file *m)
 {
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
print_tickdevice(m, tick_get_broadcast_device(), -1);
-   SEQ_printf(m, "tick_broadcast_mask: %08lx\n",
-  cpumask_bits(tick_get_broadcast_mask())[0]);
+   SEQ_printf(m, "tick_broadcast_mask: %*pb\n",
+  cpumask_pr_args(tick_get_broadcast_mask()));
 #ifdef CONFIG_TICK_ONESHOT
-   SEQ_printf(m, "tick_broadcast_oneshot_mask: %08lx\n",
-  cpumask_bits(tick_get_broadcast_oneshot_mask())[0]);
+   SEQ_printf(m, "tick_broadcast_oneshot_mask: %*pb\n",
+  cpumask_pr_args(tick_get_broadcast_oneshot_mask()));
 #endif
SEQ_printf(m, "\n");
 #endif
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level

2015-05-05 Thread Preeti U Murthy

On 05/05/2015 11:36 AM, Shilpasri G Bhat wrote:
> Hi Preeti,
> 
> On 05/05/2015 09:21 AM, Preeti U Murthy wrote:
>> Hi Shilpa,
>>
>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
>>> The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the
>>> max allowed frequency for that chip if the chip exceeds its power or
>>> temperature limits. As Pmax capping is a chip level condition report
>>> this throttling behavior at chip level and also do not set the global
>>> 'throttled' on Pmax capping instead set the per-chip throttled
>>> variable. Report unthrottling if Pmax is restored after throttling.
>>>
>>> This patch adds a structure to store chip id and throttled state of
>>> the chip.
>>>
>>> Signed-off-by: Shilpasri G Bhat 
>>> ---
>>>  drivers/cpufreq/powernv-cpufreq.c | 59 
>>> ---
>>>  1 file changed, 55 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
>>> b/drivers/cpufreq/powernv-cpufreq.c
>>> index ebef0d8..d0c18c9 100644
>>> --- a/drivers/cpufreq/powernv-cpufreq.c
>>> +++ b/drivers/cpufreq/powernv-cpufreq.c
>>> @@ -27,6 +27,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>
>>>  #include 
>>>  #include 
>>> @@ -42,6 +43,13 @@
>>>  static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
>>>  static bool rebooting, throttled;
>>>
>>> +static struct chip {
>>> +   unsigned int id;
>>> +   bool throttled;
>>> +} *chips;
>>> +
>>> +static int nr_chips;
>>> +
>>>  /*
>>>   * Note: The set of pstates consists of contiguous integers, the
>>>   * smallest of which is indicated by powernv_pstate_info.min, the
>>> @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void)
>>>  static void powernv_cpufreq_throttle_check(unsigned int cpu)
>>>  {
>>> unsigned long pmsr;
>>> -   int pmsr_pmax, pmsr_lp;
>>> +   int pmsr_pmax, pmsr_lp, i;
>>>
>>> pmsr = get_pmspr(SPRN_PMSR);
>>>
>>> +   for (i = 0; i < nr_chips; i++)
>>> +   if (chips[i].id == cpu_to_chip_id(cpu))
>>> +   break;
>>> +
>>> /* Check for Pmax Capping */
>>> pmsr_pmax = (s8)PMSR_MAX(pmsr);
>>> if (pmsr_pmax != powernv_pstate_info.max) {
>>> -   throttled = true;
>>> -   pr_info("CPU %d Pmax is reduced to %d\n", cpu, pmsr_pmax);
>>> -   pr_info("Max allowed Pstate is capped\n");
>>> +   if (chips[i].throttled)
>>> +   goto next;
>>> +   chips[i].throttled = true;
>>> +   pr_info("CPU %d on Chip %u has Pmax reduced to %d\n", cpu,
>>> +   chips[i].id, pmsr_pmax);
>>> +   } else if (chips[i].throttled) {
>>> +   chips[i].throttled = false;
>>
>> Is this check on pmax sufficient to indicate that the chip is unthrottled ?
> 
> Unthrottling due to Pmax uncapping here is specific to a chip. So it is
> sufficient to decide throttling/unthrottling when OCC is active for that chip.

Ok then we can perhaps exit after detecting unthrottling here.
> 
>>
>>> +   pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu,
>>> +   chips[i].id, pmsr_pmax);
>>> }
>>>
>>> /*
>>>  * Check for Psafe by reading LocalPstate
>>>  * or check if Psafe_mode_active is set in PMSR.
>>>  */
>>> +next:
>>> pmsr_lp = (s8)PMSR_LP(pmsr);
>>> if ((pmsr_lp < powernv_pstate_info.min) ||
>>> (pmsr & PMSR_PSAFE_ENABLE)) {
>>> @@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
>>> .attr   = powernv_cpu_freq_attr,
>>
>> What about the situation where although occ is active, this particular
>> chip has been throttled and we end up repeatedly reporting "pstate set
>> to safe" and "frequency control disabled from OS" ? Should we not have a
>> check on (chips[i].throttled) before reporting an anomaly for these two
>> scenarios as well just like you have for pmsr_pmax ?
> 
> We will not have "Psafe" and "frequency control disabled" repeatedly printed
> because of global variable 'throttled', which is set to true on passing any of
> these two conditions.
> 
> It is quite unlikely behavior to have only one chip in "Psafe" or "frequency
> control disabled" state. These two conditions are most likely to happen during
> an OCC reset cycle which will occur across all chips.

Let us then add a comment to indicate that Psafe and frequency control
disabled conditions will fail *only if OCC is inactive* and not
otherwise and that this is a system wide phenomenon.

Regards
Preeti U Murthy

> 
> Thanks and Regards,
> Shilpa
> 
> ___
> Linuxppc-dev mailing list
> linuxppc-...@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level

2015-05-05 Thread Preeti U Murthy

On 05/05/2015 11:36 AM, Shilpasri G Bhat wrote:
 Hi Preeti,
 
 On 05/05/2015 09:21 AM, Preeti U Murthy wrote:
 Hi Shilpa,

 On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
 The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the
 max allowed frequency for that chip if the chip exceeds its power or
 temperature limits. As Pmax capping is a chip level condition report
 this throttling behavior at chip level and also do not set the global
 'throttled' on Pmax capping instead set the per-chip throttled
 variable. Report unthrottling if Pmax is restored after throttling.

 This patch adds a structure to store chip id and throttled state of
 the chip.

 Signed-off-by: Shilpasri G Bhat shilpa.b...@linux.vnet.ibm.com
 ---
  drivers/cpufreq/powernv-cpufreq.c | 59 
 ---
  1 file changed, 55 insertions(+), 4 deletions(-)

 diff --git a/drivers/cpufreq/powernv-cpufreq.c 
 b/drivers/cpufreq/powernv-cpufreq.c
 index ebef0d8..d0c18c9 100644
 --- a/drivers/cpufreq/powernv-cpufreq.c
 +++ b/drivers/cpufreq/powernv-cpufreq.c
 @@ -27,6 +27,7 @@
  #include linux/smp.h
  #include linux/of.h
  #include linux/reboot.h
 +#include linux/slab.h

  #include asm/cputhreads.h
  #include asm/firmware.h
 @@ -42,6 +43,13 @@
  static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
  static bool rebooting, throttled;

 +static struct chip {
 +   unsigned int id;
 +   bool throttled;
 +} *chips;
 +
 +static int nr_chips;
 +
  /*
   * Note: The set of pstates consists of contiguous integers, the
   * smallest of which is indicated by powernv_pstate_info.min, the
 @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void)
  static void powernv_cpufreq_throttle_check(unsigned int cpu)
  {
 unsigned long pmsr;
 -   int pmsr_pmax, pmsr_lp;
 +   int pmsr_pmax, pmsr_lp, i;

 pmsr = get_pmspr(SPRN_PMSR);

 +   for (i = 0; i  nr_chips; i++)
 +   if (chips[i].id == cpu_to_chip_id(cpu))
 +   break;
 +
 /* Check for Pmax Capping */
 pmsr_pmax = (s8)PMSR_MAX(pmsr);
 if (pmsr_pmax != powernv_pstate_info.max) {
 -   throttled = true;
 -   pr_info(CPU %d Pmax is reduced to %d\n, cpu, pmsr_pmax);
 -   pr_info(Max allowed Pstate is capped\n);
 +   if (chips[i].throttled)
 +   goto next;
 +   chips[i].throttled = true;
 +   pr_info(CPU %d on Chip %u has Pmax reduced to %d\n, cpu,
 +   chips[i].id, pmsr_pmax);
 +   } else if (chips[i].throttled) {
 +   chips[i].throttled = false;

 Is this check on pmax sufficient to indicate that the chip is unthrottled ?
 
 Unthrottling due to Pmax uncapping here is specific to a chip. So it is
 sufficient to decide throttling/unthrottling when OCC is active for that chip.

Ok then we can perhaps exit after detecting unthrottling here.
 

 +   pr_info(CPU %d on Chip %u has Pmax restored to %d\n, cpu,
 +   chips[i].id, pmsr_pmax);
 }

 /*
  * Check for Psafe by reading LocalPstate
  * or check if Psafe_mode_active is set in PMSR.
  */
 +next:
 pmsr_lp = (s8)PMSR_LP(pmsr);
 if ((pmsr_lp  powernv_pstate_info.min) ||
 (pmsr  PMSR_PSAFE_ENABLE)) {
 @@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
 .attr   = powernv_cpu_freq_attr,

 What about the situation where although occ is active, this particular
 chip has been throttled and we end up repeatedly reporting pstate set
 to safe and frequency control disabled from OS ? Should we not have a
 check on (chips[i].throttled) before reporting an anomaly for these two
 scenarios as well just like you have for pmsr_pmax ?
 
 We will not have Psafe and frequency control disabled repeatedly printed
 because of global variable 'throttled', which is set to true on passing any of
 these two conditions.
 
 It is quite unlikely behavior to have only one chip in Psafe or frequency
 control disabled state. These two conditions are most likely to happen during
 an OCC reset cycle which will occur across all chips.

Let us then add a comment to indicate that Psafe and frequency control
disabled conditions will fail *only if OCC is inactive* and not
otherwise and that this is a system wide phenomenon.

Regards
Preeti U Murthy

 
 Thanks and Regards,
 Shilpa
 
 ___
 Linuxppc-dev mailing list
 linuxppc-...@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:timers/core] tick-broadcast: Fix the printing of broadcast masks

2015-05-05 Thread tip-bot for Preeti U Murthy

Commit-ID:  1ef09cd713c90781b683a0b4e0a874803c172b1d
Gitweb: http://git.kernel.org/tip/1ef09cd713c90781b683a0b4e0a874803c172b1d
Author: Preeti U Murthy pre...@linux.vnet.ibm.com
AuthorDate: Tue, 28 Apr 2015 14:15:20 +0530
Committer:  Thomas Gleixner t...@linutronix.de
CommitDate: Tue, 5 May 2015 10:35:58 +0200

tick-broadcast: Fix the printing of broadcast masks

Today the number of bits of the broadcast masks that is output into
/proc/timer_list is sizeof(unsigned long). This means that on machines
with a larger number of CPUs, the bitmasks of CPUs beyond this range do
not appear.

Fix this by using bitmap printing through %*pb instead, so as to
output the broadcast masks for the range of nr_cpu_ids into
/proc/timer_list.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
Cc: pet...@infradead.org
Cc: linuxppc-...@ozlabs.org
Cc: john.stu...@linaro.org
Link: http://lkml.kernel.org/r/20150428084520.3314.62668.st...@preeti.in.ibm.com
Signed-off-by: Thomas Gleixner t...@linutronix.de
---
 kernel/time/timer_list.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index 66f39bb..18b074b 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -276,11 +276,11 @@ static void timer_list_show_tickdevices_header(struct 
seq_file *m)
 {
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
print_tickdevice(m, tick_get_broadcast_device(), -1);
-   SEQ_printf(m, tick_broadcast_mask: %08lx\n,
-  cpumask_bits(tick_get_broadcast_mask())[0]);
+   SEQ_printf(m, tick_broadcast_mask: %*pb\n,
+  cpumask_pr_args(tick_get_broadcast_mask()));
 #ifdef CONFIG_TICK_ONESHOT
-   SEQ_printf(m, tick_broadcast_oneshot_mask: %08lx\n,
-  cpumask_bits(tick_get_broadcast_oneshot_mask())[0]);
+   SEQ_printf(m, tick_broadcast_oneshot_mask: %*pb\n,
+  cpumask_pr_args(tick_get_broadcast_oneshot_mask()));
 #endif
SEQ_printf(m, \n);
 #endif
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

2015-05-05 Thread Preeti U Murthy

On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote:
 Hi Preeti,
 
 On 05/05/2015 09:30 AM, Preeti U Murthy wrote:
 Hi Shilpa,

 On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
 Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
 notification by executing *throttle_check() on any one of the cpu on
 the chip. This is a sanity check to verify if we were indeed
 throttled/unthrottled after receiving OCC_THROTTLE notification.

 We cannot call *throttle_check() directly from the notification
 handler because we could be handling chip1's notification in chip2. So
 initiate an smp_call to execute *throttle_check(). We are irq-disabled
 in the notification handler, so use a worker thread to smp_call
 throttle_check() on any of the cpu in the chipmask.

 I see that the first patch takes care of reporting *per-chip* throttling
 for pmax capping condition. But where are we taking care of reporting
 pstate set to safe and freq control disabled scenarios per-chip ?

 
 IMO let us not have psafe and freq control disabled states managed 
 per-chip.
 Because when the above two conditions occur it is likely to happen across all
 chips during an OCC reset cycle. So I am setting 'throttled' to false on
 OCC_ACTIVE and re-verifying if it actually is the case by invoking
 *throttle_check().

Alright like I pointed in the previous reply, a comment to indicate that
psafe and freq control disabled conditions will fail when occ is
inactive and that all chips face the consequence of this will help.

 

 Signed-off-by: Shilpasri G Bhat shilpa.b...@linux.vnet.ibm.com
 ---
  drivers/cpufreq/powernv-cpufreq.c | 28 ++--
  1 file changed, 26 insertions(+), 2 deletions(-)

 diff --git a/drivers/cpufreq/powernv-cpufreq.c 
 b/drivers/cpufreq/powernv-cpufreq.c
 index 9268424..9618813 100644
 --- a/drivers/cpufreq/powernv-cpufreq.c
 +++ b/drivers/cpufreq/powernv-cpufreq.c
 @@ -50,6 +50,8 @@ static bool rebooting, throttled, occ_reset;
  static struct chip {
 unsigned int id;
 bool throttled;
 +   cpumask_t mask;
 +   struct work_struct throttle;
  } *chips;

  static int nr_chips;
 @@ -310,8 +312,9 @@ static inline unsigned int get_nominal_index(void)
 return powernv_pstate_info.max - powernv_pstate_info.nominal;
  }

 -static void powernv_cpufreq_throttle_check(unsigned int cpu)
 +static void powernv_cpufreq_throttle_check(void *data)
  {
 +   unsigned int cpu = smp_processor_id();
 unsigned long pmsr;
 int pmsr_pmax, pmsr_lp, i;

 @@ -373,7 +376,7 @@ static int powernv_cpufreq_target_index(struct 
 cpufreq_policy *policy,
 return 0;

 if (!throttled)
 -   powernv_cpufreq_throttle_check(smp_processor_id());
 +   powernv_cpufreq_throttle_check(NULL);

 freq_data.pstate_id = powernv_freqs[new_index].driver_data;

 @@ -418,6 +421,14 @@ static struct notifier_block powernv_cpufreq_reboot_nb 
 = {
 .notifier_call = powernv_cpufreq_reboot_notifier,
  };

 +void powernv_cpufreq_work_fn(struct work_struct *work)
 +{
 +   struct chip *chip = container_of(work, struct chip, throttle);
 +
 +   smp_call_function_any(chip-mask,
 + powernv_cpufreq_throttle_check, NULL, 0);
 +}
 +
  static char throttle_reason[][30] = {
 No throttling,
 Power Cap,
 @@ -433,6 +444,7 @@ static int powernv_cpufreq_occ_msg(struct 
 notifier_block *nb,
 struct opal_msg *occ_msg = msg;
 uint64_t token;
 uint64_t chip_id, reason;
 +   int i;

 if (msg_type != OPAL_MSG_OCC)
 return 0;
 @@ -466,6 +478,10 @@ static int powernv_cpufreq_occ_msg(struct 
 notifier_block *nb,
 occ_reset = false;
 throttled = false;
 pr_info(OCC: Active\n);
 +
 +   for (i = 0; i  nr_chips; i++)
 +   schedule_work(chips[i].throttle);
 +
 return 0;
 }

 @@ -476,6 +492,12 @@ static int powernv_cpufreq_occ_msg(struct 
 notifier_block *nb,
 else if (!reason)
 pr_info(OCC: Chip %u %s\n, (unsigned int)chip_id,
 throttle_reason[reason]);
 +   else
 +   return 0;

 Why the else section ? The code can never reach here, can it ?
 
 When reason  5 , we dont want to handle it.

Of course! My bad!
 

 +
 +   for (i = 0; i  nr_chips; i++)
 +   if (chips[i].id == chip_id)
 +   schedule_work(chips[i].throttle);
 }

 Should we not do this only when we get unthrottled so as to cross verify
 if it is indeed the case ? In case of throttling notification, opal's
 verdict is final and there is no need to cross verify right ?
 
 Two reasons for invoking *throttle_check() on throttling:
 1) We just got to know the reason and not the Pmax value we are getting
 throttled to.
 2) It could be a spurious message caused due to late/lost delivery. My point

Re: [PATCH v3 6/6] cpufreq: powernv: Restore cpu frequency to policy-cur on unthrottling

2015-05-05 Thread Preeti U Murthy

On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
 If frequency is throttled due to OCC reset then cpus will be in Psafe
 frequency, so restore the frequency on all cpus to policy-cur when
 OCCs are active again. And if frequency is throttled due to Pmax
 capping then restore the frequency of all the cpus  in the chip on
 unthrottling.
 
 Signed-off-by: Shilpasri G Bhat shilpa.b...@linux.vnet.ibm.com
 ---
  drivers/cpufreq/powernv-cpufreq.c | 31 +--
  1 file changed, 29 insertions(+), 2 deletions(-)
 
 diff --git a/drivers/cpufreq/powernv-cpufreq.c 
 b/drivers/cpufreq/powernv-cpufreq.c
 index 0a59d5b..b2915bc 100644
 --- a/drivers/cpufreq/powernv-cpufreq.c
 +++ b/drivers/cpufreq/powernv-cpufreq.c
 @@ -51,6 +51,7 @@ static struct chip {
   bool throttled;
   cpumask_t mask;
   struct work_struct throttle;
 + bool restore;
  } *chips;
 
  static int nr_chips;
 @@ -418,9 +419,29 @@ static struct notifier_block powernv_cpufreq_reboot_nb = 
 {
  void powernv_cpufreq_work_fn(struct work_struct *work)
  {
   struct chip *chip = container_of(work, struct chip, throttle);
 + unsigned int cpu;
 + cpumask_var_t mask;
 
   smp_call_function_any(chip-mask,
 powernv_cpufreq_throttle_check, NULL, 0);
 +
 + if (!chip-restore)
 + return;
 +
 + chip-restore = false;
 + cpumask_copy(mask, chip-mask);
 + for_each_cpu_and(cpu, mask, cpu_online_mask) {
 + int index, tcpu;
 + struct cpufreq_policy policy;
 +
 + cpufreq_get_policy(policy, cpu);
 + cpufreq_frequency_table_target(policy, policy.freq_table,
 +policy.cur,
 +CPUFREQ_RELATION_C, index);
 + powernv_cpufreq_target_index(policy, index);
 + for_each_cpu(tcpu, policy.cpus)
 + cpumask_clear_cpu(tcpu, mask);
 + }
  }
 
  static char throttle_reason[][30] = {
 @@ -473,8 +494,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block 
 *nb,
   throttled = false;
   pr_info(OCC: Active\n);
 
 - for (i = 0; i  nr_chips; i++)
 + for (i = 0; i  nr_chips; i++) {
 + chips[i].restore = true;
   schedule_work(chips[i].throttle);
 + }
 
   return 0;
   }
 @@ -490,8 +513,11 @@ static int powernv_cpufreq_occ_msg(struct notifier_block 
 *nb,
   return 0;
 
   for (i = 0; i  nr_chips; i++)
 - if (chips[i].id == chip_id)
 + if (chips[i].id == chip_id) {
 + if (!reason)
 + chips[i].restore = true;
   schedule_work(chips[i].throttle);
 + }
   }
   return 0;
  }
 @@ -545,6 +571,7 @@ static int init_chip_info(void)
   chips[i].throttled = false;
   cpumask_copy(chips[i].mask, cpumask_of_node(chip[i]));
   INIT_WORK(chips[i].throttle, powernv_cpufreq_work_fn);
 + chips[i].restore = false;
   }
 
   return 0;
 

Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

2015-05-04 Thread Preeti U Murthy

Hi Shilpa,

On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
> Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
> notification by executing *throttle_check() on any one of the cpu on
> the chip. This is a sanity check to verify if we were indeed
> throttled/unthrottled after receiving OCC_THROTTLE notification.
> 
> We cannot call *throttle_check() directly from the notification
> handler because we could be handling chip1's notification in chip2. So
> initiate an smp_call to execute *throttle_check(). We are irq-disabled
> in the notification handler, so use a worker thread to smp_call
> throttle_check() on any of the cpu in the chipmask.

I see that the first patch takes care of reporting *per-chip* throttling
for pmax capping condition. But where are we taking care of reporting
"pstate set to safe" and "freq control disabled" scenarios per-chip ?

> 
> Signed-off-by: Shilpasri G Bhat 
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 28 ++--
>  1 file changed, 26 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index 9268424..9618813 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -50,6 +50,8 @@ static bool rebooting, throttled, occ_reset;
>  static struct chip {
>   unsigned int id;
>   bool throttled;
> + cpumask_t mask;
> + struct work_struct throttle;
>  } *chips;
> 
>  static int nr_chips;
> @@ -310,8 +312,9 @@ static inline unsigned int get_nominal_index(void)
>   return powernv_pstate_info.max - powernv_pstate_info.nominal;
>  }
> 
> -static void powernv_cpufreq_throttle_check(unsigned int cpu)
> +static void powernv_cpufreq_throttle_check(void *data)
>  {
> + unsigned int cpu = smp_processor_id();
>   unsigned long pmsr;
>   int pmsr_pmax, pmsr_lp, i;
> 
> @@ -373,7 +376,7 @@ static int powernv_cpufreq_target_index(struct 
> cpufreq_policy *policy,
>   return 0;
> 
>   if (!throttled)
> - powernv_cpufreq_throttle_check(smp_processor_id());
> + powernv_cpufreq_throttle_check(NULL);
> 
>   freq_data.pstate_id = powernv_freqs[new_index].driver_data;
> 
> @@ -418,6 +421,14 @@ static struct notifier_block powernv_cpufreq_reboot_nb = 
> {
>   .notifier_call = powernv_cpufreq_reboot_notifier,
>  };
> 
> +void powernv_cpufreq_work_fn(struct work_struct *work)
> +{
> + struct chip *chip = container_of(work, struct chip, throttle);
> +
> + smp_call_function_any(>mask,
> +   powernv_cpufreq_throttle_check, NULL, 0);
> +}
> +
>  static char throttle_reason[][30] = {
>   "No throttling",
>   "Power Cap",
> @@ -433,6 +444,7 @@ static int powernv_cpufreq_occ_msg(struct notifier_block 
> *nb,
>   struct opal_msg *occ_msg = msg;
>   uint64_t token;
>   uint64_t chip_id, reason;
> + int i;
> 
>   if (msg_type != OPAL_MSG_OCC)
>   return 0;
> @@ -466,6 +478,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block 
> *nb,
>   occ_reset = false;
>   throttled = false;
>   pr_info("OCC: Active\n");
> +
> + for (i = 0; i < nr_chips; i++)
> + schedule_work([i].throttle);
> +
>   return 0;
>   }
> 
> @@ -476,6 +492,12 @@ static int powernv_cpufreq_occ_msg(struct notifier_block 
> *nb,
>   else if (!reason)
>   pr_info("OCC: Chip %u %s\n", (unsigned int)chip_id,
>   throttle_reason[reason]);
> + else
> + return 0;

Why the else section ? The code can never reach here, can it ?

> +
> + for (i = 0; i < nr_chips; i++)
> + if (chips[i].id == chip_id)
> + schedule_work([i].throttle);
>   }

Should we not do this only when we get unthrottled so as to cross verify
if it is indeed the case ? In case of throttling notification, opal's
verdict is final and there is no need to cross verify right ?

Perhaps the one thing that needs to be taken care in addition to
reporting throttling is setting the chip's throttled parameter to true.
This should do right ? I don't see the need to call throttle_check() here.

Regards
Preeti U Murthy


>   return 0;
>  }
> @@ -527,6 +549,8 @@ static int init_chip_info(void)
>   for (i = 0; i < nr_chips; i++) {
>   chips[i].id = chip[i];

Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level

2015-05-04 Thread Preeti U Murthy

Hi Shilpa,

On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
> The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the
> max allowed frequency for that chip if the chip exceeds its power or
> temperature limits. As Pmax capping is a chip level condition report
> this throttling behavior at chip level and also do not set the global
> 'throttled' on Pmax capping instead set the per-chip throttled
> variable. Report unthrottling if Pmax is restored after throttling.
> 
> This patch adds a structure to store chip id and throttled state of
> the chip.
> 
> Signed-off-by: Shilpasri G Bhat 
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 59 
> ---
>  1 file changed, 55 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index ebef0d8..d0c18c9 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -42,6 +43,13 @@
>  static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
>  static bool rebooting, throttled;
> 
> +static struct chip {
> + unsigned int id;
> + bool throttled;
> +} *chips;
> +
> +static int nr_chips;
> +
>  /*
>   * Note: The set of pstates consists of contiguous integers, the
>   * smallest of which is indicated by powernv_pstate_info.min, the
> @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void)
>  static void powernv_cpufreq_throttle_check(unsigned int cpu)
>  {
>   unsigned long pmsr;
> - int pmsr_pmax, pmsr_lp;
> + int pmsr_pmax, pmsr_lp, i;
> 
>   pmsr = get_pmspr(SPRN_PMSR);
> 
> + for (i = 0; i < nr_chips; i++)
> + if (chips[i].id == cpu_to_chip_id(cpu))
> + break;
> +
>   /* Check for Pmax Capping */
>   pmsr_pmax = (s8)PMSR_MAX(pmsr);
>   if (pmsr_pmax != powernv_pstate_info.max) {
> - throttled = true;
> - pr_info("CPU %d Pmax is reduced to %d\n", cpu, pmsr_pmax);
> - pr_info("Max allowed Pstate is capped\n");
> + if (chips[i].throttled)
> + goto next;
> + chips[i].throttled = true;
> + pr_info("CPU %d on Chip %u has Pmax reduced to %d\n", cpu,
> + chips[i].id, pmsr_pmax);
> + } else if (chips[i].throttled) {
> + chips[i].throttled = false;

Is this check on pmax sufficient to indicate that the chip is unthrottled ?

> + pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu,
> + chips[i].id, pmsr_pmax);
>   }
> 
>   /*
>* Check for Psafe by reading LocalPstate
>* or check if Psafe_mode_active is set in PMSR.
>*/
> +next:
>   pmsr_lp = (s8)PMSR_LP(pmsr);
>   if ((pmsr_lp < powernv_pstate_info.min) ||
>   (pmsr & PMSR_PSAFE_ENABLE)) {
> @@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
>   .attr   = powernv_cpu_freq_attr,

What about the situation where although occ is active, this particular
chip has been throttled and we end up repeatedly reporting "pstate set
to safe" and "frequency control disabled from OS" ? Should we not have a
check on (chips[i].throttled) before reporting an anomaly for these two
scenarios as well just like you have for pmsr_pmax ?

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 5/6] cpufreq: powernv: Report Psafe only if PMSR.psafe_mode_active bit is set

2015-05-04 Thread Preeti U Murthy

On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
> On a reset cycle of OCC, although the system retires from safe
> frequency state the local pstate is not restored to Pmin or last
> requested pstate. Now if the cpufreq governor initiates a pstate
> change, the local pstate will be in Psafe and we will be reporting a
> false positive when we are not throttled.
> 
> So in powernv_cpufreq_throttle_check() remove the condition which
> checks if local pstate is less than Pmin while checking for Psafe
> frequency. If the cpus are forced to Psafe then PMSR.psafe_mode_active
> bit will be set. So, when OCCs become active this bit will be cleared.
> Let us just rely on this bit for reporting throttling.
> 
> Signed-off-by: Shilpasri G Bhat 
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 12 +++-
>  1 file changed, 3 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index 9618813..0a59d5b 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -39,7 +39,6 @@
>  #define PMSR_PSAFE_ENABLE(1UL << 30)
>  #define PMSR_SPR_EM_DISABLE  (1UL << 31)
>  #define PMSR_MAX(x)  ((x >> 32) & 0xFF)
> -#define PMSR_LP(x)   ((x >> 48) & 0xFF)
>  #define OCC_RESET0
>  #define OCC_LOAD 1
>  #define OCC_THROTTLE 2
> @@ -316,7 +315,7 @@ static void powernv_cpufreq_throttle_check(void *data)
>  {
>   unsigned int cpu = smp_processor_id();
>   unsigned long pmsr;
> - int pmsr_pmax, pmsr_lp, i;
> + int pmsr_pmax, i;
>  
>   pmsr = get_pmspr(SPRN_PMSR);
>  
> @@ -338,14 +337,9 @@ static void powernv_cpufreq_throttle_check(void *data)
>   chips[i].id, pmsr_pmax);
>   }
>  
> - /*
> -  * Check for Psafe by reading LocalPstate
> -  * or check if Psafe_mode_active is set in PMSR.
> -  */
> + /* Check if Psafe_mode_active is set in PMSR. */
>  next:
> - pmsr_lp = (s8)PMSR_LP(pmsr);
> - if ((pmsr_lp < powernv_pstate_info.min) ||
> - (pmsr & PMSR_PSAFE_ENABLE)) {
> + if (pmsr & PMSR_PSAFE_ENABLE) {
>   throttled = true;
>   pr_info("Pstate set to safe frequency\n");
>   }
> 

Reviewed-by: Preeti U Murthy 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 3/6] cpufreq: powernv: Register for OCC related opal_message notification

2015-05-04 Thread Preeti U Murthy

d = true;
> + pr_crit("CPU Frequency is throttled\n");
> + }
> + pr_info("OCC: Reset\n");
> + break;
> + case OCC_LOAD:
> + pr_info("OCC: Loaded\n");
> + break;
> + case OCC_THROTTLE:
> + chip_id = be64_to_cpu(occ_msg->params[1]);
> + reason = be64_to_cpu(occ_msg->params[2]);
> +
> + if (occ_reset) {
> + occ_reset = false;
> + throttled = false;
> + pr_info("OCC: Active\n");
> + return 0;
> + }
> +
> + if (reason && reason <= 5)
> + pr_info("OCC: Chip %u Pmax reduced due to %s\n",
> + (unsigned int)chip_id,
> + throttle_reason[reason]);
> + else if (!reason)
> + pr_info("OCC: Chip %u %s\n", (unsigned int)chip_id,
> + throttle_reason[reason]);
> + }
> + return 0;
> +}
> +
> +static struct notifier_block powernv_cpufreq_opal_nb = {
> + .notifier_call  = powernv_cpufreq_occ_msg,
> + .next   = NULL,
> + .priority   = 0,
> +};
> +
>  static void powernv_cpufreq_stop_cpu(struct cpufreq_policy *policy)
>  {
>   struct powernv_smp_call_data freq_data;
> @@ -481,6 +553,7 @@ static int __init powernv_cpufreq_init(void)
>   return rc;
> 
>   register_reboot_notifier(_cpufreq_reboot_nb);
> + opal_message_notifier_register(OPAL_MSG_OCC, _cpufreq_opal_nb);
>   return cpufreq_register_driver(_cpufreq_driver);
>  }
>  module_init(powernv_cpufreq_init);
> 

Looks good.

Reviewed-by: Preeti U Murthy 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1248 matches

Mail list logo