Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST
On 06/26/2015 06:08 PM, Thomas Gleixner wrote: > On Fri, 26 Jun 2015, Preeti U Murthy wrote: >> On 06/26/2015 01:17 PM, Thomas Gleixner wrote: >>> if (state == TICK_BROADCAST_ENTER) { >>> + /* >>> +* If the current CPU owns the hrtimer broadcast >>> +* mechanism, it cannot go deep idle and we do not add >>> +* the CPU to the broadcast mask. We don't have to go >>> +* through the EXIT path as the local timer is not >>> +* shutdown. >>> +*/ >>> + ret = broadcast_needs_cpu(bc, cpu); >>> + >>> + /* >>> +* If the hrtimer broadcast check tells us that the >>> +* cpu cannot go deep idle, or if the broadcast device >>> +* is in periodic mode, we just return. >>> +*/ >>> + if (ret || tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC) >>> + goto out; >> >> The check on PERIODIC mode is good, but I don't see the point of moving >> broadcast_needs_cpu() up above. broadcast_shutdown_local() calls >> broadcast_needs_cpu() internally. >> >> Besides, by jumping to 'out', we will miss programming the broadcast >> hrtimer in tick_broadcast_set_event() below, if the cpu happen to be the >> broadcast cpu(which is why it was not allowed to go to deep idle). > > Hmm. Need to think a bit more about this convoluted maze ... I think you cover all cases just by having that check on periodic mode. This covers the nohz_full=n,highres=n, TICK_ONESHOT=y and GENERIC_CLOCKEVENTS_BROADCAST=y. broadcast_needs_cpu() should remain where it was though. And of course, the additional patch on tick_broadcast_device.evtdev == NULL in BROADCAST_ON. Regards Preeti U Murthy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST
On 06/26/2015 05:20 PM, Thomas Gleixner wrote: > On Fri, 26 Jun 2015, Preeti U Murthy wrote: >> On 06/26/2015 01:17 PM, Thomas Gleixner wrote: >>> On Fri, 26 Jun 2015, Preeti U Murthy wrote: >>>> What about the case where GENERIC_CLOCKEVENTS_BROADCAST=y and >>>> TICK_ONESHOT=n (HZ_PERIODIC=y) ? Have you tested this ? >>>> >>>> This will hang the kernel at boot if you are using the hrtimer mode of >>>> broadcast. This is because the local timers of all cpus are shutdown >>>> when the cpuidle driver registers itself, on finding out that there are >>>> idle states where local tick devices stop. The broadcast tick device is >>>> then in charge of waking up the cpus at every period. In hrtimer mode of >>>> broadcast, there is no such real device and we hang. >>> >>> Hmm, no. tick-broadcast-hrtimer.o depends on TICK_ONESHOT=y. So this >>> is covered by the check for the broadcast device, which is NULL. >>> >>> But there is another variant: >>> >>> GENERIC_CLOCKEVENTS_BROADCAST=y TICK_ONESHOT=y and 'highres=off >>> nohz=off' on the kernel command line. >> >> Can this happen at all? It is during tick_init_highres() or >> tick_nohz_switch_to_nohz() that we switch to oneshot mode, not otherwise >> AFAICT. > > And how does that matter? If 'highres=off nohz=off' is on the kernel > command line none of the switchovers happens. So system stays in > periodic mode and the broadcast hrtimer thing is registered, right? Yes we are good here. I overlooked the fact that we could disable high resolution/nohz just before boot. > >> I was actually talking of the following scenario. In periodic mode, >> where GENERIC_CLOCKEVENTS_BROADCAST=y, the arch can execute >> tick_setup_hrtimer_broadcast(), which will return nothing as you point >> out above. So there is no broadcast clockevent device. >> >> When the cpuidle driver registers with the cpuidle core however, >> cpuidle_setup_broadcast_timer() on every cpu is executed if it finds >> that there is an idle state where ticks stop. >> >> cpuidle_setup_broadcast_timer() >> tick_broadcast_enable() >> tick_broadcast_control(BROADCAST_ON) >>bc = tick_broadcast_device.evtdev which is NULL in this case >> >> TICK_BROADCAST_ON: >> checks for periodic mode of the broadcast device - succeeds >> although we haven't registered a broadcast device because >> value of TICKDEV_PERIODIC is 0, the default value of td.mode. >> >> clockevents_shutdown(dev) >> >> At this point all cpus stop. > > Right. That's a different one, if tick_broadcast_device.evtdev == NULL. > > We should not stop any cpu local timer in that case. Combined with the > patch I sent, we prevent the idle stuff from going into a state where > the cpu local timers stop. Right. We need a check above too. Regards Preeti U Murthy > > Thanks, > > tglx > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST
On 06/26/2015 01:17 PM, Thomas Gleixner wrote: > diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c > index d39f32cdd1b5..281ce29d295e 100644 > --- a/kernel/time/tick-broadcast.c > +++ b/kernel/time/tick-broadcast.c > @@ -662,46 +662,39 @@ static void broadcast_shutdown_local(struct > clock_event_device *bc, > clockevents_switch_state(dev, CLOCK_EVT_STATE_SHUTDOWN); > } > > -/** > - * tick_broadcast_oneshot_control - Enter/exit broadcast oneshot mode > - * @state: The target state (enter/exit) > - * > - * The system enters/leaves a state, where affected devices might stop > - * Returns 0 on success, -EBUSY if the cpu is used to broadcast wakeups. > - * > - * Called with interrupts disabled, so clockevents_lock is not > - * required here because the local clock event device cannot go away > - * under us. > - */ > -int tick_broadcast_oneshot_control(enum tick_broadcast_state state) > +int __tick_broadcast_oneshot_control(enum tick_broadcast_state state) > { > struct clock_event_device *bc, *dev; > - struct tick_device *td; > int cpu, ret = 0; > ktime_t now; > > - /* > - * Periodic mode does not care about the enter/exit of power > - * states > - */ > - if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC) > - return 0; > + if (!tick_broadcast_device.evtdev) > + return -EBUSY; > > - /* > - * We are called with preemtion disabled from the depth of the > - * idle code, so we can't be moved away. > - */ > - td = this_cpu_ptr(_cpu_device); > - dev = td->evtdev; > - > - if (!(dev->features & CLOCK_EVT_FEAT_C3STOP)) > - return 0; > + dev = this_cpu_ptr(_cpu_device)->evtdev; > > raw_spin_lock(_broadcast_lock); > bc = tick_broadcast_device.evtdev; > cpu = smp_processor_id(); > > if (state == TICK_BROADCAST_ENTER) { > + /* > + * If the current CPU owns the hrtimer broadcast > + * mechanism, it cannot go deep idle and we do not add > + * the CPU to the broadcast mask. We don't have to go > + * through the EXIT path as the local timer is not > + * shutdown. > + */ > + ret = broadcast_needs_cpu(bc, cpu); > + > + /* > + * If the hrtimer broadcast check tells us that the > + * cpu cannot go deep idle, or if the broadcast device > + * is in periodic mode, we just return. > + */ > + if (ret || tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC) > + goto out; The check on PERIODIC mode is good, but I don't see the point of moving broadcast_needs_cpu() up above. broadcast_shutdown_local() calls broadcast_needs_cpu() internally. Besides, by jumping to 'out', we will miss programming the broadcast hrtimer in tick_broadcast_set_event() below, if the cpu happen to be the broadcast cpu(which is why it was not allowed to go to deep idle). > + > if (!cpumask_test_and_set_cpu(cpu, > tick_broadcast_oneshot_mask)) { > WARN_ON_ONCE(cpumask_test_cpu(cpu, > tick_broadcast_pending_mask)); > broadcast_shutdown_local(bc, dev); > @@ -717,16 +710,6 @@ int tick_broadcast_oneshot_control(enum > tick_broadcast_state state) > dev->next_event.tv64 < bc->next_event.tv64) > tick_broadcast_set_event(bc, cpu, > dev->next_event); > } > - /* > - * If the current CPU owns the hrtimer broadcast > - * mechanism, it cannot go deep idle and we remove the > - * CPU from the broadcast mask. We don't have to go > - * through the EXIT path as the local timer is not > - * shutdown. > - */ > - ret = broadcast_needs_cpu(bc, cpu); > - if (ret) > - cpumask_clear_cpu(cpu, tick_broadcast_oneshot_mask); > } else { > if (cpumask_test_and_clear_cpu(cpu, > tick_broadcast_oneshot_mask)) { > clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT); > @@ -938,6 +921,13 @@ bool tick_broadcast_oneshot_available(void) > return bc ? bc->features & CLOCK_EVT_FEAT_ONESHOT : false; > } > Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST
On 06/26/2015 01:17 PM, Thomas Gleixner wrote: > On Fri, 26 Jun 2015, Preeti U Murthy wrote: >> What about the case where GENERIC_CLOCKEVENTS_BROADCAST=y and >> TICK_ONESHOT=n (HZ_PERIODIC=y) ? Have you tested this ? >> >> This will hang the kernel at boot if you are using the hrtimer mode of >> broadcast. This is because the local timers of all cpus are shutdown >> when the cpuidle driver registers itself, on finding out that there are >> idle states where local tick devices stop. The broadcast tick device is >> then in charge of waking up the cpus at every period. In hrtimer mode of >> broadcast, there is no such real device and we hang. > > Hmm, no. tick-broadcast-hrtimer.o depends on TICK_ONESHOT=y. So this > is covered by the check for the broadcast device, which is NULL. > > But there is another variant: > > GENERIC_CLOCKEVENTS_BROADCAST=y TICK_ONESHOT=y and 'highres=off > nohz=off' on the kernel command line. Can this happen at all? It is during tick_init_highres() or tick_nohz_switch_to_nohz() that we switch to oneshot mode, not otherwise AFAICT. I was actually talking of the following scenario. In periodic mode, where GENERIC_CLOCKEVENTS_BROADCAST=y, the arch can execute tick_setup_hrtimer_broadcast(), which will return nothing as you point out above. So there is no broadcast clockevent device. When the cpuidle driver registers with the cpuidle core however, cpuidle_setup_broadcast_timer() on every cpu is executed if it finds that there is an idle state where ticks stop. cpuidle_setup_broadcast_timer() tick_broadcast_enable() tick_broadcast_control(BROADCAST_ON) bc = tick_broadcast_device.evtdev which is NULL in this case TICK_BROADCAST_ON: checks for periodic mode of the broadcast device - succeeds although we haven't registered a broadcast device because value of TICKDEV_PERIODIC is 0, the default value of td.mode. clockevents_shutdown(dev) At this point all cpus stop. Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST
On 06/26/2015 05:20 PM, Thomas Gleixner wrote: On Fri, 26 Jun 2015, Preeti U Murthy wrote: On 06/26/2015 01:17 PM, Thomas Gleixner wrote: On Fri, 26 Jun 2015, Preeti U Murthy wrote: What about the case where GENERIC_CLOCKEVENTS_BROADCAST=y and TICK_ONESHOT=n (HZ_PERIODIC=y) ? Have you tested this ? This will hang the kernel at boot if you are using the hrtimer mode of broadcast. This is because the local timers of all cpus are shutdown when the cpuidle driver registers itself, on finding out that there are idle states where local tick devices stop. The broadcast tick device is then in charge of waking up the cpus at every period. In hrtimer mode of broadcast, there is no such real device and we hang. Hmm, no. tick-broadcast-hrtimer.o depends on TICK_ONESHOT=y. So this is covered by the check for the broadcast device, which is NULL. But there is another variant: GENERIC_CLOCKEVENTS_BROADCAST=y TICK_ONESHOT=y and 'highres=off nohz=off' on the kernel command line. Can this happen at all? It is during tick_init_highres() or tick_nohz_switch_to_nohz() that we switch to oneshot mode, not otherwise AFAICT. And how does that matter? If 'highres=off nohz=off' is on the kernel command line none of the switchovers happens. So system stays in periodic mode and the broadcast hrtimer thing is registered, right? Yes we are good here. I overlooked the fact that we could disable high resolution/nohz just before boot. I was actually talking of the following scenario. In periodic mode, where GENERIC_CLOCKEVENTS_BROADCAST=y, the arch can execute tick_setup_hrtimer_broadcast(), which will return nothing as you point out above. So there is no broadcast clockevent device. When the cpuidle driver registers with the cpuidle core however, cpuidle_setup_broadcast_timer() on every cpu is executed if it finds that there is an idle state where ticks stop. cpuidle_setup_broadcast_timer() tick_broadcast_enable() tick_broadcast_control(BROADCAST_ON) bc = tick_broadcast_device.evtdev which is NULL in this case TICK_BROADCAST_ON: checks for periodic mode of the broadcast device - succeeds although we haven't registered a broadcast device because value of TICKDEV_PERIODIC is 0, the default value of td.mode. clockevents_shutdown(dev) At this point all cpus stop. Right. That's a different one, if tick_broadcast_device.evtdev == NULL. We should not stop any cpu local timer in that case. Combined with the patch I sent, we prevent the idle stuff from going into a state where the cpu local timers stop. Right. We need a check above too. Regards Preeti U Murthy Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST
On 06/26/2015 06:08 PM, Thomas Gleixner wrote: On Fri, 26 Jun 2015, Preeti U Murthy wrote: On 06/26/2015 01:17 PM, Thomas Gleixner wrote: if (state == TICK_BROADCAST_ENTER) { + /* +* If the current CPU owns the hrtimer broadcast +* mechanism, it cannot go deep idle and we do not add +* the CPU to the broadcast mask. We don't have to go +* through the EXIT path as the local timer is not +* shutdown. +*/ + ret = broadcast_needs_cpu(bc, cpu); + + /* +* If the hrtimer broadcast check tells us that the +* cpu cannot go deep idle, or if the broadcast device +* is in periodic mode, we just return. +*/ + if (ret || tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC) + goto out; The check on PERIODIC mode is good, but I don't see the point of moving broadcast_needs_cpu() up above. broadcast_shutdown_local() calls broadcast_needs_cpu() internally. Besides, by jumping to 'out', we will miss programming the broadcast hrtimer in tick_broadcast_set_event() below, if the cpu happen to be the broadcast cpu(which is why it was not allowed to go to deep idle). Hmm. Need to think a bit more about this convoluted maze ... I think you cover all cases just by having that check on periodic mode. This covers the nohz_full=n,highres=n, TICK_ONESHOT=y and GENERIC_CLOCKEVENTS_BROADCAST=y. broadcast_needs_cpu() should remain where it was though. And of course, the additional patch on tick_broadcast_device.evtdev == NULL in BROADCAST_ON. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST
On 06/26/2015 01:17 PM, Thomas Gleixner wrote: On Fri, 26 Jun 2015, Preeti U Murthy wrote: What about the case where GENERIC_CLOCKEVENTS_BROADCAST=y and TICK_ONESHOT=n (HZ_PERIODIC=y) ? Have you tested this ? This will hang the kernel at boot if you are using the hrtimer mode of broadcast. This is because the local timers of all cpus are shutdown when the cpuidle driver registers itself, on finding out that there are idle states where local tick devices stop. The broadcast tick device is then in charge of waking up the cpus at every period. In hrtimer mode of broadcast, there is no such real device and we hang. Hmm, no. tick-broadcast-hrtimer.o depends on TICK_ONESHOT=y. So this is covered by the check for the broadcast device, which is NULL. But there is another variant: GENERIC_CLOCKEVENTS_BROADCAST=y TICK_ONESHOT=y and 'highres=off nohz=off' on the kernel command line. Can this happen at all? It is during tick_init_highres() or tick_nohz_switch_to_nohz() that we switch to oneshot mode, not otherwise AFAICT. I was actually talking of the following scenario. In periodic mode, where GENERIC_CLOCKEVENTS_BROADCAST=y, the arch can execute tick_setup_hrtimer_broadcast(), which will return nothing as you point out above. So there is no broadcast clockevent device. When the cpuidle driver registers with the cpuidle core however, cpuidle_setup_broadcast_timer() on every cpu is executed if it finds that there is an idle state where ticks stop. cpuidle_setup_broadcast_timer() tick_broadcast_enable() tick_broadcast_control(BROADCAST_ON) bc = tick_broadcast_device.evtdev which is NULL in this case TICK_BROADCAST_ON: checks for periodic mode of the broadcast device - succeeds although we haven't registered a broadcast device because value of TICKDEV_PERIODIC is 0, the default value of td.mode. clockevents_shutdown(dev) At this point all cpus stop. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST
On 06/26/2015 01:17 PM, Thomas Gleixner wrote: diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index d39f32cdd1b5..281ce29d295e 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -662,46 +662,39 @@ static void broadcast_shutdown_local(struct clock_event_device *bc, clockevents_switch_state(dev, CLOCK_EVT_STATE_SHUTDOWN); } -/** - * tick_broadcast_oneshot_control - Enter/exit broadcast oneshot mode - * @state: The target state (enter/exit) - * - * The system enters/leaves a state, where affected devices might stop - * Returns 0 on success, -EBUSY if the cpu is used to broadcast wakeups. - * - * Called with interrupts disabled, so clockevents_lock is not - * required here because the local clock event device cannot go away - * under us. - */ -int tick_broadcast_oneshot_control(enum tick_broadcast_state state) +int __tick_broadcast_oneshot_control(enum tick_broadcast_state state) { struct clock_event_device *bc, *dev; - struct tick_device *td; int cpu, ret = 0; ktime_t now; - /* - * Periodic mode does not care about the enter/exit of power - * states - */ - if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC) - return 0; + if (!tick_broadcast_device.evtdev) + return -EBUSY; - /* - * We are called with preemtion disabled from the depth of the - * idle code, so we can't be moved away. - */ - td = this_cpu_ptr(tick_cpu_device); - dev = td-evtdev; - - if (!(dev-features CLOCK_EVT_FEAT_C3STOP)) - return 0; + dev = this_cpu_ptr(tick_cpu_device)-evtdev; raw_spin_lock(tick_broadcast_lock); bc = tick_broadcast_device.evtdev; cpu = smp_processor_id(); if (state == TICK_BROADCAST_ENTER) { + /* + * If the current CPU owns the hrtimer broadcast + * mechanism, it cannot go deep idle and we do not add + * the CPU to the broadcast mask. We don't have to go + * through the EXIT path as the local timer is not + * shutdown. + */ + ret = broadcast_needs_cpu(bc, cpu); + + /* + * If the hrtimer broadcast check tells us that the + * cpu cannot go deep idle, or if the broadcast device + * is in periodic mode, we just return. + */ + if (ret || tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC) + goto out; The check on PERIODIC mode is good, but I don't see the point of moving broadcast_needs_cpu() up above. broadcast_shutdown_local() calls broadcast_needs_cpu() internally. Besides, by jumping to 'out', we will miss programming the broadcast hrtimer in tick_broadcast_set_event() below, if the cpu happen to be the broadcast cpu(which is why it was not allowed to go to deep idle). + if (!cpumask_test_and_set_cpu(cpu, tick_broadcast_oneshot_mask)) { WARN_ON_ONCE(cpumask_test_cpu(cpu, tick_broadcast_pending_mask)); broadcast_shutdown_local(bc, dev); @@ -717,16 +710,6 @@ int tick_broadcast_oneshot_control(enum tick_broadcast_state state) dev-next_event.tv64 bc-next_event.tv64) tick_broadcast_set_event(bc, cpu, dev-next_event); } - /* - * If the current CPU owns the hrtimer broadcast - * mechanism, it cannot go deep idle and we remove the - * CPU from the broadcast mask. We don't have to go - * through the EXIT path as the local timer is not - * shutdown. - */ - ret = broadcast_needs_cpu(bc, cpu); - if (ret) - cpumask_clear_cpu(cpu, tick_broadcast_oneshot_mask); } else { if (cpumask_test_and_clear_cpu(cpu, tick_broadcast_oneshot_mask)) { clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT); @@ -938,6 +921,13 @@ bool tick_broadcast_oneshot_available(void) return bc ? bc-features CLOCK_EVT_FEAT_ONESHOT : false; } Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST
On 06/25/2015 09:00 PM, Sudeep Holla wrote: > > > On 25/06/15 14:55, Thomas Gleixner wrote: >> On Thu, 25 Jun 2015, Sudeep Holla wrote: >> >>> tick_broadcast_enter returns 0 when CPU can switch to broadcast >>> timer and non-zero otherwise. However when GENERIC_CLOCKEVENTS_BROADCAST >>> and TICK_ONESHOT are disabled, tick_broadcast_oneshot_control returns 0 >>> which indicates to the CPUIdle framework that the CPU can enter deeper >>> idle states even when the CPU local timer will be shutdown. If the >>> target state needs broadcast but not broadcast timer is available, then >>> the CPU can not resume back from that idle state. >>> >>> This patch returns error when there's no broadcast timer support >>> available so that CPUIdle framework prevents the CPU from entering any >>> idle states losing the local timer. >> >> That's wrong and breaks stuff which does not require the broadcast >> nonsense. >> > > OK, sorry for not considering that case. > >> If TICK_ONESHOT is disabled, then everything is in periodic mode and >> tick_broadcast_enter() rightfully returns 0. Ditto for 'highres=off' >> on the command line. >> >> But there is a case which is not correctly handled right now. That's >> what you are trying to solve in the wrong way. >> > > Correct I was trying to solve exactly the case mentioned below. > >> If >> GENERIC_CLOCKEVENTS_BROADCAST=n >> >> or >> >> GENERIC_CLOCKEVENTS_BROADCAST=y and no broadcast device is available, >> >> AND cpu local tick device has the C3STOP flag set, >> >> then we have no way to tell the idle code that going deep is not >> allowed. >> >> So we need to be smarter than blindly changing a return >> value. Completely untested patch below. >> > > Agreed, thanks for the quick patch, I have tested it and it works fine. > You can add > > Tested-by: Sudeep Holla What about the case where GENERIC_CLOCKEVENTS_BROADCAST=y and TICK_ONESHOT=n (HZ_PERIODIC=y) ? Have you tested this ? This will hang the kernel at boot if you are using the hrtimer mode of broadcast. This is because the local timers of all cpus are shutdown when the cpuidle driver registers itself, on finding out that there are idle states where local tick devices stop. The broadcast tick device is then in charge of waking up the cpus at every period. In hrtimer mode of broadcast, there is no such real device and we hang. There was a patch sent out recently to fix this on powerpc. https://lkml.org/lkml/2015/6/24/42 Regards Preeti U Murthy > > Regards, > Sudeep > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST
#ifdef CONFIG_HOTPLUG_CPU > /* > * Transfer the do_timer job away from a dying cpu. > diff --git a/kernel/time/tick-sched.h b/kernel/time/tick-sched.h > index 42fdf4958bcc..a4a8d4e9baa1 100644 > --- a/kernel/time/tick-sched.h > +++ b/kernel/time/tick-sched.h > @@ -71,4 +71,14 @@ extern void tick_cancel_sched_timer(int cpu); > static inline void tick_cancel_sched_timer(int cpu) { } > #endif > > +#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST > +extern int __tick_broadcast_oneshot_control(enum tick_broadcast_state state); > +#else > +static inline int > +__tick_broadcast_oneshot_control(enum tick_broadcast_state state) > +{ > + return -EBUSY; > +} > +#endif > + > #endif Reviewed-by: Preeti U Murthy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST
(enum tick_broadcast_state state) +{ + return -EBUSY; +} +#endif + #endif Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] clockevents: return error from tick_broadcast_oneshot_control if !GENERIC_CLOCKEVENTS_BROADCAST
On 06/25/2015 09:00 PM, Sudeep Holla wrote: On 25/06/15 14:55, Thomas Gleixner wrote: On Thu, 25 Jun 2015, Sudeep Holla wrote: tick_broadcast_enter returns 0 when CPU can switch to broadcast timer and non-zero otherwise. However when GENERIC_CLOCKEVENTS_BROADCAST and TICK_ONESHOT are disabled, tick_broadcast_oneshot_control returns 0 which indicates to the CPUIdle framework that the CPU can enter deeper idle states even when the CPU local timer will be shutdown. If the target state needs broadcast but not broadcast timer is available, then the CPU can not resume back from that idle state. This patch returns error when there's no broadcast timer support available so that CPUIdle framework prevents the CPU from entering any idle states losing the local timer. That's wrong and breaks stuff which does not require the broadcast nonsense. OK, sorry for not considering that case. If TICK_ONESHOT is disabled, then everything is in periodic mode and tick_broadcast_enter() rightfully returns 0. Ditto for 'highres=off' on the command line. But there is a case which is not correctly handled right now. That's what you are trying to solve in the wrong way. Correct I was trying to solve exactly the case mentioned below. If GENERIC_CLOCKEVENTS_BROADCAST=n or GENERIC_CLOCKEVENTS_BROADCAST=y and no broadcast device is available, AND cpu local tick device has the C3STOP flag set, then we have no way to tell the idle code that going deep is not allowed. So we need to be smarter than blindly changing a return value. Completely untested patch below. Agreed, thanks for the quick patch, I have tested it and it works fine. You can add Tested-by: Sudeep Holla sudeep.ho...@arm.com What about the case where GENERIC_CLOCKEVENTS_BROADCAST=y and TICK_ONESHOT=n (HZ_PERIODIC=y) ? Have you tested this ? This will hang the kernel at boot if you are using the hrtimer mode of broadcast. This is because the local timers of all cpus are shutdown when the cpuidle driver registers itself, on finding out that there are idle states where local tick devices stop. The broadcast tick device is then in charge of waking up the cpus at every period. In hrtimer mode of broadcast, there is no such real device and we hang. There was a patch sent out recently to fix this on powerpc. https://lkml.org/lkml/2015/6/24/42 Regards Preeti U Murthy Regards, Sudeep -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tick/idle/powerpc: Do not register idle states with CPUIDLE_FLAG_TIMER_STOP set in periodic mode
On 06/25/2015 05:36 AM, Rafael J. Wysocki wrote: > On Thu, Jun 25, 2015 at 12:06 AM, Benjamin Herrenschmidt > wrote: >> On Wed, 2015-06-24 at 15:50 +0200, Rafael J. Wysocki wrote: >>> 4.2 material I suppose? >> >> And stable. Without this, if you configure without TICK_ONESHOT, the >> machine will hang. > > OK, which -stable? All of them or any specific series? This needs to go into stable/linux-3.19.y, stable/linux-4.0.y, stable/linux-4.1.y. Thanks Regards Preeti U Murthy > > Rafael > ___ > Linuxppc-dev mailing list > linuxppc-...@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] nohz: Affining unpinned timers
On 06/23/2015 02:01 PM, Vatika Harlalka wrote: > The problem addressed in this patch is about affining unpinned timers. > Adaptive or Full Dynticks CPUs should not be disturbed by unnecessary > jitter due to firing of such timers on them. > This patch will affine timers to online CPUs which are not Full Dynticks > in FULL_NOHZ configured systems. It will not bring about functional > changes if NOHZ_FULL is not configured, because is_housekeeping_cpu() > always returns true in CONFIG_NO_HZ_FULL=n. > > Signed-off by: Vatika Harlalka > --- Reviewed-by: Preeti U Murthy > Changes: Patch sent to the right recipients. > > include/linux/tick.h | 9 - > kernel/sched/core.c | 9 +++-- > 2 files changed, 15 insertions(+), 3 deletions(-) > > diff --git a/include/linux/tick.h b/include/linux/tick.h > index f8492da5..145bcba 100644 > --- a/include/linux/tick.h > +++ b/include/linux/tick.h > @@ -133,13 +133,20 @@ static inline bool tick_nohz_full_cpu(int cpu) > > return cpumask_test_cpu(cpu, tick_nohz_full_mask); > } > - > +static inline int housekeeping_any_cpu(void) > +{ > + return cpumask_any_and(housekeeping_mask, cpu_online_mask); > +} > extern void __tick_nohz_full_check(void); > extern void tick_nohz_full_kick(void); > extern void tick_nohz_full_kick_cpu(int cpu); > extern void tick_nohz_full_kick_all(void); > extern void __tick_nohz_task_switch(struct task_struct *tsk); > #else > +static inline int housekeeping_any_cpu(void) > +{ > + return smp_processor_id(); > +} > static inline bool tick_nohz_full_enabled(void) { return false; } > static inline bool tick_nohz_full_cpu(int cpu) { return false; } > static inline void __tick_nohz_full_check(void) { } > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 57bd333..acee856 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -599,18 +599,23 @@ int get_nohz_timer_target(int pinned) > int i; > struct sched_domain *sd; > > - if (pinned || !get_sysctl_timer_migration() || !idle_cpu(cpu)) > + if (pinned || !get_sysctl_timer_migration() || > + (!idle_cpu(cpu) && is_housekeeping_cpu(cpu))) > return cpu; > > rcu_read_lock(); > for_each_domain(cpu, sd) { > for_each_cpu(i, sched_domain_span(sd)) { > - if (!idle_cpu(i)) { > + if (!idle_cpu(i) && is_housekeeping_cpu(cpu)) { > cpu = i; > goto unlock; > } > } > } > + > + if (!is_housekeeping_cpu(cpu)) > + cpu = housekeeping_any_cpu(); > + > unlock: > rcu_read_unlock(); > return cpu; > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] tick/idle/powerpc: Do not register idle states with CPUIDLE_FLAG_TIMER_STOP set in periodic mode
On some archs, the local clockevent device stops in deep cpuidle states. The broadcast framework is used to wakeup cpus in these idle states, in which either an external clockevent device is used to send wakeup ipis or the hrtimer broadcast framework kicks in in the absence of such a device. One cpu is nominated as the broadcast cpu and this cpu sends wakeup ipis to sleeping cpus at the appropriate time. This is the implementation in the oneshot mode of broadcast. In periodic mode of broadcast however, the presence of such cpuidle states results in the cpuidle driver calling tick_broadcast_enable() which shuts down the local clockevent devices of all the cpus and appoints the tick broadcast device as the clockevent device for each of them. This works on those archs where the tick broadcast device is a real clockevent device. But on archs which depend on the hrtimer mode of broadcast, the tick broadcast device hapens to be a pseudo device. The consequence is that the local clockevent devices of all cpus are shutdown and the kernel hangs at boot time in periodic mode. Let us thus not register the cpuidle states which have CPUIDLE_FLAG_TIMER_STOP flag set, on archs which depend on the hrtimer mode of broadcast in periodic mode. This patch takes care of doing this on powerpc. The cpus would not have entered into such deep cpuidle states in periodic mode on powerpc anyway. So there is no loss here. Signed-off-by: Preeti U Murthy --- drivers/cpuidle/cpuidle-powernv.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 5937207..3442764 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -60,6 +60,8 @@ static int nap_loop(struct cpuidle_device *dev, return index; } +/* Register for fastsleep only in oneshot mode of broadcast */ +#ifdef CONFIG_TICK_ONESHOT static int fastsleep_loop(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) @@ -83,7 +85,7 @@ static int fastsleep_loop(struct cpuidle_device *dev, return index; } - +#endif /* * States for dedicated partition case. */ @@ -209,7 +211,14 @@ static int powernv_add_idle_states(void) powernv_states[nr_idle_states].flags = 0; powernv_states[nr_idle_states].target_residency = 100; powernv_states[nr_idle_states].enter = _loop; - } else if (flags[i] & OPAL_PM_SLEEP_ENABLED || + } + + /* +* All cpuidle states with CPUIDLE_FLAG_TIMER_STOP set must come +* within this config dependency check. +*/ +#ifdef CONFIG_TICK_ONESHOT + if (flags[i] & OPAL_PM_SLEEP_ENABLED || flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) { /* Add FASTSLEEP state */ strcpy(powernv_states[nr_idle_states].name, "FastSleep"); @@ -218,7 +227,7 @@ static int powernv_add_idle_states(void) powernv_states[nr_idle_states].target_residency = 30; powernv_states[nr_idle_states].enter = _loop; } - +#endif powernv_states[nr_idle_states].exit_latency = ((unsigned int)latency_ns[i]) / 1000; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] nohz: Affining unpinned timers
On 06/23/2015 02:01 PM, Vatika Harlalka wrote: The problem addressed in this patch is about affining unpinned timers. Adaptive or Full Dynticks CPUs should not be disturbed by unnecessary jitter due to firing of such timers on them. This patch will affine timers to online CPUs which are not Full Dynticks in FULL_NOHZ configured systems. It will not bring about functional changes if NOHZ_FULL is not configured, because is_housekeeping_cpu() always returns true in CONFIG_NO_HZ_FULL=n. Signed-off by: Vatika Harlalka vatikaharla...@gmail.com --- Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com Changes: Patch sent to the right recipients. include/linux/tick.h | 9 - kernel/sched/core.c | 9 +++-- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/include/linux/tick.h b/include/linux/tick.h index f8492da5..145bcba 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -133,13 +133,20 @@ static inline bool tick_nohz_full_cpu(int cpu) return cpumask_test_cpu(cpu, tick_nohz_full_mask); } - +static inline int housekeeping_any_cpu(void) +{ + return cpumask_any_and(housekeeping_mask, cpu_online_mask); +} extern void __tick_nohz_full_check(void); extern void tick_nohz_full_kick(void); extern void tick_nohz_full_kick_cpu(int cpu); extern void tick_nohz_full_kick_all(void); extern void __tick_nohz_task_switch(struct task_struct *tsk); #else +static inline int housekeeping_any_cpu(void) +{ + return smp_processor_id(); +} static inline bool tick_nohz_full_enabled(void) { return false; } static inline bool tick_nohz_full_cpu(int cpu) { return false; } static inline void __tick_nohz_full_check(void) { } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 57bd333..acee856 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -599,18 +599,23 @@ int get_nohz_timer_target(int pinned) int i; struct sched_domain *sd; - if (pinned || !get_sysctl_timer_migration() || !idle_cpu(cpu)) + if (pinned || !get_sysctl_timer_migration() || + (!idle_cpu(cpu) is_housekeeping_cpu(cpu))) return cpu; rcu_read_lock(); for_each_domain(cpu, sd) { for_each_cpu(i, sched_domain_span(sd)) { - if (!idle_cpu(i)) { + if (!idle_cpu(i) is_housekeeping_cpu(cpu)) { cpu = i; goto unlock; } } } + + if (!is_housekeeping_cpu(cpu)) + cpu = housekeeping_any_cpu(); + unlock: rcu_read_unlock(); return cpu; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] tick/idle/powerpc: Do not register idle states with CPUIDLE_FLAG_TIMER_STOP set in periodic mode
On some archs, the local clockevent device stops in deep cpuidle states. The broadcast framework is used to wakeup cpus in these idle states, in which either an external clockevent device is used to send wakeup ipis or the hrtimer broadcast framework kicks in in the absence of such a device. One cpu is nominated as the broadcast cpu and this cpu sends wakeup ipis to sleeping cpus at the appropriate time. This is the implementation in the oneshot mode of broadcast. In periodic mode of broadcast however, the presence of such cpuidle states results in the cpuidle driver calling tick_broadcast_enable() which shuts down the local clockevent devices of all the cpus and appoints the tick broadcast device as the clockevent device for each of them. This works on those archs where the tick broadcast device is a real clockevent device. But on archs which depend on the hrtimer mode of broadcast, the tick broadcast device hapens to be a pseudo device. The consequence is that the local clockevent devices of all cpus are shutdown and the kernel hangs at boot time in periodic mode. Let us thus not register the cpuidle states which have CPUIDLE_FLAG_TIMER_STOP flag set, on archs which depend on the hrtimer mode of broadcast in periodic mode. This patch takes care of doing this on powerpc. The cpus would not have entered into such deep cpuidle states in periodic mode on powerpc anyway. So there is no loss here. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- drivers/cpuidle/cpuidle-powernv.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 5937207..3442764 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -60,6 +60,8 @@ static int nap_loop(struct cpuidle_device *dev, return index; } +/* Register for fastsleep only in oneshot mode of broadcast */ +#ifdef CONFIG_TICK_ONESHOT static int fastsleep_loop(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) @@ -83,7 +85,7 @@ static int fastsleep_loop(struct cpuidle_device *dev, return index; } - +#endif /* * States for dedicated partition case. */ @@ -209,7 +211,14 @@ static int powernv_add_idle_states(void) powernv_states[nr_idle_states].flags = 0; powernv_states[nr_idle_states].target_residency = 100; powernv_states[nr_idle_states].enter = nap_loop; - } else if (flags[i] OPAL_PM_SLEEP_ENABLED || + } + + /* +* All cpuidle states with CPUIDLE_FLAG_TIMER_STOP set must come +* within this config dependency check. +*/ +#ifdef CONFIG_TICK_ONESHOT + if (flags[i] OPAL_PM_SLEEP_ENABLED || flags[i] OPAL_PM_SLEEP_ENABLED_ER1) { /* Add FASTSLEEP state */ strcpy(powernv_states[nr_idle_states].name, FastSleep); @@ -218,7 +227,7 @@ static int powernv_add_idle_states(void) powernv_states[nr_idle_states].target_residency = 30; powernv_states[nr_idle_states].enter = fastsleep_loop; } - +#endif powernv_states[nr_idle_states].exit_latency = ((unsigned int)latency_ns[i]) / 1000; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tick/idle/powerpc: Do not register idle states with CPUIDLE_FLAG_TIMER_STOP set in periodic mode
On 06/25/2015 05:36 AM, Rafael J. Wysocki wrote: On Thu, Jun 25, 2015 at 12:06 AM, Benjamin Herrenschmidt b...@kernel.crashing.org wrote: On Wed, 2015-06-24 at 15:50 +0200, Rafael J. Wysocki wrote: 4.2 material I suppose? And stable. Without this, if you configure without TICK_ONESHOT, the machine will hang. OK, which -stable? All of them or any specific series? This needs to go into stable/linux-3.19.y, stable/linux-4.0.y, stable/linux-4.1.y. Thanks Regards Preeti U Murthy Rafael ___ Linuxppc-dev mailing list linuxppc-...@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] cpuidle: powernv/pseries: Auto-promotion of snooze to deeper idle state
On 06/18/2015 04:53 PM, Shilpasri G Bhat wrote: > The idle cpus which stay in snooze for a long period can degrade the > perfomance of the sibling cpus. If the cpu stays in snooze for more > than target residency of the next available idle state, then exit from > snooze. This gives a chance to the cpuidle governor to re-evaluate the > last idle state of the cpu to promote it to deeper idle states. > > Signed-off-by: Shilpasri G Bhat > --- Reviewed-by: Preeti U Murthy > Changes from v1: > -Modified commit message > > drivers/cpuidle/cpuidle-powernv.c | 12 > drivers/cpuidle/cpuidle-pseries.c | 11 +++ > 2 files changed, 23 insertions(+) > > diff --git a/drivers/cpuidle/cpuidle-powernv.c > b/drivers/cpuidle/cpuidle-powernv.c > index 5937207..1e3ef5e 100644 > --- a/drivers/cpuidle/cpuidle-powernv.c > +++ b/drivers/cpuidle/cpuidle-powernv.c > @@ -29,18 +29,25 @@ struct cpuidle_driver powernv_idle_driver = { > > static int max_idle_state; > static struct cpuidle_state *cpuidle_state_table; > +static u64 snooze_timeout; > +static bool snooze_timeout_en; > > static int snooze_loop(struct cpuidle_device *dev, > struct cpuidle_driver *drv, > int index) > { > + u64 snooze_exit_time; > + > local_irq_enable(); > set_thread_flag(TIF_POLLING_NRFLAG); > > + snooze_exit_time = get_tb() + snooze_timeout; > ppc64_runlatch_off(); > while (!need_resched()) { > HMT_low(); > HMT_very_low(); > + if (snooze_timeout_en && get_tb() > snooze_exit_time) > + break; > } > > HMT_medium(); > @@ -252,6 +259,11 @@ static int powernv_idle_probe(void) > cpuidle_state_table = powernv_states; > /* Device tree can indicate more idle states */ > max_idle_state = powernv_add_idle_states(); > + if (max_idle_state > 1) { > + snooze_timeout_en = true; > + snooze_timeout = powernv_states[1].target_residency * > + tb_ticks_per_usec; > + } > } else > return -ENODEV; > > diff --git a/drivers/cpuidle/cpuidle-pseries.c > b/drivers/cpuidle/cpuidle-pseries.c > index bb9e2b6..07135e0 100644 > --- a/drivers/cpuidle/cpuidle-pseries.c > +++ b/drivers/cpuidle/cpuidle-pseries.c > @@ -27,6 +27,8 @@ struct cpuidle_driver pseries_idle_driver = { > > static int max_idle_state; > static struct cpuidle_state *cpuidle_state_table; > +static u64 snooze_timeout; > +static bool snooze_timeout_en; > > static inline void idle_loop_prolog(unsigned long *in_purr) > { > @@ -58,14 +60,18 @@ static int snooze_loop(struct cpuidle_device *dev, > int index) > { > unsigned long in_purr; > + u64 snooze_exit_time; > > idle_loop_prolog(_purr); > local_irq_enable(); > set_thread_flag(TIF_POLLING_NRFLAG); > + snooze_exit_time = get_tb() + snooze_timeout; > > while (!need_resched()) { > HMT_low(); > HMT_very_low(); > + if (snooze_timeout_en && get_tb() > snooze_exit_time) > + break; > } > > HMT_medium(); > @@ -244,6 +250,11 @@ static int pseries_idle_probe(void) > } else > return -ENODEV; > > + if (max_idle_state > 1) { > + snooze_timeout_en = true; > + snooze_timeout = cpuidle_state_table[1].target_residency * > + tb_ticks_per_usec; > + } > return 0; > } > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Fix race in hrtimer broadcast and take care of dependencies
A reminder to pick this series up for stable-4.0 On 05/13/2015 05:19 PM, Preeti U Murthy wrote: > The intention was to backport only PATCH[5/5]: clockevents: Fix > cpu_down() race for hrtimer based broadcasting, > > but this depends on commits upstream which did cleanup and > reorganization of code. However these commits cannot be cherry-picked as > is. There are a few minor changes upstream and hence these commits had > to be backported too. Reasons for backports are given below: > > 1. PATCH[1/5]: The upstream commit introduced a duplicate function which > was removed much later by commit: > > 9eed56e889d: clockevents: Clean up clockchips.h > > We do not need to cherry-pick the entire above commit because the backport > was a simple one. > > 2. PATCH[3/5]: There are newer functions introduced upstream through > > 554ef3876c6: clockevents: Handle tick device's resume separately > > Yet again we did not need to cherry pick the entire above commit, > firstly because it introduces significant new code, secondly because the > backport was a simple one. > > 3. PATCH[4/5]: The function clockevent_set_mode() was renamed to > clockevent_set_state() upstream by commit: > > 77e32c89a7: clockevents: Manage device's state separately for > the core > > We do not need to cherry-pick this commit for the same reason as in (2). > > This series needs to be applied on top of stable-4.0. > --- > > Preeti U Murthy (1): > clockevents: Fix cpu_down() race for hrtimer based broadcasting > > Thomas Gleixner (4): > clockevents: Remove CONFIG_GENERIC_CLOCKEVENTS_BUILD > tick: Move clocksource related stuff to timekeeping.h > tick: Simplify tick-internal.h > tick: Move core only declarations and functions to core > > > include/linux/clockchips.h | 23 ++ > include/linux/tick.h | 138 ++- > kernel/cpu.c |2 + > kernel/time/Kconfig |6 -- > kernel/time/Makefile |6 +- > kernel/time/clockevents.c|3 - > kernel/time/hrtimer.c|2 - > kernel/time/jiffies.c|2 - > kernel/time/ntp.c|1 > kernel/time/tick-broadcast.c | 19 +++-- > kernel/time/tick-internal.h | 166 > -- > kernel/time/tick-sched.c |7 ++ > kernel/time/tick-sched.h | 64 > kernel/time/timekeeping.h|7 ++ > kernel/time/timer_list.c |2 - > 15 files changed, 190 insertions(+), 258 deletions(-) > create mode 100644 kernel/time/tick-sched.h > > -- > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] cpuidle: powernv/pseries: Auto-promotion of snooze to deeper idle state
On 06/18/2015 04:53 PM, Shilpasri G Bhat wrote: The idle cpus which stay in snooze for a long period can degrade the perfomance of the sibling cpus. If the cpu stays in snooze for more than target residency of the next available idle state, then exit from snooze. This gives a chance to the cpuidle governor to re-evaluate the last idle state of the cpu to promote it to deeper idle states. Signed-off-by: Shilpasri G Bhat shilpa.b...@linux.vnet.ibm.com --- Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com Changes from v1: -Modified commit message drivers/cpuidle/cpuidle-powernv.c | 12 drivers/cpuidle/cpuidle-pseries.c | 11 +++ 2 files changed, 23 insertions(+) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 5937207..1e3ef5e 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -29,18 +29,25 @@ struct cpuidle_driver powernv_idle_driver = { static int max_idle_state; static struct cpuidle_state *cpuidle_state_table; +static u64 snooze_timeout; +static bool snooze_timeout_en; static int snooze_loop(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) { + u64 snooze_exit_time; + local_irq_enable(); set_thread_flag(TIF_POLLING_NRFLAG); + snooze_exit_time = get_tb() + snooze_timeout; ppc64_runlatch_off(); while (!need_resched()) { HMT_low(); HMT_very_low(); + if (snooze_timeout_en get_tb() snooze_exit_time) + break; } HMT_medium(); @@ -252,6 +259,11 @@ static int powernv_idle_probe(void) cpuidle_state_table = powernv_states; /* Device tree can indicate more idle states */ max_idle_state = powernv_add_idle_states(); + if (max_idle_state 1) { + snooze_timeout_en = true; + snooze_timeout = powernv_states[1].target_residency * + tb_ticks_per_usec; + } } else return -ENODEV; diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index bb9e2b6..07135e0 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -27,6 +27,8 @@ struct cpuidle_driver pseries_idle_driver = { static int max_idle_state; static struct cpuidle_state *cpuidle_state_table; +static u64 snooze_timeout; +static bool snooze_timeout_en; static inline void idle_loop_prolog(unsigned long *in_purr) { @@ -58,14 +60,18 @@ static int snooze_loop(struct cpuidle_device *dev, int index) { unsigned long in_purr; + u64 snooze_exit_time; idle_loop_prolog(in_purr); local_irq_enable(); set_thread_flag(TIF_POLLING_NRFLAG); + snooze_exit_time = get_tb() + snooze_timeout; while (!need_resched()) { HMT_low(); HMT_very_low(); + if (snooze_timeout_en get_tb() snooze_exit_time) + break; } HMT_medium(); @@ -244,6 +250,11 @@ static int pseries_idle_probe(void) } else return -ENODEV; + if (max_idle_state 1) { + snooze_timeout_en = true; + snooze_timeout = cpuidle_state_table[1].target_residency * + tb_ticks_per_usec; + } return 0; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Fix race in hrtimer broadcast and take care of dependencies
A reminder to pick this series up for stable-4.0 On 05/13/2015 05:19 PM, Preeti U Murthy wrote: The intention was to backport only PATCH[5/5]: clockevents: Fix cpu_down() race for hrtimer based broadcasting, but this depends on commits upstream which did cleanup and reorganization of code. However these commits cannot be cherry-picked as is. There are a few minor changes upstream and hence these commits had to be backported too. Reasons for backports are given below: 1. PATCH[1/5]: The upstream commit introduced a duplicate function which was removed much later by commit: 9eed56e889d: clockevents: Clean up clockchips.h We do not need to cherry-pick the entire above commit because the backport was a simple one. 2. PATCH[3/5]: There are newer functions introduced upstream through 554ef3876c6: clockevents: Handle tick device's resume separately Yet again we did not need to cherry pick the entire above commit, firstly because it introduces significant new code, secondly because the backport was a simple one. 3. PATCH[4/5]: The function clockevent_set_mode() was renamed to clockevent_set_state() upstream by commit: 77e32c89a7: clockevents: Manage device's state separately for the core We do not need to cherry-pick this commit for the same reason as in (2). This series needs to be applied on top of stable-4.0. --- Preeti U Murthy (1): clockevents: Fix cpu_down() race for hrtimer based broadcasting Thomas Gleixner (4): clockevents: Remove CONFIG_GENERIC_CLOCKEVENTS_BUILD tick: Move clocksource related stuff to timekeeping.h tick: Simplify tick-internal.h tick: Move core only declarations and functions to core include/linux/clockchips.h | 23 ++ include/linux/tick.h | 138 ++- kernel/cpu.c |2 + kernel/time/Kconfig |6 -- kernel/time/Makefile |6 +- kernel/time/clockevents.c|3 - kernel/time/hrtimer.c|2 - kernel/time/jiffies.c|2 - kernel/time/ntp.c|1 kernel/time/tick-broadcast.c | 19 +++-- kernel/time/tick-internal.h | 166 -- kernel/time/tick-sched.c |7 ++ kernel/time/tick-sched.h | 64 kernel/time/timekeeping.h|7 ++ kernel/time/timer_list.c |2 - 15 files changed, 190 insertions(+), 258 deletions(-) create mode 100644 kernel/time/tick-sched.h -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 7/8] nohz: Evaluate tick dependency once on context switch
On 06/12/2015 02:16 AM, Rik van Riel wrote: > On 06/11/2015 01:36 PM, Frederic Weisbecker wrote: >> The tick dependency is evaluated on every irq. This is a batch of checks >> which determine whether it is safe to stop the tick or not. These checks >> are often split in many details: posix cpu timers, scheduler, sched clock, >> perf events. Each of which are made of smaller details: posix cpu >> timer involves checking process wide timers then thread wide timers. Perf >> involves checking freq events then more per cpu details. >> >> Checking these details every time we update the full dynticks state >> bring avoidable overhead. >> >> So lets evaluate these dependencies once on context switch. Then the >> further dependency checks will be performed through a single state check. >> >> This is a first step that can be later optimized by dividing task level >> dependency, CPU level dependency and global dependency and update >> each at the right time. > >> +static void tick_nohz_full_update_dependencies(void) >> +{ >> +struct tick_sched *ts = this_cpu_ptr(_cpu_sched); >> + >> +if (!posix_cpu_timers_can_stop_tick(current)) >> +ts->tick_needed |= TICK_NEEDED_POSIX_CPU_TIMER; >> + >> +if (!perf_event_can_stop_tick()) >> +ts->tick_needed |= TICK_NEEDED_PERF_EVENT; >> + >> +if (!sched_can_stop_tick()) >> +ts->tick_needed |= TICK_NEEDED_SCHED; >> > > I see this getting kicked from task work and from ipi > context, but does it get kicked on task wakeup, when > we have a second runnable task on a CPU, but we decide > not to preempt the currently running task to switch to > it yet, but we will want to preempt the currently running > task at a later point in time? +1. This is not taken care of as far as I can see too. Regards Preeti U Murthy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 7/7]powerpc/powernv: nest pmu cpumask and cpu hotplug support
On 06/11/2015 10:47 AM, Madhavan Srinivasan wrote: > Adds cpumask attribute to be used by each nest pmu since nest > units are per-chip. Only one cpu (first online cpu) from each node/chip > is designated to read counters. > > On cpu hotplug, dying cpu is checked to see whether it is one of the > designated cpus, if yes, next online cpu from the same node/chip is designated > as new cpu to read counters. > > Cc: Michael Ellerman > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Anton Blanchard > Cc: Sukadev Bhattiprolu > Cc: Anshuman Khandual > Cc: Stephane Eranian > Cc: Preeti U Murthy > Cc: Ingo Molnar > Cc: Peter Zijlstra > Signed-off-by: Madhavan Srinivasan > --- > +static void nest_change_cpu_context(int old_cpu, int new_cpu) > +{ > + int i; > + > + if (new_cpu >= 0) { > + for (i = 0; per_nest_pmu_arr[i] != NULL; i++) > + perf_pmu_migrate_context(_nest_pmu_arr[i]->pmu, > + old_cpu, new_cpu); > + } > +} > + > +static void nest_exit_cpu(int cpu) > +{ > + int i, nid, target = -1; > + const struct cpumask *l_cpumask; > + int src_chipid; > + > + /* > + * Check in the designated list for this cpu. Dont bother > + * if not one of them. > + */ > + if (!cpumask_test_and_clear_cpu(cpu, _mask_nest_pmu)) > + return; > + > + /* > + * Now that this cpu is one of the designated, > + * find a new cpu a) which is not dying and This comment is not right. nest_exit_cpu() is called in the hotplug path, so another cpu cannot be dying in parallel. Hotplug operations are done serially. The comment ought to be "a) which is online" instead. > + * b) is in same node/chip. node is not the same as a chip right ? And you are interested in cpus on the same chip alone. So shouldn't the above comment be b) in the same chip ? > + */ > + nid = cpu_to_node(cpu); > + src_chipid = topology_physical_package_id(cpu); What is the relation between a node and a chip ? Can't we have a function which returns the cpumask of a chip straight away, since that is what you seem to be more interested in ? You can then simply choose the next cpu in this cpumask rather than going through the below loop. > + l_cpumask = cpumask_of_node(nid); > + for_each_cpu(i, l_cpumask) { > + if (i == cpu) > + continue; > + if (src_chipid == topology_physical_package_id(i)) { > + target = i; > + break; > + } > + } > + > + /* > + * Update the cpumask with the target cpu and > + * migrate the context if needed > + */ > + if (target >= 0) { You already check for target >= 0 here. So you don't need to check for new_cpu >= 0 in nest_change_cpu_context() above ? > + cpumask_set_cpu(target, _mask_nest_pmu); > + nest_change_cpu_context (cpu, target); > + } > +} > + > +static void nest_init_cpu(int cpu) > +{ > + int i, src_chipid; > + > + /* > + * Search for any existing designated thread from > + * the incoming cpu's node/chip. If found, do nothing. > + */ > + src_chipid = topology_physical_package_id(cpu); > + for_each_cpu(i, _mask_nest_pmu) > + if (src_chipid == topology_physical_package_id(i)) > + return; > + > + /* > + * Make incoming cpu as a designated thread for > + * this node/chip > + */ > + cpumask_set_cpu(cpu, _mask_nest_pmu); Why can't we simply check if cpu is the first one coming online in the chip and designate it as the cpu_mask_nest_pmu for that chip ? If it is not the first cpu, it means that another cpu in the same chip already took over this duty and it needn't bother. And shouldn't we also call nest_init() on this cpu, just like you do in cpumask_chip() on all cpu_mask_nest_pmu cpus ? > +} > + > +static int nest_cpu_notifier(struct notifier_block *self, > + unsigned long action, void *hcpu) > +{ > + long cpu = (long)hcpu; > + > + switch (action & ~CPU_TASKS_FROZEN) { > + case CPU_DOWN_FAILED: Why do we need to handle the DOWN_FAILED case ? In DOWN_PREPARE, you have ensured that the function moves on to another cpu. So even if the offline failed, its no issue. The duty is safely taken over. > + case CPU_STARTING: I would suggest initializing nest in the CPU_ONLINE stage. This is because CPU_STARTING phase can fail. In that case, we will be unnecessarily initializing nest pre-maturely. CPU_ONLINE phase assures that the cpu is successf
Re: [PATCH 7/8] nohz: Evaluate tick dependency once on context switch
On 06/12/2015 02:16 AM, Rik van Riel wrote: On 06/11/2015 01:36 PM, Frederic Weisbecker wrote: The tick dependency is evaluated on every irq. This is a batch of checks which determine whether it is safe to stop the tick or not. These checks are often split in many details: posix cpu timers, scheduler, sched clock, perf events. Each of which are made of smaller details: posix cpu timer involves checking process wide timers then thread wide timers. Perf involves checking freq events then more per cpu details. Checking these details every time we update the full dynticks state bring avoidable overhead. So lets evaluate these dependencies once on context switch. Then the further dependency checks will be performed through a single state check. This is a first step that can be later optimized by dividing task level dependency, CPU level dependency and global dependency and update each at the right time. +static void tick_nohz_full_update_dependencies(void) +{ +struct tick_sched *ts = this_cpu_ptr(tick_cpu_sched); + +if (!posix_cpu_timers_can_stop_tick(current)) +ts-tick_needed |= TICK_NEEDED_POSIX_CPU_TIMER; + +if (!perf_event_can_stop_tick()) +ts-tick_needed |= TICK_NEEDED_PERF_EVENT; + +if (!sched_can_stop_tick()) +ts-tick_needed |= TICK_NEEDED_SCHED; I see this getting kicked from task work and from ipi context, but does it get kicked on task wakeup, when we have a second runnable task on a CPU, but we decide not to preempt the currently running task to switch to it yet, but we will want to preempt the currently running task at a later point in time? +1. This is not taken care of as far as I can see too. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 7/7]powerpc/powernv: nest pmu cpumask and cpu hotplug support
On 06/11/2015 10:47 AM, Madhavan Srinivasan wrote: Adds cpumask attribute to be used by each nest pmu since nest units are per-chip. Only one cpu (first online cpu) from each node/chip is designated to read counters. On cpu hotplug, dying cpu is checked to see whether it is one of the designated cpus, if yes, next online cpu from the same node/chip is designated as new cpu to read counters. Cc: Michael Ellerman m...@ellerman.id.au Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: Anton Blanchard an...@samba.org Cc: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Cc: Anshuman Khandual khand...@linux.vnet.ibm.com Cc: Stephane Eranian eran...@google.com Cc: Preeti U Murthy pre...@linux.vnet.ibm.com Cc: Ingo Molnar mi...@kernel.org Cc: Peter Zijlstra pet...@infradead.org Signed-off-by: Madhavan Srinivasan ma...@linux.vnet.ibm.com --- +static void nest_change_cpu_context(int old_cpu, int new_cpu) +{ + int i; + + if (new_cpu = 0) { + for (i = 0; per_nest_pmu_arr[i] != NULL; i++) + perf_pmu_migrate_context(per_nest_pmu_arr[i]-pmu, + old_cpu, new_cpu); + } +} + +static void nest_exit_cpu(int cpu) +{ + int i, nid, target = -1; + const struct cpumask *l_cpumask; + int src_chipid; + + /* + * Check in the designated list for this cpu. Dont bother + * if not one of them. + */ + if (!cpumask_test_and_clear_cpu(cpu, cpu_mask_nest_pmu)) + return; + + /* + * Now that this cpu is one of the designated, + * find a new cpu a) which is not dying and This comment is not right. nest_exit_cpu() is called in the hotplug path, so another cpu cannot be dying in parallel. Hotplug operations are done serially. The comment ought to be a) which is online instead. + * b) is in same node/chip. node is not the same as a chip right ? And you are interested in cpus on the same chip alone. So shouldn't the above comment be b) in the same chip ? + */ + nid = cpu_to_node(cpu); + src_chipid = topology_physical_package_id(cpu); What is the relation between a node and a chip ? Can't we have a function which returns the cpumask of a chip straight away, since that is what you seem to be more interested in ? You can then simply choose the next cpu in this cpumask rather than going through the below loop. + l_cpumask = cpumask_of_node(nid); + for_each_cpu(i, l_cpumask) { + if (i == cpu) + continue; + if (src_chipid == topology_physical_package_id(i)) { + target = i; + break; + } + } + + /* + * Update the cpumask with the target cpu and + * migrate the context if needed + */ + if (target = 0) { You already check for target = 0 here. So you don't need to check for new_cpu = 0 in nest_change_cpu_context() above ? + cpumask_set_cpu(target, cpu_mask_nest_pmu); + nest_change_cpu_context (cpu, target); + } +} + +static void nest_init_cpu(int cpu) +{ + int i, src_chipid; + + /* + * Search for any existing designated thread from + * the incoming cpu's node/chip. If found, do nothing. + */ + src_chipid = topology_physical_package_id(cpu); + for_each_cpu(i, cpu_mask_nest_pmu) + if (src_chipid == topology_physical_package_id(i)) + return; + + /* + * Make incoming cpu as a designated thread for + * this node/chip + */ + cpumask_set_cpu(cpu, cpu_mask_nest_pmu); Why can't we simply check if cpu is the first one coming online in the chip and designate it as the cpu_mask_nest_pmu for that chip ? If it is not the first cpu, it means that another cpu in the same chip already took over this duty and it needn't bother. And shouldn't we also call nest_init() on this cpu, just like you do in cpumask_chip() on all cpu_mask_nest_pmu cpus ? +} + +static int nest_cpu_notifier(struct notifier_block *self, + unsigned long action, void *hcpu) +{ + long cpu = (long)hcpu; + + switch (action ~CPU_TASKS_FROZEN) { + case CPU_DOWN_FAILED: Why do we need to handle the DOWN_FAILED case ? In DOWN_PREPARE, you have ensured that the function moves on to another cpu. So even if the offline failed, its no issue. The duty is safely taken over. + case CPU_STARTING: I would suggest initializing nest in the CPU_ONLINE stage. This is because CPU_STARTING phase can fail. In that case, we will be unnecessarily initializing nest pre-maturely. CPU_ONLINE phase assures that the cpu is successfully online and you can then initiate nest. + nest_init_cpu(cpu); + break; + case CPU_DOWN_PREPARE: + nest_exit_cpu(cpu); + break; + default
Re: [PATCH 5/8] nohz: Restart the tick from irq exit
On 06/12/2015 06:08 PM, Frederic Weisbecker wrote: > On Fri, Jun 12, 2015 at 09:32:45AM +0200, Peter Zijlstra wrote: >> On Thu, Jun 11, 2015 at 07:36:05PM +0200, Frederic Weisbecker wrote: >>> Restart the tick when necessary from the irq exit path. It makes nohz >>> full more flexible and allow it to piggyback the tick restart on the >>> scheduler IPI in the future instead of sending a dedicated IPI that >>> often doubles the scheduler IPI on task wakeup. This will require >>> careful review of resched_curr() callers. >> >> This seems to assume schedule_ipi() callers use irq_exit(), this is >> false. > > Indeed there will be that too. Note the current patch doesn't yet rely on > schedule_ipi(), we are still using the nohz ipis. But introducing the > tick restart on irq exit prepares for later piggybacking on scheduler_ipi(). > > I think this will involve changes on got_nohz_idle_kick(), renamed to > got_nohz_kick() and include nohz full related checks to trigger the > irq_enter()/exit() pair. I maybe saying something obvious here, nevertheless; I am not sure about other archs, but atleast on powerpc after handling an interrupt, we will call irq_exit() and reevaluate starting of ticks. So in our case even if scheduler_ipi() callers do not call irq_exit(), it will be called after handling the reschedule interrupt. Regards Preeti U Murthy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/8] nohz: Restart the tick from irq exit
On 06/11/2015 11:06 PM, Frederic Weisbecker wrote: > Restart the tick when necessary from the irq exit path. It makes nohz > full more flexible and allow it to piggyback the tick restart on the > scheduler IPI in the future instead of sending a dedicated IPI that > often doubles the scheduler IPI on task wakeup. This will require You can piggy back on the scheduler ipi when you add a timer/hrtimer and add a new task to the runqueue of the nohz_full cpus, since we call resched_curr() in these code paths. But what about the calls to kick nohz_full cpus by perf events and posix cpu timers ? These call sites seem to be concerned about specifically waking up nohz_full cpus as far as I can see. IOW there is no scheduling ipi that we can fall back on in these paths. > careful review of resched_curr() callers. > Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/8] nohz: Restart the tick from irq exit
On 06/11/2015 11:06 PM, Frederic Weisbecker wrote: > Restart the tick when necessary from the irq exit path. It makes nohz > full more flexible and allow it to piggyback the tick restart on the > scheduler IPI in the future instead of sending a dedicated IPI that > often doubles the scheduler IPI on task wakeup. This will require You can piggy back on the scheduler ipi when you add a timer/hrtimer and add a new task to the runqueue of the nohz_full cpus, since we call resched_curr() in these code paths. But what about the calls to kick nohz_full cpus by perf events and posix cpu timers ? These call sites seem to be concerned about specifically waking up nohz_full cpus as far as I can see. IOW there is no scheduling ipi that we can fall back on in these paths. > careful review of resched_curr() callers. > Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/8] nohz: Restart the tick from irq exit
On 06/11/2015 11:06 PM, Frederic Weisbecker wrote: Restart the tick when necessary from the irq exit path. It makes nohz full more flexible and allow it to piggyback the tick restart on the scheduler IPI in the future instead of sending a dedicated IPI that often doubles the scheduler IPI on task wakeup. This will require You can piggy back on the scheduler ipi when you add a timer/hrtimer and add a new task to the runqueue of the nohz_full cpus, since we call resched_curr() in these code paths. But what about the calls to kick nohz_full cpus by perf events and posix cpu timers ? These call sites seem to be concerned about specifically waking up nohz_full cpus as far as I can see. IOW there is no scheduling ipi that we can fall back on in these paths. careful review of resched_curr() callers. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/8] nohz: Restart the tick from irq exit
On 06/11/2015 11:06 PM, Frederic Weisbecker wrote: Restart the tick when necessary from the irq exit path. It makes nohz full more flexible and allow it to piggyback the tick restart on the scheduler IPI in the future instead of sending a dedicated IPI that often doubles the scheduler IPI on task wakeup. This will require You can piggy back on the scheduler ipi when you add a timer/hrtimer and add a new task to the runqueue of the nohz_full cpus, since we call resched_curr() in these code paths. But what about the calls to kick nohz_full cpus by perf events and posix cpu timers ? These call sites seem to be concerned about specifically waking up nohz_full cpus as far as I can see. IOW there is no scheduling ipi that we can fall back on in these paths. careful review of resched_curr() callers. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/8] nohz: Restart the tick from irq exit
On 06/12/2015 06:08 PM, Frederic Weisbecker wrote: On Fri, Jun 12, 2015 at 09:32:45AM +0200, Peter Zijlstra wrote: On Thu, Jun 11, 2015 at 07:36:05PM +0200, Frederic Weisbecker wrote: Restart the tick when necessary from the irq exit path. It makes nohz full more flexible and allow it to piggyback the tick restart on the scheduler IPI in the future instead of sending a dedicated IPI that often doubles the scheduler IPI on task wakeup. This will require careful review of resched_curr() callers. This seems to assume schedule_ipi() callers use irq_exit(), this is false. Indeed there will be that too. Note the current patch doesn't yet rely on schedule_ipi(), we are still using the nohz ipis. But introducing the tick restart on irq exit prepares for later piggybacking on scheduler_ipi(). I think this will involve changes on got_nohz_idle_kick(), renamed to got_nohz_kick() and include nohz full related checks to trigger the irq_enter()/exit() pair. I maybe saying something obvious here, nevertheless; I am not sure about other archs, but atleast on powerpc after handling an interrupt, we will call irq_exit() and reevaluate starting of ticks. So in our case even if scheduler_ipi() callers do not call irq_exit(), it will be called after handling the reschedule interrupt. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/8] nohz: Remove idle task special case
On 06/11/2015 11:06 PM, Frederic Weisbecker wrote: > This is a leftover from old days to avoid conflicts with dynticks idle > code. Now full dynticks and idle dynticks are better integrated and > interact without known issue. I am sorry but I fail to understand why the check on idle task was there in the first place in the below code paths. It would help if you could clarify this in the changelog as well. > > So lets remove it. > > Cc: Christoph Lameter > Cc: Ingo Molnar > Cc; John Stultz > Cc: Peter Zijlstra > Cc: Preeti U Murthy > Cc: Rik van Riel > Cc: Thomas Gleixner > Cc: Viresh Kumar > Signed-off-by: Frederic Weisbecker > --- > kernel/time/tick-sched.c | 8 +++- > 1 file changed, 3 insertions(+), 5 deletions(-) > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index 812f7a3..324482f 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -208,10 +208,8 @@ void __tick_nohz_full_check(void) > struct tick_sched *ts = this_cpu_ptr(_cpu_sched); > > if (tick_nohz_full_cpu(smp_processor_id())) { > - if (ts->tick_stopped && !is_idle_task(current)) { > - if (!can_stop_full_tick()) can_stop_full_tick() would have bailed out if the current task was idle, since it checks for the number of tasks being greater than 1 to restart the tick. So why was the check is_idle_task() introduced earlier ? > - tick_nohz_restart_sched_tick(ts, ktime_get()); > - } > + if (ts->tick_stopped && !can_stop_full_tick()) > + tick_nohz_restart_sched_tick(ts, ktime_get()); > } > } > > @@ -710,7 +708,7 @@ static void tick_nohz_full_stop_tick(struct tick_sched > *ts) > #ifdef CONFIG_NO_HZ_FULL > int cpu = smp_processor_id(); > > - if (!tick_nohz_full_cpu(cpu) || is_idle_task(current)) > + if (!tick_nohz_full_cpu(cpu)) If the current task was indeed idle, the check on ts->inidle would have succeeded in tick_irq_exit() and we would not have reached this function at all, isn't it? So here too I am unable to understand why we had it in the first place. Regards Preeti U Murthy > return; > > if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE) > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/8] nohz: Remove idle task special case
On 06/11/2015 11:06 PM, Frederic Weisbecker wrote: This is a leftover from old days to avoid conflicts with dynticks idle code. Now full dynticks and idle dynticks are better integrated and interact without known issue. I am sorry but I fail to understand why the check on idle task was there in the first place in the below code paths. It would help if you could clarify this in the changelog as well. So lets remove it. Cc: Christoph Lameter c...@linux.com Cc: Ingo Molnar mi...@kernel.org Cc; John Stultz john.stu...@linaro.org Cc: Peter Zijlstra pet...@infradead.org Cc: Preeti U Murthy pre...@linux.vnet.ibm.com Cc: Rik van Riel r...@redhat.com Cc: Thomas Gleixner t...@linutronix.de Cc: Viresh Kumar viresh.ku...@linaro.org Signed-off-by: Frederic Weisbecker fweis...@gmail.com --- kernel/time/tick-sched.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 812f7a3..324482f 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -208,10 +208,8 @@ void __tick_nohz_full_check(void) struct tick_sched *ts = this_cpu_ptr(tick_cpu_sched); if (tick_nohz_full_cpu(smp_processor_id())) { - if (ts-tick_stopped !is_idle_task(current)) { - if (!can_stop_full_tick()) can_stop_full_tick() would have bailed out if the current task was idle, since it checks for the number of tasks being greater than 1 to restart the tick. So why was the check is_idle_task() introduced earlier ? - tick_nohz_restart_sched_tick(ts, ktime_get()); - } + if (ts-tick_stopped !can_stop_full_tick()) + tick_nohz_restart_sched_tick(ts, ktime_get()); } } @@ -710,7 +708,7 @@ static void tick_nohz_full_stop_tick(struct tick_sched *ts) #ifdef CONFIG_NO_HZ_FULL int cpu = smp_processor_id(); - if (!tick_nohz_full_cpu(cpu) || is_idle_task(current)) + if (!tick_nohz_full_cpu(cpu)) If the current task was indeed idle, the check on ts-inidle would have succeeded in tick_irq_exit() and we would not have reached this function at all, isn't it? So here too I am unable to understand why we had it in the first place. Regards Preeti U Murthy return; if (!ts-tick_stopped ts-nohz_mode == NOHZ_MODE_INACTIVE) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions
On 06/02/2015 11:57 AM, Viresh Kumar wrote: > On 02-06-15, 11:50, Preeti U Murthy wrote: >> CPU0CPU1 >> >> store* store* >> >> lock(policy 1) lock(policy 2) >> cpufreq_set_policy() cpufreq_set_policy() >> EXIT() : >> dbs-data->usage_count-- >> >> INIT() EXIT() > > When will INIT() follow EXIT() in set_policy() for the same governor ? > Maybe not, and so this sequence is hypothetical ? Ah, yes, this will be hypothetical. > >> dbs_data exists dbs_data->usage_count -- = 0 >> kfree(dbs_data) >> dbs-data->usage_count++ >> *NULL dereference* > > But even if this happens, it should be handled with > dbs_data->mutex_lock, which is used at other places already. dbs_data->mutex_lock is used to serialize only START,STOP,LIMIT calls. It does not serialize EXIT/INIT with these operations, but that is understandable. We need to note that EXIT can proceed in parallel with START/STOP/LIMIT operations which can result in null pointer dereferences of dbs_data. Yet again, this is due to the reason that you pointed out. One such case is __cpufreq_remove_dev_finish(). It also drops policy locks before calling into START/LIMIT. So this can proceed in parallel with an EXIT operation from a store. So dbs_data->mutex does not help serialize these two and START can refer a freed dbs_data. Consider the scenario today where CPUFREQ_HAVE_GOVERNOR_PER_POLICY is set itself. CPU0CPU1 cpufreq_set_policy() __cpufreq_governor (CPUFREQ_GOV_POLICY_EXIT) since the only usage becomes 0. __cpufreq_remove_dev_finish() free(dbs_data) __cpufreq_governor (CPUFRQ_GOV_START) dbs_data->mutex <= NULL dereference This is what we hit initially. So we will need the policy wide lock even for those drivers that have a governor per policy, before calls to __cpufreq_governor() in order to avoid such scenarios. So, your patch at https://git.linaro.org/people/viresh.kumar/linux.git/patch/57714d5b1778f2f610bcc5c74d85b29ba1cc1995 can lead to above races between different store operations themselves, won't it ? An EXIT on one of the policy cpus, followed by a STOP on another will lead to null dereference of dbs_data(For GOVERNOR_PER_POLICY). Today stores are serialized through the policy lock. Anyway, since taking the policy lock leads to ABBA deadlock and releasing it can lead to races like above, shouldn't we try another approach at serialization ? Regards Preeti U Murthy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions
On 06/02/2015 11:41 AM, Viresh Kumar wrote: > On 02-06-15, 11:33, Preeti U Murthy wrote: >> No, dbs_data is a governor wide data structure and not a policy wide > > Yeah, that's the common part which I was referring to. But normally > its just read for policies in START/STOP, they just update per-cpu > data for policy->cpus. > >> one, which is manipulated in START/STOP calls for drivers where the >> CPUFREQ_HAVE_GOVERNOR_PER_POLICY is not set. >> >> So even if we assume that we hold per-policy locks, the following race >> is still present. Assume that we have just two cpus which do not have a >> governor-per-policy set. >> >> CPU0CPU1 >> >> store* store* >> >> lock(policy 1) lock(policy 2) >> cpufreq_set_policy() cpufreq_set_policy() >> EXIT() : >> dbs-data->usage_count-- >> >> INIT() >> dbs_data exists > > You missed the usage_count++ here. Ok, sorry about that. How about the below ? > >> so return >> EXIT() >> dbs_data->usage_count -- = 0 >> kfree(dbs_data) > > And so this shouldn't happen. Else we > are missing locking in governor's > code, rather than cpufreq.c > CPU0CPU1 store* store* lock(policy 1) lock(policy 2) cpufreq_set_policy() cpufreq_set_policy() EXIT() : dbs-data->usage_count-- INIT() EXIT() dbs_data exists dbs_data->usage_count -- = 0 kfree(dbs_data) dbs-data->usage_count++ *NULL dereference* The point is there are potential race conditions. Its just a matter of interleaving ? Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions
On 06/02/2015 11:09 AM, Viresh Kumar wrote: > On 02-06-15, 11:01, Preeti U Murthy wrote: >> How will a policy lock help here at all, when cpus from multiple >> policies are calling into __cpufreq_governor() ? How will a policy lock >> serialize their entry into cpufreq_governor_dbs() ? > > So different policies don't really depend on each other. The only > thing common to them are the governor's sysfs files (only if > governor-per-policy isn't set, i.e. in your case). Those sysfs files > and their kernel counterpart variables aren't touched unless all the > policies have EXITED. All these START/STOP calls touch only the data > relevant to those policies only. No, dbs_data is a governor wide data structure and not a policy wide one, which is manipulated in START/STOP calls for drivers where the CPUFREQ_HAVE_GOVERNOR_PER_POLICY is not set. So even if we assume that we hold per-policy locks, the following race is still present. Assume that we have just two cpus which do not have a governor-per-policy set. CPU0CPU1 store* store* lock(policy 1) lock(policy 2) cpufreq_set_policy() cpufreq_set_policy() EXIT() : dbs-data->usage_count-- INIT() dbs_data exists so return EXIT() dbs_data->usage_count -- = 0 kfree(dbs_data) START() dereference dbs_data *NULL dereference* Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions
On 06/02/2015 11:57 AM, Viresh Kumar wrote: On 02-06-15, 11:50, Preeti U Murthy wrote: CPU0CPU1 store* store* lock(policy 1) lock(policy 2) cpufreq_set_policy() cpufreq_set_policy() EXIT() : dbs-data-usage_count-- INIT() EXIT() When will INIT() follow EXIT() in set_policy() for the same governor ? Maybe not, and so this sequence is hypothetical ? Ah, yes, this will be hypothetical. dbs_data exists dbs_data-usage_count -- = 0 kfree(dbs_data) dbs-data-usage_count++ *NULL dereference* But even if this happens, it should be handled with dbs_data-mutex_lock, which is used at other places already. dbs_data-mutex_lock is used to serialize only START,STOP,LIMIT calls. It does not serialize EXIT/INIT with these operations, but that is understandable. We need to note that EXIT can proceed in parallel with START/STOP/LIMIT operations which can result in null pointer dereferences of dbs_data. Yet again, this is due to the reason that you pointed out. One such case is __cpufreq_remove_dev_finish(). It also drops policy locks before calling into START/LIMIT. So this can proceed in parallel with an EXIT operation from a store. So dbs_data-mutex does not help serialize these two and START can refer a freed dbs_data. Consider the scenario today where CPUFREQ_HAVE_GOVERNOR_PER_POLICY is set itself. CPU0CPU1 cpufreq_set_policy() __cpufreq_governor (CPUFREQ_GOV_POLICY_EXIT) since the only usage becomes 0. __cpufreq_remove_dev_finish() free(dbs_data) __cpufreq_governor (CPUFRQ_GOV_START) dbs_data-mutex = NULL dereference This is what we hit initially. So we will need the policy wide lock even for those drivers that have a governor per policy, before calls to __cpufreq_governor() in order to avoid such scenarios. So, your patch at https://git.linaro.org/people/viresh.kumar/linux.git/patch/57714d5b1778f2f610bcc5c74d85b29ba1cc1995 can lead to above races between different store operations themselves, won't it ? An EXIT on one of the policy cpus, followed by a STOP on another will lead to null dereference of dbs_data(For GOVERNOR_PER_POLICY). Today stores are serialized through the policy lock. Anyway, since taking the policy lock leads to ABBA deadlock and releasing it can lead to races like above, shouldn't we try another approach at serialization ? Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions
On 06/02/2015 11:41 AM, Viresh Kumar wrote: On 02-06-15, 11:33, Preeti U Murthy wrote: No, dbs_data is a governor wide data structure and not a policy wide Yeah, that's the common part which I was referring to. But normally its just read for policies in START/STOP, they just update per-cpu data for policy-cpus. one, which is manipulated in START/STOP calls for drivers where the CPUFREQ_HAVE_GOVERNOR_PER_POLICY is not set. So even if we assume that we hold per-policy locks, the following race is still present. Assume that we have just two cpus which do not have a governor-per-policy set. CPU0CPU1 store* store* lock(policy 1) lock(policy 2) cpufreq_set_policy() cpufreq_set_policy() EXIT() : dbs-data-usage_count-- INIT() dbs_data exists You missed the usage_count++ here. Ok, sorry about that. How about the below ? so return EXIT() dbs_data-usage_count -- = 0 kfree(dbs_data) And so this shouldn't happen. Else we are missing locking in governor's code, rather than cpufreq.c CPU0CPU1 store* store* lock(policy 1) lock(policy 2) cpufreq_set_policy() cpufreq_set_policy() EXIT() : dbs-data-usage_count-- INIT() EXIT() dbs_data exists dbs_data-usage_count -- = 0 kfree(dbs_data) dbs-data-usage_count++ *NULL dereference* The point is there are potential race conditions. Its just a matter of interleaving ? Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions
On 06/02/2015 11:09 AM, Viresh Kumar wrote: On 02-06-15, 11:01, Preeti U Murthy wrote: How will a policy lock help here at all, when cpus from multiple policies are calling into __cpufreq_governor() ? How will a policy lock serialize their entry into cpufreq_governor_dbs() ? So different policies don't really depend on each other. The only thing common to them are the governor's sysfs files (only if governor-per-policy isn't set, i.e. in your case). Those sysfs files and their kernel counterpart variables aren't touched unless all the policies have EXITED. All these START/STOP calls touch only the data relevant to those policies only. No, dbs_data is a governor wide data structure and not a policy wide one, which is manipulated in START/STOP calls for drivers where the CPUFREQ_HAVE_GOVERNOR_PER_POLICY is not set. So even if we assume that we hold per-policy locks, the following race is still present. Assume that we have just two cpus which do not have a governor-per-policy set. CPU0CPU1 store* store* lock(policy 1) lock(policy 2) cpufreq_set_policy() cpufreq_set_policy() EXIT() : dbs-data-usage_count-- INIT() dbs_data exists so return EXIT() dbs_data-usage_count -- = 0 kfree(dbs_data) START() dereference dbs_data *NULL dereference* Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions
On 06/01/2015 12:49 PM, Viresh Kumar wrote: > On 01-06-15, 01:40, Preeti U Murthy wrote: > > I have to mention that this is somewhat inspired by: > > https://git.linaro.org/people/viresh.kumar/linux.git/commit/1e37f1d6ae12f5896e4e216f986762c3050129a5 > > and I was waiting to finish some core-changes to make all this simple. > > I am fine to you trying to finish it though :) > >> The problem showed up when running hotplug operations and changing >> governors in parallel. The crash would be at: >> >> [ 174.319645] Unable to handle kernel paging request for data at address >> 0x >> [ 174.319782] Faulting instruction address: 0xc053b3e0 >> cpu 0x1: Vector: 300 (Data Access) at [c3fdb870] >> pc: c053b3e0: __bitmap_weight+0x70/0x100 >> lr: c085a338: need_load_eval+0x38/0xf0 >> sp: c3fdbaf0 >>msr: 90019033 >>dar: 0 >> dsisr: 4000 >> current = 0xc3151a40 >> paca= 0xc7da0980softe: 0irq_happened: 0x01 >> pid = 842, comm = kworker/1:2 >> enter ? for help >> [c3fdbb40] c085a338 need_load_eval+0x38/0xf0 >> [c3fdbb70] c0856a10 od_dbs_timer+0x90/0x1e0 >> [c3fdbbe0] c00f489c process_one_work+0x24c/0x910 >> [c3fdbc90] c00f50dc worker_thread+0x17c/0x540 >> [c3fdbd20] c00fed70 kthread+0x120/0x140 >> [c3fdbe30] c0009678 ret_from_kernel_thread+0x5c/0x64 >> >> While debugging the issue, other problems in this area were uncovered, >> all of them necessitating serialized calls to __cpufreq_governor(). One >> potential race condition that can happen today is the below: >> >> CPU0 CPU1 >> >> cpufreq_set_policy() >> >> __cpufreq_governor >> (CPUFREQ_GOV_POLICY_EXIT) >> __cpufreq_remove_dev_finish() >> >> free(dbs_data) __cpufreq_governor >> (CPUFRQ_GOV_START) >> >> dbs_data->mutex <= NULL dereference >> >> The issue here is that calls to cpufreq_governor_dbs() is not serialized >> and they can conflict with each other in numerous ways. One way to sort >> this out would be to serialize all calls to cpufreq_governor_dbs() >> by setting the governor busy if a call is in progress and >> blocking all other calls. But this approach will not cover all loop >> holes. Take the above scenario: CPU1 will still hit a NULL dereference if >> care is not taken to check for a NULL dbs_data. >> >> To sort such scenarios, we could filter out the sequence of events: A >> CPUFREQ_GOV_START cannot be called without an INIT, if the previous >> event was an EXIT. However this results in analysing all possible >> sequence of events and adding each of them as a filter. This results in >> unmanagable code. There is high probability of missing out on a race >> condition. Both the above approaches were tried out earlier [1] > > I agree. > >> Let us therefore look at the heart of the issue. > > Yeah, we should :) > >> It is not really about >> serializing calls to cpufreq_governor_dbs(), it seems to be about >> serializing entire sequence of CPUFREQ_GOV* operations. For instance, in >> cpufreq_set_policy(), we STOP,EXIT the old policy and INIT and START the >> new policy. Between the EXIT and INIT, there must not be >> anybody else starting the policy. And between INIT and START, there must >> be nobody stopping the policy. > > Hmm.. > >> A similar argument holds for the CPUFREQ_GOV* operations in >> __cpufreq_policy_dev_{prepare|finish} and cpufreq_add_policy(). Hence >> until each of these functions complete in totality, none of the others >> should run in parallel. The interleaving of the individual calls to >> cpufreq_governor_dbs() is resulting in invalid operations. This patch >> therefore tries to serialize entire cpufreq functions calling CPUFREQ_GOV* >> operations, with respect to each other. > > We were forced to put band-aids until this time and I am really > looking into getting this fixed at the root. > > The problem is that we drop policy locks before calling > __cpufreq_governor() and that's the root cause of all these problems > we are facing. We did that because we were getting warnings about > circular locks (955ef4833574 ("cpufreq: Drop rwsem lock around > CPUFREQ_GOV_POLICY_EXIT")).. > > I have explained that problem here (Never sent this upstr
Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions
On 06/01/2015 12:49 PM, Viresh Kumar wrote: > On 01-06-15, 01:40, Preeti U Murthy wrote: > > I have to mention that this is somewhat inspired by: > > https://git.linaro.org/people/viresh.kumar/linux.git/commit/1e37f1d6ae12f5896e4e216f986762c3050129a5 > > and I was waiting to finish some core-changes to make all this simple. > > I am fine to you trying to finish it though :) I am extremely sorry for not having pointed it out explicitly. I assumed mentioning a reference to it would do. But in retrospect, I should have stated it clearly. I will remember to check with the previous authors about their progress on a previously posted out patch before posting out my version of the same. Thank you for pointing it out :) Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions
The problem showed up when running hotplug operations and changing governors in parallel. The crash would be at: [ 174.319645] Unable to handle kernel paging request for data at address 0x [ 174.319782] Faulting instruction address: 0xc053b3e0 cpu 0x1: Vector: 300 (Data Access) at [c3fdb870] pc: c053b3e0: __bitmap_weight+0x70/0x100 lr: c085a338: need_load_eval+0x38/0xf0 sp: c3fdbaf0 msr: 90019033 dar: 0 dsisr: 4000 current = 0xc3151a40 paca= 0xc7da0980 softe: 0irq_happened: 0x01 pid = 842, comm = kworker/1:2 enter ? for help [c3fdbb40] c085a338 need_load_eval+0x38/0xf0 [c3fdbb70] c0856a10 od_dbs_timer+0x90/0x1e0 [c3fdbbe0] c00f489c process_one_work+0x24c/0x910 [c3fdbc90] c00f50dc worker_thread+0x17c/0x540 [c3fdbd20] c00fed70 kthread+0x120/0x140 [c3fdbe30] c0009678 ret_from_kernel_thread+0x5c/0x64 While debugging the issue, other problems in this area were uncovered, all of them necessitating serialized calls to __cpufreq_governor(). One potential race condition that can happen today is the below: CPU0CPU1 cpufreq_set_policy() __cpufreq_governor (CPUFREQ_GOV_POLICY_EXIT) __cpufreq_remove_dev_finish() free(dbs_data) __cpufreq_governor (CPUFRQ_GOV_START) dbs_data->mutex <= NULL dereference The issue here is that calls to cpufreq_governor_dbs() is not serialized and they can conflict with each other in numerous ways. One way to sort this out would be to serialize all calls to cpufreq_governor_dbs() by setting the governor busy if a call is in progress and blocking all other calls. But this approach will not cover all loop holes. Take the above scenario: CPU1 will still hit a NULL dereference if care is not taken to check for a NULL dbs_data. To sort such scenarios, we could filter out the sequence of events: A CPUFREQ_GOV_START cannot be called without an INIT, if the previous event was an EXIT. However this results in analysing all possible sequence of events and adding each of them as a filter. This results in unmanagable code. There is high probability of missing out on a race condition. Both the above approaches were tried out earlier [1] Let us therefore look at the heart of the issue. It is not really about serializing calls to cpufreq_governor_dbs(), it seems to be about serializing entire sequence of CPUFREQ_GOV* operations. For instance, in cpufreq_set_policy(), we STOP,EXIT the old policy and INIT and START the new policy. Between the EXIT and INIT, there must not be anybody else starting the policy. And between INIT and START, there must be nobody stopping the policy. A similar argument holds for the CPUFREQ_GOV* operations in __cpufreq_policy_dev_{prepare|finish} and cpufreq_add_policy(). Hence until each of these functions complete in totality, none of the others should run in parallel. The interleaving of the individual calls to cpufreq_governor_dbs() is resulting in invalid operations. This patch therefore tries to serialize entire cpufreq functions calling CPUFREQ_GOV* operations, with respect to each other. However there are a few concerns: a. On those archs, which do not have CPUFREQ_HAVE_GOVERNOR_PER_POLICY flag set, this patch results in softlockups. This may be because we are setting the governor busy in a preempt enabled section, which will block other cpufreq operations across all cpus. b. This has still not solved the kernel panic mentioned a the beginning of the changelog. But it does resolve the kernel panic on a 3.18 stable kernel. The 3.18 kernel survives parallel hotplug and cpufreq governor change operations with this patch. So the reason this patch was posted out as an RFC was to resolve the above two issues wrg to this patch and get the discussion going on resolving potential race conditions in parallel cpufreq and hotplug operations which seems rampant and not easily solvable. There are race conditions in parallel cpufreq operations themselves. This RFC patch brings forth potential issues and possible approaches to solutions. [1] http://comments.gmane.org/gmane.linux.power-management.general/49337 Signed-off-by: Preeti U Murthy --- drivers/cpufreq/cpufreq.c | 68 +++-- include/linux/cpufreq.h |2 + 2 files changed, 67 insertions(+), 3 deletions(-) diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 8ae655c..e03e738 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -121,6 +121,45 @@ struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy) } EXPORT_SYMBOL_GPL(get_governor_parent_kobj); +static bool is_governor_busy(struct cpufreq_policy *policy) +{ + if (have_governor_per_
[RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions
The problem showed up when running hotplug operations and changing governors in parallel. The crash would be at: [ 174.319645] Unable to handle kernel paging request for data at address 0x [ 174.319782] Faulting instruction address: 0xc053b3e0 cpu 0x1: Vector: 300 (Data Access) at [c3fdb870] pc: c053b3e0: __bitmap_weight+0x70/0x100 lr: c085a338: need_load_eval+0x38/0xf0 sp: c3fdbaf0 msr: 90019033 dar: 0 dsisr: 4000 current = 0xc3151a40 paca= 0xc7da0980 softe: 0irq_happened: 0x01 pid = 842, comm = kworker/1:2 enter ? for help [c3fdbb40] c085a338 need_load_eval+0x38/0xf0 [c3fdbb70] c0856a10 od_dbs_timer+0x90/0x1e0 [c3fdbbe0] c00f489c process_one_work+0x24c/0x910 [c3fdbc90] c00f50dc worker_thread+0x17c/0x540 [c3fdbd20] c00fed70 kthread+0x120/0x140 [c3fdbe30] c0009678 ret_from_kernel_thread+0x5c/0x64 While debugging the issue, other problems in this area were uncovered, all of them necessitating serialized calls to __cpufreq_governor(). One potential race condition that can happen today is the below: CPU0CPU1 cpufreq_set_policy() __cpufreq_governor (CPUFREQ_GOV_POLICY_EXIT) __cpufreq_remove_dev_finish() free(dbs_data) __cpufreq_governor (CPUFRQ_GOV_START) dbs_data-mutex = NULL dereference The issue here is that calls to cpufreq_governor_dbs() is not serialized and they can conflict with each other in numerous ways. One way to sort this out would be to serialize all calls to cpufreq_governor_dbs() by setting the governor busy if a call is in progress and blocking all other calls. But this approach will not cover all loop holes. Take the above scenario: CPU1 will still hit a NULL dereference if care is not taken to check for a NULL dbs_data. To sort such scenarios, we could filter out the sequence of events: A CPUFREQ_GOV_START cannot be called without an INIT, if the previous event was an EXIT. However this results in analysing all possible sequence of events and adding each of them as a filter. This results in unmanagable code. There is high probability of missing out on a race condition. Both the above approaches were tried out earlier [1] Let us therefore look at the heart of the issue. It is not really about serializing calls to cpufreq_governor_dbs(), it seems to be about serializing entire sequence of CPUFREQ_GOV* operations. For instance, in cpufreq_set_policy(), we STOP,EXIT the old policy and INIT and START the new policy. Between the EXIT and INIT, there must not be anybody else starting the policy. And between INIT and START, there must be nobody stopping the policy. A similar argument holds for the CPUFREQ_GOV* operations in __cpufreq_policy_dev_{prepare|finish} and cpufreq_add_policy(). Hence until each of these functions complete in totality, none of the others should run in parallel. The interleaving of the individual calls to cpufreq_governor_dbs() is resulting in invalid operations. This patch therefore tries to serialize entire cpufreq functions calling CPUFREQ_GOV* operations, with respect to each other. However there are a few concerns: a. On those archs, which do not have CPUFREQ_HAVE_GOVERNOR_PER_POLICY flag set, this patch results in softlockups. This may be because we are setting the governor busy in a preempt enabled section, which will block other cpufreq operations across all cpus. b. This has still not solved the kernel panic mentioned a the beginning of the changelog. But it does resolve the kernel panic on a 3.18 stable kernel. The 3.18 kernel survives parallel hotplug and cpufreq governor change operations with this patch. So the reason this patch was posted out as an RFC was to resolve the above two issues wrg to this patch and get the discussion going on resolving potential race conditions in parallel cpufreq and hotplug operations which seems rampant and not easily solvable. There are race conditions in parallel cpufreq operations themselves. This RFC patch brings forth potential issues and possible approaches to solutions. [1] http://comments.gmane.org/gmane.linux.power-management.general/49337 Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- drivers/cpufreq/cpufreq.c | 68 +++-- include/linux/cpufreq.h |2 + 2 files changed, 67 insertions(+), 3 deletions(-) diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 8ae655c..e03e738 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -121,6 +121,45 @@ struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy) } EXPORT_SYMBOL_GPL(get_governor_parent_kobj); +static bool is_governor_busy(struct cpufreq_policy *policy
Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions
On 06/01/2015 12:49 PM, Viresh Kumar wrote: On 01-06-15, 01:40, Preeti U Murthy wrote: I have to mention that this is somewhat inspired by: https://git.linaro.org/people/viresh.kumar/linux.git/commit/1e37f1d6ae12f5896e4e216f986762c3050129a5 and I was waiting to finish some core-changes to make all this simple. I am fine to you trying to finish it though :) I am extremely sorry for not having pointed it out explicitly. I assumed mentioning a reference to it would do. But in retrospect, I should have stated it clearly. I will remember to check with the previous authors about their progress on a previously posted out patch before posting out my version of the same. Thank you for pointing it out :) Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] cpufreq/hotplug: Fix cpu-hotplug cpufreq race conditions
On 06/01/2015 12:49 PM, Viresh Kumar wrote: On 01-06-15, 01:40, Preeti U Murthy wrote: I have to mention that this is somewhat inspired by: https://git.linaro.org/people/viresh.kumar/linux.git/commit/1e37f1d6ae12f5896e4e216f986762c3050129a5 and I was waiting to finish some core-changes to make all this simple. I am fine to you trying to finish it though :) The problem showed up when running hotplug operations and changing governors in parallel. The crash would be at: [ 174.319645] Unable to handle kernel paging request for data at address 0x [ 174.319782] Faulting instruction address: 0xc053b3e0 cpu 0x1: Vector: 300 (Data Access) at [c3fdb870] pc: c053b3e0: __bitmap_weight+0x70/0x100 lr: c085a338: need_load_eval+0x38/0xf0 sp: c3fdbaf0 msr: 90019033 dar: 0 dsisr: 4000 current = 0xc3151a40 paca= 0xc7da0980softe: 0irq_happened: 0x01 pid = 842, comm = kworker/1:2 enter ? for help [c3fdbb40] c085a338 need_load_eval+0x38/0xf0 [c3fdbb70] c0856a10 od_dbs_timer+0x90/0x1e0 [c3fdbbe0] c00f489c process_one_work+0x24c/0x910 [c3fdbc90] c00f50dc worker_thread+0x17c/0x540 [c3fdbd20] c00fed70 kthread+0x120/0x140 [c3fdbe30] c0009678 ret_from_kernel_thread+0x5c/0x64 While debugging the issue, other problems in this area were uncovered, all of them necessitating serialized calls to __cpufreq_governor(). One potential race condition that can happen today is the below: CPU0 CPU1 cpufreq_set_policy() __cpufreq_governor (CPUFREQ_GOV_POLICY_EXIT) __cpufreq_remove_dev_finish() free(dbs_data) __cpufreq_governor (CPUFRQ_GOV_START) dbs_data-mutex = NULL dereference The issue here is that calls to cpufreq_governor_dbs() is not serialized and they can conflict with each other in numerous ways. One way to sort this out would be to serialize all calls to cpufreq_governor_dbs() by setting the governor busy if a call is in progress and blocking all other calls. But this approach will not cover all loop holes. Take the above scenario: CPU1 will still hit a NULL dereference if care is not taken to check for a NULL dbs_data. To sort such scenarios, we could filter out the sequence of events: A CPUFREQ_GOV_START cannot be called without an INIT, if the previous event was an EXIT. However this results in analysing all possible sequence of events and adding each of them as a filter. This results in unmanagable code. There is high probability of missing out on a race condition. Both the above approaches were tried out earlier [1] I agree. Let us therefore look at the heart of the issue. Yeah, we should :) It is not really about serializing calls to cpufreq_governor_dbs(), it seems to be about serializing entire sequence of CPUFREQ_GOV* operations. For instance, in cpufreq_set_policy(), we STOP,EXIT the old policy and INIT and START the new policy. Between the EXIT and INIT, there must not be anybody else starting the policy. And between INIT and START, there must be nobody stopping the policy. Hmm.. A similar argument holds for the CPUFREQ_GOV* operations in __cpufreq_policy_dev_{prepare|finish} and cpufreq_add_policy(). Hence until each of these functions complete in totality, none of the others should run in parallel. The interleaving of the individual calls to cpufreq_governor_dbs() is resulting in invalid operations. This patch therefore tries to serialize entire cpufreq functions calling CPUFREQ_GOV* operations, with respect to each other. We were forced to put band-aids until this time and I am really looking into getting this fixed at the root. The problem is that we drop policy locks before calling __cpufreq_governor() and that's the root cause of all these problems we are facing. We did that because we were getting warnings about circular locks (955ef4833574 (cpufreq: Drop rwsem lock around CPUFREQ_GOV_POLICY_EXIT)).. I have explained that problem here (Never sent this upstream, as I was waiting for some other patches to get included first): https://git.linaro.org/people/viresh.kumar/linux.git/commit/57714d5b1778f2f610bcc5c74d85b29ba1cc1995 The actual problem was: If we hold any locks, that the attribute operations grab, when removing the attribute, then it can result in a ABBA deadlock. show()/store() holds the policy-rwsem lock while accessing any sysfs attributes under cpu/cpuX/cpufreq/ directory. But something like what I have done is the real way to tackle all these problems. How will a policy lock help here at all, when cpus from multiple policies are calling into __cpufreq_governor() ? How will a policy lock serialize their entry
Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency
On 05/30/2015 11:31 AM, Vaidyanathan Srinivasan wrote: > * Preeti U Murthy [2015-05-29 19:17:17]: > > [snip] > >>> + if (max_idle_state > 1) { >>> + snooze_timeout_en = true; >>> + snooze_timeout = cpuidle_state_table[1].target_residency * >>> +tb_ticks_per_usec; >>> + } >> >> Any idea why we don't have snooze defined on the shared lpar configuration ? > > In shared lpar case, spinning in guest context may potentially take > away cycles from other lpars waiting to run on the same physical cpu. > > So the policy in shared lpar case is to let PowerVM hypervisor know > immediately that the guest cpu is idle which will allow the hypervisor > to use the cycles for other tasks/lpars. > Oh Ok! Thanks for the clarification ! Regards Preeti U Murthy > --Vaidy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency
On 05/30/2015 11:31 AM, Vaidyanathan Srinivasan wrote: * Preeti U Murthy pre...@linux.vnet.ibm.com [2015-05-29 19:17:17]: [snip] + if (max_idle_state 1) { + snooze_timeout_en = true; + snooze_timeout = cpuidle_state_table[1].target_residency * +tb_ticks_per_usec; + } Any idea why we don't have snooze defined on the shared lpar configuration ? In shared lpar case, spinning in guest context may potentially take away cycles from other lpars waiting to run on the same physical cpu. So the policy in shared lpar case is to let PowerVM hypervisor know immediately that the guest cpu is idle which will allow the hypervisor to use the cycles for other tasks/lpars. Oh Ok! Thanks for the clarification ! Regards Preeti U Murthy --Vaidy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency
Hi Shilpa, The subject does not convey the purpose of this patch clearly IMO. I would definitely suggest changing the subject to something like "Auto promotion of snooze to deeper idle state" or similar. On 05/29/2015 06:02 PM, Shilpasri G Bhat wrote: > The idle cpus which stay in snooze for a long period can degrade the > perfomance of the sibling cpus. If the cpu stays in snooze for more > than target residency of the next available idle state, then exit from > snooze. This gives a chance to the cpuidle governor to re-evaluate the > last idle state of the cpu to promote it to deeper idle states. > > Signed-off-by: Shilpasri G Bhat > --- > drivers/cpuidle/cpuidle-powernv.c | 12 > drivers/cpuidle/cpuidle-pseries.c | 11 +++ > 2 files changed, 23 insertions(+) > > diff --git a/drivers/cpuidle/cpuidle-pseries.c > b/drivers/cpuidle/cpuidle-pseries.c > index bb9e2b6..07135e0 100644 > --- a/drivers/cpuidle/cpuidle-pseries.c > +++ b/drivers/cpuidle/cpuidle-pseries.c > @@ -27,6 +27,8 @@ struct cpuidle_driver pseries_idle_driver = { > > static int max_idle_state; > static struct cpuidle_state *cpuidle_state_table; > +static u64 snooze_timeout; > +static bool snooze_timeout_en; > > static inline void idle_loop_prolog(unsigned long *in_purr) > { > @@ -58,14 +60,18 @@ static int snooze_loop(struct cpuidle_device *dev, > int index) > { > unsigned long in_purr; > + u64 snooze_exit_time; > > idle_loop_prolog(_purr); > local_irq_enable(); > set_thread_flag(TIF_POLLING_NRFLAG); > + snooze_exit_time = get_tb() + snooze_timeout; > > while (!need_resched()) { > HMT_low(); > HMT_very_low(); > + if (snooze_timeout_en && get_tb() > snooze_exit_time) > + break; > } > > HMT_medium(); > @@ -244,6 +250,11 @@ static int pseries_idle_probe(void) > } else > return -ENODEV; > > + if (max_idle_state > 1) { > + snooze_timeout_en = true; > + snooze_timeout = cpuidle_state_table[1].target_residency * > + tb_ticks_per_usec; > + } Any idea why we don't have snooze defined on the shared lpar configuration ? Regards Preeti U Murthy > return 0; > } > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency
Hi Shilpa, The subject does not convey the purpose of this patch clearly IMO. I would definitely suggest changing the subject to something like Auto promotion of snooze to deeper idle state or similar. On 05/29/2015 06:02 PM, Shilpasri G Bhat wrote: The idle cpus which stay in snooze for a long period can degrade the perfomance of the sibling cpus. If the cpu stays in snooze for more than target residency of the next available idle state, then exit from snooze. This gives a chance to the cpuidle governor to re-evaluate the last idle state of the cpu to promote it to deeper idle states. Signed-off-by: Shilpasri G Bhat shilpa.b...@linux.vnet.ibm.com --- drivers/cpuidle/cpuidle-powernv.c | 12 drivers/cpuidle/cpuidle-pseries.c | 11 +++ 2 files changed, 23 insertions(+) diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index bb9e2b6..07135e0 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -27,6 +27,8 @@ struct cpuidle_driver pseries_idle_driver = { static int max_idle_state; static struct cpuidle_state *cpuidle_state_table; +static u64 snooze_timeout; +static bool snooze_timeout_en; static inline void idle_loop_prolog(unsigned long *in_purr) { @@ -58,14 +60,18 @@ static int snooze_loop(struct cpuidle_device *dev, int index) { unsigned long in_purr; + u64 snooze_exit_time; idle_loop_prolog(in_purr); local_irq_enable(); set_thread_flag(TIF_POLLING_NRFLAG); + snooze_exit_time = get_tb() + snooze_timeout; while (!need_resched()) { HMT_low(); HMT_very_low(); + if (snooze_timeout_en get_tb() snooze_exit_time) + break; } HMT_medium(); @@ -244,6 +250,11 @@ static int pseries_idle_probe(void) } else return -ENODEV; + if (max_idle_state 1) { + snooze_timeout_en = true; + snooze_timeout = cpuidle_state_table[1].target_residency * + tb_ticks_per_usec; + } Any idea why we don't have snooze defined on the shared lpar configuration ? Regards Preeti U Murthy return 0; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] cpuidle: Do not use CPUIDLE_DRIVER_STATE_START in cpuidle.c
On 05/28/2015 07:39 AM, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > > The CPUIDLE_DRIVER_STATE_START symbol is defined as 1 only if > CONFIG_ARCH_HAS_CPU_RELAX is set, otherwise it is defined as 0. > However, if CONFIG_ARCH_HAS_CPU_RELAX is set, the first (index 0) > entry in the cpuidle driver's table of states is overwritten with > the default "poll" entry by the core. The "state" defined by the > "poll" entry doesn't provide ->enter_dead and ->enter_freeze > callbacks and its exit_latency is 0. > > For this reason, it is not necessary to use CPUIDLE_DRIVER_STATE_START > in cpuidle_play_dead() (->enter_dead is NULL, so the "poll state" > will be skipped by the loop). > > It also is arguably unuseful to return states with exit_latency > equal to 0 from find_deepest_state(), so the function can be modified > to start the loop from index 0 and the "poll state" will be skipped by > it as a result of the check against latency_req. > > Signed-off-by: Rafael J. Wysocki > --- > drivers/cpuidle/cpuidle.c |6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > Index: linux-pm/drivers/cpuidle/cpuidle.c > === > --- linux-pm.orig/drivers/cpuidle/cpuidle.c > +++ linux-pm/drivers/cpuidle/cpuidle.c > @@ -65,7 +65,7 @@ int cpuidle_play_dead(void) > return -ENODEV; > > /* Find lowest-power state that supports long-term idle */ > - for (i = drv->state_count - 1; i >= CPUIDLE_DRIVER_STATE_START; i--) > + for (i = drv->state_count - 1; i >= 0; i--) > if (drv->states[i].enter_dead) > return drv->states[i].enter_dead(dev, i); > > @@ -79,9 +79,9 @@ static int find_deepest_state(struct cpu > bool freeze) > { > unsigned int latency_req = 0; > - int i, ret = freeze ? -1 : CPUIDLE_DRIVER_STATE_START - 1; > + int i, ret = -ENXIO; > > - for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) { > + for (i = 0; i < drv->state_count; i++) { > struct cpuidle_state *s = >states[i]; > struct cpuidle_state_usage *su = >states_usage[i]; > Reviewed-by: Preeti U Murthy > > -- > To unsubscribe from this list: send the line "unsubscribe linux-pm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuidle: Do not use CPUIDLE_DRIVER_STATE_START in cpuidle.c
On 05/27/2015 07:27 PM, Rafael J. Wysocki wrote: > On Wed, May 27, 2015 at 2:25 PM, Daniel Lezcano > wrote: >> On 05/27/2015 01:31 PM, Preeti U Murthy wrote: >>> >>> On 05/27/2015 07:06 AM, Rafael J. Wysocki wrote: >>>> >>>> From: Rafael J. Wysocki >>>> >>>> The CPUIDLE_DRIVER_STATE_START symbol is defined as 1 only if >>>> CONFIG_ARCH_HAS_CPU_RELAX is set, otherwise it is defined as 0. >>>> However, if CONFIG_ARCH_HAS_CPU_RELAX is set, the first (index 0) >>>> entry in the cpuidle driver's table of states is overwritten with >>>> the default "poll" entry by the core. The "state" defined by the >>>> "poll" entry doesn't provide ->enter_dead and ->enter_freeze >>>> callbacks and its exit_latency is 0. >>>> >>>> For this reason, it is not necessary to use CPUIDLE_DRIVER_STATE_START >>>> in cpuidle_play_dead() (->enter_dead is NULL, so the "poll state" >>>> will be skipped by the loop) and in find_deepest_state() (since >>>> exit_latency is 0, the "poll state" will become the default if the >>>> "s->exit_latency <= latency_req" check is replaced with >>>> "s->exit_latency < latency_req" which may only matter for drivers >>>> providing different states with the same exit_latency). >>>> >>>> Signed-off-by: Rafael J. Wysocki >>>> --- >>>> drivers/cpuidle/cpuidle.c |8 >>>> 1 file changed, 4 insertions(+), 4 deletions(-) >>> >>> >>> >>>> >>>> @@ -79,13 +79,13 @@ static int find_deepest_state(struct cpu >>>> bool freeze) >>>> { >>>> unsigned int latency_req = 0; >>>> - int i, ret = freeze ? -1 : CPUIDLE_DRIVER_STATE_START - 1; >>>> + int i, ret = -ENXIO; >>>> >>>> - for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) { >>>> + for (i = 0; i < drv->state_count; i++) { >>>> struct cpuidle_state *s = >states[i]; >>>> struct cpuidle_state_usage *su = >states_usage[i]; >>>> >>>> - if (s->disabled || su->disable || s->exit_latency <= >>>> latency_req >>>> + if (s->disabled || su->disable || s->exit_latency < >>>> latency_req >>> >>> >>> Prior to this patch, >>> >>> For drivers on which CPUIDLE_DRIVER_STATE_START takes a value 0 and >>> whose first idle state has an exit_latency of 0, find_deepest_state() >>> would return -1 if it failed to find a deeper idle state. >>> But as an effect of this patch, find_deepest_state() returns 0 in the >>> above circumstance. >> >> >> Except I am missing something, with an exit_latency = 0, the state will be >> never selected, because of the "s->exit_latency < latency_req" condition >> (strictly greater than). > > No, this is the condition to skip the state, so previously it wouldn't > be selected, but after the patch it will. > > So yes, the patch changes behavior for systems with > CONFIG_ARCH_HAS_CPU_RELAX unset. > >>> My concern is if these drivers do not intend to enter a polling state >>> during suspend, this will cause an issue, won't it? > > The change in behavior happens for architectures where > CONFIG_ARCH_HAS_CPU_RELAX is not set. In those cases the 0-index > state is supposed to be provided by the driver. Is there a reason to > expect that this may not be a genuine idle state? On PowerPC, we have the 0-index idle state, whose exit_latency is 0 and all that the CPU does in this state, is poll on need_resched(), except at a lower priority from the hardware's standpoint. Nevertheless, the CPU is busy polling. So, I would not consider it a genuine idle state. On a side note, we do not yet support suspend on Power servers, but we may in the future. Hence the concern. >> Definitively poll can cause thermal issues, especially when suspending. It >> is a dangerous state (let's imagine you close your laptop => suspend/poll >> and then put it in your bag for a travel). > > With ARCH_HAS_CPU_RELAX set the "poll state" is supposed to be thermally safe. > >> I don't think with the code above we can reach this situation but I agree >> this is something we have to take care carefully. >> >> Actually, I am in favour of re
Re: [PATCH] cpuidle: Do not use CPUIDLE_DRIVER_STATE_START in cpuidle.c
On 05/27/2015 07:06 AM, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > > The CPUIDLE_DRIVER_STATE_START symbol is defined as 1 only if > CONFIG_ARCH_HAS_CPU_RELAX is set, otherwise it is defined as 0. > However, if CONFIG_ARCH_HAS_CPU_RELAX is set, the first (index 0) > entry in the cpuidle driver's table of states is overwritten with > the default "poll" entry by the core. The "state" defined by the > "poll" entry doesn't provide ->enter_dead and ->enter_freeze > callbacks and its exit_latency is 0. > > For this reason, it is not necessary to use CPUIDLE_DRIVER_STATE_START > in cpuidle_play_dead() (->enter_dead is NULL, so the "poll state" > will be skipped by the loop) and in find_deepest_state() (since > exit_latency is 0, the "poll state" will become the default if the > "s->exit_latency <= latency_req" check is replaced with > "s->exit_latency < latency_req" which may only matter for drivers > providing different states with the same exit_latency). > > Signed-off-by: Rafael J. Wysocki > --- > drivers/cpuidle/cpuidle.c |8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > @@ -79,13 +79,13 @@ static int find_deepest_state(struct cpu > bool freeze) > { > unsigned int latency_req = 0; > - int i, ret = freeze ? -1 : CPUIDLE_DRIVER_STATE_START - 1; > + int i, ret = -ENXIO; > > - for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) { > + for (i = 0; i < drv->state_count; i++) { > struct cpuidle_state *s = >states[i]; > struct cpuidle_state_usage *su = >states_usage[i]; > > - if (s->disabled || su->disable || s->exit_latency <= latency_req > + if (s->disabled || su->disable || s->exit_latency < latency_req Prior to this patch, For drivers on which CPUIDLE_DRIVER_STATE_START takes a value 0 and whose first idle state has an exit_latency of 0, find_deepest_state() would return -1 if it failed to find a deeper idle state. But as an effect of this patch, find_deepest_state() returns 0 in the above circumstance. My concern is if these drivers do not intend to enter a polling state during suspend, this will cause an issue, won't it? This also gets me wondering if polling state is an acceptable idle state during suspend, given that the drivers with ARCH_HAS_CPU_RELAX permit entry into it during suspend today. I would expect the cpus to be in a hardware defined idle state. Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuidle: Do not use CPUIDLE_DRIVER_STATE_START in cpuidle.c
On 05/27/2015 07:06 AM, Rafael J. Wysocki wrote: From: Rafael J. Wysocki rafael.j.wyso...@intel.com The CPUIDLE_DRIVER_STATE_START symbol is defined as 1 only if CONFIG_ARCH_HAS_CPU_RELAX is set, otherwise it is defined as 0. However, if CONFIG_ARCH_HAS_CPU_RELAX is set, the first (index 0) entry in the cpuidle driver's table of states is overwritten with the default poll entry by the core. The state defined by the poll entry doesn't provide -enter_dead and -enter_freeze callbacks and its exit_latency is 0. For this reason, it is not necessary to use CPUIDLE_DRIVER_STATE_START in cpuidle_play_dead() (-enter_dead is NULL, so the poll state will be skipped by the loop) and in find_deepest_state() (since exit_latency is 0, the poll state will become the default if the s-exit_latency = latency_req check is replaced with s-exit_latency latency_req which may only matter for drivers providing different states with the same exit_latency). Signed-off-by: Rafael J. Wysocki rafael.j.wyso...@intel.com --- drivers/cpuidle/cpuidle.c |8 1 file changed, 4 insertions(+), 4 deletions(-) snip @@ -79,13 +79,13 @@ static int find_deepest_state(struct cpu bool freeze) { unsigned int latency_req = 0; - int i, ret = freeze ? -1 : CPUIDLE_DRIVER_STATE_START - 1; + int i, ret = -ENXIO; - for (i = CPUIDLE_DRIVER_STATE_START; i drv-state_count; i++) { + for (i = 0; i drv-state_count; i++) { struct cpuidle_state *s = drv-states[i]; struct cpuidle_state_usage *su = dev-states_usage[i]; - if (s-disabled || su-disable || s-exit_latency = latency_req + if (s-disabled || su-disable || s-exit_latency latency_req Prior to this patch, For drivers on which CPUIDLE_DRIVER_STATE_START takes a value 0 and whose first idle state has an exit_latency of 0, find_deepest_state() would return -1 if it failed to find a deeper idle state. But as an effect of this patch, find_deepest_state() returns 0 in the above circumstance. My concern is if these drivers do not intend to enter a polling state during suspend, this will cause an issue, won't it? This also gets me wondering if polling state is an acceptable idle state during suspend, given that the drivers with ARCH_HAS_CPU_RELAX permit entry into it during suspend today. I would expect the cpus to be in a hardware defined idle state. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuidle: Do not use CPUIDLE_DRIVER_STATE_START in cpuidle.c
On 05/27/2015 07:27 PM, Rafael J. Wysocki wrote: On Wed, May 27, 2015 at 2:25 PM, Daniel Lezcano daniel.lezc...@linaro.org wrote: On 05/27/2015 01:31 PM, Preeti U Murthy wrote: On 05/27/2015 07:06 AM, Rafael J. Wysocki wrote: From: Rafael J. Wysocki rafael.j.wyso...@intel.com The CPUIDLE_DRIVER_STATE_START symbol is defined as 1 only if CONFIG_ARCH_HAS_CPU_RELAX is set, otherwise it is defined as 0. However, if CONFIG_ARCH_HAS_CPU_RELAX is set, the first (index 0) entry in the cpuidle driver's table of states is overwritten with the default poll entry by the core. The state defined by the poll entry doesn't provide -enter_dead and -enter_freeze callbacks and its exit_latency is 0. For this reason, it is not necessary to use CPUIDLE_DRIVER_STATE_START in cpuidle_play_dead() (-enter_dead is NULL, so the poll state will be skipped by the loop) and in find_deepest_state() (since exit_latency is 0, the poll state will become the default if the s-exit_latency = latency_req check is replaced with s-exit_latency latency_req which may only matter for drivers providing different states with the same exit_latency). Signed-off-by: Rafael J. Wysocki rafael.j.wyso...@intel.com --- drivers/cpuidle/cpuidle.c |8 1 file changed, 4 insertions(+), 4 deletions(-) snip @@ -79,13 +79,13 @@ static int find_deepest_state(struct cpu bool freeze) { unsigned int latency_req = 0; - int i, ret = freeze ? -1 : CPUIDLE_DRIVER_STATE_START - 1; + int i, ret = -ENXIO; - for (i = CPUIDLE_DRIVER_STATE_START; i drv-state_count; i++) { + for (i = 0; i drv-state_count; i++) { struct cpuidle_state *s = drv-states[i]; struct cpuidle_state_usage *su = dev-states_usage[i]; - if (s-disabled || su-disable || s-exit_latency = latency_req + if (s-disabled || su-disable || s-exit_latency latency_req Prior to this patch, For drivers on which CPUIDLE_DRIVER_STATE_START takes a value 0 and whose first idle state has an exit_latency of 0, find_deepest_state() would return -1 if it failed to find a deeper idle state. But as an effect of this patch, find_deepest_state() returns 0 in the above circumstance. Except I am missing something, with an exit_latency = 0, the state will be never selected, because of the s-exit_latency latency_req condition (strictly greater than). No, this is the condition to skip the state, so previously it wouldn't be selected, but after the patch it will. So yes, the patch changes behavior for systems with CONFIG_ARCH_HAS_CPU_RELAX unset. My concern is if these drivers do not intend to enter a polling state during suspend, this will cause an issue, won't it? The change in behavior happens for architectures where CONFIG_ARCH_HAS_CPU_RELAX is not set. In those cases the 0-index state is supposed to be provided by the driver. Is there a reason to expect that this may not be a genuine idle state? On PowerPC, we have the 0-index idle state, whose exit_latency is 0 and all that the CPU does in this state, is poll on need_resched(), except at a lower priority from the hardware's standpoint. Nevertheless, the CPU is busy polling. So, I would not consider it a genuine idle state. On a side note, we do not yet support suspend on Power servers, but we may in the future. Hence the concern. Definitively poll can cause thermal issues, especially when suspending. It is a dangerous state (let's imagine you close your laptop = suspend/poll and then put it in your bag for a travel). With ARCH_HAS_CPU_RELAX set the poll state is supposed to be thermally safe. I don't think with the code above we can reach this situation but I agree this is something we have to take care carefully. Actually, I am in favour of removing poll at all from the cpuidle driver and poll only when a cpuidle state selection fails under certain condition. So I fully agree with your statement below. I would expect the cpus to be in a hardware defined idle state. Well, except for the degenerate case in which all of them are disabled and for the broadcast timer stopping aviodance use case for find_deepest_state(). During suspend, the CPUs can very well enter states where the timer stops since we stop timer interrupts anyway. So unless the idle states are explicitly disabled by the user/hardware for some reason, deeper idle states will still be available during suspend as far as I can see. So there are two questions in my view: (1) Should find_deepest_state() ever return states with exit_latency equal to 0? I would say no, since such an idle state would mostly be polling on a wakeup event. Atleast, there is one such case in PowerPC. (2) If the answer to (1) is yes, should the poll state be ever returned by find_deepest_state()? In any case, find_deepest_state() will only return a state with exit_latency equal to 0
Re: [PATCH v2] cpuidle: Do not use CPUIDLE_DRIVER_STATE_START in cpuidle.c
On 05/28/2015 07:39 AM, Rafael J. Wysocki wrote: From: Rafael J. Wysocki rafael.j.wyso...@intel.com The CPUIDLE_DRIVER_STATE_START symbol is defined as 1 only if CONFIG_ARCH_HAS_CPU_RELAX is set, otherwise it is defined as 0. However, if CONFIG_ARCH_HAS_CPU_RELAX is set, the first (index 0) entry in the cpuidle driver's table of states is overwritten with the default poll entry by the core. The state defined by the poll entry doesn't provide -enter_dead and -enter_freeze callbacks and its exit_latency is 0. For this reason, it is not necessary to use CPUIDLE_DRIVER_STATE_START in cpuidle_play_dead() (-enter_dead is NULL, so the poll state will be skipped by the loop). It also is arguably unuseful to return states with exit_latency equal to 0 from find_deepest_state(), so the function can be modified to start the loop from index 0 and the poll state will be skipped by it as a result of the check against latency_req. Signed-off-by: Rafael J. Wysocki rafael.j.wyso...@intel.com --- drivers/cpuidle/cpuidle.c |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) Index: linux-pm/drivers/cpuidle/cpuidle.c === --- linux-pm.orig/drivers/cpuidle/cpuidle.c +++ linux-pm/drivers/cpuidle/cpuidle.c @@ -65,7 +65,7 @@ int cpuidle_play_dead(void) return -ENODEV; /* Find lowest-power state that supports long-term idle */ - for (i = drv-state_count - 1; i = CPUIDLE_DRIVER_STATE_START; i--) + for (i = drv-state_count - 1; i = 0; i--) if (drv-states[i].enter_dead) return drv-states[i].enter_dead(dev, i); @@ -79,9 +79,9 @@ static int find_deepest_state(struct cpu bool freeze) { unsigned int latency_req = 0; - int i, ret = freeze ? -1 : CPUIDLE_DRIVER_STATE_START - 1; + int i, ret = -ENXIO; - for (i = CPUIDLE_DRIVER_STATE_START; i drv-state_count; i++) { + for (i = 0; i drv-state_count; i++) { struct cpuidle_state *s = drv-states[i]; struct cpuidle_state_usage *su = dev-states_usage[i]; Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux-pm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH linux-next] tracing/mm: Use raw_smp_processor_id() instead of smp_processor_id()
On 05/15/2015 06:57 PM, Shreyas B. Prabhu wrote: > trace_mm_page_pcpu_drain, trace_kmem_cache_free, trace_mm_page_free > can be potentially called from an offlined cpu. Since trace points use > RCU and RCU should not be used from offlined cpus, we have checks to > filter out such calls. > > But these checks use smp_processor_id() and since these trace calls can > happen from preemptable sections, this throws a warning when running > with CONFIG_DEBUG_PREEMPT set. > > Now consider task gets migrated after calling smp_processor_id() > - From an online cpu to another online cpu - No impact > - From an online cpu to an offline cpu - Should never happen > - From an offline cpu to an online cpu - Once a cpu has been >offlined it returns to cpu_idle_loop, discovers its offline and calls >arch_cpu_idle_dead. All this happens with preemption disabled. So >this scenario too should never happen. > > Thus running with PREEMPT set has no impact on the condition. So use > raw_smp_processor_id() so that the warnings are suppressed. > > Signed-off-by: Shreyas B. Prabhu > --- > include/trace/events/kmem.h | 34 +++--- > 1 file changed, 31 insertions(+), 3 deletions(-) > > diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h > index 6cd975f..f7554fd 100644 > --- a/include/trace/events/kmem.h > +++ b/include/trace/events/kmem.h > @@ -146,7 +146,16 @@ DEFINE_EVENT_CONDITION(kmem_free, kmem_cache_free, > > TP_ARGS(call_site, ptr), > > - TP_CONDITION(cpu_online(smp_processor_id())) > + /* > + * This trace can be potentially called from an offlined cpu. > + * Since trace points use RCU and RCU should not be used from > + * offline cpus, filter such calls out. > + * While this trace can be called from a preemptable section, > + * it has no impact on the condition since tasks can migrate > + * only from online cpus to other online cpus. Thus its safe > + * to use raw_smp_processor_id. > + */ > + TP_CONDITION(cpu_online(raw_smp_processor_id())) > ); > > TRACE_EVENT_CONDITION(mm_page_free, > @@ -155,7 +164,17 @@ TRACE_EVENT_CONDITION(mm_page_free, > > TP_ARGS(page, order), > > - TP_CONDITION(cpu_online(smp_processor_id())), > + > + /* > + * This trace can be potentially called from an offlined cpu. > + * Since trace points use RCU and RCU should not be used from > + * offline cpus, filter such calls out. > + * While this trace can be called from a preemptable section, > + * it has no impact on the condition since tasks can migrate > + * only from online cpus to other online cpus. Thus its safe > + * to use raw_smp_processor_id. > + */ > + TP_CONDITION(cpu_online(raw_smp_processor_id())), > > TP_STRUCT__entry( > __field(unsigned long, pfn ) > @@ -263,7 +282,16 @@ TRACE_EVENT_CONDITION(mm_page_pcpu_drain, > > TP_ARGS(page, order, migratetype), > > - TP_CONDITION(cpu_online(smp_processor_id())), > + /* > + * This trace can be potentially called from an offlined cpu. > + * Since trace points use RCU and RCU should not be used from > + * offline cpus, filter such calls out. > + * While this trace can be called from a preemptable section, > + * it has no impact on the condition since tasks can migrate > + * only from online cpus to other online cpus. Thus its safe > + * to use raw_smp_processor_id. > + */ > + TP_CONDITION(cpu_online(raw_smp_processor_id())), > > TP_STRUCT__entry( > __field(unsigned long, pfn ) Reviewed-by: Preeti U Murthy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH linux-next] tracing/mm: Use raw_smp_processor_id() instead of smp_processor_id()
On 05/15/2015 06:57 PM, Shreyas B. Prabhu wrote: trace_mm_page_pcpu_drain, trace_kmem_cache_free, trace_mm_page_free can be potentially called from an offlined cpu. Since trace points use RCU and RCU should not be used from offlined cpus, we have checks to filter out such calls. But these checks use smp_processor_id() and since these trace calls can happen from preemptable sections, this throws a warning when running with CONFIG_DEBUG_PREEMPT set. Now consider task gets migrated after calling smp_processor_id() - From an online cpu to another online cpu - No impact - From an online cpu to an offline cpu - Should never happen - From an offline cpu to an online cpu - Once a cpu has been offlined it returns to cpu_idle_loop, discovers its offline and calls arch_cpu_idle_dead. All this happens with preemption disabled. So this scenario too should never happen. Thus running with PREEMPT set has no impact on the condition. So use raw_smp_processor_id() so that the warnings are suppressed. Signed-off-by: Shreyas B. Prabhu shre...@linux.vnet.ibm.com --- include/trace/events/kmem.h | 34 +++--- 1 file changed, 31 insertions(+), 3 deletions(-) diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index 6cd975f..f7554fd 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -146,7 +146,16 @@ DEFINE_EVENT_CONDITION(kmem_free, kmem_cache_free, TP_ARGS(call_site, ptr), - TP_CONDITION(cpu_online(smp_processor_id())) + /* + * This trace can be potentially called from an offlined cpu. + * Since trace points use RCU and RCU should not be used from + * offline cpus, filter such calls out. + * While this trace can be called from a preemptable section, + * it has no impact on the condition since tasks can migrate + * only from online cpus to other online cpus. Thus its safe + * to use raw_smp_processor_id. + */ + TP_CONDITION(cpu_online(raw_smp_processor_id())) ); TRACE_EVENT_CONDITION(mm_page_free, @@ -155,7 +164,17 @@ TRACE_EVENT_CONDITION(mm_page_free, TP_ARGS(page, order), - TP_CONDITION(cpu_online(smp_processor_id())), + + /* + * This trace can be potentially called from an offlined cpu. + * Since trace points use RCU and RCU should not be used from + * offline cpus, filter such calls out. + * While this trace can be called from a preemptable section, + * it has no impact on the condition since tasks can migrate + * only from online cpus to other online cpus. Thus its safe + * to use raw_smp_processor_id. + */ + TP_CONDITION(cpu_online(raw_smp_processor_id())), TP_STRUCT__entry( __field(unsigned long, pfn ) @@ -263,7 +282,16 @@ TRACE_EVENT_CONDITION(mm_page_pcpu_drain, TP_ARGS(page, order, migratetype), - TP_CONDITION(cpu_online(smp_processor_id())), + /* + * This trace can be potentially called from an offlined cpu. + * Since trace points use RCU and RCU should not be used from + * offline cpus, filter such calls out. + * While this trace can be called from a preemptable section, + * it has no impact on the condition since tasks can migrate + * only from online cpus to other online cpus. Thus its safe + * to use raw_smp_processor_id. + */ + TP_CONDITION(cpu_online(raw_smp_processor_id())), TP_STRUCT__entry( __field(unsigned long, pfn ) Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] cpuidle: updates related to tick_broadcast_enter() failures
On 05/14/2015 04:29 AM, Kevin Hilman wrote: > "Rafael J. Wysocki" writes: > > [...] > >> Second, quite honestly, I don't see a connection to genpd here. > > The connection with genpd is because the *reason* the timer was > shutdown/stopped is because it shares power with the CPU, which is why > the timer stops when the CPU hits ceratin low power states. IOW, it's > in the same power domain as the CPU. > >> What you seem to be saying is "maybe we can eliminate the need to check the >> return value of tick_broadcast_enter() in the idle loop if we proactively >> disable the TIMER_STOP idle states of a CPU when we start to use that CPU's >> timer as a broadcast one". >> >> So this seems to be about the timekeeping rather than power domains, because >> that's where the broadcast thing is done. So the code setting up the CPU's >> timer for broadcast would pretty much need to pause cpuidle, go through the >> CPU's idle states and disable the TIMER_STOP ones. And do the reverse when >> the >> timer is not going the be used for broadcast any more. > > Or..., modify the timer subystem to use runtime PM on the timer devices, > create a genpd that includes the timer device, and use > pm_genpd_attach_cpuidle() to attach that genpd so that whenever that > timer is runtime PM active, the deeper C-states cannot be hit. I think you are missing a point here. If such a solution were possible, the tick broadcast framework would not have been designed to support deep cpu idle states. One reason we cannot go this way of course, is not all archs may support genpd as was pointed out. But the second reason IMO is that a timer is runtime PM active as long as there is some deferred work, either in the near or far future. The point behind the broadcast framework is let these CPUs go to deeper idle states when the timers are in the "far" future. We can potentially save power by doing so and don't need to keep the entire power domain active just because the timer is supposed to fire 5 minutes from now, which is precisely what happens if we go the genpd way. Hence I don't think we can trivially club timers with genpd unless we have a way to power the timer PM domain down, depending on when it is supposed to fire, in which case we will merely be replicating the cpuidle governor code. Regards Preeti U Murthy > >> So question is whether or not this is actually really more >> straightforward than checking the return value of >> tick_broadcast_enter() in the idle loop after all. > > Unfortunetly this problem doesn't only affect timers. > > Daniel's broader point is that $SUBJECT series only handles this for the > timer, but there's actually a more general problem to solve for *any* > device that shares a power domain with a CPU (e.g. CPU-local > timers, interrupt controllers, performance monitoring units, floating > point units, etc. etc.) > > If we keep adding checks to the idle loop for all those devices, we're > heading for a mess. (In fact, this is exactly what CPUidle drivers in > lots of vendor trees are doing, and it is indeed quite messy, and very > vendor specific.) > > Also, solving this more general problem was the primary motivation for > adding the gnpd _attach_cpuidle() feature in the first place, so why not > use that? > > Longer term, IMO, these dependencies between CPUs and all these "extras" > logic that share a power domain should be modeled by a genpd. If all > those devices are using runtime PM, including the CPUs, and they are > grouped into a genpd, then we we can very easily know at the genpd level > whether or not the CPU could be powered down, and to what level. This > longer-term solution is what I want to discuss at LPC this year in my > "Unifiy idle management of CPUs and IO devices" topic[1]. ( Also FYI, > using a genpd to model a CPU and connected logic is part of the > motivation behind the recent proposals to add support for multiple > states to genpd by Axel Haslam. ) > > Anyways I digress... > > In the short term, while your patches look fine to me, the objection I > have is that it's only a band-aid fix that handles timers, but none of > the other "extras" that might share a power rail with the CPU. So, > until we have the long-term stuff sorted out, the better > short-term solution IMO is the _attach_cpuidle() one above. > > Kevin > > [1] http://wiki.linuxplumbersconf.org/2015:energy-aware_scheduling > ___ > Linuxppc-dev mailing list > linuxppc-...@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] cpuidle: updates related to tick_broadcast_enter() failures
On 05/14/2015 04:29 AM, Kevin Hilman wrote: Rafael J. Wysocki r...@rjwysocki.net writes: [...] Second, quite honestly, I don't see a connection to genpd here. The connection with genpd is because the *reason* the timer was shutdown/stopped is because it shares power with the CPU, which is why the timer stops when the CPU hits ceratin low power states. IOW, it's in the same power domain as the CPU. What you seem to be saying is maybe we can eliminate the need to check the return value of tick_broadcast_enter() in the idle loop if we proactively disable the TIMER_STOP idle states of a CPU when we start to use that CPU's timer as a broadcast one. So this seems to be about the timekeeping rather than power domains, because that's where the broadcast thing is done. So the code setting up the CPU's timer for broadcast would pretty much need to pause cpuidle, go through the CPU's idle states and disable the TIMER_STOP ones. And do the reverse when the timer is not going the be used for broadcast any more. Or..., modify the timer subystem to use runtime PM on the timer devices, create a genpd that includes the timer device, and use pm_genpd_attach_cpuidle() to attach that genpd so that whenever that timer is runtime PM active, the deeper C-states cannot be hit. I think you are missing a point here. If such a solution were possible, the tick broadcast framework would not have been designed to support deep cpu idle states. One reason we cannot go this way of course, is not all archs may support genpd as was pointed out. But the second reason IMO is that a timer is runtime PM active as long as there is some deferred work, either in the near or far future. The point behind the broadcast framework is let these CPUs go to deeper idle states when the timers are in the far future. We can potentially save power by doing so and don't need to keep the entire power domain active just because the timer is supposed to fire 5 minutes from now, which is precisely what happens if we go the genpd way. Hence I don't think we can trivially club timers with genpd unless we have a way to power the timer PM domain down, depending on when it is supposed to fire, in which case we will merely be replicating the cpuidle governor code. Regards Preeti U Murthy So question is whether or not this is actually really more straightforward than checking the return value of tick_broadcast_enter() in the idle loop after all. Unfortunetly this problem doesn't only affect timers. Daniel's broader point is that $SUBJECT series only handles this for the timer, but there's actually a more general problem to solve for *any* device that shares a power domain with a CPU (e.g. CPU-local timers, interrupt controllers, performance monitoring units, floating point units, etc. etc.) If we keep adding checks to the idle loop for all those devices, we're heading for a mess. (In fact, this is exactly what CPUidle drivers in lots of vendor trees are doing, and it is indeed quite messy, and very vendor specific.) Also, solving this more general problem was the primary motivation for adding the gnpd _attach_cpuidle() feature in the first place, so why not use that? Longer term, IMO, these dependencies between CPUs and all these extras logic that share a power domain should be modeled by a genpd. If all those devices are using runtime PM, including the CPUs, and they are grouped into a genpd, then we we can very easily know at the genpd level whether or not the CPU could be powered down, and to what level. This longer-term solution is what I want to discuss at LPC this year in my Unifiy idle management of CPUs and IO devices topic[1]. ( Also FYI, using a genpd to model a CPU and connected logic is part of the motivation behind the recent proposals to add support for multiple states to genpd by Axel Haslam. ) Anyways I digress... In the short term, while your patches look fine to me, the objection I have is that it's only a band-aid fix that handles timers, but none of the other extras that might share a power rail with the CPU. So, until we have the long-term stuff sorted out, the better short-term solution IMO is the _attach_cpuidle() one above. Kevin [1] http://wiki.linuxplumbersconf.org/2015:energy-aware_scheduling ___ Linuxppc-dev mailing list linuxppc-...@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] cpuidle: updates related to tick_broadcast_enter() failures
On 05/10/2015 04:45 AM, Rafael J. Wysocki wrote: > On Saturday, May 09, 2015 10:33:05 PM Rafael J. Wysocki wrote: >> On Saturday, May 09, 2015 10:11:41 PM Rafael J. Wysocki wrote: >>> On Saturday, May 09, 2015 11:19:16 AM Preeti U Murthy wrote: >>>> Hi Rafael, >>>> >>>> On 05/08/2015 07:48 PM, Rafael J. Wysocki wrote: >>> >>> [cut] >>> >>>>>> >>>>>> +/* Take note of the planned idle state. */ >>>>>> +idle_set_state(smp_processor_id(), target_state); >>>>> >>>>> And I wouldn't do this either. >>>>> >>>>> The behavior here is pretty much as though the driver demoted the state >>>>> chosen >>>>> by the governor and we don't call idle_set_state() again in those cases. >>>> >>>> Why is this wrong? >>> >>> It is not "wrong", but incomplete, because demotions done by the cpuidle >>> driver >>> should also be taken into account in the same way. >>> >>> But I'm seeing that the recent patch of mine that made cpuidle_enter_state() >>> call default_idle_call() was a mistake, because it might confuse >>> find_idlest_cpu() >>> significantly as to what state the CPU is in. I'll drop that one for now. >> >> OK, done. >> >> So after I've dropped it I think we need to do three things: >> (1) Move the idle_set_state() calls to cpuidle_enter_state(). >> (2) Make cpuidle_enter_state() call default_idle_call() again, but this time >> do that *before* it has called idle_set_state() for target_state. >> (3) Introduce demotion as per my last patch. >> >> Let me cut patches for that. > > Done as per the above and the patches follow in replies to this messge. > > All on top of the current linux-next branch of the linux-pm.git tree. The patches look good. Based and tested these patches on top of linux-pm/linux-next (They are not yet in the branch as far as I can see.) All patches in this series Reviewed and Tested-by: Preeti U Murthy > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] cpuidle: updates related to tick_broadcast_enter() failures
On 05/10/2015 04:45 AM, Rafael J. Wysocki wrote: > On Saturday, May 09, 2015 10:33:05 PM Rafael J. Wysocki wrote: >> On Saturday, May 09, 2015 10:11:41 PM Rafael J. Wysocki wrote: >>> On Saturday, May 09, 2015 11:19:16 AM Preeti U Murthy wrote: >>>> Hi Rafael, >>>> >>>> On 05/08/2015 07:48 PM, Rafael J. Wysocki wrote: >>> >>> [cut] >>> >>>>>> >>>>>> +/* Take note of the planned idle state. */ >>>>>> +idle_set_state(smp_processor_id(), target_state); >>>>> >>>>> And I wouldn't do this either. >>>>> >>>>> The behavior here is pretty much as though the driver demoted the state >>>>> chosen >>>>> by the governor and we don't call idle_set_state() again in those cases. >>>> >>>> Why is this wrong? >>> >>> It is not "wrong", but incomplete, because demotions done by the cpuidle >>> driver >>> should also be taken into account in the same way. >>> >>> But I'm seeing that the recent patch of mine that made cpuidle_enter_state() >>> call default_idle_call() was a mistake, because it might confuse >>> find_idlest_cpu() >>> significantly as to what state the CPU is in. I'll drop that one for now. >> >> OK, done. >> >> So after I've dropped it I think we need to do three things: >> (1) Move the idle_set_state() calls to cpuidle_enter_state(). >> (2) Make cpuidle_enter_state() call default_idle_call() again, but this time >> do that *before* it has called idle_set_state() for target_state. >> (3) Introduce demotion as per my last patch. >> >> Let me cut patches for that. > > Done as per the above and the patches follow in replies to this messge. > > All on top of the current linux-next branch of the linux-pm.git tree. I don't see the patches on linux-pm/linux-next. Regards Preeti U Murthy > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] cpuidle: updates related to tick_broadcast_enter() failures
On 05/10/2015 04:45 AM, Rafael J. Wysocki wrote: On Saturday, May 09, 2015 10:33:05 PM Rafael J. Wysocki wrote: On Saturday, May 09, 2015 10:11:41 PM Rafael J. Wysocki wrote: On Saturday, May 09, 2015 11:19:16 AM Preeti U Murthy wrote: Hi Rafael, On 05/08/2015 07:48 PM, Rafael J. Wysocki wrote: [cut] +/* Take note of the planned idle state. */ +idle_set_state(smp_processor_id(), target_state); And I wouldn't do this either. The behavior here is pretty much as though the driver demoted the state chosen by the governor and we don't call idle_set_state() again in those cases. Why is this wrong? It is not wrong, but incomplete, because demotions done by the cpuidle driver should also be taken into account in the same way. But I'm seeing that the recent patch of mine that made cpuidle_enter_state() call default_idle_call() was a mistake, because it might confuse find_idlest_cpu() significantly as to what state the CPU is in. I'll drop that one for now. OK, done. So after I've dropped it I think we need to do three things: (1) Move the idle_set_state() calls to cpuidle_enter_state(). (2) Make cpuidle_enter_state() call default_idle_call() again, but this time do that *before* it has called idle_set_state() for target_state. (3) Introduce demotion as per my last patch. Let me cut patches for that. Done as per the above and the patches follow in replies to this messge. All on top of the current linux-next branch of the linux-pm.git tree. I don't see the patches on linux-pm/linux-next. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] cpuidle: updates related to tick_broadcast_enter() failures
On 05/10/2015 04:45 AM, Rafael J. Wysocki wrote: On Saturday, May 09, 2015 10:33:05 PM Rafael J. Wysocki wrote: On Saturday, May 09, 2015 10:11:41 PM Rafael J. Wysocki wrote: On Saturday, May 09, 2015 11:19:16 AM Preeti U Murthy wrote: Hi Rafael, On 05/08/2015 07:48 PM, Rafael J. Wysocki wrote: [cut] +/* Take note of the planned idle state. */ +idle_set_state(smp_processor_id(), target_state); And I wouldn't do this either. The behavior here is pretty much as though the driver demoted the state chosen by the governor and we don't call idle_set_state() again in those cases. Why is this wrong? It is not wrong, but incomplete, because demotions done by the cpuidle driver should also be taken into account in the same way. But I'm seeing that the recent patch of mine that made cpuidle_enter_state() call default_idle_call() was a mistake, because it might confuse find_idlest_cpu() significantly as to what state the CPU is in. I'll drop that one for now. OK, done. So after I've dropped it I think we need to do three things: (1) Move the idle_set_state() calls to cpuidle_enter_state(). (2) Make cpuidle_enter_state() call default_idle_call() again, but this time do that *before* it has called idle_set_state() for target_state. (3) Introduce demotion as per my last patch. Let me cut patches for that. Done as per the above and the patches follow in replies to this messge. All on top of the current linux-next branch of the linux-pm.git tree. The patches look good. Based and tested these patches on top of linux-pm/linux-next (They are not yet in the branch as far as I can see.) All patches in this series Reviewed and Tested-by: Preeti U Murthy pre...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3] cpuidle: Handle tick_broadcast_enter() failure gracefully
gt;state_count; i++) { > + for (i = CPUIDLE_DRIVER_STATE_START; i < limit; i++) { > struct cpuidle_state *s = >states[i]; > struct cpuidle_state_usage *su = >states_usage[i]; > > if (s->disabled || su->disable || s->exit_latency <= latency_req > - || (freeze && !s->enter_freeze)) > + || (freeze && !s->enter_freeze) > + || (s->flags & flags_to_avoid)) > continue; > > latency_req = s->exit_latency; > @@ -100,7 +102,7 @@ static int find_deepest_state(struct cpu > int cpuidle_find_deepest_state(struct cpuidle_driver *drv, > struct cpuidle_device *dev) > { > - return find_deepest_state(drv, dev, false); > + return find_deepest_state(drv, dev, false, drv->state_count, 0); > } > > static void enter_freeze_proper(struct cpuidle_driver *drv, > @@ -139,7 +141,7 @@ int cpuidle_enter_freeze(struct cpuidle_ >* that interrupts won't be enabled when it exits and allows the tick to >* be frozen safely. >*/ > - index = find_deepest_state(drv, dev, true); > + index = find_deepest_state(drv, dev, true, drv->state_count, 0); > if (index >= 0) > enter_freeze_proper(drv, dev, index); > > @@ -168,8 +170,13 @@ int cpuidle_enter_state(struct cpuidle_d >* CPU as a broadcast timer, this call may fail if it is not available. >*/ > if (broadcast && tick_broadcast_enter()) { > - default_idle_call(); > - return -EBUSY; > + index = find_deepest_state(drv, dev, false, index, > +CPUIDLE_FLAG_TIMER_STOP); > + if (index < 0) { > + default_idle_call(); > + return -EBUSY; > + } > + target_state = >states[index]; > } > > trace_cpu_idle_rcuidle(index, dev->cpu); Regards Preeti U Murthy > > -- > To unsubscribe from this list: send the line "unsubscribe linux-pm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:locking/core] kernel: Replace reference to ASSIGN_ONCE() with WRITE_ONCE() in comment
Commit-ID: 663fdcbee0a656cdaef934e7f50e6c2670373bc9 Gitweb: http://git.kernel.org/tip/663fdcbee0a656cdaef934e7f50e6c2670373bc9 Author: Preeti U Murthy AuthorDate: Thu, 30 Apr 2015 17:27:21 +0530 Committer: Ingo Molnar CommitDate: Fri, 8 May 2015 12:28:53 +0200 kernel: Replace reference to ASSIGN_ONCE() with WRITE_ONCE() in comment Looks like commit : 43239cbe79fc ("kernel: Change ASSIGN_ONCE(val, x) to WRITE_ONCE(x, val)") left behind a reference to ASSIGN_ONCE(). Update this to WRITE_ONCE(). Signed-off-by: Preeti U Murthy Signed-off-by: Peter Zijlstra (Intel) Cc: Borislav Petkov Cc: H. Peter Anvin Cc: Thomas Gleixner Cc: borntrae...@de.ibm.com Cc: d...@stgolabs.net Cc: paul...@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/20150430115721.22278.94082.st...@preeti.in.ibm.com Signed-off-by: Ingo Molnar --- include/linux/compiler.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/compiler.h b/include/linux/compiler.h index 8677225..a7c0941 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -450,7 +450,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s * with an explicit memory barrier or atomic instruction that provides the * required ordering. * - * If possible use READ_ONCE/ASSIGN_ONCE instead. + * If possible use READ_ONCE()/WRITE_ONCE() instead. */ #define __ACCESS_ONCE(x) ({ \ __maybe_unused typeof(x) __var = (__force typeof(x)) 0; \ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully
On 05/08/2015 02:20 AM, Rafael J. Wysocki wrote: > On Thursday, May 07, 2015 11:17:21 PM Preeti U Murthy wrote: >> When a CPU has to enter an idle state where tick stops, it makes a call >> to tick_broadcast_enter(). The call will fail if this CPU is the >> broadcast CPU. Today, under such a circumstance, the arch cpuidle code >> handles this CPU. This is not convincing because not only are we not >> aware what the arch cpuidle code does, but we also do not account for >> the idle state residency time and usage of such a CPU. >> >> This scenario can be handled better by simply asking the cpuidle >> governor to choose an idle state where in ticks do not stop. To >> accommodate this change move the setting of runqueue idle state from the >> core to the cpuidle driver, else the rq->idle_state will be set wrong. >> >> Signed-off-by: Preeti U Murthy >> --- >> Changes from V1: https://lkml.org/lkml/2015/5/7/24 >> Rebased on the latest linux-pm/bleeding-edge >> >> drivers/cpuidle/cpuidle.c | 21 + >> drivers/cpuidle/governors/ladder.c | 13 ++--- >> drivers/cpuidle/governors/menu.c |6 +- >> include/linux/cpuidle.h|6 +++--- >> include/linux/sched.h | 16 >> kernel/sched/core.c| 17 + >> kernel/sched/fair.c|2 +- >> kernel/sched/idle.c|8 +--- >> kernel/sched/sched.h | 24 >> 9 files changed, 70 insertions(+), 43 deletions(-) >> >> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c >> index 8c24f95..b7e86f4 100644 >> --- a/drivers/cpuidle/cpuidle.c >> +++ b/drivers/cpuidle/cpuidle.c >> @@ -21,6 +21,7 @@ >> #include >> #include >> #include >> +#include >> #include >> >> #include "cpuidle.h" >> @@ -168,10 +169,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, >> struct cpuidle_driver *drv, >> * CPU as a broadcast timer, this call may fail if it is not available. >> */ >> if (broadcast && tick_broadcast_enter()) { >> -default_idle_call(); >> -return -EBUSY; >> +index = cpuidle_select(drv, dev, !broadcast); > > No, you can't do that. > > This code path may be used by suspend-to-idle and that should not call > cpuidle_select(). > > What's needed here seems to be a fallback mechanism like "choose the > deepest state shallower than X and such that it won't stop the tick". > You don't really need to run a full governor for that. Agreed. Makes the patch a lot simpler as well. I have sent out V3 doing this. Thank you Regards Preeti U Murthy > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3] cpuidle: Handle tick_broadcast_enter() failure gracefully
When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only do we not know what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply choosing an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq->idle_state will be set wrong. Signed-off-by: Preeti U Murthy --- Changes from V2: https://lkml.org/lkml/2015/5/7/78 Introduce a function in cpuidle core to select an idle state where ticks do not stop rather than going through the governors. Changes from V1: https://lkml.org/lkml/2015/5/7/24 Rebased on the latest linux-pm/bleeding-edge branch drivers/cpuidle/cpuidle.c | 45 +++-- include/linux/sched.h | 16 kernel/sched/core.c | 17 + kernel/sched/fair.c |2 +- kernel/sched/idle.c |6 -- kernel/sched/sched.h | 24 6 files changed, 77 insertions(+), 33 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 8c24f95..d1af760 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include "cpuidle.h" @@ -146,6 +147,36 @@ int cpuidle_enter_freeze(struct cpuidle_driver *drv, struct cpuidle_device *dev) return index; } +/* + * find_tick_valid_state - select a state where tick does not stop + * @dev: cpuidle device for this cpu + * @drv: cpuidle driver for this cpu + */ +static int find_tick_valid_state(struct cpuidle_device *dev, + struct cpuidle_driver *drv) +{ + int i, ret = -1; + + for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) { + struct cpuidle_state *s = >states[i]; + struct cpuidle_state_usage *su = >states_usage[i]; + + /* +* We do not explicitly check for latency requirement +* since it is safe to assume that only shallower idle +* states will have the CPUIDLE_FLAG_TIMER_STOP bit +* cleared and they will invariably meet the latency +* requirement. +*/ + if (s->disabled || su->disable || + (s->flags & CPUIDLE_FLAG_TIMER_STOP)) + continue; + + ret = i; + } + return ret; +} + /** * cpuidle_enter_state - enter the state and update stats * @dev: cpuidle device for this cpu @@ -168,10 +199,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * CPU as a broadcast timer, this call may fail if it is not available. */ if (broadcast && tick_broadcast_enter()) { - default_idle_call(); - return -EBUSY; + index = find_tick_valid_state(dev, drv); + if (index < 0) { + default_idle_call(); + return -EBUSY; + } + target_state = >states[index]; } + /* Take note of the planned idle state. */ + idle_set_state(smp_processor_id(), target_state); + trace_cpu_idle_rcuidle(index, dev->cpu); time_start = ktime_get(); @@ -180,6 +218,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, time_end = ktime_get(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); + /* The cpu is no longer idle or about to enter idle. */ + idle_set_state(smp_processor_id(), NULL); + if (broadcast) { if (WARN_ON_ONCE(!irqs_disabled())) local_irq_disable(); diff --git a/include/linux/sched.h b/include/linux/sched.h index 26a2e61..fef8359 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -45,6 +45,7 @@ struct sched_param { #include #include #include +#include #include #include @@ -893,6 +894,21 @@ enum cpu_idle_type { CPU_MAX_IDLE_TYPES }; +#ifdef CONFIG_CPU_IDLE +extern void idle_set_state(int cpu, struct cpuidle_state *idle_state); +extern struct cpuidle_state *idle_get_state(int cpu); +#else +static inline void idle_set_state(int cpu, + struct cpuidle_state *idle_state) +{ +} + +static inline struct cpuidle_state *idle_get_state(int cpu) +{ + return NULL; +} +#endif + /* * Increase resolution of cpu_capacity calculations */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fe22f75..8e1cc50 100644 ---
Re: [PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully
On 05/08/2015 02:20 AM, Rafael J. Wysocki wrote: On Thursday, May 07, 2015 11:17:21 PM Preeti U Murthy wrote: When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only are we not aware what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply asking the cpuidle governor to choose an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq-idle_state will be set wrong. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Changes from V1: https://lkml.org/lkml/2015/5/7/24 Rebased on the latest linux-pm/bleeding-edge drivers/cpuidle/cpuidle.c | 21 + drivers/cpuidle/governors/ladder.c | 13 ++--- drivers/cpuidle/governors/menu.c |6 +- include/linux/cpuidle.h|6 +++--- include/linux/sched.h | 16 kernel/sched/core.c| 17 + kernel/sched/fair.c|2 +- kernel/sched/idle.c|8 +--- kernel/sched/sched.h | 24 9 files changed, 70 insertions(+), 43 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 8c24f95..b7e86f4 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -21,6 +21,7 @@ #include linux/module.h #include linux/suspend.h #include linux/tick.h +#include linux/sched.h #include trace/events/power.h #include cpuidle.h @@ -168,10 +169,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * CPU as a broadcast timer, this call may fail if it is not available. */ if (broadcast tick_broadcast_enter()) { -default_idle_call(); -return -EBUSY; +index = cpuidle_select(drv, dev, !broadcast); No, you can't do that. This code path may be used by suspend-to-idle and that should not call cpuidle_select(). What's needed here seems to be a fallback mechanism like choose the deepest state shallower than X and such that it won't stop the tick. You don't really need to run a full governor for that. Agreed. Makes the patch a lot simpler as well. I have sent out V3 doing this. Thank you Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V3] cpuidle: Handle tick_broadcast_enter() failure gracefully
When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only do we not know what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply choosing an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq-idle_state will be set wrong. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Changes from V2: https://lkml.org/lkml/2015/5/7/78 Introduce a function in cpuidle core to select an idle state where ticks do not stop rather than going through the governors. Changes from V1: https://lkml.org/lkml/2015/5/7/24 Rebased on the latest linux-pm/bleeding-edge branch drivers/cpuidle/cpuidle.c | 45 +++-- include/linux/sched.h | 16 kernel/sched/core.c | 17 + kernel/sched/fair.c |2 +- kernel/sched/idle.c |6 -- kernel/sched/sched.h | 24 6 files changed, 77 insertions(+), 33 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 8c24f95..d1af760 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -21,6 +21,7 @@ #include linux/module.h #include linux/suspend.h #include linux/tick.h +#include linux/sched.h #include trace/events/power.h #include cpuidle.h @@ -146,6 +147,36 @@ int cpuidle_enter_freeze(struct cpuidle_driver *drv, struct cpuidle_device *dev) return index; } +/* + * find_tick_valid_state - select a state where tick does not stop + * @dev: cpuidle device for this cpu + * @drv: cpuidle driver for this cpu + */ +static int find_tick_valid_state(struct cpuidle_device *dev, + struct cpuidle_driver *drv) +{ + int i, ret = -1; + + for (i = CPUIDLE_DRIVER_STATE_START; i drv-state_count; i++) { + struct cpuidle_state *s = drv-states[i]; + struct cpuidle_state_usage *su = dev-states_usage[i]; + + /* +* We do not explicitly check for latency requirement +* since it is safe to assume that only shallower idle +* states will have the CPUIDLE_FLAG_TIMER_STOP bit +* cleared and they will invariably meet the latency +* requirement. +*/ + if (s-disabled || su-disable || + (s-flags CPUIDLE_FLAG_TIMER_STOP)) + continue; + + ret = i; + } + return ret; +} + /** * cpuidle_enter_state - enter the state and update stats * @dev: cpuidle device for this cpu @@ -168,10 +199,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * CPU as a broadcast timer, this call may fail if it is not available. */ if (broadcast tick_broadcast_enter()) { - default_idle_call(); - return -EBUSY; + index = find_tick_valid_state(dev, drv); + if (index 0) { + default_idle_call(); + return -EBUSY; + } + target_state = drv-states[index]; } + /* Take note of the planned idle state. */ + idle_set_state(smp_processor_id(), target_state); + trace_cpu_idle_rcuidle(index, dev-cpu); time_start = ktime_get(); @@ -180,6 +218,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, time_end = ktime_get(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev-cpu); + /* The cpu is no longer idle or about to enter idle. */ + idle_set_state(smp_processor_id(), NULL); + if (broadcast) { if (WARN_ON_ONCE(!irqs_disabled())) local_irq_disable(); diff --git a/include/linux/sched.h b/include/linux/sched.h index 26a2e61..fef8359 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -45,6 +45,7 @@ struct sched_param { #include linux/rcupdate.h #include linux/rculist.h #include linux/rtmutex.h +#include linux/cpuidle.h #include linux/time.h #include linux/param.h @@ -893,6 +894,21 @@ enum cpu_idle_type { CPU_MAX_IDLE_TYPES }; +#ifdef CONFIG_CPU_IDLE +extern void idle_set_state(int cpu, struct cpuidle_state *idle_state); +extern struct cpuidle_state *idle_get_state(int cpu); +#else +static inline void idle_set_state(int cpu, + struct cpuidle_state *idle_state) +{ +} + +static inline struct cpuidle_state *idle_get_state(int cpu) +{ + return NULL; +} +#endif + /* * Increase resolution
[tip:locking/core] kernel: Replace reference to ASSIGN_ONCE() with WRITE_ONCE() in comment
Commit-ID: 663fdcbee0a656cdaef934e7f50e6c2670373bc9 Gitweb: http://git.kernel.org/tip/663fdcbee0a656cdaef934e7f50e6c2670373bc9 Author: Preeti U Murthy pre...@linux.vnet.ibm.com AuthorDate: Thu, 30 Apr 2015 17:27:21 +0530 Committer: Ingo Molnar mi...@kernel.org CommitDate: Fri, 8 May 2015 12:28:53 +0200 kernel: Replace reference to ASSIGN_ONCE() with WRITE_ONCE() in comment Looks like commit : 43239cbe79fc (kernel: Change ASSIGN_ONCE(val, x) to WRITE_ONCE(x, val)) left behind a reference to ASSIGN_ONCE(). Update this to WRITE_ONCE(). Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Cc: Borislav Petkov b...@alien8.de Cc: H. Peter Anvin h...@zytor.com Cc: Thomas Gleixner t...@linutronix.de Cc: borntrae...@de.ibm.com Cc: d...@stgolabs.net Cc: paul...@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/20150430115721.22278.94082.st...@preeti.in.ibm.com Signed-off-by: Ingo Molnar mi...@kernel.org --- include/linux/compiler.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/compiler.h b/include/linux/compiler.h index 8677225..a7c0941 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -450,7 +450,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s * with an explicit memory barrier or atomic instruction that provides the * required ordering. * - * If possible use READ_ONCE/ASSIGN_ONCE instead. + * If possible use READ_ONCE()/WRITE_ONCE() instead. */ #define __ACCESS_ONCE(x) ({ \ __maybe_unused typeof(x) __var = (__force typeof(x)) 0; \ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3] cpuidle: Handle tick_broadcast_enter() failure gracefully
, struct cpuidle_device *dev) { - return find_deepest_state(drv, dev, false); + return find_deepest_state(drv, dev, false, drv-state_count, 0); } static void enter_freeze_proper(struct cpuidle_driver *drv, @@ -139,7 +141,7 @@ int cpuidle_enter_freeze(struct cpuidle_ * that interrupts won't be enabled when it exits and allows the tick to * be frozen safely. */ - index = find_deepest_state(drv, dev, true); + index = find_deepest_state(drv, dev, true, drv-state_count, 0); if (index = 0) enter_freeze_proper(drv, dev, index); @@ -168,8 +170,13 @@ int cpuidle_enter_state(struct cpuidle_d * CPU as a broadcast timer, this call may fail if it is not available. */ if (broadcast tick_broadcast_enter()) { - default_idle_call(); - return -EBUSY; + index = find_deepest_state(drv, dev, false, index, +CPUIDLE_FLAG_TIMER_STOP); + if (index 0) { + default_idle_call(); + return -EBUSY; + } + target_state = drv-states[index]; } trace_cpu_idle_rcuidle(index, dev-cpu); Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-pm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE
On 05/08/2015 02:29 AM, Rafael J. Wysocki wrote: > On Thursday, May 07, 2015 05:49:22 PM Preeti U Murthy wrote: >> On 05/05/2015 02:11 PM, Preeti U Murthy wrote: >>> On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote: >>>> Hi Preeti, >>>> >>>> On 05/05/2015 09:30 AM, Preeti U Murthy wrote: >>>>> Hi Shilpa, >>>>> >>>>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: >>>>>> Re-evaluate the chip's throttled state on recieving OCC_THROTTLE >>>>>> notification by executing *throttle_check() on any one of the cpu on >>>>>> the chip. This is a sanity check to verify if we were indeed >>>>>> throttled/unthrottled after receiving OCC_THROTTLE notification. >>>>>> >>>>>> We cannot call *throttle_check() directly from the notification >>>>>> handler because we could be handling chip1's notification in chip2. So >>>>>> initiate an smp_call to execute *throttle_check(). We are irq-disabled >>>>>> in the notification handler, so use a worker thread to smp_call >>>>>> throttle_check() on any of the cpu in the chipmask. >>>>> >>>>> I see that the first patch takes care of reporting *per-chip* throttling >>>>> for pmax capping condition. But where are we taking care of reporting >>>>> "pstate set to safe" and "freq control disabled" scenarios per-chip ? >>>>> >>>> >>>> IMO let us not have "psafe" and "freq control disabled" states managed >>>> per-chip. >>>> Because when the above two conditions occur it is likely to happen across >>>> all >>>> chips during an OCC reset cycle. So I am setting 'throttled' to false on >>>> OCC_ACTIVE and re-verifying if it actually is the case by invoking >>>> *throttle_check(). >>> >>> Alright like I pointed in the previous reply, a comment to indicate that >>> psafe and freq control disabled conditions will fail when occ is >>> inactive and that all chips face the consequence of this will help. >> >> From your explanation on the thread of the first patch of this series, >> this will not be required. >> >> So, >> Reviewed-by: Preeti U Murthy > > OK, so is the whole series reviewed now? Yes the whole series has been reviewed. Regards Preeti U Murthy > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RESEND PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully
On 05/07/2015 11:09 PM, Preeti U Murthy wrote: > When a CPU has to enter an idle state where tick stops, it makes a call > to tick_broadcast_enter(). The call will fail if this CPU is the > broadcast CPU. Today, under such a circumstance, the arch cpuidle code > handles this CPU. This is not convincing because not only are we not > aware what the arch cpuidle code does, but we also do not account for > the idle state residency time and usage of such a CPU. > > This scenario can be handled better by simply asking the cpuidle > governor to choose an idle state where in ticks do not stop. To > accommodate this change move the setting of runqueue idle state from the > core to the cpuidle driver, else the rq->idle_state will be set wrong. > > Signed-off-by: Preeti U Murthy > --- > Rebased on the latest linux-pm/bleeding-edge Kindly ignore this. I have sent the rebase as V2. [PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully The below patch is not updated. Regards Preeti U Murthy > > drivers/cpuidle/cpuidle.c | 21 + > drivers/cpuidle/governors/ladder.c | 13 ++--- > drivers/cpuidle/governors/menu.c |6 +- > include/linux/cpuidle.h|6 +++--- > include/linux/sched.h | 16 > kernel/sched/core.c| 17 + > kernel/sched/fair.c|2 +- > kernel/sched/idle.c|8 +--- > kernel/sched/sched.h | 24 > 9 files changed, 70 insertions(+), 43 deletions(-) > > diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c > index 61c417b..8f5657e 100644 > --- a/drivers/cpuidle/cpuidle.c > +++ b/drivers/cpuidle/cpuidle.c > @@ -21,6 +21,7 @@ > #include > #include > #include > +#include > #include > > #include "cpuidle.h" > @@ -167,8 +168,15 @@ int cpuidle_enter_state(struct cpuidle_device *dev, > struct cpuidle_driver *drv, >* local timer will be shut down. If a local timer is used from another >* CPU as a broadcast timer, this call may fail if it is not available. >*/ > - if (broadcast && tick_broadcast_enter()) > - return -EBUSY; > + if (broadcast && tick_broadcast_enter()) { > + index = cpuidle_select(drv, dev, !broadcast); > + if (index < 0) > + return -EBUSY; > + target_state = >states[index]; > + } > + > + /* Take note of the planned idle state. */ > + idle_set_state(smp_processor_id(), target_state); > > trace_cpu_idle_rcuidle(index, dev->cpu); > time_start = ktime_get(); > @@ -178,6 +186,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, > struct cpuidle_driver *drv, > time_end = ktime_get(); > trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); > > + /* The cpu is no longer idle or about to enter idle. */ > + idle_set_state(smp_processor_id(), NULL); > + > if (broadcast) { > if (WARN_ON_ONCE(!irqs_disabled())) > local_irq_disable(); > @@ -213,12 +224,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, > struct cpuidle_driver *drv, > * > * @drv: the cpuidle driver > * @dev: the cpuidle device > + * @timer_stop_valid: allow selection of idle state where tick stops > * > * Returns the index of the idle state. > */ > -int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) > +int cpuidle_select(struct cpuidle_driver *drv, > + struct cpuidle_device *dev, int timer_stop_valid) > { > - return cpuidle_curr_governor->select(drv, dev); > + return cpuidle_curr_governor->select(drv, dev, timer_stop_valid); > } > > /** > diff --git a/drivers/cpuidle/governors/ladder.c > b/drivers/cpuidle/governors/ladder.c > index 401c010..c437322 100644 > --- a/drivers/cpuidle/governors/ladder.c > +++ b/drivers/cpuidle/governors/ladder.c > @@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct > ladder_device *ldev, > * ladder_select_state - selects the next state to enter > * @drv: cpuidle driver > * @dev: the CPU > + * @timer_stop_valid: allow selection of idle state where tick stops > */ > static int ladder_select_state(struct cpuidle_driver *drv, > - struct cpuidle_device *dev) > + struct cpuidle_device *dev, int > timer_stop_valid) > { > struct ladder_device *ldev = this_cpu_ptr(_devices); > struct ladder_device_state *last_state; > @@ -86,6 +87,7 @@ static int ladder_s
Re: [PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully
On 05/07/2015 03:26 PM, Sudeep Holla wrote: > Hi Preeti, > > On 07/05/15 06:26, Preeti U Murthy wrote: >> When a CPU has to enter an idle state where tick stops, it makes a call >> to tick_broadcast_enter(). The call will fail if this CPU is the >> broadcast CPU. Today, under such a circumstance, the arch cpuidle code >> handles this CPU. This is not convincing because not only are we not >> aware what the arch cpuidle code does, but we also do not account for >> the idle state residency time and usage of such a CPU. >> >> This scenario can be handled better by simply asking the cpuidle >> governor to choose an idle state where in ticks do not stop. To >> accommodate this change move the setting of runqueue idle state from the >> core to the cpuidle driver, else the rq->idle_state will be set wrong. >> >> Signed-off-by: Preeti U Murthy >> --- >> Based on linux-pm/bleeding-edge > > I am unable to apply this patch cleanly on linux-pm/bleeding-edge > I think it conflicts with few patches that Rafael posted recently > which are in the branch now. Thanks so much for pointing out this! I have resent the patch as V2 after the rebase on the updated linux-pm/bleeding-edge branch Regards Preeti U Murthy > > Regards, > Sudeep > -- > To unsubscribe from this list: send the line "unsubscribe linux-pm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully
When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only are we not aware what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply asking the cpuidle governor to choose an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq->idle_state will be set wrong. Signed-off-by: Preeti U Murthy --- Changes from V1: https://lkml.org/lkml/2015/5/7/24 Rebased on the latest linux-pm/bleeding-edge drivers/cpuidle/cpuidle.c | 21 + drivers/cpuidle/governors/ladder.c | 13 ++--- drivers/cpuidle/governors/menu.c |6 +- include/linux/cpuidle.h|6 +++--- include/linux/sched.h | 16 kernel/sched/core.c| 17 + kernel/sched/fair.c|2 +- kernel/sched/idle.c|8 +--- kernel/sched/sched.h | 24 9 files changed, 70 insertions(+), 43 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 8c24f95..b7e86f4 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include "cpuidle.h" @@ -168,10 +169,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * CPU as a broadcast timer, this call may fail if it is not available. */ if (broadcast && tick_broadcast_enter()) { - default_idle_call(); - return -EBUSY; + index = cpuidle_select(drv, dev, !broadcast); + if (index < 0) { + default_idle_call(); + return -EBUSY; + } + target_state = >states[index]; } + /* Take note of the planned idle state. */ + idle_set_state(smp_processor_id(), target_state); + trace_cpu_idle_rcuidle(index, dev->cpu); time_start = ktime_get(); @@ -180,6 +188,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, time_end = ktime_get(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); + /* The cpu is no longer idle or about to enter idle. */ + idle_set_state(smp_processor_id(), NULL); + if (broadcast) { if (WARN_ON_ONCE(!irqs_disabled())) local_irq_disable(); @@ -215,12 +226,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * * @drv: the cpuidle driver * @dev: the cpuidle device + * @timer_stop_valid: allow selection of idle state where tick stops * * Returns the index of the idle state. */ -int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) +int cpuidle_select(struct cpuidle_driver *drv, + struct cpuidle_device *dev, int timer_stop_valid) { - return cpuidle_curr_governor->select(drv, dev); + return cpuidle_curr_governor->select(drv, dev, timer_stop_valid); } /** diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c index 401c010..c437322 100644 --- a/drivers/cpuidle/governors/ladder.c +++ b/drivers/cpuidle/governors/ladder.c @@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct ladder_device *ldev, * ladder_select_state - selects the next state to enter * @drv: cpuidle driver * @dev: the CPU + * @timer_stop_valid: allow selection of idle state where tick stops */ static int ladder_select_state(struct cpuidle_driver *drv, - struct cpuidle_device *dev) + struct cpuidle_device *dev, int timer_stop_valid) { struct ladder_device *ldev = this_cpu_ptr(_devices); struct ladder_device_state *last_state; @@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv, !drv->states[last_idx + 1].disabled && !dev->states_usage[last_idx + 1].disable && last_residency > last_state->threshold.promotion_time && + !(!timer_stop_valid && (drv->states[last_idx + 1].flags & CPUIDLE_FLAG_TIMER_STOP)) && drv->states[last_idx + 1].exit_latency <= latency_req) { last_state->stats.promotion_count++; last_state->stats.demotion_count = 0; @@ -99,11 +101,14 @@ static int ladder_select_state(struct cpuidle_driver *drv, if (last_idx > CPUIDLE_DRI
[RESEND PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully
When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only are we not aware what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply asking the cpuidle governor to choose an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq->idle_state will be set wrong. Signed-off-by: Preeti U Murthy --- Rebased on the latest linux-pm/bleeding-edge drivers/cpuidle/cpuidle.c | 21 + drivers/cpuidle/governors/ladder.c | 13 ++--- drivers/cpuidle/governors/menu.c |6 +- include/linux/cpuidle.h|6 +++--- include/linux/sched.h | 16 kernel/sched/core.c| 17 + kernel/sched/fair.c|2 +- kernel/sched/idle.c|8 +--- kernel/sched/sched.h | 24 9 files changed, 70 insertions(+), 43 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 61c417b..8f5657e 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include "cpuidle.h" @@ -167,8 +168,15 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * local timer will be shut down. If a local timer is used from another * CPU as a broadcast timer, this call may fail if it is not available. */ - if (broadcast && tick_broadcast_enter()) - return -EBUSY; + if (broadcast && tick_broadcast_enter()) { + index = cpuidle_select(drv, dev, !broadcast); + if (index < 0) + return -EBUSY; + target_state = >states[index]; + } + + /* Take note of the planned idle state. */ + idle_set_state(smp_processor_id(), target_state); trace_cpu_idle_rcuidle(index, dev->cpu); time_start = ktime_get(); @@ -178,6 +186,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, time_end = ktime_get(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); + /* The cpu is no longer idle or about to enter idle. */ + idle_set_state(smp_processor_id(), NULL); + if (broadcast) { if (WARN_ON_ONCE(!irqs_disabled())) local_irq_disable(); @@ -213,12 +224,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * * @drv: the cpuidle driver * @dev: the cpuidle device + * @timer_stop_valid: allow selection of idle state where tick stops * * Returns the index of the idle state. */ -int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) +int cpuidle_select(struct cpuidle_driver *drv, + struct cpuidle_device *dev, int timer_stop_valid) { - return cpuidle_curr_governor->select(drv, dev); + return cpuidle_curr_governor->select(drv, dev, timer_stop_valid); } /** diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c index 401c010..c437322 100644 --- a/drivers/cpuidle/governors/ladder.c +++ b/drivers/cpuidle/governors/ladder.c @@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct ladder_device *ldev, * ladder_select_state - selects the next state to enter * @drv: cpuidle driver * @dev: the CPU + * @timer_stop_valid: allow selection of idle state where tick stops */ static int ladder_select_state(struct cpuidle_driver *drv, - struct cpuidle_device *dev) + struct cpuidle_device *dev, int timer_stop_valid) { struct ladder_device *ldev = this_cpu_ptr(_devices); struct ladder_device_state *last_state; @@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv, !drv->states[last_idx + 1].disabled && !dev->states_usage[last_idx + 1].disable && last_residency > last_state->threshold.promotion_time && + !(!timer_stop_valid && (drv->states[last_idx + 1].flags & CPUIDLE_FLAG_TIMER_STOP)) && drv->states[last_idx + 1].exit_latency <= latency_req) { last_state->stats.promotion_count++; last_state->stats.demotion_count = 0; @@ -99,11 +101,14 @@ static int ladder_select_state(struct cpuidle_driver *drv, if (last_idx > CPUIDLE_DRIVER_STATE_START &&
Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE
On 05/05/2015 02:11 PM, Preeti U Murthy wrote: > On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote: >> Hi Preeti, >> >> On 05/05/2015 09:30 AM, Preeti U Murthy wrote: >>> Hi Shilpa, >>> >>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: >>>> Re-evaluate the chip's throttled state on recieving OCC_THROTTLE >>>> notification by executing *throttle_check() on any one of the cpu on >>>> the chip. This is a sanity check to verify if we were indeed >>>> throttled/unthrottled after receiving OCC_THROTTLE notification. >>>> >>>> We cannot call *throttle_check() directly from the notification >>>> handler because we could be handling chip1's notification in chip2. So >>>> initiate an smp_call to execute *throttle_check(). We are irq-disabled >>>> in the notification handler, so use a worker thread to smp_call >>>> throttle_check() on any of the cpu in the chipmask. >>> >>> I see that the first patch takes care of reporting *per-chip* throttling >>> for pmax capping condition. But where are we taking care of reporting >>> "pstate set to safe" and "freq control disabled" scenarios per-chip ? >>> >> >> IMO let us not have "psafe" and "freq control disabled" states managed >> per-chip. >> Because when the above two conditions occur it is likely to happen across all >> chips during an OCC reset cycle. So I am setting 'throttled' to false on >> OCC_ACTIVE and re-verifying if it actually is the case by invoking >> *throttle_check(). > > Alright like I pointed in the previous reply, a comment to indicate that > psafe and freq control disabled conditions will fail when occ is > inactive and that all chips face the consequence of this will help. >From your explanation on the thread of the first patch of this series, this will not be required. So, Reviewed-by: Preeti U Murthy Regards Preeti U Murthy > >> >>>> >>>> Signed-off-by: Shilpasri G Bhat >>>> --- >>>> drivers/cpufreq/powernv-cpufreq.c | 28 ++-- >>>> 1 file changed, 26 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/cpufreq/powernv-cpufreq.c >>>> b/drivers/cpufreq/powernv-cpufreq.c >>>> index 9268424..9618813 100644 >>>> --- a/drivers/cpufreq/powernv-cpufreq.c >>>> +++ b/drivers/cpufreq/powernv-cpufreq.c >>>> @@ -50,6 +50,8 @@ static bool rebooting, throttled, occ_reset; >>>> static struct chip { >>>>unsigned int id; >>>>bool throttled; >>>> + cpumask_t mask; >>>> + struct work_struct throttle; >>>> } *chips; >>>> >>>> static int nr_chips; >>>> @@ -310,8 +312,9 @@ static inline unsigned int get_nominal_index(void) >>>>return powernv_pstate_info.max - powernv_pstate_info.nominal; >>>> } >>>> >>>> -static void powernv_cpufreq_throttle_check(unsigned int cpu) >>>> +static void powernv_cpufreq_throttle_check(void *data) >>>> { >>>> + unsigned int cpu = smp_processor_id(); >>>>unsigned long pmsr; >>>>int pmsr_pmax, pmsr_lp, i; >>>> >>>> @@ -373,7 +376,7 @@ static int powernv_cpufreq_target_index(struct >>>> cpufreq_policy *policy, >>>>return 0; >>>> >>>>if (!throttled) >>>> - powernv_cpufreq_throttle_check(smp_processor_id()); >>>> + powernv_cpufreq_throttle_check(NULL); >>>> >>>>freq_data.pstate_id = powernv_freqs[new_index].driver_data; >>>> >>>> @@ -418,6 +421,14 @@ static struct notifier_block >>>> powernv_cpufreq_reboot_nb = { >>>>.notifier_call = powernv_cpufreq_reboot_notifier, >>>> }; >>>> >>>> +void powernv_cpufreq_work_fn(struct work_struct *work) >>>> +{ >>>> + struct chip *chip = container_of(work, struct chip, throttle); >>>> + >>>> + smp_call_function_any(>mask, >>>> +powernv_cpufreq_throttle_check, NULL, 0); >>>> +} >>>> + >>>> static char throttle_reason[][30] = { >>>>"No throttling", >>>>"Power Cap", >>>> @@ -433,6 +444,7 @@ static int powernv_cpufreq_occ_msg(struct >>>> notifier_block *nb, >>&g
Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level
On 05/07/2015 04:05 PM, Shilpasri G Bhat wrote: > > > On 05/05/2015 02:08 PM, Preeti U Murthy wrote: >> On 05/05/2015 11:36 AM, Shilpasri G Bhat wrote: >>> Hi Preeti, >>> >>> On 05/05/2015 09:21 AM, Preeti U Murthy wrote: >>>> Hi Shilpa, >>>> >>>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: >>>>> The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the >>>>> max allowed frequency for that chip if the chip exceeds its power or >>>>> temperature limits. As Pmax capping is a chip level condition report >>>>> this throttling behavior at chip level and also do not set the global >>>>> 'throttled' on Pmax capping instead set the per-chip throttled >>>>> variable. Report unthrottling if Pmax is restored after throttling. >>>>> >>>>> This patch adds a structure to store chip id and throttled state of >>>>> the chip. >>>>> >>>>> Signed-off-by: Shilpasri G Bhat >>>>> --- >>>>> drivers/cpufreq/powernv-cpufreq.c | 59 >>>>> --- >>>>> 1 file changed, 55 insertions(+), 4 deletions(-) >>>>> >>>>> diff --git a/drivers/cpufreq/powernv-cpufreq.c >>>>> b/drivers/cpufreq/powernv-cpufreq.c >>>>> index ebef0d8..d0c18c9 100644 >>>>> --- a/drivers/cpufreq/powernv-cpufreq.c >>>>> +++ b/drivers/cpufreq/powernv-cpufreq.c >>>>> @@ -27,6 +27,7 @@ >>>>> #include >>>>> #include >>>>> #include >>>>> +#include >>>>> >>>>> #include >>>>> #include >>>>> @@ -42,6 +43,13 @@ >>>>> static struct cpufreq_frequency_table >>>>> powernv_freqs[POWERNV_MAX_PSTATES+1]; >>>>> static bool rebooting, throttled; >>>>> >>>>> +static struct chip { >>>>> + unsigned int id; >>>>> + bool throttled; >>>>> +} *chips; >>>>> + >>>>> +static int nr_chips; >>>>> + >>>>> /* >>>>> * Note: The set of pstates consists of contiguous integers, the >>>>> * smallest of which is indicated by powernv_pstate_info.min, the >>>>> @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void) >>>>> static void powernv_cpufreq_throttle_check(unsigned int cpu) >>>>> { >>>>> unsigned long pmsr; >>>>> - int pmsr_pmax, pmsr_lp; >>>>> + int pmsr_pmax, pmsr_lp, i; >>>>> >>>>> pmsr = get_pmspr(SPRN_PMSR); >>>>> >>>>> + for (i = 0; i < nr_chips; i++) >>>>> + if (chips[i].id == cpu_to_chip_id(cpu)) >>>>> + break; >>>>> + >>>>> /* Check for Pmax Capping */ >>>>> pmsr_pmax = (s8)PMSR_MAX(pmsr); >>>>> if (pmsr_pmax != powernv_pstate_info.max) { >>>>> - throttled = true; >>>>> - pr_info("CPU %d Pmax is reduced to %d\n", cpu, pmsr_pmax); >>>>> - pr_info("Max allowed Pstate is capped\n"); >>>>> + if (chips[i].throttled) >>>>> + goto next; >>>>> + chips[i].throttled = true; >>>>> + pr_info("CPU %d on Chip %u has Pmax reduced to %d\n", cpu, >>>>> + chips[i].id, pmsr_pmax); >>>>> + } else if (chips[i].throttled) { >>>>> + chips[i].throttled = false; >>>> >>>> Is this check on pmax sufficient to indicate that the chip is unthrottled ? >>> >>> Unthrottling due to Pmax uncapping here is specific to a chip. So it is >>> sufficient to decide throttling/unthrottling when OCC is active for that >>> chip. >> >> Ok then we can perhaps exit after detecting unthrottling here. > > This won't work for older firmwares which do not clear "Frequency control > enabled bit" on OCC reset cycle. So let us check for remaining two conditions > on > unthrottling as well. ok. > >>> >>>> >>>>> + pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu, >>>>> + chips[i].id, pmsr_pmax); >>>>> } >>>>> >>>>> /* >>&g
Re: [PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully
On 05/07/2015 03:26 PM, Sudeep Holla wrote: Hi Preeti, On 07/05/15 06:26, Preeti U Murthy wrote: When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only are we not aware what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply asking the cpuidle governor to choose an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq-idle_state will be set wrong. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Based on linux-pm/bleeding-edge I am unable to apply this patch cleanly on linux-pm/bleeding-edge I think it conflicts with few patches that Rafael posted recently which are in the branch now. Thanks so much for pointing out this! I have resent the patch as V2 after the rebase on the updated linux-pm/bleeding-edge branch Regards Preeti U Murthy Regards, Sudeep -- To unsubscribe from this list: send the line unsubscribe linux-pm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RESEND PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully
On 05/07/2015 11:09 PM, Preeti U Murthy wrote: When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only are we not aware what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply asking the cpuidle governor to choose an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq-idle_state will be set wrong. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Rebased on the latest linux-pm/bleeding-edge Kindly ignore this. I have sent the rebase as V2. [PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully The below patch is not updated. Regards Preeti U Murthy drivers/cpuidle/cpuidle.c | 21 + drivers/cpuidle/governors/ladder.c | 13 ++--- drivers/cpuidle/governors/menu.c |6 +- include/linux/cpuidle.h|6 +++--- include/linux/sched.h | 16 kernel/sched/core.c| 17 + kernel/sched/fair.c|2 +- kernel/sched/idle.c|8 +--- kernel/sched/sched.h | 24 9 files changed, 70 insertions(+), 43 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 61c417b..8f5657e 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -21,6 +21,7 @@ #include linux/module.h #include linux/suspend.h #include linux/tick.h +#include linux/sched.h #include trace/events/power.h #include cpuidle.h @@ -167,8 +168,15 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * local timer will be shut down. If a local timer is used from another * CPU as a broadcast timer, this call may fail if it is not available. */ - if (broadcast tick_broadcast_enter()) - return -EBUSY; + if (broadcast tick_broadcast_enter()) { + index = cpuidle_select(drv, dev, !broadcast); + if (index 0) + return -EBUSY; + target_state = drv-states[index]; + } + + /* Take note of the planned idle state. */ + idle_set_state(smp_processor_id(), target_state); trace_cpu_idle_rcuidle(index, dev-cpu); time_start = ktime_get(); @@ -178,6 +186,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, time_end = ktime_get(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev-cpu); + /* The cpu is no longer idle or about to enter idle. */ + idle_set_state(smp_processor_id(), NULL); + if (broadcast) { if (WARN_ON_ONCE(!irqs_disabled())) local_irq_disable(); @@ -213,12 +224,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * * @drv: the cpuidle driver * @dev: the cpuidle device + * @timer_stop_valid: allow selection of idle state where tick stops * * Returns the index of the idle state. */ -int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) +int cpuidle_select(struct cpuidle_driver *drv, + struct cpuidle_device *dev, int timer_stop_valid) { - return cpuidle_curr_governor-select(drv, dev); + return cpuidle_curr_governor-select(drv, dev, timer_stop_valid); } /** diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c index 401c010..c437322 100644 --- a/drivers/cpuidle/governors/ladder.c +++ b/drivers/cpuidle/governors/ladder.c @@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct ladder_device *ldev, * ladder_select_state - selects the next state to enter * @drv: cpuidle driver * @dev: the CPU + * @timer_stop_valid: allow selection of idle state where tick stops */ static int ladder_select_state(struct cpuidle_driver *drv, - struct cpuidle_device *dev) + struct cpuidle_device *dev, int timer_stop_valid) { struct ladder_device *ldev = this_cpu_ptr(ladder_devices); struct ladder_device_state *last_state; @@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv, !drv-states[last_idx + 1].disabled !dev-states_usage[last_idx + 1].disable last_residency last_state-threshold.promotion_time + !(!timer_stop_valid (drv-states[last_idx + 1].flags CPUIDLE_FLAG_TIMER_STOP)) drv-states[last_idx + 1].exit_latency = latency_req) { last_state
[PATCH V2] cpuidle: Handle tick_broadcast_enter() failure gracefully
When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only are we not aware what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply asking the cpuidle governor to choose an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq-idle_state will be set wrong. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Changes from V1: https://lkml.org/lkml/2015/5/7/24 Rebased on the latest linux-pm/bleeding-edge drivers/cpuidle/cpuidle.c | 21 + drivers/cpuidle/governors/ladder.c | 13 ++--- drivers/cpuidle/governors/menu.c |6 +- include/linux/cpuidle.h|6 +++--- include/linux/sched.h | 16 kernel/sched/core.c| 17 + kernel/sched/fair.c|2 +- kernel/sched/idle.c|8 +--- kernel/sched/sched.h | 24 9 files changed, 70 insertions(+), 43 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 8c24f95..b7e86f4 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -21,6 +21,7 @@ #include linux/module.h #include linux/suspend.h #include linux/tick.h +#include linux/sched.h #include trace/events/power.h #include cpuidle.h @@ -168,10 +169,17 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * CPU as a broadcast timer, this call may fail if it is not available. */ if (broadcast tick_broadcast_enter()) { - default_idle_call(); - return -EBUSY; + index = cpuidle_select(drv, dev, !broadcast); + if (index 0) { + default_idle_call(); + return -EBUSY; + } + target_state = drv-states[index]; } + /* Take note of the planned idle state. */ + idle_set_state(smp_processor_id(), target_state); + trace_cpu_idle_rcuidle(index, dev-cpu); time_start = ktime_get(); @@ -180,6 +188,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, time_end = ktime_get(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev-cpu); + /* The cpu is no longer idle or about to enter idle. */ + idle_set_state(smp_processor_id(), NULL); + if (broadcast) { if (WARN_ON_ONCE(!irqs_disabled())) local_irq_disable(); @@ -215,12 +226,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * * @drv: the cpuidle driver * @dev: the cpuidle device + * @timer_stop_valid: allow selection of idle state where tick stops * * Returns the index of the idle state. */ -int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) +int cpuidle_select(struct cpuidle_driver *drv, + struct cpuidle_device *dev, int timer_stop_valid) { - return cpuidle_curr_governor-select(drv, dev); + return cpuidle_curr_governor-select(drv, dev, timer_stop_valid); } /** diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c index 401c010..c437322 100644 --- a/drivers/cpuidle/governors/ladder.c +++ b/drivers/cpuidle/governors/ladder.c @@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct ladder_device *ldev, * ladder_select_state - selects the next state to enter * @drv: cpuidle driver * @dev: the CPU + * @timer_stop_valid: allow selection of idle state where tick stops */ static int ladder_select_state(struct cpuidle_driver *drv, - struct cpuidle_device *dev) + struct cpuidle_device *dev, int timer_stop_valid) { struct ladder_device *ldev = this_cpu_ptr(ladder_devices); struct ladder_device_state *last_state; @@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv, !drv-states[last_idx + 1].disabled !dev-states_usage[last_idx + 1].disable last_residency last_state-threshold.promotion_time + !(!timer_stop_valid (drv-states[last_idx + 1].flags CPUIDLE_FLAG_TIMER_STOP)) drv-states[last_idx + 1].exit_latency = latency_req) { last_state-stats.promotion_count++; last_state-stats.demotion_count = 0; @@ -99,11 +101,14 @@ static int ladder_select_state(struct cpuidle_driver *drv, if (last_idx CPUIDLE_DRIVER_STATE_START (drv-states
Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level
On 05/07/2015 04:05 PM, Shilpasri G Bhat wrote: On 05/05/2015 02:08 PM, Preeti U Murthy wrote: On 05/05/2015 11:36 AM, Shilpasri G Bhat wrote: Hi Preeti, On 05/05/2015 09:21 AM, Preeti U Murthy wrote: Hi Shilpa, On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the max allowed frequency for that chip if the chip exceeds its power or temperature limits. As Pmax capping is a chip level condition report this throttling behavior at chip level and also do not set the global 'throttled' on Pmax capping instead set the per-chip throttled variable. Report unthrottling if Pmax is restored after throttling. This patch adds a structure to store chip id and throttled state of the chip. Signed-off-by: Shilpasri G Bhat shilpa.b...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 59 --- 1 file changed, 55 insertions(+), 4 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index ebef0d8..d0c18c9 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -27,6 +27,7 @@ #include linux/smp.h #include linux/of.h #include linux/reboot.h +#include linux/slab.h #include asm/cputhreads.h #include asm/firmware.h @@ -42,6 +43,13 @@ static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; static bool rebooting, throttled; +static struct chip { + unsigned int id; + bool throttled; +} *chips; + +static int nr_chips; + /* * Note: The set of pstates consists of contiguous integers, the * smallest of which is indicated by powernv_pstate_info.min, the @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void) static void powernv_cpufreq_throttle_check(unsigned int cpu) { unsigned long pmsr; - int pmsr_pmax, pmsr_lp; + int pmsr_pmax, pmsr_lp, i; pmsr = get_pmspr(SPRN_PMSR); + for (i = 0; i nr_chips; i++) + if (chips[i].id == cpu_to_chip_id(cpu)) + break; + /* Check for Pmax Capping */ pmsr_pmax = (s8)PMSR_MAX(pmsr); if (pmsr_pmax != powernv_pstate_info.max) { - throttled = true; - pr_info(CPU %d Pmax is reduced to %d\n, cpu, pmsr_pmax); - pr_info(Max allowed Pstate is capped\n); + if (chips[i].throttled) + goto next; + chips[i].throttled = true; + pr_info(CPU %d on Chip %u has Pmax reduced to %d\n, cpu, + chips[i].id, pmsr_pmax); + } else if (chips[i].throttled) { + chips[i].throttled = false; Is this check on pmax sufficient to indicate that the chip is unthrottled ? Unthrottling due to Pmax uncapping here is specific to a chip. So it is sufficient to decide throttling/unthrottling when OCC is active for that chip. Ok then we can perhaps exit after detecting unthrottling here. This won't work for older firmwares which do not clear Frequency control enabled bit on OCC reset cycle. So let us check for remaining two conditions on unthrottling as well. ok. + pr_info(CPU %d on Chip %u has Pmax restored to %d\n, cpu, + chips[i].id, pmsr_pmax); } /* * Check for Psafe by reading LocalPstate * or check if Psafe_mode_active is set in PMSR. */ +next: pmsr_lp = (s8)PMSR_LP(pmsr); if ((pmsr_lp powernv_pstate_info.min) || (pmsr PMSR_PSAFE_ENABLE)) { @@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver = { .attr = powernv_cpu_freq_attr, What about the situation where although occ is active, this particular chip has been throttled and we end up repeatedly reporting pstate set to safe and frequency control disabled from OS ? Should we not have a check on (chips[i].throttled) before reporting an anomaly for these two scenarios as well just like you have for pmsr_pmax ? We will not have Psafe and frequency control disabled repeatedly printed because of global variable 'throttled', which is set to true on passing any of these two conditions. It is quite unlikely behavior to have only one chip in Psafe or frequency control disabled state. These two conditions are most likely to happen during an OCC reset cycle which will occur across all chips. Let us then add a comment to indicate that Psafe and frequency control disabled conditions will fail *only if OCC is inactive* and not otherwise and that this is a system wide phenomenon. I agree that adding a comment here will clear global vs local throttling scenarios, but this will contradict the architectural design of OCC wherein it can independently go to Psafe and frequency control disabled state. It is the implementation in FSP today that has made the above two states global. My point is adding a comment here may be confusing if at all for the future firmwares this implementation is changed. Having said
Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE
On 05/05/2015 02:11 PM, Preeti U Murthy wrote: On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote: Hi Preeti, On 05/05/2015 09:30 AM, Preeti U Murthy wrote: Hi Shilpa, On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: Re-evaluate the chip's throttled state on recieving OCC_THROTTLE notification by executing *throttle_check() on any one of the cpu on the chip. This is a sanity check to verify if we were indeed throttled/unthrottled after receiving OCC_THROTTLE notification. We cannot call *throttle_check() directly from the notification handler because we could be handling chip1's notification in chip2. So initiate an smp_call to execute *throttle_check(). We are irq-disabled in the notification handler, so use a worker thread to smp_call throttle_check() on any of the cpu in the chipmask. I see that the first patch takes care of reporting *per-chip* throttling for pmax capping condition. But where are we taking care of reporting pstate set to safe and freq control disabled scenarios per-chip ? IMO let us not have psafe and freq control disabled states managed per-chip. Because when the above two conditions occur it is likely to happen across all chips during an OCC reset cycle. So I am setting 'throttled' to false on OCC_ACTIVE and re-verifying if it actually is the case by invoking *throttle_check(). Alright like I pointed in the previous reply, a comment to indicate that psafe and freq control disabled conditions will fail when occ is inactive and that all chips face the consequence of this will help. From your explanation on the thread of the first patch of this series, this will not be required. So, Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com Regards Preeti U Murthy Signed-off-by: Shilpasri G Bhat shilpa.b...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 28 ++-- 1 file changed, 26 insertions(+), 2 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 9268424..9618813 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -50,6 +50,8 @@ static bool rebooting, throttled, occ_reset; static struct chip { unsigned int id; bool throttled; + cpumask_t mask; + struct work_struct throttle; } *chips; static int nr_chips; @@ -310,8 +312,9 @@ static inline unsigned int get_nominal_index(void) return powernv_pstate_info.max - powernv_pstate_info.nominal; } -static void powernv_cpufreq_throttle_check(unsigned int cpu) +static void powernv_cpufreq_throttle_check(void *data) { + unsigned int cpu = smp_processor_id(); unsigned long pmsr; int pmsr_pmax, pmsr_lp, i; @@ -373,7 +376,7 @@ static int powernv_cpufreq_target_index(struct cpufreq_policy *policy, return 0; if (!throttled) - powernv_cpufreq_throttle_check(smp_processor_id()); + powernv_cpufreq_throttle_check(NULL); freq_data.pstate_id = powernv_freqs[new_index].driver_data; @@ -418,6 +421,14 @@ static struct notifier_block powernv_cpufreq_reboot_nb = { .notifier_call = powernv_cpufreq_reboot_notifier, }; +void powernv_cpufreq_work_fn(struct work_struct *work) +{ + struct chip *chip = container_of(work, struct chip, throttle); + + smp_call_function_any(chip-mask, +powernv_cpufreq_throttle_check, NULL, 0); +} + static char throttle_reason[][30] = { No throttling, Power Cap, @@ -433,6 +444,7 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb, struct opal_msg *occ_msg = msg; uint64_t token; uint64_t chip_id, reason; + int i; if (msg_type != OPAL_MSG_OCC) return 0; @@ -466,6 +478,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb, occ_reset = false; throttled = false; pr_info(OCC: Active\n); + + for (i = 0; i nr_chips; i++) + schedule_work(chips[i].throttle); + return 0; } @@ -476,6 +492,12 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb, else if (!reason) pr_info(OCC: Chip %u %s\n, (unsigned int)chip_id, throttle_reason[reason]); + else + return 0; Why the else section ? The code can never reach here, can it ? When reason 5 , we dont want to handle it. Of course! My bad! + + for (i = 0; i nr_chips; i++) + if (chips[i].id == chip_id) + schedule_work(chips[i].throttle); } Should we not do this only when we get unthrottled so as to cross verify if it is indeed the case ? In case of throttling notification, opal's verdict is final and there is no need to cross verify right ? Two reasons for invoking
Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE
On 05/08/2015 02:29 AM, Rafael J. Wysocki wrote: On Thursday, May 07, 2015 05:49:22 PM Preeti U Murthy wrote: On 05/05/2015 02:11 PM, Preeti U Murthy wrote: On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote: Hi Preeti, On 05/05/2015 09:30 AM, Preeti U Murthy wrote: Hi Shilpa, On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: Re-evaluate the chip's throttled state on recieving OCC_THROTTLE notification by executing *throttle_check() on any one of the cpu on the chip. This is a sanity check to verify if we were indeed throttled/unthrottled after receiving OCC_THROTTLE notification. We cannot call *throttle_check() directly from the notification handler because we could be handling chip1's notification in chip2. So initiate an smp_call to execute *throttle_check(). We are irq-disabled in the notification handler, so use a worker thread to smp_call throttle_check() on any of the cpu in the chipmask. I see that the first patch takes care of reporting *per-chip* throttling for pmax capping condition. But where are we taking care of reporting pstate set to safe and freq control disabled scenarios per-chip ? IMO let us not have psafe and freq control disabled states managed per-chip. Because when the above two conditions occur it is likely to happen across all chips during an OCC reset cycle. So I am setting 'throttled' to false on OCC_ACTIVE and re-verifying if it actually is the case by invoking *throttle_check(). Alright like I pointed in the previous reply, a comment to indicate that psafe and freq control disabled conditions will fail when occ is inactive and that all chips face the consequence of this will help. From your explanation on the thread of the first patch of this series, this will not be required. So, Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com OK, so is the whole series reviewed now? Yes the whole series has been reviewed. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RESEND PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully
When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only are we not aware what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply asking the cpuidle governor to choose an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq-idle_state will be set wrong. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Rebased on the latest linux-pm/bleeding-edge drivers/cpuidle/cpuidle.c | 21 + drivers/cpuidle/governors/ladder.c | 13 ++--- drivers/cpuidle/governors/menu.c |6 +- include/linux/cpuidle.h|6 +++--- include/linux/sched.h | 16 kernel/sched/core.c| 17 + kernel/sched/fair.c|2 +- kernel/sched/idle.c|8 +--- kernel/sched/sched.h | 24 9 files changed, 70 insertions(+), 43 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 61c417b..8f5657e 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -21,6 +21,7 @@ #include linux/module.h #include linux/suspend.h #include linux/tick.h +#include linux/sched.h #include trace/events/power.h #include cpuidle.h @@ -167,8 +168,15 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * local timer will be shut down. If a local timer is used from another * CPU as a broadcast timer, this call may fail if it is not available. */ - if (broadcast tick_broadcast_enter()) - return -EBUSY; + if (broadcast tick_broadcast_enter()) { + index = cpuidle_select(drv, dev, !broadcast); + if (index 0) + return -EBUSY; + target_state = drv-states[index]; + } + + /* Take note of the planned idle state. */ + idle_set_state(smp_processor_id(), target_state); trace_cpu_idle_rcuidle(index, dev-cpu); time_start = ktime_get(); @@ -178,6 +186,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, time_end = ktime_get(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev-cpu); + /* The cpu is no longer idle or about to enter idle. */ + idle_set_state(smp_processor_id(), NULL); + if (broadcast) { if (WARN_ON_ONCE(!irqs_disabled())) local_irq_disable(); @@ -213,12 +224,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * * @drv: the cpuidle driver * @dev: the cpuidle device + * @timer_stop_valid: allow selection of idle state where tick stops * * Returns the index of the idle state. */ -int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) +int cpuidle_select(struct cpuidle_driver *drv, + struct cpuidle_device *dev, int timer_stop_valid) { - return cpuidle_curr_governor-select(drv, dev); + return cpuidle_curr_governor-select(drv, dev, timer_stop_valid); } /** diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c index 401c010..c437322 100644 --- a/drivers/cpuidle/governors/ladder.c +++ b/drivers/cpuidle/governors/ladder.c @@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct ladder_device *ldev, * ladder_select_state - selects the next state to enter * @drv: cpuidle driver * @dev: the CPU + * @timer_stop_valid: allow selection of idle state where tick stops */ static int ladder_select_state(struct cpuidle_driver *drv, - struct cpuidle_device *dev) + struct cpuidle_device *dev, int timer_stop_valid) { struct ladder_device *ldev = this_cpu_ptr(ladder_devices); struct ladder_device_state *last_state; @@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv, !drv-states[last_idx + 1].disabled !dev-states_usage[last_idx + 1].disable last_residency last_state-threshold.promotion_time + !(!timer_stop_valid (drv-states[last_idx + 1].flags CPUIDLE_FLAG_TIMER_STOP)) drv-states[last_idx + 1].exit_latency = latency_req) { last_state-stats.promotion_count++; last_state-stats.demotion_count = 0; @@ -99,11 +101,14 @@ static int ladder_select_state(struct cpuidle_driver *drv, if (last_idx CPUIDLE_DRIVER_STATE_START (drv-states[last_idx].disabled
[PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully
When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only are we not aware what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply asking the cpuidle governor to choose an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq->idle_state will be set wrong. Signed-off-by: Preeti U Murthy --- Based on linux-pm/bleeding-edge drivers/cpuidle/cpuidle.c | 21 + drivers/cpuidle/governors/ladder.c | 13 ++--- drivers/cpuidle/governors/menu.c |6 +- include/linux/cpuidle.h|6 +++--- include/linux/sched.h | 16 kernel/sched/core.c| 17 + kernel/sched/fair.c|2 +- kernel/sched/idle.c|8 +--- kernel/sched/sched.h | 24 9 files changed, 70 insertions(+), 43 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 61c417b..8f5657e 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include "cpuidle.h" @@ -167,8 +168,15 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * local timer will be shut down. If a local timer is used from another * CPU as a broadcast timer, this call may fail if it is not available. */ - if (broadcast && tick_broadcast_enter()) - return -EBUSY; + if (broadcast && tick_broadcast_enter()) { + index = cpuidle_select(drv, dev, !broadcast); + if (index < 0) + return -EBUSY; + target_state = >states[index]; + } + + /* Take note of the planned idle state. */ + idle_set_state(smp_processor_id(), target_state); trace_cpu_idle_rcuidle(index, dev->cpu); time_start = ktime_get(); @@ -178,6 +186,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, time_end = ktime_get(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); + /* The cpu is no longer idle or about to enter idle. */ + idle_set_state(smp_processor_id(), NULL); + if (broadcast) { if (WARN_ON_ONCE(!irqs_disabled())) local_irq_disable(); @@ -213,12 +224,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * * @drv: the cpuidle driver * @dev: the cpuidle device + * @timer_stop_valid: allow selection of idle state where tick stops * * Returns the index of the idle state. */ -int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) +int cpuidle_select(struct cpuidle_driver *drv, + struct cpuidle_device *dev, int timer_stop_valid) { - return cpuidle_curr_governor->select(drv, dev); + return cpuidle_curr_governor->select(drv, dev, timer_stop_valid); } /** diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c index 401c010..c437322 100644 --- a/drivers/cpuidle/governors/ladder.c +++ b/drivers/cpuidle/governors/ladder.c @@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct ladder_device *ldev, * ladder_select_state - selects the next state to enter * @drv: cpuidle driver * @dev: the CPU + * @timer_stop_valid: allow selection of idle state where tick stops */ static int ladder_select_state(struct cpuidle_driver *drv, - struct cpuidle_device *dev) + struct cpuidle_device *dev, int timer_stop_valid) { struct ladder_device *ldev = this_cpu_ptr(_devices); struct ladder_device_state *last_state; @@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv, !drv->states[last_idx + 1].disabled && !dev->states_usage[last_idx + 1].disable && last_residency > last_state->threshold.promotion_time && + !(!timer_stop_valid && (drv->states[last_idx + 1].flags & CPUIDLE_FLAG_TIMER_STOP)) && drv->states[last_idx + 1].exit_latency <= latency_req) { last_state->stats.promotion_count++; last_state->stats.demotion_count = 0; @@ -99,11 +101,14 @@ static int ladder_select_state(struct cpuidle_driver *drv, if (last_idx > CPUIDLE_DRIVER_STATE_START && (dr
[PATCH] cpuidle: Handle tick_broadcast_enter() failure gracefully
When a CPU has to enter an idle state where tick stops, it makes a call to tick_broadcast_enter(). The call will fail if this CPU is the broadcast CPU. Today, under such a circumstance, the arch cpuidle code handles this CPU. This is not convincing because not only are we not aware what the arch cpuidle code does, but we also do not account for the idle state residency time and usage of such a CPU. This scenario can be handled better by simply asking the cpuidle governor to choose an idle state where in ticks do not stop. To accommodate this change move the setting of runqueue idle state from the core to the cpuidle driver, else the rq-idle_state will be set wrong. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- Based on linux-pm/bleeding-edge drivers/cpuidle/cpuidle.c | 21 + drivers/cpuidle/governors/ladder.c | 13 ++--- drivers/cpuidle/governors/menu.c |6 +- include/linux/cpuidle.h|6 +++--- include/linux/sched.h | 16 kernel/sched/core.c| 17 + kernel/sched/fair.c|2 +- kernel/sched/idle.c|8 +--- kernel/sched/sched.h | 24 9 files changed, 70 insertions(+), 43 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 61c417b..8f5657e 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -21,6 +21,7 @@ #include linux/module.h #include linux/suspend.h #include linux/tick.h +#include linux/sched.h #include trace/events/power.h #include cpuidle.h @@ -167,8 +168,15 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * local timer will be shut down. If a local timer is used from another * CPU as a broadcast timer, this call may fail if it is not available. */ - if (broadcast tick_broadcast_enter()) - return -EBUSY; + if (broadcast tick_broadcast_enter()) { + index = cpuidle_select(drv, dev, !broadcast); + if (index 0) + return -EBUSY; + target_state = drv-states[index]; + } + + /* Take note of the planned idle state. */ + idle_set_state(smp_processor_id(), target_state); trace_cpu_idle_rcuidle(index, dev-cpu); time_start = ktime_get(); @@ -178,6 +186,9 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, time_end = ktime_get(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev-cpu); + /* The cpu is no longer idle or about to enter idle. */ + idle_set_state(smp_processor_id(), NULL); + if (broadcast) { if (WARN_ON_ONCE(!irqs_disabled())) local_irq_disable(); @@ -213,12 +224,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, * * @drv: the cpuidle driver * @dev: the cpuidle device + * @timer_stop_valid: allow selection of idle state where tick stops * * Returns the index of the idle state. */ -int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) +int cpuidle_select(struct cpuidle_driver *drv, + struct cpuidle_device *dev, int timer_stop_valid) { - return cpuidle_curr_governor-select(drv, dev); + return cpuidle_curr_governor-select(drv, dev, timer_stop_valid); } /** diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c index 401c010..c437322 100644 --- a/drivers/cpuidle/governors/ladder.c +++ b/drivers/cpuidle/governors/ladder.c @@ -62,9 +62,10 @@ static inline void ladder_do_selection(struct ladder_device *ldev, * ladder_select_state - selects the next state to enter * @drv: cpuidle driver * @dev: the CPU + * @timer_stop_valid: allow selection of idle state where tick stops */ static int ladder_select_state(struct cpuidle_driver *drv, - struct cpuidle_device *dev) + struct cpuidle_device *dev, int timer_stop_valid) { struct ladder_device *ldev = this_cpu_ptr(ladder_devices); struct ladder_device_state *last_state; @@ -86,6 +87,7 @@ static int ladder_select_state(struct cpuidle_driver *drv, !drv-states[last_idx + 1].disabled !dev-states_usage[last_idx + 1].disable last_residency last_state-threshold.promotion_time + !(!timer_stop_valid (drv-states[last_idx + 1].flags CPUIDLE_FLAG_TIMER_STOP)) drv-states[last_idx + 1].exit_latency = latency_req) { last_state-stats.promotion_count++; last_state-stats.demotion_count = 0; @@ -99,11 +101,14 @@ static int ladder_select_state(struct cpuidle_driver *drv, if (last_idx CPUIDLE_DRIVER_STATE_START (drv-states[last_idx].disabled || dev
Re: [PATCH v3 6/6] cpufreq: powernv: Restore cpu frequency to policy->cur on unthrottling
On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: > If frequency is throttled due to OCC reset then cpus will be in Psafe > frequency, so restore the frequency on all cpus to policy->cur when > OCCs are active again. And if frequency is throttled due to Pmax > capping then restore the frequency of all the cpus in the chip on > unthrottling. > > Signed-off-by: Shilpasri G Bhat > --- > drivers/cpufreq/powernv-cpufreq.c | 31 +-- > 1 file changed, 29 insertions(+), 2 deletions(-) > > diff --git a/drivers/cpufreq/powernv-cpufreq.c > b/drivers/cpufreq/powernv-cpufreq.c > index 0a59d5b..b2915bc 100644 > --- a/drivers/cpufreq/powernv-cpufreq.c > +++ b/drivers/cpufreq/powernv-cpufreq.c > @@ -51,6 +51,7 @@ static struct chip { > bool throttled; > cpumask_t mask; > struct work_struct throttle; > + bool restore; > } *chips; > > static int nr_chips; > @@ -418,9 +419,29 @@ static struct notifier_block powernv_cpufreq_reboot_nb = > { > void powernv_cpufreq_work_fn(struct work_struct *work) > { > struct chip *chip = container_of(work, struct chip, throttle); > + unsigned int cpu; > + cpumask_var_t mask; > > smp_call_function_any(>mask, > powernv_cpufreq_throttle_check, NULL, 0); > + > + if (!chip->restore) > + return; > + > + chip->restore = false; > + cpumask_copy(mask, >mask); > + for_each_cpu_and(cpu, mask, cpu_online_mask) { > + int index, tcpu; > + struct cpufreq_policy policy; > + > + cpufreq_get_policy(, cpu); > + cpufreq_frequency_table_target(, policy.freq_table, > +policy.cur, > +CPUFREQ_RELATION_C, ); > + powernv_cpufreq_target_index(, index); > + for_each_cpu(tcpu, policy.cpus) > + cpumask_clear_cpu(tcpu, mask); > + } > } > > static char throttle_reason[][30] = { > @@ -473,8 +494,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block > *nb, > throttled = false; > pr_info("OCC: Active\n"); > > - for (i = 0; i < nr_chips; i++) > + for (i = 0; i < nr_chips; i++) { > + chips[i].restore = true; > schedule_work([i].throttle); > + } > > return 0; > } > @@ -490,8 +513,11 @@ static int powernv_cpufreq_occ_msg(struct notifier_block > *nb, > return 0; > > for (i = 0; i < nr_chips; i++) > - if (chips[i].id == chip_id) > + if (chips[i].id == chip_id) { > + if (!reason) > + chips[i].restore = true; > schedule_work([i].throttle); > + } > } > return 0; > } > @@ -545,6 +571,7 @@ static int init_chip_info(void) > chips[i].throttled = false; > cpumask_copy([i].mask, cpumask_of_node(chip[i])); > INIT_WORK([i].throttle, powernv_cpufreq_work_fn); > + chips[i].restore = false; > } > > return 0; > Reviewed-by: Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE
On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote: > Hi Preeti, > > On 05/05/2015 09:30 AM, Preeti U Murthy wrote: >> Hi Shilpa, >> >> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: >>> Re-evaluate the chip's throttled state on recieving OCC_THROTTLE >>> notification by executing *throttle_check() on any one of the cpu on >>> the chip. This is a sanity check to verify if we were indeed >>> throttled/unthrottled after receiving OCC_THROTTLE notification. >>> >>> We cannot call *throttle_check() directly from the notification >>> handler because we could be handling chip1's notification in chip2. So >>> initiate an smp_call to execute *throttle_check(). We are irq-disabled >>> in the notification handler, so use a worker thread to smp_call >>> throttle_check() on any of the cpu in the chipmask. >> >> I see that the first patch takes care of reporting *per-chip* throttling >> for pmax capping condition. But where are we taking care of reporting >> "pstate set to safe" and "freq control disabled" scenarios per-chip ? >> > > IMO let us not have "psafe" and "freq control disabled" states managed > per-chip. > Because when the above two conditions occur it is likely to happen across all > chips during an OCC reset cycle. So I am setting 'throttled' to false on > OCC_ACTIVE and re-verifying if it actually is the case by invoking > *throttle_check(). Alright like I pointed in the previous reply, a comment to indicate that psafe and freq control disabled conditions will fail when occ is inactive and that all chips face the consequence of this will help. > >>> >>> Signed-off-by: Shilpasri G Bhat >>> --- >>> drivers/cpufreq/powernv-cpufreq.c | 28 ++-- >>> 1 file changed, 26 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/cpufreq/powernv-cpufreq.c >>> b/drivers/cpufreq/powernv-cpufreq.c >>> index 9268424..9618813 100644 >>> --- a/drivers/cpufreq/powernv-cpufreq.c >>> +++ b/drivers/cpufreq/powernv-cpufreq.c >>> @@ -50,6 +50,8 @@ static bool rebooting, throttled, occ_reset; >>> static struct chip { >>> unsigned int id; >>> bool throttled; >>> + cpumask_t mask; >>> + struct work_struct throttle; >>> } *chips; >>> >>> static int nr_chips; >>> @@ -310,8 +312,9 @@ static inline unsigned int get_nominal_index(void) >>> return powernv_pstate_info.max - powernv_pstate_info.nominal; >>> } >>> >>> -static void powernv_cpufreq_throttle_check(unsigned int cpu) >>> +static void powernv_cpufreq_throttle_check(void *data) >>> { >>> + unsigned int cpu = smp_processor_id(); >>> unsigned long pmsr; >>> int pmsr_pmax, pmsr_lp, i; >>> >>> @@ -373,7 +376,7 @@ static int powernv_cpufreq_target_index(struct >>> cpufreq_policy *policy, >>> return 0; >>> >>> if (!throttled) >>> - powernv_cpufreq_throttle_check(smp_processor_id()); >>> + powernv_cpufreq_throttle_check(NULL); >>> >>> freq_data.pstate_id = powernv_freqs[new_index].driver_data; >>> >>> @@ -418,6 +421,14 @@ static struct notifier_block powernv_cpufreq_reboot_nb >>> = { >>> .notifier_call = powernv_cpufreq_reboot_notifier, >>> }; >>> >>> +void powernv_cpufreq_work_fn(struct work_struct *work) >>> +{ >>> + struct chip *chip = container_of(work, struct chip, throttle); >>> + >>> + smp_call_function_any(>mask, >>> + powernv_cpufreq_throttle_check, NULL, 0); >>> +} >>> + >>> static char throttle_reason[][30] = { >>> "No throttling", >>> "Power Cap", >>> @@ -433,6 +444,7 @@ static int powernv_cpufreq_occ_msg(struct >>> notifier_block *nb, >>> struct opal_msg *occ_msg = msg; >>> uint64_t token; >>> uint64_t chip_id, reason; >>> + int i; >>> >>> if (msg_type != OPAL_MSG_OCC) >>> return 0; >>> @@ -466,6 +478,10 @@ static int powernv_cpufreq_occ_msg(struct >>> notifier_block *nb, >>> occ_reset = false; >>> throttled = false; >>> pr_info("OCC: Active\n"); >>> + >>> + for (
[tip:timers/core] tick-broadcast: Fix the printing of broadcast masks
Commit-ID: 1ef09cd713c90781b683a0b4e0a874803c172b1d Gitweb: http://git.kernel.org/tip/1ef09cd713c90781b683a0b4e0a874803c172b1d Author: Preeti U Murthy AuthorDate: Tue, 28 Apr 2015 14:15:20 +0530 Committer: Thomas Gleixner CommitDate: Tue, 5 May 2015 10:35:58 +0200 tick-broadcast: Fix the printing of broadcast masks Today the number of bits of the broadcast masks that is output into /proc/timer_list is sizeof(unsigned long). This means that on machines with a larger number of CPUs, the bitmasks of CPUs beyond this range do not appear. Fix this by using bitmap printing through "%*pb" instead, so as to output the broadcast masks for the range of nr_cpu_ids into /proc/timer_list. Signed-off-by: Preeti U Murthy Cc: pet...@infradead.org Cc: linuxppc-...@ozlabs.org Cc: john.stu...@linaro.org Link: http://lkml.kernel.org/r/20150428084520.3314.62668.st...@preeti.in.ibm.com Signed-off-by: Thomas Gleixner --- kernel/time/timer_list.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c index 66f39bb..18b074b 100644 --- a/kernel/time/timer_list.c +++ b/kernel/time/timer_list.c @@ -276,11 +276,11 @@ static void timer_list_show_tickdevices_header(struct seq_file *m) { #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST print_tickdevice(m, tick_get_broadcast_device(), -1); - SEQ_printf(m, "tick_broadcast_mask: %08lx\n", - cpumask_bits(tick_get_broadcast_mask())[0]); + SEQ_printf(m, "tick_broadcast_mask: %*pb\n", + cpumask_pr_args(tick_get_broadcast_mask())); #ifdef CONFIG_TICK_ONESHOT - SEQ_printf(m, "tick_broadcast_oneshot_mask: %08lx\n", - cpumask_bits(tick_get_broadcast_oneshot_mask())[0]); + SEQ_printf(m, "tick_broadcast_oneshot_mask: %*pb\n", + cpumask_pr_args(tick_get_broadcast_oneshot_mask())); #endif SEQ_printf(m, "\n"); #endif -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level
On 05/05/2015 11:36 AM, Shilpasri G Bhat wrote: > Hi Preeti, > > On 05/05/2015 09:21 AM, Preeti U Murthy wrote: >> Hi Shilpa, >> >> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: >>> The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the >>> max allowed frequency for that chip if the chip exceeds its power or >>> temperature limits. As Pmax capping is a chip level condition report >>> this throttling behavior at chip level and also do not set the global >>> 'throttled' on Pmax capping instead set the per-chip throttled >>> variable. Report unthrottling if Pmax is restored after throttling. >>> >>> This patch adds a structure to store chip id and throttled state of >>> the chip. >>> >>> Signed-off-by: Shilpasri G Bhat >>> --- >>> drivers/cpufreq/powernv-cpufreq.c | 59 >>> --- >>> 1 file changed, 55 insertions(+), 4 deletions(-) >>> >>> diff --git a/drivers/cpufreq/powernv-cpufreq.c >>> b/drivers/cpufreq/powernv-cpufreq.c >>> index ebef0d8..d0c18c9 100644 >>> --- a/drivers/cpufreq/powernv-cpufreq.c >>> +++ b/drivers/cpufreq/powernv-cpufreq.c >>> @@ -27,6 +27,7 @@ >>> #include >>> #include >>> #include >>> +#include >>> >>> #include >>> #include >>> @@ -42,6 +43,13 @@ >>> static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; >>> static bool rebooting, throttled; >>> >>> +static struct chip { >>> + unsigned int id; >>> + bool throttled; >>> +} *chips; >>> + >>> +static int nr_chips; >>> + >>> /* >>> * Note: The set of pstates consists of contiguous integers, the >>> * smallest of which is indicated by powernv_pstate_info.min, the >>> @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void) >>> static void powernv_cpufreq_throttle_check(unsigned int cpu) >>> { >>> unsigned long pmsr; >>> - int pmsr_pmax, pmsr_lp; >>> + int pmsr_pmax, pmsr_lp, i; >>> >>> pmsr = get_pmspr(SPRN_PMSR); >>> >>> + for (i = 0; i < nr_chips; i++) >>> + if (chips[i].id == cpu_to_chip_id(cpu)) >>> + break; >>> + >>> /* Check for Pmax Capping */ >>> pmsr_pmax = (s8)PMSR_MAX(pmsr); >>> if (pmsr_pmax != powernv_pstate_info.max) { >>> - throttled = true; >>> - pr_info("CPU %d Pmax is reduced to %d\n", cpu, pmsr_pmax); >>> - pr_info("Max allowed Pstate is capped\n"); >>> + if (chips[i].throttled) >>> + goto next; >>> + chips[i].throttled = true; >>> + pr_info("CPU %d on Chip %u has Pmax reduced to %d\n", cpu, >>> + chips[i].id, pmsr_pmax); >>> + } else if (chips[i].throttled) { >>> + chips[i].throttled = false; >> >> Is this check on pmax sufficient to indicate that the chip is unthrottled ? > > Unthrottling due to Pmax uncapping here is specific to a chip. So it is > sufficient to decide throttling/unthrottling when OCC is active for that chip. Ok then we can perhaps exit after detecting unthrottling here. > >> >>> + pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu, >>> + chips[i].id, pmsr_pmax); >>> } >>> >>> /* >>> * Check for Psafe by reading LocalPstate >>> * or check if Psafe_mode_active is set in PMSR. >>> */ >>> +next: >>> pmsr_lp = (s8)PMSR_LP(pmsr); >>> if ((pmsr_lp < powernv_pstate_info.min) || >>> (pmsr & PMSR_PSAFE_ENABLE)) { >>> @@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver = { >>> .attr = powernv_cpu_freq_attr, >> >> What about the situation where although occ is active, this particular >> chip has been throttled and we end up repeatedly reporting "pstate set >> to safe" and "frequency control disabled from OS" ? Should we not have a >> check on (chips[i].throttled) before reporting an anomaly for these two >> scenarios as well just like you have for pmsr_pmax ? > > We will not have "Psafe" and "frequency control disabled" repeatedly printed > because of global variable 'throttled', which is set to true on passing any of > these two conditions. > > It is quite unlikely behavior to have only one chip in "Psafe" or "frequency > control disabled" state. These two conditions are most likely to happen during > an OCC reset cycle which will occur across all chips. Let us then add a comment to indicate that Psafe and frequency control disabled conditions will fail *only if OCC is inactive* and not otherwise and that this is a system wide phenomenon. Regards Preeti U Murthy > > Thanks and Regards, > Shilpa > > ___ > Linuxppc-dev mailing list > linuxppc-...@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level
On 05/05/2015 11:36 AM, Shilpasri G Bhat wrote: Hi Preeti, On 05/05/2015 09:21 AM, Preeti U Murthy wrote: Hi Shilpa, On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the max allowed frequency for that chip if the chip exceeds its power or temperature limits. As Pmax capping is a chip level condition report this throttling behavior at chip level and also do not set the global 'throttled' on Pmax capping instead set the per-chip throttled variable. Report unthrottling if Pmax is restored after throttling. This patch adds a structure to store chip id and throttled state of the chip. Signed-off-by: Shilpasri G Bhat shilpa.b...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 59 --- 1 file changed, 55 insertions(+), 4 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index ebef0d8..d0c18c9 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -27,6 +27,7 @@ #include linux/smp.h #include linux/of.h #include linux/reboot.h +#include linux/slab.h #include asm/cputhreads.h #include asm/firmware.h @@ -42,6 +43,13 @@ static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; static bool rebooting, throttled; +static struct chip { + unsigned int id; + bool throttled; +} *chips; + +static int nr_chips; + /* * Note: The set of pstates consists of contiguous integers, the * smallest of which is indicated by powernv_pstate_info.min, the @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void) static void powernv_cpufreq_throttle_check(unsigned int cpu) { unsigned long pmsr; - int pmsr_pmax, pmsr_lp; + int pmsr_pmax, pmsr_lp, i; pmsr = get_pmspr(SPRN_PMSR); + for (i = 0; i nr_chips; i++) + if (chips[i].id == cpu_to_chip_id(cpu)) + break; + /* Check for Pmax Capping */ pmsr_pmax = (s8)PMSR_MAX(pmsr); if (pmsr_pmax != powernv_pstate_info.max) { - throttled = true; - pr_info(CPU %d Pmax is reduced to %d\n, cpu, pmsr_pmax); - pr_info(Max allowed Pstate is capped\n); + if (chips[i].throttled) + goto next; + chips[i].throttled = true; + pr_info(CPU %d on Chip %u has Pmax reduced to %d\n, cpu, + chips[i].id, pmsr_pmax); + } else if (chips[i].throttled) { + chips[i].throttled = false; Is this check on pmax sufficient to indicate that the chip is unthrottled ? Unthrottling due to Pmax uncapping here is specific to a chip. So it is sufficient to decide throttling/unthrottling when OCC is active for that chip. Ok then we can perhaps exit after detecting unthrottling here. + pr_info(CPU %d on Chip %u has Pmax restored to %d\n, cpu, + chips[i].id, pmsr_pmax); } /* * Check for Psafe by reading LocalPstate * or check if Psafe_mode_active is set in PMSR. */ +next: pmsr_lp = (s8)PMSR_LP(pmsr); if ((pmsr_lp powernv_pstate_info.min) || (pmsr PMSR_PSAFE_ENABLE)) { @@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver = { .attr = powernv_cpu_freq_attr, What about the situation where although occ is active, this particular chip has been throttled and we end up repeatedly reporting pstate set to safe and frequency control disabled from OS ? Should we not have a check on (chips[i].throttled) before reporting an anomaly for these two scenarios as well just like you have for pmsr_pmax ? We will not have Psafe and frequency control disabled repeatedly printed because of global variable 'throttled', which is set to true on passing any of these two conditions. It is quite unlikely behavior to have only one chip in Psafe or frequency control disabled state. These two conditions are most likely to happen during an OCC reset cycle which will occur across all chips. Let us then add a comment to indicate that Psafe and frequency control disabled conditions will fail *only if OCC is inactive* and not otherwise and that this is a system wide phenomenon. Regards Preeti U Murthy Thanks and Regards, Shilpa ___ Linuxppc-dev mailing list linuxppc-...@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:timers/core] tick-broadcast: Fix the printing of broadcast masks
Commit-ID: 1ef09cd713c90781b683a0b4e0a874803c172b1d Gitweb: http://git.kernel.org/tip/1ef09cd713c90781b683a0b4e0a874803c172b1d Author: Preeti U Murthy pre...@linux.vnet.ibm.com AuthorDate: Tue, 28 Apr 2015 14:15:20 +0530 Committer: Thomas Gleixner t...@linutronix.de CommitDate: Tue, 5 May 2015 10:35:58 +0200 tick-broadcast: Fix the printing of broadcast masks Today the number of bits of the broadcast masks that is output into /proc/timer_list is sizeof(unsigned long). This means that on machines with a larger number of CPUs, the bitmasks of CPUs beyond this range do not appear. Fix this by using bitmap printing through %*pb instead, so as to output the broadcast masks for the range of nr_cpu_ids into /proc/timer_list. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com Cc: pet...@infradead.org Cc: linuxppc-...@ozlabs.org Cc: john.stu...@linaro.org Link: http://lkml.kernel.org/r/20150428084520.3314.62668.st...@preeti.in.ibm.com Signed-off-by: Thomas Gleixner t...@linutronix.de --- kernel/time/timer_list.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c index 66f39bb..18b074b 100644 --- a/kernel/time/timer_list.c +++ b/kernel/time/timer_list.c @@ -276,11 +276,11 @@ static void timer_list_show_tickdevices_header(struct seq_file *m) { #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST print_tickdevice(m, tick_get_broadcast_device(), -1); - SEQ_printf(m, tick_broadcast_mask: %08lx\n, - cpumask_bits(tick_get_broadcast_mask())[0]); + SEQ_printf(m, tick_broadcast_mask: %*pb\n, + cpumask_pr_args(tick_get_broadcast_mask())); #ifdef CONFIG_TICK_ONESHOT - SEQ_printf(m, tick_broadcast_oneshot_mask: %08lx\n, - cpumask_bits(tick_get_broadcast_oneshot_mask())[0]); + SEQ_printf(m, tick_broadcast_oneshot_mask: %*pb\n, + cpumask_pr_args(tick_get_broadcast_oneshot_mask())); #endif SEQ_printf(m, \n); #endif -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE
On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote: Hi Preeti, On 05/05/2015 09:30 AM, Preeti U Murthy wrote: Hi Shilpa, On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: Re-evaluate the chip's throttled state on recieving OCC_THROTTLE notification by executing *throttle_check() on any one of the cpu on the chip. This is a sanity check to verify if we were indeed throttled/unthrottled after receiving OCC_THROTTLE notification. We cannot call *throttle_check() directly from the notification handler because we could be handling chip1's notification in chip2. So initiate an smp_call to execute *throttle_check(). We are irq-disabled in the notification handler, so use a worker thread to smp_call throttle_check() on any of the cpu in the chipmask. I see that the first patch takes care of reporting *per-chip* throttling for pmax capping condition. But where are we taking care of reporting pstate set to safe and freq control disabled scenarios per-chip ? IMO let us not have psafe and freq control disabled states managed per-chip. Because when the above two conditions occur it is likely to happen across all chips during an OCC reset cycle. So I am setting 'throttled' to false on OCC_ACTIVE and re-verifying if it actually is the case by invoking *throttle_check(). Alright like I pointed in the previous reply, a comment to indicate that psafe and freq control disabled conditions will fail when occ is inactive and that all chips face the consequence of this will help. Signed-off-by: Shilpasri G Bhat shilpa.b...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 28 ++-- 1 file changed, 26 insertions(+), 2 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 9268424..9618813 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -50,6 +50,8 @@ static bool rebooting, throttled, occ_reset; static struct chip { unsigned int id; bool throttled; + cpumask_t mask; + struct work_struct throttle; } *chips; static int nr_chips; @@ -310,8 +312,9 @@ static inline unsigned int get_nominal_index(void) return powernv_pstate_info.max - powernv_pstate_info.nominal; } -static void powernv_cpufreq_throttle_check(unsigned int cpu) +static void powernv_cpufreq_throttle_check(void *data) { + unsigned int cpu = smp_processor_id(); unsigned long pmsr; int pmsr_pmax, pmsr_lp, i; @@ -373,7 +376,7 @@ static int powernv_cpufreq_target_index(struct cpufreq_policy *policy, return 0; if (!throttled) - powernv_cpufreq_throttle_check(smp_processor_id()); + powernv_cpufreq_throttle_check(NULL); freq_data.pstate_id = powernv_freqs[new_index].driver_data; @@ -418,6 +421,14 @@ static struct notifier_block powernv_cpufreq_reboot_nb = { .notifier_call = powernv_cpufreq_reboot_notifier, }; +void powernv_cpufreq_work_fn(struct work_struct *work) +{ + struct chip *chip = container_of(work, struct chip, throttle); + + smp_call_function_any(chip-mask, + powernv_cpufreq_throttle_check, NULL, 0); +} + static char throttle_reason[][30] = { No throttling, Power Cap, @@ -433,6 +444,7 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb, struct opal_msg *occ_msg = msg; uint64_t token; uint64_t chip_id, reason; + int i; if (msg_type != OPAL_MSG_OCC) return 0; @@ -466,6 +478,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb, occ_reset = false; throttled = false; pr_info(OCC: Active\n); + + for (i = 0; i nr_chips; i++) + schedule_work(chips[i].throttle); + return 0; } @@ -476,6 +492,12 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb, else if (!reason) pr_info(OCC: Chip %u %s\n, (unsigned int)chip_id, throttle_reason[reason]); + else + return 0; Why the else section ? The code can never reach here, can it ? When reason 5 , we dont want to handle it. Of course! My bad! + + for (i = 0; i nr_chips; i++) + if (chips[i].id == chip_id) + schedule_work(chips[i].throttle); } Should we not do this only when we get unthrottled so as to cross verify if it is indeed the case ? In case of throttling notification, opal's verdict is final and there is no need to cross verify right ? Two reasons for invoking *throttle_check() on throttling: 1) We just got to know the reason and not the Pmax value we are getting throttled to. 2) It could be a spurious message caused due to late/lost delivery. My point
Re: [PATCH v3 6/6] cpufreq: powernv: Restore cpu frequency to policy-cur on unthrottling
On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: If frequency is throttled due to OCC reset then cpus will be in Psafe frequency, so restore the frequency on all cpus to policy-cur when OCCs are active again. And if frequency is throttled due to Pmax capping then restore the frequency of all the cpus in the chip on unthrottling. Signed-off-by: Shilpasri G Bhat shilpa.b...@linux.vnet.ibm.com --- drivers/cpufreq/powernv-cpufreq.c | 31 +-- 1 file changed, 29 insertions(+), 2 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 0a59d5b..b2915bc 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -51,6 +51,7 @@ static struct chip { bool throttled; cpumask_t mask; struct work_struct throttle; + bool restore; } *chips; static int nr_chips; @@ -418,9 +419,29 @@ static struct notifier_block powernv_cpufreq_reboot_nb = { void powernv_cpufreq_work_fn(struct work_struct *work) { struct chip *chip = container_of(work, struct chip, throttle); + unsigned int cpu; + cpumask_var_t mask; smp_call_function_any(chip-mask, powernv_cpufreq_throttle_check, NULL, 0); + + if (!chip-restore) + return; + + chip-restore = false; + cpumask_copy(mask, chip-mask); + for_each_cpu_and(cpu, mask, cpu_online_mask) { + int index, tcpu; + struct cpufreq_policy policy; + + cpufreq_get_policy(policy, cpu); + cpufreq_frequency_table_target(policy, policy.freq_table, +policy.cur, +CPUFREQ_RELATION_C, index); + powernv_cpufreq_target_index(policy, index); + for_each_cpu(tcpu, policy.cpus) + cpumask_clear_cpu(tcpu, mask); + } } static char throttle_reason[][30] = { @@ -473,8 +494,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb, throttled = false; pr_info(OCC: Active\n); - for (i = 0; i nr_chips; i++) + for (i = 0; i nr_chips; i++) { + chips[i].restore = true; schedule_work(chips[i].throttle); + } return 0; } @@ -490,8 +513,11 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb, return 0; for (i = 0; i nr_chips; i++) - if (chips[i].id == chip_id) + if (chips[i].id == chip_id) { + if (!reason) + chips[i].restore = true; schedule_work(chips[i].throttle); + } } return 0; } @@ -545,6 +571,7 @@ static int init_chip_info(void) chips[i].throttled = false; cpumask_copy(chips[i].mask, cpumask_of_node(chip[i])); INIT_WORK(chips[i].throttle, powernv_cpufreq_work_fn); + chips[i].restore = false; } return 0; Reviewed-by: Preeti U Murthy pre...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE
Hi Shilpa, On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: > Re-evaluate the chip's throttled state on recieving OCC_THROTTLE > notification by executing *throttle_check() on any one of the cpu on > the chip. This is a sanity check to verify if we were indeed > throttled/unthrottled after receiving OCC_THROTTLE notification. > > We cannot call *throttle_check() directly from the notification > handler because we could be handling chip1's notification in chip2. So > initiate an smp_call to execute *throttle_check(). We are irq-disabled > in the notification handler, so use a worker thread to smp_call > throttle_check() on any of the cpu in the chipmask. I see that the first patch takes care of reporting *per-chip* throttling for pmax capping condition. But where are we taking care of reporting "pstate set to safe" and "freq control disabled" scenarios per-chip ? > > Signed-off-by: Shilpasri G Bhat > --- > drivers/cpufreq/powernv-cpufreq.c | 28 ++-- > 1 file changed, 26 insertions(+), 2 deletions(-) > > diff --git a/drivers/cpufreq/powernv-cpufreq.c > b/drivers/cpufreq/powernv-cpufreq.c > index 9268424..9618813 100644 > --- a/drivers/cpufreq/powernv-cpufreq.c > +++ b/drivers/cpufreq/powernv-cpufreq.c > @@ -50,6 +50,8 @@ static bool rebooting, throttled, occ_reset; > static struct chip { > unsigned int id; > bool throttled; > + cpumask_t mask; > + struct work_struct throttle; > } *chips; > > static int nr_chips; > @@ -310,8 +312,9 @@ static inline unsigned int get_nominal_index(void) > return powernv_pstate_info.max - powernv_pstate_info.nominal; > } > > -static void powernv_cpufreq_throttle_check(unsigned int cpu) > +static void powernv_cpufreq_throttle_check(void *data) > { > + unsigned int cpu = smp_processor_id(); > unsigned long pmsr; > int pmsr_pmax, pmsr_lp, i; > > @@ -373,7 +376,7 @@ static int powernv_cpufreq_target_index(struct > cpufreq_policy *policy, > return 0; > > if (!throttled) > - powernv_cpufreq_throttle_check(smp_processor_id()); > + powernv_cpufreq_throttle_check(NULL); > > freq_data.pstate_id = powernv_freqs[new_index].driver_data; > > @@ -418,6 +421,14 @@ static struct notifier_block powernv_cpufreq_reboot_nb = > { > .notifier_call = powernv_cpufreq_reboot_notifier, > }; > > +void powernv_cpufreq_work_fn(struct work_struct *work) > +{ > + struct chip *chip = container_of(work, struct chip, throttle); > + > + smp_call_function_any(>mask, > + powernv_cpufreq_throttle_check, NULL, 0); > +} > + > static char throttle_reason[][30] = { > "No throttling", > "Power Cap", > @@ -433,6 +444,7 @@ static int powernv_cpufreq_occ_msg(struct notifier_block > *nb, > struct opal_msg *occ_msg = msg; > uint64_t token; > uint64_t chip_id, reason; > + int i; > > if (msg_type != OPAL_MSG_OCC) > return 0; > @@ -466,6 +478,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block > *nb, > occ_reset = false; > throttled = false; > pr_info("OCC: Active\n"); > + > + for (i = 0; i < nr_chips; i++) > + schedule_work([i].throttle); > + > return 0; > } > > @@ -476,6 +492,12 @@ static int powernv_cpufreq_occ_msg(struct notifier_block > *nb, > else if (!reason) > pr_info("OCC: Chip %u %s\n", (unsigned int)chip_id, > throttle_reason[reason]); > + else > + return 0; Why the else section ? The code can never reach here, can it ? > + > + for (i = 0; i < nr_chips; i++) > + if (chips[i].id == chip_id) > + schedule_work([i].throttle); > } Should we not do this only when we get unthrottled so as to cross verify if it is indeed the case ? In case of throttling notification, opal's verdict is final and there is no need to cross verify right ? Perhaps the one thing that needs to be taken care in addition to reporting throttling is setting the chip's throttled parameter to true. This should do right ? I don't see the need to call throttle_check() here. Regards Preeti U Murthy > return 0; > } > @@ -527,6 +549,8 @@ static int init_chip_info(void) > for (i = 0; i < nr_chips; i++) { > chips[i].id = chip[i];
Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level
Hi Shilpa, On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: > The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the > max allowed frequency for that chip if the chip exceeds its power or > temperature limits. As Pmax capping is a chip level condition report > this throttling behavior at chip level and also do not set the global > 'throttled' on Pmax capping instead set the per-chip throttled > variable. Report unthrottling if Pmax is restored after throttling. > > This patch adds a structure to store chip id and throttled state of > the chip. > > Signed-off-by: Shilpasri G Bhat > --- > drivers/cpufreq/powernv-cpufreq.c | 59 > --- > 1 file changed, 55 insertions(+), 4 deletions(-) > > diff --git a/drivers/cpufreq/powernv-cpufreq.c > b/drivers/cpufreq/powernv-cpufreq.c > index ebef0d8..d0c18c9 100644 > --- a/drivers/cpufreq/powernv-cpufreq.c > +++ b/drivers/cpufreq/powernv-cpufreq.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -42,6 +43,13 @@ > static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1]; > static bool rebooting, throttled; > > +static struct chip { > + unsigned int id; > + bool throttled; > +} *chips; > + > +static int nr_chips; > + > /* > * Note: The set of pstates consists of contiguous integers, the > * smallest of which is indicated by powernv_pstate_info.min, the > @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void) > static void powernv_cpufreq_throttle_check(unsigned int cpu) > { > unsigned long pmsr; > - int pmsr_pmax, pmsr_lp; > + int pmsr_pmax, pmsr_lp, i; > > pmsr = get_pmspr(SPRN_PMSR); > > + for (i = 0; i < nr_chips; i++) > + if (chips[i].id == cpu_to_chip_id(cpu)) > + break; > + > /* Check for Pmax Capping */ > pmsr_pmax = (s8)PMSR_MAX(pmsr); > if (pmsr_pmax != powernv_pstate_info.max) { > - throttled = true; > - pr_info("CPU %d Pmax is reduced to %d\n", cpu, pmsr_pmax); > - pr_info("Max allowed Pstate is capped\n"); > + if (chips[i].throttled) > + goto next; > + chips[i].throttled = true; > + pr_info("CPU %d on Chip %u has Pmax reduced to %d\n", cpu, > + chips[i].id, pmsr_pmax); > + } else if (chips[i].throttled) { > + chips[i].throttled = false; Is this check on pmax sufficient to indicate that the chip is unthrottled ? > + pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu, > + chips[i].id, pmsr_pmax); > } > > /* >* Check for Psafe by reading LocalPstate >* or check if Psafe_mode_active is set in PMSR. >*/ > +next: > pmsr_lp = (s8)PMSR_LP(pmsr); > if ((pmsr_lp < powernv_pstate_info.min) || > (pmsr & PMSR_PSAFE_ENABLE)) { > @@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver = { > .attr = powernv_cpu_freq_attr, What about the situation where although occ is active, this particular chip has been throttled and we end up repeatedly reporting "pstate set to safe" and "frequency control disabled from OS" ? Should we not have a check on (chips[i].throttled) before reporting an anomaly for these two scenarios as well just like you have for pmsr_pmax ? Regards Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 5/6] cpufreq: powernv: Report Psafe only if PMSR.psafe_mode_active bit is set
On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote: > On a reset cycle of OCC, although the system retires from safe > frequency state the local pstate is not restored to Pmin or last > requested pstate. Now if the cpufreq governor initiates a pstate > change, the local pstate will be in Psafe and we will be reporting a > false positive when we are not throttled. > > So in powernv_cpufreq_throttle_check() remove the condition which > checks if local pstate is less than Pmin while checking for Psafe > frequency. If the cpus are forced to Psafe then PMSR.psafe_mode_active > bit will be set. So, when OCCs become active this bit will be cleared. > Let us just rely on this bit for reporting throttling. > > Signed-off-by: Shilpasri G Bhat > --- > drivers/cpufreq/powernv-cpufreq.c | 12 +++- > 1 file changed, 3 insertions(+), 9 deletions(-) > > diff --git a/drivers/cpufreq/powernv-cpufreq.c > b/drivers/cpufreq/powernv-cpufreq.c > index 9618813..0a59d5b 100644 > --- a/drivers/cpufreq/powernv-cpufreq.c > +++ b/drivers/cpufreq/powernv-cpufreq.c > @@ -39,7 +39,6 @@ > #define PMSR_PSAFE_ENABLE(1UL << 30) > #define PMSR_SPR_EM_DISABLE (1UL << 31) > #define PMSR_MAX(x) ((x >> 32) & 0xFF) > -#define PMSR_LP(x) ((x >> 48) & 0xFF) > #define OCC_RESET0 > #define OCC_LOAD 1 > #define OCC_THROTTLE 2 > @@ -316,7 +315,7 @@ static void powernv_cpufreq_throttle_check(void *data) > { > unsigned int cpu = smp_processor_id(); > unsigned long pmsr; > - int pmsr_pmax, pmsr_lp, i; > + int pmsr_pmax, i; > > pmsr = get_pmspr(SPRN_PMSR); > > @@ -338,14 +337,9 @@ static void powernv_cpufreq_throttle_check(void *data) > chips[i].id, pmsr_pmax); > } > > - /* > - * Check for Psafe by reading LocalPstate > - * or check if Psafe_mode_active is set in PMSR. > - */ > + /* Check if Psafe_mode_active is set in PMSR. */ > next: > - pmsr_lp = (s8)PMSR_LP(pmsr); > - if ((pmsr_lp < powernv_pstate_info.min) || > - (pmsr & PMSR_PSAFE_ENABLE)) { > + if (pmsr & PMSR_PSAFE_ENABLE) { > throttled = true; > pr_info("Pstate set to safe frequency\n"); > } > Reviewed-by: Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 3/6] cpufreq: powernv: Register for OCC related opal_message notification
d = true; > + pr_crit("CPU Frequency is throttled\n"); > + } > + pr_info("OCC: Reset\n"); > + break; > + case OCC_LOAD: > + pr_info("OCC: Loaded\n"); > + break; > + case OCC_THROTTLE: > + chip_id = be64_to_cpu(occ_msg->params[1]); > + reason = be64_to_cpu(occ_msg->params[2]); > + > + if (occ_reset) { > + occ_reset = false; > + throttled = false; > + pr_info("OCC: Active\n"); > + return 0; > + } > + > + if (reason && reason <= 5) > + pr_info("OCC: Chip %u Pmax reduced due to %s\n", > + (unsigned int)chip_id, > + throttle_reason[reason]); > + else if (!reason) > + pr_info("OCC: Chip %u %s\n", (unsigned int)chip_id, > + throttle_reason[reason]); > + } > + return 0; > +} > + > +static struct notifier_block powernv_cpufreq_opal_nb = { > + .notifier_call = powernv_cpufreq_occ_msg, > + .next = NULL, > + .priority = 0, > +}; > + > static void powernv_cpufreq_stop_cpu(struct cpufreq_policy *policy) > { > struct powernv_smp_call_data freq_data; > @@ -481,6 +553,7 @@ static int __init powernv_cpufreq_init(void) > return rc; > > register_reboot_notifier(_cpufreq_reboot_nb); > + opal_message_notifier_register(OPAL_MSG_OCC, _cpufreq_opal_nb); > return cpufreq_register_driver(_cpufreq_driver); > } > module_init(powernv_cpufreq_init); > Looks good. Reviewed-by: Preeti U Murthy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/