On Mon, Apr 19, 2021 at 11:56:30AM +0100, Vincent Donnefort wrote: > On Thu, Apr 15, 2021 at 03:32:11PM +0100, Valentin Schneider wrote: > > On 15/04/21 10:59, Peter Zijlstra wrote: > > > Can't make sense of what I did.. I've removed that hunk. Patch now looks > > > like this. > > > > > > > Small nit below, but regardless feel free to apply to the whole lot: > > Reviewed-by: Valentin Schneider <valentin.schnei...@arm.com> > > > > @VincentD, ISTR you had tested the initial version of this with your fancy > > shmancy hotplug rollback stresser. Feel like doing this > > I indeed wrote a test to verify all the rollback cases, up and down. > > It seems I encounter an intermitent issue while running several iterations of > that test ... but I need more time to debug and figure-out where it is > blocking.
Found the issue: $ cat hotplug/states: 219: sched:active 220: online CPU0: $ echo 219 > hotplug/fail $ echo 0 > online => cpu_active = 1 cpu_dying = 1 which means that later on, for another CPU hotunplug, in __balance_push_cpu_stop(), the fallback rq for a kthread can select that CPU0, but __migrate_task() would fail and we end-up in an infinite loop, trying to migrate that task to CPU0. The problem is that for a failure in sched:active, as "online" has no callback, there will be no call to cpuhp_invoke_callback(). Hence, the cpu_dying bit would not be reset. Maybe cpuhp_reset_state() and cpuhp_set_state() would then be a better place to switch the dying bit? > > > > > > So instead, make sure balance_push is enabled between > > > sched_cpu_deactivate() and sched_cpu_activate() (eg. when > > > !cpu_active()), and gate it's utility with cpu_dying(). > > > > I'd word that "is enabled below sched_cpu_activate()", since > > sched_cpu_deactivate() is now out of the picture. > > > > [...] > > > @@ -7639,6 +7639,9 @@ static DEFINE_PER_CPU(struct cpu_stop_wo > > > > > > /* > > > * Ensure we only run per-cpu kthreads once the CPU goes !active. > > > + * > > > + * This is active/set between sched_cpu_deactivate() / > > > sched_cpu_activate(). > > > > Ditto > > > > > + * But only effective when the hotplug motion is down. > > > */ > > > static void balance_push(struct rq *rq) > > > {