On Wed, Jul 01, 2020 at 05:54:11PM -0700, Tim Chen wrote: > > > On 7/1/20 4:28 PM, Joel Fernandes wrote: > > On Tue, Jun 30, 2020 at 09:32:27PM +0000, Vineeth Remanan Pillai wrote: > >> From: Peter Zijlstra <pet...@infradead.org> > >> > >> Instead of only selecting a local task, select a task for all SMT > >> siblings for every reschedule on the core (irrespective which logical > >> CPU does the reschedule). > >> > >> There could be races in core scheduler where a CPU is trying to pick > >> a task for its sibling in core scheduler, when that CPU has just been > >> offlined. We should not schedule any tasks on the CPU in this case. > >> Return an idle task in pick_next_task for this situation. > >> > >> NOTE: there is still potential for siblings rivalry. > >> NOTE: this is far too complicated; but thus far I've failed to > >> simplify it further. > >> > >> Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org> > >> Signed-off-by: Julien Desfossez <jdesfos...@digitalocean.com> > >> Signed-off-by: Vineeth Remanan Pillai <vpil...@digitalocean.com> > >> Signed-off-by: Aaron Lu <aaron...@linux.alibaba.com> > >> Signed-off-by: Tim Chen <tim.c.c...@linux.intel.com> > > > > Hi Peter, Tim, all, the below patch fixes the hotplug issue described in the > > below patch's Link tag. Patch description below describes the issues fixed > > and it applies on top of this patch. > > > > ------8<---------- > > > > From: "Joel Fernandes (Google)" <j...@joelfernandes.org> > > Subject: [PATCH] sched: Fix CPU hotplug causing crashes in task selection > > logic > > > > The selection logic does not run correctly if the current CPU is not in the > > cpu_smt_mask (which it is not because the CPU is offlined when the stopper > > finishes running and needs to switch to idle). There are also other issues > > fixed by the patch I think such as: if some other sibling set core_pick to > > something, however the selection logic on current cpu resets it before > > selecting. In this case, we need to run the task selection logic again to > > make sure it picks something if there is something to run. It might end up > > picking the wrong task. Yet another issue was, if the stopper thread is an
"It might end up picking the wrong task" needs to be: "We might end up picking a different task but that's Ok". > > unconstrained pick, then rq->core_pick is set. The next time task selection > > logic runs when stopper needs to switch to idle, the current CPU is not in > > the smt_mask. This causes the previous ->core_pick to be picked again which > > happens to be the unconstrained task! so the stopper keeps getting selected > > forever. > > > > That and there are a few more safe guards and checks around checking/setting > > rq->core_pick. To test it, I ran rcutorture and made it tag all torture > > threads. Then ran it in hotplug mode (hotplugging every 200ms) and it hit > > the > > issue. Now it runs for an hour or so without issue. (Torture testing debug > > changes: https://bit.ly/38htfqK ). > > > > Various fixes were tried causing varying degrees of crashes. Finally I > > found > > that it is easiest to just add current CPU to the smt_mask's copy always. > > This is so that task selection logic always runs on the current CPU which > > called schedule(). > > > It looks good to me. Thank you for your review! Could I add your Reviewed-by tag to the patch? - Joel > Thanks. > > Tim