On Wed, 23 Oct 2024 at 11:36, Peter Zijlstra <[email protected]> wrote: > On Wed, Oct 23, 2024 at 11:03:11AM +0200, Marco Elver wrote: > > On Wed, 23 Oct 2024 at 10:54, Marco Elver <[email protected]> wrote: > > > On Tue, Oct 22, 2024 at 09:57PM +0200, Marco Elver wrote: > > > > On Tue, 22 Oct 2024 at 21:12, Peter Zijlstra <[email protected]> > > > > wrote: > > > [...] > > > > > So KCSAn is trying to tell me these two paths run concurrently on the > > > > > same 'p' ?!? That would be a horrible bug -- both these call chains > > > > > should be holding rq->__lock (for task_rq(p)). > > > > > > > > Yes correct. > > > > > > > > And just to confirm this is no false positive, the way KCSAN works > > > > _requires_ the race to actually happen before it reports anything; > > > > this can also be seen in Alexander's report with just 1 stack trace > > > > where it saw the value transition from 0 to 1 (TASK_ON_RQ_QUEUED) but > > > > didn't know who did the write because kernel/sched was uninstrumented. > > > > > > Got another version of the splat with CONFIG_KCSAN_VERBOSE=y. Lockdep > > > seems to > > > think that both threads here are holding rq->__lock. > > > > Gotta read more carefully, one instance is ffffa2e57dc2f398 another is > > ffffa2e57dd2f398. If I read it right, then they're not actually the > > same lock. > > Yeah, as explained in the diagram below, the moment the ->on_rq = 0 > store goes through, we no longer own the task. And since > ASSERT_EXCLUSIVE_WRITER is after that, we go splat. > > The below patch changes this order and switches to using > smp_store_release() and ensures to not reference the task after it. > > I've boot tested it, but not much else. > > Could you please give this a go (on top of -rc3)? > > This also explains the SCHED_WARN_ON() Kent saw, that is subject to the > same race. > > --- > kernel/sched/fair.c | 21 ++++++++++++++------- > kernel/sched/sched.h | 34 ++++++++++++++++++++++++++++++++-- > 2 files changed, 46 insertions(+), 9 deletions(-) [...]
Tested-by: Marco Elver <[email protected]> Previously syzkaller would give us that report within ~1h of fuzzing. Have been fuzzing with your patch applied for 3h now, and this report has not resurfaced.
