On Wed, Feb 14, 2018 at 03:07:41PM +0000, Will Deacon wrote: > Hi Mark, Hi Will,
> Cheers for the report. These things tend to be a pain to debug, but I've had > a go. Thanks for taking a look! > On Wed, Feb 14, 2018 at 12:02:54PM +0000, Mark Rutland wrote: > The interesting thing here is on the exit path: > > > Freed by task 10882: > > save_stack mm/kasan/kasan.c:447 [inline] > > set_track mm/kasan/kasan.c:459 [inline] > > __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:520 > > kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:527 > > slab_free_hook mm/slub.c:1393 [inline] > > slab_free_freelist_hook mm/slub.c:1414 [inline] > > slab_free mm/slub.c:2968 [inline] > > kmem_cache_free+0x88/0x270 mm/slub.c:2990 > > __mmdrop+0x164/0x248 kernel/fork.c:604 > > ^^ This should never run, because there's an mmgrab() about 8 lines above > the mmput() in exit_mm. > > > mmdrop+0x50/0x60 kernel/fork.c:615 > > __mmput kernel/fork.c:981 [inline] > > mmput+0x270/0x338 kernel/fork.c:992 > > exit_mm kernel/exit.c:544 [inline] > > Looking at exit_mm: > > mmgrab(mm); > BUG_ON(mm != current->active_mm); > /* more a memory barrier than a real lock */ > task_lock(current); > current->mm = NULL; > up_read(&mm->mmap_sem); > enter_lazy_tlb(mm, current); > task_unlock(current); > mm_update_next_owner(mm); > mmput(mm); > > Then the comment already rings some alarm bells: our spin_lock (as used > by task_lock) has ACQUIRE semantics, so the mmgrab (which is unordered > due to being an atomic_inc) can be reordered with respect to the assignment > of NULL to current->mm. > > If the exit()ing task had recently migrated from another CPU, then that > CPU could concurrently run context_switch() and take this path: > > if (!prev->mm) { > prev->active_mm = NULL; > rq->prev_mm = oldmm; > } IIUC, on the prior context_switch, next->mm == NULL, so we set next->active_mm to prev->mm. Then, in this context_switch we set oldmm = prev->active_mm (where prev is next from the prior context switch). ... right? > which then means finish_task_switch will call mmdrop(): > > struct mm_struct *mm = rq->prev_mm; > [...] > if (mm) { > membarrier_mm_sync_core_before_usermode(mm); > mmdrop(mm); > } ... then here we use what was prev->active_mm in the most recent context switch. So AFAICT, we're never concurrently accessing a task_struct::mm field here, only prev::{mm,active_mm} while prev is current... [...] > diff --git a/kernel/exit.c b/kernel/exit.c > index 995453d9fb55..f91e8d56b03f 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -534,8 +534,9 @@ static void exit_mm(void) > } > mmgrab(mm); > BUG_ON(mm != current->active_mm); > - /* more a memory barrier than a real lock */ > task_lock(current); > + /* Ensure we've grabbed the mm before setting current->mm to NULL */ > + smp_mb__after_spin_lock(); > current->mm = NULL; ... and thus I don't follow why we would need to order these with anything more than a compiler barrier (if we're preemptible here). What have I completely misunderstood? ;) Thanks, Mark.