On Sat, 21 Jul 2018, Tetsuo Handa wrote:

> > diff --git a/mm/mmap.c b/mm/mmap.c
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3066,25 +3066,27 @@ void exit_mmap(struct mm_struct *mm)
> >     if (unlikely(mm_is_oom_victim(mm))) {
> >             /*
> >              * Manually reap the mm to free as much memory as possible.
> > -            * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> > -            * this mm from further consideration.  Taking mm->mmap_sem for
> > -            * write after setting MMF_OOM_SKIP will guarantee that the oom
> > -            * reaper will not run on this mm again after mmap_sem is
> > -            * dropped.
> > -            *
> >              * Nothing can be holding mm->mmap_sem here and the above call
> >              * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> >              * __oom_reap_task_mm() will not block.
> >              *
> > +            * This sets MMF_UNSTABLE to avoid racing with the oom reaper.
> >              * This needs to be done before calling munlock_vma_pages_all(),
> >              * which clears VM_LOCKED, otherwise the oom reaper cannot
> > -            * reliably test it.
> > +            * reliably test for it.  If the oom reaper races with
> > +            * munlock_vma_pages_all(), this can result in a kernel oops if
> > +            * a pmd is zapped, for example, after follow_page_mask() has
> > +            * checked pmd_none().
> >              */
> >             mutex_lock(&oom_lock);
> >             __oom_reap_task_mm(mm);
> >             mutex_unlock(&oom_lock);
> 
> I don't like holding oom_lock for full teardown of an mm, for an OOM victim's 
> mm
> might have multiple TB memory which could take long time.
> 

This patch does not involve deltas for oom_lock here, it can certainly be 
changed on top of this patch.  I'm not attempting to address any oom_lock 
issue here.  It should pose no roadblock for you.

I only propose this patch now since it fixes millions of processes being 
oom killed unnecessarily, it was in -mm before a NACK for the most trivial 
fixes that have now been squashed into it, and is actually tested.

> >  
> > -           set_bit(MMF_OOM_SKIP, &mm->flags);
> > +           /*
> > +            * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
> > +            * guarantee that the oom reaper will not run on this mm again
> > +            * after mmap_sem is dropped.
> > +            */
> >             down_write(&mm->mmap_sem);
> >             up_write(&mm->mmap_sem);
> >     }
> 
> 
> 
> > -#define MAX_OOM_REAP_RETRIES 10
> >  static void oom_reap_task(struct task_struct *tsk)
> >  {
> > -   int attempts = 0;
> >     struct mm_struct *mm = tsk->signal->oom_mm;
> >  
> > -   /* Retry the down_read_trylock(mmap_sem) a few times */
> > -   while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
> > -           schedule_timeout_idle(HZ/10);
> > +   /*
> > +    * If this mm has either been fully unmapped, or the oom reaper has
> > +    * given up on it, nothing left to do except drop the refcount.
> > +    */
> > +   if (test_bit(MMF_OOM_SKIP, &mm->flags))
> > +           goto drop;
> >  
> > -   if (attempts <= MAX_OOM_REAP_RETRIES ||
> > -       test_bit(MMF_OOM_SKIP, &mm->flags))
> > -           goto done;
> > +   /*
> > +    * If this mm has already been reaped, doing so again will not likely
> > +    * free additional memory.
> > +    */
> > +   if (!test_bit(MMF_UNSTABLE, &mm->flags))
> > +           oom_reap_task_mm(tsk, mm);
> 
> This is still wrong. If preempted immediately after set_bit(MMF_UNSTABLE, 
> &mm->flags) from
> __oom_reap_task_mm() from exit_mmap(), oom_reap_task() can give up before 
> reclaiming any memory.

If there is a single thread holding onto the mm and has reached 
exit_mmap() and is in the process of starting oom reaping itself, there's 
no advantage to the oom reaper trying to oom reap it.  The thread in 
exit_mmap() will take care of it, __oom_reap_task_mm() does not block and 
oom_free_timeout_ms allows for enough time for that memory freeing to 
occur.  The oom reaper will not set MMF_OOM_SKIP until the timeout has 
expired.

As I said before, you could make a case for extending the timeout once 
MMF_UNSTABLE has been set.  It practice, we haven't encountered a case 
where that matters.  But that's trivial to do if you would prefer.

Reply via email to