On Tue 25-07-17 20:26:19, Andrea Arcangeli wrote:
> On Tue, Jul 25, 2017 at 05:45:14PM +0200, Michal Hocko wrote:
> > That problem is real though as reported by David.
> 
> I'm not against fixing it, I just think it's not a major concern, and
> the solution doesn't seem optimal as measured by Kirill.
> 
> I'm just skeptical it's the best to solve that tiny race, 99.9% of the
> time such down_write is unnecessary.
> 
> > it is not only about exit_mmap. __mmput calls into exit_aio and that can
> > wait for completion and there is no way to guarantee this will finish in
> > finite time.
> 
> exit_aio blocking is actually the only good point for wanting this
> concurrency where exit_mmap->unmap_vmas and
> oom_reap_task->unmap_page_range have to run concurrently on the same
> mm.

Yes, exit_aio is the only blocking call I know of currently. But I would
like this to be as robust as possible and so I do not want to rely on
the current implementation. This can change in future and I can
guarantee that nobody will think about the oom path when adding
something to the final __mmput path.

> exit_mmap would have no issue, if there was enough time in the
> lifetime CPU to allocate the memory, sure the memory will also be
> freed in finite amount of time by exit_mmap.

I am not sure I understand. Say that any call prior to unmap_vmas blocks
on a lock which is held by another call path which cannot proceed with
the allocation...
 
> In fact you mentioned multiple OOM in the NUMA case, exit_mmap may not
> solve that, so depending on the runtime it may have been better not to
> wait all memory of the process to be freed before moving to the next
> task, but only a couple of seconds before the OOM reaper moves to a
> new candidate. Again this is only a tradeoff between solving the OOM
> faster vs risk of false positives OOM.

I really do not want to rely on any timing. This just too fragile. Once
we have killed a task then we shouldn't pick another victim until it
passed exit_mmap or the oom_reaper did its job. Otherwise we just risk
false positives while we have already disrupted the workload.
 
> If it wasn't because of exit_aio (which may have to wait I/O
> completion), changing the OOM reaper to return "false" if
> mmget_not_zero returns zero and MMF_OOM_SKIP is not set yet, would
> have been enough (and depending on the runtime it may have solved OOM
> faster in NUMA) and there would be absolutely no need to run OOM
> reaper and exit_mmap concurrently on the same mm. However there's such
> exit_aio..
> 
> Raw I/O mempools never require memory allocations, although aio if it
> involves a filesystem to complete may run into filesystem or buffering
> locks which are known to loop forever or depend on other tasks stuck
> in kernel allocations, so I didn't go down that chain too long.

Exactly. We simply cannot assume anything here because veryfying this
basically impossible.
 
[...]
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f19efcf75418..615133762b99 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2993,6 +2993,11 @@ void exit_mmap(struct mm_struct *mm)
>       /* Use -1 here to ensure all VMAs in the mm are unmapped */
>       unmap_vmas(&tlb, vma, 0, -1);
>  
> +     if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> +             /* wait the OOM reaper to stop working on this mm */
> +             down_write(&mm->mmap_sem);
> +             up_write(&mm->mmap_sem);
> +     }
>       free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>       tlb_finish_mmu(&tlb, 0, -1);
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e8b4f030c1c..2a7000995784 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -471,6 +471,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>       struct mmu_gather tlb;
>       struct vm_area_struct *vma;
>       bool ret = true;
> +     bool mmgot = true;
>  
>       /*
>        * We have to make sure to not race with the victim exit path
> @@ -500,9 +501,16 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>        * and delayed __mmput doesn't matter that much
>        */
>       if (!mmget_not_zero(mm)) {
> -             up_read(&mm->mmap_sem);
>               trace_skip_task_reaping(tsk->pid);
> -             goto unlock_oom;
> +             /*
> +              * MMF_OOM_SKIP is set by exit_mmap when the OOM
> +              * reaper can't work on the mm anymore.
> +              */
> +             if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
> +                     up_read(&mm->mmap_sem);
> +                     goto unlock_oom;
> +             }
> +             mmgot = false;
>       }

This will work more or less the same to what we have currently.

[victim]                [oom reaper]                            [oom killer]
do_exit                 __oom_reap_task_mm
  mmput
    __mmput
                          mmget_not_zero
                            test_and_set_bit(MMF_OOM_SKIP)
                                                                
oom_evaluate_task
                                                                   # select 
next victim 
                          # reap the mm
      unmap_vmas

so we can select a next victim while the current one is still not
completely torn down.

> > > 4) how is it safe to overwrite a VM_FAULT_RETRY that returns without
> > >    mmap_sem and then the arch code will release the mmap_sem despite
> > >    it was already released by handle_mm_fault? Anonymous memory faults
> > >    aren't common to return VM_FAULT_RETRY but an userfault
> > >    can. Shouldn't there be a block that prevents overwriting if
> > >    VM_FAULT_RETRY is set below? (not only VM_FAULT_ERROR)
> > > 
> > >   if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> > >                           && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> > >           ret = VM_FAULT_SIGBUS;
> > 
> > I am not sure I understand what you mean and how this is related to the
> > patch?
> 
> It's not related to the patch but it involves the OOM reaper as it
> only happens when MMF_UNSTABLE is set which is set only by the OOM
> reaper. I was simply reading the OOM reaper code and following up what
> MMF_UNSTABLE does and it ringed a bell.

I hope 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom
reaped memory") will clarify this code. If not please start a new thread
so that we do not conflate different things together.

-- 
Michal Hocko
SUSE Labs

Reply via email to