On Tue 08-08-17 19:48:55, Andrea Arcangeli wrote: [...] > The bug corrected by this patch 1/2 I pointed it out last week while > reviewing other oom reaper fixes so that looks fine. > > However I'd prefer to dump MMF_UNSTABLE for good instead of adding > more of it. It can be replaced with unmap_page_range in > __oom_reap_task_mm with a function that arms a special migration entry > so that no branchs are added to the fast paths and it's all hidden > inside is_migration_entry slow paths.
This sounds like an interesting idea but I would like to address the _correctness_ issue first and optimize on top of it. If for nothing else backporting a follow up fix sounds easier than a complete rework. There are quite some callers of is_migration_entry and the patch won't be trivial either. So can we focus on the fix first please? [...] > Overall OOM killing to me was reliable also before the oom reaper was > introduced. Yeah, this is the case in my experience as well but there are others claiming otherwise and implementation wise the code was really fragile enough to support their claims. Unbound lockup on TIF_MEMDIE task just asks for troubles, especially when we have no idea what the oom victim might be doing. Things are very simple when the victim was kicked out from the userspace but this all gets very hairy when it was somewhere in the kernel waiting for locks. It seems that we are mostly lucky in the global oom situations. We have seen lockups with memcgs and had to move the memcg oom handling to a lockless PF context. Those two were not too different except the memcg was easier to hit. [...] > A couple of years ago I could trivially trigger OOM deadlocks on > various ext4 paths that loops or use GFP_NOFAIL, but that was just a > matter of letting GFP_NOIO/NOFS/NOFAIL kind of allocation go through > memory reserves below the low watermark. You would have to identify the dependency chain to do this properly, otherwise you simply consume memory reserves and you are back to square one. > It is also fine to kill a few more processes in fact. I strongly disagree. It might be acceptable to kill more tasks if there is absolutely no other choice. OOM killing is a very disruptive action and we shoud _really_ reduce it to absolute minimum. [...] > The main point of the oom reaper nowadays is to free memory fast > enough so a second task isn't killed as a false positive, but it's not > like anybody will notice much of a difference if a second task is > killed, it wasn't commonly happening either. No, you seem to misunderstand. Adding a kernel thread to optimize a glacial kind of slow path would be really hard to justify. The sole purpose of the oom reaper is _reliability_. We do not select another task from an oom domain if there is an existing oom victim alive. So we do not need the reaper to prevent another victim selection. All we need this async context for is to _guarantee_ that somebody tries to reclaim as much memory of the victim as possible and then allow the oom killer to continue if the OOM situation is not resolve. Because that endless waiting for a sync context is what causes those lockups. > Certainly it's preferable to get two tasks killed than corrupted core > dumps or corrupted memory, so if oom reaper will stay we need to > document how we guarantee it's mutually exclusive against core dumping corrupted anonymous memory in the core dump was deemed acceptable trade off to get a more reliable oom handling. If there is a strong usecase for the reliable core dump then we can work on it, of course but the system stability is at the first place IMHO. -- Michal Hocko SUSE Labs