Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On 05/15, Sultan Alsawaf wrote: > > On Wed, May 15, 2019 at 04:58:32PM +0200, Oleg Nesterov wrote: > > Could you explain in detail what exactly did you do and what do you see in > > dmesg? > > > > Just in case, lockdep complains only once, print_circular_bug() does > > debug_locks_off() > > so it it has already reported another false positive __lock_acquire() will > > simply > > return after that. > > > > Oleg. > > This is what I did: > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > index 774ab79d3ec7..009e7d431a88 100644 > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -3078,6 +3078,7 @@ static int __lock_acquire(struct lockdep_map *lock, > unsigned int subclass, > int class_idx; > u64 chain_key; > > + BUG_ON(!debug_locks || !prove_locking); > if (unlikely(!debug_locks)) > return 0; > > diff --git a/lib/debug_locks.c b/lib/debug_locks.c > index 124fdf238b3d..4003a18420fb 100644 > --- a/lib/debug_locks.c > +++ b/lib/debug_locks.c > @@ -37,6 +37,7 @@ EXPORT_SYMBOL_GPL(debug_locks_silent); > */ > int debug_locks_off(void) > { > + return 0; > if (debug_locks && __debug_locks_off()) { > if (!debug_locks_silent) { > console_verbose(); OK, this means that debug_locks_off() always returns 0, as if debug_locks was already cleared. Thus print_deadlock_bug() will do nothing, it does if (!debug_locks_off_graph_unlock() || debug_locks_silent) return 0; iow this means that even if lockdep finds a problem, the problem won't be reported. > [1.492128] BUG: key not in .data! > [1.492141] BUG: key not in .data! > [1.492152] BUG: key not in .data! > [1.492228] BUG: key not in .data! > [1.492238] BUG: key not in .data! > [1.492248] BUG: key not in .data! I guess this is lockdep_init_map() which does printk("BUG:") itself, but due to your change above it doesn't do WARN(1) and thus there is no call trace. Oleg. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Wed, 15 May 2019 11:52:57 -0700 Sultan Alsawaf wrote: > On Wed, May 15, 2019 at 02:32:48PM -0400, Steven Rostedt wrote: > > I'm confused why you did this? > > Oleg said that debug_locks_off() could've been called and thus prevented > lockdep complaints about simple_lmk from appearing. To eliminate any > possibility > of that, I disabled debug_locks_off(). But I believe that when lockdep discovers an issue, the data from then on is not reliable. Which is why we turn it off. But just commenting out the disabling makes lockdep unreliable, and is not a proper way to test your code. Yes, it can then miss locking issues after one was discovered. Thus, you are not properly testing the locking in your code. -- Steve ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Wed, May 15, 2019 at 02:32:48PM -0400, Steven Rostedt wrote: > I'm confused why you did this? Oleg said that debug_locks_off() could've been called and thus prevented lockdep complaints about simple_lmk from appearing. To eliminate any possibility of that, I disabled debug_locks_off(). Oleg also said that __lock_acquire() could return early if lock debugging were somehow turned off after lockdep reported one bug. To mitigate any possibility of that as well, I threw in the BUG_ON() for good measure. I think at this point it's pretty clear that lockdep truly isn't complaining about simple_lmk's locking pattern, and that lockdep's lack of complaints isn't due to it being mysteriously turned off... Sultan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Wed, 15 May 2019 10:27:28 -0700 Sultan Alsawaf wrote: > On Wed, May 15, 2019 at 04:58:32PM +0200, Oleg Nesterov wrote: > > Could you explain in detail what exactly did you do and what do you see in > > dmesg? > > > > Just in case, lockdep complains only once, print_circular_bug() does > > debug_locks_off() > > so it it has already reported another false positive __lock_acquire() will > > simply > > return after that. > > > > Oleg. > > This is what I did: > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > index 774ab79d3ec7..009e7d431a88 100644 > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -3078,6 +3078,7 @@ static int __lock_acquire(struct lockdep_map *lock, > unsigned int subclass, > int class_idx; > u64 chain_key; > > + BUG_ON(!debug_locks || !prove_locking); > if (unlikely(!debug_locks)) > return 0; > > diff --git a/lib/debug_locks.c b/lib/debug_locks.c > index 124fdf238b3d..4003a18420fb 100644 > --- a/lib/debug_locks.c > +++ b/lib/debug_locks.c > @@ -37,6 +37,7 @@ EXPORT_SYMBOL_GPL(debug_locks_silent); > */ > int debug_locks_off(void) > { > + return 0; I'm confused why you did this? -- Steve > if (debug_locks && __debug_locks_off()) { > if (!debug_locks_silent) { > console_verbose(); > > ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On 05/13, Sultan Alsawaf wrote: > > On Fri, May 10, 2019 at 05:10:25PM +0200, Oleg Nesterov wrote: > > I am starting to think I am ;) > > > > If you have task1 != task2 this code > > > > task_lock(task1); > > task_lock(task2); > > > > should trigger print_deadlock_bug(), task1->alloc_lock and > > task2->alloc_lock are > > the "same" lock from lockdep pov, held_lock's will have the same > > hlock_class(). > > Okay, I've stubbed out debug_locks_off(), and lockdep is now complaining > about a > bunch of false positives so it is _really_ enabled this time. Could you explain in detail what exactly did you do and what do you see in dmesg? Just in case, lockdep complains only once, print_circular_bug() does debug_locks_off() so it it has already reported another false positive __lock_acquire() will simply return after that. Oleg. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 14, 2019 at 12:44:53PM -0400, Steven Rostedt wrote: > OK, this has gotten my attention. > > This thread is quite long, do you have a git repo I can look at, and > also where is the first task_lock() taken before the > find_lock_task_mm()? > > -- Steve Hi Steve, This is the git repo I work on: https://github.com/kerneltoast/android_kernel_google_wahoo With the newest simple_lmk iteration being this commit: https://github.com/kerneltoast/android_kernel_google_wahoo/commit/6b145b8c28b39f7047393169117f72ea7387d91c This repo is based off the 4.4 kernel that Google ships on the Pixel 2/2XL. simple_lmk iterates through the entire task list more than once and locks potential victims using find_lock_task_mm(). It keeps these potential victims locked across the multiple times that the task list is iterated. The locking pattern that Oleg said should cause lockdep to complain is that iterating through the entire task list more than once can lead to locking the same task that was locked earlier with find_lock_task_mm(), and thus deadlock. But there is a check in simple_lmk that avoids locking potential victims that were already found, which avoids the deadlock, but lockdep doesn't know about the check (which is done with vtsk_is_duplicate()) and should therefore complain. Lockdep does not complain though. Sultan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Mon, 13 May 2019 09:45:55 -0700 Sultan Alsawaf wrote: > On Fri, May 10, 2019 at 05:10:25PM +0200, Oleg Nesterov wrote: > > I am starting to think I am ;) > > > > If you have task1 != task2 this code > > > > task_lock(task1); > > task_lock(task2); > > > > should trigger print_deadlock_bug(), task1->alloc_lock and > > task2->alloc_lock are > > the "same" lock from lockdep pov, held_lock's will have the same > > hlock_class(). OK, this has gotten my attention. This thread is quite long, do you have a git repo I can look at, and also where is the first task_lock() taken before the find_lock_task_mm()? -- Steve ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Fri, May 10, 2019 at 05:10:25PM +0200, Oleg Nesterov wrote: > I am starting to think I am ;) > > If you have task1 != task2 this code > > task_lock(task1); > task_lock(task2); > > should trigger print_deadlock_bug(), task1->alloc_lock and task2->alloc_lock > are > the "same" lock from lockdep pov, held_lock's will have the same > hlock_class(). Okay, I've stubbed out debug_locks_off(), and lockdep is now complaining about a bunch of false positives so it is _really_ enabled this time. I grepped for lockdep last time to try and find a concise way to show over email that lockdep didn't complain, but that backfired. Here is better evidence: taimen:/ # dmesg | grep simple_lmk [ 58.349917] simple_lmk: Killing droid.deskclock with adj 906 to free 47548 KiB [ 58.354748] simple_lmk: Killing .android.dialer with adj 906 to free 36576 KiB [ 58.355030] simple_lmk: Killing rbandroid.sleep with adj 904 to free 50016 KiB [ 58.582833] simple_lmk: Killing oadcastreceiver with adj 904 to free 43044 KiB [ 58.587731] simple_lmk: Killing .apps.wellbeing with adj 902 to free 48128 KiB [ 58.588084] simple_lmk: Killing android.carrier with adj 902 to free 43636 KiB [ 58.671857] simple_lmk: Killing ndroid.keychain with adj 902 to free 39992 KiB [ 58.675622] simple_lmk: Killing gs.intelligence with adj 900 to free 49572 KiB [ 58.675770] simple_lmk: Killing le.modemservice with adj 800 to free 41976 KiB [ 58.762060] simple_lmk: Killing ndroid.settings with adj 700 to free 74708 KiB [ 58.763238] simple_lmk: Killing roid.apps.turbo with adj 700 to free 54660 KiB [ 58.873337] simple_lmk: Killing d.process.acore with adj 700 to free 48540 KiB [ 58.873513] simple_lmk: Killing d.process.media with adj 500 to free 46188 KiB [ 58.873713] simple_lmk: Killing putmethod.latin with adj 200 to free 67304 KiB [ 59.014046] simple_lmk: Killing android.vending with adj 201 to free 54900 KiB [ 59.017623] simple_lmk: Killing rak.mibandtools with adj 201 to free 44552 KiB [ 59.018423] simple_lmk: Killing eport.trebuchet with adj 100 to free 106208 KiB [ 59.223633] simple_lmk: Killing id.printspooler with adj 900 to free 39664 KiB [ 59.223762] simple_lmk: Killing gle.android.gms with adj 100 to free 64176 KiB [ 70.955204] simple_lmk: Killing .apps.translate with adj 906 to free 47564 KiB [ 70.955633] simple_lmk: Killing cloudprint:sync with adj 906 to free 31932 KiB [ 70.955787] simple_lmk: Killing oid.apps.photos with adj 904 to free 50204 KiB [ 71.060789] simple_lmk: Killing ecamera.android with adj 906 to free 32232 KiB [ 71.061074] simple_lmk: Killing webview_service with adj 906 to free 26028 KiB [ 71.061199] simple_lmk: Killing com.whatsapp with adj 904 to free 49484 KiB [ 71.190625] simple_lmk: Killing rbandroid.sleep with adj 906 to free 54724 KiB [ 71.190775] simple_lmk: Killing android.vending with adj 906 to free 39848 KiB [ 71.191303] simple_lmk: Killing m.facebook.orca with adj 904 to free 72296 KiB [ 71.342042] simple_lmk: Killing droid.deskclock with adj 902 to free 49284 KiB [ 71.342240] simple_lmk: Killing .apps.wellbeing with adj 900 to free 47632 KiB [ 71.342529] simple_lmk: Killing le.modemservice with adj 800 to free 33648 KiB [ 71.482391] simple_lmk: Killing d.process.media with adj 800 to free 40676 KiB [ 71.482511] simple_lmk: Killing rdog.challegram with adj 700 to free 71920 KiB taimen:/ # The first simple_lmk message appears at 58.349917. And based on the timestamps, it's clear that simple_lmk called task_lock() for multiple different tasks, which is the pattern you think should cause lockdep to complain. But here is the full dmesg starting from that point: [ 58.349917] simple_lmk: Killing droid.deskclock with adj 906 to free 47548 KiB [ 58.354748] simple_lmk: Killing .android.dialer with adj 906 to free 36576 KiB [ 58.355030] simple_lmk: Killing rbandroid.sleep with adj 904 to free 50016 KiB [ 58.432654] binder_alloc: 2284: binder_alloc_buf failed to map pages in userspace, no vma [ 58.432671] binder: 1206:1218 transaction failed 29189/-3, size 76-0 line 3189 [ 58.582833] simple_lmk: Killing oadcastreceiver with adj 904 to free 43044 KiB [ 58.587731] simple_lmk: Killing .apps.wellbeing with adj 902 to free 48128 KiB [ 58.588084] simple_lmk: Killing android.carrier with adj 902 to free 43636 KiB [ 58.590785] binder: undelivered transaction 58370, process died. [ 58.671857] simple_lmk: Killing ndroid.keychain with adj 902 to free 39992 KiB [ 58.675622] simple_lmk: Killing gs.intelligence with adj 900 to free 49572 KiB [ 58.675770] simple_lmk: Killing le.modemservice with adj 800 to free 41976 KiB [ 58.736678] binder: undelivered transaction 58814, process died. [ 58.736733] binder: release 3099:3128 transaction 57832 in, still active [ 58.736744] binder: send failed reply for transaction 57832 to 1876:3090 [ 58.736761] binder: undelivered TRANSACTION_COMPLETE [
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On 05/09, Sultan Alsawaf wrote: > > On Thu, May 09, 2019 at 05:56:46PM +0200, Oleg Nesterov wrote: > > Impossible ;) I bet lockdep should report the deadlock as soon as > > find_victims() > > calls find_lock_task_mm() when you already have a locked victim. > > I hope you're not a betting man ;) I am starting to think I am ;) If you have task1 != task2 this code task_lock(task1); task_lock(task2); should trigger print_deadlock_bug(), task1->alloc_lock and task2->alloc_lock are the "same" lock from lockdep pov, held_lock's will have the same hlock_class(). > CONFIG_PROVE_LOCKING=y OK, > And a printk added in vtsk_is_duplicate() to print when a duplicate is > detected, in this case find_lock_task_mm() won't be called, and this is what saves us from the actual deadlock. > and my phone's memory cut in half to make simple_lmk do something, this is > what > I observed: > taimen:/ # dmesg | grep lockdep > [0.00] \x09RCU lockdep checking is enabled. this reports that CONFIG_PROVE_RCU is enabled ;) > taimen:/ # dmesg | grep simple_lmk > [ 23.211091] simple_lmk: Killing android.carrier with adj 906 to free 37420 > kiB > [ 23.211160] simple_lmk: Killing oadcastreceiver with adj 906 to free 36784 > kiB yes, looks like simple_lmk has at least 2 locked victims. And I have no idea why you do not see anything else in dmesg. May be debug_locks_off() was already called. But see above, "grep lockdep" won't work. Perhaps you can do "grep -e WARNING -e BUG -e locking". Oleg. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Thu, May 09, 2019 at 05:56:46PM +0200, Oleg Nesterov wrote: > Impossible ;) I bet lockdep should report the deadlock as soon as > find_victims() > calls find_lock_task_mm() when you already have a locked victim. I hope you're not a betting man ;) With the following configured: CONFIG_DEBUG_RT_MUTEXES=y CONFIG_DEBUG_SPINLOCK=y # CONFIG_DEBUG_SPINLOCK_BITE_ON_BUG is not set CONFIG_DEBUG_SPINLOCK_PANIC_ON_BUG=y CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y CONFIG_LOCKDEP=y CONFIG_LOCK_STAT=y CONFIG_DEBUG_LOCKDEP=y CONFIG_DEBUG_ATOMIC_SLEEP=y # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set # CONFIG_LOCK_TORTURE_TEST is not set And a printk added in vtsk_is_duplicate() to print when a duplicate is detected, and my phone's memory cut in half to make simple_lmk do something, this is what I observed: taimen:/ # dmesg | grep lockdep [0.00] \x09RCU lockdep checking is enabled. taimen:/ # dmesg | grep simple_lmk [ 23.211091] simple_lmk: Killing android.carrier with adj 906 to free 37420 kiB [ 23.211160] simple_lmk: Killing oadcastreceiver with adj 906 to free 36784 kiB [ 23.248457] simple_lmk: Killing .apps.translate with adj 904 to free 45884 kiB [ 23.248720] simple_lmk: Killing ndroid.settings with adj 904 to free 42868 kiB [ 23.313417] simple_lmk: DUPLICATE VTSK! [ 23.313440] simple_lmk: Killing ndroid.keychain with adj 906 to free 33680 kiB [ 23.313513] simple_lmk: Killing com.whatsapp with adj 904 to free 51436 kiB [ 34.646695] simple_lmk: DUPLICATE VTSK! [ 34.646717] simple_lmk: Killing ndroid.apps.gcs with adj 906 to free 37956 kiB [ 34.646792] simple_lmk: Killing droid.apps.maps with adj 904 to free 63600 kiB taimen:/ # dmesg | grep lockdep [0.00] \x09RCU lockdep checking is enabled. taimen:/ # > As for > https://github.com/kerneltoast/android_kernel_google_wahoo/commit/afc8c9bf2dbde95941253c168d1adb64cfa2e3ad > Well, > > mmdrop(mm); > simple_lmk_mm_freed(mm); > > looks racy because mmdrop(mm) can free this mm_struct. Yes, > simple_lmk_mm_freed() > does not dereference this pointer, but the same memory can be re-allocated as > another ->mm for the new task which can be found by find_victims(), and _in > theory_ > this all can happen in between, so the "victims[i].mm == mm" can be false > positive. > > And this also means that simple_lmk_mm_freed() should clear victims[i].mm when > it detects "victims[i].mm == mm", otherwise we have the same theoretical race, > victims_to_kill is only cleared when the last victim goes away. Fair point. Putting simple_lmk_mm_freed(mm) right before mmdrop(mm), and sprinkling in a cmpxchg in simple_lmk_mm_freed() should fix that up. > Another nit... you can drop tasklist_lock right after the 1st "find_victims" > loop. True! > And it seems that you do not really need to walk the "victims" array twice > after that, > you can do everything in a single loop, but this is cosmetic. Won't this result in potentially holding the task lock way longer than necessary for multiple tasks that aren't going to be killed? Thanks, Sultan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On 05/07, Sultan Alsawaf wrote: > > On Tue, May 07, 2019 at 05:31:54PM +0200, Oleg Nesterov wrote: > > > Did you test this patch with lockdep enabled? > > > > If I read the patch correctly, lockdep should complain. vtsk_is_duplicate() > > ensures that we do not take the same ->alloc_lock twice or more, but lockdep > > can't know this. > > Yeah, lockdep is fine with this, at least on 4.4. Impossible ;) I bet lockdep should report the deadlock as soon as find_victims() calls find_lock_task_mm() when you already have a locked victim. Nevermind, I guess this code won't run with lockdep enabled... As for https://github.com/kerneltoast/android_kernel_google_wahoo/commit/afc8c9bf2dbde95941253c168d1adb64cfa2e3ad Well, mmdrop(mm); simple_lmk_mm_freed(mm); looks racy because mmdrop(mm) can free this mm_struct. Yes, simple_lmk_mm_freed() does not dereference this pointer, but the same memory can be re-allocated as another ->mm for the new task which can be found by find_victims(), and _in theory_ this all can happen in between, so the "victims[i].mm == mm" can be false positive. And this also means that simple_lmk_mm_freed() should clear victims[i].mm when it detects "victims[i].mm == mm", otherwise we have the same theoretical race, victims_to_kill is only cleared when the last victim goes away. Another nit... you can drop tasklist_lock right after the 1st "find_victims" loop. And it seems that you do not really need to walk the "victims" array twice after that, you can do everything in a single loop, but this is cosmetic. Oleg. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
From: Sultan Alsawaf Date: Tue, May 7, 2019 at 9:53 AM To: Suren Baghdasaryan Cc: Christian Brauner, Greg Kroah-Hartman, open list:ANDROID DRIVERS, Daniel Colascione, Todd Kjos, Kees Cook, Peter Zijlstra, Martijn Coenen, LKML, Tim Murray, Michal Hocko, linux-mm, Arve Hjønnevåg, Ingo Molnar, Steven Rostedt, Oleg Nesterov, Joel Fernandes, Andy Lutomirski, kernel-team > On Tue, May 07, 2019 at 09:28:47AM -0700, Suren Baghdasaryan wrote: > > Hi Sultan, > > Looks like you are posting this patch for devices that do not use > > userspace LMKD solution due to them using older kernels or due to > > their vendors sticking to in-kernel solution. If so, I see couple > > logistical issues with this patch. I don't see it being adopted in > > upstream kernel 5.x since it re-implements a deprecated mechanism even > > though vendors still use it. Vendors on the other hand, will not adopt > > it until you show evidence that it works way better than what > > lowmemorykilled driver does now. You would have to provide measurable > > data and explain your tests before they would consider spending time > > on this. > > Yes, this is mostly for the devices already produced that are forced to suffer > with poor memory management. I can't even convince vendors to fix kernel > memory leaks, so there's no way I'd be able to convince them of trying this > patch, especially since it seems like you're having trouble convincing vendors > to stop using lowmemorykiller in the first place. And thankfully, convincing > vendors isn't my job :) > > > On the implementation side I'm not convinced at all that this would > > work better on all devices and in all circumstances. We had cases when > > a new mechanism would show very good results until one usecase > > completely broke it. Bulk killing of processes that you are doing in > > your patch was a very good example of such a decision which later on > > we had to rethink. That's why baking these policies into kernel is > > very problematic. Another problem I see with the implementation that > > it ties process killing with the reclaim scan depth. It's very similar > > to how vmpressure works and vmpressure in my experience is very > > unpredictable. > > Could you elaborate a bit on why bulk killing isn't good? Yes. Several months ago we got reports that LMKD got very aggressive killing more processes in situations which did not require that many kills to recover from memory pressure. After investigation we tracked it to the bulk killing which would kill too many processes during a memory usage spike. When killing gradually the kills would be avoided, so we had to get back to a more balanced approach. The conceptual issue with bulk killing is that you can't cancel it when device recovers quicker than you expected. > > > > I linked in the previous message, Google made a rather large set of > > > > modifications to the supposedly-defunct lowmemorykiller.c not one month > > > > ago. > > > > What's going on? > > > > If you look into that commit, it adds ability to report kill stats. If > > that was a change in how that driver works it would be rejected. > > Fair, though it was quite strange seeing something that was supposedly totally > abandoned receiving a large chunk of code for reporting stats. > > Thanks, > Sultan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 07, 2019 at 09:28:47AM -0700, Suren Baghdasaryan wrote: > From: Christian Brauner > Date: Tue, May 7, 2019 at 3:58 AM > To: Sultan Alsawaf > Cc: Greg Kroah-Hartman, open list:ANDROID DRIVERS, Daniel Colascione, > Todd Kjos, Kees Cook, Peter Zijlstra, Martijn Coenen, LKML, Tim > Murray, Michal Hocko, Suren Baghdasaryan, linux-mm, Arve Hjønnevåg, > Ingo Molnar, Steven Rostedt, Oleg Nesterov, Joel Fernandes, Andy > Lutomirski, kernel-team > > > On Tue, May 07, 2019 at 01:12:36AM -0700, Sultan Alsawaf wrote: > > > On Tue, May 07, 2019 at 09:43:34AM +0200, Greg Kroah-Hartman wrote: > > > > Given that any "new" android device that gets shipped "soon" should be > > > > using 4.9.y or newer, is this a real issue? > > > > > > It's certainly a real issue for those who can't buy brand new Android > > > devices > > > without software bugs every six months :) > > > > > Hi Sultan, > Looks like you are posting this patch for devices that do not use > userspace LMKD solution due to them using older kernels or due to > their vendors sticking to in-kernel solution. If so, I see couple > logistical issues with this patch. I don't see it being adopted in > upstream kernel 5.x since it re-implements a deprecated mechanism even > though vendors still use it. Vendors on the other hand, will not adopt > it until you show evidence that it works way better than what > lowmemorykilled driver does now. You would have to provide measurable > data and explain your tests before they would consider spending time > on this. > On the implementation side I'm not convinced at all that this would > work better on all devices and in all circumstances. We had cases when > a new mechanism would show very good results until one usecase > completely broke it. Bulk killing of processes that you are doing in > your patch was a very good example of such a decision which later on > we had to rethink. That's why baking these policies into kernel is > very problematic. Another problem I see with the implementation that > it ties process killing with the reclaim scan depth. It's very similar > to how vmpressure works and vmpressure in my experience is very > unpredictable. Yeah it does seem conceptually similar, good point. > > > Regardless, even if PSI were backported, a full-fledged LMKD using it has > > > yet to > > > be made, so it wouldn't be of much use now. > > > > This is work that is ongoing and requires kernel changes to make it > > feasible. One of the things that I have been working on for quite a > > while is the whole file descriptor for processes thing that is important > > for LMKD (Even though I never thought about this use-case when I started > > pitching this.). Joel and Daniel have joined in and are working on > > making LMKD possible. > > What I find odd is that every couple of weeks different solutions to the > > low memory problem are pitched. There is simple_lkml, there is LMKD, and > > there was a patchset that wanted to speed up memory reclaim at process > > kill-time by adding a new flag to the new pidfd_send_signal() syscall. > > That all seems - though related - rather uncoordinated. > > I'm not sure why pidfd_wait and expedited reclaim is seen as > uncoordinated effort. All of them are done to improve userspace LMKD. Christian, pidfd_wait and expedited reclaim are both coordinated efforts and solve different problems related to LMK. simple_lmk is entirely different effort that we already hesitated about when it was first posted, now we hesitate again due to the issues Suren and others mentioned. I think it is a better idea for Sultan to spend his time on using/improving PSI/LMKd than spending it on the simple_lmk. It could also be a good topic to discuss in the Android track of the Linux plumbers conference. thanks, - Joel ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 07, 2019 at 10:17:11AM -0700, Sultan Alsawaf wrote: > On Tue, May 07, 2019 at 01:09:21PM +0200, Greg Kroah-Hartman wrote: > > > It's even more odd that although a userspace solution is touted as the > > > proper > > > way to go on LKML, almost no Android OEMs are using it, and even in that > > > commit > > > I linked in the previous message, Google made a rather large set of > > > modifications to the supposedly-defunct lowmemorykiller.c not one month > > > ago. > > > What's going on? > > > > "almost no"? Again, Android Go is doing that, right? > > I'd check for myself, but I can't seem to find kernel source for an Android Go > device... > > This seems more confusing though. Why would the ultra-low-end devices use LMKD > while other devices use the broken lowmemorykiller driver? It's probably because the Android Go devices got a lot more "help" from people at Google than did the other devices you are looking at. Also, despite the older kernel version, they are probably running a newer version of Android userspace, specially tuned just for lower memory devices. So those 3.18.y based Android Go devices are newer than the 4.4.y based "full Android" devices on the market, and even some 4.9.y based devices. Yes, it is strange :) > > > Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845. > > > > Qualcomm should never be used as an example of a company that has any > > idea of what to do in their kernel :) > > Agreed, but nearly all OEMs that use Qualcomm chipsets roll with Qualcomm's > kernel decisions, so Qualcomm has a bit of influence here. Yes, because almost no OEM wants to mess with their kernel, they just take QCOM's kernel and run with it. But don't take that for some sort of "best design practice" summary at all. > > > If PSI were backported to 4.4, or even 3.18, would it really be used? > > > > Why wouldn't it, if it worked properly? > > For the same mysterious reason that Qualcomm and others cling to > lowmemorykiller, I presume. This is part of what's been confusing me for quite > some time... QCOM's 4.4.y based kernel work was done 3-4 years ago, if not older. They didn't know that this was not the "right way" to do things. The Google developers have been working for the past few years to do it correct, but they can not go back in time to change old repos, sorry. Now that I understand you just want to work on your local device, that makes more sense. But I think you will have a better result trying to do a 4.4 backport of PSI combined with the userspace stuff, than to try to worry about your driver in 5.2 or newer. Or you can forward-port your kernel to 4.9, or better yet, 4.14. That would probably be a much better thing to do overall as 4.4 is really old now. Good luck! greg k-h ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 07, 2019 at 12:58:27PM +0200, Christian Brauner wrote: > This is work that is ongoing and requires kernel changes to make it > feasible. One of the things that I have been working on for quite a > while is the whole file descriptor for processes thing that is important > for LMKD (Even though I never thought about this use-case when I started > pitching this.). Joel and Daniel have joined in and are working on > making LMKD possible. > What I find odd is that every couple of weeks different solutions to the > low memory problem are pitched. There is simple_lkml, there is LMKD, and > there was a patchset that wanted to speed up memory reclaim at process > kill-time by adding a new flag to the new pidfd_send_signal() syscall. > That all seems - though related - rather uncoordinated. Now granted, > coordinated is usually not how kernel development necessarily works but > it would probably be good to have some sort of direction and from what I > have seen LMKD seems to be the most coordinated effort. But that might > just be my impression. LMKD is just Android's userspace low-memory-killer daemon. It's been around for a while. This patch (simple_lmk) is meant to serve as an immediate solution for the devices that'll never see a single line of PSI code running on them, which amounts to... well, most Android devices currently in existence. I'm more of a cowboy who made this patch after waiting a few years for memory management improvements on Android that never happened. Though it looks like it's going to happen soon(ish?) for super new devices that'll have the privilege of shipping with PSI in use. On Tue, May 07, 2019 at 01:09:21PM +0200, Greg Kroah-Hartman wrote: > > It's even more odd that although a userspace solution is touted as the > > proper > > way to go on LKML, almost no Android OEMs are using it, and even in that > > commit > > I linked in the previous message, Google made a rather large set of > > modifications to the supposedly-defunct lowmemorykiller.c not one month ago. > > What's going on? > > "almost no"? Again, Android Go is doing that, right? I'd check for myself, but I can't seem to find kernel source for an Android Go device... This seems more confusing though. Why would the ultra-low-end devices use LMKD while other devices use the broken lowmemorykiller driver? > > Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845. > > Qualcomm should never be used as an example of a company that has any > idea of what to do in their kernel :) Agreed, but nearly all OEMs that use Qualcomm chipsets roll with Qualcomm's kernel decisions, so Qualcomm has a bit of influence here. > > If PSI were backported to 4.4, or even 3.18, would it really be used? > > Why wouldn't it, if it worked properly? For the same mysterious reason that Qualcomm and others cling to lowmemorykiller, I presume. This is part of what's been confusing me for quite some time... > Please see the work that went into PSI and the patches around it. > There's also a lwn.net article last week about the further work ongoing > in this area. With all of that, you should see how in-kernel memory > killers are NOT the way to go. > > > Regardless, even if PSI were backported, a full-fledged LMKD using it has > > yet to > > be made, so it wouldn't be of much use now. > > "LMKD"? Again, PSI is in the 4.9.y android-common tree, so the > userspace side should be in AOSP, right? LMKD as in Android's low-memory-killer daemon. It is in AOSP, but it looks like it's still a work in progress. Thanks, Sultan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 07, 2019 at 09:28:47AM -0700, Suren Baghdasaryan wrote: > Hi Sultan, > Looks like you are posting this patch for devices that do not use > userspace LMKD solution due to them using older kernels or due to > their vendors sticking to in-kernel solution. If so, I see couple > logistical issues with this patch. I don't see it being adopted in > upstream kernel 5.x since it re-implements a deprecated mechanism even > though vendors still use it. Vendors on the other hand, will not adopt > it until you show evidence that it works way better than what > lowmemorykilled driver does now. You would have to provide measurable > data and explain your tests before they would consider spending time > on this. Yes, this is mostly for the devices already produced that are forced to suffer with poor memory management. I can't even convince vendors to fix kernel memory leaks, so there's no way I'd be able to convince them of trying this patch, especially since it seems like you're having trouble convincing vendors to stop using lowmemorykiller in the first place. And thankfully, convincing vendors isn't my job :) > On the implementation side I'm not convinced at all that this would > work better on all devices and in all circumstances. We had cases when > a new mechanism would show very good results until one usecase > completely broke it. Bulk killing of processes that you are doing in > your patch was a very good example of such a decision which later on > we had to rethink. That's why baking these policies into kernel is > very problematic. Another problem I see with the implementation that > it ties process killing with the reclaim scan depth. It's very similar > to how vmpressure works and vmpressure in my experience is very > unpredictable. Could you elaborate a bit on why bulk killing isn't good? > > > I linked in the previous message, Google made a rather large set of > > > modifications to the supposedly-defunct lowmemorykiller.c not one month > > > ago. > > > What's going on? > > If you look into that commit, it adds ability to report kill stats. If > that was a change in how that driver works it would be rejected. Fair, though it was quite strange seeing something that was supposedly totally abandoned receiving a large chunk of code for reporting stats. Thanks, Sultan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 07, 2019 at 09:28:47AM -0700, Suren Baghdasaryan wrote: > From: Christian Brauner > Date: Tue, May 7, 2019 at 3:58 AM > To: Sultan Alsawaf > Cc: Greg Kroah-Hartman, open list:ANDROID DRIVERS, Daniel Colascione, > Todd Kjos, Kees Cook, Peter Zijlstra, Martijn Coenen, LKML, Tim > Murray, Michal Hocko, Suren Baghdasaryan, linux-mm, Arve Hjønnevåg, > Ingo Molnar, Steven Rostedt, Oleg Nesterov, Joel Fernandes, Andy > Lutomirski, kernel-team > > > On Tue, May 07, 2019 at 01:12:36AM -0700, Sultan Alsawaf wrote: > > > On Tue, May 07, 2019 at 09:43:34AM +0200, Greg Kroah-Hartman wrote: > > > > Given that any "new" android device that gets shipped "soon" should be > > > > using 4.9.y or newer, is this a real issue? > > > > > > It's certainly a real issue for those who can't buy brand new Android > > > devices > > > without software bugs every six months :) > > > > > Hi Sultan, > Looks like you are posting this patch for devices that do not use > userspace LMKD solution due to them using older kernels or due to > their vendors sticking to in-kernel solution. If so, I see couple > logistical issues with this patch. I don't see it being adopted in > upstream kernel 5.x since it re-implements a deprecated mechanism even > though vendors still use it. Vendors on the other hand, will not adopt > it until you show evidence that it works way better than what > lowmemorykilled driver does now. You would have to provide measurable > data and explain your tests before they would consider spending time > on this. > On the implementation side I'm not convinced at all that this would > work better on all devices and in all circumstances. We had cases when > a new mechanism would show very good results until one usecase > completely broke it. Bulk killing of processes that you are doing in > your patch was a very good example of such a decision which later on > we had to rethink. That's why baking these policies into kernel is > very problematic. Another problem I see with the implementation that > it ties process killing with the reclaim scan depth. It's very similar > to how vmpressure works and vmpressure in my experience is very > unpredictable. > > > > > And if it is, I'm sure that asking for those patches to be backported to > > > > 4.4.y would be just fine, have you asked? > > > > > > > > Note that I know of Android Go devices, running 3.18.y kernels, do NOT > > > > use the in-kernel memory killer, but instead use the userspace solution > > > > today. So trying to get another in-kernel memory killer solution added > > > > anywhere seems quite odd. > > > > > > It's even more odd that although a userspace solution is touted as the > > > proper > > > way to go on LKML, almost no Android OEMs are using it, and even in that > > > commit > > > > That's probably because without proper kernel changes this is rather > > tricky to use safely (see below). > > > > > I linked in the previous message, Google made a rather large set of > > > modifications to the supposedly-defunct lowmemorykiller.c not one month > > > ago. > > > What's going on? > > If you look into that commit, it adds ability to report kill stats. If > that was a change in how that driver works it would be rejected. > > > > > > > Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845. If PSI > > > were > > > backported to 4.4, or even 3.18, would it really be used? I don't really > > > understand the aversion to an in-kernel memory killer on LKML despite the > > > rest > > > of the industry's attraction to it. Perhaps there's some inherently great > > > cost > > > in using the userspace solution that I'm unaware of? > > Vendors are cautious about adopting userspace solution and it is a > process to address all concerns but we are getting there. > > > > Regardless, even if PSI were backported, a full-fledged LMKD using it has > > > yet to > > > be made, so it wouldn't be of much use now. > > > > This is work that is ongoing and requires kernel changes to make it > > feasible. One of the things that I have been working on for quite a > > while is the whole file descriptor for processes thing that is important > > for LMKD (Even though I never thought about this use-case when I started > > pitching this.). Joel and Daniel have joined in and are working on > > making LMKD possible. > > What I find odd is that every couple of weeks different solutions to the > > low memory problem are pitched. There is simple_lkml, there is LMKD, and > > there was a patchset that wanted to speed up memory reclaim at process > > kill-time by adding a new flag to the new pidfd_send_signal() syscall. > > That all seems - though related - rather uncoordinated. > > I'm not sure why pidfd_wait and expedited reclaim is seen as > uncoordinated effort. All of them are done to improve userspace LMKD. If so that wasn't very obvious and there was some disagreement there as well whether this would be the right solution. In any case, the point is that LMKD seems to
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 07, 2019 at 05:31:54PM +0200, Oleg Nesterov wrote: > I am not going to comment the intent, but to be honest I am skeptical too. The general sentiment has been that this is a really bad idea, but I'm just a frustrated Android user who wants his phone to not require mountains of zRAM only to still manage memory poorly. Until I can go out and buy a non-Pixel phone that uses PSI to make these decisions (and does a good job of it), I'm going to stick to my hacky little driver for my personal devices. Many others who like to hack their Android devices to make them last longer will probably find value in this as well, since there are millions of people who use devices that'll never seen any of PSI ported to their ancient 3.x kernels. And yes, I know this would never be accepted to upstream in a million years. I mostly wanted some code review and guidance, since mm code is pretty tricky :) > On 05/06, Sultan Alsawaf wrote: > > > > +static unsigned long find_victims(struct victim_info *varr, int *vindex, > > + int vmaxlen, int min_adj, int max_adj) > > +{ > > + unsigned long pages_found = 0; > > + int old_vindex = *vindex; > > + struct task_struct *tsk; > > + > > + for_each_process(tsk) { > > + struct task_struct *vtsk; > > + unsigned long tasksize; > > + short oom_score_adj; > > + > > + /* Make sure there's space left in the victim array */ > > + if (*vindex == vmaxlen) > > + break; > > + > > + /* Don't kill current, kthreads, init, or duplicates */ > > + if (same_thread_group(tsk, current) || > > + tsk->flags & PF_KTHREAD || > > + is_global_init(tsk) || > > + vtsk_is_duplicate(varr, *vindex, tsk)) > > + continue; > > + > > + vtsk = find_lock_task_mm(tsk); > > Did you test this patch with lockdep enabled? > > If I read the patch correctly, lockdep should complain. vtsk_is_duplicate() > ensures that we do not take the same ->alloc_lock twice or more, but lockdep > can't know this. Yeah, lockdep is fine with this, at least on 4.4. > > +static void scan_and_kill(unsigned long pages_needed) > > +{ > > + static DECLARE_WAIT_QUEUE_HEAD(victim_waitq); > > + struct victim_info victims[MAX_VICTIMS]; > > + int i, nr_to_kill = 0, nr_victims = 0; > > + unsigned long pages_found = 0; > > + atomic_t victim_count; > > + > > + /* > > +* Hold the tasklist lock so tasks don't disappear while scanning. This > > +* is preferred to holding an RCU read lock so that the list of tasks > > +* is guaranteed to be up to date. Keep preemption disabled until the > > +* SIGKILLs are sent so the victim kill process isn't interrupted. > > +*/ > > + read_lock(_lock); > > + preempt_disable(); > > read_lock() disables preemption, every task_lock() too, so this looks > unnecessary. Good point. > > + for (i = 1; i < ARRAY_SIZE(adj_prio); i++) { > > + pages_found += find_victims(victims, _victims, MAX_VICTIMS, > > + adj_prio[i], adj_prio[i - 1]); > > + if (pages_found >= pages_needed || nr_victims == MAX_VICTIMS) > > + break; > > + } > > + > > + /* > > +* Calculate the number of tasks that need to be killed and quickly > > +* release the references to those that'll live. > > +*/ > > + for (i = 0, pages_found = 0; i < nr_victims; i++) { > > + struct victim_info *victim = [i]; > > + struct task_struct *vtsk = victim->tsk; > > + > > + /* The victims' mm lock is taken in find_victims; release it */ > > + if (pages_found >= pages_needed) { > > + task_unlock(vtsk); > > + continue; > > + } > > + > > + /* > > +* Grab a reference to the victim so it doesn't disappear after > > +* the tasklist lock is released. > > +*/ > > + get_task_struct(vtsk); > > The comment doesn't look correct. the victim can't dissapear until > task_unlock() > below, it can't pass exit_mm(). I was always unsure about this and decided to hold a reference to the task_struct to be safe. Thanks for clearing that up. > > + pages_found += victim->size; > > + nr_to_kill++; > > + } > > + read_unlock(_lock); > > + > > + /* Kill the victims */ > > + victim_count = (atomic_t)ATOMIC_INIT(nr_to_kill); > > + for (i = 0; i < nr_to_kill; i++) { > > + struct victim_info *victim = [i]; > > + struct task_struct *vtsk = victim->tsk; > > + > > + pr_info("Killing %s with adj %d to free %lu kiB\n", vtsk->comm, > > + vtsk->signal->oom_score_adj, > > + victim->size << (PAGE_SHIFT - 10)); > > + > > + /* Configure the victim's mm to notify us when it's freed */ > > + vtsk->mm->slmk_waitq = _waitq; > > +
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
From: Christian Brauner Date: Tue, May 7, 2019 at 3:58 AM To: Sultan Alsawaf Cc: Greg Kroah-Hartman, open list:ANDROID DRIVERS, Daniel Colascione, Todd Kjos, Kees Cook, Peter Zijlstra, Martijn Coenen, LKML, Tim Murray, Michal Hocko, Suren Baghdasaryan, linux-mm, Arve Hjønnevåg, Ingo Molnar, Steven Rostedt, Oleg Nesterov, Joel Fernandes, Andy Lutomirski, kernel-team > On Tue, May 07, 2019 at 01:12:36AM -0700, Sultan Alsawaf wrote: > > On Tue, May 07, 2019 at 09:43:34AM +0200, Greg Kroah-Hartman wrote: > > > Given that any "new" android device that gets shipped "soon" should be > > > using 4.9.y or newer, is this a real issue? > > > > It's certainly a real issue for those who can't buy brand new Android > > devices > > without software bugs every six months :) > > Hi Sultan, Looks like you are posting this patch for devices that do not use userspace LMKD solution due to them using older kernels or due to their vendors sticking to in-kernel solution. If so, I see couple logistical issues with this patch. I don't see it being adopted in upstream kernel 5.x since it re-implements a deprecated mechanism even though vendors still use it. Vendors on the other hand, will not adopt it until you show evidence that it works way better than what lowmemorykilled driver does now. You would have to provide measurable data and explain your tests before they would consider spending time on this. On the implementation side I'm not convinced at all that this would work better on all devices and in all circumstances. We had cases when a new mechanism would show very good results until one usecase completely broke it. Bulk killing of processes that you are doing in your patch was a very good example of such a decision which later on we had to rethink. That's why baking these policies into kernel is very problematic. Another problem I see with the implementation that it ties process killing with the reclaim scan depth. It's very similar to how vmpressure works and vmpressure in my experience is very unpredictable. > > > And if it is, I'm sure that asking for those patches to be backported to > > > 4.4.y would be just fine, have you asked? > > > > > > Note that I know of Android Go devices, running 3.18.y kernels, do NOT > > > use the in-kernel memory killer, but instead use the userspace solution > > > today. So trying to get another in-kernel memory killer solution added > > > anywhere seems quite odd. > > > > It's even more odd that although a userspace solution is touted as the > > proper > > way to go on LKML, almost no Android OEMs are using it, and even in that > > commit > > That's probably because without proper kernel changes this is rather > tricky to use safely (see below). > > > I linked in the previous message, Google made a rather large set of > > modifications to the supposedly-defunct lowmemorykiller.c not one month ago. > > What's going on? If you look into that commit, it adds ability to report kill stats. If that was a change in how that driver works it would be rejected. > > > > Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845. If PSI were > > backported to 4.4, or even 3.18, would it really be used? I don't really > > understand the aversion to an in-kernel memory killer on LKML despite the > > rest > > of the industry's attraction to it. Perhaps there's some inherently great > > cost > > in using the userspace solution that I'm unaware of? Vendors are cautious about adopting userspace solution and it is a process to address all concerns but we are getting there. > > Regardless, even if PSI were backported, a full-fledged LMKD using it has > > yet to > > be made, so it wouldn't be of much use now. > > This is work that is ongoing and requires kernel changes to make it > feasible. One of the things that I have been working on for quite a > while is the whole file descriptor for processes thing that is important > for LMKD (Even though I never thought about this use-case when I started > pitching this.). Joel and Daniel have joined in and are working on > making LMKD possible. > What I find odd is that every couple of weeks different solutions to the > low memory problem are pitched. There is simple_lkml, there is LMKD, and > there was a patchset that wanted to speed up memory reclaim at process > kill-time by adding a new flag to the new pidfd_send_signal() syscall. > That all seems - though related - rather uncoordinated. I'm not sure why pidfd_wait and expedited reclaim is seen as uncoordinated effort. All of them are done to improve userspace LMKD. > Now granted, > coordinated is usually not how kernel development necessarily works but > it would probably be good to have some sort of direction and from what I > have seen LMKD seems to be the most coordinated effort. But that might > just be my impression. > > Christian Thanks, Suren. ___ devel mailing list de...@linuxdriverproject.org
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
I am not going to comment the intent, but to be honest I am skeptical too. On 05/06, Sultan Alsawaf wrote: > > +static unsigned long find_victims(struct victim_info *varr, int *vindex, > + int vmaxlen, int min_adj, int max_adj) > +{ > + unsigned long pages_found = 0; > + int old_vindex = *vindex; > + struct task_struct *tsk; > + > + for_each_process(tsk) { > + struct task_struct *vtsk; > + unsigned long tasksize; > + short oom_score_adj; > + > + /* Make sure there's space left in the victim array */ > + if (*vindex == vmaxlen) > + break; > + > + /* Don't kill current, kthreads, init, or duplicates */ > + if (same_thread_group(tsk, current) || > + tsk->flags & PF_KTHREAD || > + is_global_init(tsk) || > + vtsk_is_duplicate(varr, *vindex, tsk)) > + continue; > + > + vtsk = find_lock_task_mm(tsk); Did you test this patch with lockdep enabled? If I read the patch correctly, lockdep should complain. vtsk_is_duplicate() ensures that we do not take the same ->alloc_lock twice or more, but lockdep can't know this. > +static void scan_and_kill(unsigned long pages_needed) > +{ > + static DECLARE_WAIT_QUEUE_HEAD(victim_waitq); > + struct victim_info victims[MAX_VICTIMS]; > + int i, nr_to_kill = 0, nr_victims = 0; > + unsigned long pages_found = 0; > + atomic_t victim_count; > + > + /* > + * Hold the tasklist lock so tasks don't disappear while scanning. This > + * is preferred to holding an RCU read lock so that the list of tasks > + * is guaranteed to be up to date. Keep preemption disabled until the > + * SIGKILLs are sent so the victim kill process isn't interrupted. > + */ > + read_lock(_lock); > + preempt_disable(); read_lock() disables preemption, every task_lock() too, so this looks unnecessary. > + for (i = 1; i < ARRAY_SIZE(adj_prio); i++) { > + pages_found += find_victims(victims, _victims, MAX_VICTIMS, > + adj_prio[i], adj_prio[i - 1]); > + if (pages_found >= pages_needed || nr_victims == MAX_VICTIMS) > + break; > + } > + > + /* > + * Calculate the number of tasks that need to be killed and quickly > + * release the references to those that'll live. > + */ > + for (i = 0, pages_found = 0; i < nr_victims; i++) { > + struct victim_info *victim = [i]; > + struct task_struct *vtsk = victim->tsk; > + > + /* The victims' mm lock is taken in find_victims; release it */ > + if (pages_found >= pages_needed) { > + task_unlock(vtsk); > + continue; > + } > + > + /* > + * Grab a reference to the victim so it doesn't disappear after > + * the tasklist lock is released. > + */ > + get_task_struct(vtsk); The comment doesn't look correct. the victim can't dissapear until task_unlock() below, it can't pass exit_mm(). > + pages_found += victim->size; > + nr_to_kill++; > + } > + read_unlock(_lock); > + > + /* Kill the victims */ > + victim_count = (atomic_t)ATOMIC_INIT(nr_to_kill); > + for (i = 0; i < nr_to_kill; i++) { > + struct victim_info *victim = [i]; > + struct task_struct *vtsk = victim->tsk; > + > + pr_info("Killing %s with adj %d to free %lu kiB\n", vtsk->comm, > + vtsk->signal->oom_score_adj, > + victim->size << (PAGE_SHIFT - 10)); > + > + /* Configure the victim's mm to notify us when it's freed */ > + vtsk->mm->slmk_waitq = _waitq; > + vtsk->mm->slmk_counter = _count; > + > + /* Accelerate the victim's death by forcing the kill signal */ > + do_send_sig_info(SIGKILL, SIG_INFO_TYPE, vtsk, true); this should be PIDTYPE_TGID > + > + /* Finally release the victim's mm lock */ > + task_unlock(vtsk); > + } > + preempt_enable_no_resched(); See above. And I don't understand how can _no_resched() really help... > + > + /* Try to speed up the death process now that we can schedule again */ > + for (i = 0; i < nr_to_kill; i++) { > + struct task_struct *vtsk = victims[i].tsk; > + > + /* Increase the victim's priority to make it die faster */ > + set_user_nice(vtsk, MIN_NICE); > + > + /* Allow the victim to run on any CPU */ > + set_cpus_allowed_ptr(vtsk, cpu_all_mask); > + > + /* Finally release the victim reference acquired earlier */ > + put_task_struct(vtsk); > + } > + > + /* Wait
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Mon 06-05-19 19:16:22, Sultan Alsawaf wrote: > This is a complete low memory killer solution for Android that is small > and simple. Processes are killed according to the priorities that > Android gives them, so that the least important processes are always > killed first. Processes are killed until memory deficits are satisfied, > as observed from kswapd struggling to free up pages. Simple LMK stops > killing processes when kswapd finally goes back to sleep. > > The only tunables are the desired amount of memory to be freed per > reclaim event and desired frequency of reclaim events. Simple LMK tries > to free at least the desired amount of memory per reclaim and waits > until all of its victims' memory is freed before proceeding to kill more > processes. Why do we need something like that in the kernel? I really do not like an idea of having two OOM killer implementations in the kernel. As already pointed out newer kernels can do PSI and older kernels can live with an out of tree code to achieve what they need. I do not see why we really need this code in the upstream kernel. -- Michal Hocko SUSE Labs ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 07, 2019 at 01:12:36AM -0700, Sultan Alsawaf wrote: > On Tue, May 07, 2019 at 09:43:34AM +0200, Greg Kroah-Hartman wrote: > > Given that any "new" android device that gets shipped "soon" should be > > using 4.9.y or newer, is this a real issue? > > It's certainly a real issue for those who can't buy brand new Android devices > without software bugs every six months :) Heh. But, your "new code" isn't going to be going into any existing device, or any device that will come out this year. The soonest it would be would be next year, and by then, 4.9.y is fine. > > And if it is, I'm sure that asking for those patches to be backported to > > 4.4.y would be just fine, have you asked? > > > > Note that I know of Android Go devices, running 3.18.y kernels, do NOT > > use the in-kernel memory killer, but instead use the userspace solution > > today. So trying to get another in-kernel memory killer solution added > > anywhere seems quite odd. > > It's even more odd that although a userspace solution is touted as the proper > way to go on LKML, almost no Android OEMs are using it, and even in that > commit > I linked in the previous message, Google made a rather large set of > modifications to the supposedly-defunct lowmemorykiller.c not one month ago. > What's going on? "almost no"? Again, Android Go is doing that, right? And yes, there is still some 4.4 android-common work happening in this area, see this patch that just got merged: https://android-review.googlesource.com/c/kernel/common/+/953354 So, for 4.4.y based devices, that should be enough, right? > Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845. Qualcomm should never be used as an example of a company that has any idea of what to do in their kernel :) > If PSI were backported to 4.4, or even 3.18, would it really be used? Why wouldn't it, if it worked properly? > I don't really understand the aversion to an in-kernel memory killer > on LKML despite the rest of the industry's attraction to it. Perhaps > there's some inherently great cost in using the userspace solution > that I'm unaware of? Please see the work that went into PSI and the patches around it. There's also a lwn.net article last week about the further work ongoing in this area. With all of that, you should see how in-kernel memory killers are NOT the way to go. > Regardless, even if PSI were backported, a full-fledged LMKD using it has yet > to > be made, so it wouldn't be of much use now. "LMKD"? Again, PSI is in the 4.9.y android-common tree, so the userspace side should be in AOSP, right? thanks, greg k-h ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 07, 2019 at 01:12:36AM -0700, Sultan Alsawaf wrote: > On Tue, May 07, 2019 at 09:43:34AM +0200, Greg Kroah-Hartman wrote: > > Given that any "new" android device that gets shipped "soon" should be > > using 4.9.y or newer, is this a real issue? > > It's certainly a real issue for those who can't buy brand new Android devices > without software bugs every six months :) > > > And if it is, I'm sure that asking for those patches to be backported to > > 4.4.y would be just fine, have you asked? > > > > Note that I know of Android Go devices, running 3.18.y kernels, do NOT > > use the in-kernel memory killer, but instead use the userspace solution > > today. So trying to get another in-kernel memory killer solution added > > anywhere seems quite odd. > > It's even more odd that although a userspace solution is touted as the proper > way to go on LKML, almost no Android OEMs are using it, and even in that > commit That's probably because without proper kernel changes this is rather tricky to use safely (see below). > I linked in the previous message, Google made a rather large set of > modifications to the supposedly-defunct lowmemorykiller.c not one month ago. > What's going on? > > Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845. If PSI were > backported to 4.4, or even 3.18, would it really be used? I don't really > understand the aversion to an in-kernel memory killer on LKML despite the rest > of the industry's attraction to it. Perhaps there's some inherently great cost > in using the userspace solution that I'm unaware of? > > Regardless, even if PSI were backported, a full-fledged LMKD using it has yet > to > be made, so it wouldn't be of much use now. This is work that is ongoing and requires kernel changes to make it feasible. One of the things that I have been working on for quite a while is the whole file descriptor for processes thing that is important for LMKD (Even though I never thought about this use-case when I started pitching this.). Joel and Daniel have joined in and are working on making LMKD possible. What I find odd is that every couple of weeks different solutions to the low memory problem are pitched. There is simple_lkml, there is LMKD, and there was a patchset that wanted to speed up memory reclaim at process kill-time by adding a new flag to the new pidfd_send_signal() syscall. That all seems - though related - rather uncoordinated. Now granted, coordinated is usually not how kernel development necessarily works but it would probably be good to have some sort of direction and from what I have seen LMKD seems to be the most coordinated effort. But that might just be my impression. Christian ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 07, 2019 at 09:43:34AM +0200, Greg Kroah-Hartman wrote: > Given that any "new" android device that gets shipped "soon" should be > using 4.9.y or newer, is this a real issue? It's certainly a real issue for those who can't buy brand new Android devices without software bugs every six months :) > And if it is, I'm sure that asking for those patches to be backported to > 4.4.y would be just fine, have you asked? > > Note that I know of Android Go devices, running 3.18.y kernels, do NOT > use the in-kernel memory killer, but instead use the userspace solution > today. So trying to get another in-kernel memory killer solution added > anywhere seems quite odd. It's even more odd that although a userspace solution is touted as the proper way to go on LKML, almost no Android OEMs are using it, and even in that commit I linked in the previous message, Google made a rather large set of modifications to the supposedly-defunct lowmemorykiller.c not one month ago. What's going on? Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845. If PSI were backported to 4.4, or even 3.18, would it really be used? I don't really understand the aversion to an in-kernel memory killer on LKML despite the rest of the industry's attraction to it. Perhaps there's some inherently great cost in using the userspace solution that I'm unaware of? Regardless, even if PSI were backported, a full-fledged LMKD using it has yet to be made, so it wouldn't be of much use now. Thanks, Sultan [1] https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/arch/arm64/configs/sdm845_defconfig?h=LA.UM.7.3.r1-07400-sdm845.0#n492 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 07, 2019 at 12:27:21AM -0700, Sultan Alsawaf wrote: > On Tue, May 07, 2019 at 09:04:30AM +0200, Greg Kroah-Hartman wrote: > > Um, why can't "all" Android devices take the same patches that the Pixel > > phones are using today? They should all be in the public android-common > > kernel repositories that all Android devices should be syncing with on a > > weekly/monthly basis anyway, right? > > > > thanks, > > > > greg k-h > > Hi Greg, > > I only see PSI present in the android-common kernels for 4.9 and above. The > vast > majority of Android devices do not run a 4.9+ kernel. It seems unreasonable to > expect OEMs to toil with backporting PSI themselves to get decent memory > management. Given that any "new" android device that gets shipped "soon" should be using 4.9.y or newer, is this a real issue? And if it is, I'm sure that asking for those patches to be backported to 4.4.y would be just fine, have you asked? Note that I know of Android Go devices, running 3.18.y kernels, do NOT use the in-kernel memory killer, but instead use the userspace solution today. So trying to get another in-kernel memory killer solution added anywhere seems quite odd. thanks, greg k-h ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 07, 2019 at 09:04:30AM +0200, Greg Kroah-Hartman wrote: > Um, why can't "all" Android devices take the same patches that the Pixel > phones are using today? They should all be in the public android-common > kernel repositories that all Android devices should be syncing with on a > weekly/monthly basis anyway, right? > > thanks, > > greg k-h Hi Greg, I only see PSI present in the android-common kernels for 4.9 and above. The vast majority of Android devices do not run a 4.9+ kernel. It seems unreasonable to expect OEMs to toil with backporting PSI themselves to get decent memory management. But even if they did backport PSI, it wouldn't help too much because a PSI-enabled LMKD solution is not ready yet. It looks like a PSI-based LMKD is still under heavy development and won't be ready for all Android devices for quite some time. Additionally, it looks like the supposedly-dead lowmemorykiller.c is still being actively tweaked by Google [1], which does not instill confidence that a definitive LMK solution a la PSI is coming any time soon. Thanks, Sultan [1] https://android.googlesource.com/kernel/common/+/152bacdd85c46f0c76b00c4acc253e414513634c ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Mon, May 06, 2019 at 07:16:22PM -0700, Sultan Alsawaf wrote: > This is a complete low memory killer solution for Android that is small > and simple. Processes are killed according to the priorities that > Android gives them, so that the least important processes are always > killed first. Processes are killed until memory deficits are satisfied, > as observed from kswapd struggling to free up pages. Simple LMK stops > killing processes when kswapd finally goes back to sleep. > > The only tunables are the desired amount of memory to be freed per > reclaim event and desired frequency of reclaim events. Simple LMK tries > to free at least the desired amount of memory per reclaim and waits > until all of its victims' memory is freed before proceeding to kill more > processes. > > Signed-off-by: Sultan Alsawaf > --- > Hello everyone, > > I've addressed some of the concerns that were brought up with the first > version > of the Simple LMK patch. I understand that a kernel-based solution like this > that contains policy decisions for a specific userspace is not the way to go, > but the Android ecosystem still has a pressing need for a low memory killer > that > works well. > > Most Android devices still use the ancient and deprecated lowmemorykiller.c > kernel driver; Simple LMK seeks to replace that, at the very least until PSI > and > a userspace daemon utilizing PSI are ready for *all* Android devices, and not > just the privileged Pixel phone line. Um, why can't "all" Android devices take the same patches that the Pixel phones are using today? They should all be in the public android-common kernel repositories that all Android devices should be syncing with on a weekly/monthly basis anyway, right? thanks, greg k-h ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
This is a complete low memory killer solution for Android that is small and simple. Processes are killed according to the priorities that Android gives them, so that the least important processes are always killed first. Processes are killed until memory deficits are satisfied, as observed from kswapd struggling to free up pages. Simple LMK stops killing processes when kswapd finally goes back to sleep. The only tunables are the desired amount of memory to be freed per reclaim event and desired frequency of reclaim events. Simple LMK tries to free at least the desired amount of memory per reclaim and waits until all of its victims' memory is freed before proceeding to kill more processes. Signed-off-by: Sultan Alsawaf --- Hello everyone, I've addressed some of the concerns that were brought up with the first version of the Simple LMK patch. I understand that a kernel-based solution like this that contains policy decisions for a specific userspace is not the way to go, but the Android ecosystem still has a pressing need for a low memory killer that works well. Most Android devices still use the ancient and deprecated lowmemorykiller.c kernel driver; Simple LMK seeks to replace that, at the very least until PSI and a userspace daemon utilizing PSI are ready for *all* Android devices, and not just the privileged Pixel phone line. The feedback I received last time was quite helpful. This Simple LMK patch works significantly better than the first, improving memory management by a large margin while being immune to transient spikes in memory usage (since the signal to start killing processes is determined by how hard kswapd tries to free up pages, which is something that occurs over a span of time and not a single point in time). I'd love to hear some feedback on this new patch. I do encourage those who are interested to take it for a spin on an Android device. This patch has been tested successfully on Android 4.4 and 4.9 kernels. For the sake of review here, I have adapted the patch to 5.1. Thanks, Sultan drivers/android/Kconfig | 33 drivers/android/Makefile | 1 + drivers/android/simple_lmk.c | 315 +++ include/linux/mm_types.h | 4 + include/linux/simple_lmk.h | 11 ++ kernel/fork.c| 13 ++ mm/vmscan.c | 12 ++ 7 files changed, 389 insertions(+) create mode 100644 drivers/android/simple_lmk.c create mode 100644 include/linux/simple_lmk.h diff --git a/drivers/android/Kconfig b/drivers/android/Kconfig index 6fdf2abe4..bdd429338 100644 --- a/drivers/android/Kconfig +++ b/drivers/android/Kconfig @@ -54,6 +54,39 @@ config ANDROID_BINDER_IPC_SELFTEST exhaustively with combinations of various buffer sizes and alignments. +config ANDROID_SIMPLE_LMK + bool "Simple Android Low Memory Killer" + depends on !ANDROID_LOW_MEMORY_KILLER && !MEMCG + ---help--- + This is a complete low memory killer solution for Android that is + small and simple. Processes are killed according to the priorities + that Android gives them, so that the least important processes are + always killed first. Processes are killed until memory deficits are + satisfied, as observed from kswapd struggling to free up pages. Simple + LMK stops killing processes when kswapd finally goes back to sleep. + +if ANDROID_SIMPLE_LMK + +config ANDROID_SIMPLE_LMK_AGGRESSION + int "Reclaim frequency selection" + range 1 3 + default 1 + help + This value determines how frequently Simple LMK will perform memory + reclaims. A lower value corresponds to less frequent reclaims, which + maximizes memory usage. The range of values has a logarithmic + correlation; 2 is twice as aggressive as 1, and 3 is twice as + aggressive as 2, which makes 3 four times as aggressive as 1. + +config ANDROID_SIMPLE_LMK_MINFREE + int "Minimum MiB of memory to free per reclaim" + range 8 512 + default 64 + help + Simple LMK will try to free at least this much memory per reclaim. + +endif + endif # if ANDROID endmenu diff --git a/drivers/android/Makefile b/drivers/android/Makefile index c7856e320..7c91293b6 100644 --- a/drivers/android/Makefile +++ b/drivers/android/Makefile @@ -3,3 +3,4 @@ ccflags-y += -I$(src) # needed for trace events obj-$(CONFIG_ANDROID_BINDERFS) += binderfs.o obj-$(CONFIG_ANDROID_BINDER_IPC) += binder.o binder_alloc.o obj-$(CONFIG_ANDROID_BINDER_IPC_SELFTEST) += binder_alloc_selftest.o +obj-$(CONFIG_ANDROID_SIMPLE_LMK) += simple_lmk.o diff --git a/drivers/android/simple_lmk.c b/drivers/android/simple_lmk.c new file mode 100644 index 0..a2ffb57b5 --- /dev/null +++ b/drivers/android/simple_lmk.c @@ -0,0 +1,315 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2019 Sultan Alsawaf . + */ + +#define pr_fmt(fmt)
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner wrote: > On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione wrote: > > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner > > wrote: > > > So I dislike the idea of allocating new inodes from the procfs super > > > block. I would like to avoid pinning the whole pidfd concept exclusively > > > to proc. The idea is that the pidfd API will be useable through procfs > > > via open("/proc/") because that is what users expect and really > > > wanted to have for a long time. So it makes sense to have this working. > > > But it should really be useable without it. That's why translate_pid() > > > and pidfd_clone() are on the table. What I'm saying is, once the pidfd > > > api is "complete" you should be able to set CONFIG_PROCFS=N - even > > > though that's crazy - and still be able to use pidfds. This is also a > > > point akpm asked about when I did the pidfd_send_signal work. > > > > I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One > > crazy idea that I was discussing with Joel the other day is to just > > make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root() > > system call that returned, out of thin air and independent of the > > mount table, a procfs root directory file descriptor for the caller's > > PID namspace and suitable for use with openat(2). > > Even if this works I'm pretty sure that Al and a lot of others will not > be happy about this. A syscall to get an fd to /proc? That's not going > to happen and I don't see the need for a separate syscall just for that. > (I do see the point of making CONFIG_PROCFS=y the default btw.) I think his point here was that he wanted a handle to procfs no matter where it was mounted and then can later use openat on that. Agreed that it may be unnecessary unless there is a usecase for it, and especially if the /proc directory being the defacto mountpoint for procfs is a universal convention. > Inode allocation from the procfs mount for the file descriptors Joel > wants is not correct. Their not really procfs file descriptors so this > is a nack. We can't just hook into proc that way. I was not particular about using procfs mount for the FDs but that's the only way I knew how to do it until you pointed out anon_inode (my grep skills missed that), so thank you! > > C'mon: /proc is used by everyone today and almost every program breaks > > if it's not around. The string "/proc" is already de facto kernel ABI. > > Let's just drop the pretense of /proc being optional and bake it into > > the kernel proper, then give programs a way to get to /proc that isn't > > tied to any particular mount configuration. This way, we don't need a > > translate_pid(), since callers can just use procfs to do the same > > thing. (That is, if I understand correctly what translate_pid does.) > > I'm not sure what you think translate_pid() is doing since you're not > saying what you think it does. > Examples from the old patchset: > translate_pid(pid, ns, -1) - get pid in our pid namespace > translate_pid(pid, -1, ns) - get pid in other pid namespace > translate_pid(1, ns, -1)- get pid of init task for namespace > translate_pid(pid, -1, ns) > 0 - is pid is reachable from ns? > translate_pid(1, ns1, ns2) > 0 - is ns1 inside ns2? > translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2? > translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2? > > Allowing this syscall to yield pidfds as proper regular file fds and > also taking pidfds as argument is an excellent way to kill a few > problems at once: > - cheap pid namespace introspection > - creates a bridge between the "old" pid-based api and the "new" pidfd api This second point would solve the problem of getting a new pidfd given a pid indeed, without depending on /proc/ at all. So kudos for that and I am glad you are making it return pidfds (but correct me if I misunderstood what you're planning to do with translate_fd). It also obviates any need for dealing with procfs mount points. > - allows us to get proper non-directory file descriptors for any pids we > like Here I got a bit lost. AIUI pidfd is a directory fd. Why would we want it to not be a directory fd? That would be ambigiuous with what pidfd_send_signal expects. Also would it be a bad idea to extend translate_pid to also do what we want for the pidfd_wait syscall? So translate_fd in this case would return an fd that is just used for the pid's death status. All of these extensions seem to mean translate_pid should probably take a fourth parameter that tells it the target translation type? They way I am hypothesizing, translate_pid, it should probably be - translation to a pid in some ns - translation of a pid to a pidfd - translation of a pid to a "wait" fd which returns the death/reap process status. If that makes sense, that would also avoid the need for a new syscall we are adding. > The additional advantage is that people are already happy to
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione wrote: > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner > wrote: > > So I dislike the idea of allocating new inodes from the procfs super > > block. I would like to avoid pinning the whole pidfd concept exclusively > > to proc. The idea is that the pidfd API will be useable through procfs > > via open("/proc/") because that is what users expect and really > > wanted to have for a long time. So it makes sense to have this working. > > But it should really be useable without it. That's why translate_pid() > > and pidfd_clone() are on the table. What I'm saying is, once the pidfd > > api is "complete" you should be able to set CONFIG_PROCFS=N - even > > though that's crazy - and still be able to use pidfds. This is also a > > point akpm asked about when I did the pidfd_send_signal work. > > I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One > crazy idea that I was discussing with Joel the other day is to just > make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root() > system call that returned, out of thin air and independent of the > mount table, a procfs root directory file descriptor for the caller's > PID namspace and suitable for use with openat(2). Even if this works I'm pretty sure that Al and a lot of others will not be happy about this. A syscall to get an fd to /proc? That's not going to happen and I don't see the need for a separate syscall just for that. (I do see the point of making CONFIG_PROCFS=y the default btw.) Inode allocation from the procfs mount for the file descriptors Joel wants is not correct. Their not really procfs file descriptors so this is a nack. We can't just hook into proc that way. > > C'mon: /proc is used by everyone today and almost every program breaks > if it's not around. The string "/proc" is already de facto kernel ABI. > Let's just drop the pretense of /proc being optional and bake it into > the kernel proper, then give programs a way to get to /proc that isn't > tied to any particular mount configuration. This way, we don't need a > translate_pid(), since callers can just use procfs to do the same > thing. (That is, if I understand correctly what translate_pid does.) I'm not sure what you think translate_pid() is doing since you're not saying what you think it does. Examples from the old patchset: translate_pid(pid, ns, -1) - get pid in our pid namespace translate_pid(pid, -1, ns) - get pid in other pid namespace translate_pid(1, ns, -1)- get pid of init task for namespace translate_pid(pid, -1, ns) > 0 - is pid is reachable from ns? translate_pid(1, ns1, ns2) > 0 - is ns1 inside ns2? translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2? translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2? Allowing this syscall to yield pidfds as proper regular file fds and also taking pidfds as argument is an excellent way to kill a few problems at once: - cheap pid namespace introspection - creates a bridge between the "old" pid-based api and the "new" pidfd api - allows us to get proper non-directory file descriptors for any pids we like The additional advantage is that people are already happy to add this syscall so simply extending it and routing it through the pidfd tree or Eric's tree is reasonable. (It should probably grow a flag argument. I need to start prototyping this.) > > We still need a pidfd_clone() for atomicity reasons, but that's a > separate story. My goal is to be able to write a library that Yes, on my todo list and I have a ported patch based on prior working rotting somehwere on my git server. > transparently creates and manages a helper child process even in a > "hostile" process environment in which some other uncoordinated thread > is constantly doing a waitpid(-1) (e.g., the JVM). > > > So instead of going throught proc we should probably do what David has > > been doing in the mount API and come to rely on anone_inode. So > > something like: > > > > fd = anon_inode_getfd("pidfd", _fops, file_priv_data, flags); > > > > and stash information such as pid namespace etc. in a pidfd struct or > > something that we then can stash file->private_data of the new file. > > This also lets us avoid all this open coding done here. > > Another advantage is that anon_inodes is its own kernel-internal > > filesystem. > > Sure. That works too. Great. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, Mar 19, 2019 at 11:14:17PM +0100, Christian Brauner wrote: [snip] > > > > ---8<--- > > > > From: Joel Fernandes > > Subject: [PATCH] Partial skeleton prototype of pidfd_wait frontend > > > > Signed-off-by: Joel Fernandes > > --- > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > > include/linux/syscalls.h | 1 + > > include/uapi/asm-generic/unistd.h | 4 +- > > kernel/signal.c| 62 ++ > > kernel/sys_ni.c| 3 ++ > > 6 files changed, 71 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl > > b/arch/x86/entry/syscalls/syscall_32.tbl > > index 1f9607ed087c..2a63f1896b63 100644 > > --- a/arch/x86/entry/syscalls/syscall_32.tbl > > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > > @@ -433,3 +433,4 @@ > > 425i386io_uring_setup sys_io_uring_setup > > __ia32_sys_io_uring_setup > > 426i386io_uring_enter sys_io_uring_enter > > __ia32_sys_io_uring_enter > > 427i386io_uring_register sys_io_uring_register > > __ia32_sys_io_uring_register > > +428i386pidfd_wait sys_pidfd_wait > > __ia32_sys_pidfd_wait > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl > > b/arch/x86/entry/syscalls/syscall_64.tbl > > index 92ee0b4378d4..cf2e08a8053b 100644 > > --- a/arch/x86/entry/syscalls/syscall_64.tbl > > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > > @@ -349,6 +349,7 @@ > > 425common io_uring_setup __x64_sys_io_uring_setup > > 426common io_uring_enter __x64_sys_io_uring_enter > > 427common io_uring_register __x64_sys_io_uring_register > > +428common pidfd_wait __x64_sys_pidfd_wait > > > > # > > # x32-specific system call numbers start at 512 to avoid cache impact > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > > index e446806a561f..62160970ed3f 100644 > > --- a/include/linux/syscalls.h > > +++ b/include/linux/syscalls.h > > @@ -988,6 +988,7 @@ asmlinkage long sys_rseq(struct rseq __user *rseq, > > uint32_t rseq_len, > > asmlinkage long sys_pidfd_send_signal(int pidfd, int sig, > >siginfo_t __user *info, > >unsigned int flags); > > +asmlinkage long sys_pidfd_wait(int pidfd); > > > > /* > > * Architecture-specific system calls > > diff --git a/include/uapi/asm-generic/unistd.h > > b/include/uapi/asm-generic/unistd.h > > index dee7292e1df6..137aa8662230 100644 > > --- a/include/uapi/asm-generic/unistd.h > > +++ b/include/uapi/asm-generic/unistd.h > > @@ -832,9 +832,11 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup) > > __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter) > > #define __NR_io_uring_register 427 > > __SYSCALL(__NR_io_uring_register, sys_io_uring_register) > > +#define __NR_pidfd_wait 428 > > +__SYSCALL(__NR_pidfd_wait, sys_pidfd_wait) > > > > #undef __NR_syscalls > > -#define __NR_syscalls 428 > > +#define __NR_syscalls 429 > > > > /* > > * 32 bit systems traditionally used different > > diff --git a/kernel/signal.c b/kernel/signal.c > > index b7953934aa99..ebb550b87044 100644 > > --- a/kernel/signal.c > > +++ b/kernel/signal.c > > @@ -3550,6 +3550,68 @@ static int > > copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info) > > return copy_siginfo_from_user(kinfo, info); > > } > > > > +static ssize_t pidfd_wait_read_iter(struct kiocb *iocb, struct iov_iter > > *to) > > +{ > > + /* > > +* This is just a test string, it will contain the actual > > +* status of the pidfd in the future. > > +*/ > > + char buf[] = "status"; > > + > > + return copy_to_iter(buf, strlen(buf)+1, to); > > +} > > + > > +static const struct file_operations pidfd_wait_file_ops = { > > + .read_iter = pidfd_wait_read_iter, > > +}; > > + > > +static struct inode *pidfd_wait_get_inode(struct super_block *sb) > > +{ > > + struct inode *inode = new_inode(sb); > > + > > + inode->i_ino = get_next_ino(); > > + inode_init_owner(inode, NULL, S_IFREG); > > + > > + inode->i_op = _dir_inode_operations; > > + inode->i_fop= _wait_file_ops; > > + > > + return inode; > > +} > > + > > +SYSCALL_DEFINE1(pidfd_wait, int, pidfd) > > +{ > > + struct fd f; > > + struct inode *inode; > > + struct file *file; > > + int new_fd; > > + struct pid_namespace *pid_ns; > > + struct super_block *sb; > > + struct vfsmount *mnt; > > + > > + f = fdget_raw(pidfd); > > + if (!f.file) > > + return -EBADF; > > + > > + sb = file_inode(f.file)->i_sb; > > + pid_ns = sb->s_fs_info; > > + > > + inode = pidfd_wait_get_inode(sb); > > + > > + mnt = pid_ns->proc_mnt; > > + > > + file = alloc_file_pseudo(inode, mnt,
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Mon, Mar 18, 2019 at 07:50:52PM -0400, Joel Fernandes wrote: > On Mon, Mar 18, 2019 at 01:29:51AM +0100, Christian Brauner wrote: > > On Sun, Mar 17, 2019 at 08:40:19AM -0700, Daniel Colascione wrote: > > > On Sun, Mar 17, 2019 at 4:42 AM Christian Brauner > > > wrote: > > > > > > > > On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote: > > > > > On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote: > > > > > > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner > > > > > > wrote: > > > > > > > > > > > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote: > > > > > > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner > > > > > > > > > > wrote: > > > > > > > > > > [..] > > > > > > > > > > > > why do we want to add a new syscall (pidfd_wait) > > > > > > > > > > > > though? Why not just use > > > > > > > > > > > > standard poll/epoll interface on the proc fd like > > > > > > > > > > > > Daniel was suggesting. > > > > > > > > > > > > AFAIK, once the proc file is opened, the struct pid is > > > > > > > > > > > > essentially pinned > > > > > > > > > > > > even though the proc number may be reused. Then the > > > > > > > > > > > > caller can just poll. > > > > > > > > > > > > We can add a waitqueue to struct pid, and wake up any > > > > > > > > > > > > waiters on process > > > > > > > > > > > > death (A quick look shows task_struct can be mapped to > > > > > > > > > > > > its struct pid) and > > > > > > > > > > > > also possibly optimize it using Steve's TIF flag idea. > > > > > > > > > > > > No new syscall is > > > > > > > > > > > > needed then, let me know if I missed something? > > > > > > > > > > > > > > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll > > > > > > > > > > > solution? > > > > > > > > > > > > > > > > > > > > Hmm, going through earlier threads, I believe so now. Here > > > > > > > > > > was Daniel's > > > > > > > > > > reasoning about avoiding a notification about process death > > > > > > > > > > through proc > > > > > > > > > > directory fd: > > > > > > > > > > http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html > > > > > > > > > > > > > > > > > > > > May be a dedicated syscall for this would be cleaner after > > > > > > > > > > all. > > > > > > > > > > > > > > > > > > Ah, I wish I've seen that discussion before... > > > > > > > > > syscall makes sense and it can be non-blocking and we can use > > > > > > > > > select/poll/epoll if we use eventfd. > > > > > > > > > > > > > > > > Thanks for taking a look. > > > > > > > > > > > > > > > > > I would strongly advocate for > > > > > > > > > non-blocking version or at least to have a non-blocking > > > > > > > > > option. > > > > > > > > > > > > > > > > Waiting for FD readiness is *already* blocking or non-blocking > > > > > > > > according to the caller's desire --- users can pass options > > > > > > > > they want > > > > > > > > to poll(2) or whatever. There's no need for any kind of special > > > > > > > > configuration knob or non-blocking option. We already *have* a > > > > > > > > non-blocking option that works universally for everything. > > > > > > > > > > > > > > > > As I mentioned in the linked thread, waiting for process exit > > > > > > > > should > > > > > > > > work just like waiting for bytes to appear on a pipe. Process > > > > > > > > exit > > > > > > > > status is just another blob of bytes that a process might > > > > > > > > receive. A > > > > > > > > process exit handle ought to be just another information > > > > > > > > source. The > > > > > > > > reason the unix process API is so awful is that for whatever > > > > > > > > reason > > > > > > > > the original designers treated processes as some kind of > > > > > > > > special kind > > > > > > > > of resource instead of fitting them into the otherwise > > > > > > > > general-purpose > > > > > > > > unix data-handling API. Let's not repeat that mistake. > > > > > > > > > > > > > > > > > Something like this: > > > > > > > > > > > > > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > > > > > > > > > // register eventfd to receive death notification > > > > > > > > > pidfd_wait(pid_to_kill, evfd); > > > > > > > > > // kill the process > > > > > > > > > pidfd_send_signal(pid_to_kill, ...) > > > > > > > > > // tend to other things > > > > > > > > > > > > > > > > Now you've lost me. pidfd_wait should return a *new* FD, not > > > > > > > > wire up > > > > > > > > an eventfd. > > > > > > > > > > > > > > > > > > > > Ok, I probably misunderstood your post linked by Joel. I though your > > > > > > original proposal was based on being able to poll a file under > > > > > > /proc/pid and then you changed your mind to have a separate syscall > > > >
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Mon, Mar 18, 2019 at 01:29:51AM +0100, Christian Brauner wrote: > On Sun, Mar 17, 2019 at 08:40:19AM -0700, Daniel Colascione wrote: > > On Sun, Mar 17, 2019 at 4:42 AM Christian Brauner > > wrote: > > > > > > On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote: > > > > On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote: > > > > > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner > > > > > wrote: > > > > > > > > > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote: > > > > > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan > > > > > > > wrote: > > > > > > > > > > > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner > > > > > > > > > wrote: > > > > > > > > > [..] > > > > > > > > > > > why do we want to add a new syscall (pidfd_wait) though? > > > > > > > > > > > Why not just use > > > > > > > > > > > standard poll/epoll interface on the proc fd like Daniel > > > > > > > > > > > was suggesting. > > > > > > > > > > > AFAIK, once the proc file is opened, the struct pid is > > > > > > > > > > > essentially pinned > > > > > > > > > > > even though the proc number may be reused. Then the > > > > > > > > > > > caller can just poll. > > > > > > > > > > > We can add a waitqueue to struct pid, and wake up any > > > > > > > > > > > waiters on process > > > > > > > > > > > death (A quick look shows task_struct can be mapped to > > > > > > > > > > > its struct pid) and > > > > > > > > > > > also possibly optimize it using Steve's TIF flag idea. No > > > > > > > > > > > new syscall is > > > > > > > > > > > needed then, let me know if I missed something? > > > > > > > > > > > > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll > > > > > > > > > > solution? > > > > > > > > > > > > > > > > > > Hmm, going through earlier threads, I believe so now. Here > > > > > > > > > was Daniel's > > > > > > > > > reasoning about avoiding a notification about process death > > > > > > > > > through proc > > > > > > > > > directory fd: > > > > > > > > > http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html > > > > > > > > > > > > > > > > > > May be a dedicated syscall for this would be cleaner after > > > > > > > > > all. > > > > > > > > > > > > > > > > Ah, I wish I've seen that discussion before... > > > > > > > > syscall makes sense and it can be non-blocking and we can use > > > > > > > > select/poll/epoll if we use eventfd. > > > > > > > > > > > > > > Thanks for taking a look. > > > > > > > > > > > > > > > I would strongly advocate for > > > > > > > > non-blocking version or at least to have a non-blocking option. > > > > > > > > > > > > > > Waiting for FD readiness is *already* blocking or non-blocking > > > > > > > according to the caller's desire --- users can pass options they > > > > > > > want > > > > > > > to poll(2) or whatever. There's no need for any kind of special > > > > > > > configuration knob or non-blocking option. We already *have* a > > > > > > > non-blocking option that works universally for everything. > > > > > > > > > > > > > > As I mentioned in the linked thread, waiting for process exit > > > > > > > should > > > > > > > work just like waiting for bytes to appear on a pipe. Process exit > > > > > > > status is just another blob of bytes that a process might > > > > > > > receive. A > > > > > > > process exit handle ought to be just another information source. > > > > > > > The > > > > > > > reason the unix process API is so awful is that for whatever > > > > > > > reason > > > > > > > the original designers treated processes as some kind of special > > > > > > > kind > > > > > > > of resource instead of fitting them into the otherwise > > > > > > > general-purpose > > > > > > > unix data-handling API. Let's not repeat that mistake. > > > > > > > > > > > > > > > Something like this: > > > > > > > > > > > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > > > > > > > > // register eventfd to receive death notification > > > > > > > > pidfd_wait(pid_to_kill, evfd); > > > > > > > > // kill the process > > > > > > > > pidfd_send_signal(pid_to_kill, ...) > > > > > > > > // tend to other things > > > > > > > > > > > > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire > > > > > > > up > > > > > > > an eventfd. > > > > > > > > > > > > > > > > > Ok, I probably misunderstood your post linked by Joel. I though your > > > > > original proposal was based on being able to poll a file under > > > > > /proc/pid and then you changed your mind to have a separate syscall > > > > > which I assumed would be a blocking one to wait for process exit. > > > > > Maybe you can describe the new interface you are thinking about in > > > > > terms of userspace usage like I did above? Several lines of code would > > > > > explain more than paragraphs of text. > > > > > > > >
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Sun, Mar 17, 2019 at 08:40:19AM -0700, Daniel Colascione wrote: > On Sun, Mar 17, 2019 at 4:42 AM Christian Brauner > wrote: > > > > On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote: > > > On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote: > > > > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner > > > > wrote: > > > > > > > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote: > > > > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan > > > > > > wrote: > > > > > > > > > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes > > > > > > > wrote: > > > > > > > > > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner > > > > > > > > wrote: > > > > > > > > [..] > > > > > > > > > > why do we want to add a new syscall (pidfd_wait) though? > > > > > > > > > > Why not just use > > > > > > > > > > standard poll/epoll interface on the proc fd like Daniel > > > > > > > > > > was suggesting. > > > > > > > > > > AFAIK, once the proc file is opened, the struct pid is > > > > > > > > > > essentially pinned > > > > > > > > > > even though the proc number may be reused. Then the caller > > > > > > > > > > can just poll. > > > > > > > > > > We can add a waitqueue to struct pid, and wake up any > > > > > > > > > > waiters on process > > > > > > > > > > death (A quick look shows task_struct can be mapped to its > > > > > > > > > > struct pid) and > > > > > > > > > > also possibly optimize it using Steve's TIF flag idea. No > > > > > > > > > > new syscall is > > > > > > > > > > needed then, let me know if I missed something? > > > > > > > > > > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll > > > > > > > > > solution? > > > > > > > > > > > > > > > > Hmm, going through earlier threads, I believe so now. Here was > > > > > > > > Daniel's > > > > > > > > reasoning about avoiding a notification about process death > > > > > > > > through proc > > > > > > > > directory fd: > > > > > > > > http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html > > > > > > > > > > > > > > > > May be a dedicated syscall for this would be cleaner after all. > > > > > > > > > > > > > > Ah, I wish I've seen that discussion before... > > > > > > > syscall makes sense and it can be non-blocking and we can use > > > > > > > select/poll/epoll if we use eventfd. > > > > > > > > > > > > Thanks for taking a look. > > > > > > > > > > > > > I would strongly advocate for > > > > > > > non-blocking version or at least to have a non-blocking option. > > > > > > > > > > > > Waiting for FD readiness is *already* blocking or non-blocking > > > > > > according to the caller's desire --- users can pass options they > > > > > > want > > > > > > to poll(2) or whatever. There's no need for any kind of special > > > > > > configuration knob or non-blocking option. We already *have* a > > > > > > non-blocking option that works universally for everything. > > > > > > > > > > > > As I mentioned in the linked thread, waiting for process exit should > > > > > > work just like waiting for bytes to appear on a pipe. Process exit > > > > > > status is just another blob of bytes that a process might receive. A > > > > > > process exit handle ought to be just another information source. The > > > > > > reason the unix process API is so awful is that for whatever reason > > > > > > the original designers treated processes as some kind of special > > > > > > kind > > > > > > of resource instead of fitting them into the otherwise > > > > > > general-purpose > > > > > > unix data-handling API. Let's not repeat that mistake. > > > > > > > > > > > > > Something like this: > > > > > > > > > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > > > > > > > // register eventfd to receive death notification > > > > > > > pidfd_wait(pid_to_kill, evfd); > > > > > > > // kill the process > > > > > > > pidfd_send_signal(pid_to_kill, ...) > > > > > > > // tend to other things > > > > > > > > > > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire up > > > > > > an eventfd. > > > > > > > > > > > > > > Ok, I probably misunderstood your post linked by Joel. I though your > > > > original proposal was based on being able to poll a file under > > > > /proc/pid and then you changed your mind to have a separate syscall > > > > which I assumed would be a blocking one to wait for process exit. > > > > Maybe you can describe the new interface you are thinking about in > > > > terms of userspace usage like I did above? Several lines of code would > > > > explain more than paragraphs of text. > > > > > > Hey, Thanks Suren for the eventfd idea. I agree with Daniel on this. The > > > idea > > > from Daniel here is to wait for process death and exit events by just > > > referring to a stable fd, independent of whatever is going on in /proc. > > > > > > What is needed is something like this (in highly pseudo-code form): > > > > > > pidfd =
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Sun, Mar 17, 2019 at 10:11:10AM -0700, Daniel Colascione wrote: > On Sun, Mar 17, 2019 at 9:35 AM Serge E. Hallyn wrote: > > > > On Sun, Mar 17, 2019 at 12:42:40PM +0100, Christian Brauner wrote: > > > On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote: > > > > On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote: > > > > > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner > > > > > wrote: > > > > > > > > > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote: > > > > > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan > > > > > > > wrote: > > > > > > > > > > > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner > > > > > > > > > wrote: > > > > > > > > > [..] > > > > > > > > > > > why do we want to add a new syscall (pidfd_wait) though? > > > > > > > > > > > Why not just use > > > > > > > > > > > standard poll/epoll interface on the proc fd like Daniel > > > > > > > > > > > was suggesting. > > > > > > > > > > > AFAIK, once the proc file is opened, the struct pid is > > > > > > > > > > > essentially pinned > > > > > > > > > > > even though the proc number may be reused. Then the > > > > > > > > > > > caller can just poll. > > > > > > > > > > > We can add a waitqueue to struct pid, and wake up any > > > > > > > > > > > waiters on process > > > > > > > > > > > death (A quick look shows task_struct can be mapped to > > > > > > > > > > > its struct pid) and > > > > > > > > > > > also possibly optimize it using Steve's TIF flag idea. No > > > > > > > > > > > new syscall is > > > > > > > > > > > needed then, let me know if I missed something? > > > > > > > > > > > > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll > > > > > > > > > > solution? > > > > > > > > > > > > > > > > > > Hmm, going through earlier threads, I believe so now. Here > > > > > > > > > was Daniel's > > > > > > > > > reasoning about avoiding a notification about process death > > > > > > > > > through proc > > > > > > > > > directory fd: > > > > > > > > > http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html > > > > > > > > > > > > > > > > > > May be a dedicated syscall for this would be cleaner after > > > > > > > > > all. > > > > > > > > > > > > > > > > Ah, I wish I've seen that discussion before... > > > > > > > > syscall makes sense and it can be non-blocking and we can use > > > > > > > > select/poll/epoll if we use eventfd. > > > > > > > > > > > > > > Thanks for taking a look. > > > > > > > > > > > > > > > I would strongly advocate for > > > > > > > > non-blocking version or at least to have a non-blocking option. > > > > > > > > > > > > > > Waiting for FD readiness is *already* blocking or non-blocking > > > > > > > according to the caller's desire --- users can pass options they > > > > > > > want > > > > > > > to poll(2) or whatever. There's no need for any kind of special > > > > > > > configuration knob or non-blocking option. We already *have* a > > > > > > > non-blocking option that works universally for everything. > > > > > > > > > > > > > > As I mentioned in the linked thread, waiting for process exit > > > > > > > should > > > > > > > work just like waiting for bytes to appear on a pipe. Process exit > > > > > > > status is just another blob of bytes that a process might > > > > > > > receive. A > > > > > > > process exit handle ought to be just another information source. > > > > > > > The > > > > > > > reason the unix process API is so awful is that for whatever > > > > > > > reason > > > > > > > the original designers treated processes as some kind of special > > > > > > > kind > > > > > > > of resource instead of fitting them into the otherwise > > > > > > > general-purpose > > > > > > > unix data-handling API. Let's not repeat that mistake. > > > > > > > > > > > > > > > Something like this: > > > > > > > > > > > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > > > > > > > > // register eventfd to receive death notification > > > > > > > > pidfd_wait(pid_to_kill, evfd); > > > > > > > > // kill the process > > > > > > > > pidfd_send_signal(pid_to_kill, ...) > > > > > > > > // tend to other things > > > > > > > > > > > > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire > > > > > > > up > > > > > > > an eventfd. > > > > > > > > > > > > > > > > > Ok, I probably misunderstood your post linked by Joel. I though your > > > > > original proposal was based on being able to poll a file under > > > > > /proc/pid and then you changed your mind to have a separate syscall > > > > > which I assumed would be a blocking one to wait for process exit. > > > > > Maybe you can describe the new interface you are thinking about in > > > > > terms of userspace usage like I did above? Several lines of code would > > > > > explain more than paragraphs of text. > > > > > > > > Hey, Thanks
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Sun, Mar 17, 2019 at 12:42:40PM +0100, Christian Brauner wrote: > On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote: > > On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote: > > > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner > > > wrote: > > > > > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote: > > > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan > > > > > wrote: > > > > > > > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes > > > > > > wrote: > > > > > > > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote: > > > > > > > [..] > > > > > > > > > why do we want to add a new syscall (pidfd_wait) though? Why > > > > > > > > > not just use > > > > > > > > > standard poll/epoll interface on the proc fd like Daniel was > > > > > > > > > suggesting. > > > > > > > > > AFAIK, once the proc file is opened, the struct pid is > > > > > > > > > essentially pinned > > > > > > > > > even though the proc number may be reused. Then the caller > > > > > > > > > can just poll. > > > > > > > > > We can add a waitqueue to struct pid, and wake up any waiters > > > > > > > > > on process > > > > > > > > > death (A quick look shows task_struct can be mapped to its > > > > > > > > > struct pid) and > > > > > > > > > also possibly optimize it using Steve's TIF flag idea. No new > > > > > > > > > syscall is > > > > > > > > > needed then, let me know if I missed something? > > > > > > > > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll solution? > > > > > > > > > > > > > > Hmm, going through earlier threads, I believe so now. Here was > > > > > > > Daniel's > > > > > > > reasoning about avoiding a notification about process death > > > > > > > through proc > > > > > > > directory fd: > > > > > > > http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html > > > > > > > > > > > > > > May be a dedicated syscall for this would be cleaner after all. > > > > > > > > > > > > Ah, I wish I've seen that discussion before... > > > > > > syscall makes sense and it can be non-blocking and we can use > > > > > > select/poll/epoll if we use eventfd. > > > > > > > > > > Thanks for taking a look. > > > > > > > > > > > I would strongly advocate for > > > > > > non-blocking version or at least to have a non-blocking option. > > > > > > > > > > Waiting for FD readiness is *already* blocking or non-blocking > > > > > according to the caller's desire --- users can pass options they want > > > > > to poll(2) or whatever. There's no need for any kind of special > > > > > configuration knob or non-blocking option. We already *have* a > > > > > non-blocking option that works universally for everything. > > > > > > > > > > As I mentioned in the linked thread, waiting for process exit should > > > > > work just like waiting for bytes to appear on a pipe. Process exit > > > > > status is just another blob of bytes that a process might receive. A > > > > > process exit handle ought to be just another information source. The > > > > > reason the unix process API is so awful is that for whatever reason > > > > > the original designers treated processes as some kind of special kind > > > > > of resource instead of fitting them into the otherwise general-purpose > > > > > unix data-handling API. Let's not repeat that mistake. > > > > > > > > > > > Something like this: > > > > > > > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > > > > > > // register eventfd to receive death notification > > > > > > pidfd_wait(pid_to_kill, evfd); > > > > > > // kill the process > > > > > > pidfd_send_signal(pid_to_kill, ...) > > > > > > // tend to other things > > > > > > > > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire up > > > > > an eventfd. > > > > > > > > > > > Ok, I probably misunderstood your post linked by Joel. I though your > > > original proposal was based on being able to poll a file under > > > /proc/pid and then you changed your mind to have a separate syscall > > > which I assumed would be a blocking one to wait for process exit. > > > Maybe you can describe the new interface you are thinking about in > > > terms of userspace usage like I did above? Several lines of code would > > > explain more than paragraphs of text. > > > > Hey, Thanks Suren for the eventfd idea. I agree with Daniel on this. The > > idea > > from Daniel here is to wait for process death and exit events by just > > referring to a stable fd, independent of whatever is going on in /proc. > > > > What is needed is something like this (in highly pseudo-code form): > > > > pidfd = opendir("/proc/",..); > > wait_fd = pidfd_wait(pidfd); > > read or poll wait_fd (non-blocking or blocking whichever) > > > > wait_fd will block until the task has either died or reaped. In both these > > cases, it can return a suitable string such as "dead" or "reaped" although > > an > > integer with some predefined meaning is also Ok. >
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote: > On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote: > > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner > > wrote: > > > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote: > > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan > > > > wrote: > > > > > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes > > > > > wrote: > > > > > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote: > > > > > > [..] > > > > > > > > why do we want to add a new syscall (pidfd_wait) though? Why > > > > > > > > not just use > > > > > > > > standard poll/epoll interface on the proc fd like Daniel was > > > > > > > > suggesting. > > > > > > > > AFAIK, once the proc file is opened, the struct pid is > > > > > > > > essentially pinned > > > > > > > > even though the proc number may be reused. Then the caller can > > > > > > > > just poll. > > > > > > > > We can add a waitqueue to struct pid, and wake up any waiters > > > > > > > > on process > > > > > > > > death (A quick look shows task_struct can be mapped to its > > > > > > > > struct pid) and > > > > > > > > also possibly optimize it using Steve's TIF flag idea. No new > > > > > > > > syscall is > > > > > > > > needed then, let me know if I missed something? > > > > > > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll solution? > > > > > > > > > > > > Hmm, going through earlier threads, I believe so now. Here was > > > > > > Daniel's > > > > > > reasoning about avoiding a notification about process death through > > > > > > proc > > > > > > directory fd: > > > > > > http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html > > > > > > > > > > > > May be a dedicated syscall for this would be cleaner after all. > > > > > > > > > > Ah, I wish I've seen that discussion before... > > > > > syscall makes sense and it can be non-blocking and we can use > > > > > select/poll/epoll if we use eventfd. > > > > > > > > Thanks for taking a look. > > > > > > > > > I would strongly advocate for > > > > > non-blocking version or at least to have a non-blocking option. > > > > > > > > Waiting for FD readiness is *already* blocking or non-blocking > > > > according to the caller's desire --- users can pass options they want > > > > to poll(2) or whatever. There's no need for any kind of special > > > > configuration knob or non-blocking option. We already *have* a > > > > non-blocking option that works universally for everything. > > > > > > > > As I mentioned in the linked thread, waiting for process exit should > > > > work just like waiting for bytes to appear on a pipe. Process exit > > > > status is just another blob of bytes that a process might receive. A > > > > process exit handle ought to be just another information source. The > > > > reason the unix process API is so awful is that for whatever reason > > > > the original designers treated processes as some kind of special kind > > > > of resource instead of fitting them into the otherwise general-purpose > > > > unix data-handling API. Let's not repeat that mistake. > > > > > > > > > Something like this: > > > > > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > > > > > // register eventfd to receive death notification > > > > > pidfd_wait(pid_to_kill, evfd); > > > > > // kill the process > > > > > pidfd_send_signal(pid_to_kill, ...) > > > > > // tend to other things > > > > > > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire up > > > > an eventfd. > > > > > > > > Ok, I probably misunderstood your post linked by Joel. I though your > > original proposal was based on being able to poll a file under > > /proc/pid and then you changed your mind to have a separate syscall > > which I assumed would be a blocking one to wait for process exit. > > Maybe you can describe the new interface you are thinking about in > > terms of userspace usage like I did above? Several lines of code would > > explain more than paragraphs of text. > > Hey, Thanks Suren for the eventfd idea. I agree with Daniel on this. The idea > from Daniel here is to wait for process death and exit events by just > referring to a stable fd, independent of whatever is going on in /proc. > > What is needed is something like this (in highly pseudo-code form): > > pidfd = opendir("/proc/",..); > wait_fd = pidfd_wait(pidfd); > read or poll wait_fd (non-blocking or blocking whichever) > > wait_fd will block until the task has either died or reaped. In both these > cases, it can return a suitable string such as "dead" or "reaped" although an > integer with some predefined meaning is also Ok. > > What that guarantees is, even if the task's PID has been reused, or the task > has already died or already died + reaped, all of these events cannot race > with the code above and the information passed to the user is race-free and > stable / guaranteed. > > An eventfd
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote: > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner > wrote: > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote: > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan > > > wrote: > > > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes > > > > wrote: > > > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote: > > > > > [..] > > > > > > > why do we want to add a new syscall (pidfd_wait) though? Why not > > > > > > > just use > > > > > > > standard poll/epoll interface on the proc fd like Daniel was > > > > > > > suggesting. > > > > > > > AFAIK, once the proc file is opened, the struct pid is > > > > > > > essentially pinned > > > > > > > even though the proc number may be reused. Then the caller can > > > > > > > just poll. > > > > > > > We can add a waitqueue to struct pid, and wake up any waiters on > > > > > > > process > > > > > > > death (A quick look shows task_struct can be mapped to its struct > > > > > > > pid) and > > > > > > > also possibly optimize it using Steve's TIF flag idea. No new > > > > > > > syscall is > > > > > > > needed then, let me know if I missed something? > > > > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll solution? > > > > > > > > > > Hmm, going through earlier threads, I believe so now. Here was > > > > > Daniel's > > > > > reasoning about avoiding a notification about process death through > > > > > proc > > > > > directory fd: > > > > > http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html > > > > > > > > > > May be a dedicated syscall for this would be cleaner after all. > > > > > > > > Ah, I wish I've seen that discussion before... > > > > syscall makes sense and it can be non-blocking and we can use > > > > select/poll/epoll if we use eventfd. > > > > > > Thanks for taking a look. > > > > > > > I would strongly advocate for > > > > non-blocking version or at least to have a non-blocking option. > > > > > > Waiting for FD readiness is *already* blocking or non-blocking > > > according to the caller's desire --- users can pass options they want > > > to poll(2) or whatever. There's no need for any kind of special > > > configuration knob or non-blocking option. We already *have* a > > > non-blocking option that works universally for everything. > > > > > > As I mentioned in the linked thread, waiting for process exit should > > > work just like waiting for bytes to appear on a pipe. Process exit > > > status is just another blob of bytes that a process might receive. A > > > process exit handle ought to be just another information source. The > > > reason the unix process API is so awful is that for whatever reason > > > the original designers treated processes as some kind of special kind > > > of resource instead of fitting them into the otherwise general-purpose > > > unix data-handling API. Let's not repeat that mistake. > > > > > > > Something like this: > > > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > > > > // register eventfd to receive death notification > > > > pidfd_wait(pid_to_kill, evfd); > > > > // kill the process > > > > pidfd_send_signal(pid_to_kill, ...) > > > > // tend to other things > > > > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire up > > > an eventfd. > > > > > Ok, I probably misunderstood your post linked by Joel. I though your > original proposal was based on being able to poll a file under > /proc/pid and then you changed your mind to have a separate syscall > which I assumed would be a blocking one to wait for process exit. > Maybe you can describe the new interface you are thinking about in > terms of userspace usage like I did above? Several lines of code would > explain more than paragraphs of text. Hey, Thanks Suren for the eventfd idea. I agree with Daniel on this. The idea from Daniel here is to wait for process death and exit events by just referring to a stable fd, independent of whatever is going on in /proc. What is needed is something like this (in highly pseudo-code form): pidfd = opendir("/proc/",..); wait_fd = pidfd_wait(pidfd); read or poll wait_fd (non-blocking or blocking whichever) wait_fd will block until the task has either died or reaped. In both these cases, it can return a suitable string such as "dead" or "reaped" although an integer with some predefined meaning is also Ok. What that guarantees is, even if the task's PID has been reused, or the task has already died or already died + reaped, all of these events cannot race with the code above and the information passed to the user is race-free and stable / guaranteed. An eventfd seems to not fit well, because AFAICS passing the raw PID to eventfd as in your example would still race since the PID could have been reused by another process by the time the eventfd is created. Also Andy's idea in [1] seems to use poll flags to communicate various tihngs
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote: > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan wrote: > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes > > wrote: > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote: > > > [..] > > > > > why do we want to add a new syscall (pidfd_wait) though? Why not just > > > > > use > > > > > standard poll/epoll interface on the proc fd like Daniel was > > > > > suggesting. > > > > > AFAIK, once the proc file is opened, the struct pid is essentially > > > > > pinned > > > > > even though the proc number may be reused. Then the caller can just > > > > > poll. > > > > > We can add a waitqueue to struct pid, and wake up any waiters on > > > > > process > > > > > death (A quick look shows task_struct can be mapped to its struct > > > > > pid) and > > > > > also possibly optimize it using Steve's TIF flag idea. No new syscall > > > > > is > > > > > needed then, let me know if I missed something? > > > > > > > > Huh, I thought that Daniel was against the poll/epoll solution? > > > > > > Hmm, going through earlier threads, I believe so now. Here was Daniel's > > > reasoning about avoiding a notification about process death through proc > > > directory fd: http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html > > > > > > May be a dedicated syscall for this would be cleaner after all. > > > > Ah, I wish I've seen that discussion before... > > syscall makes sense and it can be non-blocking and we can use > > select/poll/epoll if we use eventfd. > > Thanks for taking a look. > > > I would strongly advocate for > > non-blocking version or at least to have a non-blocking option. > > Waiting for FD readiness is *already* blocking or non-blocking > according to the caller's desire --- users can pass options they want > to poll(2) or whatever. There's no need for any kind of special > configuration knob or non-blocking option. We already *have* a > non-blocking option that works universally for everything. > > As I mentioned in the linked thread, waiting for process exit should > work just like waiting for bytes to appear on a pipe. Process exit > status is just another blob of bytes that a process might receive. A > process exit handle ought to be just another information source. The > reason the unix process API is so awful is that for whatever reason > the original designers treated processes as some kind of special kind > of resource instead of fitting them into the otherwise general-purpose > unix data-handling API. Let's not repeat that mistake. > > > Something like this: > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > > // register eventfd to receive death notification > > pidfd_wait(pid_to_kill, evfd); > > // kill the process > > pidfd_send_signal(pid_to_kill, ...) > > // tend to other things > > Now you've lost me. pidfd_wait should return a *new* FD, not wire up > an eventfd. > > Why? Because the new type FD can report process exit *status* > information (via read(2) after readability signal) as well as this > binary yes-or-no signal *that* a process exited, and this capability > is useful if you want to the pidfd interface to be a good > general-purpose process management facility to replace the awful > wait() family of functions. You can't get an exit status from an > eventfd. Wiring up an eventfd the way you've proposed also complicates > wait-causality information, complicating both tracing and any priority > inheritance we might want in the future (because all the wakeups gets > mixed into the eventfd and you can't unscramble an egg). And for what? > What do we gain by using an eventfd? Is the reason that exit.c would > be able to use eventfd_signal instead of poking a waitqueue directly? > How is that better? With an eventfd, you've increased path length on > process exit *and* complicated the API for no reason. > > > ... > > // wait for the process to die > > poll_wait(evfd, ...); > > > > This simplifies userspace > > Not relative to an exit handle it doesn't. > > >, allows it to wait for multiple events using > > epoll > > So does a process exit status handle. > > > and I think kernel implementation will be also quite simple > > because it already implements eventfd_signal() that takes care of > > waitqueue handling. > > What if there are multiple eventfds registered for the death of a > process? In any case, you need some mechanism to find, upon process > death, a list of waiters, then wake each of them up. That's either a > global search or a search in some list rooted in a task-related > structure (either struct task or one of its friends). Using an eventfd > here adds nothing, since upon death, you need this list search > regardless, and as I mentioned above, eventfd-wiring just makes the > API worse. > > > If pidfd_send_signal could be extended to have an optional eventfd > > parameter then we would not even have to add a new syscall. > > There is nothing wrong with
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan wrote: > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes > wrote: > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote: > > [..] > > > > why do we want to add a new syscall (pidfd_wait) though? Why not just > > > > use > > > > standard poll/epoll interface on the proc fd like Daniel was suggesting. > > > > AFAIK, once the proc file is opened, the struct pid is essentially > > > > pinned > > > > even though the proc number may be reused. Then the caller can just > > > > poll. > > > > We can add a waitqueue to struct pid, and wake up any waiters on process > > > > death (A quick look shows task_struct can be mapped to its struct pid) > > > > and > > > > also possibly optimize it using Steve's TIF flag idea. No new syscall is > > > > needed then, let me know if I missed something? > > > > > > Huh, I thought that Daniel was against the poll/epoll solution? > > > > Hmm, going through earlier threads, I believe so now. Here was Daniel's > > reasoning about avoiding a notification about process death through proc > > directory fd: http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html > > > > May be a dedicated syscall for this would be cleaner after all. > > Ah, I wish I've seen that discussion before... > syscall makes sense and it can be non-blocking and we can use > select/poll/epoll if we use eventfd. Thanks for taking a look. > I would strongly advocate for > non-blocking version or at least to have a non-blocking option. Waiting for FD readiness is *already* blocking or non-blocking according to the caller's desire --- users can pass options they want to poll(2) or whatever. There's no need for any kind of special configuration knob or non-blocking option. We already *have* a non-blocking option that works universally for everything. As I mentioned in the linked thread, waiting for process exit should work just like waiting for bytes to appear on a pipe. Process exit status is just another blob of bytes that a process might receive. A process exit handle ought to be just another information source. The reason the unix process API is so awful is that for whatever reason the original designers treated processes as some kind of special kind of resource instead of fitting them into the otherwise general-purpose unix data-handling API. Let's not repeat that mistake. > Something like this: > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > // register eventfd to receive death notification > pidfd_wait(pid_to_kill, evfd); > // kill the process > pidfd_send_signal(pid_to_kill, ...) > // tend to other things Now you've lost me. pidfd_wait should return a *new* FD, not wire up an eventfd. Why? Because the new type FD can report process exit *status* information (via read(2) after readability signal) as well as this binary yes-or-no signal *that* a process exited, and this capability is useful if you want to the pidfd interface to be a good general-purpose process management facility to replace the awful wait() family of functions. You can't get an exit status from an eventfd. Wiring up an eventfd the way you've proposed also complicates wait-causality information, complicating both tracing and any priority inheritance we might want in the future (because all the wakeups gets mixed into the eventfd and you can't unscramble an egg). And for what? What do we gain by using an eventfd? Is the reason that exit.c would be able to use eventfd_signal instead of poking a waitqueue directly? How is that better? With an eventfd, you've increased path length on process exit *and* complicated the API for no reason. > ... > // wait for the process to die > poll_wait(evfd, ...); > > This simplifies userspace Not relative to an exit handle it doesn't. >, allows it to wait for multiple events using > epoll So does a process exit status handle. > and I think kernel implementation will be also quite simple > because it already implements eventfd_signal() that takes care of > waitqueue handling. What if there are multiple eventfds registered for the death of a process? In any case, you need some mechanism to find, upon process death, a list of waiters, then wake each of them up. That's either a global search or a search in some list rooted in a task-related structure (either struct task or one of its friends). Using an eventfd here adds nothing, since upon death, you need this list search regardless, and as I mentioned above, eventfd-wiring just makes the API worse. > If pidfd_send_signal could be extended to have an optional eventfd > parameter then we would not even have to add a new syscall. There is nothing wrong with adding a new system call. I don't know why there's this idea circulating that adding system calls is something we should bend over backwards to avoid. It's cheap, and support-wise, kernel interface is kernel interface. Sending a signal has *nothing* to do with wiring up some kind of notification and there's no
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote: [..] > > why do we want to add a new syscall (pidfd_wait) though? Why not just use > > standard poll/epoll interface on the proc fd like Daniel was suggesting. > > AFAIK, once the proc file is opened, the struct pid is essentially pinned > > even though the proc number may be reused. Then the caller can just poll. > > We can add a waitqueue to struct pid, and wake up any waiters on process > > death (A quick look shows task_struct can be mapped to its struct pid) and > > also possibly optimize it using Steve's TIF flag idea. No new syscall is > > needed then, let me know if I missed something? > > Huh, I thought that Daniel was against the poll/epoll solution? Hmm, going through earlier threads, I believe so now. Here was Daniel's reasoning about avoiding a notification about process death through proc directory fd: http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html May be a dedicated syscall for this would be cleaner after all. > I have no clear opinion on what is better at the moment since I have > been mostly concerned with getting pidfd_send_signal() into shape and > was reluctant to put more ideas/work into this if it gets shutdown. > Once we have pidfd_send_signal() the wait discussion makes sense. Ok. It looks like that is almost in though (fingers crossed :)). thanks, - Joel ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Fri, Mar 15, 2019 at 02:13:24PM -0400, Joel Fernandes wrote: > On Fri, Mar 15, 2019 at 07:03:07PM +0100, Christian Brauner wrote: > > On Thu, Mar 14, 2019 at 09:36:43PM -0700, Daniel Colascione wrote: > > > On Thu, Mar 14, 2019 at 8:16 PM Steven Rostedt > > > wrote: > > > > > > > > On Thu, 14 Mar 2019 13:49:11 -0700 > > > > Sultan Alsawaf wrote: > > > > > > > > > Perhaps I'm missing something, but if you want to know when a process > > > > > has died > > > > > after sending a SIGKILL to it, then why not just make the SIGKILL > > > > > optionally > > > > > block until the process has died completely? It'd be rather trivial > > > > > to just > > > > > store a pointer to an onstack completion inside the victim process' > > > > > task_struct, > > > > > and then complete it in free_task(). > > > > > > > > How would you implement such a method in userspace? kill() doesn't take > > > > any parameters but the pid of the process you want to send a signal to, > > > > and the signal to send. This would require a new system call, and be > > > > quite a bit of work. > > > > > > That's what the pidfd work is for. Please read the original threads > > > about the motivation and design of that facility. > > > > > > > If you can solve this with an ebpf program, I > > > > strongly suggest you do that instead. > > > > > > Regarding process death notification: I will absolutely not support > > > putting aBPF and perf trace events on the critical path of core system > > > memory management functionality. Tracing and monitoring facilities are > > > great for learning about the system, but they were never intended to > > > be load-bearing. The proposed eBPF process-monitoring approach is just > > > a variant of the netlink proposal we discussed previously on the pidfd > > > threads; it has all of its drawbacks. We really need a core system > > > call --- really, we've needed robust process management since the > > > creation of unix --- and I'm glad that we're finally getting it. > > > Adding new system calls is not expensive; going to great lengths to > > > avoid adding one is like calling a helicopter to avoid crossing the > > > street. I don't think we should present an abuse of the debugging and > > > performance monitoring infrastructure as an alternative to a robust > > > and desperately-needed bit of core functionality that's neither hard > > > to add nor complex to implement nor expensive to use. > > > > > > Regarding the proposal for a new kernel-side lmkd: when possible, the > > > kernel should provide mechanism, not policy. Putting the low memory > > > killer back into the kernel after we've spent significant effort > > > making it possible for userspace to do that job. Compared to kernel > > > code, more easily understood, more easily debuggable, more easily > > > updated, and much safer. If we *can* move something out of the kernel, > > > we should. This patch moves us in exactly the wrong direction. Yes, we > > > need *something* that sits synchronously astride the page allocation > > > path and does *something* to stop a busy beaver allocator that eats > > > all the available memory before lmkd, even mlocked and realtime, can > > > respond. The OOM killer is adequate for this very rare case. > > > > > > With respect to kill timing: Tim is right about the need for two > > > levels of policy: first, a high-level process prioritization and > > > memory-demand balancing scheme (which is what OOM score adjustment > > > code in ActivityManager amounts to); and second, a low-level > > > process-killing methodology that maximizes sustainable memory reclaim > > > and minimizes unwanted side effects while killing those processes that > > > should be dead. Both of these policies belong in userspace --- because > > > they *can* be in userspace --- and userspace needs only a few tools, > > > most of which already exist, to do a perfectly adequate job. > > > > > > We do want killed processes to die promptly. That's why I support > > > boosting a process's priority somehow when lmkd is about to kill it. > > > The precise way in which we do that --- involving not only actual > > > priority, but scheduler knobs, cgroup assignment, core affinity, and > > > so on --- is a complex topic best left to userspace. lmkd already has > > > all the knobs it needs to implement whatever priority boosting policy > > > it wants. > > > > > > Hell, once we add a pidfd_wait --- which I plan to work on, assuming > > > nobody beats me to it, after pidfd_send_signal lands --- you can > > > > Daniel, > > > > I've just been talking to Joel. > > I actually "expected" you to work pidfd_wait() after prior > > conversations we had on the pidfd_send_signal() patchsets. :) That's why > > I got a separate git tree on kernel.org since I expect a lot more work > > to come. I hope that Linus still decides to pull pidfd_send_signal() > > before Sunday (For the ones who have missed the link in a prior response > > of mine: > >
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Fri, Mar 15, 2019 at 07:03:07PM +0100, Christian Brauner wrote: > On Thu, Mar 14, 2019 at 09:36:43PM -0700, Daniel Colascione wrote: > > On Thu, Mar 14, 2019 at 8:16 PM Steven Rostedt wrote: > > > > > > On Thu, 14 Mar 2019 13:49:11 -0700 > > > Sultan Alsawaf wrote: > > > > > > > Perhaps I'm missing something, but if you want to know when a process > > > > has died > > > > after sending a SIGKILL to it, then why not just make the SIGKILL > > > > optionally > > > > block until the process has died completely? It'd be rather trivial to > > > > just > > > > store a pointer to an onstack completion inside the victim process' > > > > task_struct, > > > > and then complete it in free_task(). > > > > > > How would you implement such a method in userspace? kill() doesn't take > > > any parameters but the pid of the process you want to send a signal to, > > > and the signal to send. This would require a new system call, and be > > > quite a bit of work. > > > > That's what the pidfd work is for. Please read the original threads > > about the motivation and design of that facility. > > > > > If you can solve this with an ebpf program, I > > > strongly suggest you do that instead. > > > > Regarding process death notification: I will absolutely not support > > putting aBPF and perf trace events on the critical path of core system > > memory management functionality. Tracing and monitoring facilities are > > great for learning about the system, but they were never intended to > > be load-bearing. The proposed eBPF process-monitoring approach is just > > a variant of the netlink proposal we discussed previously on the pidfd > > threads; it has all of its drawbacks. We really need a core system > > call --- really, we've needed robust process management since the > > creation of unix --- and I'm glad that we're finally getting it. > > Adding new system calls is not expensive; going to great lengths to > > avoid adding one is like calling a helicopter to avoid crossing the > > street. I don't think we should present an abuse of the debugging and > > performance monitoring infrastructure as an alternative to a robust > > and desperately-needed bit of core functionality that's neither hard > > to add nor complex to implement nor expensive to use. > > > > Regarding the proposal for a new kernel-side lmkd: when possible, the > > kernel should provide mechanism, not policy. Putting the low memory > > killer back into the kernel after we've spent significant effort > > making it possible for userspace to do that job. Compared to kernel > > code, more easily understood, more easily debuggable, more easily > > updated, and much safer. If we *can* move something out of the kernel, > > we should. This patch moves us in exactly the wrong direction. Yes, we > > need *something* that sits synchronously astride the page allocation > > path and does *something* to stop a busy beaver allocator that eats > > all the available memory before lmkd, even mlocked and realtime, can > > respond. The OOM killer is adequate for this very rare case. > > > > With respect to kill timing: Tim is right about the need for two > > levels of policy: first, a high-level process prioritization and > > memory-demand balancing scheme (which is what OOM score adjustment > > code in ActivityManager amounts to); and second, a low-level > > process-killing methodology that maximizes sustainable memory reclaim > > and minimizes unwanted side effects while killing those processes that > > should be dead. Both of these policies belong in userspace --- because > > they *can* be in userspace --- and userspace needs only a few tools, > > most of which already exist, to do a perfectly adequate job. > > > > We do want killed processes to die promptly. That's why I support > > boosting a process's priority somehow when lmkd is about to kill it. > > The precise way in which we do that --- involving not only actual > > priority, but scheduler knobs, cgroup assignment, core affinity, and > > so on --- is a complex topic best left to userspace. lmkd already has > > all the knobs it needs to implement whatever priority boosting policy > > it wants. > > > > Hell, once we add a pidfd_wait --- which I plan to work on, assuming > > nobody beats me to it, after pidfd_send_signal lands --- you can > > Daniel, > > I've just been talking to Joel. > I actually "expected" you to work pidfd_wait() after prior > conversations we had on the pidfd_send_signal() patchsets. :) That's why > I got a separate git tree on kernel.org since I expect a lot more work > to come. I hope that Linus still decides to pull pidfd_send_signal() > before Sunday (For the ones who have missed the link in a prior response > of mine: > https://lkml.org/lkml/2019/3/12/439 > > This is the first merge window I sent this PR. > > The pidfd tree has a branch for-next that is already tracked by Stephen > in linux-next since the 5.0 merge window. The patches for > pidfd_send_signal() sit in the pidfd
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Thu, Mar 14, 2019 at 09:36:43PM -0700, Daniel Colascione wrote: > On Thu, Mar 14, 2019 at 8:16 PM Steven Rostedt wrote: > > > > On Thu, 14 Mar 2019 13:49:11 -0700 > > Sultan Alsawaf wrote: > > > > > Perhaps I'm missing something, but if you want to know when a process has > > > died > > > after sending a SIGKILL to it, then why not just make the SIGKILL > > > optionally > > > block until the process has died completely? It'd be rather trivial to > > > just > > > store a pointer to an onstack completion inside the victim process' > > > task_struct, > > > and then complete it in free_task(). > > > > How would you implement such a method in userspace? kill() doesn't take > > any parameters but the pid of the process you want to send a signal to, > > and the signal to send. This would require a new system call, and be > > quite a bit of work. > > That's what the pidfd work is for. Please read the original threads > about the motivation and design of that facility. > > > If you can solve this with an ebpf program, I > > strongly suggest you do that instead. > > Regarding process death notification: I will absolutely not support > putting aBPF and perf trace events on the critical path of core system > memory management functionality. Tracing and monitoring facilities are > great for learning about the system, but they were never intended to > be load-bearing. The proposed eBPF process-monitoring approach is just > a variant of the netlink proposal we discussed previously on the pidfd > threads; it has all of its drawbacks. We really need a core system > call --- really, we've needed robust process management since the > creation of unix --- and I'm glad that we're finally getting it. > Adding new system calls is not expensive; going to great lengths to > avoid adding one is like calling a helicopter to avoid crossing the > street. I don't think we should present an abuse of the debugging and > performance monitoring infrastructure as an alternative to a robust > and desperately-needed bit of core functionality that's neither hard > to add nor complex to implement nor expensive to use. > > Regarding the proposal for a new kernel-side lmkd: when possible, the > kernel should provide mechanism, not policy. Putting the low memory > killer back into the kernel after we've spent significant effort > making it possible for userspace to do that job. Compared to kernel > code, more easily understood, more easily debuggable, more easily > updated, and much safer. If we *can* move something out of the kernel, > we should. This patch moves us in exactly the wrong direction. Yes, we > need *something* that sits synchronously astride the page allocation > path and does *something* to stop a busy beaver allocator that eats > all the available memory before lmkd, even mlocked and realtime, can > respond. The OOM killer is adequate for this very rare case. > > With respect to kill timing: Tim is right about the need for two > levels of policy: first, a high-level process prioritization and > memory-demand balancing scheme (which is what OOM score adjustment > code in ActivityManager amounts to); and second, a low-level > process-killing methodology that maximizes sustainable memory reclaim > and minimizes unwanted side effects while killing those processes that > should be dead. Both of these policies belong in userspace --- because > they *can* be in userspace --- and userspace needs only a few tools, > most of which already exist, to do a perfectly adequate job. > > We do want killed processes to die promptly. That's why I support > boosting a process's priority somehow when lmkd is about to kill it. > The precise way in which we do that --- involving not only actual > priority, but scheduler knobs, cgroup assignment, core affinity, and > so on --- is a complex topic best left to userspace. lmkd already has > all the knobs it needs to implement whatever priority boosting policy > it wants. > > Hell, once we add a pidfd_wait --- which I plan to work on, assuming > nobody beats me to it, after pidfd_send_signal lands --- you can Daniel, I've just been talking to Joel. I actually "expected" you to work pidfd_wait() after prior conversations we had on the pidfd_send_signal() patchsets. :) That's why I got a separate git tree on kernel.org since I expect a lot more work to come. I hope that Linus still decides to pull pidfd_send_signal() before Sunday (For the ones who have missed the link in a prior response of mine: https://lkml.org/lkml/2019/3/12/439 This is the first merge window I sent this PR. The pidfd tree has a branch for-next that is already tracked by Stephen in linux-next since the 5.0 merge window. The patches for pidfd_send_signal() sit in the pidfd branch. I'd be happy to share the tree with you and Joel (We can rename it if you prefer I don't care). I would really like to centralize this work so that we sort of have a "united front" and end up with a coherent api and can send PRs from a
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Thu, 14 Mar 2019 21:36:43 -0700 Daniel Colascione wrote: > On Thu, Mar 14, 2019 at 8:16 PM Steven Rostedt wrote: > > > > On Thu, 14 Mar 2019 13:49:11 -0700 > > Sultan Alsawaf wrote: > > > > > Perhaps I'm missing something, but if you want to know when a process has > > > died > > > after sending a SIGKILL to it, then why not just make the SIGKILL > > > optionally > > > block until the process has died completely? It'd be rather trivial to > > > just > > > store a pointer to an onstack completion inside the victim process' > > > task_struct, > > > and then complete it in free_task(). > > > > How would you implement such a method in userspace? kill() doesn't take > > any parameters but the pid of the process you want to send a signal to, > > and the signal to send. This would require a new system call, and be > > quite a bit of work. > > That's what the pidfd work is for. Please read the original threads > about the motivation and design of that facility. I wasn't Cc'd on the original work, so I haven't read them. > > > If you can solve this with an ebpf program, I > > strongly suggest you do that instead. > > We do want killed processes to die promptly. That's why I support > boosting a process's priority somehow when lmkd is about to kill it. > The precise way in which we do that --- involving not only actual > priority, but scheduler knobs, cgroup assignment, core affinity, and > so on --- is a complex topic best left to userspace. lmkd already has > all the knobs it needs to implement whatever priority boosting policy > it wants. > > Hell, once we add a pidfd_wait --- which I plan to work on, assuming > nobody beats me to it, after pidfd_send_signal lands --- you can > imagine a general-purpose priority inheritance mechanism expediting > process death when a high-priority process waits on a pidfd_wait > handle for a condemned process. You know you're on the right track > design-wise when you start seeing this kind of elegant constructive > interference between seemingly-unrelated features. What we don't need > is some kind of blocking SIGKILL alternative or backdoor event > delivery system. > > We definitely don't want to have to wait for a process's parent to > reap it. Instead, we want to wait for it to become a zombie. That's > why I designed my original exithand patch to fire death notification > upon transition to the zombie state, not upon process table removal, > and I expect pidfd_wait (or whatever we call it) to act the same way. > > In any case, there's a clear path forward here --- general-purpose, > cheap, and elegant --- and we should just focus on doing that instead > of more complex proposals with few advantages. If you add new pidfd systemcalls then making a new way to send a signal and block till it does die or whatever is more acceptable than adding a new signal that changes the semantics of sending signals, which is what I was against. I do agree with Joel about bloating task_struct too. If anything, have a wait queue you add, where you can allocate a descriptor with the task dieing and task killing, and just search this queue on dying. We could add a TIF flag to the task as well to let the exiting of this task know it should do such an operation. -- Steve ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Thu, Mar 14, 2019 at 09:36:43PM -0700, Daniel Colascione wrote: [snip] > > If you can solve this with an ebpf program, I > > strongly suggest you do that instead. > > Regarding process death notification: I will absolutely not support > putting aBPF and perf trace events on the critical path of core system > memory management functionality. Tracing and monitoring facilities are > great for learning about the system, but they were never intended to > be load-bearing. The proposed eBPF process-monitoring approach is just > a variant of the netlink proposal we discussed previously on the pidfd > threads; it has all of its drawbacks. We really need a core system > call --- really, we've needed robust process management since the > creation of unix --- and I'm glad that we're finally getting it. > Adding new system calls is not expensive; going to great lengths to > avoid adding one is like calling a helicopter to avoid crossing the > street. I don't think we should present an abuse of the debugging and > performance monitoring infrastructure as an alternative to a robust > and desperately-needed bit of core functionality that's neither hard > to add nor complex to implement nor expensive to use. The eBPF-based solution to this would be just as simple while avoiding any kernel changes. I don't know why you think it is not load-bearing. However, I agree the proc/pidfd approach is better and can be simpler for some users who don't want to deal with eBPF - especially since something like this has many usecases. I was just suggesting the eBPF solution as a better alternative to the task_struct surgery idea from Sultan since that sounded to me quite hackish (that could just be my opinion). > Regarding the proposal for a new kernel-side lmkd: when possible, the > kernel should provide mechanism, not policy. Putting the low memory > killer back into the kernel after we've spent significant effort > making it possible for userspace to do that job. Compared to kernel > code, more easily understood, more easily debuggable, more easily > updated, and much safer. If we *can* move something out of the kernel, > we should. This patch moves us in exactly the wrong direction. Yes, we > need *something* that sits synchronously astride the page allocation > path and does *something* to stop a busy beaver allocator that eats > all the available memory before lmkd, even mlocked and realtime, can > respond. The OOM killer is adequate for this very rare case. > > With respect to kill timing: Tim is right about the need for two > levels of policy: first, a high-level process prioritization and > memory-demand balancing scheme (which is what OOM score adjustment > code in ActivityManager amounts to); and second, a low-level > process-killing methodology that maximizes sustainable memory reclaim > and minimizes unwanted side effects while killing those processes that > should be dead. Both of these policies belong in userspace --- because > they *can* be in userspace --- and userspace needs only a few tools, > most of which already exist, to do a perfectly adequate job. > > We do want killed processes to die promptly. That's why I support > boosting a process's priority somehow when lmkd is about to kill it. > The precise way in which we do that --- involving not only actual > priority, but scheduler knobs, cgroup assignment, core affinity, and > so on --- is a complex topic best left to userspace. lmkd already has > all the knobs it needs to implement whatever priority boosting policy > it wants. > > Hell, once we add a pidfd_wait --- which I plan to work on, assuming > nobody beats me to it, after pidfd_send_signal lands --- you can > imagine a general-purpose priority inheritance mechanism expediting > process death when a high-priority process waits on a pidfd_wait > handle for a condemned process. You know you're on the right track > design-wise when you start seeing this kind of elegant constructive > interference between seemingly-unrelated features. What we don't need > is some kind of blocking SIGKILL alternative or backdoor event > delivery system. > > We definitely don't want to have to wait for a process's parent to > reap it. Instead, we want to wait for it to become a zombie. That's > why I designed my original exithand patch to fire death notification > upon transition to the zombie state, not upon process table removal, > and I expect pidfd_wait (or whatever we call it) to act the same way. Agreed. Looking forward to the patches. :) thanks, - Joel ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Thu, Mar 14, 2019 at 11:16:41PM -0400, Steven Rostedt wrote: > How would you implement such a method in userspace? kill() doesn't take > any parameters but the pid of the process you want to send a signal to, > and the signal to send. This would require a new system call, and be > quite a bit of work. If you can solve this with an ebpf program, I > strongly suggest you do that instead. This can be done by introducing a new signal number that provides SIGKILL functionality while blocking (maybe SIGKILLBLOCK?). Thanks, Sultan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Thu, Mar 14, 2019 at 10:54:48PM -0400, Joel Fernandes wrote: > I'm not sure if that makes much semantic sense for how the signal handling is > supposed to work. Imagine a parent sends SIGKILL to its child, and then does > a wait(2). Because the SIGKILL blocks in your idea, then the wait cannot > execute, and because the wait cannot execute, the zombie task will not get > reaped and so the SIGKILL senders never gets unblocked and the whole thing > just gets locked up. No? I don't know it just feels incorrect. Block until the victim becomes a zombie instead. > Further, in your idea adding stuff to task_struct will simply bloat it - when > this task can easily be handled using eBPF without making any kernel changes. > Either by probing sched_process_free or sched_process_exit tracepoints. > Scheduler maintainers generally frown on adding stuff to task_struct > pointlessly there's a good reason since bloating it effects the performance > etc, and something like this would probably never be ifdef'd out behind a > CONFIG. Adding something to task_struct is just the easiest way to test things for experimentation. This can be avoided in my suggestion by passing the pointer to a completion via the relevant functions, and then completing it at the time the victim transitions to a zombie state. I understand it's possible to use eBPF for this, but it seems kind of messy since this functionality is something that I think others would want provided by the kernel (i.e., anyone using PSI to implement their own OOM killer daemon similar to LMKD). Thanks, Sultan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Thu, 14 Mar 2019 13:49:11 -0700 Sultan Alsawaf wrote: > Perhaps I'm missing something, but if you want to know when a process has died > after sending a SIGKILL to it, then why not just make the SIGKILL optionally > block until the process has died completely? It'd be rather trivial to just > store a pointer to an onstack completion inside the victim process' > task_struct, > and then complete it in free_task(). How would you implement such a method in userspace? kill() doesn't take any parameters but the pid of the process you want to send a signal to, and the signal to send. This would require a new system call, and be quite a bit of work. If you can solve this with an ebpf program, I strongly suggest you do that instead. -- Steve ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Thu, Mar 14, 2019 at 01:49:11PM -0700, Sultan Alsawaf wrote: > On Thu, Mar 14, 2019 at 10:47:17AM -0700, Joel Fernandes wrote: > > About the 100ms latency, I wonder whether it is that high because of > > the way Android's lmkd is observing that a process has died. There is > > a gap between when a process memory is freed and when it disappears > > from the process-table. Once a process is SIGKILLed, it becomes a > > zombie. Its memory is freed instantly during the SIGKILL delivery (I > > traced this so that's how I know), but until it is reaped by its > > parent thread, it will still exist in /proc/ . So if testing the > > existence of /proc/ is how Android is observing that the process > > died, then there can be a large latency where it takes a very long > > time for the parent to actually reap the child way after its memory > > was long freed. A quicker way to know if a process's memory is freed > > before it is reaped could be to read back /proc//maps in > > userspace of the victim , and that file will be empty for zombie > > processes. So then one does not need wait for the parent to reap it. I > > wonder how much of that 100ms you mentioned is actually the "Waiting > > while Parent is reaping the child", than "memory freeing time". So > > yeah for this second problem, the procfds work will help. > > > > By the way another approach that can provide a quick and asynchronous > > notification of when the process memory is freed, is to monitor > > sched_process_exit trace event using eBPF. You can tell eBPF the PID > > that you want to monitor before the SIGKILL. As soon as the process > > dies and its memory is freed, the eBPF program can send a notification > > to user space (using the perf_events polling infra). The > > sched_process_exit fires just after the mmput() happens so it is quite > > close to when the memory is reclaimed. This also doesn't need any > > kernel changes. I could come up with a prototype for this and > > benchmark it on Android, if you want. Just let me know. > > Perhaps I'm missing something, but if you want to know when a process has died > after sending a SIGKILL to it, then why not just make the SIGKILL optionally > block until the process has died completely? It'd be rather trivial to just > store a pointer to an onstack completion inside the victim process' > task_struct, > and then complete it in free_task(). I'm not sure if that makes much semantic sense for how the signal handling is supposed to work. Imagine a parent sends SIGKILL to its child, and then does a wait(2). Because the SIGKILL blocks in your idea, then the wait cannot execute, and because the wait cannot execute, the zombie task will not get reaped and so the SIGKILL senders never gets unblocked and the whole thing just gets locked up. No? I don't know it just feels incorrect. Further, in your idea adding stuff to task_struct will simply bloat it - when this task can easily be handled using eBPF without making any kernel changes. Either by probing sched_process_free or sched_process_exit tracepoints. Scheduler maintainers generally frown on adding stuff to task_struct pointlessly there's a good reason since bloating it effects the performance etc, and something like this would probably never be ifdef'd out behind a CONFIG. thanks, - Joel ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Thu, Mar 14, 2019 at 10:47:17AM -0700, Joel Fernandes wrote: > About the 100ms latency, I wonder whether it is that high because of > the way Android's lmkd is observing that a process has died. There is > a gap between when a process memory is freed and when it disappears > from the process-table. Once a process is SIGKILLed, it becomes a > zombie. Its memory is freed instantly during the SIGKILL delivery (I > traced this so that's how I know), but until it is reaped by its > parent thread, it will still exist in /proc/ . So if testing the > existence of /proc/ is how Android is observing that the process > died, then there can be a large latency where it takes a very long > time for the parent to actually reap the child way after its memory > was long freed. A quicker way to know if a process's memory is freed > before it is reaped could be to read back /proc//maps in > userspace of the victim , and that file will be empty for zombie > processes. So then one does not need wait for the parent to reap it. I > wonder how much of that 100ms you mentioned is actually the "Waiting > while Parent is reaping the child", than "memory freeing time". So > yeah for this second problem, the procfds work will help. > > By the way another approach that can provide a quick and asynchronous > notification of when the process memory is freed, is to monitor > sched_process_exit trace event using eBPF. You can tell eBPF the PID > that you want to monitor before the SIGKILL. As soon as the process > dies and its memory is freed, the eBPF program can send a notification > to user space (using the perf_events polling infra). The > sched_process_exit fires just after the mmput() happens so it is quite > close to when the memory is reclaimed. This also doesn't need any > kernel changes. I could come up with a prototype for this and > benchmark it on Android, if you want. Just let me know. Perhaps I'm missing something, but if you want to know when a process has died after sending a SIGKILL to it, then why not just make the SIGKILL optionally block until the process has died completely? It'd be rather trivial to just store a pointer to an onstack completion inside the victim process' task_struct, and then complete it in free_task(). Thanks, Sultan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
Hi Tim, Thanks for the detailed and excellent write-up. It will serve as a good future reference for low memory killer requirements. I made some comments below on the "how to kill" part. On Tue, Mar 12, 2019 at 10:17 AM Tim Murray wrote: > > On Tue, Mar 12, 2019 at 9:37 AM Sultan Alsawaf wrote: > > > > On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote: > > > The only way to control the OOM behavior pro-actively is to throttle > > > allocation speed. We have memcg high limit for that purpose. Along with > > > PSI, I can imagine a reasonably working user space early oom > > > notifications and reasonable acting upon that. > > > > The issue with pro-active memory management that prompted me to create this > > was > > poor memory utilization. All of the alternative means of reclaiming pages > > in the > > page allocator's slow path turn out to be very useful for maximizing memory > > utilization, which is something that we would have to forgo by relying on a > > purely pro-active solution. I have not had a chance to look at PSI yet, but > > unless a PSI-enabled solution allows allocations to reach the same point as > > when > > the OOM killer is invoked (which is contradictory to what it sets out to > > do), > > then it cannot take advantage of all of the alternative memory-reclaim means > > employed in the slowpath, and will result in killing a process before it is > > _really_ necessary. > > There are two essential parts of a lowmemorykiller implementation: > when to kill and how to kill. > > There are a million possible approaches to decide when to kill an > unimportant process. They usually trade off between the same two > failure modes depending on the workload. > > If you kill too aggressively, a transient spike that could be > imperceptibly absorbed by evicting some file pages or moving some > pages to ZRAM will result in killing processes, which then get started > up later and have a performance/battery cost. > > If you don't kill aggressively enough, you will encounter a workload > that thrashes the page cache, constantly evicting and reloading file > pages and moving things in and out of ZRAM, which makes the system > unusable when a process should have been killed instead. > > As far as I've seen, any methodology that uses single points in time > to decide when to kill without completely biasing toward one or the > other is susceptible to both. The minfree approach used by > lowmemorykiller/lmkd certainly is; it is both too aggressive for some > workloads and not aggressive enough for other workloads. My guess is > that simple LMK won't kill on transient spikes but will be extremely > susceptible to page cache thrashing. This is not an improvement; page > cache thrashing manifests as the entire system running very slowly. > > What you actually want from lowmemorykiller/lmkd on Android is to only > kill once it becomes clear that the system will continue to try to > reclaim memory to the extent that it could impact what the user > actually cares about. That means tracking how much time is spent in > reclaim/paging operations and the like, and that's exactly what PSI > does. lmkd has had support for using PSI as a replacement for > vmpressure for use as a wakeup trigger (to check current memory levels > against the minfree thresholds) since early February. It works fine; > unsurprisingly it's better than vmpressure at avoiding false wakeups. > > Longer term, there's a lot of work to be done in lmkd to turn PSI into > a kill trigger and remove minfree entirely. It's tricky (mainly > because of the "when to kill another process" problem discussed > later), but I believe it's feasible. > > How to kill is similarly messy. The latency of reclaiming memory post > SIGKILL can be severe (usually tens of milliseconds, occasionally > >100ms). The latency we see on Android usually isn't because those > threads are blocked in uninterruptible sleep, it's because times of > memory pressure are also usually times of significant CPU contention > and these are overwhelmingly CFS threads, some of which may be > assigned a very low priority. lmkd now sets priorities and resets > cpusets upon killing a process, and we have seen improved reclaim > latency because of this. oom reaper might be a good approach to avoid > this latency (I think some in-kernel lowmemorykiller implementations > rely on it), but we can't use it from userspace. Something for future > consideration. > This makes sense. If the process receiving the SIGKILL does not get CPU time, then the kernel may not be able to execute the unconditional signal handling paths in the context of the victim process to free the memory. I don't see how proc-fds approach will solve this though. Say you have process L (which is LMKd) which sends a SIGKILL to process V(which is a victim). Now L sends SIGKILL to V. Unless V executes the signal-handling code in kernel context and is scheduled at high enough priority to get CPU time, I don't think the
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, Mar 12, 2019 at 7:43 PM Tim Murray wrote: > > On Tue, Mar 12, 2019 at 10:45 AM Sultan Alsawaf > wrote: > > > > On Tue, Mar 12, 2019 at 10:17:43AM -0700, Tim Murray wrote: > > > Knowing whether a SIGKILL'd process has finished reclaiming is as far > > > as I know not possible without something like procfds. That's where > > > the 100ms timeout in lmkd comes in. lowmemorykiller and lmkd both > > > attempt to wait up to 100ms for reclaim to finish by checking for the > > > continued existence of the thread that received the SIGKILL, but this > > > really means that they wait up to 100ms for the _thread_ to finish, > > > which doesn't tell you anything about the memory used by that process. > > > If those threads terminate early and lowmemorykiller/lmkd get a signal > > > to kill again, then there may be two processes competing for CPU time > > > to reclaim memory. That doesn't reclaim any faster and may be an > > > unnecessary kill. > > > ... > > > - offer a way to wait for process termination so lmkd can tell when > > > reclaim has finished and know when killing another process is > > > appropriate > > > > Should be pretty easy with something like this: > > Yeah, that's in the spirit of what I was suggesting, but there are lot > of edge cases around how to get that data out efficiently and PID > reuse (it's a real issue--often the Android apps that are causing > memory pressure are also constantly creating/destroying threads). > > I believe procfds or a similar mechanism will be a good solution to this. Fwiw, I am working on this and have send a PR for inclusion in 5.1: https://lore.kernel.org/lkml/20190312135245.27591-1-christ...@brauner.io/ There's also a tree to track this work. Christian ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, Mar 12, 2019 at 10:17:43AM -0700, Tim Murray wrote: > Knowing whether a SIGKILL'd process has finished reclaiming is as far > as I know not possible without something like procfds. That's where > the 100ms timeout in lmkd comes in. lowmemorykiller and lmkd both > attempt to wait up to 100ms for reclaim to finish by checking for the > continued existence of the thread that received the SIGKILL, but this > really means that they wait up to 100ms for the _thread_ to finish, > which doesn't tell you anything about the memory used by that process. > If those threads terminate early and lowmemorykiller/lmkd get a signal > to kill again, then there may be two processes competing for CPU time > to reclaim memory. That doesn't reclaim any faster and may be an > unnecessary kill. > ... > - offer a way to wait for process termination so lmkd can tell when > reclaim has finished and know when killing another process is > appropriate Should be pretty easy with something like this: diff --git a/include/linux/sched.h b/include/linux/sched.h index 1549584a1..6ac478af2 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1199,6 +1199,7 @@ struct task_struct { unsigned long lowest_stack; unsigned long prev_lowest_stack; #endif + ktime_t sigkill_time; /* * New fields for task_struct should be added above here, so that diff --git a/kernel/fork.c b/kernel/fork.c index 9dcd18aa2..0ae182777 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -435,6 +435,8 @@ void put_task_stack(struct task_struct *tsk) void free_task(struct task_struct *tsk) { + ktime_t sigkill_time = tsk->sigkill_time; + pid_t pid = tsk->pid; #ifndef CONFIG_THREAD_INFO_IN_TASK /* * The task is finally done with both the stack and thread_info, @@ -455,6 +457,9 @@ void free_task(struct task_struct *tsk) if (tsk->flags & PF_KTHREAD) free_kthread_struct(tsk); free_task_struct(tsk); + if (sigkill_time) + printk("%d killed after %lld us\n", pid, + ktime_us_delta(ktime_get(), sigkill_time)); } EXPORT_SYMBOL(free_task); @@ -1881,6 +1886,7 @@ static __latent_entropy struct task_struct *copy_process( p->sequential_io= 0; p->sequential_io_avg= 0; #endif + p->sigkill_time = 0; /* Perform scheduler related setup. Assign this task to a CPU. */ retval = sched_fork(clone_flags, p); diff --git a/kernel/signal.c b/kernel/signal.c index 5d53183e2..1142c8811 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1168,6 +1168,8 @@ static int __send_signal(int sig, struct kernel_siginfo *info, struct task_struc } out_set: + if (sig == SIGKILL) + t->sigkill_time = ktime_get(); signalfd_notify(t, sig); sigaddset(>signal, sig); ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue 12-03-19 09:37:41, Sultan Alsawaf wrote: > I have not had a chance to look at PSI yet, but > unless a PSI-enabled solution allows allocations to reach the same point as > when > the OOM killer is invoked (which is contradictory to what it sets out to do), > then it cannot take advantage of all of the alternative memory-reclaim means > employed in the slowpath, and will result in killing a process before it is > _really_ necessary. One more note. The above is true, but you can also hit one of the thrashing reclaim behaviors and reclaim last few pages again and again with the whole system really sluggish. That is what PSI is trying to help with. -- Michal Hocko SUSE Labs ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue 12-03-19 09:37:41, Sultan Alsawaf wrote: > On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote: > > The only way to control the OOM behavior pro-actively is to throttle > > allocation speed. We have memcg high limit for that purpose. Along with > > PSI, I can imagine a reasonably working user space early oom > > notifications and reasonable acting upon that. > > The issue with pro-active memory management that prompted me to create this > was > poor memory utilization. All of the alternative means of reclaiming pages in > the > page allocator's slow path turn out to be very useful for maximizing memory > utilization, which is something that we would have to forgo by relying on a > purely pro-active solution. I have not had a chance to look at PSI yet, but > unless a PSI-enabled solution allows allocations to reach the same point as > when > the OOM killer is invoked (which is contradictory to what it sets out to do), > then it cannot take advantage of all of the alternative memory-reclaim means > employed in the slowpath, and will result in killing a process before it is > _really_ necessary. If you really want to reach the real OOM situation then you can very well rely on the in-kernel OOM killer. The only reason you want a customized oom killer is the tasks clasification. And that is a different story. User space hints on the victim selection has been a topic for quite while. It never get to any conclusion as interested parties have always lost an interest because it got hairy quickly. > > If you design is relies on the speed of killing then it is fundamentally > > flawed AFAICT. You cannot assume anything about how quickly a task dies. > > It might be blocked in an uninterruptible sleep or performin an > > operation which takes some time. Sure, oom_reaper might help here but > > still. > > In theory we could instantly zap any process that is not trapped in the kernel > at the time that the OOM killer is invoked without any consequences though, > no? No, this is not so simple. Have a look at the oom_reaper and hops it has to go through. -- Michal Hocko SUSE Labs ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote: > The only way to control the OOM behavior pro-actively is to throttle > allocation speed. We have memcg high limit for that purpose. Along with > PSI, I can imagine a reasonably working user space early oom > notifications and reasonable acting upon that. The issue with pro-active memory management that prompted me to create this was poor memory utilization. All of the alternative means of reclaiming pages in the page allocator's slow path turn out to be very useful for maximizing memory utilization, which is something that we would have to forgo by relying on a purely pro-active solution. I have not had a chance to look at PSI yet, but unless a PSI-enabled solution allows allocations to reach the same point as when the OOM killer is invoked (which is contradictory to what it sets out to do), then it cannot take advantage of all of the alternative memory-reclaim means employed in the slowpath, and will result in killing a process before it is _really_ necessary. > If you design is relies on the speed of killing then it is fundamentally > flawed AFAICT. You cannot assume anything about how quickly a task dies. > It might be blocked in an uninterruptible sleep or performin an > operation which takes some time. Sure, oom_reaper might help here but > still. In theory we could instantly zap any process that is not trapped in the kernel at the time that the OOM killer is invoked without any consequences though, no? Thanks, Sultan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue 12-03-19 16:33:15, Michal Hocko wrote: > On Tue 12-03-19 08:25:41, Matthew Wilcox wrote: > > On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote: > > > On Mon 11-03-19 15:15:35, Suren Baghdasaryan wrote: > > > > Yeah, killing speed is a well-known problem which we are considering > > > > in LMKD. For example the recent LMKD change to assign process being > > > > killed to a cpuset cgroup containing big cores cuts the kill time > > > > considerably. This is not ideal and we are thinking about better ways > > > > to expedite the cleanup process. > > > > > > If you design is relies on the speed of killing then it is fundamentally > > > flawed AFAICT. You cannot assume anything about how quickly a task dies. > > > It might be blocked in an uninterruptible sleep or performin an > > > operation which takes some time. Sure, oom_reaper might help here but > > > still. > > > > Many UNINTERRUPTIBLE sleeps can be converted to KILLABLE sleeps. It just > > needs someone to do the work. > > They can and should as much as possible. No question about that. But not > all of them can and that is why nobody should be relying on that. That > is the whole point of having the oom_reaper and async oom victim tear > down. Let me clarify a bit. LMK obviously doesn't need any guarantee like the core oom killer because it is more of a pro-active measure than the last resort. I merely wanted to say that relying on a design which assumes anything about time victim needs to exit is flawed and it will fail under different workloads. On the other hand this might work good enough on very specific workloads to be usable. I am not questioning that. The point is that this is not generic enough to be accepted to the upstream kernel. -- Michal Hocko SUSE Labs ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue 12-03-19 08:25:41, Matthew Wilcox wrote: > On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote: > > On Mon 11-03-19 15:15:35, Suren Baghdasaryan wrote: > > > Yeah, killing speed is a well-known problem which we are considering > > > in LMKD. For example the recent LMKD change to assign process being > > > killed to a cpuset cgroup containing big cores cuts the kill time > > > considerably. This is not ideal and we are thinking about better ways > > > to expedite the cleanup process. > > > > If you design is relies on the speed of killing then it is fundamentally > > flawed AFAICT. You cannot assume anything about how quickly a task dies. > > It might be blocked in an uninterruptible sleep or performin an > > operation which takes some time. Sure, oom_reaper might help here but > > still. > > Many UNINTERRUPTIBLE sleeps can be converted to KILLABLE sleeps. It just > needs someone to do the work. They can and should as much as possible. No question about that. But not all of them can and that is why nobody should be relying on that. That is the whole point of having the oom_reaper and async oom victim tear down. -- Michal Hocko SUSE Labs ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote: > On Mon 11-03-19 15:15:35, Suren Baghdasaryan wrote: > > Yeah, killing speed is a well-known problem which we are considering > > in LMKD. For example the recent LMKD change to assign process being > > killed to a cpuset cgroup containing big cores cuts the kill time > > considerably. This is not ideal and we are thinking about better ways > > to expedite the cleanup process. > > If you design is relies on the speed of killing then it is fundamentally > flawed AFAICT. You cannot assume anything about how quickly a task dies. > It might be blocked in an uninterruptible sleep or performin an > operation which takes some time. Sure, oom_reaper might help here but > still. Many UNINTERRUPTIBLE sleeps can be converted to KILLABLE sleeps. It just needs someone to do the work. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Mon 11-03-19 15:15:35, Suren Baghdasaryan wrote: > On Mon, Mar 11, 2019 at 1:46 PM Sultan Alsawaf wrote: > > > > On Mon, Mar 11, 2019 at 01:10:36PM -0700, Suren Baghdasaryan wrote: > > > The idea seems interesting although I need to think about this a bit > > > more. Killing processes based on failed page allocation might backfire > > > during transient spikes in memory usage. > > > > This issue could be alleviated if tasks could be killed and have their pages > > reaped faster. Currently, Linux takes a _very_ long time to free a task's > > memory > > after an initial privileged SIGKILL is sent to a task, even with the task's > > priority being set to the highest possible (so unwanted scheduler preemption > > starving dying tasks of CPU time is not the issue at play here). I've > > frequently measured the difference in time between when a SIGKILL is sent > > for a > > task and when free_task() is called for that task to be hundreds of > > milliseconds, which is incredibly long. AFAIK, this is a problem that LMKD > > suffers from as well, and perhaps any OOM killer implementation in Linux, > > since > > you cannot evaluate effect you've had on memory pressure by killing a > > process > > for at least several tens of milliseconds. > > Yeah, killing speed is a well-known problem which we are considering > in LMKD. For example the recent LMKD change to assign process being > killed to a cpuset cgroup containing big cores cuts the kill time > considerably. This is not ideal and we are thinking about better ways > to expedite the cleanup process. If you design is relies on the speed of killing then it is fundamentally flawed AFAICT. You cannot assume anything about how quickly a task dies. It might be blocked in an uninterruptible sleep or performin an operation which takes some time. Sure, oom_reaper might help here but still. The only way to control the OOM behavior pro-actively is to throttle allocation speed. We have memcg high limit for that purpose. Along with PSI, I can imagine a reasonably working user space early oom notifications and reasonable acting upon that. -- Michal Hocko SUSE Labs ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Mon, Mar 11, 2019 at 03:15:35PM -0700, Suren Baghdasaryan wrote: > This what LMKD currently is - a userspace RT process. > My point was that this page allocation queue that you implemented > can't be implemented in userspace, at least not without extensive > communication with kernel. Oh, that's easy to address. My page allocation queue and the decision on when to kill a process are orthogonal. In fact, the page allocation queue could be touched up a bit to factor in the issues Michal mentioned, and it can be implemented as an improvement to the existing OOM killer. The point of it is just to ensure that page allocation requests that have gone OOM are given priority over other allocation requests when free pages start to trickle in. Userspace doesn't need to know about the page allocation queue, and the queue is not necessary to implement the method of determining when to kill processes that I've proposed. It's an optimization, not a necessity. Thanks, Sultan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Mon, Mar 11, 2019 at 05:11:25PM -0400, Joel Fernandes wrote: > But the point is that a transient temporary memory spike should not be a > signal to kill _any_ process. The reaction to kill shouldn't be so > spontaneous that unwanted tasks are killed because the system went into > panic mode. It should be averaged out which I believe is what PSI does. In my patch from the first email, I implemented the decision to kill a process at the same time that the existing kernel OOM killer decides to kill a process. If the kernel's OOM killer were susceptible to killing processes due to transient memory spikes, then I think there would have been several complaints about this behavior regardless of which userspace or architecture is in use. I think the existing OOM killer has this done right. The decision to kill a process occurs after the page allocator has tried _very_ hard to satisfy a page allocation via alternative means, such as utilizing compaction, flushing file-backed pages to disk via kswapd, and direct reclaim. Once all of those means have failed, it is quite reasonable to kill a process to free memory. Trying to wait out the memory starvation at this point would be futile. Thanks, Sultan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Mon, Mar 11, 2019 at 01:46:26PM -0700, Sultan Alsawaf wrote: > On Mon, Mar 11, 2019 at 01:10:36PM -0700, Suren Baghdasaryan wrote: > > The idea seems interesting although I need to think about this a bit > > more. Killing processes based on failed page allocation might backfire > > during transient spikes in memory usage. > > This issue could be alleviated if tasks could be killed and have their pages > reaped faster. But the point is that a transient temporary memory spike should not be a signal to kill _any_ process. The reaction to kill shouldn't be so spontaneous that unwanted tasks are killed because the system went into panic mode. It should be averaged out which I believe is what PSI does. thanks, - Joel ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Mon, Mar 11, 2019 at 01:10:36PM -0700, Suren Baghdasaryan wrote: > The idea seems interesting although I need to think about this a bit > more. Killing processes based on failed page allocation might backfire > during transient spikes in memory usage. This issue could be alleviated if tasks could be killed and have their pages reaped faster. Currently, Linux takes a _very_ long time to free a task's memory after an initial privileged SIGKILL is sent to a task, even with the task's priority being set to the highest possible (so unwanted scheduler preemption starving dying tasks of CPU time is not the issue at play here). I've frequently measured the difference in time between when a SIGKILL is sent for a task and when free_task() is called for that task to be hundreds of milliseconds, which is incredibly long. AFAIK, this is a problem that LMKD suffers from as well, and perhaps any OOM killer implementation in Linux, since you cannot evaluate effect you've had on memory pressure by killing a process for at least several tens of milliseconds. > AFAIKT the biggest issue with using this approach in userspace is that > it's not practically implementable without heavy in-kernel support. > How to implement such interaction between kernel and userspace would > be an interesting discussion which I would be happy to participate in. You could signal a lightweight userspace process that has maximum scheduler priority and have it kill the tasks it'd like. Thanks, Sultan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Mon, Mar 11, 2019 at 06:43:20PM +0100, Michal Hocko wrote: > I am sorry but we are not going to maintain two different OOM > implementations in the kernel. From a quick look the implementation is > quite a hack which is not really suitable for anything but a very > specific usecase. E.g. reusing a freed page for a waiting allocation > sounds like an interesting idea but it doesn't really work for many > reasons. E.g. any NUMA affinity is broken, zone protection doesn't work > either. Not to mention how the code hooks into the allocator hot paths. > This is simply no no. > > Last but not least people have worked really hard to provide means (PSI) > to do what you need in the userspace. Hi Michal, Thanks for the feedback. I had no doubt that this would be vehemently rejected on the mailing list, but I wanted feedback/opinions on it and thus sent it as anRFC. At best I thought perhaps the mechanisms I've employed might serve as inspiration for LMKD improvements in Android, since this hacky OOM killer I've devised does work quite well for the very specific usecase it is set out to address. The NUMA affinity and zone protection bits are helpful insights too. I'll take a look at PSI which Joel mentioned as well. Thanks, Sultan Alsawaf ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Sun 10-03-19 13:34:03, Sultan Alsawaf wrote: > From: Sultan Alsawaf > > This is a complete low memory killer solution for Android that is small > and simple. It kills the largest, least-important processes it can find > whenever a page allocation has completely failed (right after direct > reclaim). Processes are killed according to the priorities that Android > gives them, so that the least important processes are always killed > first. Killing larger processes is preferred in order to free the most > memory possible in one go. > > Simple LMK is integrated deeply into the page allocator in order to > catch exactly when a page allocation fails and exactly when a page is > freed. Failed page allocations that have invoked Simple LMK are placed > on a queue and wait for Simple LMK to satisfy them. When a page is about > to be freed, the failed page allocations are given priority over normal > page allocations by Simple LMK to see if they can immediately use the > freed page. > > Additionally, processes are continuously killed by failed small-order > page allocations until they are satisfied. I am sorry but we are not going to maintain two different OOM implementations in the kernel. From a quick look the implementation is quite a hack which is not really suitable for anything but a very specific usecase. E.g. reusing a freed page for a waiting allocation sounds like an interesting idea but it doesn't really work for many reasons. E.g. any NUMA affinity is broken, zone protection doesn't work either. Not to mention how the code hooks into the allocator hot paths. This is simply no no. Last but not least people have worked really hard to provide means (PSI) to do what you need in the userspace. > Signed-off-by: Sultan Alsawaf > --- > drivers/android/Kconfig | 28 > drivers/android/Makefile | 1 + > drivers/android/simple_lmk.c | 301 +++ > include/linux/sched.h| 3 + > include/linux/simple_lmk.h | 11 ++ > kernel/fork.c| 3 + > mm/page_alloc.c | 13 ++ > 7 files changed, 360 insertions(+) > create mode 100644 drivers/android/simple_lmk.c > create mode 100644 include/linux/simple_lmk.h > > diff --git a/drivers/android/Kconfig b/drivers/android/Kconfig > index 6fdf2abe4..7469d049d 100644 > --- a/drivers/android/Kconfig > +++ b/drivers/android/Kconfig > @@ -54,6 +54,34 @@ config ANDROID_BINDER_IPC_SELFTEST > exhaustively with combinations of various buffer sizes and > alignments. > > +config ANDROID_SIMPLE_LMK > + bool "Simple Android Low Memory Killer" > + depends on !MEMCG > + ---help--- > + This is a complete low memory killer solution for Android that is > + small and simple. It is integrated deeply into the page allocator to > + know exactly when a page allocation hits OOM and exactly when a page > + is freed. Processes are killed according to the priorities that > + Android gives them, so that the least important processes are always > + killed first. > + > +if ANDROID_SIMPLE_LMK > + > +config ANDROID_SIMPLE_LMK_MINFREE > + int "Minimum MiB of memory to free per reclaim" > + default "64" > + help > + Simple LMK will free at least this many MiB of memory per reclaim. > + > +config ANDROID_SIMPLE_LMK_KILL_TIMEOUT > + int "Kill timeout in milliseconds" > + default "50" > + help > + Simple LMK will only perform memory reclaim at most once per this > + amount of time. > + > +endif # if ANDROID_SIMPLE_LMK > + > endif # if ANDROID > > endmenu > diff --git a/drivers/android/Makefile b/drivers/android/Makefile > index c7856e320..7c91293b6 100644 > --- a/drivers/android/Makefile > +++ b/drivers/android/Makefile > @@ -3,3 +3,4 @@ ccflags-y += -I$(src) # needed for trace > events > obj-$(CONFIG_ANDROID_BINDERFS) += binderfs.o > obj-$(CONFIG_ANDROID_BINDER_IPC) += binder.o binder_alloc.o > obj-$(CONFIG_ANDROID_BINDER_IPC_SELFTEST) += binder_alloc_selftest.o > +obj-$(CONFIG_ANDROID_SIMPLE_LMK) += simple_lmk.o > diff --git a/drivers/android/simple_lmk.c b/drivers/android/simple_lmk.c > new file mode 100644 > index 0..8a441650a > --- /dev/null > +++ b/drivers/android/simple_lmk.c > @@ -0,0 +1,301 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Copyright (C) 2019 Sultan Alsawaf . > + */ > + > +#define pr_fmt(fmt) "simple_lmk: " fmt > + > +#include > +#include > +#include > +#include > +#include > +#include > + > +#define MIN_FREE_PAGES (CONFIG_ANDROID_SIMPLE_LMK_MINFREE * SZ_1M / > PAGE_SIZE) > + > +struct oom_alloc_req { > + struct page *page; > + struct completion done; > + struct list_head lh; > + unsigned int order; > + int migratetype; > +}; > + > +struct victim_info { > + struct task_struct *tsk; > + unsigned long size; > +}; > + > +enum { > + DISABLED, > + STARTING, > + READY, > + KILLING > +};
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Mon, Mar 11, 2019 at 12:32:33PM -0400, Joel Fernandes wrote: > On Sun, Mar 10, 2019 at 01:34:03PM -0700, Sultan Alsawaf wrote: > [...] > > > > /* Perform scheduler related setup. Assign this task to a CPU. */ > > retval = sched_fork(clone_flags, p); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 3eb01dedf..fd0d697c6 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -67,6 +67,7 @@ > > #include > > #include > > #include > > +#include > > > > #include > > #include > > @@ -967,6 +968,11 @@ static inline void __free_one_page(struct page *page, > > } > > } > > > > +#ifdef CONFIG_ANDROID_SIMPLE_LMK > > + if (simple_lmk_page_in(page, order, migratetype)) > > + return; > > +#endif > > + > > list_add(>lru, >free_area[order].free_list[migratetype]); > > out: > > zone->free_area[order].nr_free++; > > @@ -4427,6 +4433,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int > > order, > > if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL)) > > goto nopage; > > > > +#ifdef CONFIG_ANDROID_SIMPLE_LMK > > + page = simple_lmk_oom_alloc(order, ac->migratetype); > > + if (page) > > + prep_new_page(page, order, gfp_mask, alloc_flags); > > + goto got_pg; > > +#endif > > + > > Hacking generic MM code with Android-specific callback is probably a major > issue with your patch. > > Also I CC'd -mm maintainers and lists since your patch > touches page_alloc.c. Always run get_maintainer.pl before sending a patch. I > added them this time. I see you CC'd linux-mm on your initial patch, so I apologize. Ignore this part of my reply. Thanks. > Have you looked at the recent PSI work that Suren and Johannes have been > doing [1]? As I understand, userspace lmkd may be migrated to use that at > some > point. Suren can provide more details. I am sure AOSP contributions to make > LMKd better by using the PSI backend would be appreciated. Please consider > collaborating on that and help out, thanks. Check the cover-letter of that > patch [1] where LMKd is mentioned. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Sun, Mar 10, 2019 at 01:34:03PM -0700, Sultan Alsawaf wrote: [...] > > /* Perform scheduler related setup. Assign this task to a CPU. */ > retval = sched_fork(clone_flags, p); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 3eb01dedf..fd0d697c6 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -67,6 +67,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -967,6 +968,11 @@ static inline void __free_one_page(struct page *page, > } > } > > +#ifdef CONFIG_ANDROID_SIMPLE_LMK > + if (simple_lmk_page_in(page, order, migratetype)) > + return; > +#endif > + > list_add(>lru, >free_area[order].free_list[migratetype]); > out: > zone->free_area[order].nr_free++; > @@ -4427,6 +4433,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int > order, > if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL)) > goto nopage; > > +#ifdef CONFIG_ANDROID_SIMPLE_LMK > + page = simple_lmk_oom_alloc(order, ac->migratetype); > + if (page) > + prep_new_page(page, order, gfp_mask, alloc_flags); > + goto got_pg; > +#endif > + Hacking generic MM code with Android-specific callback is probably a major issue with your patch. Also I CC'd -mm maintainers and lists since your patch touches page_alloc.c. Always run get_maintainer.pl before sending a patch. I added them this time. Have you looked at the recent PSI work that Suren and Johannes have been doing [1]? As I understand, userspace lmkd may be migrated to use that at some point. Suren can provide more details. I am sure AOSP contributions to make LMKd better by using the PSI backend would be appreciated. Please consider collaborating on that and help out, thanks. Check the cover-letter of that patch [1] where LMKd is mentioned. thanks, - Joel [1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1951257.html ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Sun, Mar 10, 2019 at 10:03:35PM +0100, Greg Kroah-Hartman wrote: > On Sun, Mar 10, 2019 at 01:34:03PM -0700, Sultan Alsawaf wrote: > > From: Sultan Alsawaf > > > > This is a complete low memory killer solution for Android that is small > > and simple. It kills the largest, least-important processes it can find > > whenever a page allocation has completely failed (right after direct > > reclaim). Processes are killed according to the priorities that Android > > gives them, so that the least important processes are always killed > > first. Killing larger processes is preferred in order to free the most > > memory possible in one go. > > > > Simple LMK is integrated deeply into the page allocator in order to > > catch exactly when a page allocation fails and exactly when a page is > > freed. Failed page allocations that have invoked Simple LMK are placed > > on a queue and wait for Simple LMK to satisfy them. When a page is about > > to be freed, the failed page allocations are given priority over normal > > page allocations by Simple LMK to see if they can immediately use the > > freed page. > > > > Additionally, processes are continuously killed by failed small-order > > page allocations until they are satisfied. > > > > Signed-off-by: Sultan Alsawaf > > Wait, why? We just removed the in-kernel android memory killer, we > don't want to add another one back again, right? Android Go devices > work just fine with the userspace memory killer code, and those are "low > memory" by design. > > Why do we need kernel code here at all? > > thanks, > > greg k-h Hi Greg, Thanks for replying. It has not been my experience and the experience of many others that Android's userspace low memory memory killer works "just fine." On my Pixel 3 XL with a meager 4GB of memory, the userspace killer has had issues with killing too many processes, which has resulted in a noticeably poor user experience for all Pixel owners. From the looks of lmkd on the master branch, there still isn't really any definitive solution for this, aside from a 100ms delay in between process kills. I think that the userspace low memory killer is more complex than necessary, especially since in the kernel we can detect exactly when we run out of memory and react far more quickly than any userspace daemon. The original reasoning behind why the old kernel low memory killer was removed is also a bit vague to me. It just seemed to be abandonware, and all of a sudden a userspace daemon was touted as the solution. This driver is like an Android-flavored version of the kernel's oom killer, and has proven quite effective for me on my Pixel. Processes are killed exactly when a page allocation fails, so memory use is maximized. There is no complexity to try and estimate how full memory is either. Thanks, Sultan ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Sun, Mar 10, 2019 at 01:34:03PM -0700, Sultan Alsawaf wrote: > From: Sultan Alsawaf > > This is a complete low memory killer solution for Android that is small > and simple. It kills the largest, least-important processes it can find > whenever a page allocation has completely failed (right after direct > reclaim). Processes are killed according to the priorities that Android > gives them, so that the least important processes are always killed > first. Killing larger processes is preferred in order to free the most > memory possible in one go. > > Simple LMK is integrated deeply into the page allocator in order to > catch exactly when a page allocation fails and exactly when a page is > freed. Failed page allocations that have invoked Simple LMK are placed > on a queue and wait for Simple LMK to satisfy them. When a page is about > to be freed, the failed page allocations are given priority over normal > page allocations by Simple LMK to see if they can immediately use the > freed page. > > Additionally, processes are continuously killed by failed small-order > page allocations until they are satisfied. > > Signed-off-by: Sultan Alsawaf Wait, why? We just removed the in-kernel android memory killer, we don't want to add another one back again, right? Android Go devices work just fine with the userspace memory killer code, and those are "low memory" by design. Why do we need kernel code here at all? thanks, greg k-h ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel