Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-16 Thread Oleg Nesterov
On 05/15, Sultan Alsawaf wrote:
>
> On Wed, May 15, 2019 at 04:58:32PM +0200, Oleg Nesterov wrote:
> > Could you explain in detail what exactly did you do and what do you see in 
> > dmesg?
> >
> > Just in case, lockdep complains only once, print_circular_bug() does 
> > debug_locks_off()
> > so it it has already reported another false positive __lock_acquire() will 
> > simply
> > return after that.
> >
> > Oleg.
>
> This is what I did:
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 774ab79d3ec7..009e7d431a88 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -3078,6 +3078,7 @@ static int __lock_acquire(struct lockdep_map *lock, 
> unsigned int subclass,
> int class_idx;
> u64 chain_key;
>
> +   BUG_ON(!debug_locks || !prove_locking);
> if (unlikely(!debug_locks))
> return 0;
>
> diff --git a/lib/debug_locks.c b/lib/debug_locks.c
> index 124fdf238b3d..4003a18420fb 100644
> --- a/lib/debug_locks.c
> +++ b/lib/debug_locks.c
> @@ -37,6 +37,7 @@ EXPORT_SYMBOL_GPL(debug_locks_silent);
>   */
>  int debug_locks_off(void)
>  {
> +   return 0;
> if (debug_locks && __debug_locks_off()) {
> if (!debug_locks_silent) {
> console_verbose();

OK, this means that debug_locks_off() always returns 0, as if debug_locks was 
already
cleared.

Thus print_deadlock_bug() will do nothing, it does

if (!debug_locks_off_graph_unlock() || debug_locks_silent)
return 0;

iow this means that even if lockdep finds a problem, the problem won't be 
reported.

> [1.492128] BUG: key  not in .data!
> [1.492141] BUG: key  not in .data!
> [1.492152] BUG: key  not in .data!
> [1.492228] BUG: key  not in .data!
> [1.492238] BUG: key  not in .data!
> [1.492248] BUG: key  not in .data!

I guess this is lockdep_init_map() which does printk("BUG:") itself, but due to 
your
change above it doesn't do WARN(1) and thus there is no call trace.

Oleg.

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-15 Thread Steven Rostedt
On Wed, 15 May 2019 11:52:57 -0700
Sultan Alsawaf  wrote:

> On Wed, May 15, 2019 at 02:32:48PM -0400, Steven Rostedt wrote:
> > I'm confused why you did this?  
> 
> Oleg said that debug_locks_off() could've been called and thus prevented
> lockdep complaints about simple_lmk from appearing. To eliminate any 
> possibility
> of that, I disabled debug_locks_off().

But I believe that when lockdep discovers an issue, the data from then
on is not reliable. Which is why we turn it off. But just commenting
out the disabling makes lockdep unreliable, and is not a proper way to
test your code.

Yes, it can then miss locking issues after one was discovered. Thus,
you are not properly testing the locking in your code.

-- Steve
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-15 Thread Sultan Alsawaf
On Wed, May 15, 2019 at 02:32:48PM -0400, Steven Rostedt wrote:
> I'm confused why you did this?

Oleg said that debug_locks_off() could've been called and thus prevented
lockdep complaints about simple_lmk from appearing. To eliminate any possibility
of that, I disabled debug_locks_off().

Oleg also said that __lock_acquire() could return early if lock debugging were
somehow turned off after lockdep reported one bug. To mitigate any possibility
of that as well, I threw in the BUG_ON() for good measure.

I think at this point it's pretty clear that lockdep truly isn't complaining
about simple_lmk's locking pattern, and that lockdep's lack of complaints isn't
due to it being mysteriously turned off...

Sultan
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-15 Thread Steven Rostedt
On Wed, 15 May 2019 10:27:28 -0700
Sultan Alsawaf  wrote:

> On Wed, May 15, 2019 at 04:58:32PM +0200, Oleg Nesterov wrote:
> > Could you explain in detail what exactly did you do and what do you see in 
> > dmesg?
> > 
> > Just in case, lockdep complains only once, print_circular_bug() does 
> > debug_locks_off()
> > so it it has already reported another false positive __lock_acquire() will 
> > simply
> > return after that.
> > 
> > Oleg.  
> 
> This is what I did:
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 774ab79d3ec7..009e7d431a88 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -3078,6 +3078,7 @@ static int __lock_acquire(struct lockdep_map *lock, 
> unsigned int subclass,
> int class_idx;
> u64 chain_key;
> 
> +   BUG_ON(!debug_locks || !prove_locking);
> if (unlikely(!debug_locks))
> return 0;
> 
> diff --git a/lib/debug_locks.c b/lib/debug_locks.c
> index 124fdf238b3d..4003a18420fb 100644
> --- a/lib/debug_locks.c
> +++ b/lib/debug_locks.c
> @@ -37,6 +37,7 @@ EXPORT_SYMBOL_GPL(debug_locks_silent);
>   */
>  int debug_locks_off(void)
>  {
> +   return 0;

I'm confused why you did this?

-- Steve

> if (debug_locks && __debug_locks_off()) {
> if (!debug_locks_silent) {
> console_verbose();
> 
>
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-15 Thread Oleg Nesterov
On 05/13, Sultan Alsawaf wrote:
>
> On Fri, May 10, 2019 at 05:10:25PM +0200, Oleg Nesterov wrote:
> > I am starting to think I am ;)
> >
> > If you have task1 != task2 this code
> >
> > task_lock(task1);
> > task_lock(task2);
> >
> > should trigger print_deadlock_bug(), task1->alloc_lock and 
> > task2->alloc_lock are
> > the "same" lock from lockdep pov, held_lock's will have the same 
> > hlock_class().
>
> Okay, I've stubbed out debug_locks_off(), and lockdep is now complaining 
> about a
> bunch of false positives so it is _really_ enabled this time.

Could you explain in detail what exactly did you do and what do you see in 
dmesg?

Just in case, lockdep complains only once, print_circular_bug() does 
debug_locks_off()
so it it has already reported another false positive __lock_acquire() will 
simply
return after that.

Oleg.

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-14 Thread Sultan Alsawaf
On Tue, May 14, 2019 at 12:44:53PM -0400, Steven Rostedt wrote:
> OK, this has gotten my attention.
> 
> This thread is quite long, do you have a git repo I can look at, and
> also where is the first task_lock() taken before the
> find_lock_task_mm()?
> 
> -- Steve

Hi Steve,

This is the git repo I work on: 
https://github.com/kerneltoast/android_kernel_google_wahoo

With the newest simple_lmk iteration being this commit: 
https://github.com/kerneltoast/android_kernel_google_wahoo/commit/6b145b8c28b39f7047393169117f72ea7387d91c

This repo is based off the 4.4 kernel that Google ships on the Pixel 2/2XL.

simple_lmk iterates through the entire task list more than once and locks
potential victims using find_lock_task_mm(). It keeps these potential victims
locked across the multiple times that the task list is iterated.

The locking pattern that Oleg said should cause lockdep to complain is that
iterating through the entire task list more than once can lead to locking the
same task that was locked earlier with find_lock_task_mm(), and thus deadlock.
But there is a check in simple_lmk that avoids locking potential victims that
were already found, which avoids the deadlock, but lockdep doesn't know about
the check (which is done with vtsk_is_duplicate()) and should therefore
complain.

Lockdep does not complain though.

Sultan
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-14 Thread Steven Rostedt
On Mon, 13 May 2019 09:45:55 -0700
Sultan Alsawaf  wrote:

> On Fri, May 10, 2019 at 05:10:25PM +0200, Oleg Nesterov wrote:
> > I am starting to think I am ;)
> > 
> > If you have task1 != task2 this code
> > 
> > task_lock(task1);
> > task_lock(task2);
> > 
> > should trigger print_deadlock_bug(), task1->alloc_lock and 
> > task2->alloc_lock are
> > the "same" lock from lockdep pov, held_lock's will have the same 
> > hlock_class().  

OK, this has gotten my attention.

This thread is quite long, do you have a git repo I can look at, and
also where is the first task_lock() taken before the
find_lock_task_mm()?

-- Steve

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-13 Thread Sultan Alsawaf
On Fri, May 10, 2019 at 05:10:25PM +0200, Oleg Nesterov wrote:
> I am starting to think I am ;)
> 
> If you have task1 != task2 this code
> 
>   task_lock(task1);
>   task_lock(task2);
> 
> should trigger print_deadlock_bug(), task1->alloc_lock and task2->alloc_lock 
> are
> the "same" lock from lockdep pov, held_lock's will have the same 
> hlock_class().

Okay, I've stubbed out debug_locks_off(), and lockdep is now complaining about a
bunch of false positives so it is _really_ enabled this time. I grepped for
lockdep last time to try and find a concise way to show over email that lockdep
didn't complain, but that backfired. Here is better evidence:

taimen:/ # dmesg | grep simple_lmk
[   58.349917] simple_lmk: Killing droid.deskclock with adj 906 to free 47548 
KiB
[   58.354748] simple_lmk: Killing .android.dialer with adj 906 to free 36576 
KiB
[   58.355030] simple_lmk: Killing rbandroid.sleep with adj 904 to free 50016 
KiB
[   58.582833] simple_lmk: Killing oadcastreceiver with adj 904 to free 43044 
KiB
[   58.587731] simple_lmk: Killing .apps.wellbeing with adj 902 to free 48128 
KiB
[   58.588084] simple_lmk: Killing android.carrier with adj 902 to free 43636 
KiB
[   58.671857] simple_lmk: Killing ndroid.keychain with adj 902 to free 39992 
KiB
[   58.675622] simple_lmk: Killing gs.intelligence with adj 900 to free 49572 
KiB
[   58.675770] simple_lmk: Killing le.modemservice with adj 800 to free 41976 
KiB
[   58.762060] simple_lmk: Killing ndroid.settings with adj 700 to free 74708 
KiB
[   58.763238] simple_lmk: Killing roid.apps.turbo with adj 700 to free 54660 
KiB
[   58.873337] simple_lmk: Killing d.process.acore with adj 700 to free 48540 
KiB
[   58.873513] simple_lmk: Killing d.process.media with adj 500 to free 46188 
KiB
[   58.873713] simple_lmk: Killing putmethod.latin with adj 200 to free 67304 
KiB
[   59.014046] simple_lmk: Killing android.vending with adj 201 to free 54900 
KiB
[   59.017623] simple_lmk: Killing rak.mibandtools with adj 201 to free 44552 
KiB
[   59.018423] simple_lmk: Killing eport.trebuchet with adj 100 to free 106208 
KiB
[   59.223633] simple_lmk: Killing id.printspooler with adj 900 to free 39664 
KiB
[   59.223762] simple_lmk: Killing gle.android.gms with adj 100 to free 64176 
KiB
[   70.955204] simple_lmk: Killing .apps.translate with adj 906 to free 47564 
KiB
[   70.955633] simple_lmk: Killing cloudprint:sync with adj 906 to free 31932 
KiB
[   70.955787] simple_lmk: Killing oid.apps.photos with adj 904 to free 50204 
KiB
[   71.060789] simple_lmk: Killing ecamera.android with adj 906 to free 32232 
KiB
[   71.061074] simple_lmk: Killing webview_service with adj 906 to free 26028 
KiB
[   71.061199] simple_lmk: Killing com.whatsapp with adj 904 to free 49484 KiB
[   71.190625] simple_lmk: Killing rbandroid.sleep with adj 906 to free 54724 
KiB
[   71.190775] simple_lmk: Killing android.vending with adj 906 to free 39848 
KiB
[   71.191303] simple_lmk: Killing m.facebook.orca with adj 904 to free 72296 
KiB
[   71.342042] simple_lmk: Killing droid.deskclock with adj 902 to free 49284 
KiB
[   71.342240] simple_lmk: Killing .apps.wellbeing with adj 900 to free 47632 
KiB
[   71.342529] simple_lmk: Killing le.modemservice with adj 800 to free 33648 
KiB
[   71.482391] simple_lmk: Killing d.process.media with adj 800 to free 40676 
KiB
[   71.482511] simple_lmk: Killing rdog.challegram with adj 700 to free 71920 
KiB
taimen:/ #

The first simple_lmk message appears at 58.349917. And based on the timestamps,
it's clear that simple_lmk called task_lock() for multiple different tasks,
which is the pattern you think should cause lockdep to complain. But here is the
full dmesg starting from that point:

[   58.349917] simple_lmk: Killing droid.deskclock with adj 906 to free 47548 
KiB
[   58.354748] simple_lmk: Killing .android.dialer with adj 906 to free 36576 
KiB
[   58.355030] simple_lmk: Killing rbandroid.sleep with adj 904 to free 50016 
KiB
[   58.432654] binder_alloc: 2284: binder_alloc_buf failed to map pages in 
userspace, no vma
[   58.432671] binder: 1206:1218 transaction failed 29189/-3, size 76-0 line 
3189
[   58.582833] simple_lmk: Killing oadcastreceiver with adj 904 to free 43044 
KiB
[   58.587731] simple_lmk: Killing .apps.wellbeing with adj 902 to free 48128 
KiB
[   58.588084] simple_lmk: Killing android.carrier with adj 902 to free 43636 
KiB
[   58.590785] binder: undelivered transaction 58370, process died.
[   58.671857] simple_lmk: Killing ndroid.keychain with adj 902 to free 39992 
KiB
[   58.675622] simple_lmk: Killing gs.intelligence with adj 900 to free 49572 
KiB
[   58.675770] simple_lmk: Killing le.modemservice with adj 800 to free 41976 
KiB
[   58.736678] binder: undelivered transaction 58814, process died.
[   58.736733] binder: release 3099:3128 transaction 57832 in, still active
[   58.736744] binder: send failed reply for transaction 57832 to 1876:3090
[   58.736761] binder: undelivered TRANSACTION_COMPLETE
[   

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-10 Thread Oleg Nesterov
On 05/09, Sultan Alsawaf wrote:
>
> On Thu, May 09, 2019 at 05:56:46PM +0200, Oleg Nesterov wrote:
> > Impossible ;) I bet lockdep should report the deadlock as soon as 
> > find_victims()
> > calls find_lock_task_mm() when you already have a locked victim.
>
> I hope you're not a betting man ;)

I am starting to think I am ;)

If you have task1 != task2 this code

task_lock(task1);
task_lock(task2);

should trigger print_deadlock_bug(), task1->alloc_lock and task2->alloc_lock are
the "same" lock from lockdep pov, held_lock's will have the same hlock_class().

> CONFIG_PROVE_LOCKING=y

OK,

> And a printk added in vtsk_is_duplicate() to print when a duplicate is 
> detected,

in this case find_lock_task_mm() won't be called, and this is what saves us from
the actual deadlock.


> and my phone's memory cut in half to make simple_lmk do something, this is 
> what
> I observed:
> taimen:/ # dmesg | grep lockdep
> [0.00] \x09RCU lockdep checking is enabled.

this reports that CONFIG_PROVE_RCU is enabled ;)

> taimen:/ # dmesg | grep simple_lmk
> [   23.211091] simple_lmk: Killing android.carrier with adj 906 to free 37420 
> kiB
> [   23.211160] simple_lmk: Killing oadcastreceiver with adj 906 to free 36784 
> kiB

yes, looks like simple_lmk has at least 2 locked victims. And I have no idea why
you do not see anything else in dmesg. May be debug_locks_off() was already 
called.

But see above, "grep lockdep" won't work.  Perhaps you can do
"grep -e WARNING -e BUG -e locking".

Oleg.

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-09 Thread Sultan Alsawaf
On Thu, May 09, 2019 at 05:56:46PM +0200, Oleg Nesterov wrote:
> Impossible ;) I bet lockdep should report the deadlock as soon as 
> find_victims()
> calls find_lock_task_mm() when you already have a locked victim.

I hope you're not a betting man ;)

With the following configured:
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
# CONFIG_DEBUG_SPINLOCK_BITE_ON_BUG is not set
CONFIG_DEBUG_SPINLOCK_PANIC_ON_BUG=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCKDEP=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_ATOMIC_SLEEP=y
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set

And a printk added in vtsk_is_duplicate() to print when a duplicate is detected,
and my phone's memory cut in half to make simple_lmk do something, this is what
I observed:
taimen:/ # dmesg | grep lockdep
[0.00] \x09RCU lockdep checking is enabled.
taimen:/ # dmesg | grep simple_lmk
[   23.211091] simple_lmk: Killing android.carrier with adj 906 to free 37420 
kiB
[   23.211160] simple_lmk: Killing oadcastreceiver with adj 906 to free 36784 
kiB
[   23.248457] simple_lmk: Killing .apps.translate with adj 904 to free 45884 
kiB
[   23.248720] simple_lmk: Killing ndroid.settings with adj 904 to free 42868 
kiB
[   23.313417] simple_lmk: DUPLICATE VTSK!
[   23.313440] simple_lmk: Killing ndroid.keychain with adj 906 to free 33680 
kiB
[   23.313513] simple_lmk: Killing com.whatsapp with adj 904 to free 51436 kiB
[   34.646695] simple_lmk: DUPLICATE VTSK!
[   34.646717] simple_lmk: Killing ndroid.apps.gcs with adj 906 to free 37956 
kiB
[   34.646792] simple_lmk: Killing droid.apps.maps with adj 904 to free 63600 
kiB
taimen:/ # dmesg | grep lockdep
[0.00] \x09RCU lockdep checking is enabled.
taimen:/ # 

> As for 
> https://github.com/kerneltoast/android_kernel_google_wahoo/commit/afc8c9bf2dbde95941253c168d1adb64cfa2e3ad
> Well,
> 
>   mmdrop(mm);
>   simple_lmk_mm_freed(mm);
> 
> looks racy because mmdrop(mm) can free this mm_struct. Yes, 
> simple_lmk_mm_freed()
> does not dereference this pointer, but the same memory can be re-allocated as
> another ->mm for the new task which can be found by find_victims(), and _in 
> theory_
> this all can happen in between, so the "victims[i].mm == mm" can be false 
> positive.
> 
> And this also means that simple_lmk_mm_freed() should clear victims[i].mm when
> it detects "victims[i].mm == mm", otherwise we have the same theoretical race,
> victims_to_kill is only cleared when the last victim goes away.

Fair point. Putting simple_lmk_mm_freed(mm) right before mmdrop(mm), and
sprinkling in a cmpxchg in simple_lmk_mm_freed() should fix that up.

> Another nit... you can drop tasklist_lock right after the 1st "find_victims" 
> loop.

True!

> And it seems that you do not really need to walk the "victims" array twice 
> after that,
> you can do everything in a single loop, but this is cosmetic.

Won't this result in potentially holding the task lock way longer than necessary
for multiple tasks that aren't going to be killed?

Thanks,
Sultan
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-09 Thread Oleg Nesterov
On 05/07, Sultan Alsawaf wrote:
>
> On Tue, May 07, 2019 at 05:31:54PM +0200, Oleg Nesterov wrote:
>
> > Did you test this patch with lockdep enabled?
> >
> > If I read the patch correctly, lockdep should complain. vtsk_is_duplicate()
> > ensures that we do not take the same ->alloc_lock twice or more, but lockdep
> > can't know this.
>
> Yeah, lockdep is fine with this, at least on 4.4.

Impossible ;) I bet lockdep should report the deadlock as soon as find_victims()
calls find_lock_task_mm() when you already have a locked victim.

Nevermind, I guess this code won't run with lockdep enabled...


As for 
https://github.com/kerneltoast/android_kernel_google_wahoo/commit/afc8c9bf2dbde95941253c168d1adb64cfa2e3ad
Well,

mmdrop(mm);
simple_lmk_mm_freed(mm);

looks racy because mmdrop(mm) can free this mm_struct. Yes, 
simple_lmk_mm_freed()
does not dereference this pointer, but the same memory can be re-allocated as
another ->mm for the new task which can be found by find_victims(), and _in 
theory_
this all can happen in between, so the "victims[i].mm == mm" can be false 
positive.

And this also means that simple_lmk_mm_freed() should clear victims[i].mm when
it detects "victims[i].mm == mm", otherwise we have the same theoretical race,
victims_to_kill is only cleared when the last victim goes away.


Another nit... you can drop tasklist_lock right after the 1st "find_victims" 
loop.

And it seems that you do not really need to walk the "victims" array twice 
after that,
you can do everything in a single loop, but this is cosmetic.

Oleg.

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Suren Baghdasaryan
From: Sultan Alsawaf 
Date: Tue, May 7, 2019 at 9:53 AM
To: Suren Baghdasaryan
Cc: Christian Brauner, Greg Kroah-Hartman, open list:ANDROID DRIVERS,
Daniel Colascione, Todd Kjos, Kees Cook, Peter Zijlstra, Martijn
Coenen, LKML, Tim Murray, Michal Hocko, linux-mm, Arve Hjønnevåg, Ingo
Molnar, Steven Rostedt, Oleg Nesterov, Joel Fernandes, Andy
Lutomirski, kernel-team

> On Tue, May 07, 2019 at 09:28:47AM -0700, Suren Baghdasaryan wrote:
> > Hi Sultan,
> > Looks like you are posting this patch for devices that do not use
> > userspace LMKD solution due to them using older kernels or due to
> > their vendors sticking to in-kernel solution. If so, I see couple
> > logistical issues with this patch. I don't see it being adopted in
> > upstream kernel 5.x since it re-implements a deprecated mechanism even
> > though vendors still use it. Vendors on the other hand, will not adopt
> > it until you show evidence that it works way better than what
> > lowmemorykilled driver does now. You would have to provide measurable
> > data and explain your tests before they would consider spending time
> > on this.
>
> Yes, this is mostly for the devices already produced that are forced to suffer
> with poor memory management. I can't even convince vendors to fix kernel
> memory leaks, so there's no way I'd be able to convince them of trying this
> patch, especially since it seems like you're having trouble convincing vendors
> to stop using lowmemorykiller in the first place. And thankfully, convincing
> vendors isn't my job :)
>
> > On the implementation side I'm not convinced at all that this would
> > work better on all devices and in all circumstances. We had cases when
> > a new mechanism would show very good results until one usecase
> > completely broke it. Bulk killing of processes that you are doing in
> > your patch was a very good example of such a decision which later on
> > we had to rethink. That's why baking these policies into kernel is
> > very problematic. Another problem I see with the implementation that
> > it ties process killing with the reclaim scan depth. It's very similar
> > to how vmpressure works and vmpressure in my experience is very
> > unpredictable.
>
> Could you elaborate a bit on why bulk killing isn't good?

Yes. Several months ago we got reports that LMKD got very aggressive
killing more processes in situations which did not require that many
kills to recover from memory pressure. After investigation we tracked
it to the bulk killing which would kill too many processes during a
memory usage spike. When killing gradually the kills would be avoided,
so we had to get back to a more balanced approach. The conceptual
issue with bulk killing is that you can't cancel it when device
recovers quicker than you expected.

> > > > I linked in the previous message, Google made a rather large set of
> > > > modifications to the supposedly-defunct lowmemorykiller.c not one month 
> > > > ago.
> > > > What's going on?
> >
> > If you look into that commit, it adds ability to report kill stats. If
> > that was a change in how that driver works it would be rejected.
>
> Fair, though it was quite strange seeing something that was supposedly totally
> abandoned receiving a large chunk of code for reporting stats.
>
> Thanks,
> Sultan
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Joel Fernandes
On Tue, May 07, 2019 at 09:28:47AM -0700, Suren Baghdasaryan wrote:
> From: Christian Brauner 
> Date: Tue, May 7, 2019 at 3:58 AM
> To: Sultan Alsawaf
> Cc: Greg Kroah-Hartman, open list:ANDROID DRIVERS, Daniel Colascione,
> Todd Kjos, Kees Cook, Peter Zijlstra, Martijn Coenen, LKML, Tim
> Murray, Michal Hocko, Suren Baghdasaryan, linux-mm, Arve Hjønnevåg,
> Ingo Molnar, Steven Rostedt, Oleg Nesterov, Joel Fernandes, Andy
> Lutomirski, kernel-team
> 
> > On Tue, May 07, 2019 at 01:12:36AM -0700, Sultan Alsawaf wrote:
> > > On Tue, May 07, 2019 at 09:43:34AM +0200, Greg Kroah-Hartman wrote:
> > > > Given that any "new" android device that gets shipped "soon" should be
> > > > using 4.9.y or newer, is this a real issue?
> > >
> > > It's certainly a real issue for those who can't buy brand new Android 
> > > devices
> > > without software bugs every six months :)
> > >
> 
> Hi Sultan,
> Looks like you are posting this patch for devices that do not use
> userspace LMKD solution due to them using older kernels or due to
> their vendors sticking to in-kernel solution. If so, I see couple
> logistical issues with this patch. I don't see it being adopted in
> upstream kernel 5.x since it re-implements a deprecated mechanism even
> though vendors still use it. Vendors on the other hand, will not adopt
> it until you show evidence that it works way better than what
> lowmemorykilled driver does now. You would have to provide measurable
> data and explain your tests before they would consider spending time
> on this.
> On the implementation side I'm not convinced at all that this would
> work better on all devices and in all circumstances. We had cases when
> a new mechanism would show very good results until one usecase
> completely broke it. Bulk killing of processes that you are doing in
> your patch was a very good example of such a decision which later on
> we had to rethink. That's why baking these policies into kernel is
> very problematic. Another problem I see with the implementation that
> it ties process killing with the reclaim scan depth. It's very similar
> to how vmpressure works and vmpressure in my experience is very
> unpredictable.

Yeah it does seem conceptually similar, good point.
 
> > > Regardless, even if PSI were backported, a full-fledged LMKD using it has 
> > > yet to
> > > be made, so it wouldn't be of much use now.
> >
> > This is work that is ongoing and requires kernel changes to make it
> > feasible. One of the things that I have been working on for quite a
> > while is the whole file descriptor for processes thing that is important
> > for LMKD (Even though I never thought about this use-case when I started
> > pitching this.). Joel and Daniel have joined in and are working on
> > making LMKD possible.
> > What I find odd is that every couple of weeks different solutions to the
> > low memory problem are pitched. There is simple_lkml, there is LMKD, and
> > there was a patchset that wanted to speed up memory reclaim at process
> > kill-time by adding a new flag to the new pidfd_send_signal() syscall.
> > That all seems - though related - rather uncoordinated.
> 
> I'm not sure why pidfd_wait and expedited reclaim is seen as
> uncoordinated effort. All of them are done to improve userspace LMKD.

Christian, pidfd_wait and expedited reclaim are both coordinated efforts and
solve different problems related to LMK. simple_lmk is entirely different
effort that we already hesitated about when it was first posted, now we
hesitate again due to the issues Suren and others mentioned.

I think it is a better idea for Sultan to spend his time on using/improving
PSI/LMKd than spending it on the simple_lmk. It could also be a good topic to
discuss in the Android track of the Linux plumbers conference.

thanks,

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Greg Kroah-Hartman
On Tue, May 07, 2019 at 10:17:11AM -0700, Sultan Alsawaf wrote:
> On Tue, May 07, 2019 at 01:09:21PM +0200, Greg Kroah-Hartman wrote:
> > > It's even more odd that although a userspace solution is touted as the 
> > > proper
> > > way to go on LKML, almost no Android OEMs are using it, and even in that 
> > > commit
> > > I linked in the previous message, Google made a rather large set of
> > > modifications to the supposedly-defunct lowmemorykiller.c not one month 
> > > ago.
> > > What's going on?
> > 
> > "almost no"?  Again, Android Go is doing that, right?
> 
> I'd check for myself, but I can't seem to find kernel source for an Android Go
> device...
> 
> This seems more confusing though. Why would the ultra-low-end devices use LMKD
> while other devices use the broken lowmemorykiller driver?

It's probably because the Android Go devices got a lot more "help" from
people at Google than did the other devices you are looking at.  Also,
despite the older kernel version, they are probably running a newer
version of Android userspace, specially tuned just for lower memory
devices.

So those 3.18.y based Android Go devices are newer than the 4.4.y based
"full Android" devices on the market, and even some 4.9.y based devices.

Yes, it is strange :)

> > > Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845.
> > 
> > Qualcomm should never be used as an example of a company that has any
> > idea of what to do in their kernel :)
> 
> Agreed, but nearly all OEMs that use Qualcomm chipsets roll with Qualcomm's
> kernel decisions, so Qualcomm has a bit of influence here.

Yes, because almost no OEM wants to mess with their kernel, they just
take QCOM's kernel and run with it.  But don't take that for some sort
of "best design practice" summary at all.

> > > If PSI were backported to 4.4, or even 3.18, would it really be used?
> > 
> > Why wouldn't it, if it worked properly?
> 
> For the same mysterious reason that Qualcomm and others cling to
> lowmemorykiller, I presume. This is part of what's been confusing me for quite
> some time...

QCOM's 4.4.y based kernel work was done 3-4 years ago, if not older.
They didn't know that this was not the "right way" to do things.  The
Google developers have been working for the past few years to do it
correct, but they can not go back in time to change old repos, sorry.

Now that I understand you just want to work on your local device, that
makes more sense.  But I think you will have a better result trying to
do a 4.4 backport of PSI combined with the userspace stuff, than to try
to worry about your driver in 5.2 or newer.

Or you can forward-port your kernel to 4.9, or better yet, 4.14.  That
would probably be a much better thing to do overall as 4.4 is really old
now.

Good luck!

greg k-h
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Sultan Alsawaf
On Tue, May 07, 2019 at 12:58:27PM +0200, Christian Brauner wrote:
> This is work that is ongoing and requires kernel changes to make it
> feasible. One of the things that I have been working on for quite a
> while is the whole file descriptor for processes thing that is important
> for LMKD (Even though I never thought about this use-case when I started
> pitching this.). Joel and Daniel have joined in and are working on
> making LMKD possible.
> What I find odd is that every couple of weeks different solutions to the
> low memory problem are pitched. There is simple_lkml, there is LMKD, and
> there was a patchset that wanted to speed up memory reclaim at process
> kill-time by adding a new flag to the new pidfd_send_signal() syscall.
> That all seems - though related - rather uncoordinated. Now granted,
> coordinated is usually not how kernel development necessarily works but
> it would probably be good to have some sort of direction and from what I
> have seen LMKD seems to be the most coordinated effort. But that might
> just be my impression.

LMKD is just Android's userspace low-memory-killer daemon. It's been around for
a while.

This patch (simple_lmk) is meant to serve as an immediate solution for the
devices that'll never see a single line of PSI code running on them, which
amounts to... well, most Android devices currently in existence. I'm more of a
cowboy who made this patch after waiting a few years for memory management
improvements on Android that never happened. Though it looks like it's going to
happen soon(ish?) for super new devices that'll have the privilege of shipping
with PSI in use.

On Tue, May 07, 2019 at 01:09:21PM +0200, Greg Kroah-Hartman wrote:
> > It's even more odd that although a userspace solution is touted as the 
> > proper
> > way to go on LKML, almost no Android OEMs are using it, and even in that 
> > commit
> > I linked in the previous message, Google made a rather large set of
> > modifications to the supposedly-defunct lowmemorykiller.c not one month ago.
> > What's going on?
> 
> "almost no"?  Again, Android Go is doing that, right?

I'd check for myself, but I can't seem to find kernel source for an Android Go
device...

This seems more confusing though. Why would the ultra-low-end devices use LMKD
while other devices use the broken lowmemorykiller driver?

> > Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845.
> 
> Qualcomm should never be used as an example of a company that has any
> idea of what to do in their kernel :)

Agreed, but nearly all OEMs that use Qualcomm chipsets roll with Qualcomm's
kernel decisions, so Qualcomm has a bit of influence here.

> > If PSI were backported to 4.4, or even 3.18, would it really be used?
> 
> Why wouldn't it, if it worked properly?

For the same mysterious reason that Qualcomm and others cling to
lowmemorykiller, I presume. This is part of what's been confusing me for quite
some time...

> Please see the work that went into PSI and the patches around it.
> There's also a lwn.net article last week about the further work ongoing
> in this area.  With all of that, you should see how in-kernel memory
> killers are NOT the way to go.
> 
> > Regardless, even if PSI were backported, a full-fledged LMKD using it has 
> > yet to
> > be made, so it wouldn't be of much use now.
> 
> "LMKD"?  Again, PSI is in the 4.9.y android-common tree, so the
> userspace side should be in AOSP, right?

LMKD as in Android's low-memory-killer daemon. It is in AOSP, but it looks like
it's still a work in progress.

Thanks,
Sultan
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Sultan Alsawaf
On Tue, May 07, 2019 at 09:28:47AM -0700, Suren Baghdasaryan wrote:
> Hi Sultan,
> Looks like you are posting this patch for devices that do not use
> userspace LMKD solution due to them using older kernels or due to
> their vendors sticking to in-kernel solution. If so, I see couple
> logistical issues with this patch. I don't see it being adopted in
> upstream kernel 5.x since it re-implements a deprecated mechanism even
> though vendors still use it. Vendors on the other hand, will not adopt
> it until you show evidence that it works way better than what
> lowmemorykilled driver does now. You would have to provide measurable
> data and explain your tests before they would consider spending time
> on this.

Yes, this is mostly for the devices already produced that are forced to suffer
with poor memory management. I can't even convince vendors to fix kernel
memory leaks, so there's no way I'd be able to convince them of trying this
patch, especially since it seems like you're having trouble convincing vendors
to stop using lowmemorykiller in the first place. And thankfully, convincing
vendors isn't my job :)

> On the implementation side I'm not convinced at all that this would
> work better on all devices and in all circumstances. We had cases when
> a new mechanism would show very good results until one usecase
> completely broke it. Bulk killing of processes that you are doing in
> your patch was a very good example of such a decision which later on
> we had to rethink. That's why baking these policies into kernel is
> very problematic. Another problem I see with the implementation that
> it ties process killing with the reclaim scan depth. It's very similar
> to how vmpressure works and vmpressure in my experience is very
> unpredictable.

Could you elaborate a bit on why bulk killing isn't good?

> > > I linked in the previous message, Google made a rather large set of
> > > modifications to the supposedly-defunct lowmemorykiller.c not one month 
> > > ago.
> > > What's going on?
> 
> If you look into that commit, it adds ability to report kill stats. If
> that was a change in how that driver works it would be rejected.

Fair, though it was quite strange seeing something that was supposedly totally
abandoned receiving a large chunk of code for reporting stats.

Thanks,
Sultan
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Christian Brauner
On Tue, May 07, 2019 at 09:28:47AM -0700, Suren Baghdasaryan wrote:
> From: Christian Brauner 
> Date: Tue, May 7, 2019 at 3:58 AM
> To: Sultan Alsawaf
> Cc: Greg Kroah-Hartman, open list:ANDROID DRIVERS, Daniel Colascione,
> Todd Kjos, Kees Cook, Peter Zijlstra, Martijn Coenen, LKML, Tim
> Murray, Michal Hocko, Suren Baghdasaryan, linux-mm, Arve Hjønnevåg,
> Ingo Molnar, Steven Rostedt, Oleg Nesterov, Joel Fernandes, Andy
> Lutomirski, kernel-team
> 
> > On Tue, May 07, 2019 at 01:12:36AM -0700, Sultan Alsawaf wrote:
> > > On Tue, May 07, 2019 at 09:43:34AM +0200, Greg Kroah-Hartman wrote:
> > > > Given that any "new" android device that gets shipped "soon" should be
> > > > using 4.9.y or newer, is this a real issue?
> > >
> > > It's certainly a real issue for those who can't buy brand new Android 
> > > devices
> > > without software bugs every six months :)
> > >
> 
> Hi Sultan,
> Looks like you are posting this patch for devices that do not use
> userspace LMKD solution due to them using older kernels or due to
> their vendors sticking to in-kernel solution. If so, I see couple
> logistical issues with this patch. I don't see it being adopted in
> upstream kernel 5.x since it re-implements a deprecated mechanism even
> though vendors still use it. Vendors on the other hand, will not adopt
> it until you show evidence that it works way better than what
> lowmemorykilled driver does now. You would have to provide measurable
> data and explain your tests before they would consider spending time
> on this.
> On the implementation side I'm not convinced at all that this would
> work better on all devices and in all circumstances. We had cases when
> a new mechanism would show very good results until one usecase
> completely broke it. Bulk killing of processes that you are doing in
> your patch was a very good example of such a decision which later on
> we had to rethink. That's why baking these policies into kernel is
> very problematic. Another problem I see with the implementation that
> it ties process killing with the reclaim scan depth. It's very similar
> to how vmpressure works and vmpressure in my experience is very
> unpredictable.
> 
> > > > And if it is, I'm sure that asking for those patches to be backported to
> > > > 4.4.y would be just fine, have you asked?
> > > >
> > > > Note that I know of Android Go devices, running 3.18.y kernels, do NOT
> > > > use the in-kernel memory killer, but instead use the userspace solution
> > > > today.  So trying to get another in-kernel memory killer solution added
> > > > anywhere seems quite odd.
> > >
> > > It's even more odd that although a userspace solution is touted as the 
> > > proper
> > > way to go on LKML, almost no Android OEMs are using it, and even in that 
> > > commit
> >
> > That's probably because without proper kernel changes this is rather
> > tricky to use safely (see below).
> >
> > > I linked in the previous message, Google made a rather large set of
> > > modifications to the supposedly-defunct lowmemorykiller.c not one month 
> > > ago.
> > > What's going on?
> 
> If you look into that commit, it adds ability to report kill stats. If
> that was a change in how that driver works it would be rejected.
> 
> > >
> > > Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845. If PSI 
> > > were
> > > backported to 4.4, or even 3.18, would it really be used? I don't really
> > > understand the aversion to an in-kernel memory killer on LKML despite the 
> > > rest
> > > of the industry's attraction to it. Perhaps there's some inherently great 
> > > cost
> > > in using the userspace solution that I'm unaware of?
> 
> Vendors are cautious about adopting userspace solution and it is a
> process to address all concerns but we are getting there.
> 
> > > Regardless, even if PSI were backported, a full-fledged LMKD using it has 
> > > yet to
> > > be made, so it wouldn't be of much use now.
> >
> > This is work that is ongoing and requires kernel changes to make it
> > feasible. One of the things that I have been working on for quite a
> > while is the whole file descriptor for processes thing that is important
> > for LMKD (Even though I never thought about this use-case when I started
> > pitching this.). Joel and Daniel have joined in and are working on
> > making LMKD possible.
> > What I find odd is that every couple of weeks different solutions to the
> > low memory problem are pitched. There is simple_lkml, there is LMKD, and
> > there was a patchset that wanted to speed up memory reclaim at process
> > kill-time by adding a new flag to the new pidfd_send_signal() syscall.
> > That all seems - though related - rather uncoordinated.
> 
> I'm not sure why pidfd_wait and expedited reclaim is seen as
> uncoordinated effort. All of them are done to improve userspace LMKD.

If so that wasn't very obvious and there was some disagreement there as
well whether this would be the right solution.
In any case, the point is that LMKD seems to 

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Sultan Alsawaf
On Tue, May 07, 2019 at 05:31:54PM +0200, Oleg Nesterov wrote:
> I am not going to comment the intent, but to be honest I am skeptical too.

The general sentiment has been that this is a really bad idea, but I'm just a
frustrated Android user who wants his phone to not require mountains of zRAM
only to still manage memory poorly. Until I can go out and buy a non-Pixel phone
that uses PSI to make these decisions (and does a good job of it), I'm going to
stick to my hacky little driver for my personal devices. Many others who like to
hack their Android devices to make them last longer will probably find value in
this as well, since there are millions of people who use devices that'll never
seen any of PSI ported to their ancient 3.x kernels.

And yes, I know this would never be accepted to upstream in a million years. I
mostly wanted some code review and guidance, since mm code is pretty tricky :)

> On 05/06, Sultan Alsawaf wrote:
> >
> > +static unsigned long find_victims(struct victim_info *varr, int *vindex,
> > + int vmaxlen, int min_adj, int max_adj)
> > +{
> > +   unsigned long pages_found = 0;
> > +   int old_vindex = *vindex;
> > +   struct task_struct *tsk;
> > +
> > +   for_each_process(tsk) {
> > +   struct task_struct *vtsk;
> > +   unsigned long tasksize;
> > +   short oom_score_adj;
> > +
> > +   /* Make sure there's space left in the victim array */
> > +   if (*vindex == vmaxlen)
> > +   break;
> > +
> > +   /* Don't kill current, kthreads, init, or duplicates */
> > +   if (same_thread_group(tsk, current) ||
> > +   tsk->flags & PF_KTHREAD ||
> > +   is_global_init(tsk) ||
> > +   vtsk_is_duplicate(varr, *vindex, tsk))
> > +   continue;
> > +
> > +   vtsk = find_lock_task_mm(tsk);
> 
> Did you test this patch with lockdep enabled?
> 
> If I read the patch correctly, lockdep should complain. vtsk_is_duplicate()
> ensures that we do not take the same ->alloc_lock twice or more, but lockdep
> can't know this.

Yeah, lockdep is fine with this, at least on 4.4.

> > +static void scan_and_kill(unsigned long pages_needed)
> > +{
> > +   static DECLARE_WAIT_QUEUE_HEAD(victim_waitq);
> > +   struct victim_info victims[MAX_VICTIMS];
> > +   int i, nr_to_kill = 0, nr_victims = 0;
> > +   unsigned long pages_found = 0;
> > +   atomic_t victim_count;
> > +
> > +   /*
> > +* Hold the tasklist lock so tasks don't disappear while scanning. This
> > +* is preferred to holding an RCU read lock so that the list of tasks
> > +* is guaranteed to be up to date. Keep preemption disabled until the
> > +* SIGKILLs are sent so the victim kill process isn't interrupted.
> > +*/
> > +   read_lock(_lock);
> > +   preempt_disable();
> 
> read_lock() disables preemption, every task_lock() too, so this looks
> unnecessary.

Good point.

> > +   for (i = 1; i < ARRAY_SIZE(adj_prio); i++) {
> > +   pages_found += find_victims(victims, _victims, MAX_VICTIMS,
> > +   adj_prio[i], adj_prio[i - 1]);
> > +   if (pages_found >= pages_needed || nr_victims == MAX_VICTIMS)
> > +   break;
> > +   }
> > +
> > +   /*
> > +* Calculate the number of tasks that need to be killed and quickly
> > +* release the references to those that'll live.
> > +*/
> > +   for (i = 0, pages_found = 0; i < nr_victims; i++) {
> > +   struct victim_info *victim = [i];
> > +   struct task_struct *vtsk = victim->tsk;
> > +
> > +   /* The victims' mm lock is taken in find_victims; release it */
> > +   if (pages_found >= pages_needed) {
> > +   task_unlock(vtsk);
> > +   continue;
> > +   }
> > +
> > +   /*
> > +* Grab a reference to the victim so it doesn't disappear after
> > +* the tasklist lock is released.
> > +*/
> > +   get_task_struct(vtsk);
> 
> The comment doesn't look correct. the victim can't dissapear until 
> task_unlock()
> below, it can't pass exit_mm().

I was always unsure about this and decided to hold a reference to the
task_struct to be safe. Thanks for clearing that up.

> > +   pages_found += victim->size;
> > +   nr_to_kill++;
> > +   }
> > +   read_unlock(_lock);
> > +
> > +   /* Kill the victims */
> > +   victim_count = (atomic_t)ATOMIC_INIT(nr_to_kill);
> > +   for (i = 0; i < nr_to_kill; i++) {
> > +   struct victim_info *victim = [i];
> > +   struct task_struct *vtsk = victim->tsk;
> > +
> > +   pr_info("Killing %s with adj %d to free %lu kiB\n", vtsk->comm,
> > +   vtsk->signal->oom_score_adj,
> > +   victim->size << (PAGE_SHIFT - 10));
> > +
> > +   /* Configure the victim's mm to notify us when it's freed */
> > +   vtsk->mm->slmk_waitq = _waitq;
> > +   

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Suren Baghdasaryan
From: Christian Brauner 
Date: Tue, May 7, 2019 at 3:58 AM
To: Sultan Alsawaf
Cc: Greg Kroah-Hartman, open list:ANDROID DRIVERS, Daniel Colascione,
Todd Kjos, Kees Cook, Peter Zijlstra, Martijn Coenen, LKML, Tim
Murray, Michal Hocko, Suren Baghdasaryan, linux-mm, Arve Hjønnevåg,
Ingo Molnar, Steven Rostedt, Oleg Nesterov, Joel Fernandes, Andy
Lutomirski, kernel-team

> On Tue, May 07, 2019 at 01:12:36AM -0700, Sultan Alsawaf wrote:
> > On Tue, May 07, 2019 at 09:43:34AM +0200, Greg Kroah-Hartman wrote:
> > > Given that any "new" android device that gets shipped "soon" should be
> > > using 4.9.y or newer, is this a real issue?
> >
> > It's certainly a real issue for those who can't buy brand new Android 
> > devices
> > without software bugs every six months :)
> >

Hi Sultan,
Looks like you are posting this patch for devices that do not use
userspace LMKD solution due to them using older kernels or due to
their vendors sticking to in-kernel solution. If so, I see couple
logistical issues with this patch. I don't see it being adopted in
upstream kernel 5.x since it re-implements a deprecated mechanism even
though vendors still use it. Vendors on the other hand, will not adopt
it until you show evidence that it works way better than what
lowmemorykilled driver does now. You would have to provide measurable
data and explain your tests before they would consider spending time
on this.
On the implementation side I'm not convinced at all that this would
work better on all devices and in all circumstances. We had cases when
a new mechanism would show very good results until one usecase
completely broke it. Bulk killing of processes that you are doing in
your patch was a very good example of such a decision which later on
we had to rethink. That's why baking these policies into kernel is
very problematic. Another problem I see with the implementation that
it ties process killing with the reclaim scan depth. It's very similar
to how vmpressure works and vmpressure in my experience is very
unpredictable.

> > > And if it is, I'm sure that asking for those patches to be backported to
> > > 4.4.y would be just fine, have you asked?
> > >
> > > Note that I know of Android Go devices, running 3.18.y kernels, do NOT
> > > use the in-kernel memory killer, but instead use the userspace solution
> > > today.  So trying to get another in-kernel memory killer solution added
> > > anywhere seems quite odd.
> >
> > It's even more odd that although a userspace solution is touted as the 
> > proper
> > way to go on LKML, almost no Android OEMs are using it, and even in that 
> > commit
>
> That's probably because without proper kernel changes this is rather
> tricky to use safely (see below).
>
> > I linked in the previous message, Google made a rather large set of
> > modifications to the supposedly-defunct lowmemorykiller.c not one month ago.
> > What's going on?

If you look into that commit, it adds ability to report kill stats. If
that was a change in how that driver works it would be rejected.

> >
> > Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845. If PSI were
> > backported to 4.4, or even 3.18, would it really be used? I don't really
> > understand the aversion to an in-kernel memory killer on LKML despite the 
> > rest
> > of the industry's attraction to it. Perhaps there's some inherently great 
> > cost
> > in using the userspace solution that I'm unaware of?

Vendors are cautious about adopting userspace solution and it is a
process to address all concerns but we are getting there.

> > Regardless, even if PSI were backported, a full-fledged LMKD using it has 
> > yet to
> > be made, so it wouldn't be of much use now.
>
> This is work that is ongoing and requires kernel changes to make it
> feasible. One of the things that I have been working on for quite a
> while is the whole file descriptor for processes thing that is important
> for LMKD (Even though I never thought about this use-case when I started
> pitching this.). Joel and Daniel have joined in and are working on
> making LMKD possible.
> What I find odd is that every couple of weeks different solutions to the
> low memory problem are pitched. There is simple_lkml, there is LMKD, and
> there was a patchset that wanted to speed up memory reclaim at process
> kill-time by adding a new flag to the new pidfd_send_signal() syscall.
> That all seems - though related - rather uncoordinated.

I'm not sure why pidfd_wait and expedited reclaim is seen as
uncoordinated effort. All of them are done to improve userspace LMKD.

> Now granted,
> coordinated is usually not how kernel development necessarily works but
> it would probably be good to have some sort of direction and from what I
> have seen LMKD seems to be the most coordinated effort. But that might
> just be my impression.
>
> Christian

Thanks,
Suren.
___
devel mailing list
de...@linuxdriverproject.org

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Oleg Nesterov
I am not going to comment the intent, but to be honest I am skeptical too.

On 05/06, Sultan Alsawaf wrote:
>
> +static unsigned long find_victims(struct victim_info *varr, int *vindex,
> +   int vmaxlen, int min_adj, int max_adj)
> +{
> + unsigned long pages_found = 0;
> + int old_vindex = *vindex;
> + struct task_struct *tsk;
> +
> + for_each_process(tsk) {
> + struct task_struct *vtsk;
> + unsigned long tasksize;
> + short oom_score_adj;
> +
> + /* Make sure there's space left in the victim array */
> + if (*vindex == vmaxlen)
> + break;
> +
> + /* Don't kill current, kthreads, init, or duplicates */
> + if (same_thread_group(tsk, current) ||
> + tsk->flags & PF_KTHREAD ||
> + is_global_init(tsk) ||
> + vtsk_is_duplicate(varr, *vindex, tsk))
> + continue;
> +
> + vtsk = find_lock_task_mm(tsk);

Did you test this patch with lockdep enabled?

If I read the patch correctly, lockdep should complain. vtsk_is_duplicate()
ensures that we do not take the same ->alloc_lock twice or more, but lockdep
can't know this.

> +static void scan_and_kill(unsigned long pages_needed)
> +{
> + static DECLARE_WAIT_QUEUE_HEAD(victim_waitq);
> + struct victim_info victims[MAX_VICTIMS];
> + int i, nr_to_kill = 0, nr_victims = 0;
> + unsigned long pages_found = 0;
> + atomic_t victim_count;
> +
> + /*
> +  * Hold the tasklist lock so tasks don't disappear while scanning. This
> +  * is preferred to holding an RCU read lock so that the list of tasks
> +  * is guaranteed to be up to date. Keep preemption disabled until the
> +  * SIGKILLs are sent so the victim kill process isn't interrupted.
> +  */
> + read_lock(_lock);
> + preempt_disable();

read_lock() disables preemption, every task_lock() too, so this looks
unnecessary.

> + for (i = 1; i < ARRAY_SIZE(adj_prio); i++) {
> + pages_found += find_victims(victims, _victims, MAX_VICTIMS,
> + adj_prio[i], adj_prio[i - 1]);
> + if (pages_found >= pages_needed || nr_victims == MAX_VICTIMS)
> + break;
> + }
> +
> + /*
> +  * Calculate the number of tasks that need to be killed and quickly
> +  * release the references to those that'll live.
> +  */
> + for (i = 0, pages_found = 0; i < nr_victims; i++) {
> + struct victim_info *victim = [i];
> + struct task_struct *vtsk = victim->tsk;
> +
> + /* The victims' mm lock is taken in find_victims; release it */
> + if (pages_found >= pages_needed) {
> + task_unlock(vtsk);
> + continue;
> + }
> +
> + /*
> +  * Grab a reference to the victim so it doesn't disappear after
> +  * the tasklist lock is released.
> +  */
> + get_task_struct(vtsk);

The comment doesn't look correct. the victim can't dissapear until task_unlock()
below, it can't pass exit_mm().

> + pages_found += victim->size;
> + nr_to_kill++;
> + }
> + read_unlock(_lock);
> +
> + /* Kill the victims */
> + victim_count = (atomic_t)ATOMIC_INIT(nr_to_kill);
> + for (i = 0; i < nr_to_kill; i++) {
> + struct victim_info *victim = [i];
> + struct task_struct *vtsk = victim->tsk;
> +
> + pr_info("Killing %s with adj %d to free %lu kiB\n", vtsk->comm,
> + vtsk->signal->oom_score_adj,
> + victim->size << (PAGE_SHIFT - 10));
> +
> + /* Configure the victim's mm to notify us when it's freed */
> + vtsk->mm->slmk_waitq = _waitq;
> + vtsk->mm->slmk_counter = _count;
> +
> + /* Accelerate the victim's death by forcing the kill signal */
> + do_send_sig_info(SIGKILL, SIG_INFO_TYPE, vtsk, true);
   
this should be PIDTYPE_TGID

> +
> + /* Finally release the victim's mm lock */
> + task_unlock(vtsk);
> + }
> + preempt_enable_no_resched();

See above. And I don't understand how can _no_resched() really help...

> +
> + /* Try to speed up the death process now that we can schedule again */
> + for (i = 0; i < nr_to_kill; i++) {
> + struct task_struct *vtsk = victims[i].tsk;
> +
> + /* Increase the victim's priority to make it die faster */
> + set_user_nice(vtsk, MIN_NICE);
> +
> + /* Allow the victim to run on any CPU */
> + set_cpus_allowed_ptr(vtsk, cpu_all_mask);
> +
> + /* Finally release the victim reference acquired earlier */
> + put_task_struct(vtsk);
> + }
> +
> + /* Wait 

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Michal Hocko
On Mon 06-05-19 19:16:22, Sultan Alsawaf wrote:
> This is a complete low memory killer solution for Android that is small
> and simple. Processes are killed according to the priorities that
> Android gives them, so that the least important processes are always
> killed first. Processes are killed until memory deficits are satisfied,
> as observed from kswapd struggling to free up pages. Simple LMK stops
> killing processes when kswapd finally goes back to sleep.
> 
> The only tunables are the desired amount of memory to be freed per
> reclaim event and desired frequency of reclaim events. Simple LMK tries
> to free at least the desired amount of memory per reclaim and waits
> until all of its victims' memory is freed before proceeding to kill more
> processes.

Why do we need something like that in the kernel? I really do not like
an idea of having two OOM killer implementations in the kernel. As
already pointed out newer kernels can do PSI and older kernels can live
with an out of tree code to achieve what they need. I do not see why we
really need this code in the upstream kernel.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Greg Kroah-Hartman
On Tue, May 07, 2019 at 01:12:36AM -0700, Sultan Alsawaf wrote:
> On Tue, May 07, 2019 at 09:43:34AM +0200, Greg Kroah-Hartman wrote:
> > Given that any "new" android device that gets shipped "soon" should be
> > using 4.9.y or newer, is this a real issue?
> 
> It's certainly a real issue for those who can't buy brand new Android devices
> without software bugs every six months :)

Heh.

But, your "new code" isn't going to be going into any existing device,
or any device that will come out this year.  The soonest it would be
would be next year, and by then, 4.9.y is fine.

> > And if it is, I'm sure that asking for those patches to be backported to
> > 4.4.y would be just fine, have you asked?
> >
> > Note that I know of Android Go devices, running 3.18.y kernels, do NOT
> > use the in-kernel memory killer, but instead use the userspace solution
> > today.  So trying to get another in-kernel memory killer solution added
> > anywhere seems quite odd.
> 
> It's even more odd that although a userspace solution is touted as the proper
> way to go on LKML, almost no Android OEMs are using it, and even in that 
> commit
> I linked in the previous message, Google made a rather large set of
> modifications to the supposedly-defunct lowmemorykiller.c not one month ago.
> What's going on?

"almost no"?  Again, Android Go is doing that, right?

And yes, there is still some 4.4 android-common work happening in this
area, see this patch that just got merged:
https://android-review.googlesource.com/c/kernel/common/+/953354

So, for 4.4.y based devices, that should be enough, right?

> Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845.

Qualcomm should never be used as an example of a company that has any
idea of what to do in their kernel :)

> If PSI were backported to 4.4, or even 3.18, would it really be used?

Why wouldn't it, if it worked properly?

> I don't really understand the aversion to an in-kernel memory killer
> on LKML despite the rest of the industry's attraction to it. Perhaps
> there's some inherently great cost in using the userspace solution
> that I'm unaware of?

Please see the work that went into PSI and the patches around it.
There's also a lwn.net article last week about the further work ongoing
in this area.  With all of that, you should see how in-kernel memory
killers are NOT the way to go.

> Regardless, even if PSI were backported, a full-fledged LMKD using it has yet 
> to
> be made, so it wouldn't be of much use now.

"LMKD"?  Again, PSI is in the 4.9.y android-common tree, so the
userspace side should be in AOSP, right?

thanks,

greg k-h
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Christian Brauner
On Tue, May 07, 2019 at 01:12:36AM -0700, Sultan Alsawaf wrote:
> On Tue, May 07, 2019 at 09:43:34AM +0200, Greg Kroah-Hartman wrote:
> > Given that any "new" android device that gets shipped "soon" should be
> > using 4.9.y or newer, is this a real issue?
> 
> It's certainly a real issue for those who can't buy brand new Android devices
> without software bugs every six months :)
> 
> > And if it is, I'm sure that asking for those patches to be backported to
> > 4.4.y would be just fine, have you asked?
> >
> > Note that I know of Android Go devices, running 3.18.y kernels, do NOT
> > use the in-kernel memory killer, but instead use the userspace solution
> > today.  So trying to get another in-kernel memory killer solution added
> > anywhere seems quite odd.
> 
> It's even more odd that although a userspace solution is touted as the proper
> way to go on LKML, almost no Android OEMs are using it, and even in that 
> commit

That's probably because without proper kernel changes this is rather
tricky to use safely (see below).

> I linked in the previous message, Google made a rather large set of
> modifications to the supposedly-defunct lowmemorykiller.c not one month ago.
> What's going on?
> 
> Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845. If PSI were
> backported to 4.4, or even 3.18, would it really be used? I don't really
> understand the aversion to an in-kernel memory killer on LKML despite the rest
> of the industry's attraction to it. Perhaps there's some inherently great cost
> in using the userspace solution that I'm unaware of?
> 
> Regardless, even if PSI were backported, a full-fledged LMKD using it has yet 
> to
> be made, so it wouldn't be of much use now.

This is work that is ongoing and requires kernel changes to make it
feasible. One of the things that I have been working on for quite a
while is the whole file descriptor for processes thing that is important
for LMKD (Even though I never thought about this use-case when I started
pitching this.). Joel and Daniel have joined in and are working on
making LMKD possible.
What I find odd is that every couple of weeks different solutions to the
low memory problem are pitched. There is simple_lkml, there is LMKD, and
there was a patchset that wanted to speed up memory reclaim at process
kill-time by adding a new flag to the new pidfd_send_signal() syscall.
That all seems - though related - rather uncoordinated. Now granted,
coordinated is usually not how kernel development necessarily works but
it would probably be good to have some sort of direction and from what I
have seen LMKD seems to be the most coordinated effort. But that might
just be my impression.

Christian
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Sultan Alsawaf
On Tue, May 07, 2019 at 09:43:34AM +0200, Greg Kroah-Hartman wrote:
> Given that any "new" android device that gets shipped "soon" should be
> using 4.9.y or newer, is this a real issue?

It's certainly a real issue for those who can't buy brand new Android devices
without software bugs every six months :)

> And if it is, I'm sure that asking for those patches to be backported to
> 4.4.y would be just fine, have you asked?
>
> Note that I know of Android Go devices, running 3.18.y kernels, do NOT
> use the in-kernel memory killer, but instead use the userspace solution
> today.  So trying to get another in-kernel memory killer solution added
> anywhere seems quite odd.

It's even more odd that although a userspace solution is touted as the proper
way to go on LKML, almost no Android OEMs are using it, and even in that commit
I linked in the previous message, Google made a rather large set of
modifications to the supposedly-defunct lowmemorykiller.c not one month ago.
What's going on?

Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845. If PSI were
backported to 4.4, or even 3.18, would it really be used? I don't really
understand the aversion to an in-kernel memory killer on LKML despite the rest
of the industry's attraction to it. Perhaps there's some inherently great cost
in using the userspace solution that I'm unaware of?

Regardless, even if PSI were backported, a full-fledged LMKD using it has yet to
be made, so it wouldn't be of much use now.

Thanks,
Sultan

[1] 
https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/arch/arm64/configs/sdm845_defconfig?h=LA.UM.7.3.r1-07400-sdm845.0#n492
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Greg Kroah-Hartman
On Tue, May 07, 2019 at 12:27:21AM -0700, Sultan Alsawaf wrote:
> On Tue, May 07, 2019 at 09:04:30AM +0200, Greg Kroah-Hartman wrote:
> > Um, why can't "all" Android devices take the same patches that the Pixel
> > phones are using today?  They should all be in the public android-common
> > kernel repositories that all Android devices should be syncing with on a
> > weekly/monthly basis anyway, right?
> > 
> > thanks,
> > 
> > greg k-h
> 
> Hi Greg,
> 
> I only see PSI present in the android-common kernels for 4.9 and above. The 
> vast
> majority of Android devices do not run a 4.9+ kernel. It seems unreasonable to
> expect OEMs to toil with backporting PSI themselves to get decent memory
> management.

Given that any "new" android device that gets shipped "soon" should be
using 4.9.y or newer, is this a real issue?

And if it is, I'm sure that asking for those patches to be backported to
4.4.y would be just fine, have you asked?

Note that I know of Android Go devices, running 3.18.y kernels, do NOT
use the in-kernel memory killer, but instead use the userspace solution
today.  So trying to get another in-kernel memory killer solution added
anywhere seems quite odd.

thanks,

greg k-h
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Sultan Alsawaf
On Tue, May 07, 2019 at 09:04:30AM +0200, Greg Kroah-Hartman wrote:
> Um, why can't "all" Android devices take the same patches that the Pixel
> phones are using today?  They should all be in the public android-common
> kernel repositories that all Android devices should be syncing with on a
> weekly/monthly basis anyway, right?
> 
> thanks,
> 
> greg k-h

Hi Greg,

I only see PSI present in the android-common kernels for 4.9 and above. The vast
majority of Android devices do not run a 4.9+ kernel. It seems unreasonable to
expect OEMs to toil with backporting PSI themselves to get decent memory
management.

But even if they did backport PSI, it wouldn't help too much because a
PSI-enabled LMKD solution is not ready yet. It looks like a PSI-based LMKD is
still under heavy development and won't be ready for all Android devices for
quite some time.

Additionally, it looks like the supposedly-dead lowmemorykiller.c is still being
actively tweaked by Google [1], which does not instill confidence that a
definitive LMK solution a la PSI is coming any time soon.

Thanks,
Sultan

[1] 
https://android.googlesource.com/kernel/common/+/152bacdd85c46f0c76b00c4acc253e414513634c
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-07 Thread Greg Kroah-Hartman
On Mon, May 06, 2019 at 07:16:22PM -0700, Sultan Alsawaf wrote:
> This is a complete low memory killer solution for Android that is small
> and simple. Processes are killed according to the priorities that
> Android gives them, so that the least important processes are always
> killed first. Processes are killed until memory deficits are satisfied,
> as observed from kswapd struggling to free up pages. Simple LMK stops
> killing processes when kswapd finally goes back to sleep.
> 
> The only tunables are the desired amount of memory to be freed per
> reclaim event and desired frequency of reclaim events. Simple LMK tries
> to free at least the desired amount of memory per reclaim and waits
> until all of its victims' memory is freed before proceeding to kill more
> processes.
> 
> Signed-off-by: Sultan Alsawaf 
> ---
> Hello everyone,
> 
> I've addressed some of the concerns that were brought up with the first 
> version
> of the Simple LMK patch. I understand that a kernel-based solution like this
> that contains policy decisions for a specific userspace is not the way to go,
> but the Android ecosystem still has a pressing need for a low memory killer 
> that
> works well.
> 
> Most Android devices still use the ancient and deprecated lowmemorykiller.c
> kernel driver; Simple LMK seeks to replace that, at the very least until PSI 
> and
> a userspace daemon utilizing PSI are ready for *all* Android devices, and not
> just the privileged Pixel phone line.

Um, why can't "all" Android devices take the same patches that the Pixel
phones are using today?  They should all be in the public android-common
kernel repositories that all Android devices should be syncing with on a
weekly/monthly basis anyway, right?

thanks,

greg k-h
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-05-06 Thread Sultan Alsawaf
This is a complete low memory killer solution for Android that is small
and simple. Processes are killed according to the priorities that
Android gives them, so that the least important processes are always
killed first. Processes are killed until memory deficits are satisfied,
as observed from kswapd struggling to free up pages. Simple LMK stops
killing processes when kswapd finally goes back to sleep.

The only tunables are the desired amount of memory to be freed per
reclaim event and desired frequency of reclaim events. Simple LMK tries
to free at least the desired amount of memory per reclaim and waits
until all of its victims' memory is freed before proceeding to kill more
processes.

Signed-off-by: Sultan Alsawaf 
---
Hello everyone,

I've addressed some of the concerns that were brought up with the first version
of the Simple LMK patch. I understand that a kernel-based solution like this
that contains policy decisions for a specific userspace is not the way to go,
but the Android ecosystem still has a pressing need for a low memory killer that
works well.

Most Android devices still use the ancient and deprecated lowmemorykiller.c
kernel driver; Simple LMK seeks to replace that, at the very least until PSI and
a userspace daemon utilizing PSI are ready for *all* Android devices, and not
just the privileged Pixel phone line.

The feedback I received last time was quite helpful. This Simple LMK patch works
significantly better than the first, improving memory management by a large
margin while being immune to transient spikes in memory usage (since the signal
to start killing processes is determined by how hard kswapd tries to free up
pages, which is something that occurs over a span of time and not a single point
in time).

I'd love to hear some feedback on this new patch. I do encourage those who are
interested to take it for a spin on an Android device. This patch has been
tested successfully on Android 4.4 and 4.9 kernels. For the sake of review here,
I have adapted the patch to 5.1.

Thanks,
Sultan

 drivers/android/Kconfig  |  33 
 drivers/android/Makefile |   1 +
 drivers/android/simple_lmk.c | 315 +++
 include/linux/mm_types.h |   4 +
 include/linux/simple_lmk.h   |  11 ++
 kernel/fork.c|  13 ++
 mm/vmscan.c  |  12 ++
 7 files changed, 389 insertions(+)
 create mode 100644 drivers/android/simple_lmk.c
 create mode 100644 include/linux/simple_lmk.h

diff --git a/drivers/android/Kconfig b/drivers/android/Kconfig
index 6fdf2abe4..bdd429338 100644
--- a/drivers/android/Kconfig
+++ b/drivers/android/Kconfig
@@ -54,6 +54,39 @@ config ANDROID_BINDER_IPC_SELFTEST
  exhaustively with combinations of various buffer sizes and
  alignments.
 
+config ANDROID_SIMPLE_LMK
+   bool "Simple Android Low Memory Killer"
+   depends on !ANDROID_LOW_MEMORY_KILLER && !MEMCG
+   ---help---
+ This is a complete low memory killer solution for Android that is
+ small and simple. Processes are killed according to the priorities
+ that Android gives them, so that the least important processes are
+ always killed first. Processes are killed until memory deficits are
+ satisfied, as observed from kswapd struggling to free up pages. Simple
+ LMK stops killing processes when kswapd finally goes back to sleep.
+
+if ANDROID_SIMPLE_LMK
+
+config ANDROID_SIMPLE_LMK_AGGRESSION
+   int "Reclaim frequency selection"
+   range 1 3
+   default 1
+   help
+ This value determines how frequently Simple LMK will perform memory
+ reclaims. A lower value corresponds to less frequent reclaims, which
+ maximizes memory usage. The range of values has a logarithmic
+ correlation; 2 is twice as aggressive as 1, and 3 is twice as
+ aggressive as 2, which makes 3 four times as aggressive as 1.
+
+config ANDROID_SIMPLE_LMK_MINFREE
+   int "Minimum MiB of memory to free per reclaim"
+   range 8 512
+   default 64
+   help
+ Simple LMK will try to free at least this much memory per reclaim.
+
+endif
+
 endif # if ANDROID
 
 endmenu
diff --git a/drivers/android/Makefile b/drivers/android/Makefile
index c7856e320..7c91293b6 100644
--- a/drivers/android/Makefile
+++ b/drivers/android/Makefile
@@ -3,3 +3,4 @@ ccflags-y += -I$(src)   # needed for trace 
events
 obj-$(CONFIG_ANDROID_BINDERFS) += binderfs.o
 obj-$(CONFIG_ANDROID_BINDER_IPC)   += binder.o binder_alloc.o
 obj-$(CONFIG_ANDROID_BINDER_IPC_SELFTEST) += binder_alloc_selftest.o
+obj-$(CONFIG_ANDROID_SIMPLE_LMK)   += simple_lmk.o
diff --git a/drivers/android/simple_lmk.c b/drivers/android/simple_lmk.c
new file mode 100644
index 0..a2ffb57b5
--- /dev/null
+++ b/drivers/android/simple_lmk.c
@@ -0,0 +1,315 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Sultan Alsawaf .
+ */
+
+#define pr_fmt(fmt) 

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-19 Thread Joel Fernandes
On Wed, Mar 20, 2019 at 12:10:23AM +0100, Christian Brauner wrote:
> On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione wrote:
> > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner  
> > wrote:
> > > So I dislike the idea of allocating new inodes from the procfs super
> > > block. I would like to avoid pinning the whole pidfd concept exclusively
> > > to proc. The idea is that the pidfd API will be useable through procfs
> > > via open("/proc/") because that is what users expect and really
> > > wanted to have for a long time. So it makes sense to have this working.
> > > But it should really be useable without it. That's why translate_pid()
> > > and pidfd_clone() are on the table.  What I'm saying is, once the pidfd
> > > api is "complete" you should be able to set CONFIG_PROCFS=N - even
> > > though that's crazy - and still be able to use pidfds. This is also a
> > > point akpm asked about when I did the pidfd_send_signal work.
> > 
> > I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One
> > crazy idea that I was discussing with Joel the other day is to just
> > make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root()
> > system call that returned, out of thin air and independent of the
> > mount table, a procfs root directory file descriptor for the caller's
> > PID namspace and suitable for use with openat(2).
> 
> Even if this works I'm pretty sure that Al and a lot of others will not
> be happy about this. A syscall to get an fd to /proc? That's not going
> to happen and I don't see the need for a separate syscall just for that.
> (I do see the point of making CONFIG_PROCFS=y the default btw.)

I think his point here was that he wanted a handle to procfs no matter where
it was mounted and then can later use openat on that. Agreed that it may be
unnecessary unless there is a usecase for it, and especially if the /proc
directory being the defacto mountpoint for procfs is a universal convention.

> Inode allocation from the procfs mount for the file descriptors Joel
> wants is not correct. Their not really procfs file descriptors so this
> is a nack. We can't just hook into proc that way.

I was not particular about using procfs mount for the FDs but that's the only
way I knew how to do it until you pointed out anon_inode (my grep skills
missed that), so thank you!

> > C'mon: /proc is used by everyone today and almost every program breaks
> > if it's not around. The string "/proc" is already de facto kernel ABI.
> > Let's just drop the pretense of /proc being optional and bake it into
> > the kernel proper, then give programs a way to get to /proc that isn't
> > tied to any particular mount configuration. This way, we don't need a
> > translate_pid(), since callers can just use procfs to do the same
> > thing. (That is, if I understand correctly what translate_pid does.)
> 
> I'm not sure what you think translate_pid() is doing since you're not
> saying what you think it does.
> Examples from the old patchset:
> translate_pid(pid, ns, -1)  - get pid in our pid namespace
> translate_pid(pid, -1, ns)  - get pid in other pid namespace
> translate_pid(1, ns, -1)- get pid of init task for namespace
> translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
> translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
> translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
> translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?
> 
> Allowing this syscall to yield pidfds as proper regular file fds and
> also taking pidfds as argument is an excellent way to kill a few
> problems at once:
> - cheap pid namespace introspection
> - creates a bridge between the "old" pid-based api and the "new" pidfd api

This second point would solve the problem of getting a new pidfd given a pid
indeed, without depending on /proc/ at all. So kudos for that and I am
glad you are making it return pidfds (but correct me if I misunderstood what
you're planning to do with translate_fd). It also obviates any need for
dealing with procfs mount points.

> - allows us to get proper non-directory file descriptors for any pids we
>   like

Here I got a bit lost. AIUI pidfd is a directory fd. Why would we want it to
not be a directory fd? That would be ambigiuous with what pidfd_send_signal
expects.

Also would it be a bad idea to extend translate_pid to also do what we want
for the pidfd_wait syscall?  So translate_fd in this case would return an fd
that is just used for the pid's death status.

All of these extensions seem to mean translate_pid should probably take a
fourth parameter that tells it the target translation type?

They way I am hypothesizing, translate_pid, it should probably be
- translation to a pid in some ns
- translation of a pid to a pidfd
- translation of a pid to a "wait" fd which returns the death/reap process 
status.

If that makes sense, that would also avoid the need for a new syscall we are 
adding.

> The additional advantage is that people are already happy to 

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-19 Thread Christian Brauner
On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione wrote:
> On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner  
> wrote:
> > So I dislike the idea of allocating new inodes from the procfs super
> > block. I would like to avoid pinning the whole pidfd concept exclusively
> > to proc. The idea is that the pidfd API will be useable through procfs
> > via open("/proc/") because that is what users expect and really
> > wanted to have for a long time. So it makes sense to have this working.
> > But it should really be useable without it. That's why translate_pid()
> > and pidfd_clone() are on the table.  What I'm saying is, once the pidfd
> > api is "complete" you should be able to set CONFIG_PROCFS=N - even
> > though that's crazy - and still be able to use pidfds. This is also a
> > point akpm asked about when I did the pidfd_send_signal work.
> 
> I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One
> crazy idea that I was discussing with Joel the other day is to just
> make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root()
> system call that returned, out of thin air and independent of the
> mount table, a procfs root directory file descriptor for the caller's
> PID namspace and suitable for use with openat(2).

Even if this works I'm pretty sure that Al and a lot of others will not
be happy about this. A syscall to get an fd to /proc? That's not going
to happen and I don't see the need for a separate syscall just for that.
(I do see the point of making CONFIG_PROCFS=y the default btw.)

Inode allocation from the procfs mount for the file descriptors Joel
wants is not correct. Their not really procfs file descriptors so this
is a nack. We can't just hook into proc that way.

> 
> C'mon: /proc is used by everyone today and almost every program breaks
> if it's not around. The string "/proc" is already de facto kernel ABI.
> Let's just drop the pretense of /proc being optional and bake it into
> the kernel proper, then give programs a way to get to /proc that isn't
> tied to any particular mount configuration. This way, we don't need a
> translate_pid(), since callers can just use procfs to do the same
> thing. (That is, if I understand correctly what translate_pid does.)

I'm not sure what you think translate_pid() is doing since you're not
saying what you think it does.
Examples from the old patchset:
translate_pid(pid, ns, -1)  - get pid in our pid namespace
translate_pid(pid, -1, ns)  - get pid in other pid namespace
translate_pid(1, ns, -1)- get pid of init task for namespace
translate_pid(pid, -1, ns) > 0  - is pid is reachable from ns?
translate_pid(1, ns1, ns2) > 0  - is ns1 inside ns2?
translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2?
translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2?

Allowing this syscall to yield pidfds as proper regular file fds and
also taking pidfds as argument is an excellent way to kill a few
problems at once:
- cheap pid namespace introspection
- creates a bridge between the "old" pid-based api and the "new" pidfd api
- allows us to get proper non-directory file descriptors for any pids we
  like

The additional advantage is that people are already happy to add this
syscall so simply extending it and routing it through the pidfd tree or
Eric's tree is reasonable. (It should probably grow a flag argument. I
need to start prototyping this.)

> 
> We still need a pidfd_clone() for atomicity reasons, but that's a
> separate story. My goal is to be able to write a library that

Yes, on my todo list and I have a ported patch based on prior working
rotting somehwere on my git server.

> transparently creates and manages a helper child process even in a
> "hostile" process environment in which some other uncoordinated thread
> is constantly doing a waitpid(-1) (e.g., the JVM).
> 
> > So instead of going throught proc we should probably do what David has
> > been doing in the mount API and come to rely on anone_inode. So
> > something like:
> >
> > fd = anon_inode_getfd("pidfd", _fops, file_priv_data, flags);
> >
> > and stash information such as pid namespace etc. in a pidfd struct or
> > something that we then can stash file->private_data of the new file.
> > This also lets us avoid all this open coding done here.
> > Another advantage is that anon_inodes is its own kernel-internal
> > filesystem.
> 
> Sure. That works too.

Great.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-19 Thread Joel Fernandes
On Tue, Mar 19, 2019 at 11:14:17PM +0100, Christian Brauner wrote:
[snip] 
> > 
> > ---8<---
> > 
> > From: Joel Fernandes 
> > Subject: [PATCH] Partial skeleton prototype of pidfd_wait frontend
> > 
> > Signed-off-by: Joel Fernandes 
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
> >  include/linux/syscalls.h   |  1 +
> >  include/uapi/asm-generic/unistd.h  |  4 +-
> >  kernel/signal.c| 62 ++
> >  kernel/sys_ni.c|  3 ++
> >  6 files changed, 71 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> > b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 1f9607ed087c..2a63f1896b63 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -433,3 +433,4 @@
> >  425i386io_uring_setup  sys_io_uring_setup  
> > __ia32_sys_io_uring_setup
> >  426i386io_uring_enter  sys_io_uring_enter  
> > __ia32_sys_io_uring_enter
> >  427i386io_uring_register   sys_io_uring_register   
> > __ia32_sys_io_uring_register
> > +428i386pidfd_wait  sys_pidfd_wait  
> > __ia32_sys_pidfd_wait
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> > b/arch/x86/entry/syscalls/syscall_64.tbl
> > index 92ee0b4378d4..cf2e08a8053b 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -349,6 +349,7 @@
> >  425common  io_uring_setup  __x64_sys_io_uring_setup
> >  426common  io_uring_enter  __x64_sys_io_uring_enter
> >  427common  io_uring_register   __x64_sys_io_uring_register
> > +428common  pidfd_wait  __x64_sys_pidfd_wait
> >  
> >  #
> >  # x32-specific system call numbers start at 512 to avoid cache impact
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index e446806a561f..62160970ed3f 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -988,6 +988,7 @@ asmlinkage long sys_rseq(struct rseq __user *rseq, 
> > uint32_t rseq_len,
> >  asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
> >siginfo_t __user *info,
> >unsigned int flags);
> > +asmlinkage long sys_pidfd_wait(int pidfd);
> >  
> >  /*
> >   * Architecture-specific system calls
> > diff --git a/include/uapi/asm-generic/unistd.h 
> > b/include/uapi/asm-generic/unistd.h
> > index dee7292e1df6..137aa8662230 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -832,9 +832,11 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
> >  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
> >  #define __NR_io_uring_register 427
> >  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> > +#define __NR_pidfd_wait 428
> > +__SYSCALL(__NR_pidfd_wait, sys_pidfd_wait)
> >  
> >  #undef __NR_syscalls
> > -#define __NR_syscalls 428
> > +#define __NR_syscalls 429
> >  
> >  /*
> >   * 32 bit systems traditionally used different
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index b7953934aa99..ebb550b87044 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -3550,6 +3550,68 @@ static int 
> > copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
> > return copy_siginfo_from_user(kinfo, info);
> >  }
> >  
> > +static ssize_t pidfd_wait_read_iter(struct kiocb *iocb, struct iov_iter 
> > *to)
> > +{
> > +   /*
> > +* This is just a test string, it will contain the actual
> > +* status of the pidfd in the future.
> > +*/
> > +   char buf[] = "status";
> > +
> > +   return copy_to_iter(buf, strlen(buf)+1, to);
> > +}
> > +
> > +static const struct file_operations pidfd_wait_file_ops = {
> > +   .read_iter  = pidfd_wait_read_iter,
> > +};
> > +
> > +static struct inode *pidfd_wait_get_inode(struct super_block *sb)
> > +{
> > +   struct inode *inode = new_inode(sb);
> > +
> > +   inode->i_ino = get_next_ino();
> > +   inode_init_owner(inode, NULL, S_IFREG);
> > +
> > +   inode->i_op = _dir_inode_operations;
> > +   inode->i_fop= _wait_file_ops;
> > +
> > +   return inode;
> > +}
> > +
> > +SYSCALL_DEFINE1(pidfd_wait, int, pidfd)
> > +{
> > +   struct fd f;
> > +   struct inode *inode;
> > +   struct file *file;
> > +   int new_fd;
> > +   struct pid_namespace *pid_ns;
> > +   struct super_block *sb;
> > +   struct vfsmount *mnt;
> > +
> > +   f = fdget_raw(pidfd);
> > +   if (!f.file)
> > +   return -EBADF;
> > +
> > +   sb = file_inode(f.file)->i_sb;
> > +   pid_ns = sb->s_fs_info;
> > +
> > +   inode = pidfd_wait_get_inode(sb);
> > +
> > +   mnt = pid_ns->proc_mnt;
> > +
> > +   file = alloc_file_pseudo(inode, mnt, 

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-19 Thread Christian Brauner
On Mon, Mar 18, 2019 at 07:50:52PM -0400, Joel Fernandes wrote:
> On Mon, Mar 18, 2019 at 01:29:51AM +0100, Christian Brauner wrote:
> > On Sun, Mar 17, 2019 at 08:40:19AM -0700, Daniel Colascione wrote:
> > > On Sun, Mar 17, 2019 at 4:42 AM Christian Brauner  
> > > wrote:
> > > >
> > > > On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote:
> > > > > On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote:
> > > > > > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote:
> > > > > > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner 
> > > > > > > > > > wrote:
> > > > > > > > > > [..]
> > > > > > > > > > > > why do we want to add a new syscall (pidfd_wait) 
> > > > > > > > > > > > though? Why not just use
> > > > > > > > > > > > standard poll/epoll interface on the proc fd like 
> > > > > > > > > > > > Daniel was suggesting.
> > > > > > > > > > > > AFAIK, once the proc file is opened, the struct pid is 
> > > > > > > > > > > > essentially pinned
> > > > > > > > > > > > even though the proc number may be reused. Then the 
> > > > > > > > > > > > caller can just poll.
> > > > > > > > > > > > We can add a waitqueue to struct pid, and wake up any 
> > > > > > > > > > > > waiters on process
> > > > > > > > > > > > death (A quick look shows task_struct can be mapped to 
> > > > > > > > > > > > its struct pid) and
> > > > > > > > > > > > also possibly optimize it using Steve's TIF flag idea. 
> > > > > > > > > > > > No new syscall is
> > > > > > > > > > > > needed then, let me know if I missed something?
> > > > > > > > > > >
> > > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll 
> > > > > > > > > > > solution?
> > > > > > > > > >
> > > > > > > > > > Hmm, going through earlier threads, I believe so now. Here 
> > > > > > > > > > was Daniel's
> > > > > > > > > > reasoning about avoiding a notification about process death 
> > > > > > > > > > through proc
> > > > > > > > > > directory fd: 
> > > > > > > > > > http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html
> > > > > > > > > >
> > > > > > > > > > May be a dedicated syscall for this would be cleaner after 
> > > > > > > > > > all.
> > > > > > > > >
> > > > > > > > > Ah, I wish I've seen that discussion before...
> > > > > > > > > syscall makes sense and it can be non-blocking and we can use
> > > > > > > > > select/poll/epoll if we use eventfd.
> > > > > > > >
> > > > > > > > Thanks for taking a look.
> > > > > > > >
> > > > > > > > > I would strongly advocate for
> > > > > > > > > non-blocking version or at least to have a non-blocking 
> > > > > > > > > option.
> > > > > > > >
> > > > > > > > Waiting for FD readiness is *already* blocking or non-blocking
> > > > > > > > according to the caller's desire --- users can pass options 
> > > > > > > > they want
> > > > > > > > to poll(2) or whatever. There's no need for any kind of special
> > > > > > > > configuration knob or non-blocking option. We already *have* a
> > > > > > > > non-blocking option that works universally for everything.
> > > > > > > >
> > > > > > > > As I mentioned in the linked thread, waiting for process exit 
> > > > > > > > should
> > > > > > > > work just like waiting for bytes to appear on a pipe. Process 
> > > > > > > > exit
> > > > > > > > status is just another blob of bytes that a process might 
> > > > > > > > receive. A
> > > > > > > > process exit handle ought to be just another information 
> > > > > > > > source. The
> > > > > > > > reason the unix process API is so awful is that for whatever 
> > > > > > > > reason
> > > > > > > > the original designers treated processes as some kind of 
> > > > > > > > special kind
> > > > > > > > of resource instead of fitting them into the otherwise 
> > > > > > > > general-purpose
> > > > > > > > unix data-handling API. Let's not repeat that mistake.
> > > > > > > >
> > > > > > > > > Something like this:
> > > > > > > > >
> > > > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> > > > > > > > > // register eventfd to receive death notification
> > > > > > > > > pidfd_wait(pid_to_kill, evfd);
> > > > > > > > > // kill the process
> > > > > > > > > pidfd_send_signal(pid_to_kill, ...)
> > > > > > > > > // tend to other things
> > > > > > > >
> > > > > > > > Now you've lost me. pidfd_wait should return a *new* FD, not 
> > > > > > > > wire up
> > > > > > > > an eventfd.
> > > > > > > >
> > > > > >
> > > > > > Ok, I probably misunderstood your post linked by Joel. I though your
> > > > > > original proposal was based on being able to poll a file under
> > > > > > /proc/pid and then you changed your mind to have a separate syscall
> > > > 

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-18 Thread Joel Fernandes
On Mon, Mar 18, 2019 at 01:29:51AM +0100, Christian Brauner wrote:
> On Sun, Mar 17, 2019 at 08:40:19AM -0700, Daniel Colascione wrote:
> > On Sun, Mar 17, 2019 at 4:42 AM Christian Brauner  
> > wrote:
> > >
> > > On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote:
> > > > On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote:
> > > > > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner 
> > > > >  wrote:
> > > > > >
> > > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote:
> > > > > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner 
> > > > > > > > > wrote:
> > > > > > > > > [..]
> > > > > > > > > > > why do we want to add a new syscall (pidfd_wait) though? 
> > > > > > > > > > > Why not just use
> > > > > > > > > > > standard poll/epoll interface on the proc fd like Daniel 
> > > > > > > > > > > was suggesting.
> > > > > > > > > > > AFAIK, once the proc file is opened, the struct pid is 
> > > > > > > > > > > essentially pinned
> > > > > > > > > > > even though the proc number may be reused. Then the 
> > > > > > > > > > > caller can just poll.
> > > > > > > > > > > We can add a waitqueue to struct pid, and wake up any 
> > > > > > > > > > > waiters on process
> > > > > > > > > > > death (A quick look shows task_struct can be mapped to 
> > > > > > > > > > > its struct pid) and
> > > > > > > > > > > also possibly optimize it using Steve's TIF flag idea. No 
> > > > > > > > > > > new syscall is
> > > > > > > > > > > needed then, let me know if I missed something?
> > > > > > > > > >
> > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll 
> > > > > > > > > > solution?
> > > > > > > > >
> > > > > > > > > Hmm, going through earlier threads, I believe so now. Here 
> > > > > > > > > was Daniel's
> > > > > > > > > reasoning about avoiding a notification about process death 
> > > > > > > > > through proc
> > > > > > > > > directory fd: 
> > > > > > > > > http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html
> > > > > > > > >
> > > > > > > > > May be a dedicated syscall for this would be cleaner after 
> > > > > > > > > all.
> > > > > > > >
> > > > > > > > Ah, I wish I've seen that discussion before...
> > > > > > > > syscall makes sense and it can be non-blocking and we can use
> > > > > > > > select/poll/epoll if we use eventfd.
> > > > > > >
> > > > > > > Thanks for taking a look.
> > > > > > >
> > > > > > > > I would strongly advocate for
> > > > > > > > non-blocking version or at least to have a non-blocking option.
> > > > > > >
> > > > > > > Waiting for FD readiness is *already* blocking or non-blocking
> > > > > > > according to the caller's desire --- users can pass options they 
> > > > > > > want
> > > > > > > to poll(2) or whatever. There's no need for any kind of special
> > > > > > > configuration knob or non-blocking option. We already *have* a
> > > > > > > non-blocking option that works universally for everything.
> > > > > > >
> > > > > > > As I mentioned in the linked thread, waiting for process exit 
> > > > > > > should
> > > > > > > work just like waiting for bytes to appear on a pipe. Process exit
> > > > > > > status is just another blob of bytes that a process might 
> > > > > > > receive. A
> > > > > > > process exit handle ought to be just another information source. 
> > > > > > > The
> > > > > > > reason the unix process API is so awful is that for whatever 
> > > > > > > reason
> > > > > > > the original designers treated processes as some kind of special 
> > > > > > > kind
> > > > > > > of resource instead of fitting them into the otherwise 
> > > > > > > general-purpose
> > > > > > > unix data-handling API. Let's not repeat that mistake.
> > > > > > >
> > > > > > > > Something like this:
> > > > > > > >
> > > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> > > > > > > > // register eventfd to receive death notification
> > > > > > > > pidfd_wait(pid_to_kill, evfd);
> > > > > > > > // kill the process
> > > > > > > > pidfd_send_signal(pid_to_kill, ...)
> > > > > > > > // tend to other things
> > > > > > >
> > > > > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire 
> > > > > > > up
> > > > > > > an eventfd.
> > > > > > >
> > > > >
> > > > > Ok, I probably misunderstood your post linked by Joel. I though your
> > > > > original proposal was based on being able to poll a file under
> > > > > /proc/pid and then you changed your mind to have a separate syscall
> > > > > which I assumed would be a blocking one to wait for process exit.
> > > > > Maybe you can describe the new interface you are thinking about in
> > > > > terms of userspace usage like I did above? Several lines of code would
> > > > > explain more than paragraphs of text.
> > > >
> > > > 

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-17 Thread Christian Brauner
On Sun, Mar 17, 2019 at 08:40:19AM -0700, Daniel Colascione wrote:
> On Sun, Mar 17, 2019 at 4:42 AM Christian Brauner  
> wrote:
> >
> > On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote:
> > > On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote:
> > > > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner 
> > > >  wrote:
> > > > >
> > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote:
> > > > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner 
> > > > > > > > wrote:
> > > > > > > > [..]
> > > > > > > > > > why do we want to add a new syscall (pidfd_wait) though? 
> > > > > > > > > > Why not just use
> > > > > > > > > > standard poll/epoll interface on the proc fd like Daniel 
> > > > > > > > > > was suggesting.
> > > > > > > > > > AFAIK, once the proc file is opened, the struct pid is 
> > > > > > > > > > essentially pinned
> > > > > > > > > > even though the proc number may be reused. Then the caller 
> > > > > > > > > > can just poll.
> > > > > > > > > > We can add a waitqueue to struct pid, and wake up any 
> > > > > > > > > > waiters on process
> > > > > > > > > > death (A quick look shows task_struct can be mapped to its 
> > > > > > > > > > struct pid) and
> > > > > > > > > > also possibly optimize it using Steve's TIF flag idea. No 
> > > > > > > > > > new syscall is
> > > > > > > > > > needed then, let me know if I missed something?
> > > > > > > > >
> > > > > > > > > Huh, I thought that Daniel was against the poll/epoll 
> > > > > > > > > solution?
> > > > > > > >
> > > > > > > > Hmm, going through earlier threads, I believe so now. Here was 
> > > > > > > > Daniel's
> > > > > > > > reasoning about avoiding a notification about process death 
> > > > > > > > through proc
> > > > > > > > directory fd: 
> > > > > > > > http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html
> > > > > > > >
> > > > > > > > May be a dedicated syscall for this would be cleaner after all.
> > > > > > >
> > > > > > > Ah, I wish I've seen that discussion before...
> > > > > > > syscall makes sense and it can be non-blocking and we can use
> > > > > > > select/poll/epoll if we use eventfd.
> > > > > >
> > > > > > Thanks for taking a look.
> > > > > >
> > > > > > > I would strongly advocate for
> > > > > > > non-blocking version or at least to have a non-blocking option.
> > > > > >
> > > > > > Waiting for FD readiness is *already* blocking or non-blocking
> > > > > > according to the caller's desire --- users can pass options they 
> > > > > > want
> > > > > > to poll(2) or whatever. There's no need for any kind of special
> > > > > > configuration knob or non-blocking option. We already *have* a
> > > > > > non-blocking option that works universally for everything.
> > > > > >
> > > > > > As I mentioned in the linked thread, waiting for process exit should
> > > > > > work just like waiting for bytes to appear on a pipe. Process exit
> > > > > > status is just another blob of bytes that a process might receive. A
> > > > > > process exit handle ought to be just another information source. The
> > > > > > reason the unix process API is so awful is that for whatever reason
> > > > > > the original designers treated processes as some kind of special 
> > > > > > kind
> > > > > > of resource instead of fitting them into the otherwise 
> > > > > > general-purpose
> > > > > > unix data-handling API. Let's not repeat that mistake.
> > > > > >
> > > > > > > Something like this:
> > > > > > >
> > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> > > > > > > // register eventfd to receive death notification
> > > > > > > pidfd_wait(pid_to_kill, evfd);
> > > > > > > // kill the process
> > > > > > > pidfd_send_signal(pid_to_kill, ...)
> > > > > > > // tend to other things
> > > > > >
> > > > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire up
> > > > > > an eventfd.
> > > > > >
> > > >
> > > > Ok, I probably misunderstood your post linked by Joel. I though your
> > > > original proposal was based on being able to poll a file under
> > > > /proc/pid and then you changed your mind to have a separate syscall
> > > > which I assumed would be a blocking one to wait for process exit.
> > > > Maybe you can describe the new interface you are thinking about in
> > > > terms of userspace usage like I did above? Several lines of code would
> > > > explain more than paragraphs of text.
> > >
> > > Hey, Thanks Suren for the eventfd idea. I agree with Daniel on this. The 
> > > idea
> > > from Daniel here is to wait for process death and exit events by just
> > > referring to a stable fd, independent of whatever is going on in /proc.
> > >
> > > What is needed is something like this (in highly pseudo-code form):
> > >
> > > pidfd = 

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-17 Thread Serge E. Hallyn
On Sun, Mar 17, 2019 at 10:11:10AM -0700, Daniel Colascione wrote:
> On Sun, Mar 17, 2019 at 9:35 AM Serge E. Hallyn  wrote:
> >
> > On Sun, Mar 17, 2019 at 12:42:40PM +0100, Christian Brauner wrote:
> > > On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote:
> > > > On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote:
> > > > > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner 
> > > > >  wrote:
> > > > > >
> > > > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote:
> > > > > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner 
> > > > > > > > > wrote:
> > > > > > > > > [..]
> > > > > > > > > > > why do we want to add a new syscall (pidfd_wait) though? 
> > > > > > > > > > > Why not just use
> > > > > > > > > > > standard poll/epoll interface on the proc fd like Daniel 
> > > > > > > > > > > was suggesting.
> > > > > > > > > > > AFAIK, once the proc file is opened, the struct pid is 
> > > > > > > > > > > essentially pinned
> > > > > > > > > > > even though the proc number may be reused. Then the 
> > > > > > > > > > > caller can just poll.
> > > > > > > > > > > We can add a waitqueue to struct pid, and wake up any 
> > > > > > > > > > > waiters on process
> > > > > > > > > > > death (A quick look shows task_struct can be mapped to 
> > > > > > > > > > > its struct pid) and
> > > > > > > > > > > also possibly optimize it using Steve's TIF flag idea. No 
> > > > > > > > > > > new syscall is
> > > > > > > > > > > needed then, let me know if I missed something?
> > > > > > > > > >
> > > > > > > > > > Huh, I thought that Daniel was against the poll/epoll 
> > > > > > > > > > solution?
> > > > > > > > >
> > > > > > > > > Hmm, going through earlier threads, I believe so now. Here 
> > > > > > > > > was Daniel's
> > > > > > > > > reasoning about avoiding a notification about process death 
> > > > > > > > > through proc
> > > > > > > > > directory fd: 
> > > > > > > > > http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html
> > > > > > > > >
> > > > > > > > > May be a dedicated syscall for this would be cleaner after 
> > > > > > > > > all.
> > > > > > > >
> > > > > > > > Ah, I wish I've seen that discussion before...
> > > > > > > > syscall makes sense and it can be non-blocking and we can use
> > > > > > > > select/poll/epoll if we use eventfd.
> > > > > > >
> > > > > > > Thanks for taking a look.
> > > > > > >
> > > > > > > > I would strongly advocate for
> > > > > > > > non-blocking version or at least to have a non-blocking option.
> > > > > > >
> > > > > > > Waiting for FD readiness is *already* blocking or non-blocking
> > > > > > > according to the caller's desire --- users can pass options they 
> > > > > > > want
> > > > > > > to poll(2) or whatever. There's no need for any kind of special
> > > > > > > configuration knob or non-blocking option. We already *have* a
> > > > > > > non-blocking option that works universally for everything.
> > > > > > >
> > > > > > > As I mentioned in the linked thread, waiting for process exit 
> > > > > > > should
> > > > > > > work just like waiting for bytes to appear on a pipe. Process exit
> > > > > > > status is just another blob of bytes that a process might 
> > > > > > > receive. A
> > > > > > > process exit handle ought to be just another information source. 
> > > > > > > The
> > > > > > > reason the unix process API is so awful is that for whatever 
> > > > > > > reason
> > > > > > > the original designers treated processes as some kind of special 
> > > > > > > kind
> > > > > > > of resource instead of fitting them into the otherwise 
> > > > > > > general-purpose
> > > > > > > unix data-handling API. Let's not repeat that mistake.
> > > > > > >
> > > > > > > > Something like this:
> > > > > > > >
> > > > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> > > > > > > > // register eventfd to receive death notification
> > > > > > > > pidfd_wait(pid_to_kill, evfd);
> > > > > > > > // kill the process
> > > > > > > > pidfd_send_signal(pid_to_kill, ...)
> > > > > > > > // tend to other things
> > > > > > >
> > > > > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire 
> > > > > > > up
> > > > > > > an eventfd.
> > > > > > >
> > > > >
> > > > > Ok, I probably misunderstood your post linked by Joel. I though your
> > > > > original proposal was based on being able to poll a file under
> > > > > /proc/pid and then you changed your mind to have a separate syscall
> > > > > which I assumed would be a blocking one to wait for process exit.
> > > > > Maybe you can describe the new interface you are thinking about in
> > > > > terms of userspace usage like I did above? Several lines of code would
> > > > > explain more than paragraphs of text.
> > > >
> > > > Hey, Thanks 

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-17 Thread Serge E. Hallyn
On Sun, Mar 17, 2019 at 12:42:40PM +0100, Christian Brauner wrote:
> On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote:
> > On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote:
> > > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner  
> > > wrote:
> > > >
> > > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote:
> > > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan 
> > > > >  wrote:
> > > > > >
> > > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote:
> > > > > > > [..]
> > > > > > > > > why do we want to add a new syscall (pidfd_wait) though? Why 
> > > > > > > > > not just use
> > > > > > > > > standard poll/epoll interface on the proc fd like Daniel was 
> > > > > > > > > suggesting.
> > > > > > > > > AFAIK, once the proc file is opened, the struct pid is 
> > > > > > > > > essentially pinned
> > > > > > > > > even though the proc number may be reused. Then the caller 
> > > > > > > > > can just poll.
> > > > > > > > > We can add a waitqueue to struct pid, and wake up any waiters 
> > > > > > > > > on process
> > > > > > > > > death (A quick look shows task_struct can be mapped to its 
> > > > > > > > > struct pid) and
> > > > > > > > > also possibly optimize it using Steve's TIF flag idea. No new 
> > > > > > > > > syscall is
> > > > > > > > > needed then, let me know if I missed something?
> > > > > > > >
> > > > > > > > Huh, I thought that Daniel was against the poll/epoll solution?
> > > > > > >
> > > > > > > Hmm, going through earlier threads, I believe so now. Here was 
> > > > > > > Daniel's
> > > > > > > reasoning about avoiding a notification about process death 
> > > > > > > through proc
> > > > > > > directory fd: 
> > > > > > > http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html
> > > > > > >
> > > > > > > May be a dedicated syscall for this would be cleaner after all.
> > > > > >
> > > > > > Ah, I wish I've seen that discussion before...
> > > > > > syscall makes sense and it can be non-blocking and we can use
> > > > > > select/poll/epoll if we use eventfd.
> > > > >
> > > > > Thanks for taking a look.
> > > > >
> > > > > > I would strongly advocate for
> > > > > > non-blocking version or at least to have a non-blocking option.
> > > > >
> > > > > Waiting for FD readiness is *already* blocking or non-blocking
> > > > > according to the caller's desire --- users can pass options they want
> > > > > to poll(2) or whatever. There's no need for any kind of special
> > > > > configuration knob or non-blocking option. We already *have* a
> > > > > non-blocking option that works universally for everything.
> > > > >
> > > > > As I mentioned in the linked thread, waiting for process exit should
> > > > > work just like waiting for bytes to appear on a pipe. Process exit
> > > > > status is just another blob of bytes that a process might receive. A
> > > > > process exit handle ought to be just another information source. The
> > > > > reason the unix process API is so awful is that for whatever reason
> > > > > the original designers treated processes as some kind of special kind
> > > > > of resource instead of fitting them into the otherwise general-purpose
> > > > > unix data-handling API. Let's not repeat that mistake.
> > > > >
> > > > > > Something like this:
> > > > > >
> > > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> > > > > > // register eventfd to receive death notification
> > > > > > pidfd_wait(pid_to_kill, evfd);
> > > > > > // kill the process
> > > > > > pidfd_send_signal(pid_to_kill, ...)
> > > > > > // tend to other things
> > > > >
> > > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire up
> > > > > an eventfd.
> > > > >
> > > 
> > > Ok, I probably misunderstood your post linked by Joel. I though your
> > > original proposal was based on being able to poll a file under
> > > /proc/pid and then you changed your mind to have a separate syscall
> > > which I assumed would be a blocking one to wait for process exit.
> > > Maybe you can describe the new interface you are thinking about in
> > > terms of userspace usage like I did above? Several lines of code would
> > > explain more than paragraphs of text.
> > 
> > Hey, Thanks Suren for the eventfd idea. I agree with Daniel on this. The 
> > idea
> > from Daniel here is to wait for process death and exit events by just
> > referring to a stable fd, independent of whatever is going on in /proc.
> > 
> > What is needed is something like this (in highly pseudo-code form):
> > 
> > pidfd = opendir("/proc/",..);
> > wait_fd = pidfd_wait(pidfd);
> > read or poll wait_fd (non-blocking or blocking whichever)
> > 
> > wait_fd will block until the task has either died or reaped. In both these
> > cases, it can return a suitable string such as "dead" or "reaped" although 
> > an
> > integer with some predefined meaning is also Ok.
> 

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-17 Thread Christian Brauner
On Sat, Mar 16, 2019 at 09:53:06PM -0400, Joel Fernandes wrote:
> On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote:
> > On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner  
> > wrote:
> > >
> > > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote:
> > > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan  
> > > > wrote:
> > > > >
> > > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes 
> > > > >  wrote:
> > > > > >
> > > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote:
> > > > > > [..]
> > > > > > > > why do we want to add a new syscall (pidfd_wait) though? Why 
> > > > > > > > not just use
> > > > > > > > standard poll/epoll interface on the proc fd like Daniel was 
> > > > > > > > suggesting.
> > > > > > > > AFAIK, once the proc file is opened, the struct pid is 
> > > > > > > > essentially pinned
> > > > > > > > even though the proc number may be reused. Then the caller can 
> > > > > > > > just poll.
> > > > > > > > We can add a waitqueue to struct pid, and wake up any waiters 
> > > > > > > > on process
> > > > > > > > death (A quick look shows task_struct can be mapped to its 
> > > > > > > > struct pid) and
> > > > > > > > also possibly optimize it using Steve's TIF flag idea. No new 
> > > > > > > > syscall is
> > > > > > > > needed then, let me know if I missed something?
> > > > > > >
> > > > > > > Huh, I thought that Daniel was against the poll/epoll solution?
> > > > > >
> > > > > > Hmm, going through earlier threads, I believe so now. Here was 
> > > > > > Daniel's
> > > > > > reasoning about avoiding a notification about process death through 
> > > > > > proc
> > > > > > directory fd: 
> > > > > > http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html
> > > > > >
> > > > > > May be a dedicated syscall for this would be cleaner after all.
> > > > >
> > > > > Ah, I wish I've seen that discussion before...
> > > > > syscall makes sense and it can be non-blocking and we can use
> > > > > select/poll/epoll if we use eventfd.
> > > >
> > > > Thanks for taking a look.
> > > >
> > > > > I would strongly advocate for
> > > > > non-blocking version or at least to have a non-blocking option.
> > > >
> > > > Waiting for FD readiness is *already* blocking or non-blocking
> > > > according to the caller's desire --- users can pass options they want
> > > > to poll(2) or whatever. There's no need for any kind of special
> > > > configuration knob or non-blocking option. We already *have* a
> > > > non-blocking option that works universally for everything.
> > > >
> > > > As I mentioned in the linked thread, waiting for process exit should
> > > > work just like waiting for bytes to appear on a pipe. Process exit
> > > > status is just another blob of bytes that a process might receive. A
> > > > process exit handle ought to be just another information source. The
> > > > reason the unix process API is so awful is that for whatever reason
> > > > the original designers treated processes as some kind of special kind
> > > > of resource instead of fitting them into the otherwise general-purpose
> > > > unix data-handling API. Let's not repeat that mistake.
> > > >
> > > > > Something like this:
> > > > >
> > > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> > > > > // register eventfd to receive death notification
> > > > > pidfd_wait(pid_to_kill, evfd);
> > > > > // kill the process
> > > > > pidfd_send_signal(pid_to_kill, ...)
> > > > > // tend to other things
> > > >
> > > > Now you've lost me. pidfd_wait should return a *new* FD, not wire up
> > > > an eventfd.
> > > >
> > 
> > Ok, I probably misunderstood your post linked by Joel. I though your
> > original proposal was based on being able to poll a file under
> > /proc/pid and then you changed your mind to have a separate syscall
> > which I assumed would be a blocking one to wait for process exit.
> > Maybe you can describe the new interface you are thinking about in
> > terms of userspace usage like I did above? Several lines of code would
> > explain more than paragraphs of text.
> 
> Hey, Thanks Suren for the eventfd idea. I agree with Daniel on this. The idea
> from Daniel here is to wait for process death and exit events by just
> referring to a stable fd, independent of whatever is going on in /proc.
> 
> What is needed is something like this (in highly pseudo-code form):
> 
> pidfd = opendir("/proc/",..);
> wait_fd = pidfd_wait(pidfd);
> read or poll wait_fd (non-blocking or blocking whichever)
> 
> wait_fd will block until the task has either died or reaped. In both these
> cases, it can return a suitable string such as "dead" or "reaped" although an
> integer with some predefined meaning is also Ok.
> 
> What that guarantees is, even if the task's PID has been reused, or the task
> has already died or already died + reaped, all of these events cannot race
> with the code above and the information passed to the user is race-free and
> stable / guaranteed.
> 
> An eventfd 

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-16 Thread Joel Fernandes
On Sat, Mar 16, 2019 at 12:37:18PM -0700, Suren Baghdasaryan wrote:
> On Sat, Mar 16, 2019 at 11:57 AM Christian Brauner  
> wrote:
> >
> > On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote:
> > > On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan  
> > > wrote:
> > > >
> > > > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes 
> > > >  wrote:
> > > > >
> > > > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote:
> > > > > [..]
> > > > > > > why do we want to add a new syscall (pidfd_wait) though? Why not 
> > > > > > > just use
> > > > > > > standard poll/epoll interface on the proc fd like Daniel was 
> > > > > > > suggesting.
> > > > > > > AFAIK, once the proc file is opened, the struct pid is 
> > > > > > > essentially pinned
> > > > > > > even though the proc number may be reused. Then the caller can 
> > > > > > > just poll.
> > > > > > > We can add a waitqueue to struct pid, and wake up any waiters on 
> > > > > > > process
> > > > > > > death (A quick look shows task_struct can be mapped to its struct 
> > > > > > > pid) and
> > > > > > > also possibly optimize it using Steve's TIF flag idea. No new 
> > > > > > > syscall is
> > > > > > > needed then, let me know if I missed something?
> > > > > >
> > > > > > Huh, I thought that Daniel was against the poll/epoll solution?
> > > > >
> > > > > Hmm, going through earlier threads, I believe so now. Here was 
> > > > > Daniel's
> > > > > reasoning about avoiding a notification about process death through 
> > > > > proc
> > > > > directory fd: 
> > > > > http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html
> > > > >
> > > > > May be a dedicated syscall for this would be cleaner after all.
> > > >
> > > > Ah, I wish I've seen that discussion before...
> > > > syscall makes sense and it can be non-blocking and we can use
> > > > select/poll/epoll if we use eventfd.
> > >
> > > Thanks for taking a look.
> > >
> > > > I would strongly advocate for
> > > > non-blocking version or at least to have a non-blocking option.
> > >
> > > Waiting for FD readiness is *already* blocking or non-blocking
> > > according to the caller's desire --- users can pass options they want
> > > to poll(2) or whatever. There's no need for any kind of special
> > > configuration knob or non-blocking option. We already *have* a
> > > non-blocking option that works universally for everything.
> > >
> > > As I mentioned in the linked thread, waiting for process exit should
> > > work just like waiting for bytes to appear on a pipe. Process exit
> > > status is just another blob of bytes that a process might receive. A
> > > process exit handle ought to be just another information source. The
> > > reason the unix process API is so awful is that for whatever reason
> > > the original designers treated processes as some kind of special kind
> > > of resource instead of fitting them into the otherwise general-purpose
> > > unix data-handling API. Let's not repeat that mistake.
> > >
> > > > Something like this:
> > > >
> > > > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> > > > // register eventfd to receive death notification
> > > > pidfd_wait(pid_to_kill, evfd);
> > > > // kill the process
> > > > pidfd_send_signal(pid_to_kill, ...)
> > > > // tend to other things
> > >
> > > Now you've lost me. pidfd_wait should return a *new* FD, not wire up
> > > an eventfd.
> > >
> 
> Ok, I probably misunderstood your post linked by Joel. I though your
> original proposal was based on being able to poll a file under
> /proc/pid and then you changed your mind to have a separate syscall
> which I assumed would be a blocking one to wait for process exit.
> Maybe you can describe the new interface you are thinking about in
> terms of userspace usage like I did above? Several lines of code would
> explain more than paragraphs of text.

Hey, Thanks Suren for the eventfd idea. I agree with Daniel on this. The idea
from Daniel here is to wait for process death and exit events by just
referring to a stable fd, independent of whatever is going on in /proc.

What is needed is something like this (in highly pseudo-code form):

pidfd = opendir("/proc/",..);
wait_fd = pidfd_wait(pidfd);
read or poll wait_fd (non-blocking or blocking whichever)

wait_fd will block until the task has either died or reaped. In both these
cases, it can return a suitable string such as "dead" or "reaped" although an
integer with some predefined meaning is also Ok.

What that guarantees is, even if the task's PID has been reused, or the task
has already died or already died + reaped, all of these events cannot race
with the code above and the information passed to the user is race-free and
stable / guaranteed.

An eventfd seems to not fit well, because AFAICS passing the raw PID to
eventfd as in your example would still race since the PID could have been
reused by another process by the time the eventfd is created.

Also Andy's idea in [1] seems to use poll flags to communicate various tihngs

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-16 Thread Christian Brauner
On Sat, Mar 16, 2019 at 11:00:10AM -0700, Daniel Colascione wrote:
> On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan  wrote:
> >
> > On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes  
> > wrote:
> > >
> > > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote:
> > > [..]
> > > > > why do we want to add a new syscall (pidfd_wait) though? Why not just 
> > > > > use
> > > > > standard poll/epoll interface on the proc fd like Daniel was 
> > > > > suggesting.
> > > > > AFAIK, once the proc file is opened, the struct pid is essentially 
> > > > > pinned
> > > > > even though the proc number may be reused. Then the caller can just 
> > > > > poll.
> > > > > We can add a waitqueue to struct pid, and wake up any waiters on 
> > > > > process
> > > > > death (A quick look shows task_struct can be mapped to its struct 
> > > > > pid) and
> > > > > also possibly optimize it using Steve's TIF flag idea. No new syscall 
> > > > > is
> > > > > needed then, let me know if I missed something?
> > > >
> > > > Huh, I thought that Daniel was against the poll/epoll solution?
> > >
> > > Hmm, going through earlier threads, I believe so now. Here was Daniel's
> > > reasoning about avoiding a notification about process death through proc
> > > directory fd: http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html
> > >
> > > May be a dedicated syscall for this would be cleaner after all.
> >
> > Ah, I wish I've seen that discussion before...
> > syscall makes sense and it can be non-blocking and we can use
> > select/poll/epoll if we use eventfd.
> 
> Thanks for taking a look.
> 
> > I would strongly advocate for
> > non-blocking version or at least to have a non-blocking option.
> 
> Waiting for FD readiness is *already* blocking or non-blocking
> according to the caller's desire --- users can pass options they want
> to poll(2) or whatever. There's no need for any kind of special
> configuration knob or non-blocking option. We already *have* a
> non-blocking option that works universally for everything.
> 
> As I mentioned in the linked thread, waiting for process exit should
> work just like waiting for bytes to appear on a pipe. Process exit
> status is just another blob of bytes that a process might receive. A
> process exit handle ought to be just another information source. The
> reason the unix process API is so awful is that for whatever reason
> the original designers treated processes as some kind of special kind
> of resource instead of fitting them into the otherwise general-purpose
> unix data-handling API. Let's not repeat that mistake.
> 
> > Something like this:
> >
> > evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> > // register eventfd to receive death notification
> > pidfd_wait(pid_to_kill, evfd);
> > // kill the process
> > pidfd_send_signal(pid_to_kill, ...)
> > // tend to other things
> 
> Now you've lost me. pidfd_wait should return a *new* FD, not wire up
> an eventfd.
> 
> Why? Because the new type FD can report process exit *status*
> information (via read(2) after readability signal) as well as this
> binary yes-or-no signal *that* a process exited, and this capability
> is useful if you want to the pidfd interface to be a good
> general-purpose process management facility to replace the awful
> wait() family of functions. You can't get an exit status from an
> eventfd. Wiring up an eventfd the way you've proposed also complicates
> wait-causality information, complicating both tracing and any priority
> inheritance we might want in the future (because all the wakeups gets
> mixed into the eventfd and you can't unscramble an egg). And for what?
> What do we gain by using an eventfd? Is the reason that exit.c would
> be able to use eventfd_signal instead of poking a waitqueue directly?
> How is that better? With an eventfd, you've increased path length on
> process exit *and* complicated the API for no reason.
> 
> > ...
> > // wait for the process to die
> > poll_wait(evfd, ...);
> >
> > This simplifies userspace
> 
> Not relative to an exit handle it doesn't.
> 
> >, allows it to wait for multiple events using
> > epoll
> 
> So does a process exit status handle.
> 
> > and I think kernel implementation will be also quite simple
> > because it already implements eventfd_signal() that takes care of
> > waitqueue handling.
> 
> What if there are multiple eventfds registered for the death of a
> process? In any case, you need some mechanism to find, upon process
> death, a list of waiters, then wake each of them up. That's either a
> global search or a search in some list rooted in a task-related
> structure (either struct task or one of its friends). Using an eventfd
> here adds nothing, since upon death, you need this list search
> regardless, and as I mentioned above, eventfd-wiring just makes the
> API worse.
> 
> > If pidfd_send_signal could be extended to have an optional eventfd
> > parameter then we would not even have to add a new syscall.
> 
> There is nothing wrong with 

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-16 Thread Daniel Colascione
On Sat, Mar 16, 2019 at 10:31 AM Suren Baghdasaryan  wrote:
>
> On Fri, Mar 15, 2019 at 11:49 AM Joel Fernandes  
> wrote:
> >
> > On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote:
> > [..]
> > > > why do we want to add a new syscall (pidfd_wait) though? Why not just 
> > > > use
> > > > standard poll/epoll interface on the proc fd like Daniel was suggesting.
> > > > AFAIK, once the proc file is opened, the struct pid is essentially 
> > > > pinned
> > > > even though the proc number may be reused. Then the caller can just 
> > > > poll.
> > > > We can add a waitqueue to struct pid, and wake up any waiters on process
> > > > death (A quick look shows task_struct can be mapped to its struct pid) 
> > > > and
> > > > also possibly optimize it using Steve's TIF flag idea. No new syscall is
> > > > needed then, let me know if I missed something?
> > >
> > > Huh, I thought that Daniel was against the poll/epoll solution?
> >
> > Hmm, going through earlier threads, I believe so now. Here was Daniel's
> > reasoning about avoiding a notification about process death through proc
> > directory fd: http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html
> >
> > May be a dedicated syscall for this would be cleaner after all.
>
> Ah, I wish I've seen that discussion before...
> syscall makes sense and it can be non-blocking and we can use
> select/poll/epoll if we use eventfd.

Thanks for taking a look.

> I would strongly advocate for
> non-blocking version or at least to have a non-blocking option.

Waiting for FD readiness is *already* blocking or non-blocking
according to the caller's desire --- users can pass options they want
to poll(2) or whatever. There's no need for any kind of special
configuration knob or non-blocking option. We already *have* a
non-blocking option that works universally for everything.

As I mentioned in the linked thread, waiting for process exit should
work just like waiting for bytes to appear on a pipe. Process exit
status is just another blob of bytes that a process might receive. A
process exit handle ought to be just another information source. The
reason the unix process API is so awful is that for whatever reason
the original designers treated processes as some kind of special kind
of resource instead of fitting them into the otherwise general-purpose
unix data-handling API. Let's not repeat that mistake.

> Something like this:
>
> evfd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> // register eventfd to receive death notification
> pidfd_wait(pid_to_kill, evfd);
> // kill the process
> pidfd_send_signal(pid_to_kill, ...)
> // tend to other things

Now you've lost me. pidfd_wait should return a *new* FD, not wire up
an eventfd.

Why? Because the new type FD can report process exit *status*
information (via read(2) after readability signal) as well as this
binary yes-or-no signal *that* a process exited, and this capability
is useful if you want to the pidfd interface to be a good
general-purpose process management facility to replace the awful
wait() family of functions. You can't get an exit status from an
eventfd. Wiring up an eventfd the way you've proposed also complicates
wait-causality information, complicating both tracing and any priority
inheritance we might want in the future (because all the wakeups gets
mixed into the eventfd and you can't unscramble an egg). And for what?
What do we gain by using an eventfd? Is the reason that exit.c would
be able to use eventfd_signal instead of poking a waitqueue directly?
How is that better? With an eventfd, you've increased path length on
process exit *and* complicated the API for no reason.

> ...
> // wait for the process to die
> poll_wait(evfd, ...);
>
> This simplifies userspace

Not relative to an exit handle it doesn't.

>, allows it to wait for multiple events using
> epoll

So does a process exit status handle.

> and I think kernel implementation will be also quite simple
> because it already implements eventfd_signal() that takes care of
> waitqueue handling.

What if there are multiple eventfds registered for the death of a
process? In any case, you need some mechanism to find, upon process
death, a list of waiters, then wake each of them up. That's either a
global search or a search in some list rooted in a task-related
structure (either struct task or one of its friends). Using an eventfd
here adds nothing, since upon death, you need this list search
regardless, and as I mentioned above, eventfd-wiring just makes the
API worse.

> If pidfd_send_signal could be extended to have an optional eventfd
> parameter then we would not even have to add a new syscall.

There is nothing wrong with adding a new system call. I don't know why
there's this idea circulating that adding system calls is something we
should bend over backwards to avoid. It's cheap, and support-wise,
kernel interface is kernel interface. Sending a signal has *nothing*
to do with wiring up some kind of notification and there's no 

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-15 Thread Joel Fernandes
On Fri, Mar 15, 2019 at 07:24:28PM +0100, Christian Brauner wrote:
[..]
> > why do we want to add a new syscall (pidfd_wait) though? Why not just use
> > standard poll/epoll interface on the proc fd like Daniel was suggesting.
> > AFAIK, once the proc file is opened, the struct pid is essentially pinned
> > even though the proc number may be reused. Then the caller can just poll.
> > We can add a waitqueue to struct pid, and wake up any waiters on process
> > death (A quick look shows task_struct can be mapped to its struct pid) and
> > also possibly optimize it using Steve's TIF flag idea. No new syscall is
> > needed then, let me know if I missed something?
> 
> Huh, I thought that Daniel was against the poll/epoll solution?

Hmm, going through earlier threads, I believe so now. Here was Daniel's
reasoning about avoiding a notification about process death through proc
directory fd: http://lkml.iu.edu/hypermail/linux/kernel/1811.0/00232.html

May be a dedicated syscall for this would be cleaner after all.

> I have no clear opinion on what is better at the moment since I have
> been mostly concerned with getting pidfd_send_signal() into shape and
> was reluctant to put more ideas/work into this if it gets shutdown.
> Once we have pidfd_send_signal() the wait discussion makes sense.

Ok. It looks like that is almost in though (fingers crossed :)).

thanks,

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-15 Thread Christian Brauner
On Fri, Mar 15, 2019 at 02:13:24PM -0400, Joel Fernandes wrote:
> On Fri, Mar 15, 2019 at 07:03:07PM +0100, Christian Brauner wrote:
> > On Thu, Mar 14, 2019 at 09:36:43PM -0700, Daniel Colascione wrote:
> > > On Thu, Mar 14, 2019 at 8:16 PM Steven Rostedt  
> > > wrote:
> > > >
> > > > On Thu, 14 Mar 2019 13:49:11 -0700
> > > > Sultan Alsawaf  wrote:
> > > >
> > > > > Perhaps I'm missing something, but if you want to know when a process 
> > > > > has died
> > > > > after sending a SIGKILL to it, then why not just make the SIGKILL 
> > > > > optionally
> > > > > block until the process has died completely? It'd be rather trivial 
> > > > > to just
> > > > > store a pointer to an onstack completion inside the victim process' 
> > > > > task_struct,
> > > > > and then complete it in free_task().
> > > >
> > > > How would you implement such a method in userspace? kill() doesn't take
> > > > any parameters but the pid of the process you want to send a signal to,
> > > > and the signal to send. This would require a new system call, and be
> > > > quite a bit of work.
> > > 
> > > That's what the pidfd work is for. Please read the original threads
> > > about the motivation and design of that facility.
> > > 
> > > > If you can solve this with an ebpf program, I
> > > > strongly suggest you do that instead.
> > > 
> > > Regarding process death notification: I will absolutely not support
> > > putting aBPF and perf trace events on the critical path of core system
> > > memory management functionality. Tracing and monitoring facilities are
> > > great for learning about the system, but they were never intended to
> > > be load-bearing. The proposed eBPF process-monitoring approach is just
> > > a variant of the netlink proposal we discussed previously on the pidfd
> > > threads; it has all of its drawbacks. We really need a core system
> > > call  --- really, we've needed robust process management since the
> > > creation of unix --- and I'm glad that we're finally getting it.
> > > Adding new system calls is not expensive; going to great lengths to
> > > avoid adding one is like calling a helicopter to avoid crossing the
> > > street. I don't think we should present an abuse of the debugging and
> > > performance monitoring infrastructure as an alternative to a robust
> > > and desperately-needed bit of core functionality that's neither hard
> > > to add nor complex to implement nor expensive to use.
> > > 
> > > Regarding the proposal for a new kernel-side lmkd: when possible, the
> > > kernel should provide mechanism, not policy. Putting the low memory
> > > killer back into the kernel after we've spent significant effort
> > > making it possible for userspace to do that job. Compared to kernel
> > > code, more easily understood, more easily debuggable, more easily
> > > updated, and much safer. If we *can* move something out of the kernel,
> > > we should. This patch moves us in exactly the wrong direction. Yes, we
> > > need *something* that sits synchronously astride the page allocation
> > > path and does *something* to stop a busy beaver allocator that eats
> > > all the available memory before lmkd, even mlocked and realtime, can
> > > respond. The OOM killer is adequate for this very rare case.
> > > 
> > > With respect to kill timing: Tim is right about the need for two
> > > levels of policy: first, a high-level process prioritization and
> > > memory-demand balancing scheme (which is what OOM score adjustment
> > > code in ActivityManager amounts to); and second, a low-level
> > > process-killing methodology that maximizes sustainable memory reclaim
> > > and minimizes unwanted side effects while killing those processes that
> > > should be dead. Both of these policies belong in userspace --- because
> > > they *can* be in userspace --- and userspace needs only a few tools,
> > > most of which already exist, to do a perfectly adequate job.
> > > 
> > > We do want killed processes to die promptly. That's why I support
> > > boosting a process's priority somehow when lmkd is about to kill it.
> > > The precise way in which we do that --- involving not only actual
> > > priority, but scheduler knobs, cgroup assignment, core affinity, and
> > > so on --- is a complex topic best left to userspace. lmkd already has
> > > all the knobs it needs to implement whatever priority boosting policy
> > > it wants.
> > > 
> > > Hell, once we add a pidfd_wait --- which I plan to work on, assuming
> > > nobody beats me to it, after pidfd_send_signal lands --- you can
> > 
> > Daniel,
> > 
> > I've just been talking to Joel.
> > I actually "expected" you to work pidfd_wait() after prior
> > conversations we had on the pidfd_send_signal() patchsets. :) That's why
> > I got a separate git tree on kernel.org since I expect a lot more work
> > to come. I hope that Linus still decides to pull pidfd_send_signal()
> > before Sunday (For the ones who have missed the link in a prior response
> > of mine:
> > 

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-15 Thread Joel Fernandes
On Fri, Mar 15, 2019 at 07:03:07PM +0100, Christian Brauner wrote:
> On Thu, Mar 14, 2019 at 09:36:43PM -0700, Daniel Colascione wrote:
> > On Thu, Mar 14, 2019 at 8:16 PM Steven Rostedt  wrote:
> > >
> > > On Thu, 14 Mar 2019 13:49:11 -0700
> > > Sultan Alsawaf  wrote:
> > >
> > > > Perhaps I'm missing something, but if you want to know when a process 
> > > > has died
> > > > after sending a SIGKILL to it, then why not just make the SIGKILL 
> > > > optionally
> > > > block until the process has died completely? It'd be rather trivial to 
> > > > just
> > > > store a pointer to an onstack completion inside the victim process' 
> > > > task_struct,
> > > > and then complete it in free_task().
> > >
> > > How would you implement such a method in userspace? kill() doesn't take
> > > any parameters but the pid of the process you want to send a signal to,
> > > and the signal to send. This would require a new system call, and be
> > > quite a bit of work.
> > 
> > That's what the pidfd work is for. Please read the original threads
> > about the motivation and design of that facility.
> > 
> > > If you can solve this with an ebpf program, I
> > > strongly suggest you do that instead.
> > 
> > Regarding process death notification: I will absolutely not support
> > putting aBPF and perf trace events on the critical path of core system
> > memory management functionality. Tracing and monitoring facilities are
> > great for learning about the system, but they were never intended to
> > be load-bearing. The proposed eBPF process-monitoring approach is just
> > a variant of the netlink proposal we discussed previously on the pidfd
> > threads; it has all of its drawbacks. We really need a core system
> > call  --- really, we've needed robust process management since the
> > creation of unix --- and I'm glad that we're finally getting it.
> > Adding new system calls is not expensive; going to great lengths to
> > avoid adding one is like calling a helicopter to avoid crossing the
> > street. I don't think we should present an abuse of the debugging and
> > performance monitoring infrastructure as an alternative to a robust
> > and desperately-needed bit of core functionality that's neither hard
> > to add nor complex to implement nor expensive to use.
> > 
> > Regarding the proposal for a new kernel-side lmkd: when possible, the
> > kernel should provide mechanism, not policy. Putting the low memory
> > killer back into the kernel after we've spent significant effort
> > making it possible for userspace to do that job. Compared to kernel
> > code, more easily understood, more easily debuggable, more easily
> > updated, and much safer. If we *can* move something out of the kernel,
> > we should. This patch moves us in exactly the wrong direction. Yes, we
> > need *something* that sits synchronously astride the page allocation
> > path and does *something* to stop a busy beaver allocator that eats
> > all the available memory before lmkd, even mlocked and realtime, can
> > respond. The OOM killer is adequate for this very rare case.
> > 
> > With respect to kill timing: Tim is right about the need for two
> > levels of policy: first, a high-level process prioritization and
> > memory-demand balancing scheme (which is what OOM score adjustment
> > code in ActivityManager amounts to); and second, a low-level
> > process-killing methodology that maximizes sustainable memory reclaim
> > and minimizes unwanted side effects while killing those processes that
> > should be dead. Both of these policies belong in userspace --- because
> > they *can* be in userspace --- and userspace needs only a few tools,
> > most of which already exist, to do a perfectly adequate job.
> > 
> > We do want killed processes to die promptly. That's why I support
> > boosting a process's priority somehow when lmkd is about to kill it.
> > The precise way in which we do that --- involving not only actual
> > priority, but scheduler knobs, cgroup assignment, core affinity, and
> > so on --- is a complex topic best left to userspace. lmkd already has
> > all the knobs it needs to implement whatever priority boosting policy
> > it wants.
> > 
> > Hell, once we add a pidfd_wait --- which I plan to work on, assuming
> > nobody beats me to it, after pidfd_send_signal lands --- you can
> 
> Daniel,
> 
> I've just been talking to Joel.
> I actually "expected" you to work pidfd_wait() after prior
> conversations we had on the pidfd_send_signal() patchsets. :) That's why
> I got a separate git tree on kernel.org since I expect a lot more work
> to come. I hope that Linus still decides to pull pidfd_send_signal()
> before Sunday (For the ones who have missed the link in a prior response
> of mine:
> https://lkml.org/lkml/2019/3/12/439
> 
> This is the first merge window I sent this PR.
> 
> The pidfd tree has a branch for-next that is already tracked by Stephen
> in linux-next since the 5.0 merge window. The patches for
> pidfd_send_signal() sit in the pidfd 

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-15 Thread Christian Brauner
On Thu, Mar 14, 2019 at 09:36:43PM -0700, Daniel Colascione wrote:
> On Thu, Mar 14, 2019 at 8:16 PM Steven Rostedt  wrote:
> >
> > On Thu, 14 Mar 2019 13:49:11 -0700
> > Sultan Alsawaf  wrote:
> >
> > > Perhaps I'm missing something, but if you want to know when a process has 
> > > died
> > > after sending a SIGKILL to it, then why not just make the SIGKILL 
> > > optionally
> > > block until the process has died completely? It'd be rather trivial to 
> > > just
> > > store a pointer to an onstack completion inside the victim process' 
> > > task_struct,
> > > and then complete it in free_task().
> >
> > How would you implement such a method in userspace? kill() doesn't take
> > any parameters but the pid of the process you want to send a signal to,
> > and the signal to send. This would require a new system call, and be
> > quite a bit of work.
> 
> That's what the pidfd work is for. Please read the original threads
> about the motivation and design of that facility.
> 
> > If you can solve this with an ebpf program, I
> > strongly suggest you do that instead.
> 
> Regarding process death notification: I will absolutely not support
> putting aBPF and perf trace events on the critical path of core system
> memory management functionality. Tracing and monitoring facilities are
> great for learning about the system, but they were never intended to
> be load-bearing. The proposed eBPF process-monitoring approach is just
> a variant of the netlink proposal we discussed previously on the pidfd
> threads; it has all of its drawbacks. We really need a core system
> call  --- really, we've needed robust process management since the
> creation of unix --- and I'm glad that we're finally getting it.
> Adding new system calls is not expensive; going to great lengths to
> avoid adding one is like calling a helicopter to avoid crossing the
> street. I don't think we should present an abuse of the debugging and
> performance monitoring infrastructure as an alternative to a robust
> and desperately-needed bit of core functionality that's neither hard
> to add nor complex to implement nor expensive to use.
> 
> Regarding the proposal for a new kernel-side lmkd: when possible, the
> kernel should provide mechanism, not policy. Putting the low memory
> killer back into the kernel after we've spent significant effort
> making it possible for userspace to do that job. Compared to kernel
> code, more easily understood, more easily debuggable, more easily
> updated, and much safer. If we *can* move something out of the kernel,
> we should. This patch moves us in exactly the wrong direction. Yes, we
> need *something* that sits synchronously astride the page allocation
> path and does *something* to stop a busy beaver allocator that eats
> all the available memory before lmkd, even mlocked and realtime, can
> respond. The OOM killer is adequate for this very rare case.
> 
> With respect to kill timing: Tim is right about the need for two
> levels of policy: first, a high-level process prioritization and
> memory-demand balancing scheme (which is what OOM score adjustment
> code in ActivityManager amounts to); and second, a low-level
> process-killing methodology that maximizes sustainable memory reclaim
> and minimizes unwanted side effects while killing those processes that
> should be dead. Both of these policies belong in userspace --- because
> they *can* be in userspace --- and userspace needs only a few tools,
> most of which already exist, to do a perfectly adequate job.
> 
> We do want killed processes to die promptly. That's why I support
> boosting a process's priority somehow when lmkd is about to kill it.
> The precise way in which we do that --- involving not only actual
> priority, but scheduler knobs, cgroup assignment, core affinity, and
> so on --- is a complex topic best left to userspace. lmkd already has
> all the knobs it needs to implement whatever priority boosting policy
> it wants.
> 
> Hell, once we add a pidfd_wait --- which I plan to work on, assuming
> nobody beats me to it, after pidfd_send_signal lands --- you can

Daniel,

I've just been talking to Joel.
I actually "expected" you to work pidfd_wait() after prior
conversations we had on the pidfd_send_signal() patchsets. :) That's why
I got a separate git tree on kernel.org since I expect a lot more work
to come. I hope that Linus still decides to pull pidfd_send_signal()
before Sunday (For the ones who have missed the link in a prior response
of mine:
https://lkml.org/lkml/2019/3/12/439

This is the first merge window I sent this PR.

The pidfd tree has a branch for-next that is already tracked by Stephen
in linux-next since the 5.0 merge window. The patches for
pidfd_send_signal() sit in the pidfd branch.
I'd be happy to share the tree with you and Joel (We can rename it if
you prefer I don't care).
I would really like to centralize this work so that we sort of have a
"united front" and end up with a coherent api and can send PRs from a

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-15 Thread Steven Rostedt
On Thu, 14 Mar 2019 21:36:43 -0700
Daniel Colascione  wrote:

> On Thu, Mar 14, 2019 at 8:16 PM Steven Rostedt  wrote:
> >
> > On Thu, 14 Mar 2019 13:49:11 -0700
> > Sultan Alsawaf  wrote:
> >  
> > > Perhaps I'm missing something, but if you want to know when a process has 
> > > died
> > > after sending a SIGKILL to it, then why not just make the SIGKILL 
> > > optionally
> > > block until the process has died completely? It'd be rather trivial to 
> > > just
> > > store a pointer to an onstack completion inside the victim process' 
> > > task_struct,
> > > and then complete it in free_task().  
> >
> > How would you implement such a method in userspace? kill() doesn't take
> > any parameters but the pid of the process you want to send a signal to,
> > and the signal to send. This would require a new system call, and be
> > quite a bit of work.  
> 
> That's what the pidfd work is for. Please read the original threads
> about the motivation and design of that facility.

I wasn't Cc'd on the original work, so I haven't read them.

> 
> > If you can solve this with an ebpf program, I
> > strongly suggest you do that instead.  
> 



> We do want killed processes to die promptly. That's why I support
> boosting a process's priority somehow when lmkd is about to kill it.
> The precise way in which we do that --- involving not only actual
> priority, but scheduler knobs, cgroup assignment, core affinity, and
> so on --- is a complex topic best left to userspace. lmkd already has
> all the knobs it needs to implement whatever priority boosting policy
> it wants.
> 
> Hell, once we add a pidfd_wait --- which I plan to work on, assuming
> nobody beats me to it, after pidfd_send_signal lands --- you can
> imagine a general-purpose priority inheritance mechanism expediting
> process death when a high-priority process waits on a pidfd_wait
> handle for a condemned process. You know you're on the right track
> design-wise when you start seeing this kind of elegant constructive
> interference between seemingly-unrelated features. What we don't need
> is some kind of blocking SIGKILL alternative or backdoor event
> delivery system.
> 
> We definitely don't want to have to wait for a process's parent to
> reap it. Instead, we want to wait for it to become a zombie. That's
> why I designed my original exithand patch to fire death notification
> upon transition to the zombie state, not upon process table removal,
> and I expect pidfd_wait (or whatever we call it) to act the same way.
> 
> In any case, there's a clear path forward here --- general-purpose,
> cheap, and elegant --- and we should just focus on doing that instead
> of more complex proposals with few advantages.

If you add new pidfd systemcalls then making a new way to send a signal
and block till it does die or whatever is more acceptable than adding a
new signal that changes the semantics of sending signals, which is what
I was against.

I do agree with Joel about bloating task_struct too. If anything, have
a wait queue you add, where you can allocate a descriptor with the task
dieing and task killing, and just search this queue on dying. We could
add a TIF flag to the task as well to let the exiting of this task know
it should do such an operation.

-- Steve
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-15 Thread Joel Fernandes
On Thu, Mar 14, 2019 at 09:36:43PM -0700, Daniel Colascione wrote:
[snip] 
> > If you can solve this with an ebpf program, I
> > strongly suggest you do that instead.
> 
> Regarding process death notification: I will absolutely not support
> putting aBPF and perf trace events on the critical path of core system
> memory management functionality. Tracing and monitoring facilities are
> great for learning about the system, but they were never intended to
> be load-bearing. The proposed eBPF process-monitoring approach is just
> a variant of the netlink proposal we discussed previously on the pidfd
> threads; it has all of its drawbacks. We really need a core system
> call  --- really, we've needed robust process management since the
> creation of unix --- and I'm glad that we're finally getting it.
> Adding new system calls is not expensive; going to great lengths to
> avoid adding one is like calling a helicopter to avoid crossing the
> street. I don't think we should present an abuse of the debugging and
> performance monitoring infrastructure as an alternative to a robust
> and desperately-needed bit of core functionality that's neither hard
> to add nor complex to implement nor expensive to use.

The eBPF-based solution to this would be just as simple while avoiding any
kernel changes. I don't know why you think it is not load-bearing. However, I
agree the proc/pidfd approach is better and can be simpler for some users who
don't want to deal with eBPF - especially since something like this has many
usecases. I was just suggesting the eBPF solution as a better alternative to
the task_struct surgery idea from Sultan since that sounded to me quite
hackish (that could just be my opinion).

> Regarding the proposal for a new kernel-side lmkd: when possible, the
> kernel should provide mechanism, not policy. Putting the low memory
> killer back into the kernel after we've spent significant effort
> making it possible for userspace to do that job. Compared to kernel
> code, more easily understood, more easily debuggable, more easily
> updated, and much safer. If we *can* move something out of the kernel,
> we should. This patch moves us in exactly the wrong direction. Yes, we
> need *something* that sits synchronously astride the page allocation
> path and does *something* to stop a busy beaver allocator that eats
> all the available memory before lmkd, even mlocked and realtime, can
> respond. The OOM killer is adequate for this very rare case.
> 
> With respect to kill timing: Tim is right about the need for two
> levels of policy: first, a high-level process prioritization and
> memory-demand balancing scheme (which is what OOM score adjustment
> code in ActivityManager amounts to); and second, a low-level
> process-killing methodology that maximizes sustainable memory reclaim
> and minimizes unwanted side effects while killing those processes that
> should be dead. Both of these policies belong in userspace --- because
> they *can* be in userspace --- and userspace needs only a few tools,
> most of which already exist, to do a perfectly adequate job.
> 
> We do want killed processes to die promptly. That's why I support
> boosting a process's priority somehow when lmkd is about to kill it.
> The precise way in which we do that --- involving not only actual
> priority, but scheduler knobs, cgroup assignment, core affinity, and
> so on --- is a complex topic best left to userspace. lmkd already has
> all the knobs it needs to implement whatever priority boosting policy
> it wants.
> 
> Hell, once we add a pidfd_wait --- which I plan to work on, assuming
> nobody beats me to it, after pidfd_send_signal lands --- you can
> imagine a general-purpose priority inheritance mechanism expediting
> process death when a high-priority process waits on a pidfd_wait
> handle for a condemned process. You know you're on the right track
> design-wise when you start seeing this kind of elegant constructive
> interference between seemingly-unrelated features. What we don't need
> is some kind of blocking SIGKILL alternative or backdoor event
> delivery system.
> 
> We definitely don't want to have to wait for a process's parent to
> reap it. Instead, we want to wait for it to become a zombie. That's
> why I designed my original exithand patch to fire death notification
> upon transition to the zombie state, not upon process table removal,
> and I expect pidfd_wait (or whatever we call it) to act the same way.

Agreed. Looking forward to the patches. :)

thanks,

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-14 Thread Sultan Alsawaf
On Thu, Mar 14, 2019 at 11:16:41PM -0400, Steven Rostedt wrote:
> How would you implement such a method in userspace? kill() doesn't take
> any parameters but the pid of the process you want to send a signal to,
> and the signal to send. This would require a new system call, and be
> quite a bit of work. If you can solve this with an ebpf program, I
> strongly suggest you do that instead.

This can be done by introducing a new signal number that provides SIGKILL
functionality while blocking (maybe SIGKILLBLOCK?).

Thanks,
Sultan
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-14 Thread Sultan Alsawaf
On Thu, Mar 14, 2019 at 10:54:48PM -0400, Joel Fernandes wrote:
> I'm not sure if that makes much semantic sense for how the signal handling is
> supposed to work. Imagine a parent sends SIGKILL to its child, and then does
> a wait(2). Because the SIGKILL blocks in your idea, then the wait cannot
> execute, and because the wait cannot execute, the zombie task will not get
> reaped and so the SIGKILL senders never gets unblocked and the whole thing
> just gets locked up. No? I don't know it just feels incorrect.

Block until the victim becomes a zombie instead.

> Further, in your idea adding stuff to task_struct will simply bloat it - when
> this task can easily be handled using eBPF without making any kernel changes.
> Either by probing sched_process_free or sched_process_exit tracepoints.
> Scheduler maintainers generally frown on adding stuff to task_struct
> pointlessly there's a good reason since bloating it effects the performance
> etc, and something like this would probably never be ifdef'd out behind a
> CONFIG.

Adding something to task_struct is just the easiest way to test things for
experimentation. This can be avoided in my suggestion by passing the pointer to
a completion via the relevant functions, and then completing it at the time the
victim transitions to a zombie state. I understand it's possible to use eBPF for
this, but it seems kind of messy since this functionality is something that I
think others would want provided by the kernel (i.e., anyone using PSI to
implement their own OOM killer daemon similar to LMKD).

Thanks,
Sultan
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-14 Thread Steven Rostedt
On Thu, 14 Mar 2019 13:49:11 -0700
Sultan Alsawaf  wrote:

> Perhaps I'm missing something, but if you want to know when a process has died
> after sending a SIGKILL to it, then why not just make the SIGKILL optionally
> block until the process has died completely? It'd be rather trivial to just
> store a pointer to an onstack completion inside the victim process' 
> task_struct,
> and then complete it in free_task().

How would you implement such a method in userspace? kill() doesn't take
any parameters but the pid of the process you want to send a signal to,
and the signal to send. This would require a new system call, and be
quite a bit of work. If you can solve this with an ebpf program, I
strongly suggest you do that instead.

-- Steve
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-14 Thread Joel Fernandes
On Thu, Mar 14, 2019 at 01:49:11PM -0700, Sultan Alsawaf wrote:
> On Thu, Mar 14, 2019 at 10:47:17AM -0700, Joel Fernandes wrote:
> > About the 100ms latency, I wonder whether it is that high because of
> > the way Android's lmkd is observing that a process has died. There is
> > a gap between when a process memory is freed and when it disappears
> > from the process-table.  Once a process is SIGKILLed, it becomes a
> > zombie. Its memory is freed instantly during the SIGKILL delivery (I
> > traced this so that's how I know), but until it is reaped by its
> > parent thread, it will still exist in /proc/ . So if testing the
> > existence of /proc/ is how Android is observing that the process
> > died, then there can be a large latency where it takes a very long
> > time for the parent to actually reap the child way after its memory
> > was long freed. A quicker way to know if a process's memory is freed
> > before it is reaped could be to read back /proc//maps in
> > userspace of the victim , and that file will be empty for zombie
> > processes. So then one does not need wait for the parent to reap it. I
> > wonder how much of that 100ms you mentioned is actually the "Waiting
> > while Parent is reaping the child", than "memory freeing time". So
> > yeah for this second problem, the procfds work will help.
> >
> > By the way another approach that can provide a quick and asynchronous
> > notification of when the process memory is freed, is to monitor
> > sched_process_exit trace event using eBPF. You can tell eBPF the PID
> > that you want to monitor before the SIGKILL. As soon as the process
> > dies and its memory is freed, the eBPF program can send a notification
> > to user space (using the perf_events polling infra). The
> > sched_process_exit fires just after the mmput() happens so it is quite
> > close to when the memory is reclaimed. This also doesn't need any
> > kernel changes. I could come up with a prototype for this and
> > benchmark it on Android, if you want. Just let me know.
> 
> Perhaps I'm missing something, but if you want to know when a process has died
> after sending a SIGKILL to it, then why not just make the SIGKILL optionally
> block until the process has died completely? It'd be rather trivial to just
> store a pointer to an onstack completion inside the victim process' 
> task_struct,
> and then complete it in free_task().

I'm not sure if that makes much semantic sense for how the signal handling is
supposed to work. Imagine a parent sends SIGKILL to its child, and then does
a wait(2). Because the SIGKILL blocks in your idea, then the wait cannot
execute, and because the wait cannot execute, the zombie task will not get
reaped and so the SIGKILL senders never gets unblocked and the whole thing
just gets locked up. No? I don't know it just feels incorrect.

Further, in your idea adding stuff to task_struct will simply bloat it - when
this task can easily be handled using eBPF without making any kernel changes.
Either by probing sched_process_free or sched_process_exit tracepoints.
Scheduler maintainers generally frown on adding stuff to task_struct
pointlessly there's a good reason since bloating it effects the performance
etc, and something like this would probably never be ifdef'd out behind a
CONFIG.

thanks,

 - Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-14 Thread Sultan Alsawaf
On Thu, Mar 14, 2019 at 10:47:17AM -0700, Joel Fernandes wrote:
> About the 100ms latency, I wonder whether it is that high because of
> the way Android's lmkd is observing that a process has died. There is
> a gap between when a process memory is freed and when it disappears
> from the process-table.  Once a process is SIGKILLed, it becomes a
> zombie. Its memory is freed instantly during the SIGKILL delivery (I
> traced this so that's how I know), but until it is reaped by its
> parent thread, it will still exist in /proc/ . So if testing the
> existence of /proc/ is how Android is observing that the process
> died, then there can be a large latency where it takes a very long
> time for the parent to actually reap the child way after its memory
> was long freed. A quicker way to know if a process's memory is freed
> before it is reaped could be to read back /proc//maps in
> userspace of the victim , and that file will be empty for zombie
> processes. So then one does not need wait for the parent to reap it. I
> wonder how much of that 100ms you mentioned is actually the "Waiting
> while Parent is reaping the child", than "memory freeing time". So
> yeah for this second problem, the procfds work will help.
>
> By the way another approach that can provide a quick and asynchronous
> notification of when the process memory is freed, is to monitor
> sched_process_exit trace event using eBPF. You can tell eBPF the PID
> that you want to monitor before the SIGKILL. As soon as the process
> dies and its memory is freed, the eBPF program can send a notification
> to user space (using the perf_events polling infra). The
> sched_process_exit fires just after the mmput() happens so it is quite
> close to when the memory is reclaimed. This also doesn't need any
> kernel changes. I could come up with a prototype for this and
> benchmark it on Android, if you want. Just let me know.

Perhaps I'm missing something, but if you want to know when a process has died
after sending a SIGKILL to it, then why not just make the SIGKILL optionally
block until the process has died completely? It'd be rather trivial to just
store a pointer to an onstack completion inside the victim process' task_struct,
and then complete it in free_task().

Thanks,
Sultan
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-14 Thread Joel Fernandes
Hi Tim,
Thanks for the detailed and excellent write-up. It will serve as a
good future reference for low memory killer requirements. I made some
comments below on the "how to kill" part.

On Tue, Mar 12, 2019 at 10:17 AM Tim Murray  wrote:
>
> On Tue, Mar 12, 2019 at 9:37 AM Sultan Alsawaf  wrote:
> >
> > On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote:
> > > The only way to control the OOM behavior pro-actively is to throttle
> > > allocation speed. We have memcg high limit for that purpose. Along with
> > > PSI, I can imagine a reasonably working user space early oom
> > > notifications and reasonable acting upon that.
> >
> > The issue with pro-active memory management that prompted me to create this 
> > was
> > poor memory utilization. All of the alternative means of reclaiming pages 
> > in the
> > page allocator's slow path turn out to be very useful for maximizing memory
> > utilization, which is something that we would have to forgo by relying on a
> > purely pro-active solution. I have not had a chance to look at PSI yet, but
> > unless a PSI-enabled solution allows allocations to reach the same point as 
> > when
> > the OOM killer is invoked (which is contradictory to what it sets out to 
> > do),
> > then it cannot take advantage of all of the alternative memory-reclaim means
> > employed in the slowpath, and will result in killing a process before it is
> > _really_ necessary.
>
> There are two essential parts of a lowmemorykiller implementation:
> when to kill and how to kill.
>
> There are a million possible approaches to decide when to kill an
> unimportant process. They usually trade off between the same two
> failure modes depending on the workload.
>
> If you kill too aggressively, a transient spike that could be
> imperceptibly absorbed by evicting some file pages or moving some
> pages to ZRAM will result in killing processes, which then get started
> up later and have a performance/battery cost.
>
> If you don't kill aggressively enough, you will encounter a workload
> that thrashes the page cache, constantly evicting and reloading file
> pages and moving things in and out of ZRAM, which makes the system
> unusable when a process should have been killed instead.
>
> As far as I've seen, any methodology that uses single points in time
> to decide when to kill without completely biasing toward one or the
> other is susceptible to both. The minfree approach used by
> lowmemorykiller/lmkd certainly is; it is both too aggressive for some
> workloads and not aggressive enough for other workloads. My guess is
> that simple LMK won't kill on transient spikes but will be extremely
> susceptible to page cache thrashing. This is not an improvement; page
> cache thrashing manifests as the entire system running very slowly.
>
> What you actually want from lowmemorykiller/lmkd on Android is to only
> kill once it becomes clear that the system will continue to try to
> reclaim memory to the extent that it could impact what the user
> actually cares about. That means tracking how much time is spent in
> reclaim/paging operations and the like, and that's exactly what PSI
> does. lmkd has had support for using PSI as a replacement for
> vmpressure for use as a wakeup trigger (to check current memory levels
> against the minfree thresholds) since early February. It works fine;
> unsurprisingly it's better than vmpressure at avoiding false wakeups.
>
> Longer term, there's a lot of work to be done in lmkd to turn PSI into
> a kill trigger and remove minfree entirely. It's tricky (mainly
> because of the "when to kill another process" problem discussed
> later), but I believe it's feasible.
>
> How to kill is similarly messy. The latency of reclaiming memory post
> SIGKILL can be severe (usually tens of milliseconds, occasionally
> >100ms). The latency we see on Android usually isn't because those
> threads are blocked in uninterruptible sleep, it's because times of
> memory pressure are also usually times of significant CPU contention
> and these are overwhelmingly CFS threads, some of which may be
> assigned a very low priority. lmkd now sets priorities and resets
> cpusets upon killing a process, and we have seen improved reclaim
> latency because of this. oom reaper might be a good approach to avoid
> this latency (I think some in-kernel lowmemorykiller implementations
> rely on it), but we can't use it from userspace. Something for future
> consideration.
>

This makes sense. If the process receiving the SIGKILL does not get CPU
time, then the kernel may not be able to execute the unconditional
signal handling paths in the context of the victim process to free the memory.

I don't see how proc-fds approach will solve this though. Say you have
process L (which is LMKd) which sends a SIGKILL to process V(which is
a victim). Now L sends SIGKILL to V. Unless V executes the
signal-handling code in kernel context and is scheduled at high enough
priority to get CPU time, I don't think the 

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Christian Brauner
On Tue, Mar 12, 2019 at 7:43 PM Tim Murray  wrote:
>
> On Tue, Mar 12, 2019 at 10:45 AM Sultan Alsawaf  
> wrote:
> >
> > On Tue, Mar 12, 2019 at 10:17:43AM -0700, Tim Murray wrote:
> > > Knowing whether a SIGKILL'd process has finished reclaiming is as far
> > > as I know not possible without something like procfds. That's where
> > > the 100ms timeout in lmkd comes in. lowmemorykiller and lmkd both
> > > attempt to wait up to 100ms for reclaim to finish by checking for the
> > > continued existence of the thread that received the SIGKILL, but this
> > > really means that they wait up to 100ms for the _thread_ to finish,
> > > which doesn't tell you anything about the memory used by that process.
> > > If those threads terminate early and lowmemorykiller/lmkd get a signal
> > > to kill again, then there may be two processes competing for CPU time
> > > to reclaim memory. That doesn't reclaim any faster and may be an
> > > unnecessary kill.
> > > ...
> > > - offer a way to wait for process termination so lmkd can tell when
> > > reclaim has finished and know when killing another process is
> > > appropriate
> >
> > Should be pretty easy with something like this:
>
> Yeah, that's in the spirit of what I was suggesting, but there are lot
> of edge cases around how to get that data out efficiently and PID
> reuse (it's a real issue--often the Android apps that are causing
> memory pressure are also constantly creating/destroying threads).
>
> I believe procfds or a similar mechanism will be a good solution to this.

Fwiw, I am working on this and have send a PR for inclusion in 5.1:
https://lore.kernel.org/lkml/20190312135245.27591-1-christ...@brauner.io/
There's also a tree to track this work.

Christian
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Sultan Alsawaf
On Tue, Mar 12, 2019 at 10:17:43AM -0700, Tim Murray wrote:
> Knowing whether a SIGKILL'd process has finished reclaiming is as far
> as I know not possible without something like procfds. That's where
> the 100ms timeout in lmkd comes in. lowmemorykiller and lmkd both
> attempt to wait up to 100ms for reclaim to finish by checking for the
> continued existence of the thread that received the SIGKILL, but this
> really means that they wait up to 100ms for the _thread_ to finish,
> which doesn't tell you anything about the memory used by that process.
> If those threads terminate early and lowmemorykiller/lmkd get a signal
> to kill again, then there may be two processes competing for CPU time
> to reclaim memory. That doesn't reclaim any faster and may be an
> unnecessary kill.
> ...
> - offer a way to wait for process termination so lmkd can tell when
> reclaim has finished and know when killing another process is
> appropriate

Should be pretty easy with something like this:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1549584a1..6ac478af2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1199,6 +1199,7 @@ struct task_struct {
unsigned long   lowest_stack;
unsigned long   prev_lowest_stack;
 #endif
+   ktime_t sigkill_time;
 
/*
 * New fields for task_struct should be added above here, so that
diff --git a/kernel/fork.c b/kernel/fork.c
index 9dcd18aa2..0ae182777 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -435,6 +435,8 @@ void put_task_stack(struct task_struct *tsk)
 
 void free_task(struct task_struct *tsk)
 {
+   ktime_t sigkill_time = tsk->sigkill_time;
+   pid_t pid = tsk->pid;
 #ifndef CONFIG_THREAD_INFO_IN_TASK
/*
 * The task is finally done with both the stack and thread_info,
@@ -455,6 +457,9 @@ void free_task(struct task_struct *tsk)
if (tsk->flags & PF_KTHREAD)
free_kthread_struct(tsk);
free_task_struct(tsk);
+   if (sigkill_time)
+   printk("%d killed after %lld us\n", pid,
+  ktime_us_delta(ktime_get(), sigkill_time));
 }
 EXPORT_SYMBOL(free_task);
 
@@ -1881,6 +1886,7 @@ static __latent_entropy struct task_struct *copy_process(
p->sequential_io= 0;
p->sequential_io_avg= 0;
 #endif
+   p->sigkill_time = 0;
 
/* Perform scheduler related setup. Assign this task to a CPU. */
retval = sched_fork(clone_flags, p);
diff --git a/kernel/signal.c b/kernel/signal.c
index 5d53183e2..1142c8811 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1168,6 +1168,8 @@ static int __send_signal(int sig, struct kernel_siginfo 
*info, struct task_struc
}
 
 out_set:
+   if (sig == SIGKILL)
+   t->sigkill_time = ktime_get();
signalfd_notify(t, sig);
sigaddset(>signal, sig);
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Michal Hocko
On Tue 12-03-19 09:37:41, Sultan Alsawaf wrote:
> I have not had a chance to look at PSI yet, but
> unless a PSI-enabled solution allows allocations to reach the same point as 
> when
> the OOM killer is invoked (which is contradictory to what it sets out to do),
> then it cannot take advantage of all of the alternative memory-reclaim means
> employed in the slowpath, and will result in killing a process before it is
> _really_ necessary.

One more note. The above is true, but you can also hit one of the
thrashing reclaim behaviors and reclaim last few pages again and again
with the whole system really sluggish. That is what PSI is trying to
help with.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Michal Hocko
On Tue 12-03-19 09:37:41, Sultan Alsawaf wrote:
> On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote:
> > The only way to control the OOM behavior pro-actively is to throttle
> > allocation speed. We have memcg high limit for that purpose. Along with
> > PSI, I can imagine a reasonably working user space early oom
> > notifications and reasonable acting upon that.
> 
> The issue with pro-active memory management that prompted me to create this 
> was
> poor memory utilization. All of the alternative means of reclaiming pages in 
> the
> page allocator's slow path turn out to be very useful for maximizing memory
> utilization, which is something that we would have to forgo by relying on a
> purely pro-active solution. I have not had a chance to look at PSI yet, but
> unless a PSI-enabled solution allows allocations to reach the same point as 
> when
> the OOM killer is invoked (which is contradictory to what it sets out to do),
> then it cannot take advantage of all of the alternative memory-reclaim means
> employed in the slowpath, and will result in killing a process before it is
> _really_ necessary.

If you really want to reach the real OOM situation then you can very
well rely on the in-kernel OOM killer. The only reason you want a
customized oom killer is the tasks clasification. And that is a
different story. User space hints on the victim selection has been a
topic for quite while. It never get to any conclusion as interested
parties have always lost an interest because it got hairy quickly.

> > If you design is relies on the speed of killing then it is fundamentally
> > flawed AFAICT. You cannot assume anything about how quickly a task dies.
> > It might be blocked in an uninterruptible sleep or performin an
> > operation which takes some time. Sure, oom_reaper might help here but
> > still.
> 
> In theory we could instantly zap any process that is not trapped in the kernel
> at the time that the OOM killer is invoked without any consequences though, 
> no?

No, this is not so simple. Have a look at the oom_reaper and hops it has
to go through.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Sultan Alsawaf
On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote:
> The only way to control the OOM behavior pro-actively is to throttle
> allocation speed. We have memcg high limit for that purpose. Along with
> PSI, I can imagine a reasonably working user space early oom
> notifications and reasonable acting upon that.

The issue with pro-active memory management that prompted me to create this was
poor memory utilization. All of the alternative means of reclaiming pages in the
page allocator's slow path turn out to be very useful for maximizing memory
utilization, which is something that we would have to forgo by relying on a
purely pro-active solution. I have not had a chance to look at PSI yet, but
unless a PSI-enabled solution allows allocations to reach the same point as when
the OOM killer is invoked (which is contradictory to what it sets out to do),
then it cannot take advantage of all of the alternative memory-reclaim means
employed in the slowpath, and will result in killing a process before it is
_really_ necessary.

> If you design is relies on the speed of killing then it is fundamentally
> flawed AFAICT. You cannot assume anything about how quickly a task dies.
> It might be blocked in an uninterruptible sleep or performin an
> operation which takes some time. Sure, oom_reaper might help here but
> still.

In theory we could instantly zap any process that is not trapped in the kernel
at the time that the OOM killer is invoked without any consequences though, no?

Thanks,
Sultan
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Michal Hocko
On Tue 12-03-19 16:33:15, Michal Hocko wrote:
> On Tue 12-03-19 08:25:41, Matthew Wilcox wrote:
> > On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote:
> > > On Mon 11-03-19 15:15:35, Suren Baghdasaryan wrote:
> > > > Yeah, killing speed is a well-known problem which we are considering
> > > > in LMKD. For example the recent LMKD change to assign process being
> > > > killed to a cpuset cgroup containing big cores cuts the kill time
> > > > considerably. This is not ideal and we are thinking about better ways
> > > > to expedite the cleanup process.
> > > 
> > > If you design is relies on the speed of killing then it is fundamentally
> > > flawed AFAICT. You cannot assume anything about how quickly a task dies.
> > > It might be blocked in an uninterruptible sleep or performin an
> > > operation which takes some time. Sure, oom_reaper might help here but
> > > still.
> > 
> > Many UNINTERRUPTIBLE sleeps can be converted to KILLABLE sleeps.  It just
> > needs someone to do the work.
> 
> They can and should as much as possible. No question about that. But not
> all of them can and that is why nobody should be relying on that. That
> is the whole point of having the oom_reaper and async oom victim tear
> down.

Let me clarify a bit. LMK obviously doesn't need any guarantee like the
core oom killer because it is more of a pro-active measure than the last
resort. I merely wanted to say that relying on a design which assumes
anything about time victim needs to exit is flawed and it will fail
under different workloads. On the other hand this might work good enough
on very specific workloads to be usable. I am not questioning that. The
point is that this is not generic enough to be accepted to the upstream
kernel.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Michal Hocko
On Tue 12-03-19 08:25:41, Matthew Wilcox wrote:
> On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote:
> > On Mon 11-03-19 15:15:35, Suren Baghdasaryan wrote:
> > > Yeah, killing speed is a well-known problem which we are considering
> > > in LMKD. For example the recent LMKD change to assign process being
> > > killed to a cpuset cgroup containing big cores cuts the kill time
> > > considerably. This is not ideal and we are thinking about better ways
> > > to expedite the cleanup process.
> > 
> > If you design is relies on the speed of killing then it is fundamentally
> > flawed AFAICT. You cannot assume anything about how quickly a task dies.
> > It might be blocked in an uninterruptible sleep or performin an
> > operation which takes some time. Sure, oom_reaper might help here but
> > still.
> 
> Many UNINTERRUPTIBLE sleeps can be converted to KILLABLE sleeps.  It just
> needs someone to do the work.

They can and should as much as possible. No question about that. But not
all of them can and that is why nobody should be relying on that. That
is the whole point of having the oom_reaper and async oom victim tear
down.

-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Matthew Wilcox
On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote:
> On Mon 11-03-19 15:15:35, Suren Baghdasaryan wrote:
> > Yeah, killing speed is a well-known problem which we are considering
> > in LMKD. For example the recent LMKD change to assign process being
> > killed to a cpuset cgroup containing big cores cuts the kill time
> > considerably. This is not ideal and we are thinking about better ways
> > to expedite the cleanup process.
> 
> If you design is relies on the speed of killing then it is fundamentally
> flawed AFAICT. You cannot assume anything about how quickly a task dies.
> It might be blocked in an uninterruptible sleep or performin an
> operation which takes some time. Sure, oom_reaper might help here but
> still.

Many UNINTERRUPTIBLE sleeps can be converted to KILLABLE sleeps.  It just
needs someone to do the work.
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Michal Hocko
On Mon 11-03-19 15:15:35, Suren Baghdasaryan wrote:
> On Mon, Mar 11, 2019 at 1:46 PM Sultan Alsawaf  wrote:
> >
> > On Mon, Mar 11, 2019 at 01:10:36PM -0700, Suren Baghdasaryan wrote:
> > > The idea seems interesting although I need to think about this a bit
> > > more. Killing processes based on failed page allocation might backfire
> > > during transient spikes in memory usage.
> >
> > This issue could be alleviated if tasks could be killed and have their pages
> > reaped faster. Currently, Linux takes a _very_ long time to free a task's 
> > memory
> > after an initial privileged SIGKILL is sent to a task, even with the task's
> > priority being set to the highest possible (so unwanted scheduler preemption
> > starving dying tasks of CPU time is not the issue at play here). I've
> > frequently measured the difference in time between when a SIGKILL is sent 
> > for a
> > task and when free_task() is called for that task to be hundreds of
> > milliseconds, which is incredibly long. AFAIK, this is a problem that LMKD
> > suffers from as well, and perhaps any OOM killer implementation in Linux, 
> > since
> > you cannot evaluate effect you've had on memory pressure by killing a 
> > process
> > for at least several tens of milliseconds.
> 
> Yeah, killing speed is a well-known problem which we are considering
> in LMKD. For example the recent LMKD change to assign process being
> killed to a cpuset cgroup containing big cores cuts the kill time
> considerably. This is not ideal and we are thinking about better ways
> to expedite the cleanup process.

If you design is relies on the speed of killing then it is fundamentally
flawed AFAICT. You cannot assume anything about how quickly a task dies.
It might be blocked in an uninterruptible sleep or performin an
operation which takes some time. Sure, oom_reaper might help here but
still.

The only way to control the OOM behavior pro-actively is to throttle
allocation speed. We have memcg high limit for that purpose. Along with
PSI, I can imagine a reasonably working user space early oom
notifications and reasonable acting upon that.
-- 
Michal Hocko
SUSE Labs
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-11 Thread Sultan Alsawaf
On Mon, Mar 11, 2019 at 03:15:35PM -0700, Suren Baghdasaryan wrote:
> This what LMKD currently is - a userspace RT process.
> My point was that this page allocation queue that you implemented
> can't be implemented in userspace, at least not without extensive
> communication with kernel.

Oh, that's easy to address. My page allocation queue and the decision on when to
kill a process are orthogonal. In fact, the page allocation queue could be
touched up a bit to factor in the issues Michal mentioned, and it can be
implemented as an improvement to the existing OOM killer. The point of it is
just to ensure that page allocation requests that have gone OOM are given
priority over other allocation requests when free pages start to trickle in.

Userspace doesn't need to know about the page allocation queue, and the queue is
not necessary to implement the method of determining when to kill processes that
I've proposed. It's an optimization, not a necessity.

Thanks,
Sultan
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-11 Thread Sultan Alsawaf
On Mon, Mar 11, 2019 at 05:11:25PM -0400, Joel Fernandes wrote:
> But the point is that a transient temporary memory spike should not be a
> signal to kill _any_ process.  The reaction to kill shouldn't be so
> spontaneous that unwanted tasks are killed because the system went into
> panic mode. It should be averaged out which I believe is what PSI does.

In my patch from the first email, I implemented the decision to kill a process
at the same time that the existing kernel OOM killer decides to kill a process.
If the kernel's OOM killer were susceptible to killing processes due to
transient memory spikes, then I think there would have been several complaints
about this behavior regardless of which userspace or architecture is in use.
I think the existing OOM killer has this done right.

The decision to kill a process occurs after the page allocator has tried _very_
hard to satisfy a page allocation via alternative means, such as utilizing
compaction, flushing file-backed pages to disk via kswapd, and direct reclaim.
Once all of those means have failed, it is quite reasonable to kill a process to
free memory. Trying to wait out the memory starvation at this point would be
futile.

Thanks,
Sultan
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-11 Thread Joel Fernandes
On Mon, Mar 11, 2019 at 01:46:26PM -0700, Sultan Alsawaf wrote:
> On Mon, Mar 11, 2019 at 01:10:36PM -0700, Suren Baghdasaryan wrote:
> > The idea seems interesting although I need to think about this a bit
> > more. Killing processes based on failed page allocation might backfire
> > during transient spikes in memory usage.
> 
> This issue could be alleviated if tasks could be killed and have their pages
> reaped faster.

But the point is that a transient temporary memory spike should not be a
signal to kill _any_ process.  The reaction to kill shouldn't be so
spontaneous that unwanted tasks are killed because the system went into
panic mode. It should be averaged out which I believe is what PSI does.

thanks,

- Joel

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-11 Thread Sultan Alsawaf
On Mon, Mar 11, 2019 at 01:10:36PM -0700, Suren Baghdasaryan wrote:
> The idea seems interesting although I need to think about this a bit
> more. Killing processes based on failed page allocation might backfire
> during transient spikes in memory usage.

This issue could be alleviated if tasks could be killed and have their pages
reaped faster. Currently, Linux takes a _very_ long time to free a task's memory
after an initial privileged SIGKILL is sent to a task, even with the task's
priority being set to the highest possible (so unwanted scheduler preemption
starving dying tasks of CPU time is not the issue at play here). I've
frequently measured the difference in time between when a SIGKILL is sent for a
task and when free_task() is called for that task to be hundreds of
milliseconds, which is incredibly long. AFAIK, this is a problem that LMKD
suffers from as well, and perhaps any OOM killer implementation in Linux, since
you cannot evaluate effect you've had on memory pressure by killing a process
for at least several tens of milliseconds.

> AFAIKT the biggest issue with using this approach in userspace is that
> it's not practically implementable without heavy in-kernel support.
> How to implement such interaction between kernel and userspace would
> be an interesting discussion which I would be happy to participate in.

You could signal a lightweight userspace process that has maximum scheduler
priority and have it kill the tasks it'd like.

Thanks,
Sultan
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-11 Thread Sultan Alsawaf
On Mon, Mar 11, 2019 at 06:43:20PM +0100, Michal Hocko wrote:
> I am sorry but we are not going to maintain two different OOM
> implementations in the kernel. From a quick look the implementation is
> quite a hack which is not really suitable for anything but a very
> specific usecase. E.g. reusing a freed page for a waiting allocation
> sounds like an interesting idea but it doesn't really work for many
> reasons. E.g. any NUMA affinity is broken, zone protection doesn't work
> either. Not to mention how the code hooks into the allocator hot paths.
> This is simply no no.
> 
> Last but not least people have worked really hard to provide means (PSI)
> to do what you need in the userspace.

Hi Michal,

Thanks for the feedback. I had no doubt that this would be vehemently rejected
on the mailing list, but I wanted feedback/opinions on it and thus sent it as 
anRFC. At best I thought perhaps the mechanisms I've employed might serve as
inspiration for LMKD improvements in Android, since this hacky OOM killer I've
devised does work quite well for the very specific usecase it is set out to
address. The NUMA affinity and zone protection bits are helpful insights too.

I'll take a look at PSI which Joel mentioned as well.

Thanks,
Sultan Alsawaf
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-11 Thread Michal Hocko
On Sun 10-03-19 13:34:03, Sultan Alsawaf wrote:
> From: Sultan Alsawaf 
> 
> This is a complete low memory killer solution for Android that is small
> and simple. It kills the largest, least-important processes it can find
> whenever a page allocation has completely failed (right after direct
> reclaim). Processes are killed according to the priorities that Android
> gives them, so that the least important processes are always killed
> first. Killing larger processes is preferred in order to free the most
> memory possible in one go.
> 
> Simple LMK is integrated deeply into the page allocator in order to
> catch exactly when a page allocation fails and exactly when a page is
> freed. Failed page allocations that have invoked Simple LMK are placed
> on a queue and wait for Simple LMK to satisfy them. When a page is about
> to be freed, the failed page allocations are given priority over normal
> page allocations by Simple LMK to see if they can immediately use the
> freed page.
> 
> Additionally, processes are continuously killed by failed small-order
> page allocations until they are satisfied.

I am sorry but we are not going to maintain two different OOM
implementations in the kernel. From a quick look the implementation is
quite a hack which is not really suitable for anything but a very
specific usecase. E.g. reusing a freed page for a waiting allocation
sounds like an interesting idea but it doesn't really work for many
reasons. E.g. any NUMA affinity is broken, zone protection doesn't work
either. Not to mention how the code hooks into the allocator hot paths.
This is simply no no.

Last but not least people have worked really hard to provide means (PSI)
to do what you need in the userspace.
 
> Signed-off-by: Sultan Alsawaf 
> ---
>  drivers/android/Kconfig  |  28 
>  drivers/android/Makefile |   1 +
>  drivers/android/simple_lmk.c | 301 +++
>  include/linux/sched.h|   3 +
>  include/linux/simple_lmk.h   |  11 ++
>  kernel/fork.c|   3 +
>  mm/page_alloc.c  |  13 ++
>  7 files changed, 360 insertions(+)
>  create mode 100644 drivers/android/simple_lmk.c
>  create mode 100644 include/linux/simple_lmk.h
> 
> diff --git a/drivers/android/Kconfig b/drivers/android/Kconfig
> index 6fdf2abe4..7469d049d 100644
> --- a/drivers/android/Kconfig
> +++ b/drivers/android/Kconfig
> @@ -54,6 +54,34 @@ config ANDROID_BINDER_IPC_SELFTEST
> exhaustively with combinations of various buffer sizes and
> alignments.
>  
> +config ANDROID_SIMPLE_LMK
> + bool "Simple Android Low Memory Killer"
> + depends on !MEMCG
> + ---help---
> +   This is a complete low memory killer solution for Android that is
> +   small and simple. It is integrated deeply into the page allocator to
> +   know exactly when a page allocation hits OOM and exactly when a page
> +   is freed. Processes are killed according to the priorities that
> +   Android gives them, so that the least important processes are always
> +   killed first.
> +
> +if ANDROID_SIMPLE_LMK
> +
> +config ANDROID_SIMPLE_LMK_MINFREE
> + int "Minimum MiB of memory to free per reclaim"
> + default "64"
> + help
> +   Simple LMK will free at least this many MiB of memory per reclaim.
> +
> +config ANDROID_SIMPLE_LMK_KILL_TIMEOUT
> + int "Kill timeout in milliseconds"
> + default "50"
> + help
> +   Simple LMK will only perform memory reclaim at most once per this
> +   amount of time.
> +
> +endif # if ANDROID_SIMPLE_LMK
> +
>  endif # if ANDROID
>  
>  endmenu
> diff --git a/drivers/android/Makefile b/drivers/android/Makefile
> index c7856e320..7c91293b6 100644
> --- a/drivers/android/Makefile
> +++ b/drivers/android/Makefile
> @@ -3,3 +3,4 @@ ccflags-y += -I$(src) # needed for trace 
> events
>  obj-$(CONFIG_ANDROID_BINDERFS)   += binderfs.o
>  obj-$(CONFIG_ANDROID_BINDER_IPC) += binder.o binder_alloc.o
>  obj-$(CONFIG_ANDROID_BINDER_IPC_SELFTEST) += binder_alloc_selftest.o
> +obj-$(CONFIG_ANDROID_SIMPLE_LMK) += simple_lmk.o
> diff --git a/drivers/android/simple_lmk.c b/drivers/android/simple_lmk.c
> new file mode 100644
> index 0..8a441650a
> --- /dev/null
> +++ b/drivers/android/simple_lmk.c
> @@ -0,0 +1,301 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2019 Sultan Alsawaf .
> + */
> +
> +#define pr_fmt(fmt) "simple_lmk: " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define MIN_FREE_PAGES (CONFIG_ANDROID_SIMPLE_LMK_MINFREE * SZ_1M / 
> PAGE_SIZE)
> +
> +struct oom_alloc_req {
> + struct page *page;
> + struct completion done;
> + struct list_head lh;
> + unsigned int order;
> + int migratetype;
> +};
> +
> +struct victim_info {
> + struct task_struct *tsk;
> + unsigned long size;
> +};
> +
> +enum {
> + DISABLED,
> + STARTING,
> + READY,
> + KILLING
> +};

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-11 Thread Joel Fernandes
On Mon, Mar 11, 2019 at 12:32:33PM -0400, Joel Fernandes wrote:
> On Sun, Mar 10, 2019 at 01:34:03PM -0700, Sultan Alsawaf wrote:
> [...]
> >  
> > /* Perform scheduler related setup. Assign this task to a CPU. */
> > retval = sched_fork(clone_flags, p);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3eb01dedf..fd0d697c6 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -67,6 +67,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -967,6 +968,11 @@ static inline void __free_one_page(struct page *page,
> > }
> > }
> >  
> > +#ifdef CONFIG_ANDROID_SIMPLE_LMK
> > +   if (simple_lmk_page_in(page, order, migratetype))
> > +   return;
> > +#endif
> > +
> > list_add(>lru, >free_area[order].free_list[migratetype]);
> >  out:
> > zone->free_area[order].nr_free++;
> > @@ -4427,6 +4433,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> > order,
> > if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
> > goto nopage;
> >  
> > +#ifdef CONFIG_ANDROID_SIMPLE_LMK
> > +   page = simple_lmk_oom_alloc(order, ac->migratetype);
> > +   if (page)
> > +   prep_new_page(page, order, gfp_mask, alloc_flags);
> > +   goto got_pg;
> > +#endif
> > +
> 
> Hacking generic MM code with Android-specific callback is probably a major
> issue with your patch.
>
> Also I CC'd -mm maintainers and lists since your patch
> touches page_alloc.c. Always run get_maintainer.pl before sending a patch. I
> added them this time.

I see you CC'd linux-mm on your initial patch, so I apologize. Ignore this
part of my reply. Thanks.



> Have you looked at the recent PSI work that Suren and Johannes have been
> doing [1]?  As I understand, userspace lmkd may be migrated to use that at 
> some
> point.  Suren can provide more details. I am sure AOSP contributions to make
> LMKd better by using the PSI backend would be appreciated. Please consider
> collaborating on that and help out, thanks. Check the cover-letter of that
> patch [1] where LMKd is mentioned.
 
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-11 Thread Joel Fernandes
On Sun, Mar 10, 2019 at 01:34:03PM -0700, Sultan Alsawaf wrote:
[...]
>  
>   /* Perform scheduler related setup. Assign this task to a CPU. */
>   retval = sched_fork(clone_flags, p);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3eb01dedf..fd0d697c6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -67,6 +67,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -967,6 +968,11 @@ static inline void __free_one_page(struct page *page,
>   }
>   }
>  
> +#ifdef CONFIG_ANDROID_SIMPLE_LMK
> + if (simple_lmk_page_in(page, order, migratetype))
> + return;
> +#endif
> +
>   list_add(>lru, >free_area[order].free_list[migratetype]);
>  out:
>   zone->free_area[order].nr_free++;
> @@ -4427,6 +4433,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> order,
>   if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
>   goto nopage;
>  
> +#ifdef CONFIG_ANDROID_SIMPLE_LMK
> + page = simple_lmk_oom_alloc(order, ac->migratetype);
> + if (page)
> + prep_new_page(page, order, gfp_mask, alloc_flags);
> + goto got_pg;
> +#endif
> +

Hacking generic MM code with Android-specific callback is probably a major
issue with your patch. Also I CC'd -mm maintainers and lists since your patch
touches page_alloc.c. Always run get_maintainer.pl before sending a patch. I
added them this time.

Have you looked at the recent PSI work that Suren and Johannes have been
doing [1]?  As I understand, userspace lmkd may be migrated to use that at some
point.  Suren can provide more details. I am sure AOSP contributions to make
LMKd better by using the PSI backend would be appreciated. Please consider
collaborating on that and help out, thanks. Check the cover-letter of that
patch [1] where LMKd is mentioned.

thanks,

 - Joel

[1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1951257.html

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-10 Thread Sultan Alsawaf
On Sun, Mar 10, 2019 at 10:03:35PM +0100, Greg Kroah-Hartman wrote:
> On Sun, Mar 10, 2019 at 01:34:03PM -0700, Sultan Alsawaf wrote:
> > From: Sultan Alsawaf 
> > 
> > This is a complete low memory killer solution for Android that is small
> > and simple. It kills the largest, least-important processes it can find
> > whenever a page allocation has completely failed (right after direct
> > reclaim). Processes are killed according to the priorities that Android
> > gives them, so that the least important processes are always killed
> > first. Killing larger processes is preferred in order to free the most
> > memory possible in one go.
> > 
> > Simple LMK is integrated deeply into the page allocator in order to
> > catch exactly when a page allocation fails and exactly when a page is
> > freed. Failed page allocations that have invoked Simple LMK are placed
> > on a queue and wait for Simple LMK to satisfy them. When a page is about
> > to be freed, the failed page allocations are given priority over normal
> > page allocations by Simple LMK to see if they can immediately use the
> > freed page.
> > 
> > Additionally, processes are continuously killed by failed small-order
> > page allocations until they are satisfied.
> > 
> > Signed-off-by: Sultan Alsawaf 
> 
> Wait, why?  We just removed the in-kernel android memory killer, we
> don't want to add another one back again, right?  Android Go devices
> work just fine with the userspace memory killer code, and those are "low
> memory" by design.
> 
> Why do we need kernel code here at all?
> 
> thanks,
> 
> greg k-h

Hi Greg,

Thanks for replying. It has not been my experience and the experience of many
others that Android's userspace low memory memory killer works "just fine." On
my Pixel 3 XL with a meager 4GB of memory, the userspace killer has had issues
with killing too many processes, which has resulted in a noticeably poor user
experience for all Pixel owners. From the looks of lmkd on the master branch,
there still isn't really any definitive solution for this, aside from a 100ms
delay in between process kills.

I think that the userspace low memory killer is more complex than necessary,
especially since in the kernel we can detect exactly when we run out of memory
and react far more quickly than any userspace daemon.

The original reasoning behind why the old kernel low memory killer was removed
is also a bit vague to me. It just seemed to be abandonware, and all of a sudden
a userspace daemon was touted as the solution.

This driver is like an Android-flavored version of the kernel's oom killer, and
has proven quite effective for me on my Pixel. Processes are killed exactly when
a page allocation fails, so memory use is maximized. There is no complexity to
try and estimate how full memory is either.

Thanks,
Sultan
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel


Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-10 Thread Greg Kroah-Hartman
On Sun, Mar 10, 2019 at 01:34:03PM -0700, Sultan Alsawaf wrote:
> From: Sultan Alsawaf 
> 
> This is a complete low memory killer solution for Android that is small
> and simple. It kills the largest, least-important processes it can find
> whenever a page allocation has completely failed (right after direct
> reclaim). Processes are killed according to the priorities that Android
> gives them, so that the least important processes are always killed
> first. Killing larger processes is preferred in order to free the most
> memory possible in one go.
> 
> Simple LMK is integrated deeply into the page allocator in order to
> catch exactly when a page allocation fails and exactly when a page is
> freed. Failed page allocations that have invoked Simple LMK are placed
> on a queue and wait for Simple LMK to satisfy them. When a page is about
> to be freed, the failed page allocations are given priority over normal
> page allocations by Simple LMK to see if they can immediately use the
> freed page.
> 
> Additionally, processes are continuously killed by failed small-order
> page allocations until they are satisfied.
> 
> Signed-off-by: Sultan Alsawaf 

Wait, why?  We just removed the in-kernel android memory killer, we
don't want to add another one back again, right?  Android Go devices
work just fine with the userspace memory killer code, and those are "low
memory" by design.

Why do we need kernel code here at all?

thanks,

greg k-h
___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel