On Mon, Apr 09, 2018 at 09:34:07AM +0200, Michal Hocko wrote:
> On Sat 07-04-18 21:27:09, Matthew Wilcox wrote:
> > > > - Steal time from other processes to free memory (KSWAPD_RECLAIM)
> > >
> > > What does that mean? If I drop the flag, do not steal? Well I do because
> > > they will hit direct reclaim sooner...
> > If they allocate memory, sure. A process which stays in its working
> > set won't, unless it's preempted by kswapd.
> Well, I was probably not clear here. KSWAPD_RECLAIM is not something you
> want to drop because this is a cooperative flag. If you do not use it
> then you are effectivelly pushing others to the direct reclaim because
> the kswapd won't be woken up and won't do the background work. Your
> working make it sound as a good thing to drop.
If memory is low, *somebody* has to reclaim. As I understand it, kswapd
was originally introduced because networking might do many allocations
from interrupt context, and so was unable to do its own reclaiming. On a
machine which was used only for routing, there was no userspace process to
do the reclaiming, so it ran out of memory. But if you're an HPC person
who's expecting their long-running tasks to be synchronised and not be
unnecessarily disturbed, having kswapd preempting your task is awful.
I'm not arguing in favour of removing kswapd or anything like that,
but if you're not willing/able to reclaim memory yourself, then you're
necessarily stealing time from other tasks in order to have reclaim
> > > What does that mean and how it is different from NOWAIT? Is this about
> > > the low watermark and if yes do we want to teach users about this and
> > > make the whole thing even more complicated? Does it wake
> > > kswapd? What is the eagerness ordering? LOW, NOWAIT, NORETRY,
> > > RETRY_MAYFAIL, NOFAIL?
> > LOW doesn't quite fit into the eagerness scale with the other flags;
> > instead it's composable with them. So you can specify NOWAIT | LOW,
> > NORETRY | LOW, NOFAIL | LOW, etc. All I have in mind is something
> > like this:
> > if (alloc_flags & ALLOC_HIGH)
> > min -= min / 2;
> > + if (alloc_flags & ALLOC_LOW)
> > + min += min / 2;
> > The idea is that a GFP_KERNEL | __GFP_LOW allocation cannot force a
> > GFP_KERNEL allocation into an OOM situation because it cannot take
> > the last pages of memory before the watermark.
> So what are we going to do if the LOW watermark cannot succeed?
Depends on the other flags. GFP_NOWAIT | GFP_LOW will just return NULL
(somewhat more readily than a plain GFP_NOWAIT would). GFP_NORETRY |
GFP_LOW will do one pass through reclaim. If it gets enough pages
to drag the zone above the watermark, then it'll succeed, otherwise
return NULL. NOFAIL | LOW will keep retrying forever. GFP_KERNEL |
GFP_LOW ... hmm, that'll OOM-kill another process more eagerly that
a regular GFP_KERNEL allocation would. We'll need a little tweak so
GFP_LOW implies __GFP_RETRY_MAYFAIL.
> > It can still make a
> > GFP_KERNEL allocation *more likely* to hit OOM (just like any other kind
> > of allocation can), but it can't do it by itself.
> So who would be a user of __GFP_LOW?
vmalloc and Steven's ringbuffer. If I write a kernel module that tries
to vmalloc 1TB of space, it'll OOM-kill everything on the machine trying
to get enough memory to fill the page array. Probably everyone using
__GFP_RETRY_MAYFAIL today, to be honest. It's more likely to accomplish
what they want -- trying slightly less hard to get memory than GFP_KERNEL
> > I've been wondering about combining the DIRECT_RECLAIM, NORETRY,
> > RETRY_MAYFAIL and NOFAIL flags together into a single field:
> > 0 => RECLAIM_NEVER, /* !DIRECT_RECLAIM */
> > 1 => RECLAIM_ONCE, /* NORETRY */
> > 2 => RECLAIM_PROGRESS, /* RETRY_MAYFAIL */
> > 3 => RECLAIM_FOREVER, /* NOFAIL */
> > The existance of __GFP_RECLAIM makes this a bit tricky. I honestly don't
> > know what this code is asking for:
> I am not sure I follow here. Is the RECLAIM_ an internal thing to the
No, I'm talking about changing the __GFP flags like this:
@@ -24,10 +24,8 @@ struct vm_area_struct;
#define ___GFP_HIGH 0x20u
#define ___GFP_IO 0x40u
#define ___GFP_FS 0x80u
+#define ___GFP_ACCOUNT 0x100u
#define ___GFP_NOWARN 0x200u
-#define ___GFP_RETRY_MAYFAIL 0x400u
-#define ___GFP_NOFAIL 0x800u
-#define ___GFP_NORETRY 0x1000u
#define ___GFP_MEMALLOC 0x2000u
#define ___GFP_COMP 0x4000u
#define ___GFP_ZERO 0x8000u
@@ -35,8 +33,10 @@ struct vm_area_struct;
#define ___GFP_HARDWALL 0x20000u
#define ___GFP_THISNODE 0x40000u
#define ___GFP_ATOMIC 0x80000u
-#define ___GFP_ACCOUNT 0x100000u
-#define ___GFP_DIRECT_RECLAIM 0x400000u
+#define ___GFP_RECLAIM_NEVER 0x00000u
+#define ___GFP_RECLAIM_ONCE 0x10000u
+#define ___GFP_RECLAIM_PROGRESS 0x20000u
+#define ___GFP_RECLAIM_FOREVER 0x30000u
#define ___GFP_WRITE 0x800000u
#define ___GFP_KSWAPD_RECLAIM 0x1000000u
> > kernel/power/swap.c: __get_free_page(__GFP_RECLAIM |
> > __GFP_HIGH);
> > but I suspect I'll have to find out. There's about 60 places to look at.
> Well, it would be more understandable if this was written as
> (GFP_KERNEL | __GFP_HIGH) & ~(__GFP_FS|__GFP_IO)
Yeah, I think it's really (GFP_NOIO | __GFP_HIGH)
> > I also want to add __GFP_KILL (to be part of the GFP_KERNEL definition).
> What does __GFP_KILL means?
Allows OOM killing. So it's the inverse of the GFP_RETRY_MAYFAIL bit.
> > That way, each bit that you set in the GFP mask increases the things the
> > page allocator can do to get memory for you. At the moment, RETRY_MAYFAIL
> > subtracts the ability to kill other tasks, which is unusual.
> Well, it is not all that great because some flags add capabilities while
> some remove them but, well, life is hard when you try to extend an
> interface which was not all that great from the very beginning.
That's the story of Linux ;-)