On Fri, May 10, 2019 at 05:32:15PM -0700, Shakeel Butt wrote:
> From: Roman Gushchin <[email protected]>
> Date: Wed, May 8, 2019 at 1:30 PM
> To: Andrew Morton, Shakeel Butt
> Cc: <[email protected]>, <[email protected]>,
> <[email protected]>, Johannes Weiner, Michal Hocko, Rik van Riel,
> Christoph Lameter, Vladimir Davydov, <[email protected]>, Roman
> Gushchin
>
> > # Why do we need this?
> >
> > We've noticed that the number of dying cgroups is steadily growing on most
> > of our hosts in production. The following investigation revealed an issue
> > in userspace memory reclaim code [1], accounting of kernel stacks [2],
> > and also the mainreason: slab objects.
> >
> > The underlying problem is quite simple: any page charged
> > to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless
> > all charged pages are gone. If a slab object is actively used by other
> > cgroups,
> > it won't be reclaimed, and will prevent the origin cgroup from being
> > reclaimed.
> >
> > Slab objects, and first of all vfs cache, is shared between cgroups, which
> > are
> > using the same underlying fs, and what's even more important, it's shared
> > between multiple generations of the same workload. So if something is
> > running
> > periodically every time in a new cgroup (like how systemd works), we do
> > accumulate multiple dying cgroups.
> >
> > Strictly speaking pagecache isn't different here, but there is a key
> > difference:
> > we disable protection and apply some extra pressure on LRUs of dying
> > cgroups,
>
> How do you apply extra pressure on dying cgroups? cgroup-v2 does not
> have memory.force_empty.
I mean the following part of get_scan_count():
/*
* If the cgroup's already been deleted, make sure to
* scrape out the remaining cache.
*/
if (!scan && !mem_cgroup_online(memcg))
scan = min(lruvec_size, SWAP_CLUSTER_MAX);
It seems to work well, so that pagecache alone doesn't pin too many
dying cgroups. The price we're paying is some excessive IO here,
which can be avoided had we be able to recharge the pagecache.
Btw, thank you very much for looking into the patchset. I'll address
all comments and send v4 soon.
Thanks!
Roman