Re: [Intel-gfx] [PATCH 24/29] mm: vmscan: make global slab shrink lockless

2023-07-06 Thread Paul E. McKenney
On Fri, Jun 23, 2023 at 04:29:39PM +1000, Dave Chinner wrote:
> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
> > On 6/22/23 10:53, Qi Zheng wrote:
> > > @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, 
> > > int nid,
> > >   if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> > >   return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
> > >  
> > > - if (!down_read_trylock(_rwsem))
> > > - goto out;
> > > -
> > > - list_for_each_entry(shrinker, _list, list) {
> > > + rcu_read_lock();
> > > + list_for_each_entry_rcu(shrinker, _list, list) {
> > >   struct shrink_control sc = {
> > >   .gfp_mask = gfp_mask,
> > >   .nid = nid,
> > >   .memcg = memcg,
> > >   };
> > >  
> > > + if (!shrinker_try_get(shrinker))
> > > + continue;
> > > + rcu_read_unlock();
> > 
> > I don't think you can do this unlock?

Sorry to be slow to respond here, this one fell through the cracks.
And thank you to Qi for reminding me!

If you do this unlock, you had jolly well better nail down the current
element (the one referenced by shrinker), for example, by acquiring an
explicit reference count on the object.  And presumably this is exactly
what shrinker_try_get() is doing.  And a look at your 24/29 confirms this,
at least assuming that shrinker->refcount is set to zero before the call
to synchronize_rcu() in free_module() *and* that synchronize_rcu() doesn't
start until *after* shrinker_put() calls complete().  Plus, as always,
the object must be removed from the list before the synchronize_rcu()
starts.  (On these parts of the puzzle, I defer to those more familiar
with this code path.  And I strongly suggest carefully commenting this
type of action-at-a-distance design pattern.)

Why is this important?  Because otherwise that object might be freed
before you get to the call to rcu_read_lock() at the end of this loop.
And if that happens, list_for_each_entry_rcu() will be walking the
freelist, which is quite bad for the health and well-being of your kernel.

There are a few other ways to make this sort of thing work:

1.  Defer the shrinker_put() to the beginning of the loop.
You would need a flag initially set to zero, and then set to
one just before (or just after) the rcu_read_lock() above.
You would also need another shrinker_old pointer to track the
old pointer.  Then at the top of the loop, if the flag is set,
invoke shrinker_put() on shrinker_old.  This ensures that the
previous shrinker structure stays around long enough to allow
the loop to find the next shrinker structure in the list.

This approach is attractive when the removal code path
can invoke shrinker_put() after the grace period ends.

2.  Make shrinker_put() invoke call_rcu() when ->refcount reaches
zero, and have the callback function free the object.  This of
course requires adding an rcu_head structure to the shrinker
structure, which might or might not be a reasonable course of
action.  If adding that rcu_head is reasonable, this simplifies
the logic quite a bit.

3.  For the shrinker-structure-removal code path, remove the shrinker
structure, then remove the initial count from ->refcount,
and then keep doing grace periods until ->refcount is zero,
then do one more.  Of course, if the result of removing the
initial count was zero, then only a single additional grace
period is required.

This would need to be carefully commented, as it is a bit
unconventional.

There are probably many other ways, but just to give an idea of a few
other ways to do this.

> > > +
> > >   ret = do_shrink_slab(, shrinker, priority);
> > >   if (ret == SHRINK_EMPTY)
> > >   ret = 0;
> > >   freed += ret;
> > > - /*
> > > -  * Bail out if someone want to register a new shrinker to
> > > -  * prevent the registration from being stalled for long periods
> > > -  * by parallel ongoing shrinking.
> > > -  */
> > > - if (rwsem_is_contended(_rwsem)) {
> > > - freed = freed ? : 1;
> > > - break;
> > > - }
> > > - }
> > >  
> > > - up_read(_rwsem);
> > > -out:
> > > + rcu_read_lock();
> > 
> > That new rcu_read_lock() won't help AFAIK, the whole
> > list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
> > safe.
> 
> Yeah, that's the pattern we've been taught and the one we can look
> at and immediately say "this is safe".
> 
> This is a different pattern, as has been explained bi Qi, and I
> think it *might* be safe.
> 
> *However.*
> 
> Right now I don't have time to go through a novel RCU list iteration
> pattern it one step at to determine the correctness of the
> algorithm. I'm mostly 

Re: [Intel-gfx] [PATCH 24/29] mm: vmscan: make global slab shrink lockless

2023-07-06 Thread Qi Zheng




On 2023/7/4 11:45, Qi Zheng wrote:



On 2023/7/4 00:39, Paul E. McKenney wrote:

On Fri, Jun 23, 2023 at 04:29:39PM +1000, Dave Chinner wrote:

On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:

On 6/22/23 10:53, Qi Zheng wrote:
@@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t 
gfp_mask, int nid,

  if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
  return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
-    if (!down_read_trylock(_rwsem))
-    goto out;
-
-    list_for_each_entry(shrinker, _list, list) {
+    rcu_read_lock();
+    list_for_each_entry_rcu(shrinker, _list, list) {
  struct shrink_control sc = {
  .gfp_mask = gfp_mask,
  .nid = nid,
  .memcg = memcg,
  };
+    if (!shrinker_try_get(shrinker))
+    continue;
+    rcu_read_unlock();


I don't think you can do this unlock?


Sorry to be slow to respond here, this one fell through the cracks.
And thank you to Qi for reminding me!

If you do this unlock, you had jolly well better nail down the current
element (the one referenced by shrinker), for example, by acquiring an
explicit reference count on the object.  And presumably this is exactly
what shrinker_try_get() is doing.  And a look at your 24/29 confirms 
this,

at least assuming that shrinker->refcount is set to zero before the call
to synchronize_rcu() in free_module() *and* that synchronize_rcu() 
doesn't

start until *after* shrinker_put() calls complete().  Plus, as always,
the object must be removed from the list before the synchronize_rcu()
starts.  (On these parts of the puzzle, I defer to those more familiar
with this code path.  And I strongly suggest carefully commenting this
type of action-at-a-distance design pattern.)


Yeah, I think I've done it like above. A more detailed timing diagram is
below.



Why is this important?  Because otherwise that object might be freed
before you get to the call to rcu_read_lock() at the end of this loop.
And if that happens, list_for_each_entry_rcu() will be walking the
freelist, which is quite bad for the health and well-being of your 
kernel.


There are a few other ways to make this sort of thing work:

1.    Defer the shrinker_put() to the beginning of the loop.
You would need a flag initially set to zero, and then set to
one just before (or just after) the rcu_read_lock() above.
You would also need another shrinker_old pointer to track the
old pointer.  Then at the top of the loop, if the flag is set,
invoke shrinker_put() on shrinker_old.    This ensures that the
previous shrinker structure stays around long enough to allow
the loop to find the next shrinker structure in the list.

This approach is attractive when the removal code path
can invoke shrinker_put() after the grace period ends.

2.    Make shrinker_put() invoke call_rcu() when ->refcount reaches
zero, and have the callback function free the object.  This of
course requires adding an rcu_head structure to the shrinker
structure, which might or might not be a reasonable course of
action.  If adding that rcu_head is reasonable, this simplifies
the logic quite a bit.

3.    For the shrinker-structure-removal code path, remove the shrinker
structure, then remove the initial count from ->refcount,
and then keep doing grace periods until ->refcount is zero,
then do one more.  Of course, if the result of removing the
initial count was zero, then only a single additional grace
period is required.

This would need to be carefully commented, as it is a bit
unconventional.


Thanks for such a detailed addition!



There are probably many other ways, but just to give an idea of a few
other ways to do this.


+
  ret = do_shrink_slab(, shrinker, priority);
  if (ret == SHRINK_EMPTY)
  ret = 0;
  freed += ret;
-    /*
- * Bail out if someone want to register a new shrinker to
- * prevent the registration from being stalled for long 
periods

- * by parallel ongoing shrinking.
- */
-    if (rwsem_is_contended(_rwsem)) {
-    freed = freed ? : 1;
-    break;
-    }
-    }
-    up_read(_rwsem);
-out:
+    rcu_read_lock();


That new rcu_read_lock() won't help AFAIK, the whole
list_for_each_entry_rcu() needs to be under the single 
rcu_read_lock() to be

safe.


Yeah, that's the pattern we've been taught and the one we can look
at and immediately say "this is safe".

This is a different pattern, as has been explained bi Qi, and I
think it *might* be safe.

*However.*

Right now I don't have time to go through a novel RCU list iteration
pattern it one step at to determine the correctness of the
algorithm. I'm mostly worried about list manipulations that can
occur outside rcu_read_lock() section bleeding into the RCU
critical section because rcu_read_lock() by itself is not a memory

Re: [Intel-gfx] [PATCH 24/29] mm: vmscan: make global slab shrink lockless

2023-07-06 Thread Qi Zheng




On 2023/7/4 00:39, Paul E. McKenney wrote:

On Fri, Jun 23, 2023 at 04:29:39PM +1000, Dave Chinner wrote:

On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:

On 6/22/23 10:53, Qi Zheng wrote:

@@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int 
nid,
if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
  
-	if (!down_read_trylock(_rwsem))

-   goto out;
-
-   list_for_each_entry(shrinker, _list, list) {
+   rcu_read_lock();
+   list_for_each_entry_rcu(shrinker, _list, list) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
.memcg = memcg,
};
  
+		if (!shrinker_try_get(shrinker))

+   continue;
+   rcu_read_unlock();


I don't think you can do this unlock?


Sorry to be slow to respond here, this one fell through the cracks.
And thank you to Qi for reminding me!

If you do this unlock, you had jolly well better nail down the current
element (the one referenced by shrinker), for example, by acquiring an
explicit reference count on the object.  And presumably this is exactly
what shrinker_try_get() is doing.  And a look at your 24/29 confirms this,
at least assuming that shrinker->refcount is set to zero before the call
to synchronize_rcu() in free_module() *and* that synchronize_rcu() doesn't
start until *after* shrinker_put() calls complete().  Plus, as always,
the object must be removed from the list before the synchronize_rcu()
starts.  (On these parts of the puzzle, I defer to those more familiar
with this code path.  And I strongly suggest carefully commenting this
type of action-at-a-distance design pattern.)


Yeah, I think I've done it like above. A more detailed timing diagram is
below.



Why is this important?  Because otherwise that object might be freed
before you get to the call to rcu_read_lock() at the end of this loop.
And if that happens, list_for_each_entry_rcu() will be walking the
freelist, which is quite bad for the health and well-being of your kernel.

There are a few other ways to make this sort of thing work:

1.  Defer the shrinker_put() to the beginning of the loop.
You would need a flag initially set to zero, and then set to
one just before (or just after) the rcu_read_lock() above.
You would also need another shrinker_old pointer to track the
old pointer.  Then at the top of the loop, if the flag is set,
invoke shrinker_put() on shrinker_old.  This ensures that the
previous shrinker structure stays around long enough to allow
the loop to find the next shrinker structure in the list.

This approach is attractive when the removal code path
can invoke shrinker_put() after the grace period ends.

2.  Make shrinker_put() invoke call_rcu() when ->refcount reaches
zero, and have the callback function free the object.  This of
course requires adding an rcu_head structure to the shrinker
structure, which might or might not be a reasonable course of
action.  If adding that rcu_head is reasonable, this simplifies
the logic quite a bit.

3.  For the shrinker-structure-removal code path, remove the shrinker
structure, then remove the initial count from ->refcount,
and then keep doing grace periods until ->refcount is zero,
then do one more.  Of course, if the result of removing the
initial count was zero, then only a single additional grace
period is required.

This would need to be carefully commented, as it is a bit
unconventional.


Thanks for such a detailed addition!



There are probably many other ways, but just to give an idea of a few
other ways to do this.


+
ret = do_shrink_slab(, shrinker, priority);
if (ret == SHRINK_EMPTY)
ret = 0;
freed += ret;
-   /*
-* Bail out if someone want to register a new shrinker to
-* prevent the registration from being stalled for long periods
-* by parallel ongoing shrinking.
-*/
-   if (rwsem_is_contended(_rwsem)) {
-   freed = freed ? : 1;
-   break;
-   }
-   }
  
-	up_read(_rwsem);

-out:
+   rcu_read_lock();


That new rcu_read_lock() won't help AFAIK, the whole
list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
safe.


Yeah, that's the pattern we've been taught and the one we can look
at and immediately say "this is safe".

This is a different pattern, as has been explained bi Qi, and I
think it *might* be safe.

*However.*

Right now I don't have time to go through a novel RCU list iteration
pattern it one step 

Re: [Intel-gfx] [PATCH 24/29] mm: vmscan: make global slab shrink lockless

2023-07-06 Thread Qi Zheng

Hi Dave,

On 2023/6/24 19:08, Qi Zheng wrote:

Hi Dave,

On 2023/6/24 06:19, Dave Chinner wrote:

On Fri, Jun 23, 2023 at 09:10:57PM +0800, Qi Zheng wrote:

On 2023/6/23 14:29, Dave Chinner wrote:

On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:

On 6/22/23 10:53, Qi Zheng wrote:

Yes, I suggested the IDR route because radix tree lookups under RCU
with reference counted objects are a known safe pattern that we can
easily confirm is correct or not.  Hence I suggested the unification
+ IDR route because it makes the life of reviewers so, so much
easier...


In fact, I originally planned to try the unification + IDR method you
suggested at the beginning. But in the case of CONFIG_MEMCG disabled,
the struct mem_cgroup is not even defined, and root_mem_cgroup and
shrinker_info will not be allocated.  This required more code 
changes, so

I ended up keeping the shrinker_list and implementing the above pattern.


Yes. Go back and read what I originally said needed to be done
first. In the case of CONFIG_MEMCG=n, a dummy root memcg still needs
to exist that holds all of the global shrinkers. Then shrink_slab()
is only ever passed a memcg that should be iterated.

Yes, it needs changes external to the shrinker code itself to be
made to work. And even if memcg's are not enabled, we can still use
the memcg structures to ensure a common abstraction is used for the
shrinker tracking infrastructure


Yeah, what I imagined before was to define a more concise struct
mem_cgroup in the case of CONFIG_MEMCG=n, then allocate a dummy root
memcg on system boot:

#ifdef !CONFIG_MEMCG

struct shrinker_info {
 struct rcu_head rcu;
 atomic_long_t *nr_deferred;
 unsigned long *map;
 int map_nr_max;
};

struct mem_cgroup_per_node {
 struct shrinker_info __rcu    *shrinker_info;
};

struct mem_cgroup {
 struct mem_cgroup_per_node *nodeinfo[];
};

#endif


These days I tried doing this:

1. CONFIG_MEMCG && !mem_cgroup_disabled()

   track all global shrinkers with root_mem_cgroup.

2. CONFIG_MEMCG && mem_cgroup_disabled()

   the root_mem_cgroup is also allocated in this case, so still use
   root_mem_cgroup to track all global shrinkers.

3. !CONFIG_MEMCG

   allocate a dummy memcg during system startup (after cgroup_init())
   and use it to track all global shrinkers

This works, but needs to modify the startup order of some subsystems,
because some shrinkers will be registered before root_mem_cgroup is
allocated, such as:

1. rcu-kfree shrinker in rcu_init()
2. super block shrinkers in vfs_caches_init()

And cgroup_init() also depends on some file system infrastructure, so
I made some changes (rough and unorganized):

diff --git a/fs/namespace.c b/fs/namespace.c
index e157efc54023..6a12d3d0064e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -4706,7 +4706,7 @@ static void __init init_mount_tree(void)

 void __init mnt_init(void)
 {
-   int err;
+   //int err;

mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, 
NULL);

@@ -4725,15 +4725,7 @@ void __init mnt_init(void)
if (!mount_hashtable || !mountpoint_hashtable)
panic("Failed to allocate mount hash table\n");

-   kernfs_init();
-
-   err = sysfs_init();
-   if (err)
-   printk(KERN_WARNING "%s: sysfs_init error: %d\n",
-   __func__, err);
-   fs_kobj = kobject_create_and_add("fs", NULL);
-   if (!fs_kobj)
-   printk(KERN_WARNING "%s: kobj create error\n", __func__);
shmem_init();
init_rootfs();
init_mount_tree();
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 7d9c2a63b7cd..d87c67f6f66e 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -119,6 +119,7 @@ static inline void call_rcu_hurry(struct rcu_head 
*head, rcu_callback_t func)


 /* Internal to kernel */
 void rcu_init(void);
+void rcu_shrinker_init(void);
 extern int rcu_scheduler_active;
 void rcu_sched_clock_irq(int user);
 void rcu_report_dead(unsigned int cpu);
diff --git a/init/main.c b/init/main.c
index ad920fac325c..4190fc6d10ad 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1049,14 +1049,22 @@ void start_kernel(void)
security_init();
dbg_late_init();
net_ns_init();
+   kernfs_init();
+   if (sysfs_init())
+   printk(KERN_WARNING "%s: sysfs_init error\n",
+   __func__);
+   fs_kobj = kobject_create_and_add("fs", NULL);
+   if (!fs_kobj)
+   printk(KERN_WARNING "%s: kobj create error\n", __func__);
+   proc_root_init();
+   cgroup_init();
vfs_caches_init();
pagecache_init();
signals_init();
seq_file_init();
-   proc_root_init();
nsfs_init();
cpuset_init();
-   cgroup_init();
+   rcu_shrinker_init();
taskstats_init_early();
delayacct_init();

diff --git 

Re: [Intel-gfx] [PATCH 24/29] mm: vmscan: make global slab shrink lockless

2023-06-27 Thread Qi Zheng




On 2023/6/24 19:08, Qi Zheng wrote:

Hi Dave,

On 2023/6/24 06:19, Dave Chinner wrote:

On Fri, Jun 23, 2023 at 09:10:57PM +0800, Qi Zheng wrote:

On 2023/6/23 14:29, Dave Chinner wrote:

On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:

On 6/22/23 10:53, Qi Zheng wrote:

Yes, I suggested the IDR route because radix tree lookups under RCU
with reference counted objects are a known safe pattern that we can
easily confirm is correct or not.  Hence I suggested the unification
+ IDR route because it makes the life of reviewers so, so much
easier...


In fact, I originally planned to try the unification + IDR method you
suggested at the beginning. But in the case of CONFIG_MEMCG disabled,
the struct mem_cgroup is not even defined, and root_mem_cgroup and
shrinker_info will not be allocated.  This required more code 
changes, so

I ended up keeping the shrinker_list and implementing the above pattern.


Yes. Go back and read what I originally said needed to be done
first. In the case of CONFIG_MEMCG=n, a dummy root memcg still needs
to exist that holds all of the global shrinkers. Then shrink_slab()
is only ever passed a memcg that should be iterated.

Yes, it needs changes external to the shrinker code itself to be
made to work. And even if memcg's are not enabled, we can still use
the memcg structures to ensure a common abstraction is used for the
shrinker tracking infrastructure


Yeah, what I imagined before was to define a more concise struct
mem_cgroup in the case of CONFIG_MEMCG=n, then allocate a dummy root
memcg on system boot:

#ifdef !CONFIG_MEMCG

struct shrinker_info {
 struct rcu_head rcu;
 atomic_long_t *nr_deferred;
 unsigned long *map;
 int map_nr_max;
};

struct mem_cgroup_per_node {
 struct shrinker_info __rcu    *shrinker_info;
};

struct mem_cgroup {
 struct mem_cgroup_per_node *nodeinfo[];
};

#endif

But I have a concern: if all global shrinkers are tracking with the
info->map of root memcg, a shrinker->id needs to be assigned to them,
which will cause info->map_nr_max to become larger than before, then
making the traversal of info->map slower.


But most of the system is 'sb-xxx' shrinker instances, they all have
the SHRINKER_MEMCG_AWARE flag, so it should have little impact on the
speed of traversing info->map. ;)






If the above pattern is not safe, I will go back to the unification +
IDR method.


And that is exactly how we got into this mess in the first place


I only found one similar pattern in the kernel:

fs/smb/server/oplock.c:find_same_lease_key/smb_break_all_levII_oplock/lookup_lease_in_table

But IIUC, the refcount here needs to be decremented after holding
rcu lock as I did above.

So regardless of whether we choose unification + IDR in the end, I still
want to confirm whether the pattern I implemented above is safe. :)


Also + RCU mailing list.



Thanks,
Qi



-Dave


Re: [Intel-gfx] [PATCH 24/29] mm: vmscan: make global slab shrink lockless

2023-06-27 Thread Qi Zheng




On 2023/6/22 23:12, Vlastimil Babka wrote:

On 6/22/23 10:53, Qi Zheng wrote:

The shrinker_rwsem is a global read-write lock in
shrinkers subsystem, which protects most operations
such as slab shrink, registration and unregistration
of shrinkers, etc. This can easily cause problems in
the following cases.

1) When the memory pressure is high and there are many
filesystems mounted or unmounted at the same time,
slab shrink will be affected (down_read_trylock()
failed).

Such as the real workload mentioned by Kirill Tkhai:

```
One of the real workloads from my experience is start
of an overcommitted node containing many starting
containers after node crash (or many resuming containers
after reboot for kernel update). In these cases memory
pressure is huge, and the node goes round in long reclaim.
```

2) If a shrinker is blocked (such as the case mentioned
in [1]) and a writer comes in (such as mount a fs),
then this writer will be blocked and cause all
subsequent shrinker-related operations to be blocked.

Even if there is no competitor when shrinking slab, there
may still be a problem. If we have a long shrinker list
and we do not reclaim enough memory with each shrinker,
then the down_read_trylock() may be called with high
frequency. Because of the poor multicore scalability of
atomic operations, this can lead to a significant drop
in IPC (instructions per cycle).

We used to implement the lockless slab shrink with
SRCU [1], but then kernel test robot reported -88.8%
regression in stress-ng.ramfs.ops_per_sec test case [2],
so we reverted it [3].

This commit uses the refcount+RCU method [4] proposed by
by Dave Chinner to re-implement the lockless global slab
shrink. The memcg slab shrink is handled in the subsequent
patch.

Currently, the shrinker instances can be divided into
the following three types:

a) global shrinker instance statically defined in the kernel,
such as workingset_shadow_shrinker.

b) global shrinker instance statically defined in the kernel
modules, such as mmu_shrinker in x86.

c) shrinker instance embedded in other structures.

For case a, the memory of shrinker instance is never freed.
For case b, the memory of shrinker instance will be freed
after the module is unloaded. But we will call synchronize_rcu()
in free_module() to wait for RCU read-side critical section to
exit. For case c, the memory of shrinker instance will be
dynamically freed by calling kfree_rcu(). So we can use
rcu_read_{lock,unlock}() to ensure that the shrinker instance
is valid.

The shrinker::refcount mechanism ensures that the shrinker
instance will not be run again after unregistration. So the
structure that records the pointer of shrinker instance can be
safely freed without waiting for the RCU read-side critical
section.

In this way, while we implement the lockless slab shrink, we
don't need to be blocked in unregister_shrinker() to wait
RCU read-side critical section.

The following are the test results:

stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &

1) Before applying this patchset:

  setting to a 60 second run per stressor
  dispatching hogs: 9 ramfs
  stressor   bogo ops real time  usr time  sys time   bogo ops/s bogo 
ops/s
(secs)(secs)(secs)   (real time) (usr+sys 
time)
  ramfs880623 60.02  7.71226.93 14671.45
3753.09
  ramfs:
   1 System Management Interrupt
  for a 60.03s run time:
 5762.40s available CPU time
7.71s user time   (  0.13%)
  226.93s system time (  3.94%)
  234.64s total time  (  4.07%)
  load average: 8.54 3.06 2.11
  passed: 9: ramfs (9)
  failed: 0
  skipped: 0
  successful run completed in 60.03s (1 min, 0.03 secs)

2) After applying this patchset:

  setting to a 60 second run per stressor
  dispatching hogs: 9 ramfs
  stressor   bogo ops real time  usr time  sys time   bogo ops/s bogo 
ops/s
(secs)(secs)(secs)   (real time) (usr+sys 
time)
  ramfs847562 60.02  7.44230.22 14120.66
3566.23
  ramfs:
   4 System Management Interrupts
  for a 60.12s run time:
 5771.95s available CPU time
7.44s user time   (  0.13%)
  230.22s system time (  3.99%)
  237.66s total time  (  4.12%)
  load average: 8.18 2.43 0.84
  passed: 9: ramfs (9)
  failed: 0
  skipped: 0
  successful run completed in 60.12s (1 min, 0.12 secs)

We can see that the ops/s has hardly changed.

[1]. 
https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.a...@bytedance.com/
[2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie@intel.com/
[3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zh...@linux.dev/
[4]. https://lore.kernel.org/lkml/zijhou1d55d4h...@dread.disaster.area/

Signed-off-by: Qi Zheng 
---
  include/linux/shrinker.h |  6 ++
  mm/vmscan.c  | 33 

Re: [Intel-gfx] [PATCH 24/29] mm: vmscan: make global slab shrink lockless

2023-06-27 Thread Alan Huang


> 2023年6月23日 上午12:42,Qi Zheng  写道:
> 
> 
> 
> On 2023/6/22 23:12, Vlastimil Babka wrote:
>> On 6/22/23 10:53, Qi Zheng wrote:
>>> The shrinker_rwsem is a global read-write lock in
>>> shrinkers subsystem, which protects most operations
>>> such as slab shrink, registration and unregistration
>>> of shrinkers, etc. This can easily cause problems in
>>> the following cases.
>>> 
>>> 1) When the memory pressure is high and there are many
>>>filesystems mounted or unmounted at the same time,
>>>slab shrink will be affected (down_read_trylock()
>>>failed).
>>> 
>>>Such as the real workload mentioned by Kirill Tkhai:
>>> 
>>>```
>>>One of the real workloads from my experience is start
>>>of an overcommitted node containing many starting
>>>containers after node crash (or many resuming containers
>>>after reboot for kernel update). In these cases memory
>>>pressure is huge, and the node goes round in long reclaim.
>>>```
>>> 
>>> 2) If a shrinker is blocked (such as the case mentioned
>>>in [1]) and a writer comes in (such as mount a fs),
>>>then this writer will be blocked and cause all
>>>subsequent shrinker-related operations to be blocked.
>>> 
>>> Even if there is no competitor when shrinking slab, there
>>> may still be a problem. If we have a long shrinker list
>>> and we do not reclaim enough memory with each shrinker,
>>> then the down_read_trylock() may be called with high
>>> frequency. Because of the poor multicore scalability of
>>> atomic operations, this can lead to a significant drop
>>> in IPC (instructions per cycle).
>>> 
>>> We used to implement the lockless slab shrink with
>>> SRCU [1], but then kernel test robot reported -88.8%
>>> regression in stress-ng.ramfs.ops_per_sec test case [2],
>>> so we reverted it [3].
>>> 
>>> This commit uses the refcount+RCU method [4] proposed by
>>> by Dave Chinner to re-implement the lockless global slab
>>> shrink. The memcg slab shrink is handled in the subsequent
>>> patch.
>>> 
>>> Currently, the shrinker instances can be divided into
>>> the following three types:
>>> 
>>> a) global shrinker instance statically defined in the kernel,
>>> such as workingset_shadow_shrinker.
>>> 
>>> b) global shrinker instance statically defined in the kernel
>>> modules, such as mmu_shrinker in x86.
>>> 
>>> c) shrinker instance embedded in other structures.
>>> 
>>> For case a, the memory of shrinker instance is never freed.
>>> For case b, the memory of shrinker instance will be freed
>>> after the module is unloaded. But we will call synchronize_rcu()
>>> in free_module() to wait for RCU read-side critical section to
>>> exit. For case c, the memory of shrinker instance will be
>>> dynamically freed by calling kfree_rcu(). So we can use
>>> rcu_read_{lock,unlock}() to ensure that the shrinker instance
>>> is valid.
>>> 
>>> The shrinker::refcount mechanism ensures that the shrinker
>>> instance will not be run again after unregistration. So the
>>> structure that records the pointer of shrinker instance can be
>>> safely freed without waiting for the RCU read-side critical
>>> section.
>>> 
>>> In this way, while we implement the lockless slab shrink, we
>>> don't need to be blocked in unregister_shrinker() to wait
>>> RCU read-side critical section.
>>> 
>>> The following are the test results:
>>> 
>>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
>>> 
>>> 1) Before applying this patchset:
>>> 
>>>  setting to a 60 second run per stressor
>>>  dispatching hogs: 9 ramfs
>>>  stressor   bogo ops real time  usr time  sys time   bogo ops/s 
>>> bogo ops/s
>>>(secs)(secs)(secs)   (real time) 
>>> (usr+sys time)
>>>  ramfs880623 60.02  7.71226.93 14671.45
>>> 3753.09
>>>  ramfs:
>>>   1 System Management Interrupt
>>>  for a 60.03s run time:
>>> 5762.40s available CPU time
>>>7.71s user time   (  0.13%)
>>>  226.93s system time (  3.94%)
>>>  234.64s total time  (  4.07%)
>>>  load average: 8.54 3.06 2.11
>>>  passed: 9: ramfs (9)
>>>  failed: 0
>>>  skipped: 0
>>>  successful run completed in 60.03s (1 min, 0.03 secs)
>>> 
>>> 2) After applying this patchset:
>>> 
>>>  setting to a 60 second run per stressor
>>>  dispatching hogs: 9 ramfs
>>>  stressor   bogo ops real time  usr time  sys time   bogo ops/s 
>>> bogo ops/s
>>>(secs)(secs)(secs)   (real time) 
>>> (usr+sys time)
>>>  ramfs847562 60.02  7.44230.22 14120.66
>>> 3566.23
>>>  ramfs:
>>>   4 System Management Interrupts
>>>  for a 60.12s run time:
>>> 5771.95s available CPU time
>>>7.44s user time   (  0.13%)
>>>  230.22s system time (  3.99%)
>>>  237.66s total time  (  4.12%)
>>>  load average: 8.18 2.43 0.84
>>>  passed: 9: ramfs (9)
>>>  failed: 0
>>>  skipped: 0
>>>  successful run completed in 60.12s (1 min, 0.12 

Re: [Intel-gfx] [PATCH 24/29] mm: vmscan: make global slab shrink lockless

2023-06-27 Thread Qi Zheng




On 2023/6/23 01:41, Alan Huang wrote:



2023年6月23日 上午12:42,Qi Zheng  写道:



On 2023/6/22 23:12, Vlastimil Babka wrote:

On 6/22/23 10:53, Qi Zheng wrote:

The shrinker_rwsem is a global read-write lock in
shrinkers subsystem, which protects most operations
such as slab shrink, registration and unregistration
of shrinkers, etc. This can easily cause problems in
the following cases.

1) When the memory pressure is high and there are many
filesystems mounted or unmounted at the same time,
slab shrink will be affected (down_read_trylock()
failed).

Such as the real workload mentioned by Kirill Tkhai:

```
One of the real workloads from my experience is start
of an overcommitted node containing many starting
containers after node crash (or many resuming containers
after reboot for kernel update). In these cases memory
pressure is huge, and the node goes round in long reclaim.
```

2) If a shrinker is blocked (such as the case mentioned
in [1]) and a writer comes in (such as mount a fs),
then this writer will be blocked and cause all
subsequent shrinker-related operations to be blocked.

Even if there is no competitor when shrinking slab, there
may still be a problem. If we have a long shrinker list
and we do not reclaim enough memory with each shrinker,
then the down_read_trylock() may be called with high
frequency. Because of the poor multicore scalability of
atomic operations, this can lead to a significant drop
in IPC (instructions per cycle).

We used to implement the lockless slab shrink with
SRCU [1], but then kernel test robot reported -88.8%
regression in stress-ng.ramfs.ops_per_sec test case [2],
so we reverted it [3].

This commit uses the refcount+RCU method [4] proposed by
by Dave Chinner to re-implement the lockless global slab
shrink. The memcg slab shrink is handled in the subsequent
patch.

Currently, the shrinker instances can be divided into
the following three types:

a) global shrinker instance statically defined in the kernel,
such as workingset_shadow_shrinker.

b) global shrinker instance statically defined in the kernel
modules, such as mmu_shrinker in x86.

c) shrinker instance embedded in other structures.

For case a, the memory of shrinker instance is never freed.
For case b, the memory of shrinker instance will be freed
after the module is unloaded. But we will call synchronize_rcu()
in free_module() to wait for RCU read-side critical section to
exit. For case c, the memory of shrinker instance will be
dynamically freed by calling kfree_rcu(). So we can use
rcu_read_{lock,unlock}() to ensure that the shrinker instance
is valid.

The shrinker::refcount mechanism ensures that the shrinker
instance will not be run again after unregistration. So the
structure that records the pointer of shrinker instance can be
safely freed without waiting for the RCU read-side critical
section.

In this way, while we implement the lockless slab shrink, we
don't need to be blocked in unregister_shrinker() to wait
RCU read-side critical section.

The following are the test results:

stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &

1) Before applying this patchset:

  setting to a 60 second run per stressor
  dispatching hogs: 9 ramfs
  stressor   bogo ops real time  usr time  sys time   bogo ops/s bogo 
ops/s
(secs)(secs)(secs)   (real time) (usr+sys 
time)
  ramfs880623 60.02  7.71226.93 14671.45
3753.09
  ramfs:
   1 System Management Interrupt
  for a 60.03s run time:
 5762.40s available CPU time
7.71s user time   (  0.13%)
  226.93s system time (  3.94%)
  234.64s total time  (  4.07%)
  load average: 8.54 3.06 2.11
  passed: 9: ramfs (9)
  failed: 0
  skipped: 0
  successful run completed in 60.03s (1 min, 0.03 secs)

2) After applying this patchset:

  setting to a 60 second run per stressor
  dispatching hogs: 9 ramfs
  stressor   bogo ops real time  usr time  sys time   bogo ops/s bogo 
ops/s
(secs)(secs)(secs)   (real time) (usr+sys 
time)
  ramfs847562 60.02  7.44230.22 14120.66
3566.23
  ramfs:
   4 System Management Interrupts
  for a 60.12s run time:
 5771.95s available CPU time
7.44s user time   (  0.13%)
  230.22s system time (  3.99%)
  237.66s total time  (  4.12%)
  load average: 8.18 2.43 0.84
  passed: 9: ramfs (9)
  failed: 0
  skipped: 0
  successful run completed in 60.12s (1 min, 0.12 secs)

We can see that the ops/s has hardly changed.

[1]. 
https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.a...@bytedance.com/
[2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie@intel.com/
[3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zh...@linux.dev/
[4]. https://lore.kernel.org/lkml/zijhou1d55d4h...@dread.disaster.area/

Signed-off-by: Qi Zheng 
---
  

Re: [Intel-gfx] [PATCH 24/29] mm: vmscan: make global slab shrink lockless

2023-06-27 Thread Qi Zheng

Hi Dave,

On 2023/6/24 06:19, Dave Chinner wrote:

On Fri, Jun 23, 2023 at 09:10:57PM +0800, Qi Zheng wrote:

On 2023/6/23 14:29, Dave Chinner wrote:

On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:

On 6/22/23 10:53, Qi Zheng wrote:

Yes, I suggested the IDR route because radix tree lookups under RCU
with reference counted objects are a known safe pattern that we can
easily confirm is correct or not.  Hence I suggested the unification
+ IDR route because it makes the life of reviewers so, so much
easier...


In fact, I originally planned to try the unification + IDR method you
suggested at the beginning. But in the case of CONFIG_MEMCG disabled,
the struct mem_cgroup is not even defined, and root_mem_cgroup and
shrinker_info will not be allocated.  This required more code changes, so
I ended up keeping the shrinker_list and implementing the above pattern.


Yes. Go back and read what I originally said needed to be done
first. In the case of CONFIG_MEMCG=n, a dummy root memcg still needs
to exist that holds all of the global shrinkers. Then shrink_slab()
is only ever passed a memcg that should be iterated.

Yes, it needs changes external to the shrinker code itself to be
made to work. And even if memcg's are not enabled, we can still use
the memcg structures to ensure a common abstraction is used for the
shrinker tracking infrastructure


Yeah, what I imagined before was to define a more concise struct
mem_cgroup in the case of CONFIG_MEMCG=n, then allocate a dummy root
memcg on system boot:

#ifdef !CONFIG_MEMCG

struct shrinker_info {
struct rcu_head rcu;
atomic_long_t *nr_deferred;
unsigned long *map;
int map_nr_max;
};

struct mem_cgroup_per_node {
struct shrinker_info __rcu  *shrinker_info;
};

struct mem_cgroup {
struct mem_cgroup_per_node *nodeinfo[];
};

#endif

But I have a concern: if all global shrinkers are tracking with the
info->map of root memcg, a shrinker->id needs to be assigned to them,
which will cause info->map_nr_max to become larger than before, then
making the traversal of info->map slower.




If the above pattern is not safe, I will go back to the unification +
IDR method.


And that is exactly how we got into this mess in the first place


I only found one similar pattern in the kernel:

fs/smb/server/oplock.c:find_same_lease_key/smb_break_all_levII_oplock/lookup_lease_in_table

But IIUC, the refcount here needs to be decremented after holding
rcu lock as I did above.

So regardless of whether we choose unification + IDR in the end, I still
want to confirm whether the pattern I implemented above is safe. :)

Thanks,
Qi



-Dave


Re: [Intel-gfx] [PATCH 24/29] mm: vmscan: make global slab shrink lockless

2023-06-27 Thread Qi Zheng




On 2023/6/23 14:29, Dave Chinner wrote:

On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:

On 6/22/23 10:53, Qi Zheng wrote:

@@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int 
nid,
if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
  
-	if (!down_read_trylock(_rwsem))

-   goto out;
-
-   list_for_each_entry(shrinker, _list, list) {
+   rcu_read_lock();
+   list_for_each_entry_rcu(shrinker, _list, list) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
.memcg = memcg,
};
  
+		if (!shrinker_try_get(shrinker))

+   continue;
+   rcu_read_unlock();


I don't think you can do this unlock?


+
ret = do_shrink_slab(, shrinker, priority);
if (ret == SHRINK_EMPTY)
ret = 0;
freed += ret;
-   /*
-* Bail out if someone want to register a new shrinker to
-* prevent the registration from being stalled for long periods
-* by parallel ongoing shrinking.
-*/
-   if (rwsem_is_contended(_rwsem)) {
-   freed = freed ? : 1;
-   break;
-   }
-   }
  
-	up_read(_rwsem);

-out:
+   rcu_read_lock();


That new rcu_read_lock() won't help AFAIK, the whole
list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
safe.


Yeah, that's the pattern we've been taught and the one we can look
at and immediately say "this is safe".

This is a different pattern, as has been explained bi Qi, and I
think it *might* be safe.

*However.*

Right now I don't have time to go through a novel RCU list iteration
pattern it one step at to determine the correctness of the
algorithm. I'm mostly worried about list manipulations that can
occur outside rcu_read_lock() section bleeding into the RCU
critical section because rcu_read_lock() by itself is not a memory
barrier.

Maybe Paul has seen this pattern often enough he could simply tell
us what conditions it is safe in. But for me to work that out from
first principles? I just don't have the time to do that right now.


Hi Paul, can you help to confirm this?




IIUC this is why Dave in [4] suggests unifying shrink_slab() with
shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.


Yes, I suggested the IDR route because radix tree lookups under RCU
with reference counted objects are a known safe pattern that we can
easily confirm is correct or not.  Hence I suggested the unification
+ IDR route because it makes the life of reviewers so, so much
easier...


In fact, I originally planned to try the unification + IDR method you
suggested at the beginning. But in the case of CONFIG_MEMCG disabled,
the struct mem_cgroup is not even defined, and root_mem_cgroup and
shrinker_info will not be allocated. This required more code changes, so
I ended up keeping the shrinker_list and implementing the above pattern.

If the above pattern is not safe, I will go back to the unification +
IDR method.

Thanks,
Qi



Cheers,

Dave.


Re: [Intel-gfx] [PATCH 24/29] mm: vmscan: make global slab shrink lockless

2023-06-23 Thread Dave Chinner
On Fri, Jun 23, 2023 at 09:10:57PM +0800, Qi Zheng wrote:
> On 2023/6/23 14:29, Dave Chinner wrote:
> > On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
> > > On 6/22/23 10:53, Qi Zheng wrote:
> > Yes, I suggested the IDR route because radix tree lookups under RCU
> > with reference counted objects are a known safe pattern that we can
> > easily confirm is correct or not.  Hence I suggested the unification
> > + IDR route because it makes the life of reviewers so, so much
> > easier...
> 
> In fact, I originally planned to try the unification + IDR method you
> suggested at the beginning. But in the case of CONFIG_MEMCG disabled,
> the struct mem_cgroup is not even defined, and root_mem_cgroup and
> shrinker_info will not be allocated.  This required more code changes, so
> I ended up keeping the shrinker_list and implementing the above pattern.

Yes. Go back and read what I originally said needed to be done
first. In the case of CONFIG_MEMCG=n, a dummy root memcg still needs
to exist that holds all of the global shrinkers. Then shrink_slab()
is only ever passed a memcg that should be iterated.

Yes, it needs changes external to the shrinker code itself to be
made to work. And even if memcg's are not enabled, we can still use
the memcg structures to ensure a common abstraction is used for the
shrinker tracking infrastructure

> If the above pattern is not safe, I will go back to the unification +
> IDR method.

And that is exactly how we got into this mess in the first place

-Dave
-- 
Dave Chinner
da...@fromorbit.com


Re: [Intel-gfx] [PATCH 24/29] mm: vmscan: make global slab shrink lockless

2023-06-23 Thread Dave Chinner
On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
> On 6/22/23 10:53, Qi Zheng wrote:
> > @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, 
> > int nid,
> > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> > return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
> >  
> > -   if (!down_read_trylock(_rwsem))
> > -   goto out;
> > -
> > -   list_for_each_entry(shrinker, _list, list) {
> > +   rcu_read_lock();
> > +   list_for_each_entry_rcu(shrinker, _list, list) {
> > struct shrink_control sc = {
> > .gfp_mask = gfp_mask,
> > .nid = nid,
> > .memcg = memcg,
> > };
> >  
> > +   if (!shrinker_try_get(shrinker))
> > +   continue;
> > +   rcu_read_unlock();
> 
> I don't think you can do this unlock?
> 
> > +
> > ret = do_shrink_slab(, shrinker, priority);
> > if (ret == SHRINK_EMPTY)
> > ret = 0;
> > freed += ret;
> > -   /*
> > -* Bail out if someone want to register a new shrinker to
> > -* prevent the registration from being stalled for long periods
> > -* by parallel ongoing shrinking.
> > -*/
> > -   if (rwsem_is_contended(_rwsem)) {
> > -   freed = freed ? : 1;
> > -   break;
> > -   }
> > -   }
> >  
> > -   up_read(_rwsem);
> > -out:
> > +   rcu_read_lock();
> 
> That new rcu_read_lock() won't help AFAIK, the whole
> list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
> safe.

Yeah, that's the pattern we've been taught and the one we can look
at and immediately say "this is safe".

This is a different pattern, as has been explained bi Qi, and I
think it *might* be safe.

*However.*

Right now I don't have time to go through a novel RCU list iteration
pattern it one step at to determine the correctness of the
algorithm. I'm mostly worried about list manipulations that can
occur outside rcu_read_lock() section bleeding into the RCU
critical section because rcu_read_lock() by itself is not a memory
barrier.

Maybe Paul has seen this pattern often enough he could simply tell
us what conditions it is safe in. But for me to work that out from
first principles? I just don't have the time to do that right now.

> IIUC this is why Dave in [4] suggests unifying shrink_slab() with
> shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.

Yes, I suggested the IDR route because radix tree lookups under RCU
with reference counted objects are a known safe pattern that we can
easily confirm is correct or not.  Hence I suggested the unification
+ IDR route because it makes the life of reviewers so, so much
easier...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [Intel-gfx] [PATCH 24/29] mm: vmscan: make global slab shrink lockless

2023-06-22 Thread Vlastimil Babka
On 6/22/23 10:53, Qi Zheng wrote:
> The shrinker_rwsem is a global read-write lock in
> shrinkers subsystem, which protects most operations
> such as slab shrink, registration and unregistration
> of shrinkers, etc. This can easily cause problems in
> the following cases.
> 
> 1) When the memory pressure is high and there are many
>filesystems mounted or unmounted at the same time,
>slab shrink will be affected (down_read_trylock()
>failed).
> 
>Such as the real workload mentioned by Kirill Tkhai:
> 
>```
>One of the real workloads from my experience is start
>of an overcommitted node containing many starting
>containers after node crash (or many resuming containers
>after reboot for kernel update). In these cases memory
>pressure is huge, and the node goes round in long reclaim.
>```
> 
> 2) If a shrinker is blocked (such as the case mentioned
>in [1]) and a writer comes in (such as mount a fs),
>then this writer will be blocked and cause all
>subsequent shrinker-related operations to be blocked.
> 
> Even if there is no competitor when shrinking slab, there
> may still be a problem. If we have a long shrinker list
> and we do not reclaim enough memory with each shrinker,
> then the down_read_trylock() may be called with high
> frequency. Because of the poor multicore scalability of
> atomic operations, this can lead to a significant drop
> in IPC (instructions per cycle).
> 
> We used to implement the lockless slab shrink with
> SRCU [1], but then kernel test robot reported -88.8%
> regression in stress-ng.ramfs.ops_per_sec test case [2],
> so we reverted it [3].
> 
> This commit uses the refcount+RCU method [4] proposed by
> by Dave Chinner to re-implement the lockless global slab
> shrink. The memcg slab shrink is handled in the subsequent
> patch.
> 
> Currently, the shrinker instances can be divided into
> the following three types:
> 
> a) global shrinker instance statically defined in the kernel,
> such as workingset_shadow_shrinker.
> 
> b) global shrinker instance statically defined in the kernel
> modules, such as mmu_shrinker in x86.
> 
> c) shrinker instance embedded in other structures.
> 
> For case a, the memory of shrinker instance is never freed.
> For case b, the memory of shrinker instance will be freed
> after the module is unloaded. But we will call synchronize_rcu()
> in free_module() to wait for RCU read-side critical section to
> exit. For case c, the memory of shrinker instance will be
> dynamically freed by calling kfree_rcu(). So we can use
> rcu_read_{lock,unlock}() to ensure that the shrinker instance
> is valid.
> 
> The shrinker::refcount mechanism ensures that the shrinker
> instance will not be run again after unregistration. So the
> structure that records the pointer of shrinker instance can be
> safely freed without waiting for the RCU read-side critical
> section.
> 
> In this way, while we implement the lockless slab shrink, we
> don't need to be blocked in unregister_shrinker() to wait
> RCU read-side critical section.
> 
> The following are the test results:
> 
> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
> 
> 1) Before applying this patchset:
> 
>  setting to a 60 second run per stressor
>  dispatching hogs: 9 ramfs
>  stressor   bogo ops real time  usr time  sys time   bogo ops/s bogo 
> ops/s
>(secs)(secs)(secs)   (real time) (usr+sys 
> time)
>  ramfs880623 60.02  7.71226.93 14671.45
> 3753.09
>  ramfs:
>   1 System Management Interrupt
>  for a 60.03s run time:
> 5762.40s available CPU time
>7.71s user time   (  0.13%)
>  226.93s system time (  3.94%)
>  234.64s total time  (  4.07%)
>  load average: 8.54 3.06 2.11
>  passed: 9: ramfs (9)
>  failed: 0
>  skipped: 0
>  successful run completed in 60.03s (1 min, 0.03 secs)
> 
> 2) After applying this patchset:
> 
>  setting to a 60 second run per stressor
>  dispatching hogs: 9 ramfs
>  stressor   bogo ops real time  usr time  sys time   bogo ops/s bogo 
> ops/s
>(secs)(secs)(secs)   (real time) (usr+sys 
> time)
>  ramfs847562 60.02  7.44230.22 14120.66
> 3566.23
>  ramfs:
>   4 System Management Interrupts
>  for a 60.12s run time:
> 5771.95s available CPU time
>7.44s user time   (  0.13%)
>  230.22s system time (  3.99%)
>  237.66s total time  (  4.12%)
>  load average: 8.18 2.43 0.84
>  passed: 9: ramfs (9)
>  failed: 0
>  skipped: 0
>  successful run completed in 60.12s (1 min, 0.12 secs)
> 
> We can see that the ops/s has hardly changed.
> 
> [1]. 
> https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.a...@bytedance.com/
> [2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie@intel.com/
> [3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zh...@linux.dev/
> [4].