Re: [PATCH v4 46/48] mm: shrinker: make memcg slab shrink lockless
ly(!shrinker || !shrinker_try_get(shrinker))) > { > + clear_bit(offset, unit->map); > + rcu_read_unlock(); > continue; > } > + rcu_read_unlock(); > > /* Call non-slab shrinkers even though kmem is disabled > */ > if (!memcg_kmem_online() && > @@ -523,15 +546,20 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, > int nid, > set_shrinker_bit(memcg, nid, > shrinker_id); > } > freed += ret; > - > - if (rwsem_is_contended(_rwsem)) { > - freed = freed ? : 1; > - goto unlock; > - } > + shrinker_put(shrinker); Ok, so why is this safe to call without holding the rcu read lock? The global shrinker has to hold the rcu_read_lock() whilst calling shrinker_put() to guarantee the validity of the list next pointer, but we don't hold off RCU here so what guarantees a racing global shrinker walk doesn't trip over this shrinker_put() call dropping the refcount to zero and freeing occuring in a different context... > + /* > + * We have already exited the read-side of rcu critical section > + * before calling do_shrink_slab(), the shrinker_info may be > + * released in expand_one_shrinker_info(), so reacquire the > + * shrinker_info. > + */ > + index++; > + goto again; With that, what makes the use of shrinker_info in xchg_nr_deferred_memcg() in do_shrink_slab() coherent and valid? -Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote: > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index eb342994675a..f06225f18531 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -4,6 +4,8 @@ > > #include > #include > +#include > +#include > > #define SHRINKER_UNIT_BITS BITS_PER_LONG > > @@ -87,6 +89,10 @@ struct shrinker { > int seeks; /* seeks to recreate an obj */ > unsigned flags; > > + refcount_t refcount; > + struct completion done; > + struct rcu_head rcu; Documentation, please. What does the refcount protect, what does the completion provide, etc. > + > void *private_data; > > /* These are for internal use */ > @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, > const char *fmt, ...); > void shrinker_register(struct shrinker *shrinker); > void shrinker_free(struct shrinker *shrinker); > > +static inline bool shrinker_try_get(struct shrinker *shrinker) > +{ > + return refcount_inc_not_zero(>refcount); > +} > + > +static inline void shrinker_put(struct shrinker *shrinker) > +{ > + if (refcount_dec_and_test(>refcount)) > + complete(>done); > +} > + > #ifdef CONFIG_SHRINKER_DEBUG > extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker, > const char *fmt, ...); > diff --git a/mm/shrinker.c b/mm/shrinker.c > index 1911c06b8af5..d318f5621862 100644 > --- a/mm/shrinker.c > +++ b/mm/shrinker.c > @@ -2,6 +2,7 @@ > #include > #include > #include > +#include > #include > > #include "internal.h" > @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, > struct mem_cgroup *memcg, > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) > return shrink_slab_memcg(gfp_mask, nid, memcg, priority); > > - if (!down_read_trylock(_rwsem)) > - goto out; > - > - list_for_each_entry(shrinker, _list, list) { > + rcu_read_lock(); > + list_for_each_entry_rcu(shrinker, _list, list) { > struct shrink_control sc = { > .gfp_mask = gfp_mask, > .nid = nid, > .memcg = memcg, > }; > > + if (!shrinker_try_get(shrinker)) > + continue; > + > + /* > + * We can safely unlock the RCU lock here since we already > + * hold the refcount of the shrinker. > + */ > + rcu_read_unlock(); > + > ret = do_shrink_slab(, shrinker, priority); > if (ret == SHRINK_EMPTY) > ret = 0; > freed += ret; > + > /* > - * Bail out if someone want to register a new shrinker to > - * prevent the registration from being stalled for long periods > - * by parallel ongoing shrinking. > + * This shrinker may be deleted from shrinker_list and freed > + * after the shrinker_put() below, but this shrinker is still > + * used for the next traversal. So it is necessary to hold the > + * RCU lock first to prevent this shrinker from being freed, > + * which also ensures that the next shrinker that is traversed > + * will not be freed (even if it is deleted from shrinker_list > + * at the same time). >*/ This needs to be moved to the head of the function, and document the whole list walk, get, put and completion parts of the algorithm that make it safe. There's more to this than "we hold a reference count", especially the tricky "we might see the shrinker before it is fully initialised" case . > void shrinker_free(struct shrinker *shrinker) > { > struct dentry *debugfs_entry = NULL; > @@ -686,9 +712,18 @@ void shrinker_free(struct shrinker *shrinker) > if (!shrinker) > return; > > + if (shrinker->flags & SHRINKER_REGISTERED) { > + shrinker_put(shrinker); > + wait_for_completion(>done); > + } Needs a comment explaining why we need to wait here... > + > down_write(_rwsem); > if (shrinker->flags & SHRINKER_REGISTERED) { > - list_del(>list); > + /* > + * Lookups on the shrinker are over and will fail in the future, > + * so we can now remove it from the lists and free it. > + */ rather than here after the wait has been done and provided the guarantee that no shrinker is running or will run again... -Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v4 44/48] mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}
} > } > } > up_read(_rwsem); > @@ -407,7 +458,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, > int nid, > { > struct shrinker_info *info; > unsigned long ret, freed = 0; > - int i; > + int offset, index = 0; > > if (!mem_cgroup_online(memcg)) > return 0; > @@ -419,56 +470,63 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, > int nid, > if (unlikely(!info)) > goto unlock; > > - for_each_set_bit(i, info->map, info->map_nr_max) { > - struct shrink_control sc = { > - .gfp_mask = gfp_mask, > - .nid = nid, > - .memcg = memcg, > - }; > - struct shrinker *shrinker; > + for (; index < shriner_id_to_index(info->map_nr_max); index++) { > + struct shrinker_info_unit *unit; This adds another layer of indent to shrink_slab_memcg(). Please factor it first so that the code ends up being readable. Doing that first as a separate patch will also make the actual algorithm changes in this patch be much more obvious - this huge hunk of diff is pretty much impossible to review... -Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote: > The shrinker_rwsem is a global read-write lock in shrinkers subsystem, > which protects most operations such as slab shrink, registration and > unregistration of shrinkers, etc. This can easily cause problems in the > following cases. > This commit uses the refcount+RCU method [5] proposed by Dave Chinner > to re-implement the lockless global slab shrink. The memcg slab shrink is > handled in the subsequent patch. > --- > include/linux/shrinker.h | 17 ++ > mm/shrinker.c| 70 +--- > 2 files changed, 68 insertions(+), 19 deletions(-) There's no documentation in the code explaining how the lockless shrinker algorithm works. It's left to the reader to work out how this all goes together > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index eb342994675a..f06225f18531 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -4,6 +4,8 @@ > > #include > #include > +#include > +#include > > #define SHRINKER_UNIT_BITS BITS_PER_LONG > > @@ -87,6 +89,10 @@ struct shrinker { > int seeks; /* seeks to recreate an obj */ > unsigned flags; > > + refcount_t refcount; > + struct completion done; > + struct rcu_head rcu; What does the refcount protect, why do we need the completion, etc? > + > void *private_data; > > /* These are for internal use */ > @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, > const char *fmt, ...); > void shrinker_register(struct shrinker *shrinker); > void shrinker_free(struct shrinker *shrinker); > > +static inline bool shrinker_try_get(struct shrinker *shrinker) > +{ > + return refcount_inc_not_zero(>refcount); > +} > + > +static inline void shrinker_put(struct shrinker *shrinker) > +{ > + if (refcount_dec_and_test(>refcount)) > + complete(>done); > +} > + > #ifdef CONFIG_SHRINKER_DEBUG > extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker, > const char *fmt, ...); > diff --git a/mm/shrinker.c b/mm/shrinker.c > index 1911c06b8af5..d318f5621862 100644 > --- a/mm/shrinker.c > +++ b/mm/shrinker.c > @@ -2,6 +2,7 @@ > #include > #include > #include > +#include > #include > > #include "internal.h" > @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, > struct mem_cgroup *memcg, > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) > return shrink_slab_memcg(gfp_mask, nid, memcg, priority); > > - if (!down_read_trylock(_rwsem)) > - goto out; > - > - list_for_each_entry(shrinker, _list, list) { > + rcu_read_lock(); > + list_for_each_entry_rcu(shrinker, _list, list) { > struct shrink_control sc = { > .gfp_mask = gfp_mask, > .nid = nid, > .memcg = memcg, > }; > > + if (!shrinker_try_get(shrinker)) > + continue; > + > + /* > + * We can safely unlock the RCU lock here since we already > + * hold the refcount of the shrinker. > + */ > + rcu_read_unlock(); > + > ret = do_shrink_slab(, shrinker, priority); > if (ret == SHRINK_EMPTY) > ret = 0; > freed += ret; > + > /* > - * Bail out if someone want to register a new shrinker to > - * prevent the registration from being stalled for long periods > - * by parallel ongoing shrinking. > + * This shrinker may be deleted from shrinker_list and freed > + * after the shrinker_put() below, but this shrinker is still > + * used for the next traversal. So it is necessary to hold the > + * RCU lock first to prevent this shrinker from being freed, > + * which also ensures that the next shrinker that is traversed > + * will not be freed (even if it is deleted from shrinker_list > + * at the same time). >*/ This comment really should be at the head of the function, describing the algorithm used within the function itself. i.e. how reference counts are used w.r.t. the rcu_read_lock() usage to guarantee existence of the shrinker and the validity of the list walk. I'm not going to remember all these little details when I look at this code in another 6 months time, and having to work it out from first principles every time I look at the code will waste of a lot of time... -Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v3 28/49] dm zoned: dynamically allocate the dm-zoned-meta shrinker
On Thu, Jul 27, 2023 at 07:20:46PM +0900, Damien Le Moal wrote: > On 7/27/23 17:55, Qi Zheng wrote: > >>> goto err; > >>> } > >>> + zmd->mblk_shrinker->count_objects = dmz_mblock_shrinker_count; > >>> + zmd->mblk_shrinker->scan_objects = dmz_mblock_shrinker_scan; > >>> + zmd->mblk_shrinker->seeks = DEFAULT_SEEKS; > >>> + zmd->mblk_shrinker->private_data = zmd; > >>> + > >>> + shrinker_register(zmd->mblk_shrinker); > >> > >> I fail to see how this new shrinker API is better... Why isn't there a > >> shrinker_alloc_and_register() function ? That would avoid adding all this > >> code > >> all over the place as the new API call would be very similar to the current > >> shrinker_register() call with static allocation. > > > > In some registration scenarios, memory needs to be allocated in advance. > > So we continue to use the previous prealloc/register_prepared() > > algorithm. The shrinker_alloc_and_register() is just a helper function > > that combines the two, and this increases the number of APIs that > > shrinker exposes to the outside, so I choose not to add this helper. > > And that results in more code in many places instead of less code + a simple > inline helper in the shrinker header file... It's not just a "simple helper" - it's a function that has to take 6 or 7 parameters with a return value that must be checked and handled. This was done in the first versions of the patch set - the amount of code in each caller does not go down and, IMO, was much harder to read and determine "this is obviously correct" that what we have now. > So not adding that super simple > helper is not exactly the best choice in my opinion. Each to their own - I much prefer the existing style/API over having to go look up a helper function every time I want to check some random shrinker has been set up correctly -Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v2 44/47] mm: shrinker: make global slab shrink lockless
On Wed, Jul 26, 2023 at 05:14:09PM +0800, Qi Zheng wrote: > On 2023/7/26 16:08, Dave Chinner wrote: > > On Mon, Jul 24, 2023 at 05:43:51PM +0800, Qi Zheng wrote: > > > @@ -122,6 +126,13 @@ void shrinker_free_non_registered(struct shrinker > > > *shrinker); > > > void shrinker_register(struct shrinker *shrinker); > > > void shrinker_unregister(struct shrinker *shrinker); > > > +static inline bool shrinker_try_get(struct shrinker *shrinker) > > > +{ > > > + return READ_ONCE(shrinker->registered) && > > > +refcount_inc_not_zero(>refcount); > > > +} > > > > Why do we care about shrinker->registered here? If we don't set > > the refcount to 1 until we have fully initialised everything, then > > the shrinker code can key entirely off the reference count and > > none of the lookup code needs to care about whether the shrinker is > > registered or not. > > The purpose of checking shrinker->registered here is to stop running > shrinker after calling shrinker_free(), which can prevent the following > situations from happening: > > CPU 0 CPU 1 > > shrinker_try_get() > >shrinker_try_get() > > shrinker_put() > shrinker_try_get() >shrinker_put() I don't see any race here? What is wrong with having multiple active users at once? > > > > This should use a completion, then it is always safe under > > rcu_read_lock(). This also gets rid of the shrinker_lock spin lock, > > which only exists because we can't take a blocking lock under > > rcu_read_lock(). i.e: > > > > > > void shrinker_put(struct shrinker *shrinker) > > { > > if (refcount_dec_and_test(>refcount)) > > complete(>done); > > } > > > > void shrinker_free() > > { > > . > > refcount_dec(>refcount); > > I guess what you mean is shrinker_put(), because here may be the last > refcount. Yes, I did. > > wait_for_completion(>done); > > /* > > * lookups on the shrinker will now all fail as refcount has > > * fallen to zero. We can now remove it from the lists and > > * free it. > > */ > > down_write(shrinker_rwsem); > > list_del_rcu(>list); > > up_write(_rwsem); > > call_rcu(shrinker->rcu_head, shrinker_free_rcu_cb); > > } > > > > > > > > > @@ -686,11 +711,14 @@ EXPORT_SYMBOL(shrinker_free_non_registered); > > > void shrinker_register(struct shrinker *shrinker) > > > { > > > - down_write(_rwsem); > > > - list_add_tail(>list, _list); > > > - shrinker->flags |= SHRINKER_REGISTERED; > > > + refcount_set(>refcount, 1); > > > + > > > + spin_lock(_lock); > > > + list_add_tail_rcu(>list, _list); > > > + spin_unlock(_lock); > > > + > > > shrinker_debugfs_add(shrinker); > > > - up_write(_rwsem); > > > + WRITE_ONCE(shrinker->registered, true); > > > } > > > EXPORT_SYMBOL(shrinker_register); > > > > This just looks wrong - you are trying to use WRITE_ONCE() as a > > release barrier to indicate that the shrinker is now set up fully. > > That's not necessary - the refcount is an atomic and along with the > > rcu locks they should provides all the barriers we need. i.e. > > The reason I used WRITE_ONCE() here is because the shrinker->registered > will be read and written concurrently (read in shrinker_try_get() and > written in shrinker_free()), which is why I added shrinker::registered > field instead of using SHRINKER_REGISTERED flag (this can reduce the > addition of WRITE_ONCE()/READ_ONCE()). Using WRITE_ONCE/READ_ONCE doesn't provide memory barriers needed to use the field like this. You need release/acquire memory ordering here. i.e. smp_store_release()/smp_load_acquire(). As it is, the refcount_inc_not_zero() provides a control dependency, as documented in include/linux/refcount.h, refcount_dec_and_test() provides release memory ordering. The only thing I think we may need is a write barrier before refcount_set(), such that if refcount_inc_not_zero() sees a non-zero value, it is guaranteed to see an initialised structure... i.e. refcounts provide all the existence and initialisation guarantees. Hence I don't see the need to use shrinker->registered like this and it can remain a bit flag protected by the shrinker_rwsem(). > > void shrinker_register(struct shrinker *shrinker) > > { > > down_write(_rwsem); > > list_add_tail_rcu(&g
Re: [PATCH v2 44/47] mm: shrinker: make global slab shrink lockless
On Mon, Jul 24, 2023 at 05:43:51PM +0800, Qi Zheng wrote: > The shrinker_rwsem is a global read-write lock in shrinkers subsystem, > which protects most operations such as slab shrink, registration and > unregistration of shrinkers, etc. This can easily cause problems in the > following cases. > > 1) When the memory pressure is high and there are many filesystems >mounted or unmounted at the same time, slab shrink will be affected >(down_read_trylock() failed). > >Such as the real workload mentioned by Kirill Tkhai: > >``` >One of the real workloads from my experience is start >of an overcommitted node containing many starting >containers after node crash (or many resuming containers >after reboot for kernel update). In these cases memory >pressure is huge, and the node goes round in long reclaim. >``` > > 2) If a shrinker is blocked (such as the case mentioned >in [1]) and a writer comes in (such as mount a fs), >then this writer will be blocked and cause all >subsequent shrinker-related operations to be blocked. > > Even if there is no competitor when shrinking slab, there may still be a > problem. The down_read_trylock() may become a perf hotspot with frequent > calls to shrink_slab(). Because of the poor multicore scalability of > atomic operations, this can lead to a significant drop in IPC > (instructions per cycle). > > We used to implement the lockless slab shrink with SRCU [2], but then > kernel test robot reported -88.8% regression in > stress-ng.ramfs.ops_per_sec test case [3], so we reverted it [4]. > > This commit uses the refcount+RCU method [5] proposed by Dave Chinner > to re-implement the lockless global slab shrink. The memcg slab shrink is > handled in the subsequent patch. > > For now, all shrinker instances are converted to dynamically allocated and > will be freed by kfree_rcu(). So we can use rcu_read_{lock,unlock}() to > ensure that the shrinker instance is valid. > > And the shrinker instance will not be run again after unregistration. So > the structure that records the pointer of shrinker instance can be safely > freed without waiting for the RCU read-side critical section. > > In this way, while we implement the lockless slab shrink, we don't need to > be blocked in unregister_shrinker(). > > The following are the test results: > > stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 & > > 1) Before applying this patchset: > > setting to a 60 second run per stressor > dispatching hogs: 9 ramfs > stressor bogo ops real time usr time sys time bogo ops/s bogo > ops/s > (secs)(secs)(secs) (real time) (usr+sys > time) > ramfs735238 60.00 12.37363.70 12253.05 > 1955.08 > for a 60.01s run time: >1440.27s available CPU time > 12.36s user time ( 0.86%) > 363.70s system time ( 25.25%) > 376.06s total time ( 26.11%) > load average: 10.79 4.47 1.69 > passed: 9: ramfs (9) > failed: 0 > skipped: 0 > successful run completed in 60.01s (1 min, 0.01 secs) > > 2) After applying this patchset: > > setting to a 60 second run per stressor > dispatching hogs: 9 ramfs > stressor bogo ops real time usr time sys time bogo ops/s bogo > ops/s > (secs)(secs)(secs) (real time) (usr+sys > time) > ramfs746677 60.00 12.22367.75 12443.70 > 1965.13 > for a 60.01s run time: >1440.26s available CPU time > 12.21s user time ( 0.85%) > 367.75s system time ( 25.53%) > 379.96s total time ( 26.38%) > load average: 8.37 2.48 0.86 > passed: 9: ramfs (9) > failed: 0 > skipped: 0 > successful run completed in 60.01s (1 min, 0.01 secs) > > We can see that the ops/s has hardly changed. > > [1]. > https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomi...@virtuozzo.com/ > [2]. > https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.a...@bytedance.com/ > [3]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie@intel.com/ > [4]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zh...@linux.dev/ > [5]. https://lore.kernel.org/lkml/zijhou1d55d4h...@dread.disaster.area/ > > Signed-off-by: Qi Zheng > --- > include/linux/shrinker.h | 19 +++--- > mm/shrinker.c| 75 ++-- > mm/shrinker_debug.c | 52 +--- > 3 files changed, 104 insertions(+), 42 deletions(-) > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index 36977a70bebb..335da93cccee 100644 > --- a/in
Re: [PATCH v2 03/47] mm: shrinker: add infrastructure for dynamically allocating shrinker
On Mon, Jul 24, 2023 at 05:43:10PM +0800, Qi Zheng wrote: > Currently, the shrinker instances can be divided into the following three > types: > > a) global shrinker instance statically defined in the kernel, such as >workingset_shadow_shrinker. > > b) global shrinker instance statically defined in the kernel modules, such >as mmu_shrinker in x86. > > c) shrinker instance embedded in other structures. > > For case a, the memory of shrinker instance is never freed. For case b, > the memory of shrinker instance will be freed after synchronize_rcu() when > the module is unloaded. For case c, the memory of shrinker instance will > be freed along with the structure it is embedded in. > > In preparation for implementing lockless slab shrink, we need to > dynamically allocate those shrinker instances in case c, then the memory > can be dynamically freed alone by calling kfree_rcu(). > > So this commit adds the following new APIs for dynamically allocating > shrinker, and add a private_data field to struct shrinker to record and > get the original embedded structure. > > 1. shrinker_alloc() > > Used to allocate shrinker instance itself and related memory, it will > return a pointer to the shrinker instance on success and NULL on failure. > > 2. shrinker_free_non_registered() > > Used to destroy the non-registered shrinker instance. This is a bit nasty > > 3. shrinker_register() > > Used to register the shrinker instance, which is same as the current > register_shrinker_prepared(). > > 4. shrinker_unregister() rename this "shrinker_free()" and key the two different freeing cases on the SHRINKER_REGISTERED bit rather than mostly duplicating the two. void shrinker_free(struct shrinker *shrinker) { struct dentry *debugfs_entry = NULL; int debugfs_id; if (!shrinker) return; down_write(_rwsem); if (shrinker->flags & SHRINKER_REGISTERED) { list_del(>list); debugfs_entry = shrinker_debugfs_detach(shrinker, _id); } else if (IS_ENABLED(CONFIG_SHRINKER_DEBUG)) { kfree_const(shrinker->name); } if (shrinker->flags & SHRINKER_MEMCG_AWARE) unregister_memcg_shrinker(shrinker); up_write(_rwsem); if (debugfs_entry) shrinker_debugfs_remove(debugfs_entry, debugfs_id); kfree(shrinker->nr_deferred); kfree(shrinker); } EXPORT_SYMBOL_GPL(shrinker_free); -- Dave Chinner da...@fromorbit.com
Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
On Fri, Jun 23, 2023 at 09:10:57PM +0800, Qi Zheng wrote: > On 2023/6/23 14:29, Dave Chinner wrote: > > On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote: > > > On 6/22/23 10:53, Qi Zheng wrote: > > Yes, I suggested the IDR route because radix tree lookups under RCU > > with reference counted objects are a known safe pattern that we can > > easily confirm is correct or not. Hence I suggested the unification > > + IDR route because it makes the life of reviewers so, so much > > easier... > > In fact, I originally planned to try the unification + IDR method you > suggested at the beginning. But in the case of CONFIG_MEMCG disabled, > the struct mem_cgroup is not even defined, and root_mem_cgroup and > shrinker_info will not be allocated. This required more code changes, so > I ended up keeping the shrinker_list and implementing the above pattern. Yes. Go back and read what I originally said needed to be done first. In the case of CONFIG_MEMCG=n, a dummy root memcg still needs to exist that holds all of the global shrinkers. Then shrink_slab() is only ever passed a memcg that should be iterated. Yes, it needs changes external to the shrinker code itself to be made to work. And even if memcg's are not enabled, we can still use the memcg structures to ensure a common abstraction is used for the shrinker tracking infrastructure > If the above pattern is not safe, I will go back to the unification + > IDR method. And that is exactly how we got into this mess in the first place -Dave -- Dave Chinner da...@fromorbit.com
Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote: > On 6/22/23 10:53, Qi Zheng wrote: > > @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, > > int nid, > > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) > > return shrink_slab_memcg(gfp_mask, nid, memcg, priority); > > > > - if (!down_read_trylock(_rwsem)) > > - goto out; > > - > > - list_for_each_entry(shrinker, _list, list) { > > + rcu_read_lock(); > > + list_for_each_entry_rcu(shrinker, _list, list) { > > struct shrink_control sc = { > > .gfp_mask = gfp_mask, > > .nid = nid, > > .memcg = memcg, > > }; > > > > + if (!shrinker_try_get(shrinker)) > > + continue; > > + rcu_read_unlock(); > > I don't think you can do this unlock? > > > + > > ret = do_shrink_slab(, shrinker, priority); > > if (ret == SHRINK_EMPTY) > > ret = 0; > > freed += ret; > > - /* > > -* Bail out if someone want to register a new shrinker to > > -* prevent the registration from being stalled for long periods > > -* by parallel ongoing shrinking. > > -*/ > > - if (rwsem_is_contended(_rwsem)) { > > - freed = freed ? : 1; > > - break; > > - } > > - } > > > > - up_read(_rwsem); > > -out: > > + rcu_read_lock(); > > That new rcu_read_lock() won't help AFAIK, the whole > list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be > safe. Yeah, that's the pattern we've been taught and the one we can look at and immediately say "this is safe". This is a different pattern, as has been explained bi Qi, and I think it *might* be safe. *However.* Right now I don't have time to go through a novel RCU list iteration pattern it one step at to determine the correctness of the algorithm. I'm mostly worried about list manipulations that can occur outside rcu_read_lock() section bleeding into the RCU critical section because rcu_read_lock() by itself is not a memory barrier. Maybe Paul has seen this pattern often enough he could simply tell us what conditions it is safe in. But for me to work that out from first principles? I just don't have the time to do that right now. > IIUC this is why Dave in [4] suggests unifying shrink_slab() with > shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR. Yes, I suggested the IDR route because radix tree lookups under RCU with reference counted objects are a known safe pattern that we can easily confirm is correct or not. Hence I suggested the unification + IDR route because it makes the life of reviewers so, so much easier... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 02/29] mm: vmscan: introduce some helpers for dynamically allocating shrinker
On Thu, Jun 22, 2023 at 04:53:08PM +0800, Qi Zheng wrote: > Introduce some helpers for dynamically allocating shrinker instance, > and their uses are as follows: > > 1. shrinker_alloc_and_init() > > Used to allocate and initialize a shrinker instance, the priv_data > parameter is used to pass the pointer of the previously embedded > structure of the shrinker instance. > > 2. shrinker_free() > > Used to free the shrinker instance when the registration of shrinker > fails. > > 3. unregister_and_free_shrinker() > > Used to unregister and free the shrinker instance, and the kfree() > will be changed to kfree_rcu() later. > > Signed-off-by: Qi Zheng > --- > include/linux/shrinker.h | 12 > mm/vmscan.c | 35 +++ > 2 files changed, 47 insertions(+) > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index 43e6fcabbf51..8e9ba6fa3fcc 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -107,6 +107,18 @@ extern void unregister_shrinker(struct shrinker > *shrinker); > extern void free_prealloced_shrinker(struct shrinker *shrinker); > extern void synchronize_shrinkers(void); > > +typedef unsigned long (*count_objects_cb)(struct shrinker *s, > + struct shrink_control *sc); > +typedef unsigned long (*scan_objects_cb)(struct shrinker *s, > + struct shrink_control *sc); > + > +struct shrinker *shrinker_alloc_and_init(count_objects_cb count, > + scan_objects_cb scan, long batch, > + int seeks, unsigned flags, > + void *priv_data); > +void shrinker_free(struct shrinker *shrinker); > +void unregister_and_free_shrinker(struct shrinker *shrinker); H. Not exactly how I envisioned this to be done. Ok, this will definitely work, but I don't think it is an improvement. It's certainly not what I was thinking of when I suggested dynamically allocating shrinkers. The main issue is that this doesn't simplify the API - it expands it and creates a minefield of old and new functions that have to be used in exactly the right order for the right things to happen. What I was thinking of was moving the entire shrinker setup code over to the prealloc/register_prepared() algorithm, where the setup is already separated from the activation of the shrinker. That is, we start by renaming prealloc_shrinker() to shrinker_alloc(), adding a flags field to tell it everything that it needs to alloc (i.e. the NUMA/MEMCG_AWARE flags) and having it returned a fully allocated shrinker ready to register. Initially this also contains an internal flag to say the shrinker was allocated so that unregister_shrinker() knows to free it. The caller then fills out the shrinker functions, seeks, etc. just like the do now, and then calls register_shrinker_prepared() to make the shrinker active when it wants to turn it on. When it is time to tear down the shrinker, no API needs to change. unregister_shrinker() does all the shutdown and frees all the internal memory like it does now. If the shrinker is also marked as allocated, it frees the shrinker via RCU, too. Once everything is converted to this API, we then remove register_shrinker(), rename register_shrinker_prepared() to shrinker_register(), rename unregister_shrinker to shrinker_unregister(), get rid of the internal "allocated" flag and always free the shrinker. At the end of the patchset, every shrinker should be set up in a manner like this: sb->shrinker = shrinker_alloc(SHRINKER_MEMCG_AWARE|SHRINKER_NUMA_AWARE, "sb-%s", type->name); if (!sb->shrinker) return -ENOMEM; sb->shrinker->count_objects = super_cache_count; sb->shrinker->scan_objects = super_cache_scan; sb->shrinker->batch = 1024; sb->shrinker->private = sb; . shrinker_register(sb->shrinker); And teardown is just a call to shrinker_unregister(sb->shrinker) as it is now. i.e. the entire shrinker regsitration API is now just three functions, down from the current four, and much simpler than the the seven functions this patch set results in... The other advantage of this is that it will break all the existing out of tree code and third party modules using the old API and will no longer work with a kernel using lockless slab shrinkers. They need to break (both at the source and binary levels) to stop bad things from happening due to using uncoverted shrinkers in the new setup. -Dave. -- Dave Chinner da...@fromorbit.com
Re: Build regressions/improvements in v5.18-rc1
On Mon, Apr 04, 2022 at 10:16:08AM +0200, Geert Uytterhoeven wrote: > On Mon, 4 Apr 2022, Geert Uytterhoeven wrote: > > Below is the list of build error/warning regressions/improvements in > > v5.18-rc1[1] compared to v5.17[2]. > > > > Summarized: > > - build errors: +36/-15 > > - build warnings: +5/-38 > > > > Happy fixing! ;-) Well > > + /kisskb/src/fs/xfs/xfs_buf.h: error: initializer element is not > > constant: => 46:23 Looking at: http://kisskb.ellerman.id.au/kisskb/buildresult/14714961/ The build error is: /kisskb/src/fs/xfs/./xfs_trace.h:432:2: note: in expansion of macro 'TP_printk' TP_printk("dev %d:%d daddr 0x%llx bbcount 0x%x hold %d pincount %d " ^ /kisskb/src/fs/xfs/./xfs_trace.h:440:5: note: in expansion of macro '__print_flags' __print_flags(__entry->flags, "|", XFS_BUF_FLAGS), ^ /kisskb/src/fs/xfs/xfs_buf.h:67:4: note: in expansion of macro 'XBF_UNMAPPED' { XBF_UNMAPPED, "UNMAPPED" } ^ /kisskb/src/fs/xfs/./xfs_trace.h:440:40: note: in expansion of macro 'XFS_BUF_FLAGS' __print_flags(__entry->flags, "|", XFS_BUF_FLAGS), ^ /kisskb/src/fs/xfs/./xfs_trace.h: In function 'trace_raw_output_xfs_buf_flags_class': /kisskb/src/fs/xfs/xfs_buf.h:46:23: error: initializer element is not constant #define XBF_UNMAPPED (1 << 31)/* do not map the buffer */ This doesn't make a whole lotta sense to me. It's blown up in a tracepoint macro in XFS that was not changed at all in 5.18-rc1, nor was any of the surrounding XFS code or contexts. Perhaps something outside XFS changed to cause this on these platforms? Can you bisect this, please? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration
On Wed, Sep 01, 2021 at 07:07:34PM -0400, Felix Kuehling wrote: > On 2021-09-01 6:03 p.m., Dave Chinner wrote: > > On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote: > > > Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig: > > > > On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote: > > > > > > > driver code is not really involved in updating the CPU mappings. > > > > > > > Maybe > > > > > > > it's something we need to do in the migration helpers. > > > > > > It looks like I'm totally misunderstanding what you are adding here > > > > > > then. Why do we need any special treatment at all for memory that > > > > > > has normal struct pages and is part of the direct kernel map? > > > > > The pages are like normal memory for purposes of mapping them in CPU > > > > > page tables and for coherent access from the CPU. > > > > That's the user page tables. What about the kernel direct map? > > > > If there is a normal kernel struct page backing there really should > > > > be no need for the pgmap. > > > I'm not sure. The physical address ranges are in the UEFI system address > > > map as special-purpose memory. Does Linux create the struct pages and > > > kernel direct map for that without a pgmap call? I didn't see that last > > > time I went digging through that code. > > > > > > > > > > > From an application > > > > > perspective, we want file-backed and anonymous mappings to be able to > > > > > use DEVICE_PUBLIC pages with coherent CPU access. The goal is to > > > > > optimize performance for GPU heavy workloads while minimizing the need > > > > > to migrate data back-and-forth between system memory and device > > > > > memory. > > > > I don't really understand that part. file backed pages are always > > > > allocated by the file system using the pagecache helpers, that is > > > > using the page allocator. Anonymouns memory also always comes from > > > > the page allocator. > > > I'm coming at this from my experience with DEVICE_PRIVATE. Both > > > anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE > > > memory by the migrate_vma_* helpers for more efficient access by our > > > GPU. (*) It's part of the basic premise of HMM as I understand it. I > > > would expect the same thing to work for DEVICE_PUBLIC memory. > > > > > > (*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't > > > currently work, but that's something I'm hoping to fix at some point. > > FWIW, I'd love to see the architecture documents that define how > > filesystems are supposed to interact with this device private > > memory. This whole "hand filesystem controlled memory to other > > devices" is a minefield that is trivial to get wrong iand very > > difficult to fix - just look at the historical mess that RDMA > > to/from file backed and/or DAX pages has been. > > > > So, really, from my perspective as a filesystem engineer, I want to > > see an actual specification for how this new memory type is going to > > interact with filesystem and the page cache so everyone has some > > idea of how this is going to work and can point out how it doesn't > > work before code that simply doesn't work is pushed out into > > production systems and then merged > > OK. To be clear, that's not part of this patch series. And I have no > authority to push anything in this part of the kernel, so you have nothing > to fear. ;) I know this isn't part of the series. but this patchset is laying the foundation for future work that will impact subsystems that currently have zero visibility and/or knowledge of these changes. There must be an overall architectural plan for this functionality, regardless of the current state of implementation. It's that overall architectural plan I'm asking about here, because I need to understand that before I can sanely comment on the page cache/filesystem aspect of the proposed functionality... > FWIW, we already have the ability to map file-backed system memory pages > into device page tables with HMM and interval notifiers. But we cannot > currently migrate them to ZONE_DEVICE pages. Sure, but sharing page cache pages allocated and managed by the filesystem is not what you are talking about. You're talking about migrating page cache data to completely different memory allocated by a different memory manager that the filesystems currently have no knowledge of or have any
Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration
On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote: > > Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig: > > On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote: > >>>> driver code is not really involved in updating the CPU mappings. Maybe > >>>> it's something we need to do in the migration helpers. > >>> It looks like I'm totally misunderstanding what you are adding here > >>> then. Why do we need any special treatment at all for memory that > >>> has normal struct pages and is part of the direct kernel map? > >> The pages are like normal memory for purposes of mapping them in CPU > >> page tables and for coherent access from the CPU. > > That's the user page tables. What about the kernel direct map? > > If there is a normal kernel struct page backing there really should > > be no need for the pgmap. > > I'm not sure. The physical address ranges are in the UEFI system address > map as special-purpose memory. Does Linux create the struct pages and > kernel direct map for that without a pgmap call? I didn't see that last > time I went digging through that code. > > > > > >> From an application > >> perspective, we want file-backed and anonymous mappings to be able to > >> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to > >> optimize performance for GPU heavy workloads while minimizing the need > >> to migrate data back-and-forth between system memory and device memory. > > I don't really understand that part. file backed pages are always > > allocated by the file system using the pagecache helpers, that is > > using the page allocator. Anonymouns memory also always comes from > > the page allocator. > > I'm coming at this from my experience with DEVICE_PRIVATE. Both > anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE > memory by the migrate_vma_* helpers for more efficient access by our > GPU. (*) It's part of the basic premise of HMM as I understand it. I > would expect the same thing to work for DEVICE_PUBLIC memory. > > (*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't > currently work, but that's something I'm hoping to fix at some point. FWIW, I'd love to see the architecture documents that define how filesystems are supposed to interact with this device private memory. This whole "hand filesystem controlled memory to other devices" is a minefield that is trivial to get wrong iand very difficult to fix - just look at the historical mess that RDMA to/from file backed and/or DAX pages has been. So, really, from my perspective as a filesystem engineer, I want to see an actual specification for how this new memory type is going to interact with filesystem and the page cache so everyone has some idea of how this is going to work and can point out how it doesn't work before code that simply doesn't work is pushed out into production systems and then merged Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v3 1/8] ext4/xfs: add page refcount helper
On Thu, Jun 17, 2021 at 10:16:58AM -0500, Alex Sierra wrote: > From: Ralph Campbell > > There are several places where ZONE_DEVICE struct pages assume a reference > count == 1 means the page is idle and free. Instead of open coding this, > add a helper function to hide this detail. > > v2: > [AS]: rename dax_layout_is_idle_page func to dax_page_unused Did you even compile test this? > Signed-off-by: Ralph Campbell > Signed-off-by: Alex Sierra > --- > fs/dax.c| 4 ++-- > fs/ext4/inode.c | 5 + > fs/xfs/xfs_file.c | 4 +--- > include/linux/dax.h | 10 ++ > 4 files changed, 14 insertions(+), 9 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 26d5dcd2d69e..321f4ddc6643 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct > address_space *mapping, > for_each_mapped_pfn(entry, pfn) { > struct page *page = pfn_to_page(pfn); > > - WARN_ON_ONCE(trunc && page_ref_count(page) > 1); > + WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page)); Because you still use dax_layout_is_idle_page() here, not dax_page_unused()... > WARN_ON_ONCE(page->mapping && page->mapping != mapping); > page->mapping = NULL; > page->index = 0; > @@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry) > for_each_mapped_pfn(entry, pfn) { > struct page *page = pfn_to_page(pfn); > > - if (page_ref_count(page) > 1) > + if (!dax_layout_is_idle_page(page)) Here too. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] mm: Track mmu notifiers in fs_reclaim_acquire/release
190.961530][ T369]xfs_setattr_nonsize+0x1a6/0xcd0 OK, this patch has royally screwed something up if this path thinks it can enter memory reclaim. This path is inside a transaction, so it is running under PF_MEMALLOC_NOFS context, so should *never* enter memory reclaim. I'd suggest that whatever mods were made to fs_reclaim_acquire by this patch broke it's basic functionality > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 13cc653122b7..7536f0fd 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -57,6 +57,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -4124,7 +4125,7 @@ should_compact_retry(struct alloc_context *ac, > > unsigned int order, int alloc_fla > > static struct lockdep_map __fs_reclaim_map = > > STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map); > > > > -static bool __need_fs_reclaim(gfp_t gfp_mask) > > +static bool __need_reclaim(gfp_t gfp_mask) > > { > > gfp_mask = current_gfp_context(gfp_mask); This is applies the per-task memory allocation context flags to the mask that is checked here. > > @@ -4136,10 +4137,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) > > if (current->flags & PF_MEMALLOC) > > return false; > > > > - /* We're only interested __GFP_FS allocations for now */ > > - if (!(gfp_mask & __GFP_FS)) > > - return false; > > - > > if (gfp_mask & __GFP_NOLOCKDEP) > > return false; > > > > @@ -4158,15 +4155,25 @@ void __fs_reclaim_release(void) > > > > void fs_reclaim_acquire(gfp_t gfp_mask) > > { > > - if (__need_fs_reclaim(gfp_mask)) > > - __fs_reclaim_acquire(); > > + if (__need_reclaim(gfp_mask)) { > > + if (gfp_mask & __GFP_FS) > > + __fs_reclaim_acquire(); and they have not been applied in this path. There's your breakage. For future reference, please post anything that changes NOFS allocation contexts or behaviours to linux-fsdevel, as filesystem developers need to know about proposed changes to infrastructure that is critical to the correct functioning of filesystems... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [trivial PATCH] treewide: Align function definition open/close braces
On Sun, Dec 17, 2017 at 04:28:44PM -0800, Joe Perches wrote: > Some functions definitions have either the initial open brace and/or > the closing brace outside of column 1. > > Move those braces to column 1. > > This allows various function analyzers like gnu complexity to work > properly for these modified functions. > > Miscellanea: > > o Remove extra trailing ; and blank line from xfs_agf_verify > > Signed-off-by: Joe Perches <j...@perches.com> > --- .... XFS bits look fine. Acked-by: Dave Chinner <dchin...@redhat.com> -- Dave Chinner da...@fromorbit.com ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] drm/i915: Fix up usage of SHRINK_STOP
On Wed, Sep 25, 2013 at 02:00:02PM +0200, Daniel Vetter wrote: In commit 81e49f811404f428a9d9a63295a0c267e802fa12 Author: Glauber Costa glom...@openvz.org Date: Wed Aug 28 10:18:13 2013 +1000 i915: bail out earlier when shrinker cannot acquire mutex SHRINK_STOP was added to tell the core shrinker code to bail out and go to the next shrinker since the i915 shrinker couldn't acquire required locks. But the SHRINK_STOP return code was added to the -count_objects callback and not the -scan_objects callback as it should have been, resulting in tons of dmesg noise like shrink_slab: i915_gem_inactive_scan+0x0/0x9c negative objects to delete nr=-x Fix discusssed with Dave Chinner. Acked-by: Dave Chinner dchin...@redhat.com Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] [PATCH] [RFC] mm/shrinker: Add a shrinker flag to always shrink a bit
On Wed, Sep 18, 2013 at 12:38:23PM +0200, Knut Petersen wrote: On 18.09.2013 11:10, Daniel Vetter wrote: Just now I prepared a patch changing the same function in vmscan.c Also, this needs to be rebased to the new shrinker api in 3.12, I simply haven't rolled my trees forward yet. Well, you should. Since commit 81e49f shrinker-count_objects might be set to SHRINK_STOP, causing shrink_slab_node() to complain loud and often: [ 1908.234595] shrink_slab: i915_gem_inactive_scan+0x0/0x9c negative objects to delete nr=-x The kernel emitted a few thousand log lines like the one quoted above during the last few days on my system. diff --git a/mm/vmscan.c b/mm/vmscan.c index 2cff0d4..d81f6e0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -254,6 +254,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, total_scan = max_pass; } +/* Always try to shrink a bit to make forward progress. */ +if (shrinker-evicts_to_page_lru) +total_scan = max_t(long, total_scan, batch_size); + At that place the error message is already emitted. /* * We need to avoid excessive windup on filesystem shrinkers * due to large numbers of GFP_NOFS allocations causing the Have a look at the attached patch. It fixes my problem with the erroneous/misleading error messages, and I think it´s right to just bail out early if SHRINK_STOP is found. Do you agree ? No, that's wrong. -count_objects should never ass SHRINK_STOP. Indeed, it should always return a count of objects in the cache, regardless of the context. SHRINK_STOP is for -scan_objects to tell the shrinker it can make any progress due to the context it is called in. This allows the shirnker to defer the work to another call in a different context. However, if -count-objects doesn't return a count, the work that was supposed to be done cannot be deferred, and that is what -count_objects should always return the number of objects in the cache. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] [PATCH] [RFC] mm/shrinker: Add a shrinker flag to always shrink a bit
[my keyboard my be on the fritz - it's not typing what I'm thinking...] On Thu, Sep 19, 2013 at 06:38:22AM +1000, Dave Chinner wrote: On Wed, Sep 18, 2013 at 12:38:23PM +0200, Knut Petersen wrote: On 18.09.2013 11:10, Daniel Vetter wrote: Just now I prepared a patch changing the same function in vmscan.c Also, this needs to be rebased to the new shrinker api in 3.12, I simply haven't rolled my trees forward yet. Well, you should. Since commit 81e49f shrinker-count_objects might be set to SHRINK_STOP, causing shrink_slab_node() to complain loud and often: [ 1908.234595] shrink_slab: i915_gem_inactive_scan+0x0/0x9c negative objects to delete nr=-x The kernel emitted a few thousand log lines like the one quoted above during the last few days on my system. diff --git a/mm/vmscan.c b/mm/vmscan.c index 2cff0d4..d81f6e0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -254,6 +254,10 @@ unsigned long shrink_slab(struct shrink_control *shrink, total_scan = max_pass; } + /* Always try to shrink a bit to make forward progress. */ + if (shrinker-evicts_to_page_lru) + total_scan = max_t(long, total_scan, batch_size); + At that place the error message is already emitted. /* * We need to avoid excessive windup on filesystem shrinkers * due to large numbers of GFP_NOFS allocations causing the Have a look at the attached patch. It fixes my problem with the erroneous/misleading error messages, and I think it´s right to just bail out early if SHRINK_STOP is found. Do you agree ? No, that's wrong. -count_objects should never ass SHRINK_STOP. *pass Indeed, it should always return a count of objects in the cache, regardless of the context. SHRINK_STOP is for -scan_objects to tell the shrinker it can make *can't any progress due to the context it is called in. This allows the shirnker to defer the work to another call in a different context. However, if -count-objects doesn't return a count, the work that was supposed to be done cannot be deferred, and that is what *why -count_objects should always return the number of objects in the cache. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] [PATCH] [RFC] mm/shrinker: Add a shrinker flag to always shrink a bit
On Thu, Sep 19, 2013 at 08:57:04AM +0200, Daniel Vetter wrote: On Wed, Sep 18, 2013 at 10:38 PM, Dave Chinner da...@fromorbit.com wrote: No, that's wrong. -count_objects should never ass SHRINK_STOP. Indeed, it should always return a count of objects in the cache, regardless of the context. SHRINK_STOP is for -scan_objects to tell the shrinker it can make any progress due to the context it is called in. This allows the shirnker to defer the work to another call in a different context. However, if -count-objects doesn't return a count, the work that was supposed to be done cannot be deferred, and that is what -count_objects should always return the number of objects in the cache. So we should rework the locking in the drm/i915 shrinker to be able to always count objects? Thus far no one screamed yet that we're not really able to do that in all call contexts ... It's not the end of the world if you count no objects. in an ideal world, you keep a count of the object sizes on the LRU when you add/remove the objects on the list, that way .count_objects doesn't need to walk or lock anything, which is what things like the inode and dentry caches do... So should I revert 81e49f or will the early return 0; completely upset the core shrinker logic? It looks to me like 81e49f changed the wrong function to return SHRINK_STOP. It should have changed i915_gem_inactive_scan() to return SHRINK_STOP when the locks could not be taken, not i915_gem_inactive_count(). What should happen is this: max_pass = count_objects() if (max_pass == 0) /* skip to next shrinker */ /* calculate total_scan from max_pass and previous leftovers */ while (total_scan) { freed = scan_objects(batch_size) if (freed == SHRINK_STOP) break; /* can't make progress */ total_scan -= batch_size; } /* save remaining total_scan for next pass */ i.e. SHRINK_STOP will abort the scan loop when nothing can be done. Right now, if nothing can be done because the locks can't be taken, the scan loop will continue running until total_scan reaches zero. i.e. it does a whole lotta nothing. So right now, I'd revert 81e49f and then convert i915_gem_inactive_scan() to return SHRINK_STOP if it can't get locks, and everything shoul dwork just fine... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
[MMTests] IO metadata on XFS
On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote: > On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote: > > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > > > Adding dri-devel and a few others because an i915 patch contributed to > > > the regression. > > > > > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > > > > times per write(2) call, IIRC), so with limited numbers of > > > > > > threads/limited CPU power it will result in lower performance. Where > > > > > > you have lots of CPU power, there will be little difference in > > > > > > performance... > > > > > > > > > > When I checked it it could only be called twice, and we'd already > > > > > optimize away the second call. I'd defintively like to track down > > > > > where > > > > > the performance changes happend, at least to a major version but even > > > > > better to a -rc or git commit. > > > > > > > > > > > > > By all means feel free to run the test yourself and run the bisection :) > > > > > > > > It's rare but on this occasion the test machine is idle so I started an > > > > automated git bisection. As you know the milage with an automated bisect > > > > varies so it may or may not find the right commit. Test machine is > > > > sandy so > > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > > > > is the report of interest. The script is doing a full search between > > > > v3.3 and > > > > v3.4 for a point where average files/sec for fsmark-single drops below > > > > 25000. > > > > I did not limit the search to fs/xfs on the off-chance that it is an > > > > apparently unrelated patch that caused the problem. > > > > > > > > > > It was obvious very quickly that there were two distinct regression so I > > > ran two bisections. One led to a XFS and the other led to an i915 patch > > > that enables RC6 to reduce power usage. > > > > > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] > > > > Doesn't seem to be the major cause of the regression. By itself, it > > has impact, but the majority comes from the XFS change... > > > > The fact it has an impact at all is weird but lets see what the DRI > folks think about it. > > > > [c999a223: xfs: introduce an allocation workqueue] > > > > Which indicates that there is workqueue scheduling issues, I think. > > The same amount of work is being done, but half of it is being > > pushed off into a workqueue to avoid stack overflow issues (*). I > > tested the above patch in anger on an 8p machine, similar to the > > machine you saw no regressions on, but the workload didn't drive it > > to being completely CPU bound (only about 90%) so the allocation > > work was probably always scheduled quickly. > > What test were you using? fsmark, dbench, compilebench, and a few fio workloads. Also, xfstests times each test and I keep track of overall runtime, and none of those showed any performance differential, either... Indeed, running on a current 3.5-rc5 tree, my usual fsmark benchmarks are running at the same numbers I've been seeing since about 3.0 - somewhere around 18k files/s for a single thread, and 110-115k files/s for 8 threads. I just ran your variant, and I'm getting about 20kfile/s for a single thread, which is about right because you're using smaller directories than I am (22500 files per dir vs 100k in my tests). > > How many worker threads have been spawned on these machines > > that are showing the regression? > > 20 or 21 generally. An example list as spotted by top looks like Pretty normal. > > What is the context switch rate on the machines whenteh test is running? . > Vanilla average context switch rate 4278.53 > Revert average context switch rate1095 That seems about right, too. > > Can you run latencytop to see > > if there is excessive starvation/wait times for allocation > > completion? > > I'm not sure what format you are looking for. Where the context switches are coming from, and how long they are abeing stalled for. Just to get the context switch locations, you can use perf on the sched:sched_switch event, but that doesn't give you stall tim
Re: [MMTests] IO metadata on XFS
On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote: On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote: On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: Adding dri-devel and a few others because an i915 patch contributed to the regression. On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: It increases the CPU overhead (dirty_inode can be called up to 4 times per write(2) call, IIRC), so with limited numbers of threads/limited CPU power it will result in lower performance. Where you have lots of CPU power, there will be little difference in performance... When I checked it it could only be called twice, and we'd already optimize away the second call. I'd defintively like to track down where the performance changes happend, at least to a major version but even better to a -rc or git commit. By all means feel free to run the test yourself and run the bisection :) It's rare but on this occasion the test machine is idle so I started an automated git bisection. As you know the milage with an automated bisect varies so it may or may not find the right commit. Test machine is sandy so http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html is the report of interest. The script is doing a full search between v3.3 and v3.4 for a point where average files/sec for fsmark-single drops below 25000. I did not limit the search to fs/xfs on the off-chance that it is an apparently unrelated patch that caused the problem. It was obvious very quickly that there were two distinct regression so I ran two bisections. One led to a XFS and the other led to an i915 patch that enables RC6 to reduce power usage. [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] Doesn't seem to be the major cause of the regression. By itself, it has impact, but the majority comes from the XFS change... The fact it has an impact at all is weird but lets see what the DRI folks think about it. [c999a223: xfs: introduce an allocation workqueue] Which indicates that there is workqueue scheduling issues, I think. The same amount of work is being done, but half of it is being pushed off into a workqueue to avoid stack overflow issues (*). I tested the above patch in anger on an 8p machine, similar to the machine you saw no regressions on, but the workload didn't drive it to being completely CPU bound (only about 90%) so the allocation work was probably always scheduled quickly. What test were you using? fsmark, dbench, compilebench, and a few fio workloads. Also, xfstests times each test and I keep track of overall runtime, and none of those showed any performance differential, either... Indeed, running on a current 3.5-rc5 tree, my usual fsmark benchmarks are running at the same numbers I've been seeing since about 3.0 - somewhere around 18k files/s for a single thread, and 110-115k files/s for 8 threads. I just ran your variant, and I'm getting about 20kfile/s for a single thread, which is about right because you're using smaller directories than I am (22500 files per dir vs 100k in my tests). How many worker threads have been spawned on these machines that are showing the regression? 20 or 21 generally. An example list as spotted by top looks like Pretty normal. What is the context switch rate on the machines whenteh test is running? . Vanilla average context switch rate 4278.53 Revert average context switch rate1095 That seems about right, too. Can you run latencytop to see if there is excessive starvation/wait times for allocation completion? I'm not sure what format you are looking for. Where the context switches are coming from, and how long they are abeing stalled for. Just to get the context switch locations, you can use perf on the sched:sched_switch event, but that doesn't give you stall times. Local testing tells me that about 40% of the switches are from xfs_alloc_vextent, 55% are from the work threads, and the rest are CPU idling events, which is exactly as I'd expect. A pert top profile comparison might be informative, too... I'm not sure if this is what you really wanted. I thought an oprofile or perf report would have made more sense but I recorded perf top over time anyway and it's at the end of the mail. perf report and oprofile give you CPU usage across the run, it's not instantaneous and that's where all the interesting information is. e.g. a 5% sample in a 20s profile might be 5% per second for 20s, or it might be 100% for 1s - that's the behaviour run profiles cannot give you insight into As it is, the output you posted is nothing unusual. For just these XFS tests I've uploaded a tarball of the logs
[MMTests] IO metadata on XFS
On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: > Adding dri-devel and a few others because an i915 patch contributed to > the regression. > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: > > > > It increases the CPU overhead (dirty_inode can be called up to 4 > > > > times per write(2) call, IIRC), so with limited numbers of > > > > threads/limited CPU power it will result in lower performance. Where > > > > you have lots of CPU power, there will be little difference in > > > > performance... > > > > > > When I checked it it could only be called twice, and we'd already > > > optimize away the second call. I'd defintively like to track down where > > > the performance changes happend, at least to a major version but even > > > better to a -rc or git commit. > > > > > > > By all means feel free to run the test yourself and run the bisection :) > > > > It's rare but on this occasion the test machine is idle so I started an > > automated git bisection. As you know the milage with an automated bisect > > varies so it may or may not find the right commit. Test machine is sandy so > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html > > is the report of interest. The script is doing a full search between v3.3 > > and > > v3.4 for a point where average files/sec for fsmark-single drops below > > 25000. > > I did not limit the search to fs/xfs on the off-chance that it is an > > apparently unrelated patch that caused the problem. > > > > It was obvious very quickly that there were two distinct regression so I > ran two bisections. One led to a XFS and the other led to an i915 patch > that enables RC6 to reduce power usage. > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] Doesn't seem to be the major cause of the regression. By itself, it has impact, but the majority comes from the XFS change... > [c999a223: xfs: introduce an allocation workqueue] Which indicates that there is workqueue scheduling issues, I think. The same amount of work is being done, but half of it is being pushed off into a workqueue to avoid stack overflow issues (*). I tested the above patch in anger on an 8p machine, similar to the machine you saw no regressions on, but the workload didn't drive it to being completely CPU bound (only about 90%) so the allocation work was probably always scheduled quickly. How many worker threads have been spawned on these machines that are showing the regression? What is the context switch rate on the machines whenteh test is running? Can you run latencytop to see if there is excessive starvation/wait times for allocation completion? A pert top profile comparison might be informative, too... (*) The stack usage below submit_bio() can be more than 5k (DM, MD, SCSI, driver, memory allocation), so it's really not safe to do allocation anywhere below about 3k of kernel stack being used. e.g. on a relatively trivial storage setup without the above commit: [142296.384921] flush-253:4 used greatest stack depth: 360 bytes left Fundamentally, 8k stacks on x86-64 are too small for our increasingly complex storage layers and the 100+ function deep call chains that occur. Cheers, Dave. -- Dave Chinner david at fromorbit.com
Re: [MMTests] IO metadata on XFS
On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote: Adding dri-devel and a few others because an i915 patch contributed to the regression. On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote: On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote: It increases the CPU overhead (dirty_inode can be called up to 4 times per write(2) call, IIRC), so with limited numbers of threads/limited CPU power it will result in lower performance. Where you have lots of CPU power, there will be little difference in performance... When I checked it it could only be called twice, and we'd already optimize away the second call. I'd defintively like to track down where the performance changes happend, at least to a major version but even better to a -rc or git commit. By all means feel free to run the test yourself and run the bisection :) It's rare but on this occasion the test machine is idle so I started an automated git bisection. As you know the milage with an automated bisect varies so it may or may not find the right commit. Test machine is sandy so http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html is the report of interest. The script is doing a full search between v3.3 and v3.4 for a point where average files/sec for fsmark-single drops below 25000. I did not limit the search to fs/xfs on the off-chance that it is an apparently unrelated patch that caused the problem. It was obvious very quickly that there were two distinct regression so I ran two bisections. One led to a XFS and the other led to an i915 patch that enables RC6 to reduce power usage. [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default] Doesn't seem to be the major cause of the regression. By itself, it has impact, but the majority comes from the XFS change... [c999a223: xfs: introduce an allocation workqueue] Which indicates that there is workqueue scheduling issues, I think. The same amount of work is being done, but half of it is being pushed off into a workqueue to avoid stack overflow issues (*). I tested the above patch in anger on an 8p machine, similar to the machine you saw no regressions on, but the workload didn't drive it to being completely CPU bound (only about 90%) so the allocation work was probably always scheduled quickly. How many worker threads have been spawned on these machines that are showing the regression? What is the context switch rate on the machines whenteh test is running? Can you run latencytop to see if there is excessive starvation/wait times for allocation completion? A pert top profile comparison might be informative, too... (*) The stack usage below submit_bio() can be more than 5k (DM, MD, SCSI, driver, memory allocation), so it's really not safe to do allocation anywhere below about 3k of kernel stack being used. e.g. on a relatively trivial storage setup without the above commit: [142296.384921] flush-253:4 used greatest stack depth: 360 bytes left Fundamentally, 8k stacks on x86-64 are too small for our increasingly complex storage layers and the 100+ function deep call chains that occur. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
[PATCH] i915: slab shrinker have to return -1 if it cant shrink any objects
On Thu, Jul 14, 2011 at 11:48:08AM +0900, KOSAKI Motohiro wrote: > >> If you are thinking the shrinker protocol is too complicated, doc update > >> patch is really welcome. > > > > Slab shrinkers have a nasty, crappy interface and the shrink_slab() > > code is full of bugs. Rather that telling people to "update the > > documentation" because it's too complex, how about we fix the > > interface and the bugs? > > If you can take your time for that, I don't hesitate to help you. ;) Its on my list of things to do if nobody else steps before I get there ;) Cheers, Dave. -- Dave Chinner david at fromorbit.com
Re: [PATCH] i915: slab shrinker have to return -1 if it cant shrink any objects
On Wed, Jul 13, 2011 at 05:19:22PM +0900, KOSAKI Motohiro wrote: (2011/07/13 16:41), Chris Wilson wrote: (snip) while (total_scan = SHRINK_BATCH) { long this_scan = SHRINK_BATCH; int shrink_ret; int nr_before; nr_before = do_shrinker_shrink(shrinker, shrink, 0); shrink_ret = do_shrinker_shrink(shrinker, shrink, this_scan); if (shrink_ret == -1) break; And fifteen lines above that you have: unsigned long max_pass = do_shrinker_shrink(shrinker, shrinker, 0); ... shrinker-nr += f(max_pass); if (shrinker-nr 0) printk(KERN_ERR ...); That's the *error* I hit when I originally returned -1. You misunderstand the code. The third argument is critically important. Only if it's 0 (ie sc-nr_to_scan==0), shrinker must not return negative. And once again the shitty shrinker API bites a user. Thus, my patch checked nr_to_scan argument. and I've suggested look at shrink_icache_memory(). Which is going away real soon - it's not the model of perfection that you make it out to be. ;) If you are thinking the shrinker protocol is too complicated, doc update patch is really welcome. Slab shrinkers have a nasty, crappy interface and the shrink_slab() code is full of bugs. Rather that telling people to update the documentation because it's too complex, how about we fix the interface and the bugs? Indeed, how hard is it to require a subsystem to supply two shrinker methods, one to return the count of reclaimable objects, the other to scan the reclaimable objects to reclaim them? After all, that's exactly the interface I'm exposing to filesystems underneath the shrinker API in the per-sb shrinker patchset that gets rid of shrink_icache_memory() rather than propagating the insanity Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] i915: slab shrinker have to return -1 if it cant shrink any objects
it happens because you burn CPU on locks rather than reclaimıng objects. Single threaded object reclaim is still the fastest way to do reclaim if you have global lists and locks. What I'm trying to say is that how you solve the shrinker balance problem for you subsystem will be specific to how you need to hold pages under memory pressure to maintain performance. Sorry I can't give you a better answer than that, but that's what my experience with caches and tuning shrinker behaviour indicates. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] i915: slab shrinker have to return -1 if it cant shrink any objects
On Thu, Jul 14, 2011 at 11:48:08AM +0900, KOSAKI Motohiro wrote: If you are thinking the shrinker protocol is too complicated, doc update patch is really welcome. Slab shrinkers have a nasty, crappy interface and the shrink_slab() code is full of bugs. Rather that telling people to update the documentation because it's too complex, how about we fix the interface and the bugs? If you can take your time for that, I don't hesitate to help you. ;) Its on my list of things to do if nobody else steps before I get there ;) Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
[PATCH] i915: slab shrinker have to return -1 if it cant shrink any objects
shrinker is not able to completely trash the cache like the previous page-cache based version did. It's a very specific solution to the problem of tuning a shrinker for good system behaviour, but it's the only way I found that works Oh, and using a mutex to single thread cache reclaim rather than spinlocks is usually a good idea from a scalability point of view because your shrinker can be called simultaneously on every CPU. Spinlocks really, really hurt when that happens, and performance will plummet when it happens because you burn CPU on locks rather than reclaim?ng objects. Single threaded object reclaim is still the fastest way to do reclaim if you have global lists and locks. What I'm trying to say is that how you solve the shrinker balance problem for you subsystem will be specific to how you need to hold pages under memory pressure to maintain performance. Sorry I can't give you a better answer than that, but that's what my experience with caches and tuning shrinker behaviour indicates. Cheers, Dave. -- Dave Chinner david at fromorbit.com
[PATCH] i915: slab shrinker have to return -1 if it cant shrink any objects
On Wed, Jul 13, 2011 at 05:19:22PM +0900, KOSAKI Motohiro wrote: > (2011/07/13 16:41), Chris Wilson wrote: > >> (snip) > >> while (total_scan >= SHRINK_BATCH) { > >> long this_scan = SHRINK_BATCH; > >> int shrink_ret; > >> int nr_before; > >> > >> nr_before = do_shrinker_shrink(shrinker, shrink, > >> 0); > >> shrink_ret = do_shrinker_shrink(shrinker, shrink, > >> this_scan); > >> if (shrink_ret == -1) > >> break; > >> > > > > And fifteen lines above that you have: > > unsigned long max_pass = do_shrinker_shrink(shrinker, shrinker, 0); > > ... > > shrinker->nr += f(max_pass); > > if (shrinker->nr < 0) printk(KERN_ERR "..."); > > > > That's the *error* I hit when I originally returned -1. > > You misunderstand the code. The third argument is critically important. > Only if it's 0 (ie sc->nr_to_scan==0), shrinker must not return negative. And once again the shitty shrinker API bites a user. > Thus, my patch checked nr_to_scan argument. and I've suggested look at > shrink_icache_memory(). Which is going away real soon - it's not the model of perfection that you make it out to be. ;) > If you are thinking the shrinker protocol is too complicated, doc update > patch is really welcome. Slab shrinkers have a nasty, crappy interface and the shrink_slab() code is full of bugs. Rather that telling people to "update the documentation" because it's too complex, how about we fix the interface and the bugs? Indeed, how hard is it to require a subsystem to supply two shrinker methods, one to return the count of reclaimable objects, the other to scan the reclaimable objects to reclaim them? After all, that's exactly the interface I'm exposing to filesystems underneath the shrinker API in the per-sb shrinker patchset that gets rid of shrink_icache_memory() rather than propagating the insanity Cheers, Dave. -- Dave Chinner david at fromorbit.com