Re: [PATCH v4 46/48] mm: shrinker: make memcg slab shrink lockless

2023-08-07 Thread Dave Chinner
ly(!shrinker || !shrinker_try_get(shrinker))) 
> {
> + clear_bit(offset, unit->map);
> + rcu_read_unlock();
>   continue;
>   }
> + rcu_read_unlock();
>  
>   /* Call non-slab shrinkers even though kmem is disabled 
> */
>   if (!memcg_kmem_online() &&
> @@ -523,15 +546,20 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, 
> int nid,
>   set_shrinker_bit(memcg, nid, 
> shrinker_id);
>   }
>   freed += ret;
> -
> - if (rwsem_is_contended(_rwsem)) {
> - freed = freed ? : 1;
> - goto unlock;
> - }
> + shrinker_put(shrinker);

Ok, so why is this safe to call without holding the rcu read lock?
The global shrinker has to hold the rcu_read_lock() whilst calling
shrinker_put() to guarantee the validity of the list next pointer,
but we don't hold off RCU here so what guarantees a racing global
shrinker walk doesn't trip over this shrinker_put() call dropping
the refcount to zero and freeing occuring in a different context...


> + /*
> +  * We have already exited the read-side of rcu critical section
> +  * before calling do_shrink_slab(), the shrinker_info may be
> +  * released in expand_one_shrinker_info(), so reacquire the
> +  * shrinker_info.
> +  */
> + index++;
> + goto again;

With that, what makes the use of shrinker_info in
xchg_nr_deferred_memcg() in do_shrink_slab() coherent and valid?

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless

2023-08-07 Thread Dave Chinner
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote:
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index eb342994675a..f06225f18531 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -4,6 +4,8 @@
>  
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #define SHRINKER_UNIT_BITS   BITS_PER_LONG
>  
> @@ -87,6 +89,10 @@ struct shrinker {
>   int seeks;  /* seeks to recreate an obj */
>   unsigned flags;
>  
> + refcount_t refcount;
> + struct completion done;
> + struct rcu_head rcu;

Documentation, please. What does the refcount protect, what does the
completion provide, etc.

> +
>   void *private_data;
>  
>   /* These are for internal use */
> @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, 
> const char *fmt, ...);
>  void shrinker_register(struct shrinker *shrinker);
>  void shrinker_free(struct shrinker *shrinker);
>  
> +static inline bool shrinker_try_get(struct shrinker *shrinker)
> +{
> + return refcount_inc_not_zero(>refcount);
> +}
> +
> +static inline void shrinker_put(struct shrinker *shrinker)
> +{
> + if (refcount_dec_and_test(>refcount))
> + complete(>done);
> +}
> +
>  #ifdef CONFIG_SHRINKER_DEBUG
>  extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker,
> const char *fmt, ...);
> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index 1911c06b8af5..d318f5621862 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -2,6 +2,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "internal.h"
> @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, 
> struct mem_cgroup *memcg,
>   if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>   return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
> - if (!down_read_trylock(_rwsem))
> - goto out;
> -
> - list_for_each_entry(shrinker, _list, list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(shrinker, _list, list) {
>   struct shrink_control sc = {
>   .gfp_mask = gfp_mask,
>   .nid = nid,
>   .memcg = memcg,
>   };
>  
> + if (!shrinker_try_get(shrinker))
> + continue;
> +
> + /*
> +  * We can safely unlock the RCU lock here since we already
> +  * hold the refcount of the shrinker.
> +  */
> + rcu_read_unlock();
> +
>   ret = do_shrink_slab(, shrinker, priority);
>   if (ret == SHRINK_EMPTY)
>   ret = 0;
>   freed += ret;
> +
>   /*
> -  * Bail out if someone want to register a new shrinker to
> -  * prevent the registration from being stalled for long periods
> -  * by parallel ongoing shrinking.
> +  * This shrinker may be deleted from shrinker_list and freed
> +  * after the shrinker_put() below, but this shrinker is still
> +  * used for the next traversal. So it is necessary to hold the
> +  * RCU lock first to prevent this shrinker from being freed,
> +  * which also ensures that the next shrinker that is traversed
> +  * will not be freed (even if it is deleted from shrinker_list
> +  * at the same time).
>*/

This needs to be moved to the head of the function, and document
the whole list walk, get, put and completion parts of the algorithm
that make it safe. There's more to this than "we hold a reference
count", especially the tricky "we might see the shrinker before it
is fully initialised" case


.
>  void shrinker_free(struct shrinker *shrinker)
>  {
>   struct dentry *debugfs_entry = NULL;
> @@ -686,9 +712,18 @@ void shrinker_free(struct shrinker *shrinker)
>   if (!shrinker)
>   return;
>  
> + if (shrinker->flags & SHRINKER_REGISTERED) {
> + shrinker_put(shrinker);
> + wait_for_completion(>done);
> + }

Needs a comment explaining why we need to wait here...
> +
>   down_write(_rwsem);
>   if (shrinker->flags & SHRINKER_REGISTERED) {
> - list_del(>list);
> + /*
> +  * Lookups on the shrinker are over and will fail in the future,
> +  * so we can now remove it from the lists and free it.
> +  */

 rather than here after the wait has been done and provided the
guarantee that no shrinker is running or will run again...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v4 44/48] mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}

2023-08-07 Thread Dave Chinner
  }
>   }
>   }
>   up_read(_rwsem);
> @@ -407,7 +458,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, 
> int nid,
>  {
>   struct shrinker_info *info;
>   unsigned long ret, freed = 0;
> - int i;
> + int offset, index = 0;
>  
>   if (!mem_cgroup_online(memcg))
>   return 0;
> @@ -419,56 +470,63 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, 
> int nid,
>   if (unlikely(!info))
>   goto unlock;
>  
> - for_each_set_bit(i, info->map, info->map_nr_max) {
> - struct shrink_control sc = {
> - .gfp_mask = gfp_mask,
> - .nid = nid,
> - .memcg = memcg,
> - };
> - struct shrinker *shrinker;
> + for (; index < shriner_id_to_index(info->map_nr_max); index++) {
> + struct shrinker_info_unit *unit;

This adds another layer of indent to shrink_slab_memcg(). Please
factor it first so that the code ends up being readable. Doing that
first as a separate patch will also make the actual algorithm
changes in this patch be much more obvious - this huge hunk of
diff is pretty much impossible to review...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless

2023-08-07 Thread Dave Chinner
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote:
> The shrinker_rwsem is a global read-write lock in shrinkers subsystem,
> which protects most operations such as slab shrink, registration and
> unregistration of shrinkers, etc. This can easily cause problems in the
> following cases.

> This commit uses the refcount+RCU method [5] proposed by Dave Chinner
> to re-implement the lockless global slab shrink. The memcg slab shrink is
> handled in the subsequent patch.

> ---
>  include/linux/shrinker.h | 17 ++
>  mm/shrinker.c| 70 +---
>  2 files changed, 68 insertions(+), 19 deletions(-)

There's no documentation in the code explaining how the lockless
shrinker algorithm works. It's left to the reader to work out how
this all goes together

> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index eb342994675a..f06225f18531 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -4,6 +4,8 @@
>  
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #define SHRINKER_UNIT_BITS   BITS_PER_LONG
>  
> @@ -87,6 +89,10 @@ struct shrinker {
>   int seeks;  /* seeks to recreate an obj */
>   unsigned flags;
>  
> + refcount_t refcount;
> + struct completion done;
> + struct rcu_head rcu;

What does the refcount protect, why do we need the completion, etc?

> +
>   void *private_data;
>  
>   /* These are for internal use */
> @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, 
> const char *fmt, ...);
>  void shrinker_register(struct shrinker *shrinker);
>  void shrinker_free(struct shrinker *shrinker);
>  
> +static inline bool shrinker_try_get(struct shrinker *shrinker)
> +{
> + return refcount_inc_not_zero(>refcount);
> +}
> +
> +static inline void shrinker_put(struct shrinker *shrinker)
> +{
> + if (refcount_dec_and_test(>refcount))
> + complete(>done);
> +}
> +
>  #ifdef CONFIG_SHRINKER_DEBUG
>  extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker,
> const char *fmt, ...);
> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index 1911c06b8af5..d318f5621862 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -2,6 +2,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "internal.h"
> @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, 
> struct mem_cgroup *memcg,
>   if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>   return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
> - if (!down_read_trylock(_rwsem))
> - goto out;
> -
> - list_for_each_entry(shrinker, _list, list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(shrinker, _list, list) {
>   struct shrink_control sc = {
>   .gfp_mask = gfp_mask,
>   .nid = nid,
>   .memcg = memcg,
>   };
>  
> + if (!shrinker_try_get(shrinker))
> + continue;
> +
> + /*
> +  * We can safely unlock the RCU lock here since we already
> +  * hold the refcount of the shrinker.
> +  */
> + rcu_read_unlock();
> +
>   ret = do_shrink_slab(, shrinker, priority);
>   if (ret == SHRINK_EMPTY)
>   ret = 0;
>   freed += ret;
> +
>   /*
> -  * Bail out if someone want to register a new shrinker to
> -  * prevent the registration from being stalled for long periods
> -  * by parallel ongoing shrinking.
> +  * This shrinker may be deleted from shrinker_list and freed
> +  * after the shrinker_put() below, but this shrinker is still
> +  * used for the next traversal. So it is necessary to hold the
> +  * RCU lock first to prevent this shrinker from being freed,
> +  * which also ensures that the next shrinker that is traversed
> +  * will not be freed (even if it is deleted from shrinker_list
> +  * at the same time).
>*/

This comment really should be at the head of the function,
describing the algorithm used within the function itself. i.e. how
reference counts are used w.r.t. the rcu_read_lock() usage to
guarantee existence of the shrinker and the validity of the list
walk.

I'm not going to remember all these little details when I look at
this code in another 6 months time, and having to work it out from
first principles every time I look at the code will waste of a lot
of time...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v3 28/49] dm zoned: dynamically allocate the dm-zoned-meta shrinker

2023-07-27 Thread Dave Chinner
On Thu, Jul 27, 2023 at 07:20:46PM +0900, Damien Le Moal wrote:
> On 7/27/23 17:55, Qi Zheng wrote:
> >>>   goto err;
> >>>   }
> >>>   +    zmd->mblk_shrinker->count_objects = dmz_mblock_shrinker_count;
> >>> +    zmd->mblk_shrinker->scan_objects = dmz_mblock_shrinker_scan;
> >>> +    zmd->mblk_shrinker->seeks = DEFAULT_SEEKS;
> >>> +    zmd->mblk_shrinker->private_data = zmd;
> >>> +
> >>> +    shrinker_register(zmd->mblk_shrinker);
> >>
> >> I fail to see how this new shrinker API is better... Why isn't there a
> >> shrinker_alloc_and_register() function ? That would avoid adding all this 
> >> code
> >> all over the place as the new API call would be very similar to the current
> >> shrinker_register() call with static allocation.
> > 
> > In some registration scenarios, memory needs to be allocated in advance.
> > So we continue to use the previous prealloc/register_prepared()
> > algorithm. The shrinker_alloc_and_register() is just a helper function
> > that combines the two, and this increases the number of APIs that
> > shrinker exposes to the outside, so I choose not to add this helper.
> 
> And that results in more code in many places instead of less code + a simple
> inline helper in the shrinker header file...

It's not just a "simple helper" - it's a function that has to take 6
or 7 parameters with a return value that must be checked and
handled.

This was done in the first versions of the patch set - the amount of
code in each caller does not go down and, IMO, was much harder to
read and determine "this is obviously correct" that what we have
now.

> So not adding that super simple
> helper is not exactly the best choice in my opinion.

Each to their own - I much prefer the existing style/API over having
to go look up a helper function every time I want to check some
random shrinker has been set up correctly

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2 44/47] mm: shrinker: make global slab shrink lockless

2023-07-26 Thread Dave Chinner
On Wed, Jul 26, 2023 at 05:14:09PM +0800, Qi Zheng wrote:
> On 2023/7/26 16:08, Dave Chinner wrote:
> > On Mon, Jul 24, 2023 at 05:43:51PM +0800, Qi Zheng wrote:
> > > @@ -122,6 +126,13 @@ void shrinker_free_non_registered(struct shrinker 
> > > *shrinker);
> > >   void shrinker_register(struct shrinker *shrinker);
> > >   void shrinker_unregister(struct shrinker *shrinker);
> > > +static inline bool shrinker_try_get(struct shrinker *shrinker)
> > > +{
> > > + return READ_ONCE(shrinker->registered) &&
> > > +refcount_inc_not_zero(>refcount);
> > > +}
> > 
> > Why do we care about shrinker->registered here? If we don't set
> > the refcount to 1 until we have fully initialised everything, then
> > the shrinker code can key entirely off the reference count and
> > none of the lookup code needs to care about whether the shrinker is
> > registered or not.
> 
> The purpose of checking shrinker->registered here is to stop running
> shrinker after calling shrinker_free(), which can prevent the following
> situations from happening:
> 
> CPU 0 CPU 1
> 
> shrinker_try_get()
> 
>shrinker_try_get()
> 
> shrinker_put()
> shrinker_try_get()
>shrinker_put()

I don't see any race here? What is wrong with having multiple active
users at once?

> > 
> > This should use a completion, then it is always safe under
> > rcu_read_lock().  This also gets rid of the shrinker_lock spin lock,
> > which only exists because we can't take a blocking lock under
> > rcu_read_lock(). i.e:
> > 
> > 
> > void shrinker_put(struct shrinker *shrinker)
> > {
> > if (refcount_dec_and_test(>refcount))
> > complete(>done);
> > }
> > 
> > void shrinker_free()
> > {
> > .
> > refcount_dec(>refcount);
> 
> I guess what you mean is shrinker_put(), because here may be the last
> refcount.

Yes, I did.

> > wait_for_completion(>done);
> > /*
> >  * lookups on the shrinker will now all fail as refcount has
> >  * fallen to zero. We can now remove it from the lists and
> >  * free it.
> >  */
> > down_write(shrinker_rwsem);
> > list_del_rcu(>list);
> > up_write(_rwsem);
> > call_rcu(shrinker->rcu_head, shrinker_free_rcu_cb);
> > }
> > 
> > 
> > 
> > > @@ -686,11 +711,14 @@ EXPORT_SYMBOL(shrinker_free_non_registered);
> > >   void shrinker_register(struct shrinker *shrinker)
> > >   {
> > > - down_write(_rwsem);
> > > - list_add_tail(>list, _list);
> > > - shrinker->flags |= SHRINKER_REGISTERED;
> > > + refcount_set(>refcount, 1);
> > > +
> > > + spin_lock(_lock);
> > > + list_add_tail_rcu(>list, _list);
> > > + spin_unlock(_lock);
> > > +
> > >   shrinker_debugfs_add(shrinker);
> > > - up_write(_rwsem);
> > > + WRITE_ONCE(shrinker->registered, true);
> > >   }
> > >   EXPORT_SYMBOL(shrinker_register);
> > 
> > This just looks wrong - you are trying to use WRITE_ONCE() as a
> > release barrier to indicate that the shrinker is now set up fully.
> > That's not necessary - the refcount is an atomic and along with the
> > rcu locks they should provides all the barriers we need. i.e.
> 
> The reason I used WRITE_ONCE() here is because the shrinker->registered
> will be read and written concurrently (read in shrinker_try_get() and
> written in shrinker_free()), which is why I added shrinker::registered
> field instead of using SHRINKER_REGISTERED flag (this can reduce the
> addition of WRITE_ONCE()/READ_ONCE()).

Using WRITE_ONCE/READ_ONCE doesn't provide memory barriers needed to
use the field like this. You need release/acquire memory ordering
here. i.e. smp_store_release()/smp_load_acquire().

As it is, the refcount_inc_not_zero() provides a control dependency,
as documented in include/linux/refcount.h, refcount_dec_and_test()
provides release memory ordering. The only thing I think we may need
is a write barrier before refcount_set(), such that if
refcount_inc_not_zero() sees a non-zero value, it is guaranteed to
see an initialised structure...

i.e. refcounts provide all the existence and initialisation
guarantees. Hence I don't see the need to use shrinker->registered
like this and it can remain a bit flag protected by the
shrinker_rwsem().


> > void shrinker_register(struct shrinker *shrinker)
> > {
> > down_write(_rwsem);
> > list_add_tail_rcu(&g

Re: [PATCH v2 44/47] mm: shrinker: make global slab shrink lockless

2023-07-26 Thread Dave Chinner
On Mon, Jul 24, 2023 at 05:43:51PM +0800, Qi Zheng wrote:
> The shrinker_rwsem is a global read-write lock in shrinkers subsystem,
> which protects most operations such as slab shrink, registration and
> unregistration of shrinkers, etc. This can easily cause problems in the
> following cases.
> 
> 1) When the memory pressure is high and there are many filesystems
>mounted or unmounted at the same time, slab shrink will be affected
>(down_read_trylock() failed).
> 
>Such as the real workload mentioned by Kirill Tkhai:
> 
>```
>One of the real workloads from my experience is start
>of an overcommitted node containing many starting
>containers after node crash (or many resuming containers
>after reboot for kernel update). In these cases memory
>pressure is huge, and the node goes round in long reclaim.
>```
> 
> 2) If a shrinker is blocked (such as the case mentioned
>in [1]) and a writer comes in (such as mount a fs),
>then this writer will be blocked and cause all
>subsequent shrinker-related operations to be blocked.
> 
> Even if there is no competitor when shrinking slab, there may still be a
> problem. The down_read_trylock() may become a perf hotspot with frequent
> calls to shrink_slab(). Because of the poor multicore scalability of
> atomic operations, this can lead to a significant drop in IPC
> (instructions per cycle).
> 
> We used to implement the lockless slab shrink with SRCU [2], but then
> kernel test robot reported -88.8% regression in
> stress-ng.ramfs.ops_per_sec test case [3], so we reverted it [4].
> 
> This commit uses the refcount+RCU method [5] proposed by Dave Chinner
> to re-implement the lockless global slab shrink. The memcg slab shrink is
> handled in the subsequent patch.
> 
> For now, all shrinker instances are converted to dynamically allocated and
> will be freed by kfree_rcu(). So we can use rcu_read_{lock,unlock}() to
> ensure that the shrinker instance is valid.
> 
> And the shrinker instance will not be run again after unregistration. So
> the structure that records the pointer of shrinker instance can be safely
> freed without waiting for the RCU read-side critical section.
> 
> In this way, while we implement the lockless slab shrink, we don't need to
> be blocked in unregister_shrinker().
> 
> The following are the test results:
> 
> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
> 
> 1) Before applying this patchset:
> 
> setting to a 60 second run per stressor
> dispatching hogs: 9 ramfs
> stressor   bogo ops real time  usr time  sys time   bogo ops/s bogo 
> ops/s
>   (secs)(secs)(secs)   (real time) (usr+sys 
> time)
> ramfs735238 60.00 12.37363.70 12253.05
> 1955.08
> for a 60.01s run time:
>1440.27s available CPU time
>  12.36s user time   (  0.86%)
> 363.70s system time ( 25.25%)
> 376.06s total time  ( 26.11%)
> load average: 10.79 4.47 1.69
> passed: 9: ramfs (9)
> failed: 0
> skipped: 0
> successful run completed in 60.01s (1 min, 0.01 secs)
> 
> 2) After applying this patchset:
> 
> setting to a 60 second run per stressor
> dispatching hogs: 9 ramfs
> stressor   bogo ops real time  usr time  sys time   bogo ops/s bogo 
> ops/s
>   (secs)(secs)(secs)   (real time) (usr+sys 
> time)
> ramfs746677 60.00 12.22367.75 12443.70
> 1965.13
> for a 60.01s run time:
>1440.26s available CPU time
>  12.21s user time   (  0.85%)
> 367.75s system time ( 25.53%)
> 379.96s total time  ( 26.38%)
> load average: 8.37 2.48 0.86
> passed: 9: ramfs (9)
> failed: 0
> skipped: 0
> successful run completed in 60.01s (1 min, 0.01 secs)
> 
> We can see that the ops/s has hardly changed.
> 
> [1]. 
> https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomi...@virtuozzo.com/
> [2]. 
> https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.a...@bytedance.com/
> [3]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie@intel.com/
> [4]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zh...@linux.dev/
> [5]. https://lore.kernel.org/lkml/zijhou1d55d4h...@dread.disaster.area/
> 
> Signed-off-by: Qi Zheng 
> ---
>  include/linux/shrinker.h | 19 +++---
>  mm/shrinker.c| 75 ++--
>  mm/shrinker_debug.c  | 52 +---
>  3 files changed, 104 insertions(+), 42 deletions(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 36977a70bebb..335da93cccee 100644
> --- a/in

Re: [PATCH v2 03/47] mm: shrinker: add infrastructure for dynamically allocating shrinker

2023-07-26 Thread Dave Chinner
On Mon, Jul 24, 2023 at 05:43:10PM +0800, Qi Zheng wrote:
> Currently, the shrinker instances can be divided into the following three
> types:
> 
> a) global shrinker instance statically defined in the kernel, such as
>workingset_shadow_shrinker.
> 
> b) global shrinker instance statically defined in the kernel modules, such
>as mmu_shrinker in x86.
> 
> c) shrinker instance embedded in other structures.
> 
> For case a, the memory of shrinker instance is never freed. For case b,
> the memory of shrinker instance will be freed after synchronize_rcu() when
> the module is unloaded. For case c, the memory of shrinker instance will
> be freed along with the structure it is embedded in.
> 
> In preparation for implementing lockless slab shrink, we need to
> dynamically allocate those shrinker instances in case c, then the memory
> can be dynamically freed alone by calling kfree_rcu().
> 
> So this commit adds the following new APIs for dynamically allocating
> shrinker, and add a private_data field to struct shrinker to record and
> get the original embedded structure.
> 
> 1. shrinker_alloc()
> 
> Used to allocate shrinker instance itself and related memory, it will
> return a pointer to the shrinker instance on success and NULL on failure.
> 
> 2. shrinker_free_non_registered()
> 
> Used to destroy the non-registered shrinker instance.

This is a bit nasty

> 
> 3. shrinker_register()
> 
> Used to register the shrinker instance, which is same as the current
> register_shrinker_prepared().
> 
> 4. shrinker_unregister()

rename this "shrinker_free()" and key the two different freeing
cases on the SHRINKER_REGISTERED bit rather than mostly duplicating
the two.

void shrinker_free(struct shrinker *shrinker)
{
struct dentry *debugfs_entry = NULL;
int debugfs_id;

if (!shrinker)
return;

down_write(_rwsem);
if (shrinker->flags & SHRINKER_REGISTERED) {
list_del(>list);
debugfs_entry = shrinker_debugfs_detach(shrinker, _id);
} else if (IS_ENABLED(CONFIG_SHRINKER_DEBUG)) {
kfree_const(shrinker->name);
}

if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
up_write(_rwsem);

if (debugfs_entry)
shrinker_debugfs_remove(debugfs_entry, debugfs_id);

kfree(shrinker->nr_deferred);
kfree(shrinker);
}
EXPORT_SYMBOL_GPL(shrinker_free);

-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless

2023-06-23 Thread Dave Chinner
On Fri, Jun 23, 2023 at 09:10:57PM +0800, Qi Zheng wrote:
> On 2023/6/23 14:29, Dave Chinner wrote:
> > On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
> > > On 6/22/23 10:53, Qi Zheng wrote:
> > Yes, I suggested the IDR route because radix tree lookups under RCU
> > with reference counted objects are a known safe pattern that we can
> > easily confirm is correct or not.  Hence I suggested the unification
> > + IDR route because it makes the life of reviewers so, so much
> > easier...
> 
> In fact, I originally planned to try the unification + IDR method you
> suggested at the beginning. But in the case of CONFIG_MEMCG disabled,
> the struct mem_cgroup is not even defined, and root_mem_cgroup and
> shrinker_info will not be allocated.  This required more code changes, so
> I ended up keeping the shrinker_list and implementing the above pattern.

Yes. Go back and read what I originally said needed to be done
first. In the case of CONFIG_MEMCG=n, a dummy root memcg still needs
to exist that holds all of the global shrinkers. Then shrink_slab()
is only ever passed a memcg that should be iterated.

Yes, it needs changes external to the shrinker code itself to be
made to work. And even if memcg's are not enabled, we can still use
the memcg structures to ensure a common abstraction is used for the
shrinker tracking infrastructure

> If the above pattern is not safe, I will go back to the unification +
> IDR method.

And that is exactly how we got into this mess in the first place

-Dave
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless

2023-06-23 Thread Dave Chinner
On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
> On 6/22/23 10:53, Qi Zheng wrote:
> > @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, 
> > int nid,
> > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> > return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
> >  
> > -   if (!down_read_trylock(_rwsem))
> > -   goto out;
> > -
> > -   list_for_each_entry(shrinker, _list, list) {
> > +   rcu_read_lock();
> > +   list_for_each_entry_rcu(shrinker, _list, list) {
> > struct shrink_control sc = {
> > .gfp_mask = gfp_mask,
> > .nid = nid,
> > .memcg = memcg,
> > };
> >  
> > +   if (!shrinker_try_get(shrinker))
> > +   continue;
> > +   rcu_read_unlock();
> 
> I don't think you can do this unlock?
> 
> > +
> > ret = do_shrink_slab(, shrinker, priority);
> > if (ret == SHRINK_EMPTY)
> > ret = 0;
> > freed += ret;
> > -   /*
> > -* Bail out if someone want to register a new shrinker to
> > -* prevent the registration from being stalled for long periods
> > -* by parallel ongoing shrinking.
> > -*/
> > -   if (rwsem_is_contended(_rwsem)) {
> > -   freed = freed ? : 1;
> > -   break;
> > -   }
> > -   }
> >  
> > -   up_read(_rwsem);
> > -out:
> > +   rcu_read_lock();
> 
> That new rcu_read_lock() won't help AFAIK, the whole
> list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
> safe.

Yeah, that's the pattern we've been taught and the one we can look
at and immediately say "this is safe".

This is a different pattern, as has been explained bi Qi, and I
think it *might* be safe.

*However.*

Right now I don't have time to go through a novel RCU list iteration
pattern it one step at to determine the correctness of the
algorithm. I'm mostly worried about list manipulations that can
occur outside rcu_read_lock() section bleeding into the RCU
critical section because rcu_read_lock() by itself is not a memory
barrier.

Maybe Paul has seen this pattern often enough he could simply tell
us what conditions it is safe in. But for me to work that out from
first principles? I just don't have the time to do that right now.

> IIUC this is why Dave in [4] suggests unifying shrink_slab() with
> shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.

Yes, I suggested the IDR route because radix tree lookups under RCU
with reference counted objects are a known safe pattern that we can
easily confirm is correct or not.  Hence I suggested the unification
+ IDR route because it makes the life of reviewers so, so much
easier...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 02/29] mm: vmscan: introduce some helpers for dynamically allocating shrinker

2023-06-23 Thread Dave Chinner
On Thu, Jun 22, 2023 at 04:53:08PM +0800, Qi Zheng wrote:
> Introduce some helpers for dynamically allocating shrinker instance,
> and their uses are as follows:
> 
> 1. shrinker_alloc_and_init()
> 
> Used to allocate and initialize a shrinker instance, the priv_data
> parameter is used to pass the pointer of the previously embedded
> structure of the shrinker instance.
> 
> 2. shrinker_free()
> 
> Used to free the shrinker instance when the registration of shrinker
> fails.
> 
> 3. unregister_and_free_shrinker()
> 
> Used to unregister and free the shrinker instance, and the kfree()
> will be changed to kfree_rcu() later.
> 
> Signed-off-by: Qi Zheng 
> ---
>  include/linux/shrinker.h | 12 
>  mm/vmscan.c  | 35 +++
>  2 files changed, 47 insertions(+)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 43e6fcabbf51..8e9ba6fa3fcc 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -107,6 +107,18 @@ extern void unregister_shrinker(struct shrinker 
> *shrinker);
>  extern void free_prealloced_shrinker(struct shrinker *shrinker);
>  extern void synchronize_shrinkers(void);
>  
> +typedef unsigned long (*count_objects_cb)(struct shrinker *s,
> +   struct shrink_control *sc);
> +typedef unsigned long (*scan_objects_cb)(struct shrinker *s,
> +  struct shrink_control *sc);
> +
> +struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
> +  scan_objects_cb scan, long batch,
> +  int seeks, unsigned flags,
> +  void *priv_data);
> +void shrinker_free(struct shrinker *shrinker);
> +void unregister_and_free_shrinker(struct shrinker *shrinker);

H. Not exactly how I envisioned this to be done.

Ok, this will definitely work, but I don't think it is an
improvement. It's certainly not what I was thinking of when I
suggested dynamically allocating shrinkers.

The main issue is that this doesn't simplify the API - it expands it
and creates a minefield of old and new functions that have to be
used in exactly the right order for the right things to happen.

What I was thinking of was moving the entire shrinker setup code
over to the prealloc/register_prepared() algorithm, where the setup
is already separated from the activation of the shrinker.

That is, we start by renaming prealloc_shrinker() to
shrinker_alloc(), adding a flags field to tell it everything that it
needs to alloc (i.e. the NUMA/MEMCG_AWARE flags) and having it
returned a fully allocated shrinker ready to register. Initially
this also contains an internal flag to say the shrinker was
allocated so that unregister_shrinker() knows to free it.

The caller then fills out the shrinker functions, seeks, etc. just
like the do now, and then calls register_shrinker_prepared() to make
the shrinker active when it wants to turn it on.

When it is time to tear down the shrinker, no API needs to change.
unregister_shrinker() does all the shutdown and frees all the
internal memory like it does now. If the shrinker is also marked as
allocated, it frees the shrinker via RCU, too.

Once everything is converted to this API, we then remove
register_shrinker(), rename register_shrinker_prepared() to
shrinker_register(), rename unregister_shrinker to
shrinker_unregister(), get rid of the internal "allocated" flag
and always free the shrinker.

At the end of the patchset, every shrinker should be set
up in a manner like this:


sb->shrinker = shrinker_alloc(SHRINKER_MEMCG_AWARE|SHRINKER_NUMA_AWARE,
"sb-%s", type->name);
if (!sb->shrinker)
return -ENOMEM;

sb->shrinker->count_objects = super_cache_count;
sb->shrinker->scan_objects = super_cache_scan;
sb->shrinker->batch = 1024;
sb->shrinker->private = sb;

.

shrinker_register(sb->shrinker);

And teardown is just a call to shrinker_unregister(sb->shrinker)
as it is now.

i.e. the entire shrinker regsitration API is now just three
functions, down from the current four, and much simpler than the
the seven functions this patch set results in...

The other advantage of this is that it will break all the existing
out of tree code and third party modules using the old API and will
no longer work with a kernel using lockless slab shrinkers. They
need to break (both at the source and binary levels) to stop bad
things from happening due to using uncoverted shrinkers in the new
setup.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Build regressions/improvements in v5.18-rc1

2022-04-04 Thread Dave Chinner
On Mon, Apr 04, 2022 at 10:16:08AM +0200, Geert Uytterhoeven wrote:
> On Mon, 4 Apr 2022, Geert Uytterhoeven wrote:
> > Below is the list of build error/warning regressions/improvements in
> > v5.18-rc1[1] compared to v5.17[2].
> > 
> > Summarized:
> >  - build errors: +36/-15
> >  - build warnings: +5/-38
> > 
> > Happy fixing! ;-)

Well

> >  + /kisskb/src/fs/xfs/xfs_buf.h: error: initializer element is not 
> > constant:  => 46:23

Looking at:

http://kisskb.ellerman.id.au/kisskb/buildresult/14714961/

The build error is:

/kisskb/src/fs/xfs/./xfs_trace.h:432:2: note: in expansion of macro 'TP_printk'
  TP_printk("dev %d:%d daddr 0x%llx bbcount 0x%x hold %d pincount %d "
  ^
/kisskb/src/fs/xfs/./xfs_trace.h:440:5: note: in expansion of macro 
'__print_flags'
 __print_flags(__entry->flags, "|", XFS_BUF_FLAGS),
 ^
/kisskb/src/fs/xfs/xfs_buf.h:67:4: note: in expansion of macro 'XBF_UNMAPPED'
  { XBF_UNMAPPED,  "UNMAPPED" }
^
/kisskb/src/fs/xfs/./xfs_trace.h:440:40: note: in expansion of macro 
'XFS_BUF_FLAGS'
 __print_flags(__entry->flags, "|", XFS_BUF_FLAGS),
^
/kisskb/src/fs/xfs/./xfs_trace.h: In function 
'trace_raw_output_xfs_buf_flags_class':
/kisskb/src/fs/xfs/xfs_buf.h:46:23: error: initializer element is not constant
 #define XBF_UNMAPPED  (1 << 31)/* do not map the buffer */

This doesn't make a whole lotta sense to me. It's blown up in a
tracepoint macro in XFS that was not changed at all in 5.18-rc1, nor
was any of the surrounding XFS code or contexts.  Perhaps something
outside XFS changed to cause this on these platforms?

Can you bisect this, please?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-09-01 Thread Dave Chinner
On Wed, Sep 01, 2021 at 07:07:34PM -0400, Felix Kuehling wrote:
> On 2021-09-01 6:03 p.m., Dave Chinner wrote:
> > On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
> > > Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:
> > > > On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
> > > > > > > driver code is not really involved in updating the CPU mappings. 
> > > > > > > Maybe
> > > > > > > it's something we need to do in the migration helpers.
> > > > > > It looks like I'm totally misunderstanding what you are adding here
> > > > > > then.  Why do we need any special treatment at all for memory that
> > > > > > has normal struct pages and is part of the direct kernel map?
> > > > > The pages are like normal memory for purposes of mapping them in CPU
> > > > > page tables and for coherent access from the CPU.
> > > > That's the user page tables.  What about the kernel direct map?
> > > > If there is a normal kernel struct page backing there really should
> > > > be no need for the pgmap.
> > > I'm not sure. The physical address ranges are in the UEFI system address
> > > map as special-purpose memory. Does Linux create the struct pages and
> > > kernel direct map for that without a pgmap call? I didn't see that last
> > > time I went digging through that code.
> > > 
> > > 
> > > > >  From an application
> > > > > perspective, we want file-backed and anonymous mappings to be able to
> > > > > use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
> > > > > optimize performance for GPU heavy workloads while minimizing the need
> > > > > to migrate data back-and-forth between system memory and device 
> > > > > memory.
> > > > I don't really understand that part.  file backed pages are always
> > > > allocated by the file system using the pagecache helpers, that is
> > > > using the page allocator.  Anonymouns memory also always comes from
> > > > the page allocator.
> > > I'm coming at this from my experience with DEVICE_PRIVATE. Both
> > > anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
> > > memory by the migrate_vma_* helpers for more efficient access by our
> > > GPU. (*) It's part of the basic premise of HMM as I understand it. I
> > > would expect the same thing to work for DEVICE_PUBLIC memory.
> > > 
> > > (*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
> > > currently work, but that's something I'm hoping to fix at some point.
> > FWIW, I'd love to see the architecture documents that define how
> > filesystems are supposed to interact with this device private
> > memory. This whole "hand filesystem controlled memory to other
> > devices" is a minefield that is trivial to get wrong iand very
> > difficult to fix - just look at the historical mess that RDMA
> > to/from file backed and/or DAX pages has been.
> > 
> > So, really, from my perspective as a filesystem engineer, I want to
> > see an actual specification for how this new memory type is going to
> > interact with filesystem and the page cache so everyone has some
> > idea of how this is going to work and can point out how it doesn't
> > work before code that simply doesn't work is pushed out into
> > production systems and then merged
> 
> OK. To be clear, that's not part of this patch series. And I have no
> authority to push anything in this part of the kernel, so you have nothing
> to fear. ;)

I know this isn't part of the series. but this patchset is laying
the foundation for future work that will impact subsystems that
currently have zero visibility and/or knowledge of these changes.
There must be an overall architectural plan for this functionality,
regardless of the current state of implementation. It's that overall
architectural plan I'm asking about here, because I need to
understand that before I can sanely comment on the page
cache/filesystem aspect of the proposed functionality...

> FWIW, we already have the ability to map file-backed system memory pages
> into device page tables with HMM and interval notifiers. But we cannot
> currently migrate them to ZONE_DEVICE pages.

Sure, but sharing page cache pages allocated and managed by the
filesystem is not what you are talking about. You're talking about
migrating page cache data to completely different memory allocated
by a different memory manager that the filesystems currently have no
knowledge of or have any 

Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

2021-09-01 Thread Dave Chinner
On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
> 
> Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:
> > On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
> >>>> driver code is not really involved in updating the CPU mappings. Maybe
> >>>> it's something we need to do in the migration helpers.
> >>> It looks like I'm totally misunderstanding what you are adding here
> >>> then.  Why do we need any special treatment at all for memory that
> >>> has normal struct pages and is part of the direct kernel map?
> >> The pages are like normal memory for purposes of mapping them in CPU
> >> page tables and for coherent access from the CPU.
> > That's the user page tables.  What about the kernel direct map?
> > If there is a normal kernel struct page backing there really should
> > be no need for the pgmap.
> 
> I'm not sure. The physical address ranges are in the UEFI system address
> map as special-purpose memory. Does Linux create the struct pages and
> kernel direct map for that without a pgmap call? I didn't see that last
> time I went digging through that code.
> 
> 
> >
> >> From an application
> >> perspective, we want file-backed and anonymous mappings to be able to
> >> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
> >> optimize performance for GPU heavy workloads while minimizing the need
> >> to migrate data back-and-forth between system memory and device memory.
> > I don't really understand that part.  file backed pages are always
> > allocated by the file system using the pagecache helpers, that is
> > using the page allocator.  Anonymouns memory also always comes from
> > the page allocator.
> 
> I'm coming at this from my experience with DEVICE_PRIVATE. Both
> anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
> memory by the migrate_vma_* helpers for more efficient access by our
> GPU. (*) It's part of the basic premise of HMM as I understand it. I
> would expect the same thing to work for DEVICE_PUBLIC memory.
> 
> (*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
> currently work, but that's something I'm hoping to fix at some point.

FWIW, I'd love to see the architecture documents that define how
filesystems are supposed to interact with this device private
memory. This whole "hand filesystem controlled memory to other
devices" is a minefield that is trivial to get wrong iand very
difficult to fix - just look at the historical mess that RDMA
to/from file backed and/or DAX pages has been.

So, really, from my perspective as a filesystem engineer, I want to
see an actual specification for how this new memory type is going to
interact with filesystem and the page cache so everyone has some
idea of how this is going to work and can point out how it doesn't
work before code that simply doesn't work is pushed out into
production systems and then merged

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v3 1/8] ext4/xfs: add page refcount helper

2021-06-18 Thread Dave Chinner
On Thu, Jun 17, 2021 at 10:16:58AM -0500, Alex Sierra wrote:
> From: Ralph Campbell 
> 
> There are several places where ZONE_DEVICE struct pages assume a reference
> count == 1 means the page is idle and free. Instead of open coding this,
> add a helper function to hide this detail.
> 
> v2:
> [AS]: rename dax_layout_is_idle_page func to dax_page_unused

Did you even compile test this?

> Signed-off-by: Ralph Campbell 
> Signed-off-by: Alex Sierra 
> ---
>  fs/dax.c|  4 ++--
>  fs/ext4/inode.c |  5 +
>  fs/xfs/xfs_file.c   |  4 +---
>  include/linux/dax.h | 10 ++
>  4 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 26d5dcd2d69e..321f4ddc6643 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct 
> address_space *mapping,
>   for_each_mapped_pfn(entry, pfn) {
>   struct page *page = pfn_to_page(pfn);
>  
> - WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> + WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page));

Because you still use dax_layout_is_idle_page() here, not
dax_page_unused()...

>   WARN_ON_ONCE(page->mapping && page->mapping != mapping);
>   page->mapping = NULL;
>   page->index = 0;
> @@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry)
>   for_each_mapped_pfn(entry, pfn) {
>   struct page *page = pfn_to_page(pfn);
>  
> - if (page_ref_count(page) > 1)
> + if (!dax_layout_is_idle_page(page))

Here too.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] mm: Track mmu notifiers in fs_reclaim_acquire/release

2020-06-24 Thread Dave Chinner
 190.961530][  T369]xfs_setattr_nonsize+0x1a6/0xcd0

OK, this patch has royally screwed something up if this path thinks
it can enter memory reclaim. This path is inside a transaction, so
it is running under PF_MEMALLOC_NOFS context, so should *never*
enter memory reclaim.

I'd suggest that whatever mods were made to fs_reclaim_acquire by
this patch broke it's basic functionality

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 13cc653122b7..7536f0fd 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -57,6 +57,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -4124,7 +4125,7 @@ should_compact_retry(struct alloc_context *ac, 
> > unsigned int order, int alloc_fla
> >  static struct lockdep_map __fs_reclaim_map =
> > STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);
> >  
> > -static bool __need_fs_reclaim(gfp_t gfp_mask)
> > +static bool __need_reclaim(gfp_t gfp_mask)
> >  {
> > gfp_mask = current_gfp_context(gfp_mask);

This is applies the per-task memory allocation context flags to the
mask that is checked here.

> > @@ -4136,10 +4137,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
> > if (current->flags & PF_MEMALLOC)
> > return false;
> >  
> > -   /* We're only interested __GFP_FS allocations for now */
> > -   if (!(gfp_mask & __GFP_FS))
> > -   return false;
> > -
> > if (gfp_mask & __GFP_NOLOCKDEP)
> > return false;
> >  
> > @@ -4158,15 +4155,25 @@ void __fs_reclaim_release(void)
> >  
> >  void fs_reclaim_acquire(gfp_t gfp_mask)
> >  {
> > -   if (__need_fs_reclaim(gfp_mask))
> > -   __fs_reclaim_acquire();
> > +   if (__need_reclaim(gfp_mask)) {
> > +   if (gfp_mask & __GFP_FS)
> > +   __fs_reclaim_acquire();

 and they have not been applied in this path. There's your
breakage.

For future reference, please post anything that changes NOFS
allocation contexts or behaviours to linux-fsdevel, as filesystem
developers need to know about proposed changes to infrastructure
that is critical to the correct functioning of filesystems...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [trivial PATCH] treewide: Align function definition open/close braces

2017-12-19 Thread Dave Chinner
On Sun, Dec 17, 2017 at 04:28:44PM -0800, Joe Perches wrote:
> Some functions definitions have either the initial open brace and/or
> the closing brace outside of column 1.
> 
> Move those braces to column 1.
> 
> This allows various function analyzers like gnu complexity to work
> properly for these modified functions.
> 
> Miscellanea:
> 
> o Remove extra trailing ; and blank line from xfs_agf_verify
> 
> Signed-off-by: Joe Perches <j...@perches.com>
> ---
....

XFS bits look fine.

Acked-by: Dave Chinner <dchin...@redhat.com>

-- 
Dave Chinner
da...@fromorbit.com
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] drm/i915: Fix up usage of SHRINK_STOP

2013-09-26 Thread Dave Chinner
On Wed, Sep 25, 2013 at 02:00:02PM +0200, Daniel Vetter wrote:
 In
 
 commit 81e49f811404f428a9d9a63295a0c267e802fa12
 Author: Glauber Costa glom...@openvz.org
 Date:   Wed Aug 28 10:18:13 2013 +1000
 
 i915: bail out earlier when shrinker cannot acquire mutex
 
 SHRINK_STOP was added to tell the core shrinker code to bail out and
 go to the next shrinker since the i915 shrinker couldn't acquire
 required locks. But the SHRINK_STOP return code was added to the
 -count_objects callback and not the -scan_objects callback as it
 should have been, resulting in tons of dmesg noise like
 
 shrink_slab: i915_gem_inactive_scan+0x0/0x9c negative objects to delete 
 nr=-x
 
 Fix discusssed with Dave Chinner.

Acked-by: Dave Chinner dchin...@redhat.com

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Intel-gfx] [PATCH] [RFC] mm/shrinker: Add a shrinker flag to always shrink a bit

2013-09-19 Thread Dave Chinner
On Wed, Sep 18, 2013 at 12:38:23PM +0200, Knut Petersen wrote:
 On 18.09.2013 11:10, Daniel Vetter wrote:
 
 Just now I prepared a patch changing the same function in vmscan.c
 Also, this needs to be rebased to the new shrinker api in 3.12, I
 simply haven't rolled my trees forward yet.
 
 Well, you should. Since commit 81e49f  shrinker-count_objects might be
 set to SHRINK_STOP, causing shrink_slab_node() to complain loud and often:
 
 [ 1908.234595] shrink_slab: i915_gem_inactive_scan+0x0/0x9c negative objects 
 to delete nr=-x
 
 The kernel emitted a few thousand log lines like the one quoted above during 
 the
 last few days on my system.
 
 diff --git a/mm/vmscan.c b/mm/vmscan.c
 index 2cff0d4..d81f6e0 100644
 --- a/mm/vmscan.c
 +++ b/mm/vmscan.c
 @@ -254,6 +254,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
  total_scan = max_pass;
  }
 +/* Always try to shrink a bit to make forward progress. */
 +if (shrinker-evicts_to_page_lru)
 +total_scan = max_t(long, total_scan, batch_size);
 +
 At that place the error message is already emitted.
  /*
   * We need to avoid excessive windup on filesystem shrinkers
   * due to large numbers of GFP_NOFS allocations causing the
 
 Have a look at the attached patch. It fixes my problem with the 
 erroneous/misleading
 error messages, and I think it´s right to just bail out early if SHRINK_STOP 
 is found.
 
 Do you agree ?

No, that's wrong. -count_objects should never ass SHRINK_STOP.
Indeed, it should always return a count of objects in the cache,
regardless of the context. 

SHRINK_STOP is for -scan_objects to tell the shrinker it can make
any progress due to the context it is called in. This allows the
shirnker to defer the work to another call in a different context.
However, if -count-objects doesn't return a count, the work that
was supposed to be done cannot be deferred, and that is what
-count_objects should always return the number of objects in the
cache.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Intel-gfx] [PATCH] [RFC] mm/shrinker: Add a shrinker flag to always shrink a bit

2013-09-19 Thread Dave Chinner
[my keyboard my be on the fritz - it's not typing what I'm thinking...]

On Thu, Sep 19, 2013 at 06:38:22AM +1000, Dave Chinner wrote:
 On Wed, Sep 18, 2013 at 12:38:23PM +0200, Knut Petersen wrote:
  On 18.09.2013 11:10, Daniel Vetter wrote:
  
  Just now I prepared a patch changing the same function in vmscan.c
  Also, this needs to be rebased to the new shrinker api in 3.12, I
  simply haven't rolled my trees forward yet.
  
  Well, you should. Since commit 81e49f  shrinker-count_objects might be
  set to SHRINK_STOP, causing shrink_slab_node() to complain loud and often:
  
  [ 1908.234595] shrink_slab: i915_gem_inactive_scan+0x0/0x9c negative 
  objects to delete nr=-x
  
  The kernel emitted a few thousand log lines like the one quoted above 
  during the
  last few days on my system.
  
  diff --git a/mm/vmscan.c b/mm/vmscan.c
  index 2cff0d4..d81f6e0 100644
  --- a/mm/vmscan.c
  +++ b/mm/vmscan.c
  @@ -254,6 +254,10 @@ unsigned long shrink_slab(struct shrink_control 
  *shrink,
 total_scan = max_pass;
 }
  +  /* Always try to shrink a bit to make forward progress. */
  +  if (shrinker-evicts_to_page_lru)
  +  total_scan = max_t(long, total_scan, batch_size);
  +
  At that place the error message is already emitted.
 /*
  * We need to avoid excessive windup on filesystem shrinkers
  * due to large numbers of GFP_NOFS allocations causing the
  
  Have a look at the attached patch. It fixes my problem with the 
  erroneous/misleading
  error messages, and I think it´s right to just bail out early if 
  SHRINK_STOP is found.
  
  Do you agree ?
 
 No, that's wrong. -count_objects should never ass SHRINK_STOP.

*pass

 Indeed, it should always return a count of objects in the cache,
 regardless of the context. 
 
 SHRINK_STOP is for -scan_objects to tell the shrinker it can make

*can't

 any progress due to the context it is called in. This allows the
 shirnker to defer the work to another call in a different context.
 However, if -count-objects doesn't return a count, the work that
 was supposed to be done cannot be deferred, and that is what

 *why

 -count_objects should always return the number of objects in the
 cache.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Intel-gfx] [PATCH] [RFC] mm/shrinker: Add a shrinker flag to always shrink a bit

2013-09-19 Thread Dave Chinner
On Thu, Sep 19, 2013 at 08:57:04AM +0200, Daniel Vetter wrote:
 On Wed, Sep 18, 2013 at 10:38 PM, Dave Chinner da...@fromorbit.com wrote:
  No, that's wrong. -count_objects should never ass SHRINK_STOP.
  Indeed, it should always return a count of objects in the cache,
  regardless of the context.
 
  SHRINK_STOP is for -scan_objects to tell the shrinker it can make
  any progress due to the context it is called in. This allows the
  shirnker to defer the work to another call in a different context.
  However, if -count-objects doesn't return a count, the work that
  was supposed to be done cannot be deferred, and that is what
  -count_objects should always return the number of objects in the
  cache.
 
 So we should rework the locking in the drm/i915 shrinker to be able to
 always count objects? Thus far no one screamed yet that we're not
 really able to do that in all call contexts ...

It's not the end of the world if you count no objects. in an ideal
world, you keep a count of the object sizes on the LRU when you
add/remove the objects on the list, that way .count_objects doesn't
need to walk or lock anything, which is what things like the inode
and dentry caches do...

 
 So should I revert 81e49f or will the early return 0; completely upset
 the core shrinker logic?

It looks to me like 81e49f changed the wrong function to return
SHRINK_STOP. It should have changed i915_gem_inactive_scan() to
return SHRINK_STOP when the locks could not be taken, not
i915_gem_inactive_count().

What should happen is this:

max_pass = count_objects()
if (max_pass == 0)
/* skip to next shrinker */

/* calculate total_scan from max_pass and previous leftovers */

while (total_scan) {
freed = scan_objects(batch_size)
if (freed == SHRINK_STOP)
break; /* can't make progress */
total_scan -= batch_size;
}

/* save remaining total_scan for next pass */


i.e. SHRINK_STOP will abort the scan loop when nothing can be done.
Right now, if nothing can be done because the locks can't be taken,
the scan loop will continue running until total_scan reaches zero.
i.e. it does a whole lotta nothing.

So right now, I'd revert 81e49f and then convert
i915_gem_inactive_scan() to return SHRINK_STOP if it can't get
locks, and everything shoul dwork just fine...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[MMTests] IO metadata on XFS

2012-07-04 Thread Dave Chinner
On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote:
> On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote:
> > On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> > > Adding dri-devel and a few others because an i915 patch contributed to
> > > the regression.
> > > 
> > > On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > > > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > > > times per write(2) call, IIRC), so with limited numbers of
> > > > > > threads/limited CPU power it will result in lower performance. Where
> > > > > > you have lots of CPU power, there will be little difference in
> > > > > > performance...
> > > > > 
> > > > > When I checked it it could only be called twice, and we'd already
> > > > > optimize away the second call.  I'd defintively like to track down 
> > > > > where
> > > > > the performance changes happend, at least to a major version but even
> > > > > better to a -rc or git commit.
> > > > > 
> > > > 
> > > > By all means feel free to run the test yourself and run the bisection :)
> > > > 
> > > > It's rare but on this occasion the test machine is idle so I started an
> > > > automated git bisection. As you know the milage with an automated bisect
> > > > varies so it may or may not find the right commit. Test machine is 
> > > > sandy so
> > > > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > > > is the report of interest. The script is doing a full search between 
> > > > v3.3 and
> > > > v3.4 for a point where average files/sec for fsmark-single drops below 
> > > > 25000.
> > > > I did not limit the search to fs/xfs on the off-chance that it is an
> > > > apparently unrelated patch that caused the problem.
> > > > 
> > > 
> > > It was obvious very quickly that there were two distinct regression so I
> > > ran two bisections. One led to a XFS and the other led to an i915 patch
> > > that enables RC6 to reduce power usage.
> > > 
> > > [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
> > 
> > Doesn't seem to be the major cause of the regression. By itself, it
> > has impact, but the majority comes from the XFS change...
> > 
> 
> The fact it has an impact at all is weird but lets see what the DRI
> folks think about it.
> 
> > > [c999a223: xfs: introduce an allocation workqueue]
> > 
> > Which indicates that there is workqueue scheduling issues, I think.
> > The same amount of work is being done, but half of it is being
> > pushed off into a workqueue to avoid stack overflow issues (*).  I
> > tested the above patch in anger on an 8p machine, similar to the
> > machine you saw no regressions on, but the workload didn't drive it
> > to being completely CPU bound (only about 90%) so the allocation
> > work was probably always scheduled quickly.
> 
> What test were you using?

fsmark, dbench, compilebench, and a few fio workloads. Also,
xfstests times each test and I keep track of overall runtime, and
none of those showed any performance differential, either...

Indeed, running on a current 3.5-rc5 tree, my usual fsmark
benchmarks are running at the same numbers I've been seeing since
about 3.0 - somewhere around 18k files/s for a single thread, and
110-115k files/s for 8 threads.

I just ran your variant, and I'm getting about 20kfile/s for a
single thread, which is about right because you're using smaller
directories than I am (22500 files per dir vs 100k in my tests).

> > How many worker threads have been spawned on these machines
> > that are showing the regression?
> 
> 20 or 21 generally. An example list as spotted by top looks like

Pretty normal.

> > What is the context switch rate on the machines whenteh test is running?
.
> Vanilla average context switch rate   4278.53
> Revert average context switch rate1095

That seems about right, too.

> > Can you run latencytop to see
> > if there is excessive starvation/wait times for allocation
> > completion?
> 
> I'm not sure what format you are looking for.

Where the context switches are coming from, and how long they are
abeing stalled for. Just to get the context switch locations, you
can use perf on the sched:sched_switch event, but that doesn't give
you stall tim

Re: [MMTests] IO metadata on XFS

2012-07-04 Thread Dave Chinner
On Tue, Jul 03, 2012 at 11:59:51AM +0100, Mel Gorman wrote:
 On Tue, Jul 03, 2012 at 10:19:28AM +1000, Dave Chinner wrote:
  On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
   Adding dri-devel and a few others because an i915 patch contributed to
   the regression.
   
   On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
  It increases the CPU overhead (dirty_inode can be called up to 4
  times per write(2) call, IIRC), so with limited numbers of
  threads/limited CPU power it will result in lower performance. Where
  you have lots of CPU power, there will be little difference in
  performance...
 
 When I checked it it could only be called twice, and we'd already
 optimize away the second call.  I'd defintively like to track down 
 where
 the performance changes happend, at least to a major version but even
 better to a -rc or git commit.
 

By all means feel free to run the test yourself and run the bisection :)

It's rare but on this occasion the test machine is idle so I started an
automated git bisection. As you know the milage with an automated bisect
varies so it may or may not find the right commit. Test machine is 
sandy so
http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
is the report of interest. The script is doing a full search between 
v3.3 and
v3.4 for a point where average files/sec for fsmark-single drops below 
25000.
I did not limit the search to fs/xfs on the off-chance that it is an
apparently unrelated patch that caused the problem.

   
   It was obvious very quickly that there were two distinct regression so I
   ran two bisections. One led to a XFS and the other led to an i915 patch
   that enables RC6 to reduce power usage.
   
   [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]
  
  Doesn't seem to be the major cause of the regression. By itself, it
  has impact, but the majority comes from the XFS change...
  
 
 The fact it has an impact at all is weird but lets see what the DRI
 folks think about it.
 
   [c999a223: xfs: introduce an allocation workqueue]
  
  Which indicates that there is workqueue scheduling issues, I think.
  The same amount of work is being done, but half of it is being
  pushed off into a workqueue to avoid stack overflow issues (*).  I
  tested the above patch in anger on an 8p machine, similar to the
  machine you saw no regressions on, but the workload didn't drive it
  to being completely CPU bound (only about 90%) so the allocation
  work was probably always scheduled quickly.
 
 What test were you using?

fsmark, dbench, compilebench, and a few fio workloads. Also,
xfstests times each test and I keep track of overall runtime, and
none of those showed any performance differential, either...

Indeed, running on a current 3.5-rc5 tree, my usual fsmark
benchmarks are running at the same numbers I've been seeing since
about 3.0 - somewhere around 18k files/s for a single thread, and
110-115k files/s for 8 threads.

I just ran your variant, and I'm getting about 20kfile/s for a
single thread, which is about right because you're using smaller
directories than I am (22500 files per dir vs 100k in my tests).

  How many worker threads have been spawned on these machines
  that are showing the regression?
 
 20 or 21 generally. An example list as spotted by top looks like

Pretty normal.

  What is the context switch rate on the machines whenteh test is running?
.
 Vanilla average context switch rate   4278.53
 Revert average context switch rate1095

That seems about right, too.

  Can you run latencytop to see
  if there is excessive starvation/wait times for allocation
  completion?
 
 I'm not sure what format you are looking for.

Where the context switches are coming from, and how long they are
abeing stalled for. Just to get the context switch locations, you
can use perf on the sched:sched_switch event, but that doesn't give
you stall times. Local testing tells me that about 40% of the
switches are from xfs_alloc_vextent, 55% are from the work threads,
and the rest are CPU idling events, which is exactly as I'd expect.

  A pert top profile comparison might be informative,
  too...
  
 
 I'm not sure if this is what you really wanted. I thought an oprofile or
 perf report would have made more sense but I recorded perf top over time
 anyway and it's at the end of the mail.

perf report and oprofile give you CPU usage across the run, it's not
instantaneous and that's where all the interesting information is.
e.g. a 5% sample in a 20s profile might be 5% per second for 20s, or
it might be 100% for 1s - that's the behaviour run profiles cannot
give you insight into

As it is, the output you posted is nothing unusual.

 For just these XFS tests I've uploaded a tarball of the logs

[MMTests] IO metadata on XFS

2012-07-03 Thread Dave Chinner
On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
> Adding dri-devel and a few others because an i915 patch contributed to
> the regression.
> 
> On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
> > On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
> > > > It increases the CPU overhead (dirty_inode can be called up to 4
> > > > times per write(2) call, IIRC), so with limited numbers of
> > > > threads/limited CPU power it will result in lower performance. Where
> > > > you have lots of CPU power, there will be little difference in
> > > > performance...
> > > 
> > > When I checked it it could only be called twice, and we'd already
> > > optimize away the second call.  I'd defintively like to track down where
> > > the performance changes happend, at least to a major version but even
> > > better to a -rc or git commit.
> > > 
> > 
> > By all means feel free to run the test yourself and run the bisection :)
> > 
> > It's rare but on this occasion the test machine is idle so I started an
> > automated git bisection. As you know the milage with an automated bisect
> > varies so it may or may not find the right commit. Test machine is sandy so
> > http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
> > is the report of interest. The script is doing a full search between v3.3 
> > and
> > v3.4 for a point where average files/sec for fsmark-single drops below 
> > 25000.
> > I did not limit the search to fs/xfs on the off-chance that it is an
> > apparently unrelated patch that caused the problem.
> > 
> 
> It was obvious very quickly that there were two distinct regression so I
> ran two bisections. One led to a XFS and the other led to an i915 patch
> that enables RC6 to reduce power usage.
> 
> [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]

Doesn't seem to be the major cause of the regression. By itself, it
has impact, but the majority comes from the XFS change...

> [c999a223: xfs: introduce an allocation workqueue]

Which indicates that there is workqueue scheduling issues, I think.
The same amount of work is being done, but half of it is being
pushed off into a workqueue to avoid stack overflow issues (*).  I
tested the above patch in anger on an 8p machine, similar to the
machine you saw no regressions on, but the workload didn't drive it
to being completely CPU bound (only about 90%) so the allocation
work was probably always scheduled quickly.

How many worker threads have been spawned on these machines
that are showing the regression? What is the context switch rate on
the machines whenteh test is running? Can you run latencytop to see
if there is excessive starvation/wait times for allocation
completion? A pert top profile comparison might be informative,
too...

(*) The stack usage below submit_bio() can be more than 5k (DM, MD,
SCSI, driver, memory allocation), so it's really not safe to do
allocation anywhere below about 3k of kernel stack being used. e.g.
on a relatively trivial storage setup without the above commit:

[142296.384921] flush-253:4 used greatest stack depth: 360 bytes left

Fundamentally, 8k stacks on x86-64 are too small for our
increasingly complex storage layers and the 100+ function deep call
chains that occur.

Cheers,

Dave.

-- 
Dave Chinner
david at fromorbit.com


Re: [MMTests] IO metadata on XFS

2012-07-03 Thread Dave Chinner
On Mon, Jul 02, 2012 at 08:35:16PM +0100, Mel Gorman wrote:
 Adding dri-devel and a few others because an i915 patch contributed to
 the regression.
 
 On Mon, Jul 02, 2012 at 03:32:15PM +0100, Mel Gorman wrote:
  On Mon, Jul 02, 2012 at 02:32:26AM -0400, Christoph Hellwig wrote:
It increases the CPU overhead (dirty_inode can be called up to 4
times per write(2) call, IIRC), so with limited numbers of
threads/limited CPU power it will result in lower performance. Where
you have lots of CPU power, there will be little difference in
performance...
   
   When I checked it it could only be called twice, and we'd already
   optimize away the second call.  I'd defintively like to track down where
   the performance changes happend, at least to a major version but even
   better to a -rc or git commit.
   
  
  By all means feel free to run the test yourself and run the bisection :)
  
  It's rare but on this occasion the test machine is idle so I started an
  automated git bisection. As you know the milage with an automated bisect
  varies so it may or may not find the right commit. Test machine is sandy so
  http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__io-metadata-xfs/sandy/comparison.html
  is the report of interest. The script is doing a full search between v3.3 
  and
  v3.4 for a point where average files/sec for fsmark-single drops below 
  25000.
  I did not limit the search to fs/xfs on the off-chance that it is an
  apparently unrelated patch that caused the problem.
  
 
 It was obvious very quickly that there were two distinct regression so I
 ran two bisections. One led to a XFS and the other led to an i915 patch
 that enables RC6 to reduce power usage.
 
 [aa464191: drm/i915: enable plain RC6 on Sandy Bridge by default]

Doesn't seem to be the major cause of the regression. By itself, it
has impact, but the majority comes from the XFS change...

 [c999a223: xfs: introduce an allocation workqueue]

Which indicates that there is workqueue scheduling issues, I think.
The same amount of work is being done, but half of it is being
pushed off into a workqueue to avoid stack overflow issues (*).  I
tested the above patch in anger on an 8p machine, similar to the
machine you saw no regressions on, but the workload didn't drive it
to being completely CPU bound (only about 90%) so the allocation
work was probably always scheduled quickly.

How many worker threads have been spawned on these machines
that are showing the regression? What is the context switch rate on
the machines whenteh test is running? Can you run latencytop to see
if there is excessive starvation/wait times for allocation
completion? A pert top profile comparison might be informative,
too...

(*) The stack usage below submit_bio() can be more than 5k (DM, MD,
SCSI, driver, memory allocation), so it's really not safe to do
allocation anywhere below about 3k of kernel stack being used. e.g.
on a relatively trivial storage setup without the above commit:

[142296.384921] flush-253:4 used greatest stack depth: 360 bytes left

Fundamentally, 8k stacks on x86-64 are too small for our
increasingly complex storage layers and the 100+ function deep call
chains that occur.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH] i915: slab shrinker have to return -1 if it cant shrink any objects

2011-07-14 Thread Dave Chinner
On Thu, Jul 14, 2011 at 11:48:08AM +0900, KOSAKI Motohiro wrote:
> >> If you are thinking the shrinker protocol is too complicated, doc update
> >> patch is really welcome.
> > 
> > Slab shrinkers have a nasty, crappy interface and the shrink_slab()
> > code is full of bugs.  Rather that telling people to "update the
> > documentation" because it's too complex, how about we fix the
> > interface and the bugs?
> 
> If you can take your time for that, I don't hesitate to help you. ;)

Its on my list of things to do if nobody else steps before I get
there ;)

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com


Re: [PATCH] i915: slab shrinker have to return -1 if it cant shrink any objects

2011-07-14 Thread Dave Chinner
On Wed, Jul 13, 2011 at 05:19:22PM +0900, KOSAKI Motohiro wrote:
 (2011/07/13 16:41), Chris Wilson wrote:
  (snip)
  while (total_scan = SHRINK_BATCH) {
  long this_scan = SHRINK_BATCH;
  int shrink_ret;
  int nr_before;
 
  nr_before = do_shrinker_shrink(shrinker, shrink, 
  0);
  shrink_ret = do_shrinker_shrink(shrinker, shrink,
  this_scan);
  if (shrink_ret == -1)
  break;
 
  
  And fifteen lines above that you have:
unsigned long max_pass = do_shrinker_shrink(shrinker, shrinker, 0);
...
shrinker-nr += f(max_pass);
if (shrinker-nr  0) printk(KERN_ERR ...);
  
  That's the *error* I hit when I originally returned -1.
 
 You misunderstand the code. The third argument is critically important.
 Only if it's 0 (ie sc-nr_to_scan==0), shrinker must not return negative.

And once again the shitty shrinker API bites a user.

 Thus, my patch checked nr_to_scan argument. and I've suggested look at
 shrink_icache_memory().

Which is going away real soon - it's not the model of perfection
that you make it out to be. ;)

 If you are thinking the shrinker protocol is too complicated, doc update
 patch is really welcome.

Slab shrinkers have a nasty, crappy interface and the shrink_slab()
code is full of bugs.  Rather that telling people to update the
documentation because it's too complex, how about we fix the
interface and the bugs?

Indeed, how hard is it to require a subsystem to supply two shrinker
methods, one to return the count of reclaimable objects, the other
to scan the reclaimable objects to reclaim them? After all, that's
exactly the interface I'm exposing to filesystems underneath the
shrinker API in the per-sb shrinker patchset that gets rid of
shrink_icache_memory() rather than propagating the insanity

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] i915: slab shrinker have to return -1 if it cant shrink any objects

2011-07-14 Thread Dave Chinner
 it happens because you burn CPU on locks rather
than reclaimıng objects. Single threaded object reclaim is still the
fastest way to do reclaim if you have global lists and locks.

What I'm trying to say is that how you solve the shrinker balance
problem for you subsystem will be specific to how you need to hold
pages under memory pressure to maintain performance. Sorry I can't
give you a better answer than that, but that's what my experience
with caches and tuning shrinker behaviour indicates.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] i915: slab shrinker have to return -1 if it cant shrink any objects

2011-07-14 Thread Dave Chinner
On Thu, Jul 14, 2011 at 11:48:08AM +0900, KOSAKI Motohiro wrote:
  If you are thinking the shrinker protocol is too complicated, doc update
  patch is really welcome.
  
  Slab shrinkers have a nasty, crappy interface and the shrink_slab()
  code is full of bugs.  Rather that telling people to update the
  documentation because it's too complex, how about we fix the
  interface and the bugs?
 
 If you can take your time for that, I don't hesitate to help you. ;)

Its on my list of things to do if nobody else steps before I get
there ;)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH] i915: slab shrinker have to return -1 if it cant shrink any objects

2011-07-13 Thread Dave Chinner
shrinker is not able to
completely trash the cache like the previous page-cache based
version did.  It's a very specific solution to the problem of tuning
a shrinker for good system behaviour, but it's the only way I found
that works

Oh, and using a mutex to single thread cache reclaim rather than
spinlocks is usually a good idea from a scalability point of view
because your shrinker can be called simultaneously on every CPU.
Spinlocks really, really hurt when that happens, and performance
will plummet when it happens because you burn CPU on locks rather
than reclaim?ng objects. Single threaded object reclaim is still the
fastest way to do reclaim if you have global lists and locks.

What I'm trying to say is that how you solve the shrinker balance
problem for you subsystem will be specific to how you need to hold
pages under memory pressure to maintain performance. Sorry I can't
give you a better answer than that, but that's what my experience
with caches and tuning shrinker behaviour indicates.

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com


[PATCH] i915: slab shrinker have to return -1 if it cant shrink any objects

2011-07-13 Thread Dave Chinner
On Wed, Jul 13, 2011 at 05:19:22PM +0900, KOSAKI Motohiro wrote:
> (2011/07/13 16:41), Chris Wilson wrote:
> >> (snip)
> >> while (total_scan >= SHRINK_BATCH) {
> >> long this_scan = SHRINK_BATCH;
> >> int shrink_ret;
> >> int nr_before;
> >>
> >> nr_before = do_shrinker_shrink(shrinker, shrink, 
> >> 0);
> >> shrink_ret = do_shrinker_shrink(shrinker, shrink,
> >> this_scan);
> >> if (shrink_ret == -1)
> >> break;
> >>
> > 
> > And fifteen lines above that you have:
> >   unsigned long max_pass = do_shrinker_shrink(shrinker, shrinker, 0);
> >   ...
> >   shrinker->nr += f(max_pass);
> >   if (shrinker->nr < 0) printk(KERN_ERR "...");
> > 
> > That's the *error* I hit when I originally returned -1.
> 
> You misunderstand the code. The third argument is critically important.
> Only if it's 0 (ie sc->nr_to_scan==0), shrinker must not return negative.

And once again the shitty shrinker API bites a user.

> Thus, my patch checked nr_to_scan argument. and I've suggested look at
> shrink_icache_memory().

Which is going away real soon - it's not the model of perfection
that you make it out to be. ;)

> If you are thinking the shrinker protocol is too complicated, doc update
> patch is really welcome.

Slab shrinkers have a nasty, crappy interface and the shrink_slab()
code is full of bugs.  Rather that telling people to "update the
documentation" because it's too complex, how about we fix the
interface and the bugs?

Indeed, how hard is it to require a subsystem to supply two shrinker
methods, one to return the count of reclaimable objects, the other
to scan the reclaimable objects to reclaim them? After all, that's
exactly the interface I'm exposing to filesystems underneath the
shrinker API in the per-sb shrinker patchset that gets rid of
shrink_icache_memory() rather than propagating the insanity

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com