On Thu, Jun 13, 2024 at 08:22:19PM -0700, Paul E. McKenney wrote:
> On Thu, Jun 13, 2024 at 02:32:27PM -0400, Kent Overstreet wrote:

[ . . . ]

After sleeping on it, backing up a level of abstraction, and thus perhaps
responding a bit more constructively...

And also, thank you very much for looking into this!

> > commit 3bb5d34d6d6e2d5433cce5d248fd81211afedec0
> > Author: Kent Overstreet <[email protected]>
> > Date:   Mon Jun 10 20:47:03 2024 -0400
> > 
> >     bcachefs: pending_rcu_items
> >     
> >     Generic (?) data structure for explicitly tracking pending SRCU items
> >     
> >     - Can we make this generic across other RCU flavors?

It should be possible, at least for current RCU and SRCU.  I think I owe
you SRCU variants of a couple of existing RCU functions.  Later today!

> >     - Do we need to add a fallback to linked lists when darray allocation
> >       fails?

In general, as in kfree_rcu(), yes.  In your specific case here, maybe not.

> >     - We wish to avoid the overhead of start_poll_synchronize_srcu() if we
> >       already have items for a given sequence number, but it appears
> >       get_poll_synchronize_srcu(); start_poll_synchronize_srcu() can
> >       actually return differert sequence numbers within the same read-side
> >       critical section. Is this the correct way of doing this?

You can always buffer up items that have not yet been assigned a sequence
number, then after you have enough or after enough time has passed, do a
single get_poll_synchronize_srcu() for the group.

> >     - Do we need to ensure the rearming RCU callback is correctly flushed
> >       when exiting? (Probably)

I believe so, and setting a flag then doing an srcu_barrier() should suffice.

> >     Signed-off-by: Kent Overstreet <[email protected]>
> > 
> > diff --git a/fs/bcachefs/btree_key_cache.c b/fs/bcachefs/btree_key_cache.c
> > index ebed60a43647..46f7fa549d59 100644
> > --- a/fs/bcachefs/btree_key_cache.c
> > +++ b/fs/bcachefs/btree_key_cache.c
> > @@ -83,18 +83,109 @@ static bool bkey_cached_evict(struct btree_key_cache 
> > *c,
> >     return ret;
> >  }
> >  
> > -static void __bkey_cached_free(struct rcu_head *rcu)
> > +#define RCU_SEQ_SKIP_SHIFT 2
> > +
> > +static void pending_rcu_items_exit(struct pending_rcu_items *pending)
> > +{
> > +   darray_exit(&pending->objs[0]);
> > +   darray_exit(&pending->objs[1]);
> > +}
> > +
> > +static void pending_rcu_items_init(struct pending_rcu_items *pending,
> > +                              struct srcu_struct *srcu,
> > +                              pending_rcu_item_process_fn process)
> > +{
> > +   memset(pending, 0, sizeof(*pending));
> > +   pending->srcu           = srcu;
> > +   pending->process        = process;
> > +   spin_lock_init(&pending->lock);
> > +}
> > +
> > +static bool pending_rcu_has_items(struct pending_rcu_items *pending)
> > +{
> > +   return pending->objs[0].nr || pending->objs[1].nr;
> > +}
> > +
> > +static void pending_rcu_items_process(struct pending_rcu_items *pending)
> >  {
> > -   struct bkey_cached *ck = container_of(rcu, struct bkey_cached, rcu);
> > +   while (pending_rcu_has_items(pending) &&
> > +          poll_state_synchronize_srcu(pending->srcu, pending->seq)) {
> > +           darray_for_each(pending->objs[0], i)
> > +                   pending->process(pending, *i);

Given that you have interrupts and/or bh disabled, there needs to be a
third set of callbacks, those whose grace periods are already elapsed.

Unioning one of the ->objs[] sets onto this third set, leaving the source
->objs[] set empty, should strongly be O(1) without the need for memory
allocation.  I might be missing something, but at first glance darray
does not support this.

The pages-of-pointers structure named kvfree_rcu_bulk_data, along with
the associated structures in kernel/rcu/tree.c does this work, but (1)
is open-coded and (2) has to deal with the full range of what the kernel
can throw at it.  It falls back to an embedded rcu_head structure, in
response to your second question in the commit log.

But perhaps this code could be generalized?  Or maybe darray can be
appropriately extended?

And one thing that I missed the first time around is that you are using
a single ->seq for two sets of objects (->objs[0] and ->objs[1]).
Each set needs its own sequence number.

> > +           darray_exit(&pending->objs[0]);
> > +           swap(pending->objs[0], pending->objs[1]);
> > +           pending->seq += 1 << RCU_SEQ_SKIP_SHIFT;
> > +   }
> > +}
> > +
> > +static void pending_rcu_items_rcu_cb(struct rcu_head *rcu)
> > +{
> > +   struct pending_rcu_items *pending =
> > +           container_of(rcu, struct pending_rcu_items, rcu);
> > +
> > +   unsigned long flags;
> > +   spin_lock_irqsave(&pending->lock, flags);
> > +   pending_rcu_items_process(pending);
> 
> SRCU callbacks execute in softirq context.  You therefore might need to
> push this call to pending_rcu_items_process() to a workqueue handler,
> at least if there are very many objects in need of processing.

Plus you have irqs disabled, which I somehow missed on the first pass.

Again, please union the objects whose grace period has ended into a set
that is ready to be processed, and then do the processing in a workqueue
handler, kthread, or similar.  This avoids causing real-time latency
problems, or, in extreme cases, softlockups or RCU CPU stall warnings.

> > -   kmem_cache_free(bch2_key_cache, ck);
> > +   pending->rcu_armed = pending_rcu_has_items(pending);
> > +   if (pending->rcu_armed)
> > +           call_srcu(pending->srcu, &pending->rcu, 
> > pending_rcu_items_rcu_cb);
> > +   spin_unlock_irqrestore(&pending->lock, flags);
> > +}
> > +
> > +static void pending_rcu_item_enqueue(struct pending_rcu_items *pending, 
> > void *obj)
> > +{
> > +   darray_voidp preallocated = {};
> > +retry:
> > +   spin_lock_irq(&pending->lock);
> > +   pending_rcu_items_process(pending);
> 
> Suppose that poll_state_synchronize_srcu() returns false so that
> start_poll_synchronize_srcu() does nothing...
> 
> ... then we get vCPU preemption here so that some number of SRCU grace
> periods elapse (which looks to be limited to just one of them, but that
> is enough because we might already have requested the next one courtesy of
> the previous call to either call_srcu() or start_poll_synchronize_srcu()) ...
> 
> > +   unsigned long seq = start_poll_synchronize_srcu(pending->srcu);
> 
> ... so that we get a seq that is 8 or 12 larger than pending->seq ...
> 
> > +   if (!pending_rcu_has_items(pending))
> > +           pending->seq = seq;
> 
> ... and we don't update pending->seq because there are still items there ...
> 
> > +   unsigned idx = (seq - pending->seq) >> RCU_SEQ_SKIP_SHIFT;
> > +   BUG_ON(idx > ARRAY_SIZE(pending->objs));

And more on this later today.

                                                        Thanx, Paul

> ... then can't this BUG_ON() trigger?
> 
> And why this ugly reliance on the cookie format?  Why not just have
> the pending_rcu_items structure's ->seq be a two-element array?
> 
> Then all you need to do is to check each ->seq[] element to see if it
> is equal to seq, while remembering the index of the first that is empty.
> If you find a ->seq[] match, just enqueue into the corresponding ->objs[]
> element.  Otherwise, you are guaranteed to have found an empty ->obj[],
> at least if you move the start_poll_synchronize_srcu() to precede
> the pending_rcu_items_process().
> 
> *Much* looser coupling with SRCU and rather simpler.
> 
> > +   darray_voidp *objs = pending->objs + idx;
> > +
> > +   if (!darray_room(*objs)) {
> > +           if (preallocated.size > objs->size) {
> > +                   memcpy(preallocated.data,
> > +                          objs->data,
> > +                          sizeof(objs->data[0]) * objs->nr);
> > +                   swap(preallocated.data, objs->data);
> > +                   swap(preallocated.size, objs->size);
> > +           } else if (darray_make_room_gfp(objs, 1, 
> > GFP_ATOMIC|__GFP_NOWARN)) {
> > +                   spin_unlock_irq(&pending->lock);
> > +                   darray_exit(&preallocated);
> > +                   darray_make_room_gfp(&preallocated, objs->nr + 1, 
> > GFP_KERNEL|__GFP_NOFAIL);
> > +                   goto retry;
> > +           }
> > +   }
> > +
> > +   BUG_ON(darray_push(objs, obj)); /* preallocated */
> > +
> > +   if (!pending->rcu_armed)
> > +           call_srcu(pending->srcu, &pending->rcu, 
> > pending_rcu_items_rcu_cb);
> 
> The start_poll_synchronize_srcu() above might have started a new
> SRCU grace period.  If so, this call_srcu() will start another one.
> Which should not be a problem.  If it does prove to be, you might use
> get_state_synchronize_srcu() instead, depending.
> 
> On the other hand, if this code is prone to bkey_cached_free() floods, the
> current combination of start_poll_synchronize_srcu() and call_srcu()
> might avoid OOM, given that this function checks for pending objects
> whose grace period has ended.
> 
> > +   pending->rcu_armed = true;
> > +   spin_unlock_irq(&pending->lock);
> > +
> > +   darray_exit(&preallocated);
> > +}
> > +
> > +static void __bkey_cached_free(struct pending_rcu_items *pending, void 
> > *obj)
> > +{
> > +   struct bch_fs *c = container_of(pending->srcu, struct bch_fs, 
> > btree_trans_barrier);
> > +
> > +   this_cpu_dec(*c->btree_key_cache.nr_pending);
> > +   kmem_cache_free(bch2_key_cache, obj);
> >  }
> >  
> >  static void bkey_cached_free(struct btree_key_cache *bc,
> >                          struct bkey_cached *ck)
> >  {
> > -   struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache);
> > -
> >     kfree(ck->k);
> >     ck->k           = NULL;
> >     ck->u64s        = 0;
> > @@ -102,7 +193,15 @@ static void bkey_cached_free(struct btree_key_cache 
> > *bc,
> >     six_unlock_write(&ck->c.lock);
> >     six_unlock_intent(&ck->c.lock);
> >  
> > -   call_srcu(&c->btree_trans_barrier, &ck->rcu, __bkey_cached_free);
> > +   preempt_disable();
> > +   struct pending_rcu_items *pending = this_cpu_ptr(bc->pending);
> > +   preempt_enable();
> > +   /*
> > +    * pending is protected by a lock, we don't need preemption disabled -
> > +    * and it expects to run in sleepable context:
> > +    */
> > +   pending_rcu_item_enqueue(pending, ck);
> > +   this_cpu_inc(*bc->nr_pending);
> >  }
> >  
> >  static struct bkey_cached *
> > @@ -778,6 +877,13 @@ void bch2_fs_btree_key_cache_exit(struct 
> > btree_key_cache *bc)
> >  
> >     if (bc->table_init_done)
> >             rhashtable_destroy(&bc->table);
> > +
> > +   free_percpu(bc->nr_pending);
> > +
> > +   int cpu;
> > +   for_each_possible_cpu(cpu)
> > +           pending_rcu_items_exit(per_cpu_ptr(bc->pending, cpu));
> > +   free_percpu(bc->pending);
> >  }
> >  
> >  void bch2_fs_btree_key_cache_init_early(struct btree_key_cache *c)
> > @@ -789,6 +895,20 @@ int bch2_fs_btree_key_cache_init(struct 
> > btree_key_cache *bc)
> >     struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache);
> >     struct shrinker *shrink;
> >  
> > +#ifdef __KERNEL__
> > +   bc->pending = alloc_percpu(struct pending_rcu_items);
> > +   if (!bc->pending)
> > +           return -BCH_ERR_ENOMEM_fs_btree_cache_init;
> > +#endif
> > +   int cpu;
> > +   for_each_possible_cpu(cpu)
> > +           pending_rcu_items_init(per_cpu_ptr(bc->pending, cpu),
> > +                                  &c->btree_trans_barrier, 
> > __bkey_cached_free);
> > +
> > +   bc->nr_pending = alloc_percpu(size_t);
> > +   if (!bc->nr_pending)
> > +           return -BCH_ERR_ENOMEM_fs_btree_cache_init;
> > +
> >     if (rhashtable_init(&bc->table, &bch2_btree_key_cache_params))
> >             return -BCH_ERR_ENOMEM_fs_btree_cache_init;
> >  
> > @@ -815,13 +935,15 @@ void bch2_btree_key_cache_to_text(struct printbuf 
> > *out, struct btree_key_cache *
> >     prt_printf(out, "keys:\t%lu\r\n",               
> > atomic_long_read(&bc->nr_keys));
> >     prt_printf(out, "dirty:\t%lu\r\n",              
> > atomic_long_read(&bc->nr_dirty));
> >     prt_printf(out, "table size:\t%u\r\n",          bc->table.tbl->size);
> > -
> > -   prt_printf(out, "\nshrinker:\n");
> > +   prt_newline(out);
> > +   prt_printf(out, "shrinker:\n");
> >     prt_printf(out, "requested_to_free:\t%lu\r\n",  bc->requested_to_free);
> >     prt_printf(out, "freed:\t%lu\r\n",              bc->freed);
> >     prt_printf(out, "skipped_dirty:\t%lu\r\n",      bc->skipped_dirty);
> >     prt_printf(out, "skipped_accessed:\t%lu\r\n",   bc->skipped_accessed);
> >     prt_printf(out, "skipped_lock_fail:\t%lu\r\n",  bc->skipped_lock_fail);
> > +   prt_newline(out);
> > +   prt_printf(out, "pending:\t%lu\r\n",            
> > per_cpu_sum(bc->nr_pending));
> >  }
> >  
> >  void bch2_btree_key_cache_exit(void)
> > diff --git a/fs/bcachefs/btree_key_cache_types.h 
> > b/fs/bcachefs/btree_key_cache_types.h
> > index e026c65f54e1..8d11cbac4c37 100644
> > --- a/fs/bcachefs/btree_key_cache_types.h
> > +++ b/fs/bcachefs/btree_key_cache_types.h
> > @@ -2,12 +2,29 @@
> >  #ifndef _BCACHEFS_BTREE_KEY_CACHE_TYPES_H
> >  #define _BCACHEFS_BTREE_KEY_CACHE_TYPES_H
> >  
> > +#include "darray.h"
> > +
> > +struct pending_rcu_items;
> > +typedef void (*pending_rcu_item_process_fn)(struct pending_rcu_items *, 
> > void *);
> > +
> > +struct pending_rcu_items {
> > +   struct srcu_struct              *srcu;
> > +   pending_rcu_item_process_fn     process;
> > +   spinlock_t                      lock;
> > +   bool                            rcu_armed;
> > +   unsigned long                   seq;
> > +   darray_voidp                    objs[2];
> > +   struct rcu_head                 rcu;
> > +};
> > +
> >  struct btree_key_cache {
> >     struct rhashtable       table;
> >     bool                    table_init_done;
> >  
> >     struct shrinker         *shrink;
> >     unsigned                shrink_iter;
> > +   struct pending_rcu_items __percpu *pending;
> > +   size_t __percpu         *nr_pending;
> >  
> >     atomic_long_t           nr_keys;
> >     atomic_long_t           nr_dirty;
> > diff --git a/fs/bcachefs/btree_types.h b/fs/bcachefs/btree_types.h
> > index 67f3efd9a007..56046f09ac65 100644
> > --- a/fs/bcachefs/btree_types.h
> > +++ b/fs/bcachefs/btree_types.h
> > @@ -396,8 +396,6 @@ struct bkey_cached {
> >     u64                     seq;
> >  
> >     struct bkey_i           *k;
> > -
> > -   struct rcu_head         rcu;
> >  };
> >  
> >  static inline struct bpos btree_node_pos(struct btree_bkey_cached_common 
> > *b)
> > diff --git a/fs/bcachefs/darray.h b/fs/bcachefs/darray.h
> > index 4b340d13caac..cdace10f4572 100644
> > --- a/fs/bcachefs/darray.h
> > +++ b/fs/bcachefs/darray.h
> > @@ -19,6 +19,7 @@ struct {                                                  
> >         \
> >  
> >  #define DARRAY(_type) DARRAY_PREALLOCATED(_type, 0)
> >  
> > +typedef DARRAY(void *)     darray_voidp;
> >  typedef DARRAY(char)       darray_char;
> >  typedef DARRAY(char *) darray_str;
> >  

Reply via email to