On Thu, Dec 18, 2025 at 06:36:00PM -0500, Mathieu Desnoyers wrote:
> On 2025-12-18 15:22, Boqun Feng wrote:
> [...]
> > > > Could you utilize this[1] to see a
> > > > comparison of the reader-side performance against RCU/SRCU?
> > > 
> > > Good point ! Let's see.
> > > 
> > > On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
> > > hyperthreading disabled,
> > > CONFIG_PREEMPT=y,
> > > CONFIG_PREEMPT_RCU=y,
> > > CONFIG_PREEMPT_HAZPTR=y.
> > > 
> > > scale_type                 ns
> > > -----------------------
> > > hazptr-smp-mb             13.1   <- this implementation
> > > hazptr-barrier            11.5   <- replace smp_mb() on acquire with 
> > > barrier(), requires IPIs on synchronize.
> > > hazptr-smp-mb-hlist       12.7   <- replace per-task hp context and 
> > > per-cpu overflow lists by hlist.
> > > rcu                       17.0
> > > srcu                      20.0
> > > srcu-fast                  1.5
> > > rcu-tasks                  0.0
> > > rcu-trace                  1.7
> > > refcnt                  1148.0
> > > rwlock                  1190.0
> > > rwsem                   4199.3
> > > lock                   41070.6
> > > lock-irq               46176.3
> > > acqrel                     1.1
> > > 
> > > So only srcu-fast, rcu-tasks, rcu-trace and a plain acqrel
> > > appear to beat hazptr read-side performance.
> > > 
> > 
> > Could you also see the reader-side performance impact when the percpu
> > hazard pointer slots are used up? I.e. the worst case.
> 
> I've modified the code to populate "(void *)1UL" in the 7 first slots
> at bootup, here is the result:
> 
> hazptr-smp-mb-7-fail    16.3 ns
> 
> So we go from 13.1 ns to 16.3 ns when all but one slots are used.
> 
> And if we pre-populate the 8 slots for each cpu, and thus force
> fallback to overflow list:
> 
> hazptr-smp-mb-8-fail    67.1 ns
> 

Thank you! So involving locking seems to hurt performance more than
per-CPU/per-task operations. This may suggest that enabling
PREEMPT_HAZPTR by default has an acceptable performance.

> > 
> > > [...]
> > > 
> > > > > +/*
> > > > > + * Perform piecewise iteration on overflow list waiting until "addr" 
> > > > > is
> > > > > + * not present. Raw spinlock is released and taken between each list
> > > > > + * item and busy loop iteration. The overflow list generation is 
> > > > > checked
> > > > > + * each time the lock is taken to validate that the list has not 
> > > > > changed
> > > > > + * before resuming iteration or busy wait. If the generation has
> > > > > + * changed, retry the entire list traversal.
> > > > > + */
> > > > > +static
> > > > > +void hazptr_synchronize_overflow_list(struct overflow_list 
> > > > > *overflow_list, void *addr)
> > > > > +{
> > > > > +     struct hazptr_backup_slot *backup_slot;
> > > > > +     uint64_t snapshot_gen;
> > > > > +
> > > > > +     raw_spin_lock(&overflow_list->lock);
> > > > > +retry:
> > > > > +     snapshot_gen = overflow_list->gen;
> > > > > +     list_for_each_entry(backup_slot, &overflow_list->head, node) {
> > > > > +             /* Busy-wait if node is found. */
> > > > > +             while (smp_load_acquire(&backup_slot->slot.addr) == 
> > > > > addr) { /* Load B */
> > > > > +                     raw_spin_unlock(&overflow_list->lock);
> > > > > +                     cpu_relax();
> > > > 
> > > > I think we should prioritize the scan thread solution [2] instead of
> > > > busy waiting hazrd pointer updaters, because when we have multiple
> > > > hazard pointer usages we would want to consolidate the scans from
> > > > updater side.
> > > 
> > > I agree that batching scans with a worker thread is a logical next step.
> > > 
> > > > If so, the whole ->gen can be avoided.
> > > 
> > > How would it allow removing the generation trick without causing long
> > > raw spinlock latencies ?
> > > 
> > 
> > Because we won't need to busy-wait for the readers to go away, we can
> > check whether they are still there in the next scan.
> > 
> > so:
> > 
> >     list_for_each_entry(backup_slot, &overflow_list->head, node) {
> >             /* Busy-wait if node is found. */
> >             if (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* 
> > Load B */
> >                     <mark addr as unable to free and move on>
> 
> But then you still iterate on a possibly large list of overflow nodes,
> with a raw spinlock held. That raw spinlock is taken by the scheduler
> on context switch. This can cause very long scheduler latency.
> 

That's fair.

> So breaking up the iteration into pieces is not just to handle
> busy-waiting, but also to make sure we don't increase the
> system latency by holding a raw spinlock (taken with rq lock
> held) for more than the little time needed to iterate to the next
> node.
> 

I agree that it helps reduce the latency, but I feel like with a scan
thread in the picture (and we don't need to busy-wait), we should use
a forward-progress-guaranteed way in the updater side scan, which means
we may need to explore other solutions for the latency (e.g.
fine-grained locking hashlist for the overflow list) than the generation
counter.

Regards,
Boqun

> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com

Reply via email to