On Thu, Dec 18, 2025 at 06:36:00PM -0500, Mathieu Desnoyers wrote:
> On 2025-12-18 15:22, Boqun Feng wrote:
> [...]
> > > > Could you utilize this[1] to see a
> > > > comparison of the reader-side performance against RCU/SRCU?
> > >
> > > Good point ! Let's see.
> > >
> > > On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
> > > hyperthreading disabled,
> > > CONFIG_PREEMPT=y,
> > > CONFIG_PREEMPT_RCU=y,
> > > CONFIG_PREEMPT_HAZPTR=y.
> > >
> > > scale_type ns
> > > -----------------------
> > > hazptr-smp-mb 13.1 <- this implementation
> > > hazptr-barrier 11.5 <- replace smp_mb() on acquire with
> > > barrier(), requires IPIs on synchronize.
> > > hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and
> > > per-cpu overflow lists by hlist.
> > > rcu 17.0
> > > srcu 20.0
> > > srcu-fast 1.5
> > > rcu-tasks 0.0
> > > rcu-trace 1.7
> > > refcnt 1148.0
> > > rwlock 1190.0
> > > rwsem 4199.3
> > > lock 41070.6
> > > lock-irq 46176.3
> > > acqrel 1.1
> > >
> > > So only srcu-fast, rcu-tasks, rcu-trace and a plain acqrel
> > > appear to beat hazptr read-side performance.
> > >
> >
> > Could you also see the reader-side performance impact when the percpu
> > hazard pointer slots are used up? I.e. the worst case.
>
> I've modified the code to populate "(void *)1UL" in the 7 first slots
> at bootup, here is the result:
>
> hazptr-smp-mb-7-fail 16.3 ns
>
> So we go from 13.1 ns to 16.3 ns when all but one slots are used.
>
> And if we pre-populate the 8 slots for each cpu, and thus force
> fallback to overflow list:
>
> hazptr-smp-mb-8-fail 67.1 ns
>
Thank you! So involving locking seems to hurt performance more than
per-CPU/per-task operations. This may suggest that enabling
PREEMPT_HAZPTR by default has an acceptable performance.
> >
> > > [...]
> > >
> > > > > +/*
> > > > > + * Perform piecewise iteration on overflow list waiting until "addr"
> > > > > is
> > > > > + * not present. Raw spinlock is released and taken between each list
> > > > > + * item and busy loop iteration. The overflow list generation is
> > > > > checked
> > > > > + * each time the lock is taken to validate that the list has not
> > > > > changed
> > > > > + * before resuming iteration or busy wait. If the generation has
> > > > > + * changed, retry the entire list traversal.
> > > > > + */
> > > > > +static
> > > > > +void hazptr_synchronize_overflow_list(struct overflow_list
> > > > > *overflow_list, void *addr)
> > > > > +{
> > > > > + struct hazptr_backup_slot *backup_slot;
> > > > > + uint64_t snapshot_gen;
> > > > > +
> > > > > + raw_spin_lock(&overflow_list->lock);
> > > > > +retry:
> > > > > + snapshot_gen = overflow_list->gen;
> > > > > + list_for_each_entry(backup_slot, &overflow_list->head, node) {
> > > > > + /* Busy-wait if node is found. */
> > > > > + while (smp_load_acquire(&backup_slot->slot.addr) ==
> > > > > addr) { /* Load B */
> > > > > + raw_spin_unlock(&overflow_list->lock);
> > > > > + cpu_relax();
> > > >
> > > > I think we should prioritize the scan thread solution [2] instead of
> > > > busy waiting hazrd pointer updaters, because when we have multiple
> > > > hazard pointer usages we would want to consolidate the scans from
> > > > updater side.
> > >
> > > I agree that batching scans with a worker thread is a logical next step.
> > >
> > > > If so, the whole ->gen can be avoided.
> > >
> > > How would it allow removing the generation trick without causing long
> > > raw spinlock latencies ?
> > >
> >
> > Because we won't need to busy-wait for the readers to go away, we can
> > check whether they are still there in the next scan.
> >
> > so:
> >
> > list_for_each_entry(backup_slot, &overflow_list->head, node) {
> > /* Busy-wait if node is found. */
> > if (smp_load_acquire(&backup_slot->slot.addr) == addr) { /*
> > Load B */
> > <mark addr as unable to free and move on>
>
> But then you still iterate on a possibly large list of overflow nodes,
> with a raw spinlock held. That raw spinlock is taken by the scheduler
> on context switch. This can cause very long scheduler latency.
>
That's fair.
> So breaking up the iteration into pieces is not just to handle
> busy-waiting, but also to make sure we don't increase the
> system latency by holding a raw spinlock (taken with rq lock
> held) for more than the little time needed to iterate to the next
> node.
>
I agree that it helps reduce the latency, but I feel like with a scan
thread in the picture (and we don't need to busy-wait), we should use
a forward-progress-guaranteed way in the updater side scan, which means
we may need to explore other solutions for the latency (e.g.
fine-grained locking hashlist for the overflow list) than the generation
counter.
Regards,
Boqun
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com