<snip>

> 
> > > > In my mind, any LTS/backports get the simplest/highest-confidence
> > > bugfix: using
> > > > atomics.
> > > > The atomics are behind the "service stats" feature enable, so
> > > > impact
> > > is only
> > > > when those are enabled.
> > > >
> > > > If there is still a performance hit, and there are *no* MT
> > > > services
> > > registered, we
> > > > could check a static-global flag, and if there are no MT services
> > > > use
> > > the normal
> > > > adds. Thoughts on such a solution to reduce atomic perf impact
> > > > only
> > > to apps
> > > > with MT-services?
> > > The choice is not between atomic stats and non-atomic stats. The
> > > stats need to be incremented atomically in both cases(MT or no MT
> > > services) as the reader could be another thread. We need to ensure
> > > that the stores and loads are not split.
> > >
> > > Hence, we need atomic operations. We cannot avoid any performance hit
> here.
> > >
> > > I think the choice is between per-core stats vs global stats. IMO,
> > > we should go with per-core stats to be aligned with the norm.
> > +1
> >
> > And the per-core counters can be summed when read for statistics, so
> > it still looks like one counter per service. They are only updated
> > per-core to prevent cache trashing.
> 
> Yes, understood and agree. Per-core counters are preferred, as relaxed atomic
> ordered stores are enough to ensure non-tearing loads. Summation before
> reporting back from the stats-requesting thread.
> For backporting, per-core counters is a significant change. Is using atomics 
> to fix
> the miss-behaviour a better idea?
Agree, the change will be simple.

This will result in some perf impact which can be mitigated by disabling the 
stats where possible. 

> 
> My question below is still open, is the below enough to fix the *functional* 
> part
> of MT services?
> 
> Code today uses normal ADD/INCQ (and hence the MT increments bug of
> tear/clobber exists)
>    0x000055555649189d <+141>:   add    %rax,0x50(%rbx)
>    0x00005555564918a1 <+145>:   incq   0x48(%rbx)
> 
> For x86, my disassembly for RELAXED and SEQ_CST orderings are the same,
> using LOCK prefix ("proper atomics")
>    0x000055555649189d <+141>:   lock add %rax,0x50(%rbx)
>    0x00005555564918a2 <+146>:   lock incq 0x48(%rbx)
> 
> My understanding is that since 2x threads can write the same address, SEQ_CST
> is required.
> Can somebody more familiar with C11 confirm that?
We need to use RELAXED. You can think of it as, 'do we need a particular order 
for memory operations within a single thread?' In this case, we do not need a 
particular order for incrementing stats in the single thread context. 

> 
> 
> > > > The code changes themselves are OK.. I can send a patch with fix
> > > > if
> > > there's
> > > > agreement on the approach?
> > > >
> > > >
> > > > diff --git a/lib/eal/common/rte_service.c
> > > b/lib/eal/common/rte_service.c index
> > > > ef31b1f63c..a07c8fc2d7 100644
> > > > --- a/lib/eal/common/rte_service.c
> > > > +++ b/lib/eal/common/rte_service.c
> > > > @@ -363,9 +363,9 @@ service_runner_do_callback(struct
> > > > rte_service_spec_impl *s,
> > > >                 uint64_t start = rte_rdtsc();
> > > >                 s->spec.callback(userdata);
> > > >                 uint64_t end = rte_rdtsc();
> > > > -               s->cycles_spent += end - start;
> > > > +               __atomic_fetch_add(&s->cycles_spent, (end-start),
> > > > __ATOMIC_RELAXED);
> > > > +               __atomic_fetch_add(&s->calls, 1,
> > > > + __ATOMIC_RELAXED);
> > > >                 cs->calls_per_service[service_idx]++;
> > > > -               s->calls++;
> > > >         } else
> > > >                 s->spec.callback(userdata);  }

Reply via email to