<snip> > > > > > In my mind, any LTS/backports get the simplest/highest-confidence > > > bugfix: using > > > > atomics. > > > > The atomics are behind the "service stats" feature enable, so > > > > impact > > > is only > > > > when those are enabled. > > > > > > > > If there is still a performance hit, and there are *no* MT > > > > services > > > registered, we > > > > could check a static-global flag, and if there are no MT services > > > > use > > > the normal > > > > adds. Thoughts on such a solution to reduce atomic perf impact > > > > only > > > to apps > > > > with MT-services? > > > The choice is not between atomic stats and non-atomic stats. The > > > stats need to be incremented atomically in both cases(MT or no MT > > > services) as the reader could be another thread. We need to ensure > > > that the stores and loads are not split. > > > > > > Hence, we need atomic operations. We cannot avoid any performance hit > here. > > > > > > I think the choice is between per-core stats vs global stats. IMO, > > > we should go with per-core stats to be aligned with the norm. > > +1 > > > > And the per-core counters can be summed when read for statistics, so > > it still looks like one counter per service. They are only updated > > per-core to prevent cache trashing. > > Yes, understood and agree. Per-core counters are preferred, as relaxed atomic > ordered stores are enough to ensure non-tearing loads. Summation before > reporting back from the stats-requesting thread. > For backporting, per-core counters is a significant change. Is using atomics > to fix > the miss-behaviour a better idea? Agree, the change will be simple.
This will result in some perf impact which can be mitigated by disabling the stats where possible. > > My question below is still open, is the below enough to fix the *functional* > part > of MT services? > > Code today uses normal ADD/INCQ (and hence the MT increments bug of > tear/clobber exists) > 0x000055555649189d <+141>: add %rax,0x50(%rbx) > 0x00005555564918a1 <+145>: incq 0x48(%rbx) > > For x86, my disassembly for RELAXED and SEQ_CST orderings are the same, > using LOCK prefix ("proper atomics") > 0x000055555649189d <+141>: lock add %rax,0x50(%rbx) > 0x00005555564918a2 <+146>: lock incq 0x48(%rbx) > > My understanding is that since 2x threads can write the same address, SEQ_CST > is required. > Can somebody more familiar with C11 confirm that? We need to use RELAXED. You can think of it as, 'do we need a particular order for memory operations within a single thread?' In this case, we do not need a particular order for incrementing stats in the single thread context. > > > > > > The code changes themselves are OK.. I can send a patch with fix > > > > if > > > there's > > > > agreement on the approach? > > > > > > > > > > > > diff --git a/lib/eal/common/rte_service.c > > > b/lib/eal/common/rte_service.c index > > > > ef31b1f63c..a07c8fc2d7 100644 > > > > --- a/lib/eal/common/rte_service.c > > > > +++ b/lib/eal/common/rte_service.c > > > > @@ -363,9 +363,9 @@ service_runner_do_callback(struct > > > > rte_service_spec_impl *s, > > > > uint64_t start = rte_rdtsc(); > > > > s->spec.callback(userdata); > > > > uint64_t end = rte_rdtsc(); > > > > - s->cycles_spent += end - start; > > > > + __atomic_fetch_add(&s->cycles_spent, (end-start), > > > > __ATOMIC_RELAXED); > > > > + __atomic_fetch_add(&s->calls, 1, > > > > + __ATOMIC_RELAXED); > > > > cs->calls_per_service[service_idx]++; > > > > - s->calls++; > > > > } else > > > > s->spec.callback(userdata); }