> From: Stephen Hemminger [mailto:[email protected]]
> Sent: Tuesday, 20 January 2026 20.27
> 
> On Tue, 20 Jan 2026 15:40:21 +0000
> Konstantin Ananyev <[email protected]> wrote:
> 
> > > > From: Stephen Hemminger [mailto:[email protected]]
> > > > Sent: Tuesday, 20 January 2026 15.34
> > > >
> > > > On Tue, 20 Jan 2026 09:49:44 +0100
> > > > Morten Brørup <[email protected]> wrote:
> > > >
> > > > > > From: Stephen Hemminger [mailto:[email protected]]
> > > > > > Sent: Monday, 19 January 2026 23.48
> > > > > >
> > > > > > On Fri, 16 Jan 2026 10:32:52 +0100
> > > > > > Morten Brørup <[email protected]> wrote:
> > > > > >
> > > > > > > > From: Stephen Hemminger
> [mailto:[email protected]]
> > > > > > > > Sent: Friday, 16 January 2026 07.46
> > > > > > > >
> > > > > > > > When building with LTO (Link Time Optimization), GCC
> performs
> > > > > > > > aggressive cross-compilation-unit inlining. This causes
> the
> > > > > > compiler
> > > > > > > > to analyze all code paths in
> __rte_ring_do_dequeue_elems(),
> > > > > > including
> > > > > > > > the 16-byte element path (__rte_ring_dequeue_elems_128),
> even
> > > > when
> > > > > > > > the runtime element size is only 4 bytes.
> > > > > > > >
> > > > > > > > The static analyzer sees that the 16-byte path would copy
> > > > > > > > 32 elements * 16 bytes = 512 bytes into a 128-byte buffer
> > > > > > > > (uint32_t[32]),
> > > > > > > > triggering -Wstringop-overflow warnings.
> > > > >
> > > > > The element size is not an inline function parameter, but
> fetched
> > > > from the "esize" field in the rte_soring structure, so the
> compiler
> > > > cannot see that the element size is 4 bytes. And thus it needs to
> > > > consider all possible element sizes.
> > > > >
> > > > > > > >
> > > > > > > > The existing #pragma GCC diagnostic suppression in
> > > > > > rte_ring_elem_pvt.h
> > > > > > > > doesn't help because with LTO the warning context shifts
> to the
> > > > > > test
> > > > > > > > file where the inlined code is instantiated.
> > > > > > > >
> > > > > > > > Fix by sizing all buffers passed to soring
> acquire/dequeue
> > > > > > functions
> > > > > > > > for the worst-case element size (16 bytes = 4 *
> > > > sizeof(uint32_t)).
> > > > > > > > This satisfies the static analyzer without changing
> runtime
> > > > > > behavior.
> > > > > > >
> > > > > > > Using wildly oversized buffers doesn't seem like a
> recommendable
> > > > > > solution.
> > > > > > > If the ring library is ever updated to support cache size
> > > > elements
> > > > > > (64 byte), the buffers would have to be oversize by factor
> 16.
> > > > > >
> > > > > > The analysis (from AI) is that compiler is getting confused.
> > > > >
> > > > > That would be my analysis too.
> > > > >
> > > > > > Since there is no good
> > > > > > way other than turning of LTO for the test to tell the
> compiler
> > > > >
> > > > > There is another way to tell the compiler: __rte_assume()
> > > >
> > > > Tried that but it doesn't work because doesn't get propagated
> deep
> > > > enough to impact here.
> > >
> > > Does this fix generally imply that when using LTO, using an SORING
> with elements
> > > smaller than 16 bytes requires oversize buffers?
> > > That's not good. :-(
> > >
> > > The SORING is still experimental.
> > > Maybe the element size and metadata size need to be passed as
> parameters to
> > > the SORING functions, like the RING functions take element size as
> parameter
> > > (except the functions that are hardcoded for using pointers as
> element size).
> >
> > Personally, I am not a big fan of such idea...
> > Wonder is that possible just to disable LTO for soring.o?
> > Another thought - if all the problems come from 128 bit version of
> enque/dequeue,
> > would using memcpy() instead  of specific functions help to mitigate
> the problem?
> >
> >
> 
> A much simpler and clear solution is to just get rid of
> __rte_always_inline
> and use inline instead. The compiler still inlines a lot but it can
> make its
> own decision.
> 
> The attribute always_inline is not always faster, in fact in real world
> applications it can make things slower because real applications get i-
> cache
> misses and lots of inline expansion makes it worse.

Here's another (untested) idea...

Maybe it would help moving the inlining around.
The SORING way of inlining differs from the RING way of inlining.

In RING inlining starts at the outermost layer.

*rte_ring.h:*
static __rte_always_inline unsigned int
rte_ring_dequeue_bulk(struct rte_ring *r, void **obj_table, unsigned int n,
                unsigned int *available)
{
        return rte_ring_dequeue_bulk_elem(r, obj_table, sizeof(void *),
                        n, available);

In SORING, the outermost layer is a normal function.

*rte_soring.h:*
__rte_experimental
uint32_t
rte_soring_dequeue_bulk(struct rte_soring *r, void *objs,
        uint32_t num, uint32_t *available);

*soring.c:*
RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_soring_dequeue_bulk, 25.03)
uint32_t
rte_soring_dequeue_bulk(struct rte_soring *r, void *objs, uint32_t num,
        uint32_t *available)
{
        return soring_dequeue(r, objs, NULL, num, RTE_RING_QUEUE_FIXED,
                        available);
}

Note that for RING, the element size is determined at the outermost layer and 
passed as a parameter to the underlying functions.

I'm not sure that modifying SORING to the same style of inlining solves 
anything. The element size is build time constant for RING, but SORING would 
have to pass r->esize as element size parameter, and that is not build time 
constant.
With such a change, my suggestion of adding __rte_assume(r->esize == 
sizeof(uint32_t)) at the application might work.
In my experience, __rte_assume() needs to be very close to the code using it, 
or it has no effect.

It would be an ABI change, but not an API change.

Reply via email to