On Tue, 20 Jan 2026 15:40:21 +0000
Konstantin Ananyev <[email protected]> wrote:

> > > From: Stephen Hemminger [mailto:[email protected]]
> > > Sent: Tuesday, 20 January 2026 15.34
> > >
> > > On Tue, 20 Jan 2026 09:49:44 +0100
> > > Morten Brørup <[email protected]> wrote:
> > >  
> > > > > From: Stephen Hemminger [mailto:[email protected]]
> > > > > Sent: Monday, 19 January 2026 23.48
> > > > >
> > > > > On Fri, 16 Jan 2026 10:32:52 +0100
> > > > > Morten Brørup <[email protected]> wrote:
> > > > >  
> > > > > > > From: Stephen Hemminger [mailto:[email protected]]
> > > > > > > Sent: Friday, 16 January 2026 07.46
> > > > > > >
> > > > > > > When building with LTO (Link Time Optimization), GCC performs
> > > > > > > aggressive cross-compilation-unit inlining. This causes the  
> > > > > compiler  
> > > > > > > to analyze all code paths in __rte_ring_do_dequeue_elems(),  
> > > > > including  
> > > > > > > the 16-byte element path (__rte_ring_dequeue_elems_128), even  
> > > when  
> > > > > > > the runtime element size is only 4 bytes.
> > > > > > >
> > > > > > > The static analyzer sees that the 16-byte path would copy
> > > > > > > 32 elements * 16 bytes = 512 bytes into a 128-byte buffer
> > > > > > > (uint32_t[32]),
> > > > > > > triggering -Wstringop-overflow warnings.  
> > > >
> > > > The element size is not an inline function parameter, but fetched  
> > > from the "esize" field in the rte_soring structure, so the compiler
> > > cannot see that the element size is 4 bytes. And thus it needs to
> > > consider all possible element sizes.  
> > > >  
> > > > > > >
> > > > > > > The existing #pragma GCC diagnostic suppression in  
> > > > > rte_ring_elem_pvt.h  
> > > > > > > doesn't help because with LTO the warning context shifts to the  
> > > > > test  
> > > > > > > file where the inlined code is instantiated.
> > > > > > >
> > > > > > > Fix by sizing all buffers passed to soring acquire/dequeue  
> > > > > functions  
> > > > > > > for the worst-case element size (16 bytes = 4 *  
> > > sizeof(uint32_t)).  
> > > > > > > This satisfies the static analyzer without changing runtime  
> > > > > behavior.  
> > > > > >
> > > > > > Using wildly oversized buffers doesn't seem like a recommendable  
> > > > > solution.  
> > > > > > If the ring library is ever updated to support cache size  
> > > elements  
> > > > > (64 byte), the buffers would have to be oversize by factor 16.
> > > > >
> > > > > The analysis (from AI) is that compiler is getting confused.  
> > > >
> > > > That would be my analysis too.
> > > >  
> > > > > Since there is no good
> > > > > way other than turning of LTO for the test to tell the compiler  
> > > >
> > > > There is another way to tell the compiler: __rte_assume()  
> > >
> > > Tried that but it doesn't work because doesn't get propagated deep
> > > enough to impact here.  
> > 
> > Does this fix generally imply that when using LTO, using an SORING with 
> > elements
> > smaller than 16 bytes requires oversize buffers?
> > That's not good. :-(
> > 
> > The SORING is still experimental.
> > Maybe the element size and metadata size need to be passed as parameters to
> > the SORING functions, like the RING functions take element size as parameter
> > (except the functions that are hardcoded for using pointers as element 
> > size).  
> 
> Personally, I am not a big fan of such idea...
> Wonder is that possible just to disable LTO for soring.o?
> Another thought - if all the problems come from 128 bit version of 
> enque/dequeue,
> would using memcpy() instead  of specific functions help to mitigate the 
> problem?  
> 
> 

A much simpler and clear solution is to just get rid of __rte_always_inline
and use inline instead. The compiler still inlines a lot but it can make its
own decision.

The attribute always_inline is not always faster, in fact in real world
applications it can make things slower because real applications get i-cache
misses and lots of inline expansion makes it worse.

Reply via email to