> From: Bruce Richardson [mailto:[email protected]] > Sent: Wednesday, 22 October 2025 16.12 > > On Wed, Oct 22, 2025 at 03:53:21PM +0200, Morten Brørup wrote: > > > From: Bruce Richardson [mailto:[email protected]] > > > Sent: Wednesday, 22 October 2025 11.08 > > > > > > On Mon, Oct 20, 2025 at 12:02:01PM +0000, Morten Brørup wrote: > > > > Refactored rte_pktmbuf_prefree_seg() for both performance and > > > readability. > > > > > > > > With the optimized RTE_MBUF_DIRECT() macro, the common likely > code > > > path > > > > now fits within one instruction cache line on x86-64 when built > with > > > GCC. > > > > > > > > Signed-off-by: Morten Brørup <[email protected]> > > > > > > Reviewed-by: Bruce Richardson <[email protected]> > > > > > > > [...] > > > > > > #define RTE_MBUF_DIRECT(mb) \ > > > > (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | > RTE_MBUF_F_EXTERNAL))) > > > > > > > > +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86) > > > > +/* Optimization for code size. > > > > + * GCC only optimizes single-bit MSB tests this way, so we do it > by > > > hand with multi-bit. > > > > + * > > > > + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are > both in > > > the MSB of the > > > > + * 64-bit ol_flags field, so we only compare this one byte > instead > > > of all 64 bits. > > > > + * On little endian architecture, the MSB of a 64-bit integer is > at > > > byte offest 7. > > > > + * > > > > + * Note: Tested using GCC version 16.0.0 20251019 > (experimental). > > > > + * > > > > + * Without this optimization, GCC generates 17 bytes of > > > instructions: > > > > + * movabs rax,0x6000000000000000 // 10 bytes > > > > + * and rax,QWORD PTR [rdi+0x18] // 4 bytes > > > > + * sete al // 3 bytes > > > > + * With this optimization, GCC generates only 7 bytes of > > > instructions: > > > > + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes > > > > + * sete al // 3 bytes > > > > + */ > > > > +#undef RTE_MBUF_DIRECT > > > > +#define RTE_MBUF_DIRECT(mb) \ > > > > + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, > ol_flags) + > > > 7] & \ > > > > + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 > * > > > 8)))) > > > > +static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 > * > > > 8)) << (7 * 8) == > > > > + (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL), > > > > + "RTE_MBUF_F_INDIRECT and/or RTE_MBUF_F_EXTERNAL are not in > > > MSB."); > > > > +#endif > > > > + > > > Couple of comments/thoughts/questions here. > > > > > > * This looks like a compiler limitation that should be fixed in > GCC. IF > > > we > > > put this optimization in, how will we know when/if we can remove > it > > > again > > > in future? I'm not sure we want this hanging around forever. > > > > Agree. > > There are plenty of hand crafted optimizations in DPDK, which are > already obsolete; > > it seems no one has found a good way of identifying them. Including > myself. > > > > > * Can the static_assert - which just checks flags are in the MSB - > be > > > * simplified to e.g. > > > "((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) << CHAR_BIT) == 0" > > > or "__builtin_ctzll(...) > (7 * CHAR_BIT)" > > > * As in prev bullet, I tend to prefer use of CHAR_BIT over hard- > coded > > > 8. > > > > In v3, I have simplified both the static_assert and the optimized > macro as you suggested on Slack, > > with some minor improvements. > > > > > * Is it necessary to limit this to just GCC and x86? If it leads to > the > > > best code on x86, why not include for all compilers? What about > non- > > > x86 > > > LE platforms? > > > > I had already tested ARM64, where it didn't make a difference; now I > have added a note about it. > > I also tested ARM32, which doesn't benefit either, but I didn't add a > note about it. > > I also tested Loongarch (on Godbolt), which does benefit from it, so > I added it. > > > > Now, as I'm writing this email, Godbolt shows that RISC-V and POWER > could also benefit. > > Maybe we should just replace the standard macro with the optimized > macro. WDYT? > > > I think that's not a bad idea. At least everything would be consistent.
I'll post a v4. Then we can go back to v3 if it looks too weird. > > /Bruce

