> From: Bruce Richardson [mailto:[email protected]] > Sent: Friday, 30 January 2026 11.53 > > On Fri, Jan 30, 2026 at 10:46:16AM +0000, Morten Brørup wrote: > > For CPU architectures without strict alignment requirements, > operations on > > 6-byte Ethernet addresses using three 2-byte operations were replaced > by a > > 4-byte and a 2-byte operation, i.e. two operations instead of three. > > > > Comparison functions are pure, so added __rte_pure. > > > > Removed superfluous parentheses. (No functional change.) > > > > Signed-off-by: Morten Brørup <[email protected]> > > --- > > lib/net/rte_ether.h | 19 ++++++++++++++++++- > > 1 file changed, 18 insertions(+), 1 deletion(-) > > > > diff --git a/lib/net/rte_ether.h b/lib/net/rte_ether.h > > index c9a0b536c3..5552d3c1f6 100644 > > --- a/lib/net/rte_ether.h > > +++ b/lib/net/rte_ether.h > > @@ -99,13 +99,19 @@ static_assert(alignof(struct rte_ether_addr) == > 2, > > * True (1) if the given two ethernet address are the same; > > * False (0) otherwise. > > */ > > +__rte_pure > > static inline int rte_is_same_ether_addr(const struct rte_ether_addr > *ea1, > > const struct rte_ether_addr *ea2) > > { > > +#if !defined(RTE_ARCH_STRICT_ALIGN) > > + return ((((const unaligned_uint32_t *)ea1)[0] ^ ((const > unaligned_uint32_t *)ea2)[0]) | > > + (((const uint16_t *)ea1)[2] ^ ((const uint16_t > *)ea2)[2])) == 0; > > +#else > > const uint16_t *w1 = (const uint16_t *)ea1; > > const uint16_t *w2 = (const uint16_t *)ea2; > > > > return ((w1[0] ^ w2[0]) | (w1[1] ^ w2[1]) | (w1[2] ^ w2[2])) == > 0; > > +#endif > > } > > Is this actually faster?
It's a simple micro-optimization, so I haven't benchmarked it. On x86, the compiled function is simplified and reduced in size from 34 to 24 bytes: 00000000004ed650 <review_rte_is_same_ether_addr>: 4ed650: 0f b7 07 movzwl (%rdi),%eax 4ed653: 0f b7 57 02 movzwl 0x2(%rdi),%edx 4ed657: 66 33 06 xor (%rsi),%ax 4ed65a: 66 33 56 02 xor 0x2(%rsi),%dx 4ed65e: 09 d0 or %edx,%eax 4ed660: 0f b7 57 04 movzwl 0x4(%rdi),%edx 4ed664: 66 33 56 04 xor 0x4(%rsi),%dx 4ed668: 66 09 d0 or %dx,%ax 4ed66b: 0f 94 c0 sete %al 4ed66e: 0f b6 c0 movzbl %al,%eax 4ed671: c3 ret 4ed672: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1) 4ed679: 00 00 00 00 4ed67d: 0f 1f 00 nopl (%rax) 00000000004ed680 <rte_is_same_ether_addr_improved>: 4ed680: 0f b7 47 04 movzwl 0x4(%rdi),%eax 4ed684: 66 33 46 04 xor 0x4(%rsi),%ax 4ed688: 8b 17 mov (%rdi),%edx 4ed68a: 33 16 xor (%rsi),%edx 4ed68c: 0f b7 c0 movzwl %ax,%eax 4ed68f: 09 c2 or %eax,%edx 4ed691: 0f 94 c0 sete %al 4ed694: 0f b6 c0 movzbl %al,%eax 4ed697: c3 ret 4ed698: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 4ed69f: 00 For reference, memcpy() of 6 bytes (compile time constant) also compiles to a 4-byte and a 2-byte operation, not three 2-byte operations. > For architectures that support strict alignment, > this looks like something that the compilers should be doing using > proper > cost-benefit evaluation based on target architecture, rather than us > doing > it in our code. I agree with the high level message in your comment. DPDK contains some manual optimizations from back in the days, and the evolvement of compilers have made some of them obsolete. In this case, GCC doesn't optimize it, so I did it manually. I haven't checked if other compilers are clever enough to do it.

