On Fri, Jan 30, 2026 at 12:16:43PM +0100, Morten Brørup wrote: > > From: Bruce Richardson [mailto:[email protected]] > > Sent: Friday, 30 January 2026 11.53 > > > > On Fri, Jan 30, 2026 at 10:46:16AM +0000, Morten Brørup wrote: > > > For CPU architectures without strict alignment requirements, > > operations on > > > 6-byte Ethernet addresses using three 2-byte operations were replaced > > by a > > > 4-byte and a 2-byte operation, i.e. two operations instead of three. > > > > > > Comparison functions are pure, so added __rte_pure. > > > > > > Removed superfluous parentheses. (No functional change.) > > > > > > Signed-off-by: Morten Brørup <[email protected]> > > > --- > > > lib/net/rte_ether.h | 19 ++++++++++++++++++- > > > 1 file changed, 18 insertions(+), 1 deletion(-) > > > > > > diff --git a/lib/net/rte_ether.h b/lib/net/rte_ether.h > > > index c9a0b536c3..5552d3c1f6 100644 > > > --- a/lib/net/rte_ether.h > > > +++ b/lib/net/rte_ether.h > > > @@ -99,13 +99,19 @@ static_assert(alignof(struct rte_ether_addr) == > > 2, > > > * True (1) if the given two ethernet address are the same; > > > * False (0) otherwise. > > > */ > > > +__rte_pure > > > static inline int rte_is_same_ether_addr(const struct rte_ether_addr > > *ea1, > > > const struct rte_ether_addr *ea2) > > > { > > > +#if !defined(RTE_ARCH_STRICT_ALIGN) > > > + return ((((const unaligned_uint32_t *)ea1)[0] ^ ((const > > unaligned_uint32_t *)ea2)[0]) | > > > + (((const uint16_t *)ea1)[2] ^ ((const uint16_t > > *)ea2)[2])) == 0; > > > +#else > > > const uint16_t *w1 = (const uint16_t *)ea1; > > > const uint16_t *w2 = (const uint16_t *)ea2; > > > > > > return ((w1[0] ^ w2[0]) | (w1[1] ^ w2[1]) | (w1[2] ^ w2[2])) == > > 0; > > > +#endif > > > } > > > > Is this actually faster? > > It's a simple micro-optimization, so I haven't benchmarked it. > On x86, the compiled function is simplified and reduced in size from 34 to 24 > bytes: > > 00000000004ed650 <review_rte_is_same_ether_addr>: > 4ed650: 0f b7 07 movzwl (%rdi),%eax > 4ed653: 0f b7 57 02 movzwl 0x2(%rdi),%edx > 4ed657: 66 33 06 xor (%rsi),%ax > 4ed65a: 66 33 56 02 xor 0x2(%rsi),%dx > 4ed65e: 09 d0 or %edx,%eax > 4ed660: 0f b7 57 04 movzwl 0x4(%rdi),%edx > 4ed664: 66 33 56 04 xor 0x4(%rsi),%dx > 4ed668: 66 09 d0 or %dx,%ax > 4ed66b: 0f 94 c0 sete %al > 4ed66e: 0f b6 c0 movzbl %al,%eax > 4ed671: c3 ret > 4ed672: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1) > 4ed679: 00 00 00 00 > 4ed67d: 0f 1f 00 nopl (%rax) > > 00000000004ed680 <rte_is_same_ether_addr_improved>: > 4ed680: 0f b7 47 04 movzwl 0x4(%rdi),%eax > 4ed684: 66 33 46 04 xor 0x4(%rsi),%ax > 4ed688: 8b 17 mov (%rdi),%edx > 4ed68a: 33 16 xor (%rsi),%edx > 4ed68c: 0f b7 c0 movzwl %ax,%eax > 4ed68f: 09 c2 or %eax,%edx > 4ed691: 0f 94 c0 sete %al > 4ed694: 0f b6 c0 movzbl %al,%eax > 4ed697: c3 ret > 4ed698: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) > 4ed69f: 00 > > For reference, memcpy() of 6 bytes (compile time constant) also compiles to a > 4-byte and a 2-byte operation, not three 2-byte operations. > What about memcmp? Does it compile similarly? Before we start adding ifdefs like this to the code, I'd like to see some measured performance benefits from it. While the code may be 10 bytes shorter, does that actually translate into a measurable difference in some app?
/Bruce

