> From: Bruce Richardson [mailto:[email protected]] > Sent: Friday, 30 January 2026 15.03 > > On Fri, Jan 30, 2026 at 02:54:52PM +0100, Morten Brørup wrote: > > > From: Bruce Richardson [mailto:[email protected]] > > > Sent: Friday, 30 January 2026 12.27 > > > > > > On Fri, Jan 30, 2026 at 12:16:43PM +0100, Morten Brørup wrote: > > > > > From: Bruce Richardson [mailto:[email protected]] > > > > > Sent: Friday, 30 January 2026 11.53 > > > > > > > > > > On Fri, Jan 30, 2026 at 10:46:16AM +0000, Morten Brørup wrote: > > > > > > For CPU architectures without strict alignment requirements, > > > > > operations on > > > > > > 6-byte Ethernet addresses using three 2-byte operations were > > > replaced > > > > > by a > > > > > > 4-byte and a 2-byte operation, i.e. two operations instead of > > > three. > > > > > > > > > > > > Comparison functions are pure, so added __rte_pure. > > > > > > > > > > > > Removed superfluous parentheses. (No functional change.) > > > > > > > > > > > > Signed-off-by: Morten Brørup <[email protected]> > > > > > > --- > > > > > > lib/net/rte_ether.h | 19 ++++++++++++++++++- > > > > > > 1 file changed, 18 insertions(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/lib/net/rte_ether.h b/lib/net/rte_ether.h > > > > > > index c9a0b536c3..5552d3c1f6 100644 > > > > > > --- a/lib/net/rte_ether.h > > > > > > +++ b/lib/net/rte_ether.h > > > > > > @@ -99,13 +99,19 @@ static_assert(alignof(struct > rte_ether_addr) > > > == > > > > > 2, > > > > > > * True (1) if the given two ethernet address are the > same; > > > > > > * False (0) otherwise. > > > > > > */ > > > > > > +__rte_pure > > > > > > static inline int rte_is_same_ether_addr(const struct > > > rte_ether_addr > > > > > *ea1, > > > > > > const struct rte_ether_addr *ea2) > > > > > > { > > > > > > +#if !defined(RTE_ARCH_STRICT_ALIGN) > > > > > > + return ((((const unaligned_uint32_t *)ea1)[0] ^ ((const > > > > > unaligned_uint32_t *)ea2)[0]) | > > > > > > + (((const uint16_t *)ea1)[2] ^ ((const uint16_t > > > > > *)ea2)[2])) == 0; > > > > > > +#else > > > > > > const uint16_t *w1 = (const uint16_t *)ea1; > > > > > > const uint16_t *w2 = (const uint16_t *)ea2; > > > > > > > > > > > > return ((w1[0] ^ w2[0]) | (w1[1] ^ w2[1]) | (w1[2] ^ > > > w2[2])) == > > > > > 0; > > > > > > +#endif > > > > > > } > > > > > > > > > > Is this actually faster? > > > > > > > > It's a simple micro-optimization, so I haven't benchmarked it. > > > > On x86, the compiled function is simplified and reduced in size > from > > > 34 to 24 bytes: > > > > > > > > 00000000004ed650 <review_rte_is_same_ether_addr>: > > > > 4ed650: 0f b7 07 movzwl (%rdi),%eax > > > > 4ed653: 0f b7 57 02 movzwl 0x2(%rdi),%edx > > > > 4ed657: 66 33 06 xor (%rsi),%ax > > > > 4ed65a: 66 33 56 02 xor 0x2(%rsi),%dx > > > > 4ed65e: 09 d0 or %edx,%eax > > > > 4ed660: 0f b7 57 04 movzwl 0x4(%rdi),%edx > > > > 4ed664: 66 33 56 04 xor 0x4(%rsi),%dx > > > > 4ed668: 66 09 d0 or %dx,%ax > > > > 4ed66b: 0f 94 c0 sete %al > > > > 4ed66e: 0f b6 c0 movzbl %al,%eax > > > > 4ed671: c3 ret > > > > 4ed672: 66 66 2e 0f 1f 84 00 data16 cs nopw > 0x0(%rax,%rax,1) > > > > 4ed679: 00 00 00 00 > > > > 4ed67d: 0f 1f 00 nopl (%rax) > > > > > > > > 00000000004ed680 <rte_is_same_ether_addr_improved>: > > > > 4ed680: 0f b7 47 04 movzwl 0x4(%rdi),%eax > > > > 4ed684: 66 33 46 04 xor 0x4(%rsi),%ax > > > > 4ed688: 8b 17 mov (%rdi),%edx > > > > 4ed68a: 33 16 xor (%rsi),%edx > > > > 4ed68c: 0f b7 c0 movzwl %ax,%eax > > > > 4ed68f: 09 c2 or %eax,%edx > > > > 4ed691: 0f 94 c0 sete %al > > > > 4ed694: 0f b6 c0 movzbl %al,%eax > > > > 4ed697: c3 ret > > > > 4ed698: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) > > > > 4ed69f: 00 > > > > > > > > For reference, memcpy() of 6 bytes (compile time constant) also > > > compiles to a 4-byte and a 2-byte operation, not three 2-byte > > > operations. > > > > > > > What about memcmp? Does it compile similarly? > > > > memcmp(a,b,6) on Clang compiles into something very similar. > > memcmp(a,b,6) on GCC compiles into something with a branch after the > first 4-byte comparison, with the assumption (regarding static branch > prediction) that they are likely to differ. > > I guess GCC's counterproductive behavior was the reason for > originally implementing a manual comparison, instead of simply using > memcmp(). > > > > BTW, GCC is clever enough to compile 8-byte and 16-byte comparisons > into code without branches. > > I guess that's why rte_ipv6_addr_eq() is implemented using memcpy() > [1]. > > > > [1]: > https://elixir.bootlin.com/dpdk/v25.11/source/lib/net/rte_ip6.h#L68 > > > > > Before we start adding ifdefs > > > like this to the code, I'd like to see some measured performance > > > benefits > > > from it. While the code may be 10 bytes shorter, does that actually > > > translate into a measurable difference in some app? > > > > Excellent question! > > Some quick rudimentary testing shows that it seems to be ~4 cycles > slower than what it's replacing. > > Reality beats expectations. > > > > I'll drop this patch. > > > If you have the test-case already prepared, can you also check what > memcmp() performs like? Replacing the whole function by memcmp and > punting > the optimization to the compiler would be a nice, though small, code > improvement.
Good you asked! While setting up the test for memcmp(), I noticed that I had been testing my improved function without "inline". With inline (like the original), it's ~1 cycle faster than the original. I have restored the patch status to "New". The memcmp() test (not forgetting "inline") performs very close to the original.

