> From: Bruce Richardson [mailto:[email protected]]
> Sent: Friday, 30 January 2026 15.33
> 
> On Fri, Jan 30, 2026 at 03:25:34PM +0100, Morten Brørup wrote:
> > > From: Bruce Richardson [mailto:[email protected]]
> > > Sent: Friday, 30 January 2026 15.03
> > >
> > > On Fri, Jan 30, 2026 at 02:54:52PM +0100, Morten Brørup wrote:
> > > > > From: Bruce Richardson [mailto:[email protected]]
> > > > > Sent: Friday, 30 January 2026 12.27
> > > > >
> > > > > On Fri, Jan 30, 2026 at 12:16:43PM +0100, Morten Brørup wrote:
> > > > > > > From: Bruce Richardson [mailto:[email protected]]
> > > > > > > Sent: Friday, 30 January 2026 11.53
> > > > > > >
> > > > > > > On Fri, Jan 30, 2026 at 10:46:16AM +0000, Morten Brørup
> wrote:
> > > > > > > > For CPU architectures without strict alignment
> requirements,
> > > > > > > operations on
> > > > > > > > 6-byte Ethernet addresses using three 2-byte operations
> were
> > > > > replaced
> > > > > > > by a
> > > > > > > > 4-byte and a 2-byte operation, i.e. two operations
> instead of
> > > > > three.
> > > > > > > >
> > > > > > > > Comparison functions are pure, so added __rte_pure.
> > > > > > > >
> > > > > > > > Removed superfluous parentheses. (No functional change.)
> > > > > > > >
> > > > > > > > Signed-off-by: Morten Brørup <[email protected]>
> > > > > > > > ---
> > > > > > > >  lib/net/rte_ether.h | 19 ++++++++++++++++++-
> > > > > > > >  1 file changed, 18 insertions(+), 1 deletion(-)
> > > > > > > >
> > > > > > > > diff --git a/lib/net/rte_ether.h b/lib/net/rte_ether.h
> > > > > > > > index c9a0b536c3..5552d3c1f6 100644
> > > > > > > > --- a/lib/net/rte_ether.h
> > > > > > > > +++ b/lib/net/rte_ether.h
> > > > > > > > @@ -99,13 +99,19 @@ static_assert(alignof(struct
> > > rte_ether_addr)
> > > > > ==
> > > > > > > 2,
> > > > > > > >   *  True  (1) if the given two ethernet address are the
> > > same;
> > > > > > > >   *  False (0) otherwise.
> > > > > > > >   */
> > > > > > > > +__rte_pure
> > > > > > > >  static inline int rte_is_same_ether_addr(const struct
> > > > > rte_ether_addr
> > > > > > > *ea1,
> > > > > > > >                                      const struct rte_ether_addr
> *ea2)
> > > > > > > >  {
> > > > > > > > +#if !defined(RTE_ARCH_STRICT_ALIGN)
> > > > > > > > +       return ((((const unaligned_uint32_t *)ea1)[0] ^
> ((const
> > > > > > > unaligned_uint32_t *)ea2)[0]) |
> > > > > > > > +                       (((const uint16_t *)ea1)[2] ^ ((const
> uint16_t
> > > > > > > *)ea2)[2])) == 0;
> > > > > > > > +#else
> > > > > > > >         const uint16_t *w1 = (const uint16_t *)ea1;
> > > > > > > >         const uint16_t *w2 = (const uint16_t *)ea2;
> > > > > > > >
> > > > > > > >         return ((w1[0] ^ w2[0]) | (w1[1] ^ w2[1]) | (w1[2] ^
> > > > > w2[2])) ==
> > > > > > > 0;
> > > > > > > > +#endif
> > > > > > > >  }
> > > > > > >
> > > > > > > Is this actually faster?
> > > > > >
> > > > > > It's a simple micro-optimization, so I haven't benchmarked
> it.
> > > > > > On x86, the compiled function is simplified and reduced in
> size
> > > from
> > > > > 34 to 24 bytes:
> > > > > >
> > > > > > 00000000004ed650 <review_rte_is_same_ether_addr>:
> > > > > >   4ed650:   0f b7 07                movzwl (%rdi),%eax
> > > > > >   4ed653:   0f b7 57 02             movzwl 0x2(%rdi),%edx
> > > > > >   4ed657:   66 33 06                xor    (%rsi),%ax
> > > > > >   4ed65a:   66 33 56 02             xor    0x2(%rsi),%dx
> > > > > >   4ed65e:   09 d0                   or     %edx,%eax
> > > > > >   4ed660:   0f b7 57 04             movzwl 0x4(%rdi),%edx
> > > > > >   4ed664:   66 33 56 04             xor    0x4(%rsi),%dx
> > > > > >   4ed668:   66 09 d0                or     %dx,%ax
> > > > > >   4ed66b:   0f 94 c0                sete   %al
> > > > > >   4ed66e:   0f b6 c0                movzbl %al,%eax
> > > > > >   4ed671:   c3                      ret
> > > > > >   4ed672:   66 66 2e 0f 1f 84 00    data16 cs nopw
> > > 0x0(%rax,%rax,1)
> > > > > >   4ed679:   00 00 00 00
> > > > > >   4ed67d:   0f 1f 00                nopl   (%rax)
> > > > > >
> > > > > > 00000000004ed680 <rte_is_same_ether_addr_improved>:
> > > > > >   4ed680:   0f b7 47 04             movzwl 0x4(%rdi),%eax
> > > > > >   4ed684:   66 33 46 04             xor    0x4(%rsi),%ax
> > > > > >   4ed688:   8b 17                   mov    (%rdi),%edx
> > > > > >   4ed68a:   33 16                   xor    (%rsi),%edx
> > > > > >   4ed68c:   0f b7 c0                movzwl %ax,%eax
> > > > > >   4ed68f:   09 c2                   or     %eax,%edx
> > > > > >   4ed691:   0f 94 c0                sete   %al
> > > > > >   4ed694:   0f b6 c0                movzbl %al,%eax
> > > > > >   4ed697:   c3                      ret
> > > > > >   4ed698:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
> > > > > >   4ed69f:   00
> > > > > >
> > > > > > For reference, memcpy() of 6 bytes (compile time constant)
> also
> > > > > compiles to a 4-byte and a 2-byte operation, not three 2-byte
> > > > > operations.
> > > > > >
> > > > > What about memcmp? Does it compile similarly?
> > > >
> > > > memcmp(a,b,6) on Clang compiles into something very similar.
> > > > memcmp(a,b,6) on GCC compiles into something with a branch after
> the
> > > first 4-byte comparison, with the assumption (regarding static
> branch
> > > prediction) that they are likely to differ.
> > > > I guess GCC's counterproductive behavior was the reason for
> > > originally implementing a manual comparison, instead of simply
> using
> > > memcmp().
> > > >
> > > > BTW, GCC is clever enough to compile 8-byte and 16-byte
> comparisons
> > > into code without branches.
> > > > I guess that's why rte_ipv6_addr_eq() is implemented using
> memcpy()
> > > [1].
> > > >
> > > > [1]:
> > > https://elixir.bootlin.com/dpdk/v25.11/source/lib/net/rte_ip6.h#L68
> > > >
> > > > > Before we start adding ifdefs
> > > > > like this to the code, I'd like to see some measured
> performance
> > > > > benefits
> > > > > from it. While the code may be 10 bytes shorter, does that
> actually
> > > > > translate into a measurable difference in some app?
> > > >
> > > > Excellent question!
> > > > Some quick rudimentary testing shows that it seems to be ~4
> cycles
> > > slower than what it's replacing.
> > > > Reality beats expectations.
> > > >
> > > > I'll drop this patch.
> > > >
> > > If you have the test-case already prepared, can you also check what
> > > memcmp() performs like? Replacing the whole function by memcmp and
> > > punting
> > > the optimization to the compiler would be a nice, though small,
> code
> > > improvement.
> >
> > Good you asked!
> >
> > While setting up the test for memcmp(), I noticed that I had been
> testing my improved function without "inline".
> > With inline (like the original), it's ~1 cycle faster than the
> original.
> > I have restored the patch status to "New".
> >
> > The memcmp() test (not forgetting "inline") performs very close to
> the original.
> >
> 
> If memcmp performs like the original, I'd be tempted to forgo the
> 1cycle
> benefit just to have the shortest simplest code.

Yes, I tend to agree.
A modern C compiler should know how to compile memcmp(a,b,6) into something 
efficient on the architecture it is compiling for.
E.g. Clang would do it the way I proposed in this patch.

For GCC, the cost of the branch was probably eliminated by the branch predictor 
when running my test.

Reply via email to