On Mon, 8 Dec 2025 at 06:24, Kajetan Staszkiewicz <[email protected]> wrote:
>
> On 2025-12-08 00:55, Konstantin Belousov wrote:
>
> > It is somewhat strange that with/without RSS results differ for UDP.
> > mlx5en driver always enable hashing the packet into rx queue.  And,
> > with single UDP stream I would expect all packets to hit the same queue.
> With a single UDP stream and RSS disabled the DUT gets 2 CPU cores
> loaded. One at 100%, I understand this is where the interrupts for
> incoming packets land and it handles receiving, forwarding and sending
> the packet (with direct ISR dispatch) and another around 15-20%, my best
> guess that it's handling interrupts for confirmations of packets sent
> out through the other NIC.
>
> With a single UDP stream and RSS enabled the DUT gets only 1 CPU core
> loaded. I understand that thanks to RSS the outbound queue on mce1 is
> the same as inbound queue on mce0 and thus the same CPU core handles irq
> for both queues.
>
> > As consequence, with/without RSS should be same (low).
>
> It is low for no RSS, but with RSS it's not just low, it's terrible.
>
> > Would it be UDP which encapsulates some other traffic, e.g. tunnel that
> > can be further classified by the internal headers, like inner headers
> > of the vxlan, then more that one receive queue could be used.
>
> The script stl/udp_1pkt_simple.py (provided with TRex) creates UDP
> packets from port 1025 to port 12, filled with 0x78, length 10 B. My
> goal is to test packets per second performance, so I've choosen this
> test as it creates very short packets.
>
> > BTW, mce cards have huge numbers of supported offloads, but all of them are
> > host-oriented, they would not help for the forwarding.
>
> > Again, since iperf stream would hit single send/receive queue.
> > Parallel iperfs between same machines scale.
>
> It seems that parallel streams forwarded through the machine scale too.
> It's a single stream that kills it, and only with option RSS enabled.

RSS was never really designed for optimising a single flow by having it consume
two CPU cores.

It was designed for optimising a /whole lot of flows/ by directing
them to a consistent
CPU mapping and if used in conjunction with CPU selection for the transmit side,
to avoid cross-CPU locking/synchronisation entirely.

It doesn't help that the RSS defaults (ie only one netisr, not hybrid
mode IIRC, etc)
are not the best for lots of flows.

So in short, I think you're testing the wrong thing.



-adrian

Reply via email to