On Mon, 8 Dec 2025 at 06:24, Kajetan Staszkiewicz <[email protected]> wrote: > > On 2025-12-08 00:55, Konstantin Belousov wrote: > > > It is somewhat strange that with/without RSS results differ for UDP. > > mlx5en driver always enable hashing the packet into rx queue. And, > > with single UDP stream I would expect all packets to hit the same queue. > With a single UDP stream and RSS disabled the DUT gets 2 CPU cores > loaded. One at 100%, I understand this is where the interrupts for > incoming packets land and it handles receiving, forwarding and sending > the packet (with direct ISR dispatch) and another around 15-20%, my best > guess that it's handling interrupts for confirmations of packets sent > out through the other NIC. > > With a single UDP stream and RSS enabled the DUT gets only 1 CPU core > loaded. I understand that thanks to RSS the outbound queue on mce1 is > the same as inbound queue on mce0 and thus the same CPU core handles irq > for both queues. > > > As consequence, with/without RSS should be same (low). > > It is low for no RSS, but with RSS it's not just low, it's terrible. > > > Would it be UDP which encapsulates some other traffic, e.g. tunnel that > > can be further classified by the internal headers, like inner headers > > of the vxlan, then more that one receive queue could be used. > > The script stl/udp_1pkt_simple.py (provided with TRex) creates UDP > packets from port 1025 to port 12, filled with 0x78, length 10 B. My > goal is to test packets per second performance, so I've choosen this > test as it creates very short packets. > > > BTW, mce cards have huge numbers of supported offloads, but all of them are > > host-oriented, they would not help for the forwarding. > > > Again, since iperf stream would hit single send/receive queue. > > Parallel iperfs between same machines scale. > > It seems that parallel streams forwarded through the machine scale too. > It's a single stream that kills it, and only with option RSS enabled.
RSS was never really designed for optimising a single flow by having it consume two CPU cores. It was designed for optimising a /whole lot of flows/ by directing them to a consistent CPU mapping and if used in conjunction with CPU selection for the transmit side, to avoid cross-CPU locking/synchronisation entirely. It doesn't help that the RSS defaults (ie only one netisr, not hybrid mode IIRC, etc) are not the best for lots of flows. So in short, I think you're testing the wrong thing. -adrian
