On Fri, Mar 20, 2026 at 05:29:08PM -0700, Jakub Kicinski wrote: > On Fri, 20 Mar 2026 11:37:36 -0700 Dipayaan Roy wrote: > > On Sat, Mar 14, 2026 at 12:50:53PM -0700, Jakub Kicinski wrote: > > > On Tue, 10 Mar 2026 21:00:49 -0700 Dipayaan Roy wrote: > > > > On certain systems configured with 4K PAGE_SIZE, utilizing page_pool > > > > fragments for RX buffers results in a significant throughput regression. > > > > Profiling reveals that this regression correlates with high overhead in > > > > the > > > > fragment allocation and reference counting paths on these specific > > > > platforms, rendering the multi-buffer-per-page strategy > > > > counterproductive. > > > > > > Can you say more ? We could technically take two references on the page > > > right away if MTU is small and avoid some of the cost. > > > > There is a 15-20% shortfall in achieving line rate for MANA (180+ Gbps) > > on a particular ARM64 SKU. The issue is only specific to this processor SKU > > — > > not seen on other ARM64 SKUs (e.g., GB200) or x86 SKUs. Critically, the > > regression only manifests beyond 16 TCP connections, which strongly > > indicates > > seen when there is high contention and traffic. > > > > no. of | rx buf backed | rx buf backed > > connections | with page fragments | with full page > > -------------+---------------------+--------------- > > 4 | 139 Gbps | 138 Gbps > > 8 | 140 Gbps | 162 Gbps > > 16 | 186 Gbps | 186 Gbps > > These results look at bit odd, 4 and 16 streams have the same perf, > while all other cases indeed show a delta. What I was hoping for was > a more precise attribution of the performance issue. Like perf top > showing that its indeed the atomic ops on the refcount that stall. > > > 32 | 136 Gbps | 183 Gbps > > 48 | 159 Gbps | 185 Gbps > > 64 | 165 Gbps | 184 Gbps > > 128 | 170 Gbps | 180 Gbps > > > > HW team is still working to RCA this hw behaviour. > > > > Regarding "We could technically take two references on the page right > > away", are you suggesting having page reference counting logic to driver > > instead of relying on page pool? > > Yes, either that or adjust the page pool APIs. > page_pool_alloc_frag_netmem() currently sets the refcount to BIAS > which it then has to subtract later. So we get: > > set(BIAS) > .. driver allocates chunks .. > sub(BIAS_MAX - pool->frag_users) > > Instead of using BIAS we could make the page pool guess that the caller > will keep asking for the same frame size. So initially take > (PAGE_SIZE/size) references. > Ok I will be doing some expeimentation with this approach to see if it helps the current scenario.
> > > The driver doesn't seem to set skb->truesize accordingly after this > > > change. So you're lying to the stack about how much memory each packet > > > consumes. This is a blocker for the change. > > > > > ACK. I will send out a separate patch with fixes tag to fix the skb true > > size. > > > > > > To mitigate this, bypass the page_pool fragment path and force a single > > > > RX > > > > packet per page allocation when all the following conditions are met: > > > > 1. The system is configured with a 4K PAGE_SIZE. > > > > 2. A processor-specific quirk is detected via SMBIOS Type 4 data. > > > > > > I don't think we want the kernel to be in the business of carrying > > > matching on platform names and providing optimal config by default. > > > This sort of logic needs to live in user space or the hypervisor > > > (which can then pass a single bit to the driver to enable the behavior) > > > > > As per our internal discussion the hypervisor cannot provide the CPU > > version info(in vm as well as in bare metal offerings). > > Why? I suppose it's much more effort for you but it's much more effort > for the community to carry the workaround. So.. > As per the hypervisor team it is not solving the issue in the case of bare metal offering, hence will work ahead with an alternate soultion as suggested by you: "This sort of logic needs to live in user space.., which can then pass a single bit to the driver to enable the behavior" > > On handling it from user side are you suggesting it to introduce a new > > ethtool Private Flags and have udev rules for the driver to set the private > > flag and switch to full page rx buffers? Given that the wide number of > > distro > > support this might be harder to maintain/backport. > > > > Also the dmi parsing design was influenced by other net wireleass > > drivers as /wireless/ath/ath10k/core.c. If this approach is not > > acceptable for MANA driver then will have to take a alternate route > > based on the dsicussion right above it. > > Plenty of ugly hacks in the kernel, it's no excuse. Hi Jakub, As we are still working on root causing the actual issue with HW team, we would want the user a option to achieve the line rate by a tuneable option to run with full page rx buffers. I will be sending out a next version that would introduce an ethtool private flag for mana that allows the user to force one RX buffer per page. Regards

