From: Willem De Bruijn <[email protected]> Date: Mon, 17 Jun 2024 14:13:07 -0400
> Alexander Lobakin wrote: >> From: Willem De Bruijn <[email protected]> >> Date: Thu, 30 May 2024 09:46:46 -0400 >> >>> Alexander Lobakin wrote: >>>> Currently, idpf uses the following model for the header buffers: >>>> >>>> * buffers are allocated via dma_alloc_coherent(); >>>> * when receiving, napi_alloc_skb() is called and then the header is >>>> copied to the newly allocated linear part. >>>> >>>> This is far from optimal as DMA coherent zone is slow on many systems >>>> and memcpy() neutralizes the idea and benefits of the header split. >>> >>> In the previous revision this assertion was called out, as we have >>> lots of experience with the existing implementation and a previous one >>> based on dynamic allocation one that performed much worse. You would >> >> napi_build_skb() is not a dynamic allocation. In contrary, >> napi_alloc_skb() from the current implementation actually *is* a dynamic >> allocation. It allocates a page frag for every header buffer each time. >> >> Page Pool refills header buffers from its pool of recycled frags. >> Plus, on x86_64, truesize of a header buffer is 1024, meaning it picks >> a new page from the pool every 4th buffer. During the testing of common >> workloads, I had literally zero new page allocations, as the skb core >> recycles frags from skbs back to the pool. >> >> IOW, the current version you're defending actually performs more dynamic >> allocations on hotpath than this one ¯\_(ツ)_/¯ >> >> (I explained all this several times already) >> >>> share performance numbers in the next revision >> >> I can't share numbers in the outside, only percents. >> >> I shared before/after % in the cover letter. Every test yielded more >> Mpps after this change, esp. non-XDP_PASS ones when you don't have >> networking stack overhead. > > This is the main concern: AF_XDP has no existing users, but TCP/IP is > used in production environments. So we cannot risk TCP/IP regressions > in favor of somewhat faster AF_XDP. Secondary is that a functional > implementation of AF_XDP soon with optimizations later is preferable > over the fastest solution later. I have perf numbers before-after for all the common workloads and I see only improvements there. Do you have any to prove that this change introduces regressions? > >>> >>> https://lore.kernel.org/netdev/[email protected]/T/#me85d509365aba9279275e9b181248247e1f01bb0 >>> >>> This may be so integral to this patch series that asking to back it >>> out now sets back the whole effort. That is not my intent. >>> >>> And I appreciate that in principle there are many potential >>> optizations. >>> >>> But this (OOT) driver is already in use and regressions in existing >>> workloads is a serious headache. As is significant code churn wrt >>> other still OOT feature patch series. >>> >>> This series (of series) modifies the driver significantly, beyond the >>> narrow scope of adding XDP and AF_XDP. >> >> Yes, because all this is needed in order for XDP to work properly and >> quick enough to be competitive. OOT XDP implementation is not >> competitive and performs much worse even in comparison to the upstream ice. >> >> (for example, the idea of doing memcpy() before running XDP only to do >> XDP_DROP and quickly drop frames sounds horrible) >> >> Any serious series modification would mean a ton of rework only to >> downgrade the overall functionality, why do that? > > As I said before, it is not my intent to set back the effort by asking > for changes now. > > Only to caution to not expand the patch series even more (it grew from > 3 to 6 series) and to remind that performance of established workloads Where does this "3" come from? When I sent RFC in Dec, it was one huge set of all the changes, then I sent another RFC where it was already 6 series. > remain paramount. Thanks, Olek
