Hi Björn and Magnus, (This thread is a follow up to private dialogue. The intention is to let community know that AF_XDP can be enhanced further to make it compatible with wider range of NIC vendors).
There are two NIC variations which don't fit well with current AF_XDP proposal. The first variation is represented by some NXP and Cavium NICs. AF_XDP expects NIC to put incoming frames into slots defined by UMEM area. Here slot size is set in XDP_UMEM_REG xdp_umem.reg.frame_size and slots available to NIC are communicated to the kernel via UMEM fill queue. While Intel NICs support only one slot size, NXP and Cavium support multiple slot sizes to optimize memory usage (e.g. { 128, 512, 2048 }, please refer to [1] for rationale). On frame reception the NIC pulls a slot from best fit pool based on frame size. The second variation is represented by e.g. Chelsio T5/T6 and Netcope NICs. As shown above, AF_XDP expects NIC to put incoming frames at predefined addresses. This is not the case here as the NIC is in charge of placing frames in memory based on it's own algorithm. For example, Chelsio T5/T6 is expecting to get whole pages from the driver and puts incoming frames on the page in a 'tape recorder' fashion. Assuming 'base' is page address and flen[N] is an array of frame lengths, the frame placement in memory will look like that: base + 0 base + frame_len[0] base + frame_len[0] + frame_len[1] base + frame_len[0] + frame_len[1] + frame_len[2] ... To better support these two NIC variations I suggest to abandon 'frame size' structuring in UMEM and stick with 'pages' instead. The NIC kernel driver is then responsible for splitting provided pages into slots expected by underlying HW (or not splitting at all in case of Chelsio/Netcope). On XDP_UMEM_REG the application needs to specify page_size. Then the application can pass empty pages to the kernel driver using UMEM 'fill' queue by specifying page offset within the UMEM area. xdp_desc format needs to be changed as well: frame location will be defined by offset from the beginning of UMEM area instead of frame index. As payload headroom can vary with AF_XDP we'll need to specify it in xdp_desc as well. Beside that it could be worth to consider changing payload length to u16 as 64k+ frames aren't very common in networking. The resulting xdp_desc would look like that: struct xdp_desc { __u64 offset; __u16 headroom; __u16 len; __u8 flags; __u8 padding[3]; }; In current proposal you have a notion of 'frame ownership': 'owned by kernel' or 'owned by application'. The ownership is transferred by means of enqueueing frame index in UMEM 'fill' queue (from application to kernel) or in UMEM 'tx completion' queue (from kernel to application). If you decide to adopt 'page' approach this notion needs to be changed a bit. This is because in case of packet forwarding one and the same page can be used for RX (parts of it enqueued in HW 'free lists') and TX (forwarding of previously RXed packets). I propose to define 'ownership' as a right to manipulate the partitioning of the page into frames. Whenever application passes a page to the kernel via UMEM 'fill' queue -- the ownership is transferred to the kernel. The application can't allocate packets on this page until kernel is done with it, but it can change payload of RXed packets before forwarding them. The kernel can pass ownership back by means of 'end-of-page' in xdp_desc.flags. The pages are definitely supposed to be recycled sooner or later. Even if it's not part of kernel API and the housekeeping implementation resided completely in application I still would like to propose possible (hopefully, cost efficient) solution to that. The recycling could be achieved by keeping refcount on pages and recycling the page only when it's owned by application and refcount reaches 0. Whenever application transfers page ownership to the kernel the refcount shall be initialized to 0. With each incoming RX xdp_desc the corresponding page needs to be identified (xdp_desc.offset >> PAGE_SHIFT) and refcount incremented. When the packet gets freed the refcount shall be decremented. If packet is forwarded in TX xdp_desc -- the refcount gets decremented only on TX completion (again, tx_completion.desc >> PAGE_SHIFT). For packets originating from the application itself the payload buffers needs to be allocated from empty page owned by the application and refcount needs to be incremented as well. [1] https://en.wikipedia.org/wiki/Internet_Mix With best regards, Mykyta