On Wed, 18 Mar 2026 10:47:26 +0100 Sriram Yagnaraman <[email protected]> wrote:
> Enable jumbo frame reception with default mbuf data room size by > chaining multiple mbufs when packet exceeds single mbuf tailroom. > > Scatter Rx is only enabled when RTE_ETH_RX_OFFLOAD_SCATTER is > requested. Packets are dropped if they exceed single mbuf size > and scatter is not enabled, or if mbuf allocation fails during > chaining. Error counter rx_dropped_pkts tracks all drops. > > This allows receiving 9KB jumbo frames using standard 2KB mbufs, > chaining ~5 segments per jumbo packet. > > Signed-off-by: Sriram Yagnaraman <[email protected]> > --- In general looks good, but AI review found some things. Summary: Thanks for the v5, Sriram. A few comments below. 1) pkt_len is uint16_t but tp_snaplen is __u32 The scatter helper also takes uint16_t data_len, so anything above 64KB silently truncates. For jumbo frames this is fine in practice, but the type mismatch is a latent bug. Consider uint32_t for both, or at least a comment explaining why uint16_t is sufficient. 2) Scatter alloc failure should bump rx_nombuf, not just rx_dropped_pkts When rte_pktmbuf_alloc() fails for a chained segment inside eth_af_packet_rx_scatter(), only rx_dropped_pkts is incremented. The head mbuf alloc failure correctly bumps rx_nombuf. The scatter case is also a no-mbuf condition and should be reported as such, otherwise operators can't distinguish memory pressure from oversized-packet drops in the stats. 3) Reclaiming headroom in chained segments The prepend/zero-data_len/zero-pkt_len sequence works but is roundabout. Setting m->data_off = 0 directly would be cleaner and avoid the intermediate state where data_len is non-zero then immediately overwritten. 4) bufs[num_rx] fix looks correct The change from bufs[i] to bufs[num_rx] fixes a pre-existing gap in the output array when packets are dropped via continue. Good catch. Overall the structure looks solid — the scatter helper extraction is clean, the goto-based frame release avoids the earlier use-after-free risk, and the queue setup validation is correct. Full detail: ## Review: [PATCH v5] net/af_packet: add multi-segment mbuf support for jumbo frames ### Error: `eth_af_packet_rx_scatter()` leaks chained mbufs on allocation failure When `rte_pktmbuf_alloc()` fails inside the `while` loop, the function returns `-1`. The caller then calls `rte_pktmbuf_free(mbuf)` on the head mbuf. However, before the failure, previous iterations of the loop have already linked allocated segments via `m->next`. The question is whether `rte_pktmbuf_free()` walks the chain — it does, so segments linked before the failure *will* be freed. But there's a subtlety: `mbuf->nb_segs` is incremented for each successfully chained segment, and `rte_pktmbuf_free()` uses `nb_segs` to walk the chain. Since `nb_segs` and the `->next` linkage are kept in sync, this is actually correct. On closer inspection: **no leak here.** Disregard — the caller's `rte_pktmbuf_free(mbuf)` correctly frees the entire chain. (Keeping this analysis transparent so you can verify the reasoning.) ### Error: `pkt_len` declared as `uint16_t` but `tp_snaplen` is `uint32_t` — silent truncation `ppd->tp_snaplen` is `__u32` in the kernel's `tpacket2_hdr`. The patch declares `pkt_len` as `uint16_t`: ```c uint16_t pkt_len; ... pkt_len = ppd->tp_snaplen; ``` For jumbo frames up to 9KB this works, but `tp_snaplen` can be up to 65535 (or larger with cooked sockets). If `tp_snaplen > 65535` this silently truncates, and even values in the 32768–65535 range could cause issues if `pkt_len` is used in signed comparisons. More practically, the scatter helper also takes `uint16_t data_len`, capping the maximum receivable packet at 64KB. Since this patch specifically targets jumbo frames (9KB), the practical risk is low, but the type should match the source. Consider using `uint32_t` for `pkt_len` and the `data_len` parameter of `eth_af_packet_rx_scatter()`, or at minimum add a bounds check / comment explaining why `uint16_t` is sufficient. ### Error: `rte_pktmbuf_pkt_len(mbuf)` not set in the single-mbuf fast path before VLAN reinsertion In the patched code, the single-mbuf path sets `data_len` but defers setting `pkt_len` until after the VLAN/timestamp block: ```c if (pkt_len <= rte_pktmbuf_tailroom(mbuf)) { memcpy(rte_pktmbuf_mtod(mbuf, void *), pbuf, pkt_len); rte_pktmbuf_data_len(mbuf) = pkt_len; } else if (...) { ... } rte_pktmbuf_pkt_len(mbuf) = pkt_len; /* check for vlan info */ if (ppd->tp_status & TP_STATUS_VLAN_VALID) { ... if (!pkt_q->vlan_strip && rte_vlan_insert(&mbuf)) ``` `rte_vlan_insert()` reads `mbuf->pkt_len` to determine packet length. At that point `pkt_len` has been assigned to `rte_pktmbuf_pkt_len(mbuf)` — wait, looking again, `pkt_len` assignment is on the line *before* the VLAN block. Let me re-read the patch flow: ``` } else { ... goto release_frame; } rte_pktmbuf_pkt_len(mbuf) = pkt_len; // <-- this line /* check for vlan info */ ``` Yes, `pkt_len` is set before the VLAN check. This is correct. Disregard. ### Warning: `rx_nombuf` not incremented on scatter allocation failure In the original code, `rx_nombuf` is incremented when `rte_pktmbuf_alloc()` fails for the head mbuf. In the scatter path, when `rte_pktmbuf_alloc()` fails for a *chained* segment, only `rx_dropped_pkts` is incremented. This is arguably a no-mbuf condition and should also increment `rx_nombuf` so that `rte_eth_stats.rx_nombuf` accurately reflects memory pressure. The current code makes it impossible to distinguish "dropped because too large and no scatter" from "dropped because mbuf pool exhausted during scatter." Suggested fix: increment `pkt_q->rx_nombuf` instead of (or in addition to) `rx_dropped_pkts` when scatter allocation fails. ### Warning: `rte_pktmbuf_prepend()` return value unchecked In `eth_af_packet_rx_scatter()`: ```c rte_pktmbuf_prepend(m, rte_pktmbuf_headroom(m)); m->pkt_len = 0; m->data_len = 0; ``` `rte_pktmbuf_prepend()` can return `NULL` if the headroom requested exceeds available headroom (shouldn't happen for a freshly allocated mbuf, but it's defensive). More importantly, after `rte_pktmbuf_prepend()`, `data_len` is increased by the headroom amount, but then immediately overwritten to `0`. The `pkt_len` is similarly set to `0`. This sequence works but is fragile — a cleaner approach would be to directly manipulate `m->data_off = 0` to reclaim the headroom, which is what this is effectively doing. Consider: ```c m->data_off = 0; ``` This is simpler, less error-prone, and avoids the intermediate state where `data_len` contains a non-zero value that is immediately discarded. ### Warning: `scatter_enabled` read from `rxmode.offloads` in `rx_queue_setup` Per AGENTS.md guidance, reading from `dev->data->dev_conf.rxmode` after configure can become stale. In this case, `rx_queue_setup` is called between `dev_configure` and `dev_start`, so the value should be consistent. However, the canonical pattern in DPDK is to check `dev->data->scattered_rx` or the offloads via `dev->data->dev_conf.rxmode.offloads` (which is acceptable in queue setup since it runs in the configure-to-start window). This is borderline acceptable but worth noting — if `mtu_set` is called after queue setup, the scatter state won't be re-evaluated. The AF_PACKET PMD's `mtu_set` doesn't re-select Rx functions or re-evaluate scatter, which is a pre-existing gap not introduced by this patch. ### Info: `bufs[i]` changed to `bufs[num_rx]` — correct fix for existing bug The original code used `bufs[i]` which would leave gaps in the output array when packets are dropped (via `continue`). The change to `bufs[num_rx]` is correct and actually fixes a pre-existing bug in the drop path. Good catch in this patch. ### Info: Consider adding `rx_nombuf` tracking for the scatter failure case As noted above, distinguishing "no mbuf" from "oversized drop" in statistics would be valuable for debugging. Minor suggestion for a follow-up.

