On Wed, 18 Mar 2026 10:47:26 +0100
Sriram Yagnaraman <[email protected]> wrote:

> Enable jumbo frame reception with default mbuf data room size by
> chaining multiple mbufs when packet exceeds single mbuf tailroom.
> 
> Scatter Rx is only enabled when RTE_ETH_RX_OFFLOAD_SCATTER is
> requested. Packets are dropped if they exceed single mbuf size
> and scatter is not enabled, or if mbuf allocation fails during
> chaining. Error counter rx_dropped_pkts tracks all drops.
> 
> This allows receiving 9KB jumbo frames using standard 2KB mbufs,
> chaining ~5 segments per jumbo packet.
> 
> Signed-off-by: Sriram Yagnaraman <[email protected]>
> ---

In general looks good, but AI review found some things.

Summary:

Thanks for the v5, Sriram. A few comments below.

1) pkt_len is uint16_t but tp_snaplen is __u32

The scatter helper also takes uint16_t data_len, so anything above 64KB 
silently truncates. For jumbo frames this is fine in practice, but the type 
mismatch is a latent bug. Consider uint32_t for both, or at least a comment 
explaining why uint16_t is sufficient.

2) Scatter alloc failure should bump rx_nombuf, not just rx_dropped_pkts

When rte_pktmbuf_alloc() fails for a chained segment inside 
eth_af_packet_rx_scatter(), only rx_dropped_pkts is incremented. The head mbuf 
alloc failure correctly bumps rx_nombuf. The scatter case is also a no-mbuf 
condition and should be reported as such, otherwise operators can't distinguish 
memory pressure from oversized-packet drops in the stats.

3) Reclaiming headroom in chained segments

The prepend/zero-data_len/zero-pkt_len sequence works but is roundabout. 
Setting m->data_off = 0 directly would be cleaner and avoid the intermediate 
state where data_len is non-zero then immediately overwritten.

4) bufs[num_rx] fix looks correct

The change from bufs[i] to bufs[num_rx] fixes a pre-existing gap in the output 
array when packets are dropped via continue. Good catch.

Overall the structure looks solid — the scatter helper extraction is clean, the 
goto-based frame release avoids the earlier use-after-free risk, and the queue 
setup validation is correct.


Full detail:

## Review: [PATCH v5] net/af_packet: add multi-segment mbuf support for jumbo 
frames

### Error: `eth_af_packet_rx_scatter()` leaks chained mbufs on allocation 
failure

When `rte_pktmbuf_alloc()` fails inside the `while` loop, the function returns 
`-1`. The caller then calls `rte_pktmbuf_free(mbuf)` on the head mbuf. However, 
before the failure, previous iterations of the loop have already linked 
allocated segments via `m->next`. The question is whether `rte_pktmbuf_free()` 
walks the chain — it does, so segments linked before the failure *will* be 
freed. But there's a subtlety: `mbuf->nb_segs` is incremented for each 
successfully chained segment, and `rte_pktmbuf_free()` uses `nb_segs` to walk 
the chain. Since `nb_segs` and the `->next` linkage are kept in sync, this is 
actually correct.

On closer inspection: **no leak here.** Disregard — the caller's 
`rte_pktmbuf_free(mbuf)` correctly frees the entire chain. (Keeping this 
analysis transparent so you can verify the reasoning.)

### Error: `pkt_len` declared as `uint16_t` but `tp_snaplen` is `uint32_t` — 
silent truncation

`ppd->tp_snaplen` is `__u32` in the kernel's `tpacket2_hdr`. The patch declares 
`pkt_len` as `uint16_t`:

```c
uint16_t pkt_len;
...
pkt_len = ppd->tp_snaplen;
```

For jumbo frames up to 9KB this works, but `tp_snaplen` can be up to 65535 (or 
larger with cooked sockets). If `tp_snaplen > 65535` this silently truncates, 
and even values in the 32768–65535 range could cause issues if `pkt_len` is 
used in signed comparisons. More practically, the scatter helper also takes 
`uint16_t data_len`, capping the maximum receivable packet at 64KB.

Since this patch specifically targets jumbo frames (9KB), the practical risk is 
low, but the type should match the source. Consider using `uint32_t` for 
`pkt_len` and the `data_len` parameter of `eth_af_packet_rx_scatter()`, or at 
minimum add a bounds check / comment explaining why `uint16_t` is sufficient.

### Error: `rte_pktmbuf_pkt_len(mbuf)` not set in the single-mbuf fast path 
before VLAN reinsertion

In the patched code, the single-mbuf path sets `data_len` but defers setting 
`pkt_len` until after the VLAN/timestamp block:

```c
if (pkt_len <= rte_pktmbuf_tailroom(mbuf)) {
    memcpy(rte_pktmbuf_mtod(mbuf, void *), pbuf, pkt_len);
    rte_pktmbuf_data_len(mbuf) = pkt_len;
} else if (...) { ... }

rte_pktmbuf_pkt_len(mbuf) = pkt_len;

/* check for vlan info */
if (ppd->tp_status & TP_STATUS_VLAN_VALID) {
    ...
    if (!pkt_q->vlan_strip && rte_vlan_insert(&mbuf))
```

`rte_vlan_insert()` reads `mbuf->pkt_len` to determine packet length. At that 
point `pkt_len` has been assigned to `rte_pktmbuf_pkt_len(mbuf)` — wait, 
looking again, `pkt_len` assignment is on the line *before* the VLAN block. Let 
me re-read the patch flow:

```
        } else { ... goto release_frame; }

        rte_pktmbuf_pkt_len(mbuf) = pkt_len;   // <-- this line

        /* check for vlan info */
```

Yes, `pkt_len` is set before the VLAN check. This is correct. Disregard.

### Warning: `rx_nombuf` not incremented on scatter allocation failure

In the original code, `rx_nombuf` is incremented when `rte_pktmbuf_alloc()` 
fails for the head mbuf. In the scatter path, when `rte_pktmbuf_alloc()` fails 
for a *chained* segment, only `rx_dropped_pkts` is incremented. This is 
arguably a no-mbuf condition and should also increment `rx_nombuf` so that 
`rte_eth_stats.rx_nombuf` accurately reflects memory pressure. The current code 
makes it impossible to distinguish "dropped because too large and no scatter" 
from "dropped because mbuf pool exhausted during scatter."

Suggested fix: increment `pkt_q->rx_nombuf` instead of (or in addition to) 
`rx_dropped_pkts` when scatter allocation fails.

### Warning: `rte_pktmbuf_prepend()` return value unchecked

In `eth_af_packet_rx_scatter()`:

```c
rte_pktmbuf_prepend(m, rte_pktmbuf_headroom(m));
m->pkt_len = 0;
m->data_len = 0;
```

`rte_pktmbuf_prepend()` can return `NULL` if the headroom requested exceeds 
available headroom (shouldn't happen for a freshly allocated mbuf, but it's 
defensive). More importantly, after `rte_pktmbuf_prepend()`, `data_len` is 
increased by the headroom amount, but then immediately overwritten to `0`. The 
`pkt_len` is similarly set to `0`. This sequence works but is fragile — a 
cleaner approach would be to directly manipulate `m->data_off = 0` to reclaim 
the headroom, which is what this is effectively doing. Consider:

```c
m->data_off = 0;
```

This is simpler, less error-prone, and avoids the intermediate state where 
`data_len` contains a non-zero value that is immediately discarded.

### Warning: `scatter_enabled` read from `rxmode.offloads` in `rx_queue_setup`

Per AGENTS.md guidance, reading from `dev->data->dev_conf.rxmode` after 
configure can become stale. In this case, `rx_queue_setup` is called between 
`dev_configure` and `dev_start`, so the value should be consistent. However, 
the canonical pattern in DPDK is to check `dev->data->scattered_rx` or the 
offloads via `dev->data->dev_conf.rxmode.offloads` (which is acceptable in 
queue setup since it runs in the configure-to-start window). This is borderline 
acceptable but worth noting — if `mtu_set` is called after queue setup, the 
scatter state won't be re-evaluated. The AF_PACKET PMD's `mtu_set` doesn't 
re-select Rx functions or re-evaluate scatter, which is a pre-existing gap not 
introduced by this patch.

### Info: `bufs[i]` changed to `bufs[num_rx]` — correct fix for existing bug

The original code used `bufs[i]` which would leave gaps in the output array 
when packets are dropped (via `continue`). The change to `bufs[num_rx]` is 
correct and actually fixes a pre-existing bug in the drop path. Good catch in 
this patch.

### Info: Consider adding `rx_nombuf` tracking for the scatter failure case

As noted above, distinguishing "no mbuf" from "oversized drop" in statistics 
would be valuable for debugging. Minor suggestion for a follow-up.

Reply via email to