On Thu, 19 Feb 2026 17:08:51 -0800
[email protected] wrote:
> From: Long Li <[email protected]>
>
> This series fixes multi-process support for DPDK drivers used on
> Azure VMs with Accelerated Networking (AN). When AN is toggled, the
> VF device is hot-removed and hot-added, which can crash secondary
> processes due to stale fast-path pointers and race conditions.
>
> Patches 1-3 fix the netvsc PMD:
> - Prevent secondary from calling unsupported promiscuous ops
> - Fix rwlock misuse and race conditions on VF add/remove events
> - Add multi-process VF device removal support via IPC
>
> Patches 4-5 fix resource leaks:
> - MANA PD resource leak on device close
> - netvsc devargs memory leak on hotplug
>
> Patches 6-8 fix a common bug across MANA, MLX5, and MLX4 drivers
> where the secondary process START_RXTX/STOP_RXTX IPC handlers
> update dev->rx_pkt_burst/tx_pkt_burst but do not update the
> process-local rte_eth_fp_ops[] array. Since rte_eth_rx_burst()
> uses rte_eth_fp_ops (not dev->rx_pkt_burst), the secondary retains
> stale queue data pointers after VF hot-add, causing a segfault.
>
> Long Li (8):
> net/netvsc: secondary ignore promiscuous enable/disable
> net/netvsc: fix race conditions on VF add/remove events
> net/netvsc: add multi-process VF device removal support
> net/mana: fix PD resource leak on device close
> net/netvsc: fix devargs memory leak on hotplug
> net/mana: fix fast-path ops setup in secondary process
> net/mlx5: fix fast-path ops setup in secondary process
> net/mlx4: fix fast-path ops setup in secondary process
>
> drivers/net/mana/mana.c | 14 ++
> drivers/net/mana/mp.c | 6 +
> drivers/net/mlx4/mlx4_mp.c | 4 +
> drivers/net/mlx5/linux/mlx5_mp_os.c | 4 +
> drivers/net/netvsc/hn_ethdev.c | 288 +++++++++++++++++++++++++++-
> drivers/net/netvsc/hn_nvs.h | 5 +
> drivers/net/netvsc/hn_rxtx.c | 40 ++--
> drivers/net/netvsc/hn_var.h | 1 +
> drivers/net/netvsc/hn_vf.c | 122 ++++++------
> 9 files changed, 397 insertions(+), 87 deletions(-)
>
I didn't see anything wrong, but AI review spotted some issues.
Address these and resubmit.
Prompt: Using AGENTS.md, do a deep dive patch review of this patchset to DPDK
VF support in Azure
Here's a summary of the key findings across the 8-patch series:
Errors (must fix):
Patch 2 — __hn_vf_add and __hn_vf_remove use double-underscore prefixes
reserved by the C standard. Rename to _unlocked() variants.
Patch 2 — Resource leak in __hn_vf_add: if configure/start fails after
hn_vf_attach() succeeds, the VF is left in a half-attached state (callback
registered, ownership held, vf_attached = true) with no cleanup path.
Patch 3 — netvsc_secondary_handle_device_remove uses hv->vf_ctx.vf_port to
index rte_eth_devices[] without validating the port is valid in the secondary's
address space.
Warnings (should fix):
Patch 2 — __hn_vf_remove leaves vf_vsc_switched = true if
hn_nvs_set_datapath(SYNTHETIC) fails, meaning Rx/Tx fast paths keep trying to
use a dead VF.
Patch 3 — netvsc_mp_req_VF naming violates DPDK lowercase convention;
relaxed memory ordering on secondary_cnt could miss a recently-attached
secondary on ARM; netvsc_uninit_once can tear down shared infrastructure
prematurely if a probe fails with other ports active.
Patch 4 — ret from ibv_dealloc_pd shadows the earlier return value, though
in practice execution falls through.
Patches 6-8 — Direct writes to rte_eth_fp_ops[] from driver code bypass the
ethdev abstraction layer. Correct for the immediate fix but worth a follow-up
ethdev helper API.
Patches 1, 5 look clean. The mlx5/mlx4 maintainers should ack patches 7-8.
More detail:
# Deep Dive Patch Review: Azure VF Hotplug Series (8 patches)
**Submitter:** Long Li <[email protected]>
**Delegate:** Stephen Hemminger
**Series:** VF hotplug improvements for netvsc, mana, mlx5, mlx4
---
## Patch 2/8: net/netvsc: fix race conditions on VF add/remove events
### Error: `__hn_vf_add` — double-underscore prefix is reserved by C standard
The new public API function `__hn_vf_add()` (declared in `hn_var.h` and defined
in `hn_vf.c`) uses a name beginning with double underscore. C99 §7.1.3 reserves
all identifiers beginning with `__` for the implementation. While this
convention is common in Linux kernel code, DPDK is a user-space library and
this creates undefined behavior per the C standard. More pragmatically, it
could collide with compiler/libc built-ins.
**Suggested fix:** Rename to `hn_vf_add_unlocked()` or `hn_vf_add_locked()`
(matching the pattern used elsewhere in DPDK for locked/unlocked variants), and
similarly rename `__hn_vf_remove()` to `hn_vf_remove_unlocked()`.
### Error: Potential resource leak in `__hn_vf_add` on configure/start failure
paths
In `__hn_vf_add()`, after `hn_vf_attach()` succeeds (which sets `vf_attached =
true`, `vf_port`, and registers the callback), the function proceeds to
configure and start the VF. If `hn_vf_configure()` (line ~259) or
`hn_setup_vf_queues()` (line ~265) or `rte_eth_dev_start()` (line ~275) fails,
the function jumps to `exit:` and returns the error. However, the VF remains in
a partially set up state:
- `vf_attached` is still `true`
- The removal callback is still registered
- Port ownership is still held
- `vf_vsc_switched` remains `false` (data path never switched to VF)
This leaves the netvsc driver in an inconsistent state where it thinks it has a
VF attached but the VF is not operational. Subsequent VF add attempts will get
`-EEXIST` from `hn_vf_attach()`, but the code treats `-EEXIST` as success (`if
(ret && ret != -EEXIST)`), so it will then try to configure the already-failed
VF again.
**Suggested fix:** Add a cleanup path that reverses `hn_vf_attach()` on
configure/start failure — unregister the callback, unset ownership, and reset
`vf_attached`/`vf_port`.
### Warning: `hn_vf_close()` upgraded from read-lock to write-lock without
explanation
The change from `rte_rwlock_read_lock` to `rte_rwlock_write_lock` in
`hn_vf_close()` is correct (the function modifies `vf_attached`), but it's a
behavioral change that could surface latency issues if other threads hold read
locks during close. This is the right fix, but worth noting in the commit
message since it changes locking semantics.
### Warning: `hn_vf_remove` — `vf_vsc_switched` not cleared on
`hn_nvs_set_datapath` failure
In the new `__hn_vf_remove()`, if `hn_nvs_set_datapath(hv,
NVS_DATAPATH_SYNTHETIC)` fails, `vf_vsc_switched` is left `true` (the `else`
clause only clears it on success). This is arguably better than the old code
which unconditionally cleared it, but it means the driver now thinks the VF
data path is still active even though the host may have already removed it.
Subsequent operations that check `vf_vsc_switched` (the Rx/Tx fast paths) will
continue trying to use a potentially dead VF port. Consider whether logging is
sufficient here or whether the state should be forced to `false` regardless,
since the VF is being removed anyway.
### Warning: Removal callback registered in `hn_vf_attach` before configure
The patch moves `rte_eth_dev_callback_register(RTE_ETH_EVENT_INTR_RMV)` into
`hn_vf_attach()`, which is called before `hn_vf_configure()`. This means the
removal callback can fire before the VF is fully configured. If a removal event
arrives in this window, `hn_remove_delayed` will operate on a partially
configured VF. The old code registered the callback in `hn_vf_configure()`
after `rte_eth_dev_configure()` succeeded, which was a tighter window. Consider
whether this race is meaningful in practice.
---
## Patch 3/8: net/netvsc: add multi-process VF device removal support
### Error: `netvsc_init_once` uses `ret` uninitialized on primary success path
In `netvsc_init_once()`, the `ret` variable is not initialized. In the
`RTE_PROC_PRIMARY` case, if `rte_memzone_reserve()` and
`netvsc_mp_init_primary()` both succeed, execution falls through the `break`
and reaches `return ret;` at the end of the function — but `ret` was never
assigned in the success path. This is an uninitialized variable use.
```c
static int netvsc_init_once(void)
{
int ret; // ← uninitialized
...
case RTE_PROC_PRIMARY:
netvsc_shared_mz = rte_memzone_reserve(...);
if (!netvsc_shared_mz) {
return -rte_errno; // error path: returns directly
}
...
ret = netvsc_mp_init_primary();
if (ret) {
rte_memzone_free(netvsc_shared_mz);
break; // error path: ret is set
}
...
netvsc_local_data.init_done = true;
break; // SUCCESS path: ret still holds
// netvsc_mp_init_primary() return (0)
```
On closer inspection, `ret` is set by `netvsc_mp_init_primary()` and if that
returns 0 (success), `ret` is 0 at the `break`. So in practice this works, but
the code flow is fragile — if the success path is refactored to skip the
`netvsc_mp_init_primary()` call, `ret` would be garbage. Initialize `ret = 0`
for clarity and safety.
### Error: `netvsc_secondary_handle_device_remove` accesses VF port without
validation
```c
static int netvsc_secondary_handle_device_remove(struct hn_data *hv)
{
uint16_t port_id = hv->vf_ctx.vf_port;
struct rte_eth_dev *dev;
dev = &rte_eth_devices[port_id];
return rte_eth_dev_release_port(dev);
}
```
The comment says "VF is already locked by primary" but this runs in the
secondary process — the primary's write lock on `hv->vf_lock` doesn't protect
secondary process memory. If `vf_port` has already been reset or the port is
invalid in the secondary's address space, this will access out-of-bounds memory
or release the wrong port. There's no `rte_eth_dev_is_valid_port()` check on
`port_id` here (unlike the caller `netvsc_mp_secondary_handle` which validates
`param->port_id` — but that's the *netvsc* port, not the VF port).
**Suggested fix:** Add `rte_eth_dev_is_valid_port(port_id)` check before
accessing `rte_eth_devices[port_id]`.
### Warning: `netvsc_mp_req_VF` — function name uses non-standard casing
The function `netvsc_mp_req_VF` mixes snake_case with an uppercase suffix. DPDK
convention is all-lowercase with underscores for C functions. Should be
`netvsc_mp_req_vf` or `netvsc_mp_req_vf_remove` (since the function currently
only handles remove).
### Warning: `netvsc_uninit_once` checks counts using bitwise OR
```c
if (netvsc_local_data.primary_cnt |
netvsc_local_data.secondary_cnt)
return;
```
Using bitwise OR (`|`) instead of logical OR (`||`) works correctly for
non-zero checks on unsigned integers, but is unconventional and may confuse
readers or trigger compiler warnings at higher warning levels. Should use `||`.
### Warning: Memory ordering inconsistency for `secondary_cnt`
The shared `secondary_cnt` is accessed with `rte_memory_order_relaxed`
throughout, but it's used as a synchronization signal in `netvsc_mp_req_VF()`
to decide whether to send IPC messages. Relaxed ordering means the primary
might read a stale zero and skip the IPC even though a secondary has already
incremented the counter. In practice on x86 this is unlikely to matter, but on
ARM it could miss a recently-attached secondary. Consider
`rte_memory_order_acquire`/`release` for the load in `netvsc_mp_req_VF` and the
stores in probe/remove.
### Warning: `eth_hn_probe` error path calls `netvsc_uninit_once` without
matching count increment
If `rte_dev_event_monitor_start()` or `eth_dev_vmbus_allocate()` fails, the
error path jumps to `init_once_failed:` which calls `netvsc_uninit_once()`.
However, at that point `netvsc_local_data.primary_cnt` hasn't been incremented
yet (the increment happens after `rte_eth_dev_probing_finish()`). So
`netvsc_uninit_once()` will see `primary_cnt == 0` and `secondary_cnt == 0`,
and will tear down the shared infrastructure — even if other netvsc ports are
still active. This is only a problem if there are multiple netvsc devices and
one fails to probe while others are already running.
---
## Patch 4/8: net/mana: fix PD resource leak on device close
### Warning: Return value from `ibv_dealloc_pd` shadows error from
`mana_dev_stop`
The `ret` variable is reused for `ibv_dealloc_pd()` results, overwriting the
return value from the preceding `mana_dev_stop()` call chain. If
`ibv_dealloc_pd(ib_parent_pd)` fails but `ibv_dealloc_pd(ib_pd)` succeeds,
`ret` will be 0 and the close will report success despite a deallocation
failure. Conversely, if dealloc fails, `ret` becomes non-zero and the function
may return that error instead of continuing to `ibv_close_device()`.
Looking at the code more carefully: the comment says "The close proceeds
regardless because ibv_close_device() will force kernel cleanup." But the code
as written *does* fall through regardless because each `ibv_dealloc_pd` result
is only logged — the code doesn't return early. However, the final `ret` fed to
`ibv_close_device`'s error handling may be wrong. Consider using a separate
variable for the dealloc return values.
---
## Patch 5/8: net/netvsc: fix devargs memory leak on hotplug
No correctness issues found. The fix is straightforward and correctly adds
`rte_devargs_reset()` before `free()` in both cleanup paths.
---
## Patch 6/8: net/mana: fix fast-path ops setup in secondary process
### Warning: Direct write to `rte_eth_fp_ops` may break abstraction
The patches for mana, mlx5, and mlx4 (patches 6-8) all directly write to
`rte_eth_fp_ops[]` fields from driver code. The `rte_eth_fp_ops` structure is
part of ethdev's internal fast-path mechanism, and drivers typically don't
manipulate it directly — `rte_eth_dev_start()` and `eth_dev_fp_ops_setup()`
handle this. While this works today, it couples the driver to ethdev internals.
Consider whether ethdev should provide a helper API for secondary process
fast-path setup instead, even if that's a separate follow-up.
### Warning: No bounds check on `param->port_id` before indexing
`rte_eth_fp_ops`
In both the START_RXTX and STOP_RXTX handlers, `param->port_id` is used to
index `rte_eth_fp_ops[]` without validation that it's within
`RTE_MAX_ETHPORTS`. While the port was already validated earlier via
`rte_eth_dev_is_valid_port()` in the mana handler, this validation should also
hold true at the point of use. This is a minor robustness concern.
---
## Patches 7/8 and 8/8: net/mlx5 and net/mlx4 fast-path ops in secondary
Same pattern and same observations as Patch 6/8 — direct `rte_eth_fp_ops`
writes. The fix is correct for the stated problem (secondary process SIGSEGV on
VF hot-add). Same warning about ethdev abstraction applies.
---
## Series-Level Observations
### The series mixes netvsc driver changes with mana/mlx5/mlx4 driver changes
Patches 1-3 and 5 are netvsc-specific. Patch 4 is mana-specific (PD leak).
Patches 6-8 fix the same secondary-process fast-path issue in three different
drivers. These are related by the Azure VF hotplug scenario, but the mana PD
leak (patch 4) is an independent bug. Consider whether the mlx5/mlx4 changes
(patches 7-8) need acks from their respective maintainers (Nvidia).
### Missing release notes
This series includes multiple bug fixes with `Cc: [email protected]`. While bug
fixes don't strictly require release notes, the multi-process VF removal
infrastructure in patch 3 is a significant new feature that warrants a release
note entry.