Re: [PATCH 0/8] fix multi-process VF hotplug

Stephen Hemminger Fri, 20 Feb 2026 09:38:45 -0800

On Thu, 19 Feb 2026 17:08:51 -0800
[email protected] wrote:

> From: Long Li <[email protected]>
> 
> This series fixes multi-process support for DPDK drivers used on
> Azure VMs with Accelerated Networking (AN). When AN is toggled, the
> VF device is hot-removed and hot-added, which can crash secondary
> processes due to stale fast-path pointers and race conditions.
> 
> Patches 1-3 fix the netvsc PMD:
> - Prevent secondary from calling unsupported promiscuous ops
> - Fix rwlock misuse and race conditions on VF add/remove events
> - Add multi-process VF device removal support via IPC
> 
> Patches 4-5 fix resource leaks:
> - MANA PD resource leak on device close
> - netvsc devargs memory leak on hotplug
> 
> Patches 6-8 fix a common bug across MANA, MLX5, and MLX4 drivers
> where the secondary process START_RXTX/STOP_RXTX IPC handlers
> update dev->rx_pkt_burst/tx_pkt_burst but do not update the
> process-local rte_eth_fp_ops[] array. Since rte_eth_rx_burst()
> uses rte_eth_fp_ops (not dev->rx_pkt_burst), the secondary retains
> stale queue data pointers after VF hot-add, causing a segfault.
> 
> Long Li (8):
>   net/netvsc: secondary ignore promiscuous enable/disable
>   net/netvsc: fix race conditions on VF add/remove events
>   net/netvsc: add multi-process VF device removal support
>   net/mana: fix PD resource leak on device close
>   net/netvsc: fix devargs memory leak on hotplug
>   net/mana: fix fast-path ops setup in secondary process
>   net/mlx5: fix fast-path ops setup in secondary process
>   net/mlx4: fix fast-path ops setup in secondary process
> 
>  drivers/net/mana/mana.c             |  14 ++
>  drivers/net/mana/mp.c               |   6 +
>  drivers/net/mlx4/mlx4_mp.c          |   4 +
>  drivers/net/mlx5/linux/mlx5_mp_os.c |   4 +
>  drivers/net/netvsc/hn_ethdev.c      | 288 +++++++++++++++++++++++++++-
>  drivers/net/netvsc/hn_nvs.h         |   5 +
>  drivers/net/netvsc/hn_rxtx.c        |  40 ++--
>  drivers/net/netvsc/hn_var.h         |   1 +
>  drivers/net/netvsc/hn_vf.c          | 122 ++++++------
>  9 files changed, 397 insertions(+), 87 deletions(-)
>


I didn't see anything wrong, but AI review spotted some issues.
Address these and resubmit.

Prompt: Using AGENTS.md, do a deep dive patch review of this patchset to DPDK 
VF support in Azure


Here's a summary of the key findings across the 8-patch series:

Errors (must fix):

    Patch 2 — __hn_vf_add and __hn_vf_remove use double-underscore prefixes 
reserved by the C standard. Rename to _unlocked() variants.
    Patch 2 — Resource leak in __hn_vf_add: if configure/start fails after 
hn_vf_attach() succeeds, the VF is left in a half-attached state (callback 
registered, ownership held, vf_attached = true) with no cleanup path.
    Patch 3 — netvsc_secondary_handle_device_remove uses hv->vf_ctx.vf_port to 
index rte_eth_devices[] without validating the port is valid in the secondary's 
address space.

Warnings (should fix):

    Patch 2 — __hn_vf_remove leaves vf_vsc_switched = true if 
hn_nvs_set_datapath(SYNTHETIC) fails, meaning Rx/Tx fast paths keep trying to 
use a dead VF.
    Patch 3 — netvsc_mp_req_VF naming violates DPDK lowercase convention; 
relaxed memory ordering on secondary_cnt could miss a recently-attached 
secondary on ARM; netvsc_uninit_once can tear down shared infrastructure 
prematurely if a probe fails with other ports active.
    Patch 4 — ret from ibv_dealloc_pd shadows the earlier return value, though 
in practice execution falls through.
    Patches 6-8 — Direct writes to rte_eth_fp_ops[] from driver code bypass the 
ethdev abstraction layer. Correct for the immediate fix but worth a follow-up 
ethdev helper API.

Patches 1, 5 look clean. The mlx5/mlx4 maintainers should ack patches 7-8.

More detail:

# Deep Dive Patch Review: Azure VF Hotplug Series (8 patches)

**Submitter:** Long Li <[email protected]>
**Delegate:** Stephen Hemminger
**Series:** VF hotplug improvements for netvsc, mana, mlx5, mlx4

---

## Patch 2/8: net/netvsc: fix race conditions on VF add/remove events

### Error: `__hn_vf_add` — double-underscore prefix is reserved by C standard

The new public API function `__hn_vf_add()` (declared in `hn_var.h` and defined 
in `hn_vf.c`) uses a name beginning with double underscore. C99 §7.1.3 reserves 
all identifiers beginning with `__` for the implementation. While this 
convention is common in Linux kernel code, DPDK is a user-space library and 
this creates undefined behavior per the C standard. More pragmatically, it 
could collide with compiler/libc built-ins.

**Suggested fix:** Rename to `hn_vf_add_unlocked()` or `hn_vf_add_locked()` 
(matching the pattern used elsewhere in DPDK for locked/unlocked variants), and 
similarly rename `__hn_vf_remove()` to `hn_vf_remove_unlocked()`.

### Error: Potential resource leak in `__hn_vf_add` on configure/start failure 
paths

In `__hn_vf_add()`, after `hn_vf_attach()` succeeds (which sets `vf_attached = 
true`, `vf_port`, and registers the callback), the function proceeds to 
configure and start the VF. If `hn_vf_configure()` (line ~259) or 
`hn_setup_vf_queues()` (line ~265) or `rte_eth_dev_start()` (line ~275) fails, 
the function jumps to `exit:` and returns the error. However, the VF remains in 
a partially set up state:

- `vf_attached` is still `true`
- The removal callback is still registered
- Port ownership is still held
- `vf_vsc_switched` remains `false` (data path never switched to VF)

This leaves the netvsc driver in an inconsistent state where it thinks it has a 
VF attached but the VF is not operational. Subsequent VF add attempts will get 
`-EEXIST` from `hn_vf_attach()`, but the code treats `-EEXIST` as success (`if 
(ret && ret != -EEXIST)`), so it will then try to configure the already-failed 
VF again.

**Suggested fix:** Add a cleanup path that reverses `hn_vf_attach()` on 
configure/start failure — unregister the callback, unset ownership, and reset 
`vf_attached`/`vf_port`.

### Warning: `hn_vf_close()` upgraded from read-lock to write-lock without 
explanation

The change from `rte_rwlock_read_lock` to `rte_rwlock_write_lock` in 
`hn_vf_close()` is correct (the function modifies `vf_attached`), but it's a 
behavioral change that could surface latency issues if other threads hold read 
locks during close. This is the right fix, but worth noting in the commit 
message since it changes locking semantics.

### Warning: `hn_vf_remove` — `vf_vsc_switched` not cleared on 
`hn_nvs_set_datapath` failure

In the new `__hn_vf_remove()`, if `hn_nvs_set_datapath(hv, 
NVS_DATAPATH_SYNTHETIC)` fails, `vf_vsc_switched` is left `true` (the `else` 
clause only clears it on success). This is arguably better than the old code 
which unconditionally cleared it, but it means the driver now thinks the VF 
data path is still active even though the host may have already removed it. 
Subsequent operations that check `vf_vsc_switched` (the Rx/Tx fast paths) will 
continue trying to use a potentially dead VF port. Consider whether logging is 
sufficient here or whether the state should be forced to `false` regardless, 
since the VF is being removed anyway.

### Warning: Removal callback registered in `hn_vf_attach` before configure

The patch moves `rte_eth_dev_callback_register(RTE_ETH_EVENT_INTR_RMV)` into 
`hn_vf_attach()`, which is called before `hn_vf_configure()`. This means the 
removal callback can fire before the VF is fully configured. If a removal event 
arrives in this window, `hn_remove_delayed` will operate on a partially 
configured VF. The old code registered the callback in `hn_vf_configure()` 
after `rte_eth_dev_configure()` succeeded, which was a tighter window. Consider 
whether this race is meaningful in practice.

---

## Patch 3/8: net/netvsc: add multi-process VF device removal support

### Error: `netvsc_init_once` uses `ret` uninitialized on primary success path

In `netvsc_init_once()`, the `ret` variable is not initialized. In the 
`RTE_PROC_PRIMARY` case, if `rte_memzone_reserve()` and 
`netvsc_mp_init_primary()` both succeed, execution falls through the `break` 
and reaches `return ret;` at the end of the function — but `ret` was never 
assigned in the success path. This is an uninitialized variable use.

```c
static int netvsc_init_once(void)
{
    int ret;                    // ← uninitialized
    ...
    case RTE_PROC_PRIMARY:
        netvsc_shared_mz = rte_memzone_reserve(...);
        if (!netvsc_shared_mz) {
            return -rte_errno;  // error path: returns directly
        }
        ...
        ret = netvsc_mp_init_primary();
        if (ret) {
            rte_memzone_free(netvsc_shared_mz);
            break;              // error path: ret is set
        }
        ...
        netvsc_local_data.init_done = true;
        break;                  // SUCCESS path: ret still holds
                                // netvsc_mp_init_primary() return (0)
```

On closer inspection, `ret` is set by `netvsc_mp_init_primary()` and if that 
returns 0 (success), `ret` is 0 at the `break`. So in practice this works, but 
the code flow is fragile — if the success path is refactored to skip the 
`netvsc_mp_init_primary()` call, `ret` would be garbage. Initialize `ret = 0` 
for clarity and safety.

### Error: `netvsc_secondary_handle_device_remove` accesses VF port without 
validation

```c
static int netvsc_secondary_handle_device_remove(struct hn_data *hv)
{
    uint16_t port_id = hv->vf_ctx.vf_port;
    struct rte_eth_dev *dev;

    dev = &rte_eth_devices[port_id];
    return rte_eth_dev_release_port(dev);
}
```

The comment says "VF is already locked by primary" but this runs in the 
secondary process — the primary's write lock on `hv->vf_lock` doesn't protect 
secondary process memory. If `vf_port` has already been reset or the port is 
invalid in the secondary's address space, this will access out-of-bounds memory 
or release the wrong port. There's no `rte_eth_dev_is_valid_port()` check on 
`port_id` here (unlike the caller `netvsc_mp_secondary_handle` which validates 
`param->port_id` — but that's the *netvsc* port, not the VF port).

**Suggested fix:** Add `rte_eth_dev_is_valid_port(port_id)` check before 
accessing `rte_eth_devices[port_id]`.

### Warning: `netvsc_mp_req_VF` — function name uses non-standard casing

The function `netvsc_mp_req_VF` mixes snake_case with an uppercase suffix. DPDK 
convention is all-lowercase with underscores for C functions. Should be 
`netvsc_mp_req_vf` or `netvsc_mp_req_vf_remove` (since the function currently 
only handles remove).

### Warning: `netvsc_uninit_once` checks counts using bitwise OR

```c
if (netvsc_local_data.primary_cnt |
    netvsc_local_data.secondary_cnt)
    return;
```

Using bitwise OR (`|`) instead of logical OR (`||`) works correctly for 
non-zero checks on unsigned integers, but is unconventional and may confuse 
readers or trigger compiler warnings at higher warning levels. Should use `||`.

### Warning: Memory ordering inconsistency for `secondary_cnt`

The shared `secondary_cnt` is accessed with `rte_memory_order_relaxed` 
throughout, but it's used as a synchronization signal in `netvsc_mp_req_VF()` 
to decide whether to send IPC messages. Relaxed ordering means the primary 
might read a stale zero and skip the IPC even though a secondary has already 
incremented the counter. In practice on x86 this is unlikely to matter, but on 
ARM it could miss a recently-attached secondary. Consider 
`rte_memory_order_acquire`/`release` for the load in `netvsc_mp_req_VF` and the 
stores in probe/remove.

### Warning: `eth_hn_probe` error path calls `netvsc_uninit_once` without 
matching count increment

If `rte_dev_event_monitor_start()` or `eth_dev_vmbus_allocate()` fails, the 
error path jumps to `init_once_failed:` which calls `netvsc_uninit_once()`. 
However, at that point `netvsc_local_data.primary_cnt` hasn't been incremented 
yet (the increment happens after `rte_eth_dev_probing_finish()`). So 
`netvsc_uninit_once()` will see `primary_cnt == 0` and `secondary_cnt == 0`, 
and will tear down the shared infrastructure — even if other netvsc ports are 
still active. This is only a problem if there are multiple netvsc devices and 
one fails to probe while others are already running.

---

## Patch 4/8: net/mana: fix PD resource leak on device close

### Warning: Return value from `ibv_dealloc_pd` shadows error from 
`mana_dev_stop`

The `ret` variable is reused for `ibv_dealloc_pd()` results, overwriting the 
return value from the preceding `mana_dev_stop()` call chain. If 
`ibv_dealloc_pd(ib_parent_pd)` fails but `ibv_dealloc_pd(ib_pd)` succeeds, 
`ret` will be 0 and the close will report success despite a deallocation 
failure. Conversely, if dealloc fails, `ret` becomes non-zero and the function 
may return that error instead of continuing to `ibv_close_device()`.

Looking at the code more carefully: the comment says "The close proceeds 
regardless because ibv_close_device() will force kernel cleanup." But the code 
as written *does* fall through regardless because each `ibv_dealloc_pd` result 
is only logged — the code doesn't return early. However, the final `ret` fed to 
`ibv_close_device`'s error handling may be wrong. Consider using a separate 
variable for the dealloc return values.

---

## Patch 5/8: net/netvsc: fix devargs memory leak on hotplug

No correctness issues found. The fix is straightforward and correctly adds 
`rte_devargs_reset()` before `free()` in both cleanup paths.

---

## Patch 6/8: net/mana: fix fast-path ops setup in secondary process

### Warning: Direct write to `rte_eth_fp_ops` may break abstraction

The patches for mana, mlx5, and mlx4 (patches 6-8) all directly write to 
`rte_eth_fp_ops[]` fields from driver code. The `rte_eth_fp_ops` structure is 
part of ethdev's internal fast-path mechanism, and drivers typically don't 
manipulate it directly — `rte_eth_dev_start()` and `eth_dev_fp_ops_setup()` 
handle this. While this works today, it couples the driver to ethdev internals. 
Consider whether ethdev should provide a helper API for secondary process 
fast-path setup instead, even if that's a separate follow-up.

### Warning: No bounds check on `param->port_id` before indexing 
`rte_eth_fp_ops`

In both the START_RXTX and STOP_RXTX handlers, `param->port_id` is used to 
index `rte_eth_fp_ops[]` without validation that it's within 
`RTE_MAX_ETHPORTS`. While the port was already validated earlier via 
`rte_eth_dev_is_valid_port()` in the mana handler, this validation should also 
hold true at the point of use. This is a minor robustness concern.

---

## Patches 7/8 and 8/8: net/mlx5 and net/mlx4 fast-path ops in secondary

Same pattern and same observations as Patch 6/8 — direct `rte_eth_fp_ops` 
writes. The fix is correct for the stated problem (secondary process SIGSEGV on 
VF hot-add). Same warning about ethdev abstraction applies.

---

## Series-Level Observations

### The series mixes netvsc driver changes with mana/mlx5/mlx4 driver changes

Patches 1-3 and 5 are netvsc-specific. Patch 4 is mana-specific (PD leak). 
Patches 6-8 fix the same secondary-process fast-path issue in three different 
drivers. These are related by the Azure VF hotplug scenario, but the mana PD 
leak (patch 4) is an independent bug. Consider whether the mlx5/mlx4 changes 
(patches 7-8) need acks from their respective maintainers (Nvidia).

### Missing release notes

This series includes multiple bug fixes with `Cc: [email protected]`. While bug 
fixes don't strictly require release notes, the multi-process VF removal 
infrastructure in patch 3 is a significant new feature that warrants a release 
note entry.

Re: [PATCH 0/8] fix multi-process VF hotplug

Reply via email to