Hi,

On 13/11/2025 9:37 PM, Maayan Kashani wrote:
When mlx5_dev_start() fails partway through initialization, the error
cleanup code unconditionally calls cleanup functions for all steps,
including those that were never successfully initialized. This causes
state corruption leading to incorrect behavior on subsequent start
attempts.

The issue manifests as:
1. First start attempt fails with -ENOMEM (expected)
2. Second start attempt returns -EINVAL instead of -ENOMEM
3. With flow isolated mode, second attempt incorrectly succeeds,
    leading to segfault in rte_eth_rx_burst()

Root cause: The single error label cleanup path calls functions like
mlx5_traffic_disable() and mlx5_flow_stop_default() even when their
corresponding initialization functions (mlx5_traffic_enable() and
mlx5_flow_start_default()) were never called due to earlier failure.

For example, when mlx5_rxq_start() fails:
- mlx5_traffic_enable() at line 1403 never executes
- mlx5_flow_start_default() at line 1420 never executes
- But cleanup unconditionally calls:
   * mlx5_traffic_disable() - destroys control flows list
   * mlx5_flow_stop_default() - corrupts flow metadata state

This corrupts the device state, causing subsequent start attempts to
fail with different errors or, in isolated mode, to incorrectly succeed
with an improperly initialized device.

Fix by replacing the single error label with cascading error labels
(Linux kernel style). Each label cleans up only its corresponding step,
then falls through to clean up earlier steps.
This ensures only successfully initialized steps are cleaned up,
maintaining device state consistency across failed start attempts.

Bugzilla ID: 1419
Fixes: 8db7e3b69822 ("net/mlx5: change operations for non-cached flows")
Cc: [email protected]

Signed-off-by: Maayan Kashani <[email protected]>
Acked-by: Dariusz Sosnowski <[email protected]>
---

Patch applied to next-net-mlx,

Kindest regards
Raslan Darawsheh

Reply via email to