On Mon,  8 Jun 2026 05:08:24 -0700
Wei Hu <[email protected]> wrote:

> From: Wei Hu <[email protected]>
> 
> Add support for handling hardware reset events in the MANA driver.
> When the MANA kernel driver receives a hardware service event, it
> initiates a device reset and notifies userspace via
> IBV_EVENT_DEVICE_FATAL. The DPDK driver handles this by performing
> an automatic teardown and recovery sequence.
> 
> The reset flow has two phases. In the enter phase, running on the
> EAL interrupt thread, the driver transitions the device state,
> waits for data path threads to drain using per-queue atomic flags,
> stops queues, tears down IB resources, and frees per-queue MR
> caches. A control thread is then spawned to handle the exit phase:
> it waits for the hardware to recover, unregisters the interrupt
> handler, re-probes the PCI device, reinitializes MR caches, and
> restarts queues.
> 
> Each queue has an atomic burst_state variable where bit 0 is the
> in-burst flag and bits 1+ encode device state. The data path uses
> a single compare-and-swap (0 to 1) to enter a burst, which fails
> immediately if the reset path has set any state bits. The reset
> path sets state bits via atomic fetch-or and polls bit 0 to wait
> for in-flight bursts to drain. This single-variable design avoids
> the need for sequential consistency ordering.
> 
> A per-device mutex serializes the reset path with ethdev
> operations. The mutex uses PTHREAD_PROCESS_SHARED for multi-process
> support and is held across blocking IB verbs calls. A trylock
> helper encapsulates the lock acquisition and device state check
> for all ethdev operation wrappers. Operations that cannot wait
> (configure, queue setup) return -EBUSY during reset, while
> dev_stop and dev_close join the reset thread before acquiring
> the lock to ensure proper sequencing. A CAS-based helper prevents
> double-join of the reset thread.
> 
> Multi-process support is included: secondary processes unmap and
> remap doorbell pages via IPC during the reset enter and exit
> phases. Data path functions in both primary and secondary
> processes check the device state atomically and return early when
> the device is not active.
> 
> The driver emits RTE_ETH_EVENT_ERR_RECOVERING before entering the
> reset path so that upper layers (e.g. netvsc) can switch their
> data path before queues are stopped. The event is emitted outside
> the reset lock to avoid deadlock if the callback calls dev_stop or
> dev_close. On completion, the driver emits RECOVERY_SUCCESS or
> RECOVERY_FAILED. If the enter phase fails internally,
> RECOVERY_FAILED is sent immediately so the application receives a
> terminal event. A PCI device removal event callback distinguishes
> hot-remove from service reset.
> 
> Documentation for the device reset feature is added in the MANA
> NIC guide and the 26.07 release notes.
> 
> Signed-off-by: Wei Hu <[email protected]>
> ---

This is is a rather complex set of state transitions so admit to relying
on AI as backup for tracking this. It still sees some errors here.

Worth asking the question, "what does mlx5 do?" and "should DPDK
EAL be doing this at the PCI layer instead?"

---

Much better - this addresses the RCU, the macros, the thread-safety-analysis
suppressions, and the callback-under-lock deadlock. The single-variable
burst_state CAS is a clean way to do the drain and the acquire/release
reasoning checks out.

One structural thing remains. The enter phase still runs the heavy teardown
on the EAL interrupt thread under reset_ops_lock: dev_stop, then
mana_mp_req_on_rxtx(RESET_ENTER) which is a blocking rte_mp_request_sync with
a multi-second timeout, then dev_close with its ibv calls. You already moved
the exit phase to a control thread because intr_callback_unregister cannot run
on the interrupt thread; the same argument applies to blocking IPC and verbs
teardown. A slow or absent secondary will stall the interrupt thread for the
MP timeout, and this is the blocking-under-a-sleeping-mutex pattern. Please
have the interrupt handler just set state and drain, then hand the rest of the
teardown to the control thread. That also removes the last lock hand-off
between functions/threads, so each function can own its lock.

Smaller points:
- RECOVERY_SUCCESS/FAILED are emitted from the reset thread. If the callback
  calls dev_stop/dev_close, mana_join_reset_thread() joins the current thread
  (EDEADLK, leaked handle). INTR_RMV is fine since it runs on the dev-event
  thread.
- The burst_state comment says bits 1+ encode device state, but only
  RESET_ENTER<<1 is ever stored - it is effectively a single "blocked" flag.

Reply via email to