mana: add device reset support

Stephen Hemminger Sun, 17 May 2026 20:49:17 -0700

On Fri,  8 May 2026 23:01:13 -0700
Wei Hu <[email protected]> wrote:

> From: Wei Hu <[email protected]>
> 
> Add support for handling hardware reset events in the MANA driver.
> When the MANA kernel driver receives a hardware service event, it
> initiates a device reset and notifies userspace via
> IBV_EVENT_DEVICE_FATAL. The DPDK driver handles this by performing
> an automatic teardown and recovery sequence.
> 
> The reset flow has two phases. In the enter phase, running on the
> EAL interrupt thread, the driver transitions the device state,
> waits for data path threads to reach a quiescent state using RCU,
> stops queues, tears down IB resources, and frees per-queue MR
> caches. A control thread is then spawned to handle the exit phase:
> it waits for the hardware to recover, unregisters the interrupt
> handler, re-probes the PCI device, reinitializes MR caches, and
> restarts queues.
> 
> A per-device spinlock serializes the reset path with ethdev
> operations. Operations that cannot wait (configure, queue setup)
> return -EBUSY during reset, while dev_stop and dev_close join the
> reset thread and use a blocking lock to ensure proper sequencing.
> 
> Multi-process support is included: secondary processes unmap and
> remap doorbell pages via IPC during the reset enter and exit
> phases. Data path functions in both primary and secondary
> processes check the device state atomically and return early when
> the device is not active.
> 
> The driver uses ethdev recovery events to notify upper layers
> (e.g. netvsc) of the reset lifecycle: RTE_ETH_EVENT_ERR_RECOVERING
> on entry, RTE_ETH_EVENT_RECOVERY_SUCCESS or
> RTE_ETH_EVENT_RECOVERY_FAILED on completion. A PCI device removal
> event callback distinguishes hot-remove from service reset.
> 
> Signed-off-by: Wei Hu <[email protected]>
> ---

Lots of issues found during AI review..

Subject: Re: [PATCH v2 1/1] net/mana: add device reset support

To: Wei Hu <[email protected]>
Cc: [email protected], [email protected], [email protected]
In-Reply-To: <[email protected]>
References: <[email protected]>
 <[email protected]>

On Fri,  8 May 2026 23:01:13 -0700, Wei Hu wrote:
> Add support for handling hardware reset events in the MANA driver.

Several issues need attention before this can be merged.

Errors
------

1. Memory leak: priv->dev_state_qsv is never freed on close.

   In mana_probe_port (non-reset path) the qsbr buffer is allocated:

       priv->dev_state_qsv = rte_zmalloc_socket("mana_rcu", sz, ...);

   It is only freed in the probe-failure goto label.  mana_dev_close
   has no corresponding rte_free, so every successful close leaks
   the allocation.

2. Joinable reset thread is leaked when reset completes on its own.

   Both mana_reset_thread (skip path) and mana_reset_exit_delay
   (normal completion) do:

       priv->reset_thread_active = false;
       rte_spinlock_unlock(&priv->reset_ops_lock);
       return 0;

   The pthread is still joinable at this point.  If mana_dev_close
   runs afterward it sees active==false and skips rte_thread_join,
   leaving the TCB/stack unreaped.  The flag must only be cleared by
   whoever actually joins the thread; otherwise the reset thread
   should be detached (and then dev_close cannot wait for it, which
   is its own problem).  Either way the current pattern is wrong.

3. Spinlock held across blocking operations.

   priv->reset_ops_lock is rte_spinlock_t and is held across:

     - mana_reset_enter:      mana_dev_stop, mana_dev_close,
                              ibv_close_device,
                              mana_mp_req_on_rxtx (5s IPC timeout)
     - mana_reset_exit_delay: ibv_close_device, mana_pci_probe,
                              mana_mp_req_on_rxtx, mana_dev_start

   IPC with a 5-second timeout and device probe under a spinlock is
   not acceptable.  Use a sleeping mutex (pthread_mutex_t initialized
   with PTHREAD_PROCESS_SHARED, since priv is in shared memory), or
   split the lock so the long operations run outside it.

4. pthread_mutex_init / pthread_cond_init without PTHREAD_PROCESS_SHARED.

   priv is allocated with rte_zmalloc_socket, so reset_cond_mutex and
   reset_cond live in hugepage memory mapped into every secondary.
   The patch initializes them with NULL attributes:

       pthread_mutex_init(&priv->reset_cond_mutex, NULL);
       pthread_cond_init(&priv->reset_cond, NULL);

   Even if only the primary uses them today, placement in shared
   memory requires the process-shared attribute.  Otherwise behavior
   is undefined and the choice will silently break the day a
   secondary starts touching these fields.

Warnings
--------

5. Secondary data path has no synchronization with the doorbell unmap.

   MANA_MP_REQ_RESET_ENTER causes the secondary MP handler to munmap
   proc_priv->db_page and set it to NULL.  The secondary's rx_burst /
   tx_burst protect themselves only with an atomic state check:

       rte_rcu_qsbr_thread_online(dstate_qsv, tid);
       if (state != MANA_DEV_ACTIVE || !db_page) { ... return 0; }

   But the qsbr only has primary threads registered (registration
   happens in mana_dev_configure, which never runs in secondary), so
   thread_online/offline in secondary do not block the primary's
   qsbr_check.  The MP handler in secondary therefore unmaps db_page
   while a peer secondary lcore can still be inside rx/tx_burst with
   a stale pointer.  Result: SIGSEGV on the next doorbell write.

   The secondary needs its own quiescence mechanism (a per-process
   qsbr or a reader-side rwlock around the data path that the MP
   handler acquires before unmap).

6. Return value of rte_dev_event_callback_unregister is discarded.

   In mana_intr_uninstall:

       if (dev->device)
               rte_dev_event_callback_unregister(dev->device->name,
                                                 mana_pci_remove_event_cb,
                                                 priv);

   The API returns -EAGAIN if the callback is currently executing.
   When that happens the unregister silently fails, the callback
   stays on the list, and any later free of priv produces a
   use-after-free the next time the EAL device-event thread
   dispatches.

7. PCI-remove callback depends on a monitor nobody starts.

   rte_dev_event_callback_register only wires the callback into the
   dispatch table; events are only delivered when something calls
   rte_dev_event_monitor_start().  The patch never calls it.  Whether
   this works in practice depends on the application (testpmd,
   netvsc) calling it first.  At minimum document the requirement;
   better, have the driver start the monitor on first probe and stop
   it on last remove.

8. No documentation or release-note update.

   This is a substantial new feature.  doc/guides/rel_notes/ and
   doc/guides/nics/mana.rst should both be updated to describe the
   reset lifecycle, the RTE_ETH_EVENT_ERR_RECOVERING /
   RECOVERY_SUCCESS / RECOVERY_FAILED behavior, and the dependency
   on rte_dev_event_monitor_start.

9. priv->reset_thread_active is bool, accessed without atomic ops.

   It is read in mana_dev_stop_lock and mana_dev_close_lock before
   the spinlock is acquired (intentionally, to avoid deadlock with
   the reset thread), and written from at least three threads (intr
   handler, reset thread, dev_close).  It should be RTE_ATOMIC(bool)
   with explicit ordering.  This compounds issue #2.

10. mana_intr_handler races with mana_pci_remove_event_cb on dev_state.

    intr_handler does:

        if (state == MANA_DEV_ACTIVE) {
                rte_spinlock_lock(&priv->reset_ops_lock);
                mana_reset_enter(priv);   /* stores RESET_ENTER */

    pci_remove_event_cb stores RESET_FAILED before taking the lock.
    If intr_handler reads ACTIVE, pci_remove fires (sets
    RESET_FAILED) while intr_handler is waiting on the lock,
    intr_handler then acquires it and overwrites RESET_FAILED with
    RESET_ENTER, and the driver attempts a reset on a device that
    has already been physically removed.  The state-set in
    pci_remove_event_cb must occur under the lock, or the
    intr_handler must re-check state after taking the lock.

Info
----

11. (void *)0 is used in mp.c (proc_priv->db_page = (void *)0; and
    if (proc_priv->db_page != 0)).  Use NULL.

12. The hand-off of reset_ops_lock from mana_reset_enter to
    mana_reset_thread to mana_reset_exit_delay (released in the
    latter) is hard to follow.  A short comment block at
    mana_reset_enter listing which functions take the lock and which
    release it would help future maintainers.

13. The widened TOCTOU window in mana_tx_burst (db_page is now read
    at the top of the function rather than just before the doorbell)
    is not itself a bug, but it makes #5 worse.  Once #5 is fixed
    via a proper reader-side primitive, consider keeping the db_page
    read inside the critical section.

Re: [PATCH v2 1/1] net/mana: add device reset support

Reply via email to