On Fri, 8 May 2026 23:01:13 -0700 Wei Hu <[email protected]> wrote:
> From: Wei Hu <[email protected]> > > Add support for handling hardware reset events in the MANA driver. > When the MANA kernel driver receives a hardware service event, it > initiates a device reset and notifies userspace via > IBV_EVENT_DEVICE_FATAL. The DPDK driver handles this by performing > an automatic teardown and recovery sequence. > > The reset flow has two phases. In the enter phase, running on the > EAL interrupt thread, the driver transitions the device state, > waits for data path threads to reach a quiescent state using RCU, > stops queues, tears down IB resources, and frees per-queue MR > caches. A control thread is then spawned to handle the exit phase: > it waits for the hardware to recover, unregisters the interrupt > handler, re-probes the PCI device, reinitializes MR caches, and > restarts queues. > > A per-device spinlock serializes the reset path with ethdev > operations. Operations that cannot wait (configure, queue setup) > return -EBUSY during reset, while dev_stop and dev_close join the > reset thread and use a blocking lock to ensure proper sequencing. > > Multi-process support is included: secondary processes unmap and > remap doorbell pages via IPC during the reset enter and exit > phases. Data path functions in both primary and secondary > processes check the device state atomically and return early when > the device is not active. > > The driver uses ethdev recovery events to notify upper layers > (e.g. netvsc) of the reset lifecycle: RTE_ETH_EVENT_ERR_RECOVERING > on entry, RTE_ETH_EVENT_RECOVERY_SUCCESS or > RTE_ETH_EVENT_RECOVERY_FAILED on completion. A PCI device removal > event callback distinguishes hot-remove from service reset. > > Signed-off-by: Wei Hu <[email protected]> > --- Lots of issues found during AI review.. Subject: Re: [PATCH v2 1/1] net/mana: add device reset support To: Wei Hu <[email protected]> Cc: [email protected], [email protected], [email protected] In-Reply-To: <[email protected]> References: <[email protected]> <[email protected]> On Fri, 8 May 2026 23:01:13 -0700, Wei Hu wrote: > Add support for handling hardware reset events in the MANA driver. Several issues need attention before this can be merged. Errors ------ 1. Memory leak: priv->dev_state_qsv is never freed on close. In mana_probe_port (non-reset path) the qsbr buffer is allocated: priv->dev_state_qsv = rte_zmalloc_socket("mana_rcu", sz, ...); It is only freed in the probe-failure goto label. mana_dev_close has no corresponding rte_free, so every successful close leaks the allocation. 2. Joinable reset thread is leaked when reset completes on its own. Both mana_reset_thread (skip path) and mana_reset_exit_delay (normal completion) do: priv->reset_thread_active = false; rte_spinlock_unlock(&priv->reset_ops_lock); return 0; The pthread is still joinable at this point. If mana_dev_close runs afterward it sees active==false and skips rte_thread_join, leaving the TCB/stack unreaped. The flag must only be cleared by whoever actually joins the thread; otherwise the reset thread should be detached (and then dev_close cannot wait for it, which is its own problem). Either way the current pattern is wrong. 3. Spinlock held across blocking operations. priv->reset_ops_lock is rte_spinlock_t and is held across: - mana_reset_enter: mana_dev_stop, mana_dev_close, ibv_close_device, mana_mp_req_on_rxtx (5s IPC timeout) - mana_reset_exit_delay: ibv_close_device, mana_pci_probe, mana_mp_req_on_rxtx, mana_dev_start IPC with a 5-second timeout and device probe under a spinlock is not acceptable. Use a sleeping mutex (pthread_mutex_t initialized with PTHREAD_PROCESS_SHARED, since priv is in shared memory), or split the lock so the long operations run outside it. 4. pthread_mutex_init / pthread_cond_init without PTHREAD_PROCESS_SHARED. priv is allocated with rte_zmalloc_socket, so reset_cond_mutex and reset_cond live in hugepage memory mapped into every secondary. The patch initializes them with NULL attributes: pthread_mutex_init(&priv->reset_cond_mutex, NULL); pthread_cond_init(&priv->reset_cond, NULL); Even if only the primary uses them today, placement in shared memory requires the process-shared attribute. Otherwise behavior is undefined and the choice will silently break the day a secondary starts touching these fields. Warnings -------- 5. Secondary data path has no synchronization with the doorbell unmap. MANA_MP_REQ_RESET_ENTER causes the secondary MP handler to munmap proc_priv->db_page and set it to NULL. The secondary's rx_burst / tx_burst protect themselves only with an atomic state check: rte_rcu_qsbr_thread_online(dstate_qsv, tid); if (state != MANA_DEV_ACTIVE || !db_page) { ... return 0; } But the qsbr only has primary threads registered (registration happens in mana_dev_configure, which never runs in secondary), so thread_online/offline in secondary do not block the primary's qsbr_check. The MP handler in secondary therefore unmaps db_page while a peer secondary lcore can still be inside rx/tx_burst with a stale pointer. Result: SIGSEGV on the next doorbell write. The secondary needs its own quiescence mechanism (a per-process qsbr or a reader-side rwlock around the data path that the MP handler acquires before unmap). 6. Return value of rte_dev_event_callback_unregister is discarded. In mana_intr_uninstall: if (dev->device) rte_dev_event_callback_unregister(dev->device->name, mana_pci_remove_event_cb, priv); The API returns -EAGAIN if the callback is currently executing. When that happens the unregister silently fails, the callback stays on the list, and any later free of priv produces a use-after-free the next time the EAL device-event thread dispatches. 7. PCI-remove callback depends on a monitor nobody starts. rte_dev_event_callback_register only wires the callback into the dispatch table; events are only delivered when something calls rte_dev_event_monitor_start(). The patch never calls it. Whether this works in practice depends on the application (testpmd, netvsc) calling it first. At minimum document the requirement; better, have the driver start the monitor on first probe and stop it on last remove. 8. No documentation or release-note update. This is a substantial new feature. doc/guides/rel_notes/ and doc/guides/nics/mana.rst should both be updated to describe the reset lifecycle, the RTE_ETH_EVENT_ERR_RECOVERING / RECOVERY_SUCCESS / RECOVERY_FAILED behavior, and the dependency on rte_dev_event_monitor_start. 9. priv->reset_thread_active is bool, accessed without atomic ops. It is read in mana_dev_stop_lock and mana_dev_close_lock before the spinlock is acquired (intentionally, to avoid deadlock with the reset thread), and written from at least three threads (intr handler, reset thread, dev_close). It should be RTE_ATOMIC(bool) with explicit ordering. This compounds issue #2. 10. mana_intr_handler races with mana_pci_remove_event_cb on dev_state. intr_handler does: if (state == MANA_DEV_ACTIVE) { rte_spinlock_lock(&priv->reset_ops_lock); mana_reset_enter(priv); /* stores RESET_ENTER */ pci_remove_event_cb stores RESET_FAILED before taking the lock. If intr_handler reads ACTIVE, pci_remove fires (sets RESET_FAILED) while intr_handler is waiting on the lock, intr_handler then acquires it and overwrites RESET_FAILED with RESET_ENTER, and the driver attempts a reset on a device that has already been physically removed. The state-set in pci_remove_event_cb must occur under the lock, or the intr_handler must re-check state after taking the lock. Info ---- 11. (void *)0 is used in mp.c (proc_priv->db_page = (void *)0; and if (proc_priv->db_page != 0)). Use NULL. 12. The hand-off of reset_ops_lock from mana_reset_enter to mana_reset_thread to mana_reset_exit_delay (released in the latter) is hard to follow. A short comment block at mana_reset_enter listing which functions take the lock and which release it would help future maintainers. 13. The widened TOCTOU window in mana_tx_burst (db_page is now read at the top of the function rather than just before the doorbell) is not itself a bug, but it makes #5 worse. Once #5 is fixed via a proper reader-side primitive, consider keeping the db_page read inside the critical section.

