> > On Fri, Mar 13, 2026 at 01:59:28PM -0300, Jason Gunthorpe wrote: > > On Sat, Mar 07, 2026 at 07:38:14PM +0200, Leon Romanovsky wrote: > > > On Fri, Mar 06, 2026 at 05:47:14PM -0800, Long Li wrote: > > > > When the MANA hardware undergoes a service reset, the ETH > > > > auxiliary device > > > > (mana.eth) used by DPDK persists across the reset cycle — it is > > > > not removed and re-added like RC/UD/GSI QPs. This means userspace > > > > RDMA consumers such as DPDK have no way of knowing that firmware > > > > handles for their PD, CQ, WQ, QP and MR resources have become stale. > > > > > > NAK to any of this. > > > > > > In case of hardware reset, mana_ib AUX device needs to be destroyed > > > and recreated later. > > > > Yeah, that is our general model for any serious RAS event where the > > driver's view of resources becomes out of sync with the HW. > > > > You have tear down the ib_device by removing the aux and then bring > > back a new one. > > > > There is an IB_EVENT_DEVICE_FATAL, but the purpose of that event is to > > tell userspace to close and re-open their uverbs FD. > > > > We don't have a model where a uverbs FD in userspace can continue to > > work after the device has a catasrophic RAS event. > > > > There may be room to have a model where the ib device doesn't fully > > unplug/replug so it retains its name and things, but that is core code > > not driver stuff. > > Good luck with that model. It is going to break RDMA-CM hotplug support. >
I think we can preserve RDMA-CM behavior without requiring ib_device unregister/re-register. On device reset, the driver can dispatch IB_EVENT_DEVICE_FATAL (or a new reset event) through ib_dispatch_event(). RDMA-CM already handles device events — we would add a handler that iterates all rdma_cm_ids on the device and sends RDMA_CM_EVENT_DEVICE_REMOVAL to each, same as cma_process_remove() does today. The difference: cma_device stays alive, so applications can reconnect on the same device after recovery instead of waiting for a new one to appear. The motivation for keeping ib_device alive is that some RDMA consumers — DPDK and NCCL — don't use RDMA-CM at all. They use raw verbs and manage QP state themselves. For these users, a persistent ib_device with IB_EVENT_PORT_ERR / IB_EVENT_PORT_ACTIVE notifications enables reliable in-place recovery without reopening the device. This matters especially for PCI DPC recovery, which is becoming critical for large-scale GPU/storage deployments. See this talk for context on the value of surviving DPC events: https://www.youtube.com/watch?v=TpNNeMGEsdU&t=1619s Today a DPC event on one NIC kills all RDMA connections and can crash entire training jobs. If the ib_device persists and the driver recreates firmware resources after recovery, raw verbs users can resume without full teardown, and RDMA-CM users get the same disconnect/reconnect behavior they have today. Thanks, Long

