> On Tue, Mar 17, 2026 at 11:43:49PM +0000, Long Li wrote: > > > > > > On Fri, Mar 13, 2026 at 01:59:28PM -0300, Jason Gunthorpe wrote: > > > > On Sat, Mar 07, 2026 at 07:38:14PM +0200, Leon Romanovsky wrote: > > > > > On Fri, Mar 06, 2026 at 05:47:14PM -0800, Long Li wrote: > > > > > > When the MANA hardware undergoes a service reset, the ETH > > > > > > auxiliary device > > > > > > (mana.eth) used by DPDK persists across the reset cycle — it > > > > > > is not removed and re-added like RC/UD/GSI QPs. This means > > > > > > userspace RDMA consumers such as DPDK have no way of knowing > > > > > > that firmware handles for their PD, CQ, WQ, QP and MR resources have > become stale. > > > > > > > > > > NAK to any of this. > > > > > > > > > > In case of hardware reset, mana_ib AUX device needs to be > > > > > destroyed and recreated later. > > > > > > > > Yeah, that is our general model for any serious RAS event where > > > > the driver's view of resources becomes out of sync with the HW. > > > > > > > > You have tear down the ib_device by removing the aux and then > > > > bring back a new one. > > > > > > > > There is an IB_EVENT_DEVICE_FATAL, but the purpose of that event > > > > is to tell userspace to close and re-open their uverbs FD. > > > > > > > > We don't have a model where a uverbs FD in userspace can continue > > > > to work after the device has a catasrophic RAS event. > > > > > > > > There may be room to have a model where the ib device doesn't > > > > fully unplug/replug so it retains its name and things, but that is > > > > core code not driver stuff. > > > > > > Good luck with that model. It is going to break RDMA-CM hotplug support. > > > > > > > I think we can preserve RDMA-CM behavior without requiring ib_device > > unregister/re-register. > > > > On device reset, the driver can dispatch IB_EVENT_DEVICE_FATAL (or a > > new reset event) through ib_dispatch_event(). RDMA-CM already handles > > device events — we would add a handler that iterates all rdma_cm_ids > > on the device and sends RDMA_CM_EVENT_DEVICE_REMOVAL to each, > same > > as cma_process_remove() does today. The difference: cma_device stays > > alive, so applications can reconnect on the same device after recovery > > instead of waiting for a new one to appear. > > > > The motivation for keeping ib_device alive is that some RDMA consumers > > — DPDK and NCCL — don't use RDMA-CM at all. They use raw verbs and > > manage QP state themselves. > > RDMA-CM provides an "external QP" model where the QP is managed by the > rdma-cm user. > > As Jason noted, you should propose the core changes together with the > corresponding librdmacm updates. The final result must ensure that legacy > applications continue to function correctly with the new kernel. > > Thanks
Will send RFC patches. Thank you, Long

