On Tue, Mar 17, 2026 at 11:43:49PM +0000, Long Li wrote: > Today a DPC event on one NIC kills all RDMA connections and can > crash entire training jobs.
All rdma connections on that nic, right? > If the ib_device persists and the driver > recreates firmware resources after recovery, raw verbs users can > resume without full teardown, and RDMA-CM users get the same > disconnect/reconnect behavior they have today. No, I don't think this is feasible. There is too much state, the kernel cannot just recreate things and transparently keep going without userspace handshaking this. IMHO It is just the wrong model. We have always gone for the model that userspace has to be involved in the RAS and it has to recreate its operations on a fresh new verbs FD. I think anything else is going to be so complicated and fragile. I can't see any sensible way an already open verbs FD can survive a device reset. Jason

