Hello, Phil, PeterM, On Thu, Jan 18, 2024 at 04:53:42PM +0100, Philippe Mathieu-Daudé wrote: > I concur. Devices reset is hard, but bus reset is even harder. > Having a quick look, the issues tracked by Alex & Peter might > come from the PCI bridges using the legacy DeviceReset.
The challenges we're facing with VFIO on vIOMMU are actually listed in patch 4's large comment I added, here: https://lore.kernel.org/qemu-devel/20240117091559.144730-5-pet...@redhat.com/ + /* + * vIOMMU reset may require proper ordering with other devices. There + * are two complexities so that normal DeviceState.reset() may not + * work properly for vIOMMUs: + * + * (1) Device depth-first reset hierachy doesn't yet work for vIOMMUs + * (reference: resettable_cold_reset_fn()) + * + * Currently, vIOMMU devices are created as normal '-device' + * cmdlines. It means in many ways it has the same attributes with + * most of the rest devices, even if the rest devices should + * logically be under control of the vIOMMU unit. + * + * One side effect of it is vIOMMU devices will be currently put + * randomly under qdev tree hierarchy, which is the source of + * device reset ordering in current QEMU (depth-first traversal). + * It means vIOMMU now can be reset before some devices. For fully + * emulated devices that's not a problem, because the traversal + * holds BQL for the whole process. However it is a problem if DMA + * can happen without BQL, like VFIO, vDPA or remote device process. + * + * TODO: one ideal solution can be that we make vIOMMU the parent + * of the whole pci host bridge. Hence vIOMMU can be reset after + * all the devices are reset and quiesced. + * + * (2) Some devices register its own reset functions + * + * Even if above issue solved, if devices register its own reset + * functions for some reason via QEMU reset hooks, vIOMMU can still + * be reset before the device. One example is vfio_reset_handler() + * where FLR is not supported on the device. + * + * TODO: merge relevant reset functions into the device tree reset + * framework. + * + * Neither of the above TODO may be trivial. To make it work for now, + * leverage reset stages and reset vIOMMU always at latter stage of the + * default. It means it needs to be reset after at least: + * + * - resettable_cold_reset_fn(): machine qdev tree reset + * - vfio_reset_handler(): VFIO reset for !FLR + */ What you're asking makes sense to me, because that's also what I would prefer to consider (and that's why I mentioned my series "slightly ugly" in the cover letter), even if I don't yet know much on the other reset challenges QEMU is already facing. The issue is just like what I mentioned in the cover letter: that complete solution can be non-trivial and can take a lot of time, and I probably wouldn't have time at least recently to persue such a solution. To fix the issue cleanly, I assume we need to start from making sure all VFIO paths will only use the Resettable interface to reset. The second part will involve moving vIOMMU device in the QOM tree to be the parent of whatever it manages. I didn't follow recent changes in this area, but currently vIOMMU is probably an anonymous object dangling somewhere and more or less orphaned, while "-device" is the interface for now to create the vIOMMU which might be too generic. I'm not sure whether we'll need new interface already to create the vIOMMU, e.g., it may ideally need to be the parent of the root pci bus that it manages, one or multiple. And we need to make sure the vIOMMU being present when the root pci buses are created, I think. IIUC that can be before parsing "-device"s. For the 2nd issue, I assume the ->hold() phase for VT-d to reset address spaces might be good enough, as long as the reset is done depth-first, then VFIO's ->hold()s will be called before that point. I consider after all devices' ->hold() kicked off there should have no DMA anymore, so quit() shouldn't hopefully matter. However even if hold() works it is still challenging for the hierachy change. Considering that this issue so far shouldn't cause any functional breakage, maybe one option is we gradually fix all above, and before that we declare it a known issue. Any other suggestions would also be greatly welcomed. Thanks, -- Peter Xu