Re: [PATCH 2/4] reset: Allow multiple stages of system resets

Peter Xu Fri, 19 Jan 2024 03:11:40 -0800

Hello, Phil, PeterM,

On Thu, Jan 18, 2024 at 04:53:42PM +0100, Philippe Mathieu-Daudé wrote:
> I concur. Devices reset is hard, but bus reset is even harder.
> Having a quick look, the issues tracked by Alex & Peter might
> come from the PCI bridges using the legacy DeviceReset.


The challenges we're facing with VFIO on vIOMMU are actually listed in
patch 4's large comment I added, here:

https://lore.kernel.org/qemu-devel/20240117091559.144730-5-pet...@redhat.com/

+    /*
+     * vIOMMU reset may require proper ordering with other devices.  There
+     * are two complexities so that normal DeviceState.reset() may not
+     * work properly for vIOMMUs:
+     *
+     * (1) Device depth-first reset hierachy doesn't yet work for vIOMMUs
+     *     (reference: resettable_cold_reset_fn())
+     *
+     *     Currently, vIOMMU devices are created as normal '-device'
+     *     cmdlines.  It means in many ways it has the same attributes with
+     *     most of the rest devices, even if the rest devices should
+     *     logically be under control of the vIOMMU unit.
+     *
+     *     One side effect of it is vIOMMU devices will be currently put
+     *     randomly under qdev tree hierarchy, which is the source of
+     *     device reset ordering in current QEMU (depth-first traversal).
+     *     It means vIOMMU now can be reset before some devices.  For fully
+     *     emulated devices that's not a problem, because the traversal
+     *     holds BQL for the whole process.  However it is a problem if DMA
+     *     can happen without BQL, like VFIO, vDPA or remote device process.
+     *
+     *     TODO: one ideal solution can be that we make vIOMMU the parent
+     *     of the whole pci host bridge.  Hence vIOMMU can be reset after
+     *     all the devices are reset and quiesced.
+     *
+     * (2) Some devices register its own reset functions
+     *
+     *     Even if above issue solved, if devices register its own reset
+     *     functions for some reason via QEMU reset hooks, vIOMMU can still
+     *     be reset before the device. One example is vfio_reset_handler()
+     *     where FLR is not supported on the device.
+     *
+     *     TODO: merge relevant reset functions into the device tree reset
+     *     framework.
+     *
+     * Neither of the above TODO may be trivial.  To make it work for now,
+     * leverage reset stages and reset vIOMMU always at latter stage of the
+     * default.  It means it needs to be reset after at least:
+     *
+     *   - resettable_cold_reset_fn(): machine qdev tree reset
+     *   - vfio_reset_handler():       VFIO reset for !FLR
+     */

What you're asking makes sense to me, because that's also what I would
prefer to consider (and that's why I mentioned my series "slightly ugly" in
the cover letter), even if I don't yet know much on the other reset
challenges QEMU is already facing.

The issue is just like what I mentioned in the cover letter: that complete
solution can be non-trivial and can take a lot of time, and I probably
wouldn't have time at least recently to persue such a solution.

To fix the issue cleanly, I assume we need to start from making sure all
VFIO paths will only use the Resettable interface to reset.

The second part will involve moving vIOMMU device in the QOM tree to be the
parent of whatever it manages.  I didn't follow recent changes in this
area, but currently vIOMMU is probably an anonymous object dangling
somewhere and more or less orphaned, while "-device" is the interface for
now to create the vIOMMU which might be too generic.  I'm not sure whether
we'll need new interface already to create the vIOMMU, e.g., it may ideally
need to be the parent of the root pci bus that it manages, one or multiple.
And we need to make sure the vIOMMU being present when the root pci buses
are created, I think.  IIUC that can be before parsing "-device"s.

For the 2nd issue, I assume the ->hold() phase for VT-d to reset address
spaces might be good enough, as long as the reset is done depth-first, then
VFIO's ->hold()s will be called before that point.  I consider after all
devices' ->hold() kicked off there should have no DMA anymore, so quit()
shouldn't hopefully matter.

However even if hold() works it is still challenging for the hierachy
change.

Considering that this issue so far shouldn't cause any functional breakage,
maybe one option is we gradually fix all above, and before that we declare
it a known issue.

Any other suggestions would also be greatly welcomed.

Thanks,

-- 
Peter Xu

Re: [PATCH 2/4] reset: Allow multiple stages of system resets

Reply via email to