Kevin Wolf <kw...@redhat.com> writes: > Am 08.04.2021 um 11:21 hat Markus Armbruster geschrieben: >> Kevin Wolf <kw...@redhat.com> writes: >> >> > Am 22.03.2021 um 16:40 hat Stefan Reiter geschrieben: >> >> The QMP dispatcher coroutine holds the qmp_queue_lock over a yield >> >> point, where it expects to be rescheduled from the main context. If a >> >> CHR_EVENT_CLOSED event is received just then, it can race and block the >> >> main thread on the mutex in monitor_qmp_cleanup_queue_and_resume. >> >> >> >> monitor_resume does not need to be called from main context, so we can >> >> call it immediately after popping a request from the queue, which allows >> >> us to drop the qmp_queue_lock mutex before yielding. >> >> >> >> Suggested-by: Wolfgang Bumiller <w.bumil...@proxmox.com> >> >> Signed-off-by: Stefan Reiter <s.rei...@proxmox.com> >> >> --- >> >> v2: >> >> * different approach: move everything that needs the qmp_queue_lock mutex >> >> before >> >> the yield point, instead of moving the event handling to a different >> >> context >> > >> > The interesting new case here seems to be that new requests could be >> > queued and the dispatcher coroutine could be kicked before yielding. >> > This is safe because &qmp_dispatcher_co_busy is accessed with atomics >> > on both sides. >> > >> > The important part is just that the first (conditional) yield stays >> > first, so that the aio_co_wake() in handle_qmp_command() won't reenter >> > the coroutine while it is expecting to be reentered from somewhere else. >> > This is still the case after the patch. >> > >> > Reviewed-by: Kevin Wolf <kw...@redhat.com> >> >> Thanks for saving me from an ugly review headache. >> >> Should this go into 6.0? > > This is something that the responsible maintainer needs to decide.
Yes, and that's me. I'm soliciting opinions. > If it helps you with the decision, and if I understand correctly, it is > a regression from 5.1, but was already broken in 5.2. It helps. Even more helpful would be a risk assessment: what's the risk of applying this patch now vs. delaying it? If I understand Stefan correctly, Proxmox observed VM hangs. How frequent are these hangs? Did they result in data corruption? How confident do we feel about the fix?