On Thu, May 22, 2025 at 10:03 AM Akihiko Odaki <akihiko.od...@daynix.com> wrote: > > On 2025/05/22 15:45, Alex Bennée wrote: > > Akihiko Odaki <akihiko.od...@daynix.com> writes: > > > >> On 2025/05/22 1:42, Alex Bennée wrote: > >>> From: Manos Pitsidianakis <manos.pitsidiana...@linaro.org> > >>> This commit fixes an indefinite hang when using VIRTIO GPU blob > >>> objects > >>> under TCG in certain conditions. > >>> The VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB VIRTIO command creates a > >>> MemoryRegion and attaches it to an offset on a PCI BAR of the > >>> VirtIOGPUdevice. The VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB command unmaps > >>> it. > >>> Because virglrenderer commands are not thread-safe they are only > >>> called on the main context and QEMU performs the cleanup in three steps > >>> to prevent a use-after-free scenario where the guest can access the > >>> region after it’s unmapped: > >>> 1. From the main context, the region’s field finish_unmapping is > >>> false > >>> by default, so it sets a variable cmd_suspended, increases the > >>> renderer_blocked variable, deletes the blob subregion, and unparents > >>> the blob subregion causing its reference count to decrement. > >>> 2. From an RCU context, the MemoryView gets freed, the FlatView gets > >>> recalculated, the free callback of the blob region > >>> virtio_gpu_virgl_hostmem_region_free is called which sets the > >>> region’s field finish_unmapping to true, allowing the main thread > >>> context to finish replying to the command > >>> 3. From the main context, the command is processed again, but this > >>> time > >>> finish_unmapping is true, so virgl_renderer_resource_unmap can be > >>> called and a response is sent to the guest. > >>> It happens so that under TCG, if the guest has no timers configured > >>> (and > >>> thus no interrupt will cause the CPU to exit), the RCU thread does not > >>> have enough time to grab the locks and recalculate the FlatView. > >>> That’s not a big problem in practice since most guests will assume a > >>> response will happen later in time and go on to do different things, > >>> potentially triggering interrupts and allowing the RCU context to run. > >>> If the guest waits for the unmap command to complete though, it blocks > >>> indefinitely. Attaching to the QEMU monitor and force quitting the guest > >>> allows the cleanup to continue. > >>> There's no reason why the FlatView recalculation can't occur right > >>> away > >>> when we delete the blob subregion, however. It does not, because when we > >>> create the subregion we set the object as its own parent: > >>> memory_region_init_ram_ptr(mr, OBJECT(mr), "blob", size, data); > >>> The extra reference is what prevents freeing the memory region > >>> object in > >>> the memory transaction of deleting the subregion. > >>> This commit changes the owner object to the device, which removes > >>> the > >>> extra owner reference in the memory region and causes the MR to be > >>> freed right away in the main context. > >>> Acked-by: Michael S. Tsirkin <m...@redhat.com> > >>> Signed-off-by: Manos Pitsidianakis <manos.pitsidiana...@linaro.org> > >>> Reviewed-by: Alex Bennée <alex.ben...@linaro.org> > >>> Tested-by: Alex Bennée <alex.ben...@linaro.org> > >>> Message-Id: <20250410122643.1747913-3-manos.pitsidiana...@linaro.org> > >>> Cc: qemu-sta...@nongnu.org > >>> --- > >>> hw/display/virtio-gpu-virgl.c | 2 +- > >>> 1 file changed, 1 insertion(+), 1 deletion(-) > >>> diff --git a/hw/display/virtio-gpu-virgl.c > >>> b/hw/display/virtio-gpu-virgl.c > >>> index 71a7500de9..8fbe4e70cc 100644 > >>> --- a/hw/display/virtio-gpu-virgl.c > >>> +++ b/hw/display/virtio-gpu-virgl.c > >>> @@ -112,7 +112,7 @@ virtio_gpu_virgl_map_resource_blob(VirtIOGPU *g, > >>> vmr->g = g; > >>> mr = g_new0(MemoryRegion, 1); > >>> - memory_region_init_ram_ptr(mr, OBJECT(mr), "blob", size, > >>> data); > >>> + memory_region_init_ram_ptr(mr, OBJECT(g), "blob", size, data); > >>> memory_region_add_subregion(&b->hostmem, offset, mr); > >>> memory_region_set_enabled(mr, true); > >>> > >> > >> I suggest dropping this patch for now due to the reason I pointed out > >> for the first version of this series. > > > > This fixes an actual bug - without it we get a hang. > > > > I understand that but it also introduces a regression; "[PATCH v3 14/20] > ui/gtk-gl-area: Remove extra draw call in refresh" is also a similar case. > > Ideally such a bug should be fixed without regression, but I understand > it is sometimes difficult to do that and postponing the bug resolution > until figuring out the correct way does not make sense. > > In such a case, a bug should be fixed minimizing the regression and the > documentation of the regression should be left in the code. > > In particular, this patch can cause use-after-free whether TCG is used > or not. Instead, I suggest to avoid freeing memory regions at all on > TCG. It will surely leak memory, but won't result in use-after-free at > least and the other accelerators will be unaffected. > > Regards, > Akihiko Odaki
We tested this fix with ASAN and didn't see anything. Do you have a test case in mind that can reproduce this use-after-free? It'd help make a certain decision on whether to drop this patch or not. I'm not doubting that this can cause a use-after-free by the way, it's just that it is hypothetical only. If it causes a use-after-free for sure we should definitely drop it. > Instead, I suggest to avoid freeing memory regions at all on > TCG. It will surely leak memory, but won't result in use-after-free at > least and the other accelerators will be unaffected. Leaking memory for blob objects is also not ideal, since they are frequently allocated. It's memory-safe but the leak can accumulate over time. -- Manos Pitsidianakis Emulation and Virtualization Engineer at Linaro Ltd