On 07.03.23 14:44, Hanna Czenczek wrote:
On 07.03.23 13:22, Fiona Ebner wrote:Hi, I am suspecting that commit 7e5cdb345f ("ide: Increment BB in-flight counter for TRIM BH") introduced an issue in combination with draining.From a debug session on a costumer's machine I gathered the following information: * The QEMU process hangs in aio_poll called during draining and doesn't progress. * The in_flight counter for the BlockDriverState is 0 and for the BlockBackend it is 1. * There is a blk_aio_pdiscard_entry request in the BlockBackend's queued_requests. * The drive is attached via ahci. I suspect that something like the following happened: 1. ide_issue_trim is called, and increments the in_flight counter. 2. ide_issue_trim_cb calls blk_aio_pdiscard. 3. somebody else starts draining. 4. ide_issue_trim_cb is called as the completion callback for blk_aio_pdiscard. 5. ide_issue_trim_cb issues yet another blk_aio_pdiscard request. 6. The request is added to the wait queue via blk_wait_while_drained, because draining has been started.7. Nobody ever decrements the in_flight counter and draining can't finish.Sounds about right.The issue occurs very rarely and is difficult to reproduce, but with the help of GDB, I'm able to do it rather reliably: 1. Use GDB to break on blk_aio_pdiscard. 2. Run mkfs.ext4 on a huge disk in the guest. 3. Issue a drive-backup QMP command after landing on the breakpoint. 4. Continue a few times in GDB. 5. After that I can observe the same situation as described above. I'd be happy about suggestions for how to fix it. Unfortunately, I don't see a clear-cut way at the moment. The only idea I have right now is to change the code to issue all discard requests at the same time, but I fear there might pitfalls with that?The point of 7e5cdb345f was that we need any in-flight count to accompany a set s->bus->dma->aiocb. While blk_aio_pdiscard() is happening, we don’t necessarily need another count. But we do need it while there is no blk_aio_pdiscard().ide_issue_trim_cb() returns in two cases (and, recursively through its callers, leaves s->bus->dma->aiocb set):1. After calling blk_aio_pdiscard(), which will keep an in-flight count,2. After calling replay_bh_schedule_event() (i.e. qemu_bh_schedule()), which does not keep an in-flight count.Perhaps we just need to move the blk_inc_in_flight() above the replay_bh_schedule_event() call?
While writing the commit message for this, I noticed it isn’t quite right: ide_cancel_dma_sync() drains s->blk only once, so once the in-flight counter goes to 0, s->blk is considered drained and ide_cancel_dma_sync() will go on to assert that s->bus->dma->aiocb is now NULL. However, if we do have a blk_aio_pdiscard() in flight, the drain will wait only for that one to complete, not for the whole trim operation to complete, i.e. the next discard or ide_trim_bh_cb() will be scheduled, but neither will necessarily be run before blk_drain() returns.
I’ve attached a reproducer that issues two trim requests. Before 7e5cdb345f, it makes qemu crash because the assertion fails (one or two of the blk_aio_pdiscard()s is drained, but the trim isn’t settled yet). After 7e5cdb345f, qemu hangs because of what you describe (the second blk_aio_pdiscard() is enqueued, so the drain can’t make progress, resulting in a deadlock). With my proposed fix, qemu crashes again.
(Reproducer is run like this: $ qemu-system-x86_64 -drive if=ide,file=/tmp/test.bin,format=raw )What comes to my mind is either what you’ve proposed initially (issuing all discards simultaneously), or to still use my proposed fix, but also have ide_cancel_dma_sync() run blk_drain() in a loop until s->bus->dma->aiocb becomes NULL. (Kind of like my original patch (https://lists.nongnu.org/archive/html/qemu-block/2022-01/msg00024.html), only that we can still use blk_drain() instead of aio_poll() because we increment the in-flight counter while the completion BH is scheduled.)
Hanna
test.bin
Description: Binary data
format binary use16 org 0x7c00 DMA_BUF = 0x7e00 cld xor ax, ax mov ds, ax mov es, ax ; clear DMA buffer mov di, DMA_BUF xor ax, ax mov cx, 256 repnz stosw ; two TRIM requests (both are the same: one sector, starting at sector index 1) mov di, DMA_BUF mov dword [di+ 0], 0x00000001 mov dword [di+ 4], 0x00010000 mov dword [di+ 8], 0x00000001 mov dword [di+12], 0x00010000 ; find IDE PCI device mov ax, 0xb102 mov dx, 0x8086 mov cx, 0x7010 xor si, si int 0x1a ; bx has PCI address push bx ; enable BM+MEM+IO mov di, 0x04 ; command/status mov ax, 0xb10a ; read config dword int 0x1a pop bx push bx or cl, 0x7 ; BM+MEM+IO mov di, 0x04 mov ax, 0xb10d ; write config dword int 0x1a pop bx push bx ; read BAR4 (DMA I/O space) mov di, 0x20 ; bar4 mov ax, 0xb10a int 0x1a and cx, 0xfffc ; DMA I/O base push cx mov dx, cx ; set up DMA xor al, al ; status: 0 out dx, al mov al, 0x04 ; clear pending interrupts add dx, 2 out dx, al mov eax, prdt add dx, 2 out dx, eax ; send TRIM command mov dx, 0x1f0 mov si, dsm_trim_cmd mov cx, 8 out_loop: lodsb out dx, al inc dx loop out_loop ; start DMA transfer pop dx mov al, 0x01 out dx, al ; immediately reset device, cancelling ongoing TRIM mov dx, 0x3f6 mov al, 0x04 out dx, al cli hlt dsm_trim_cmd: db 0x00, 0x01, 0x01, 0x00, 0x00, 0x00, 0xc0, 0x06 pciaddr: dw ? align(8) prdt: dd DMA_BUF dd 512 or 0x80000000 times 510-($-$$) db 0 dw 0xaa55 times 512 db 0