On 07.03.23 14:44, Hanna Czenczek wrote:
On 07.03.23 13:22, Fiona Ebner wrote:
Hi,
I am suspecting that commit 7e5cdb345f ("ide: Increment BB in-flight
counter for TRIM BH") introduced an issue in combination with draining.

 From a debug session on a costumer's machine I gathered the following
information:
* The QEMU process hangs in aio_poll called during draining and doesn't
progress.
* The in_flight counter for the BlockDriverState is 0 and for the
BlockBackend it is 1.
* There is a blk_aio_pdiscard_entry request in the BlockBackend's
queued_requests.
* The drive is attached via ahci.

I suspect that something like the following happened:

1. ide_issue_trim is called, and increments the in_flight counter.
2. ide_issue_trim_cb calls blk_aio_pdiscard.
3. somebody else starts draining.
4. ide_issue_trim_cb is called as the completion callback for
blk_aio_pdiscard.
5. ide_issue_trim_cb issues yet another blk_aio_pdiscard request.
6. The request is added to the wait queue via blk_wait_while_drained,
because draining has been started.
7. Nobody ever decrements the in_flight counter and draining can't finish.

Sounds about right.

The issue occurs very rarely and is difficult to reproduce, but with the
help of GDB, I'm able to do it rather reliably:
1. Use GDB to break on blk_aio_pdiscard.
2. Run mkfs.ext4 on a huge disk in the guest.
3. Issue a drive-backup QMP command after landing on the breakpoint.
4. Continue a few times in GDB.
5. After that I can observe the same situation as described above.

I'd be happy about suggestions for how to fix it. Unfortunately, I don't
see a clear-cut way at the moment. The only idea I have right now is to
change the code to issue all discard requests at the same time, but I
fear there might pitfalls with that?

The point of 7e5cdb345f was that we need any in-flight count to accompany a set s->bus->dma->aiocb.  While blk_aio_pdiscard() is happening, we don’t necessarily need another count.  But we do need it while there is no blk_aio_pdiscard().

ide_issue_trim_cb() returns in two cases (and, recursively through its callers, leaves s->bus->dma->aiocb set):
1. After calling blk_aio_pdiscard(), which will keep an in-flight count,
2. After calling replay_bh_schedule_event() (i.e. qemu_bh_schedule()), which does not keep an in-flight count.

Perhaps we just need to move the blk_inc_in_flight() above the replay_bh_schedule_event() call?

While writing the commit message for this, I noticed it isn’t quite right: ide_cancel_dma_sync() drains s->blk only once, so once the in-flight counter goes to 0, s->blk is considered drained and ide_cancel_dma_sync() will go on to assert that s->bus->dma->aiocb is now NULL.  However, if we do have a blk_aio_pdiscard() in flight, the drain will wait only for that one to complete, not for the whole trim operation to complete, i.e. the next discard or ide_trim_bh_cb() will be scheduled, but neither will necessarily be run before blk_drain() returns.

I’ve attached a reproducer that issues two trim requests.  Before 7e5cdb345f, it makes qemu crash because the assertion fails (one or two of the blk_aio_pdiscard()s is drained, but the trim isn’t settled yet).  After 7e5cdb345f, qemu hangs because of what you describe (the second blk_aio_pdiscard() is enqueued, so the drain can’t make progress, resulting in a deadlock).  With my proposed fix, qemu crashes again.

(Reproducer is run like this:
$ qemu-system-x86_64 -drive if=ide,file=/tmp/test.bin,format=raw
)

What comes to my mind is either what you’ve proposed initially (issuing all discards simultaneously), or to still use my proposed fix, but also have ide_cancel_dma_sync() run blk_drain() in a loop until s->bus->dma->aiocb becomes NULL.  (Kind of like my original patch (https://lists.nongnu.org/archive/html/qemu-block/2022-01/msg00024.html), only that we can still use blk_drain() instead of aio_poll() because we increment the in-flight counter while the completion BH is scheduled.)

Hanna

Attachment: test.bin
Description: Binary data

format binary
use16

org 0x7c00

DMA_BUF = 0x7e00

cld

xor     ax, ax
mov     ds, ax
mov     es, ax

; clear DMA buffer
mov     di, DMA_BUF
xor     ax, ax
mov     cx, 256
repnz stosw

; two TRIM requests (both are the same: one sector, starting at sector index 1)
mov     di, DMA_BUF
mov     dword [di+ 0], 0x00000001
mov     dword [di+ 4], 0x00010000
mov     dword [di+ 8], 0x00000001
mov     dword [di+12], 0x00010000

; find IDE PCI device
mov     ax, 0xb102
mov     dx, 0x8086
mov     cx, 0x7010
xor     si, si
int     0x1a

; bx has PCI address
push    bx

; enable BM+MEM+IO

mov     di, 0x04 ; command/status
mov     ax, 0xb10a ; read config dword
int     0x1a

pop     bx
push    bx

or      cl, 0x7 ; BM+MEM+IO
mov     di, 0x04
mov     ax, 0xb10d ; write config dword
int     0x1a

pop     bx
push    bx

; read BAR4 (DMA I/O space)

mov     di, 0x20 ; bar4
mov     ax, 0xb10a
int     0x1a

and     cx, 0xfffc ; DMA I/O base
push    cx

mov     dx, cx

; set up DMA

xor     al, al ; status: 0
out     dx, al

mov     al, 0x04 ; clear pending interrupts
add     dx, 2
out     dx, al

mov     eax, prdt
add     dx, 2
out     dx, eax

; send TRIM command

mov     dx, 0x1f0
mov     si, dsm_trim_cmd
mov     cx, 8
out_loop:
lodsb
out     dx, al
inc     dx
loop    out_loop

; start DMA transfer

pop     dx
mov     al, 0x01
out     dx, al

; immediately reset device, cancelling ongoing TRIM

mov     dx, 0x3f6
mov     al, 0x04
out     dx, al

cli
hlt


dsm_trim_cmd:
db 0x00, 0x01, 0x01, 0x00, 0x00, 0x00, 0xc0, 0x06

pciaddr:
dw ?

align(8)
prdt:
dd DMA_BUF
dd 512 or 0x80000000

times 510-($-$$) db 0
dw 0xaa55

times 512 db 0

Reply via email to