aiocb == NULL' failed.

Michał Kępień via Qemu-devel Tue, 11 Apr 2017 00:56:45 -0700

> I don't think the assert you are talking about in the subject is added
> by 9972354856. That assertion was added by 86698a12f and has been
> present since QEMU 2.6. I don't see the relation immediately to
> AioContext patches.


You are right, of course.  Sorry for misleading you about this.  What I
meant to write was that git bisect pinpoints commit 9972354856 as the
likely culprit ("likely" because of the makeshift testing methodology
used).

> Is this only during boot/shutdown? If not, it looks like there might be
> some other errors occurring that aggravate the device state and cause a
> reset by the guest.

In fact this has never happened to me upon boot or shutdown.  I believe
the operating system installed on the storage volume I am testing this
with has some kind of disk-intensive activity scheduled to run about
twenty minutes after booting.  That is why I have to wait that long
after booting the VM to determine whether the issue appears.

> Anyway, what should happen is something like this:
> 
> - Guest issues a reset request (ide_exec_cmd -> cmd_device_reset)
> - The device should now be "busy" and cannot accept any more requests (see 
> the conditional early in ide_exec_cmd)
> - cmd_device_reset drains any existing requests.
> - we assert that there are no handles to BH routines that have yet to return
> 
> Normally I'd say this is enough; because:
> 
> Although blk_drain does not prohibit future DMA transfers, it is being
> called after an explicit reset request from the guest, and so the device
> should be unable to service any further requests. After existing DMA
> commands are drained we should be unable to add any further requests.
> 
> It generally shouldn't be possible to see new requests show up here,
> unless;
> 
> (A) We are not guarding ide_exec_cmd properly and a new command is sneaking 
> in while we are trying to reset the device, or
> (B) blk_drain is not in fact doing what we expect it to (draining all pending 
> DMA from an outstanding IDE command we are servicing.)

ide_cancel_dma_sync() is also invoked from bmdma_cmd_writeb() and this
is in fact the code path taken when the assertion fails.

> Since you mentioned that you need to enable TRIM support in order to see
> the behavior, perhaps this is a function of a TRIM command being
> improperly implemented and causing the guest to panic, and we are indeed
> not draining TRIM requests properly.

I am not sure what the relation of TRIM to BMDMA is, but I still cannot
reproduce the issue without TRIM being enabled.

> That's my best wild guess, anyway. If you can't reproduce this
> elsewhere, can you run some debug version of this to see under which
> codepath we are invoking reset, and what the running command that we are
> failing to terminate is?

I recompiled QEMU with --enable-debug --extra-cflags="-ggdb -O0" and
attached the output of "bt full".  If this is not enough, please let me
know.


** Attachment added: "Output of "bt full" when the assertion fails"
   
https://bugs.launchpad.net/qemu/+bug/1681439/+attachment/4860013/+files/bt-full.log

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1681439

Title:
  qemu-system-x86_64: hw/ide/core.c:685: ide_cancel_dma_sync: Assertion
  `s->bus->dma->aiocb == NULL' failed.

Status in QEMU:
  New

Bug description:
  Since upgrading to QEMU 2.8.0, my Windows 7 64-bit virtual machines
  started crashing due to the assertion quoted in the summary failing.
  The assertion in question was added by commit 9972354856 ("block: add
  BDS field to count in-flight requests").  My tests show that setting
  discard=unmap is needed to reproduce the issue.  Speaking of
  reproduction, it is a bit flaky, because I have been unable to come up
  with specific instructions that would allow the issue to be triggered
  outside of my environment, but I do have a semi-sane way of testing that
  appears to depend on a specific initial state of data on the underlying
  storage volume, actions taken within the VM and waiting for about 20
  minutes.

  Here is the shortest QEMU command line that I managed to reproduce the
  bug with:

      qemu-system-x86_64 \
          -machine pc-i440fx-2.7,accel=kvm \
          -m 3072 \
          -drive file=/dev/lvm/qemu,format=raw,if=ide,discard=unmap \
        -netdev tap,id=hostnet0,ifname=tap0,script=no,downscript=no,vhost=on \
          -device virtio-net-pci,netdev=hostnet0 \
        -vnc :0

  The underlying storage (/dev/lvm/qemu) is a thin LVM snapshot.

  QEMU was compiled using:

      ./configure --python=/usr/bin/python2.7 --target-list=x86_64-softmmu
      make -j3

  My virtualization environment is not really a critical one and
  reproduction is not that much of a hassle, so if you need me to gather
  further diagnostic information or test patches, I will be happy to help.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1681439/+subscriptions

[Qemu-devel] [Bug 1681439] Re: qemu-system-x86_64: hw/ide/core.c:685: ide_cancel_dma_sync: Assertion `s->bus->dma->aiocb == NULL' failed.

Reply via email to