On Thu, Apr 2, 2015 at 5:43 PM, Paolo Bonzini <pbonz...@redhat.com> wrote: > On 02/04/2015 18:26, Stefan Hajnoczi wrote: >> John Snow has reported that qemu-io can hang when the host is under >> heavy load. He made the following observations in gdb: >> >> 1. The program is sitting in aio_poll() (called by bdrv_prwv_co()) >> waiting for request completion. >> >> 2. The thread pool has a ThreadPoolElement with ->state == THREAD_DONE. >> >> The ThreadPoolElement should have been reaped by >> thread_pool_completion_bh() and its callback invoked. For some reason >> this didn't happen and the program is blocked in poll(2) waiting. >> >> This suggests a race condition in thread-pool.c or qemu_bh_schedule() >> (used to complete ThreadPoolElement from a QEMU event loop). >> >> I don't have a good theory why this happens yet. Just wanted to share >> in case someone else hits this problem. > > Laszlo hit something very similar fairly easily with virtio-scsi (but > not virtio-blk!) on aarch64 hosts. Any attempt to debug it (ranging > from compilation with -O0 to tracing) made it disappear. A reliable > reproducer with qemu-io would be a dream...
My initial speculation was that the qemu_bh_schedule(): if (bh->scheduled) return; Check is causing us to skip BH invocations. When I look at the code the lack of explicit barriers or atomic operations for bh->scheduled itself is a little suspicious. But now I'm focussing more on thread-pool.c since that has its own threading constraints. Stefan