On Tue, 19 Sept 2023 at 06:26, Kevin Wolf <kw...@redhat.com> wrote: > > Am 18.09.2023 um 20:56 hat Stefan Hajnoczi geschrieben: > > Hi Kevin, > > I believe that my own commit "block-coroutine-wrapper: use > > qemu_get_current_aio_context()" breaks this test. The failure is > > non-deterministic (happens about 1 out of 4 runs). > > > > It seems the job hangs and the test times out in vm.run_job('job1', > > wait=5.0). > > > > I haven't debugged it yet but wanted to share this information to save > > some time. Tomorrow I'll investigate further. > > Yes, it's relatively easily reproducible if I run the test in a loop, > and I can't seem to reproduce it without the last patch. Should I > unstage the full series again, or do you think that the last patch is > really optional this time?
Please drop the last patch. I'm not aware of dependencies on the last patch. > However, I'm unsure how the stack traces I'm seeing are related to your > patch. Maybe it just made an existing bug more likely to be triggered? I'll share my thoughts once I've looked at the crash today. Regarding AioContext lock removal: I'll work on that and see what still depends on the lock. Stefan > What I'm seeing is that the reader lock is held by an iothread that is > waiting for its AioContext lock to make progress: > > Thread 3 (Thread 0x7f811e9346c0 (LWP 26390) "qemu-system-x86"): > #0 0x00007f81250aaf80 in __lll_lock_wait () at /lib64/libc.so.6 > #1 0x00007f81250b149a in pthread_mutex_lock@@GLIBC_2.2.5 () at > /lib64/libc.so.6 > #2 0x000055b7b170967e in qemu_mutex_lock_impl (mutex=0x55b7b34e3080, > file=0x55b7b199e1f7 "../util/async.c", line=728) at > ../util/qemu-thread-posix.c:94 > #3 0x000055b7b1709953 in qemu_rec_mutex_lock_impl (mutex=0x55b7b34e3080, > file=0x55b7b199e1f7 "../util/async.c", line=728) at > ../util/qemu-thread-posix.c:149 > #4 0x000055b7b1728318 in aio_context_acquire (ctx=0x55b7b34e3020) at > ../util/async.c:728 > #5 0x000055b7b1727c49 in co_schedule_bh_cb (opaque=0x55b7b34e3020) at > ../util/async.c:565 > #6 0x000055b7b1726f1c in aio_bh_call (bh=0x55b7b34e2e70) at > ../util/async.c:169 > #7 0x000055b7b17270ee in aio_bh_poll (ctx=0x55b7b34e3020) at > ../util/async.c:216 > #8 0x000055b7b170351d in aio_poll (ctx=0x55b7b34e3020, blocking=true) at > ../util/aio-posix.c:722 > #9 0x000055b7b1518604 in iothread_run (opaque=0x55b7b2904460) at > ../iothread.c:63 > #10 0x000055b7b170a955 in qemu_thread_start (args=0x55b7b34e36b0) at > ../util/qemu-thread-posix.c:541 > #11 0x00007f81250ae15d in start_thread () at /lib64/libc.so.6 > #12 0x00007f812512fc00 in clone3 () at /lib64/libc.so.6 > > On the other hand, the main thread wants to acquire the writer lock, > but it holds the AioContext lock of the iothread (it takes it in > job_prepare_locked()): > > Thread 1 (Thread 0x7f811f4b7b00 (LWP 26388) "qemu-system-x86"): > #0 0x00007f8125122356 in ppoll () at /lib64/libc.so.6 > #1 0x000055b7b172eae0 in qemu_poll_ns (fds=0x55b7b34ec910, nfds=1, > timeout=-1) at ../util/qemu-timer.c:339 > #2 0x000055b7b1704ebd in fdmon_poll_wait (ctx=0x55b7b3269210, > ready_list=0x7ffc90b05680, timeout=-1) at ../util/fdmon-poll.c:79 > #3 0x000055b7b1703284 in aio_poll (ctx=0x55b7b3269210, blocking=true) at > ../util/aio-posix.c:670 > #4 0x000055b7b1567c3b in bdrv_graph_wrlock (bs=0x0) at > ../block/graph-lock.c:145 > #5 0x000055b7b1554c1c in blk_remove_bs (blk=0x55b7b4425800) at > ../block/block-backend.c:916 > #6 0x000055b7b1554779 in blk_delete (blk=0x55b7b4425800) at > ../block/block-backend.c:497 > #7 0x000055b7b1554133 in blk_unref (blk=0x55b7b4425800) at > ../block/block-backend.c:557 > #8 0x000055b7b157a149 in mirror_exit_common (job=0x55b7b4419000) at > ../block/mirror.c:696 > #9 0x000055b7b1577015 in mirror_prepare (job=0x55b7b4419000) at > ../block/mirror.c:807 > #10 0x000055b7b153a1a7 in job_prepare_locked (job=0x55b7b4419000) at > ../job.c:988 > #11 0x000055b7b153a0d9 in job_txn_apply_locked (job=0x55b7b4419000, > fn=0x55b7b153a110 <job_prepare_locked>) at ../job.c:191 > #12 0x000055b7b1538b6d in job_do_finalize_locked (job=0x55b7b4419000) at > ../job.c:1011 > #13 0x000055b7b153a886 in job_completed_txn_success_locked > (job=0x55b7b4419000) at ../job.c:1068 > #14 0x000055b7b1539372 in job_completed_locked (job=0x55b7b4419000) at > ../job.c:1082 > #15 0x000055b7b153a71b in job_exit (opaque=0x55b7b4419000) at ../job.c:1103 > #16 0x000055b7b1726f1c in aio_bh_call (bh=0x7f8110005470) at > ../util/async.c:169 > #17 0x000055b7b17270ee in aio_bh_poll (ctx=0x55b7b3269210) at > ../util/async.c:216 > #18 0x000055b7b1702c05 in aio_dispatch (ctx=0x55b7b3269210) at > ../util/aio-posix.c:423 > #19 0x000055b7b1728a14 in aio_ctx_dispatch (source=0x55b7b3269210, > callback=0x0, user_data=0x0) at ../util/async.c:358 > #20 0x00007f8126c31c7f in g_main_dispatch (context=0x55b7b3269720) at > ../glib/gmain.c:3454 > #21 g_main_context_dispatch (context=0x55b7b3269720) at ../glib/gmain.c:4172 > #22 0x000055b7b1729c98 in glib_pollfds_poll () at ../util/main-loop.c:290 > #23 0x000055b7b1729572 in os_host_main_loop_wait (timeout=27462700) at > ../util/main-loop.c:313 > #24 0x000055b7b1729452 in main_loop_wait (nonblocking=0) at > ../util/main-loop.c:592 > #25 0x000055b7b119a1eb in qemu_main_loop () at ../softmmu/runstate.c:772 > #26 0x000055b7b14c102d in qemu_default_main () at ../softmmu/main.c:37 > #27 0x000055b7b14c1068 in main (argc=44, argv=0x7ffc90b05d58) at > ../softmmu/main.c:48 > > At first I thought we just need to look into the AioContext locking in > job completion and drop it in the right places. > > But in fact, first of all, blk_remove_bs() needs to make up its mind if > it wants the caller to hold the AioContext or not and document that. > Because it calls both bdrv_drained_begin() (which requires holding the > AioContext lock) and bdrv_graph_wrlock(NULL) (which forbids it). > > If we could fully get rid of the AioContext lock (as we originally > stated as a goal), that would automatically solve this kind of > deadlocks. > > Kevin >