Note that the RCU thread is expected to sit most of the time doing nothing, so I don't think this matters.
Zhengui's theory that notify_me doesn't work properly on ARM is more promising, but he couldn't provide a clear explanation of why he thought notify_me is involved. In particular, I would have expected notify_me to be wrong if the qemu_poll_ns call came from aio_ctx_dispatch, for example: glib_pollfds_fill g_main_context_prepare aio_ctx_prepare atomic_or(&ctx->notify_me, 1) qemu_poll_ns glib_pollfds_poll g_main_context_check aio_ctx_check atomic_and(&ctx->notify_me, ~1) g_main_context_dispatch aio_ctx_dispatch /* do something for event */ qemu_poll_ns but all backtraces show thread 1 in os_host_main_loop_wait: Thread 1 (Thread 0x40000b573370 (LWP 27214)): #0 0x000040000a489020 in ppoll () from /lib64/libc.so.6 #1 0x0000aaaaaadaefc0 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77 #2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at qemu_timer.c:391 #3 0x0000aaaaaadae014 in os_host_main_loop_wait (timeout=<optimized out>) at main_loop.c:272 #4 0x0000aaaaaadae190 in main_loop_wait (nonblocking=<optimized out>) at main_loop.c:534 #5 0x0000aaaaaad97be0 in convert_do_copy (s=0xffffdc32eb48) at qemu-img.c:1923 #6 0x0000aaaaaada2d70 in img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2414 #7 0x0000aaaaaad99ac4 in main (argc=7, argv=<optimized out>) at qemu-img.c:5305 Can you place somewhere your util/async.o object file for me to look at it? Anyway: On 11/09/19 04:15, Rafael David Tinoco wrote: > I've caught the following stack trace after an HUNG in qemu-img convert: > > (gdb) bt > #0 syscall () > #1 0x0000aaaaaabd41cc in qemu_futex_wait > #2 qemu_event_wait (ev=ev@entry=0xaaaaaac86ce8 <rcu_call_ready_event>) > #3 0x0000aaaaaabed05c in call_rcu_thread > #4 0x0000aaaaaabd34c8 in qemu_thread_start > #5 0x0000ffffbf25c880 in start_thread > #6 0x0000ffffbf1b6b9c in thread_start () > > (gdb) print rcu_call_ready_event > $4 = {value = 4294967295, initialized = true} > > value INT_MAX (4294967295) seems WRONG for qemu_futex_wait(): This is UINT_MAX, not INT_MAX. qemu_futex_wait() doesn't care of the signedness of the value, which is why it is declared as void *. (That said, changing "ev" to "&ev->value" would be nicer indeed). > - EV_BUSY, being -1, and passed as an argument qemu_futex_wait(void *, > unsigned), is a two's complement, making argument into a INT_MAX when > that's not what is expected (unless I missed something). > > *** If that is the case, unsure if you, Paolo, prefer declaring > *(QemuEvent)->value as an integer or changing EV_BUSY to "2" would okay > here *** You could change it to 3, but it has to have all the bits in EV_FREE (see atomic_or(&ev->value, EV_FREE) in qemu_event_reset). You could also change it to -1u, but I don't see a particular need to do so. > BUG: description: > https://bugs.launchpad.net/qemu/+bug/1805256/comments/15 > > ======== > ISSUE #2 > ======== > > I found this when debugging lockups while in futex() in a specific ARM64 > server - https://bugs.launchpad.net/qemu/+bug/1805256 - which I'm still > investigating. > > After fixing the issue above, I'm still getting stuck into: If you changed it to 2, it's wrong. > - Should qemu_event_set() check return code from > qemu_futex_wake()->qemu_futex()->syscall() in order to know if ANY > waiter was ever woken up ? Maybe even loop until at least 1 is awaken ? Why would it need to do so? Paolo