Alright, I'm still investigating this but wanted to share some findings... I haven't got a kernel dump yet after the task is frozen, I have analyzed only the userland part of it (although I have checked if code was running inside kernel with perf cycles:u/cycles:k at some point).
The big picture is this: Whenever qemu-img hangs, we have 3 hung tasks basically with these stacks: ---- TRHREAD #1 __GI_ppoll (../sysdeps/unix/sysv/linux/ppoll.c:39) ppoll (/usr/include/aarch64-linux-gnu/bits/poll2.h:77) qemu_poll_ns (./util/qemu-timer.c:322) os_host_main_loop_wait (./util/main-loop.c:233) main_loop_wait (./util/main-loop.c:497) convert_do_copy (./qemu-img.c:1981) img_convert (./qemu-img.c:2457) main (./qemu-img.c:4976) got stack traces: ./33293/stack ./33293/stack [<0>] __switch_to+0xc0/0x218 [<0>] __switch_to+0xc0/0x218 [<0>] ptrace_stop+0x148/0x2b0 [<0>] do_sys_poll+0x508/0x5c0 [<0>] get_signal+0x5a4/0x730 [<0>] __arm64_sys_ppoll+0xc0/0x118 [<0>] do_notify_resume+0x158/0x358 [<0>] el0_svc_common+0xa0/0x168 [<0>] work_pending+0x8/0x10 [<0>] el0_svc_handler+0x38/0x78 [<0>] el0_svc+0x8/0xc root@d06-1:~$ perf record -F 9999 -e cycles:u -p 33293 -- sleep 10 [ perf record: Woken up 6 times to write data ] [ perf record: Captured and wrote 1.871 MB perf.data (48730 samples) ] root@d06-1:~$ perf report --stdio # Overhead Command Shared Object Symbol # ........ ........ .................. ...................... # 37.82% qemu-img libc-2.29.so [.] 0x00000000000df710 21.81% qemu-img [unknown] [k] 0xffff000010099504 14.23% qemu-img [unknown] [k] 0xffff000010085dc0 9.13% qemu-img [unknown] [k] 0xffff00001008fff8 6.47% qemu-img libc-2.29.so [.] 0x00000000000df708 5.69% qemu-img qemu-img [.] qemu_event_reset 2.57% qemu-img libc-2.29.so [.] 0x00000000000df678 0.63% qemu-img libc-2.29.so [.] 0x00000000000df700 0.49% qemu-img libc-2.29.so [.] __sigtimedwait 0.42% qemu-img libpthread-2.29.so [.] __libc_sigwait ---- TRHREAD #3 __GI___sigtimedwait (../sysdeps/unix/sysv/linux/sigtimedwait.c:29) __sigwait (linux/sigwait.c:28) qemu_thread_start (./util/qemu-thread-posix.c:498) start_thread (pthread_create.c:486) thread_start (linux/aarch64/clone.S:78) ./33303/stack ./33303/stack [<0>] __switch_to+0xc0/0x218 [<0>] __switch_to+0xc0/0x218 [<0>] ptrace_stop+0x148/0x2b0 [<0>] do_sigtimedwait.isra.9+0x194/0x288 [<0>] get_signal+0x5a4/0x730 [<0>] __arm64_sys_rt_sigtimedwait+0xac/0x110 [<0>] do_notify_resume+0x158/0x358 [<0>] el0_svc_common+0xa0/0x168 [<0>] work_pending+0x8/0x10 [<0>] el0_svc_handler+0x38/0x78 [<0>] el0_svc+0x8/0xc root@d06-1:~$ perf record -F 9999 -e cycles:u -p 33303 -- sleep 10 [ perf record: Woken up 6 times to write data ] [ perf record: Captured and wrote 1.905 MB perf.data (49647 samples) ] root@d06-1:~$ perf report --stdio # Overhead Command Shared Object Symbol # ........ ........ .................. ...................... # 45.37% qemu-img libc-2.29.so [.] 0x00000000000df710 23.52% qemu-img [unknown] [k] 0xffff000010099504 9.08% qemu-img [unknown] [k] 0xffff00001008fff8 8.89% qemu-img [unknown] [k] 0xffff000010085dc0 5.56% qemu-img libc-2.29.so [.] 0x00000000000df708 3.66% qemu-img libc-2.29.so [.] 0x00000000000df678 1.01% qemu-img libc-2.29.so [.] __sigtimedwait 0.80% qemu-img libc-2.29.so [.] 0x00000000000df700 0.64% qemu-img qemu-img [.] qemu_event_reset 0.55% qemu-img libc-2.29.so [.] 0x00000000000df718 0.52% qemu-img libpthread-2.29.so [.] __libc_sigwait ---- TRHREAD #2 syscall (linux/aarch64/syscall.S:38) qemu_futex_wait (./util/qemu-thread-posix.c:438) qemu_event_wait (./util/qemu-thread-posix.c:442) call_rcu_thread (./util/rcu.c:261) qemu_thread_start (./util/qemu-thread-posix.c:498) start_thread (pthread_create.c:486) thread_start (linux/aarch64/clone.S:78) ./33302/stack ./33302/stack [<0>] __switch_to+0xc0/0x218 [<0>] __switch_to+0xc0/0x218 [<0>] ptrace_stop+0x148/0x2b0 [<0>] ptrace_stop+0x148/0x2b0 [<0>] get_signal+0x5a4/0x730 [<0>] get_signal+0x5a4/0x730 [<0>] do_notify_resume+0x1c4/0x358 [<0>] do_notify_resume+0x1c4/0x358 [<0>] work_pending+0x8/0x10 [<0>] work_pending+0x8/0x10 <stack does not change at all> root@d06-1:~$ perf report --stdio # Overhead Command Shared Object Symbol # ........ ........ .................. ...................... # 50.30% qemu-img libc-2.29.so [.] 0x00000000000df710 26.44% qemu-img [unknown] [k] 0xffff000010099504 5.88% qemu-img libc-2.29.so [.] 0x00000000000df708 5.26% qemu-img [unknown] [k] 0xffff000010085dc0 5.25% qemu-img [unknown] [k] 0xffff00001008fff8 4.25% qemu-img libc-2.29.so [.] 0x00000000000df678 0.93% qemu-img libc-2.29.so [.] __sigtimedwait 0.51% qemu-img libc-2.29.so [.] 0x00000000000df700 0.35% qemu-img libpthread-2.29.so [.] __libc_sigwait Their stack show those tasks are pretty much "stuck" in same userland program logic, while one of them is stuck at the same program counter address. Profiling those tasks give no much information without more debugging data and less optimizations. Although all the 0x000000dfXXX addresses seem broken as we get where libc was mapped (mid heap) and we have: (gdb) print __libc_sigwait $25 = {int (const sigset_t *, int *)} 0xffffbf128080 <__GI___sigwait> ---- Anyway, continuing.... I investigated the qemu_event_{set,reset,xxx} logic. In non Linux OSes it uses pthread primitives, but, for Linux, it uses a futex() implementation with a struct QemuEvent (rcu_call_ready_event) being the one holding values (busy, set, free, etc). I got 2 hung situations: (gdb) print (struct QemuEvent) *(0xaaaacd35fce8) $16 = { value = 4294967295, initialized = true } value = 4294967295 -> THIS IS A 32-bit 0xFFFF (casting vs overflow issue ?) AND a situation where value was either 0 or 1 (like expected). In this last situation I changed things by hand to make program to continue its execution: void qemu_event_wait(QemuEvent *ev) { unsigned value; assert(ev->initialized); value = atomic_read(&ev->value); smp_mb_acquire(); if (value != EV_SET) { if (value == EV_FREE) { if (atomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET) { return; } } qemu_futex_wait(ev, EV_BUSY); } } 438 in ./util/qemu-thread-posix.c 0x0000aaaaaabd4174 <+44>: mov w1, #0xffffffff // #-1 0x0000aaaaaabd4178 <+48>: ldaxr w0, [x19] 0x0000aaaaaabd417c <+52>: cmp w0, #0x1 0x0000aaaaaabd4180 <+56>: b.ne 0xaaaaaabd418c <qemu_event_wait+68> // b.any => 0x0000aaaaaabd4184 <+60>: stlxr w2, w1, [x19] 0x0000aaaaaabd4188 <+64>: cbnz w2, 0xaaaaaabd4178 <qemu_event_wait+48> 0x0000aaaaaabd418c <+68>: cbz w0, 0xaaaaaabd41cc <qemu_event_wait+132> 0x0000aaaaaabd4190 <+72>: mov w6, #0x0 // #0 0x0000aaaaaabd4194 <+76>: mov x5, #0x0 // #0 0x0000aaaaaabd4198 <+80>: mov x4, #0x0 // #0 0x0000aaaaaabd419c <+84>: mov w3, #0xffffffff // #-1 0x0000aaaaaabd41a0 <+88>: mov w2, #0x0 // #0 0x0000aaaaaabd41a4 <+92>: mov x1, x19 0x0000aaaaaabd41a8 <+96>: mov x0, #0x62 // #98 0x0000aaaaaabd41ac <+100>: bl 0xaaaaaaaff380 <syscall@plt> I unblocked it by hand, setting the program counter register to outside that logic: (gdb) print qemu_event_wait+132 $15 = (void (*)(QemuEvent *)) 0xaaaaaabd41cc <qemu_event_wait+132> (gdb) print rcu_call_ready_event $16 = {value = 1, initialized = true} (gdb) set rcu_call_ready_event->value=0 (gdb) set $pc=0xaaaaaabd41cc And it got stuck again with program counter in other STLXR instruction: (gdb) thread 2 [Switching to thread 2 (Thread 0xffffbec61d90 (LWP 33302))] #0 0x0000aaaaaabd4110 in qemu_event_reset (ev=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:414 414 ./util/qemu-thread-posix.c: No such file or directory. (gdb) bt #0 0x0000aaaaaabd4110 in qemu_event_reset (ev=0xaaaaaac87ce8 <rcu_call_ready_event>) at ./util/qemu-thread-posix.c:414 #1 0x0000aaaaaabedff8 in call_rcu_thread (opaque=opaque@entry=0x0) at ./util/rcu.c:255 #2 0x0000aaaaaabd34c8 in qemu_thread_start (args=<optimized out>) at ./util/qemu-thread-posix.c:498 #3 0x0000ffffbf26a880 in start_thread (arg=0xfffffffff5bf) at pthread_create.c:486 #4 0x0000ffffbf1c4b9c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78 (gdb) print rcu_call_ready_event $20 = {value = 1, initialized = true} (gdb) disassemble qemu_event_reset Dump of assembler code for function qemu_event_reset: 0x0000aaaaaabd40f0 <+0>: ldrb w1, [x0, #4] 0x0000aaaaaabd40f4 <+4>: cbz w1, 0xaaaaaabd411c <qemu_event_reset+44> 0x0000aaaaaabd40f8 <+8>: ldr w1, [x0] 0x0000aaaaaabd40fc <+12>: dmb ishld 0x0000aaaaaabd4100 <+16>: cbz w1, 0xaaaaaabd4108 <qemu_event_reset+24> 0x0000aaaaaabd4104 <+20>: ret 0x0000aaaaaabd4108 <+24>: ldaxr w1, [x0] 0x0000aaaaaabd410c <+28>: orr w1, w1, #0x1 => 0x0000aaaaaabd4110 <+32>: stlxr w2, w1, [x0] 0x0000aaaaaabd4114 <+36>: cbnz w2, 0xaaaaaabd4108 <qemu_event_reset+24> 0x0000aaaaaabd4118 <+40>: ret And it does not matter if I continue, CPU keeps stuck in that program counter (again in a STLXR instruction) ---- So, initially I was afraid that the lack barriers (or not so strong ones being used) could have caused a race condition that would make one thread to depend on the other thread logic. Unfortunately it looks that instruction STLXR might not be behaving appropriately for this CPU/architecture as program counter seem to be stuck in the same instruction (which is super weird, by not throwing a general exception for some microcode issue, for example). But this was just an initial overview, I still have to revisit this in order interpret results better (and recompile qemu with debugging data, and possible with other GCC version). Any comments are appreciated. -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1805256 Title: qemu-img hangs on high core count ARM system Status in QEMU: Confirmed Status in qemu package in Ubuntu: In Progress Bug description: On the HiSilicon D06 system - a 96 core NUMA arm64 box - qemu-img frequently hangs (~50% of the time) with this command: qemu-img convert -f qcow2 -O qcow2 /tmp/cloudimg /tmp/cloudimg2 Where "cloudimg" is a standard qcow2 Ubuntu cloud image. This qcow2->qcow2 conversion happens to be something uvtool does every time it fetches images. Once hung, attaching gdb gives the following backtrace: (gdb) bt #0 0x0000ffffae4f8154 in __GI_ppoll (fds=0xaaaae8a67dc0, nfds=187650274213760, timeout=<optimized out>, timeout@entry=0x0, sigmask=0xffffc123b950) at ../sysdeps/unix/sysv/linux/ppoll.c:39 #1 0x0000aaaabbefaf00 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/aarch64-linux-gnu/bits/poll2.h:77 #2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at util/qemu-timer.c:322 #3 0x0000aaaabbefbf80 in os_host_main_loop_wait (timeout=-1) at util/main-loop.c:233 #4 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497 #5 0x0000aaaabbe2aa30 in convert_do_copy (s=0xffffc123bb58) at qemu-img.c:1980 #6 img_convert (argc=<optimized out>, argv=<optimized out>) at qemu-img.c:2456 #7 0x0000aaaabbe2333c in main (argc=7, argv=<optimized out>) at qemu-img.c:4975 Reproduced w/ latest QEMU git (@ 53744e0a182) To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1805256/+subscriptions