On 04/11/2025 09.08, Akihiko Odaki wrote:
On 2025/11/04 16:41, Thomas Huth wrote:
On 04/11/2025 02.45, Akihiko Odaki wrote:
On 2025/11/03 22:59, Thomas Huth wrote:
On 28/10/2025 18.34, Paolo Bonzini wrote:
From: Akihiko Odaki <[email protected]>

Borrow the concept of force quiescent state from Linux to ensure readers
remain fast during normal operation and to avoid stalls.

  Hi Akihiko,

looks like this commit has introduced a regression with the "replay" functional test on the alpha target.
When I run something like:

  pyvenv/bin/meson test --no-rebuild -t 1 --setup thorough \
   --num-processes 1 --repeat 10 func-alpha-replay

in the build folder, approx. half of the test runs are failing for me now.

I bisected the issue to this patch here - when I rebuild qemu-system- alpha with the commit right before this change here, the above test runs work fine, so I'm pretty sure that the problem has been introduced by this commit here.

Could you please have a look?

I cannot reproduce it with commit 55d98e3edeeb ("rcu: Unify force quiescent state").

Can you provide meson-logs/testlog-thorough.txt so that I can look into the failure you are facing? If you think you have something useful for debugging, please share it to me too.

There's not much in that testlog-thorough.txt that could be helpful here,
it's basically just the information that the test has been killed due to
the timeout:

==================================== 1/1 ===================================
test:         qemu:func-thorough+func-alpha-thorough+thorough / func- alpha-replay
start time:   07:25:26
duration:     90.01s
result:       killed by signal 15 SIGTERM
command:      RUST_BACKTRACE=1 QEMU_TEST_QEMU_IMG=/tmp/qemu-rcu/qemu-img QEMU_TEST_GDB=/usr/bin/gdb MALLOC_PERTURB_=255 MESON_TEST_ITERATION=1 PYTHONPATH=/home/thuth/devel/qemu/python:/home/thuth/devel/qemu/tests/ functional G_TEST_SLOW=1 SPEED=thorough QEMU_TEST_QEMU_BINARY=/tmp/qemu- rcu/qemu-system-alpha ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 LD_LIBRARY_PATH=/tmp/qemu-rcu/tests/tcg/plugins:/tmp/qemu-rcu/contrib/ plugins UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 QEMU_BUILD_ROOT=/tmp/qemu-rcu /tmp/qemu-rcu/pyvenv/bin/python3 /home/thuth/devel/qemu/tests/functional/alpha/test_replay.py
=============================================================================

Summary of Failures:

1/1 qemu:func-thorough+func-alpha-thorough+thorough / func-alpha-replay TIMEOUT        90.01s   killed by signal 15 SIGTERM


There is also not that much helpful information in
tests/functional/alpha/test_replay.AlphaReplay.test_clipper, apart from
the console.log file. For a good run, the console log looks like this:

2025-11-04 08:16:46,148: PCI: 00:01:0 class 0101 id 1095:0646
2025-11-04 08:16:46,149: PCI:   region 0: 0000c000
2025-11-04 08:16:46,149: PCI:   region 1: 0000c008
2025-11-04 08:16:46,149: PCI:   region 2: 0000c010
2025-11-04 08:16:46,149: PCI:   region 3: 0000c018
2025-11-04 08:16:46,149: PCI:   region 4: 0000c020
2025-11-04 08:16:46,149: PCI: 00:07:0 class 0601 id 8086:0484
2025-11-04 08:16:48,149: [    0.000000] Initializing cgroup subsys cpu
2025-11-04 08:16:48,149: [    0.000000] Linux version 2.6.26-2-alpha- generic (Debian 2.6.26-29) ([email protected]) (gcc version 4.1.3 20080704 (prerelease) (Debian 4.1.2-25)) #1 Sun Mar 4 21:08:03 UTC 2012 2025-11-04 08:16:48,150: [    0.000000] Booting GENERIC on Tsunami variation Clipper using machine vector Clipper from SRM
2025-11-04 08:16:48,150: [    0.000000] Major Options: MAGIC_SYSRQ
2025-11-04 08:16:48,150: [    0.000000] Command line: printk.time=0 console=ttyS0 2025-11-04 08:16:48,150: [    0.000000] memcluster 0, usage 1, start        0, end       15 2025-11-04 08:16:48,150: [    0.000000] memcluster 1, usage 0, start 15, end    16384
2025-11-04 08:16:48,150: [    0.000000] freeing pages 15:2048
2025-11-04 08:16:48,150: [    0.000000] freeing pages 2987:16384
2025-11-04 08:16:48,151: [    0.000000] reserving pages 2987:2988
2025-11-04 08:16:48,151: [    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 16272 2025-11-04 08:16:48,151: [    0.000000] Kernel command line: printk.time=0 console=ttyS0
2025-11-04 08:16:57,358: PCI: 00:01:0 class 0101 id 1095:0646
2025-11-04 08:16:57,358: PCI:   region 0: 0000c000
2025-11-04 08:16:57,358: PCI:   region 1: 0000c008
2025-11-04 08:16:57,359: PCI:   region 2: 0000c010
2025-11-04 08:16:57,359: PCI:   region 3: 0000c018
2025-11-04 08:16:57,359: PCI:   region 4: 0000c020
2025-11-04 08:16:57,360: PCI: 00:07:0 class 0601 id 8086:0484
2025-11-04 08:17:08,468: [    0.000000] Initializing cgroup subsys cpu
2025-11-04 08:17:08,470: [    0.000000] Linux version 2.6.26-2-alpha- generic (Debian 2.6.26-29) ([email protected]) (gcc version 4.1.3 20080704 (prerelease) (Debian 4.1.2-25)) #1 Sun Mar 4 21:08:03 UTC 2012 2025-11-04 08:17:08,471: [    0.000000] Booting GENERIC on Tsunami variation Clipper using machine vector Clipper from SRM
2025-11-04 08:17:08,471: [    0.000000] Major Options: MAGIC_SYSRQ
2025-11-04 08:17:08,472: [    0.000000] Command line: printk.time=0 console=ttyS0 2025-11-04 08:17:08,472: [    0.000000] memcluster 0, usage 1, start        0, end       15 2025-11-04 08:17:08,473: [    0.000000] memcluster 1, usage 0, start 15, end    16384
2025-11-04 08:17:08,473: [    0.000000] freeing pages 15:2048
2025-11-04 08:17:08,474: [    0.000000] freeing pages 2987:16384
2025-11-04 08:17:08,474: [    0.000000] reserving pages 2987:2988
2025-11-04 08:17:08,475: [    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 16272 2025-11-04 08:17:08,476: [    0.000000] Kernel command line: printk.time=0 console=ttyS0

I.e. the replay worked as expected. When it fails, console.log only contains:

2025-11-04 08:25:26,601: PCI: 00:01:0 class 0101 id 1095:0646
2025-11-04 08:25:26,601: PCI:   region 0: 0000c000
2025-11-04 08:25:26,601: PCI:   region 1: 0000c008
2025-11-04 08:25:26,601: PCI:   region 2: 0000c010
2025-11-04 08:25:26,601: PCI:   region 3: 0000c018
2025-11-04 08:25:26,601: PCI:   region 4: 0000c020
2025-11-04 08:25:26,602: PCI: 00:07:0 class 0601 id 8086:0484
2025-11-04 08:25:28,601: [    0.000000] Initializing cgroup subsys cpu
2025-11-04 08:25:28,602: [    0.000000] Linux version 2.6.26-2-alpha- generic (Debian 2.6.26-29) ([email protected]) (gcc version 4.1.3 20080704 (prerelease) (Debian 4.1.2-25)) #1 Sun Mar 4 21:08:03 UTC 2012 2025-11-04 08:25:28,602: [    0.000000] Booting GENERIC on Tsunami variation Clipper using machine vector Clipper from SRM
2025-11-04 08:25:28,602: [    0.000000] Major Options: MAGIC_SYSRQ
2025-11-04 08:25:28,602: [    0.000000] Command line: printk.time=0 console=ttyS0 2025-11-04 08:25:28,602: [    0.000000] memcluster 0, usage 1, start        0, end       15 2025-11-04 08:25:28,602: [    0.000000] memcluster 1, usage 0, start 15, end    16384
2025-11-04 08:25:28,602: [    0.000000] freeing pages 15:2048
2025-11-04 08:25:28,603: [    0.000000] freeing pages 2987:16384
2025-11-04 08:25:28,603: [    0.000000] reserving pages 2987:2988
2025-11-04 08:25:28,603: [    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 16272 2025-11-04 08:25:28,603: [    0.000000] Kernel command line: printk.time=0 console=ttyS0

I.e. the replay did not work.

Could this RCU stuff somehow influence the replay mechanism in QEMU?

I don't know (yet).

Can you attach gdb and show a backtrace for each thread? It often reveals deadlock among threads.

Sure, it looks like this:

(gdb) thread apply all bt

Thread 4 (Thread 0x7f31bd1ff6c0 (LWP 89223) "qemu-system-alp"):
#0  0x00007f31c47150cd in syscall () at /lib64/libc.so.6
#1  0x00005593dd2b578d in qemu_futex_wait (f=0x5593ddad9e50 
<rcu_call_ready_event>, val=4294967295) at 
/home/thuth/devel/qemu/include/qemu/futex.h:47
#2  0x00005593dd2b59a1 in qemu_event_wait (ev=0x5593ddad9e50 
<rcu_call_ready_event>) at ../../home/thuth/devel/qemu/util/event.c:162
#3  0x00005593dd2c12e3 in call_rcu_thread (opaque=0x0) at 
../../home/thuth/devel/qemu/util/rcu.c:304
#4  0x00005593dd2b3311 in qemu_thread_start (args=0x5594051010c0) at 
../../home/thuth/devel/qemu/util/qemu-thread-posix.c:393
#5  0x00007f31c4693f54 in start_thread () at /lib64/libc.so.6
#6  0x00007f31c471732c in __clone3 () at /lib64/libc.so.6

Thread 3 (Thread 0x7f31bc8fd6c0 (LWP 89224) "qemu-system-alp"):
#0  0x00007f31c469c462 in __syscall_cancel_arch () at /lib64/libc.so.6
#1  0x00007f31c469075c in __internal_syscall_cancel () at /lib64/libc.so.6
#2  0x00007f31c46907a4 in __syscall_cancel () at /lib64/libc.so.6
#3  0x00007f31c470a7c6 in ppoll () at /lib64/libc.so.6
#4  0x00007f31c6916890 in g_main_context_iterate_unlocked.isra () at 
/lib64/libglib-2.0.so.0
#5  0x00007f31c6916a4f in g_main_loop_run () at /lib64/libglib-2.0.so.0
#6  0x00005593dd0d1ab0 in iothread_run (opaque=0x559405a567a0) at 
../../home/thuth/devel/qemu/iothread.c:70
#7  0x00005593dd2b3311 in qemu_thread_start (args=0x559405a571a0) at 
../../home/thuth/devel/qemu/util/qemu-thread-posix.c:393
#8  0x00007f31c4693f54 in start_thread () at /lib64/libc.so.6
#9  0x00007f31c471732c in __clone3 () at /lib64/libc.so.6

Thread 2 (Thread 0x7f3137fff6c0 (LWP 89225) "qemu-system-alp"):
#0  0x00007f31c469c462 in __syscall_cancel_arch () at /lib64/libc.so.6
#1  0x00007f31c469075c in __internal_syscall_cancel () at /lib64/libc.so.6
#2  0x00007f31c46907a4 in __syscall_cancel () at /lib64/libc.so.6
#3  0x00007f31c470b2be in write () at /lib64/libc.so.6
#4  0x00005593dd2af441 in event_notifier_set (e=0x559405a56a54) at 
../../home/thuth/devel/qemu/util/event_notifier-posix.c:117
#5  0x00005593dd2cdcde in aio_notify (ctx=0x559405a56980) at 
../../home/thuth/devel/qemu/util/async.c:506
#6  0x00005593dd2cdd2a in aio_timerlist_notify (opaque=0x559405a56980, 
type=QEMU_CLOCK_VIRTUAL) at ../../home/thuth/devel/qemu/util/async.c:524
#7  0x00005593dd2d4014 in timerlist_notify (timer_list=0x559405a56d50) at 
../../home/thuth/devel/qemu/util/qemu-timer.c:293
#8  0x00005593dd2d3aaf in qemu_clock_notify (type=QEMU_CLOCK_VIRTUAL) at 
../../home/thuth/devel/qemu/util/qemu-timer.c:147
#9  0x00005593dcf92c2c in icount_notify_aio_contexts () at 
../../home/thuth/devel/qemu/accel/tcg/tcg-accel-ops-icount.c:72
#10 0x00005593dcf92c92 in icount_handle_deadline () at 
../../home/thuth/devel/qemu/accel/tcg/tcg-accel-ops-icount.c:88
#11 0x00005593dcf93c45 in rr_cpu_thread_fn (arg=0x559405ce47d0) at 
../../home/thuth/devel/qemu/accel/tcg/tcg-accel-ops-rr.c:251
#12 0x00005593dd2b3311 in qemu_thread_start (args=0x559405ce4720) at 
../../home/thuth/devel/qemu/util/qemu-thread-posix.c:393
#13 0x00007f31c4693f54 in start_thread () at /lib64/libc.so.6
#14 0x00007f31c471732c in __clone3 () at /lib64/libc.so.6

Thread 1 (Thread 0x7f31bd82bb00 (LWP 89222) "qemu-system-alp"):
#0  0x00007f31c469c462 in __syscall_cancel_arch () at /lib64/libc.so.6
#1  0x00007f31c469075c in __internal_syscall_cancel () at /lib64/libc.so.6
#2  0x00007f31c46907a4 in __syscall_cancel () at /lib64/libc.so.6
#3  0x00007f31c470a7c6 in ppoll () at /lib64/libc.so.6
#4  0x00005593dd2d418e in qemu_poll_ns (fds=0x5594064aad30, nfds=5, 
timeout=993592000) at ../../home/thuth/devel/qemu/util/qemu-timer.c:342
#5  0x00005593dd2cf1bd in os_host_main_loop_wait (timeout=993592000) at 
../../home/thuth/devel/qemu/util/main-loop.c:308
#6  0x00005593dd2cf30f in main_loop_wait (nonblocking=0) at 
../../home/thuth/devel/qemu/util/main-loop.c:592
#7  0x00005593dce7edf6 in qemu_main_loop () at 
../../home/thuth/devel/qemu/system/runstate.c:903
#8  0x00005593dd1e212e in qemu_default_main (opaque=0x0) at 
../../home/thuth/devel/qemu/system/main.c:50
#9  0x00005593dd1e21e8 in main (argc=25, argv=0x7ffdad60bf18) at 
../../home/thuth/devel/qemu/system/main.c:93


 HTH,
  Thomas


Reply via email to