On 1/6/23 10:16, Stefan Berger wrote:


On 1/6/23 07:10, Peter Maydell wrote:
I'm seeing an intermittent hang on the s390 CI runner in the
bios-tables-test test. It looks like we've deadlocked because:

  * the TPM device is waiting for data on its socket that never arrives,
    and it's holding the iothread lock
  * QEMU is therefore not making forward progress;
    in particular it is unable to handle qtest queries/responses
  * the test binary thread 1 is waiting to get a response to its
    qtest command, which is not going to arrive
  * test binary thread 3 (tpm_emu_ctrl_thread) is has hit an
    assertion and is trying to kill QEMU via qtest_kill_qemu()
  * qtest_kill_qemu() is only a "SIGTERM and wait", so will wait
    forever, because QEMU won't respond to the SIGTERM while it's
    blocked waiting for the TPM device to release the iothread lock
  * because the ctrl-thread is waiting for QEMU to exit, it's never
    going to send the data that would unblock the TPM device emulation

[...]


Thread 3 (Thread 0x3ff8dafe900 (LWP 2661316)):
#0  0x000003ff8e9c6002 in __GI___wait4 (pid=<optimized out>,
stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=<optimized out>,
usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
#1  0x000003ff8e9c5f72 in __GI___waitpid (pid=<optimized out>,
stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=options@entry=0) at
waitpid.c:38
#2  0x000002aa0952a516 in qtest_wait_qemu (s=0x2aa0b42c9b0) at
../tests/qtest/libqtest.c:206
#3  0x000002aa0952a58a in qtest_kill_qemu (s=0x2aa0b42c9b0) at
../tests/qtest/libqtest.c:229
#4  0x000003ff8f0c288e in g_hook_list_invoke () from
/lib/s390x-linux-gnu/libglib-2.0.so.0
#5  <signal handler called>
#6  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#7  0x000003ff8e9240a2 in __GI_abort () at abort.c:79
#8  0x000003ff8f0feda8 in g_assertion_message () from
/lib/s390x-linux-gnu/libglib-2.0.so.0
#9  0x000003ff8f0fedfe in g_assertion_message_expr () from
/lib/s390x-linux-gnu/libglib-2.0.so.0
#10 0x000002aa09522904 in tpm_emu_ctrl_thread (data=0x3fff5ffa160) at
../tests/qtest/tpm-emu.c:189

This here seems to be the root cause. An unknown control channel command was 
received from the TPM emulator backend by the control channel thread and we end 
up in g_assert_not_reached().

https://github.com/qemu/qemu/blob/master/tests/qtest/tpm-emu.c#L189



         ret = qio_channel_read(ioc, (char *)&cmd, sizeof(cmd), NULL);
         if (ret <= 0) {
             break;
         }

         cmd = be32_to_cpu(cmd);
         switch (cmd) {
  [...]
         default:
             g_debug("unimplemented %u", cmd);
             g_assert_not_reached();                <------------------
         }

I will run this test case in an endless loop on an x86_64 host and see what we 
get there ...

I could not recreate the issue running the  test on a ppc64 and x86_64 host. There we 
like >100k test runs on ppc64 and >40k on x86_64. Also simulating the reception 
of an unsupported command did not lead to a hang like shown here.

Further, it's not clear to me how to check the status of a process before calling 
wait() in a portable way. kill -0 on a <defunct> process, which has exited to 
to SIGTERM, still returns 0, so this check doesn't work to determine whether the 
process has exited.

   Stefan



   Stefan


#11 0x000003ff8f0ffb7c in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#12 0x000003ff8eb07e66 in start_thread (arg=0x3ff8dafe900) at
pthread_create.c:477
#13 0x000003ff8e9fcbe6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Reply via email to