icount has become much slower after tcg_cpu_exec has stopped using the BQL. There is also a latent bug that is masked by the slowness.
The slowness happens because every occurrence of a QEMU_CLOCK_VIRTUAL timer now has to wake up the I/O thread and wait for it. The rendez-vous is mediated by the BQL QemuMutex: - handle_icount_deadline wakes up the I/O thread with BQL taken - the I/O thread wakes up and waits on the BQL - the VCPU thread releases the BQL a little later - the I/O thread raises an interrupt, which calls qemu_cpu_kick - the VCPU thread notices the interrupt, takes the BQL to process it and waits on it All this back and forth is extremely expensive, causing a 6 to 8-fold slowdown when icount is turned on. One may think that the issue is that the VCPU thread is too dependent on the BQL, but then the latent bug comes in. I first tried removing the BQL completely from the x86 cpu_exec. Every guest thern hung, and the only way to fix it (and make everything slow again) was to add a dummy BQL lock/unlock pair to qemu_tcg_wait_io_event. This is because in -icount mode you really have to process the events before the CPU restarts executing the next instruction. Therefore, this series moves the processing of QEMU_CLOCK_VIRTUAL timers straight in the vCPU thread when running in icount mode. This is only limited to the main TimerListGroup. QEMU_CLOCK_VIRTUAL timers in AioContexts still run outside the vCPU thread. With this change, icount mode is pretty much running as fast as in 2.8. I tested the patches are on top of Alex's series with both x86 and aarch64 guests, but they should be pretty much independent. The good thing is that the infrastructure to do this is basically already there, in the form of QEMUTimerListNotifyCB. It only needs to be generalized a bit (patches 2 and 3) and bugfixed (patch 1 and 4---the latter is necessary to avoid the "I/O thread spun for 1000 iterations and consequent slowing down of vCPU thread). The bad things are: - I am not sure of what was different before the patch that removed the BQL from tcg_cpu_exec (and I don't really have time to profile it right now---I should not be fixing this in fact...). - the solution sounds a bit ugly and it probably is---though the patch itself is pretty small, adding only about 30 lines of new code. Paolo Paolo Bonzini (5): qemu-timer: fix off-by-one qemu-timer: do not include sysemu/cpus.h from util/qemu-timer.h cpus: define QEMUTimerListNotifyCB for QEMU system emulation main-loop: remove now unnecessary optimization icount: process QEMU_CLOCK_VIRTUAL timers in vCPU thread cpu-exec.c | 1 + cpus.c | 29 +++++++++++++++++++++++++++-- hw/core/ptimer.c | 1 + include/qemu/timer.h | 29 ++++++++++++++++++++++++++--- include/sysemu/cpus.h | 3 +++ kvm-all.c | 1 + monitor.c | 1 + replay/replay.c | 1 + stubs/cpu-get-icount.c | 6 ++++++ tests/test-aio-multithread.c | 2 +- tests/test-aio.c | 2 +- translate-all.c | 1 + util/async.c | 2 +- util/main-loop.c | 3 ++- util/qemu-timer.c | 17 ++++++++++------- vl.c | 5 +---- 16 files changed, 84 insertions(+), 20 deletions(-)