Thanks for your suggestions. Can you also verify if my thinking is correct: the idle thread for each CPU should ALWAYS advance which can be observed by constantly increasing value of *_total_cpu_time* unless all other threads on that cpu use the time (the case of 100% cpu utilization)? Also, any active timers should only stay active until they expire. In other words, if I see not changing *_total_cpu_time* and the same active timers every time I connect with gdb or continue it means that something is wrong meaning it is an abnormal state. It could be explained by either timer interrupts not being raised by QEMU or a bug in OSv where it misses to set up the next timer event or sets in an overly distant future or some other condition of that sort. Right?
On Sunday, March 14, 2021 at 8:16:16 AM UTC-4 Nadav Har'El wrote: > On Thu, Mar 11, 2021 at 9:12 PM Waldek Kozaczuk <[email protected]> > wrote: > >> Hi, >> >> In the last couple of days, I have been troubleshooting the scenarios >> when some unit test would hang when running on QEMU in TCG mode. I have >> opened an issue with extensive details and some findings - >> https://github.com/cloudius-systems/osv/issues/1127. My running theory >> is that we may have a bug in arm-clock implementation or a bug in QEMU >> aarch64 TCG code that simulates the virtual clock counters and affects how >> we use them. It also relates to the scheduler logic and how it interacts >> with a clock so I wonder if some of you might have any interesting insights. >> >> Please note that this problem does not seem to happen on KVM when running >> the same tests on real ARM hardware. The code I am testing is actually >> patched with some of the latest improvements like the "implement >> reschedule_from_interrupt() in assembly" and some other ones fixing FPU >> state save/restore which have not been applied to the master. >> >> Any help or suggestions would be greatly appreciated. >> Waldek >> > > Unfortunately I can't really help here, I have zero experience with ARM. I > can also provide guesses: > > Your description makes it sound - although I really have no idea if that > is possible or likely - that we ask the timer hardware for a timer > interrupt but never receive it, which would make this CPU just hang forever > (if nobody on this CPU is waiting for any mutex or something like that, > nobody will ever bother to send a wakeup IPI to this CPU, so it will never > wake up and reach the scheduler). > I don't know where is our ARM code that sets up the high-precision timer > interrupt, or how it works. > > I suggest you try to add *tracepoints* in the timer code so after a hang > you can see a list of the last timer operations - sets, unsets and wakeups: > Did we set up a timer even but never got work up? Did we forget to set a > timer? Did we cancel a timer and forget to reinstate it? > > I wouldn't be surprised if this is a QEMU bug in the emulation of the > timer hardware. I've seen in the past that QEMU emulation bugs can linger > for years (e.g., see https://github.com/cloudius-systems/osv/issues/855). > Maybe it's because hardly anyone uses QEMU without hardware support (like > KVM), so these bugs can go unnoticed? Although certainly a OSv bug is even > more likely. > > Nadav. > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/1526690f-5961-4490-a5f0-81729a04b569n%40googlegroups.com.
