Re: [osv-dev] aarch64: some unit tests occasionally hang when running on QEMU in emulated mode

Waldek Kozaczuk Sun, 14 Mar 2021 08:07:43 -0700

Thanks for your suggestions.

Can you also verify if my thinking is correct: the idle thread for each CPU 
should ALWAYS advance which can be observed by constantly increasing value 
of *_total_cpu_time* unless all other threads on that cpu use the time (the 
case of 100% cpu utilization)? Also, any active timers should only stay 
active until they expire. In other words, if I see not changing 
*_total_cpu_time* and the same active timers every time I connect with gdb 
or continue it means that something is wrong meaning it is an abnormal 
state. It could be explained by either timer interrupts not being raised by 
QEMU or a bug in OSv where it misses to set up the next timer event or sets 
in an overly distant future or some other condition of that sort. Right?




On Sunday, March 14, 2021 at 8:16:16 AM UTC-4 Nadav Har'El wrote:

> On Thu, Mar 11, 2021 at 9:12 PM Waldek Kozaczuk <[email protected]> 
> wrote:
>
>> Hi,
>>
>> In the last couple of days, I have been troubleshooting the scenarios 
>> when some unit test would hang when running on QEMU in TCG mode. I have 
>> opened an issue with extensive details and some findings - 
>> https://github.com/cloudius-systems/osv/issues/1127. My running theory 
>> is that we may have a bug in arm-clock implementation or a bug in QEMU 
>> aarch64 TCG code that simulates the virtual clock counters and affects how 
>> we use them. It also relates to the scheduler logic and how it interacts 
>> with a clock so I wonder if some of you might have any interesting insights.
>>
>> Please note that this problem does not seem to happen on KVM when running 
>> the same tests on real ARM hardware. The code I am testing is actually 
>> patched with some of the latest improvements like the "implement 
>> reschedule_from_interrupt() in assembly" and some other ones fixing FPU 
>> state save/restore which have not been applied to the master.
>>
>> Any help or suggestions would be greatly appreciated.
>> Waldek
>>
>
> Unfortunately I can't really help here, I have zero experience with ARM. I 
> can also provide guesses:
>
> Your description makes it sound - although I really have no idea if that 
> is possible or likely - that we ask the timer hardware for a timer 
> interrupt but never receive it, which would make this CPU just hang forever 
> (if nobody on this CPU is waiting for any mutex or something like that, 
> nobody will ever bother to send a wakeup IPI to this CPU, so it will never 
> wake up and reach the scheduler).
> I don't know where is our ARM code that sets up the high-precision timer 
> interrupt, or how it works. 
>
> I suggest you try to add *tracepoints* in the timer code so after a hang 
> you can see a list of the last timer operations - sets, unsets and wakeups: 
> Did we set up a timer even but never got work up? Did we forget to set a 
> timer? Did we cancel a timer and forget to reinstate it?
>
> I wouldn't be surprised if this is a QEMU bug in the emulation of the 
> timer hardware. I've seen in the past that QEMU emulation bugs can linger 
> for years (e.g., see https://github.com/cloudius-systems/osv/issues/855). 
> Maybe it's because hardly anyone uses QEMU without hardware support (like 
> KVM), so these bugs can go unnoticed? Although certainly a OSv bug is even 
> more likely.
>
> Nadav.
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/1526690f-5961-4490-a5f0-81729a04b569n%40googlegroups.com.

Re: [osv-dev] aarch64: some unit tests occasionally hang when running on QEMU in emulated mode

Reply via email to