On Monday, February 8, 2021 at 9:24:25 AM UTC-5 Nadav Har'El wrote:

> On Mon, Feb 8, 2021 at 7:55 AM Waldemar Kozaczuk <[email protected]> 
> wrote:
>
>> On RISC architecture like aarch64 the thread register tpidr_el0 that
>> holds an address of thread local storage memory for current thread,
>> is never accessed directly in one instruction like on x86_64. Instead
>> compiler generates machone code that copies value of the tpidr_el0
>> to another generic register when reading and does the opposite
>> when writing.
>>
>> As the issue #1121 explains, the reschedule_from_interrupt() method
>> uses thread local variable 'current_cpu' to detect any pending thread to
>> de destroyed for current cpu after switching from old thread
>> to a new one. It must use TLS variable instead of any stack variables
>> because those would be invalid at this point as they would be on the
>> stack of the old thread switched from.
>>
>> This all works fine on x86_64, where every read or write of
>> thread register fs is performed in one instruction like so:
>>
>> mov    %fs:0xfffffffffffff948,%rax
>>
>> On aarch64, the value of the thread register tpidr_el0 is copied
>> once to another register x21 before access of first TLS variable in this
>> method and never refreshed after that. When reschedule_from_interrupt()
>> calls switch_to() to switch to new thread, the update of the tpidr_el0
>> register is done in assembly and compiler has no idea that we are
>> changing TLS register. So after return to reschedule_from_interrupt()
>> before it calls cpu::current(), the register x21 still holds old
>> value of tpidr_el0 register.
>>
>> In most cases this does not cause any harm, as typically the old
>> and new thread holds the same current_cpu address. But it leads to very
>> ugly crashes when sched::pin() is used which calls
>> reschedule_from_interrupt() after changing its current_cpu to a new one
>> it is pinning the current thread to.
>>
>> So in order to fix this problem, this patch delegates the code to
>> destroy current cpu terminating thread to a new class method
>> destroy_current_cpu_thread() to enforce fresh reading of the TLS
>> register.
>>
>> Fixes #1121
>>
>> Signed-off-by: Waldemar Kozaczuk <[email protected]>
>> ---
>>  core/sched.cc        | 19 ++++++++++++++-----
>>  include/osv/sched.hh |  1 +
>>  2 files changed, 15 insertions(+), 5 deletions(-)
>>
>> diff --git a/core/sched.cc b/core/sched.cc
>> index 06f849d1..45427cac 100644
>> --- a/core/sched.cc
>> +++ b/core/sched.cc
>> @@ -229,6 +229,13 @@ void cpu::schedule()
>>      }
>>  }
>>
>> +void __attribute__ ((noinline)) cpu::destroy_current_cpu_thread() {
>> +    if (cpu::current()->terminating_thread) {
>> +        cpu::current()->terminating_thread->destroy();
>> +        cpu::current()->terminating_thread = nullptr;
>> +    }
>> +}
>>
>
> Nice catch, but it's really sad we need to call another function, which 
> slows down the context switch (even if a bit) also on x86 where this isn't 
> needed at all.
> I think there must be a better solution:
>
> You are saying that the problem arises because switch_to() set one 
> register, but didn't set its copy to the same value.
> So, can't we change switch_to() - on aarch64 only - to also make this 
> additional copy?
>

The problem is that we do not know which generic register compiler is going 
to use. And it decides it in reschedule_from_interrupt() before this line:

thread* p = thread::current(); 

which uses TLS access. So switch_to() (which is not even inlined function) 
would have no way of knowing which register to copy to.

Another alternative is to make this code ARCH dependant with #ifdef so we 
can keep the x86_64 code and change aarch64 only.

Ideally, if there was a compiler directive to enforce re-read of the 
tpidr_el0 register.


> +
>>  void cpu::reschedule_from_interrupt(bool called_from_yield,
>>                                      thread_runtime::duration 
>> preempt_after)
>>  {
>> @@ -338,15 +345,17 @@ void cpu::reschedule_from_interrupt(bool 
>> called_from_yield,
>>      }
>>      n->switch_to();
>>
>> -    // Note: after the call to n->switch_to(), we should no longer use 
>> any of
>> +    // Note 1: after the call to n->switch_to(), we should no longer use 
>> any of
>>      // the local variables, nor "this" object, because we just switched 
>> to n's
>>      // stack and the values we can access now are those that existed in 
>> the
>>      // reschedule call which scheduled n out, and will now be returning.
>>      // So to get the current cpu, we must use cpu::current(), not "this".
>> -    if (cpu::current()->terminating_thread) {
>> -        cpu::current()->terminating_thread->destroy();
>> -        cpu::current()->terminating_thread = nullptr;
>> -    }
>> +    // Note 2: in addition, in order to destroy any pending thread to be 
>> destroyed
>> +    // for current CPU, we delegate to a class static method 
>> destroy_current_cpu_thread
>> +    // to enforce that TLS register is refreshed before we reference 
>> current cpu
>> +    // using new thread TLS variable we switched to. This is necessary 
>> for architectures
>> +    // like aarch64 where tpidr_el0 register is copied to other one 
>> before use.
>> +    destroy_current_cpu_thread();
>>  }
>>
>>  void cpu::timer_fired()
>> diff --git a/include/osv/sched.hh b/include/osv/sched.hh
>> index 0fb44f77..78fed622 100644
>> --- a/include/osv/sched.hh
>> +++ b/include/osv/sched.hh
>> @@ -886,6 +886,7 @@ struct cpu : private timer_base::client {
>>       */
>>      void reschedule_from_interrupt(bool called_from_yield = false,
>>                                  thread_runtime::duration preempt_after = 
>> thyst);
>> +    static void destroy_current_cpu_thread();
>>      void enqueue(thread& t);
>>      void init_idle_thread();
>>      virtual void timer_fired() override;
>> -- 
>> 2.29.2
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "OSv Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/osv-dev/20210208055529.1339473-1-jwkozaczuk%40gmail.com
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/2199a438-f9b7-48f2-96ec-1549c445c581n%40googlegroups.com.

Reply via email to