[osv-dev] Re: [PATCH v0] aarch64: implement reschedule_from_interrupt() in assembly

Waldek Kozaczuk Mon, 08 Mar 2021 19:49:05 -0800


On Wednesday, March 3, 2021 at 4:57:18 PM UTC-5 Waldek Kozaczuk wrote:


> As the issue #1124 and the thread 
> https://groups.google.com/g/osv-dev/c/ystZElFNdBI 
> on the mailing list explains, some tests in the aarch64 debug build 
> fail in a consistent manner with page faults suggesting that possibly 
> the values of parameters passed from function to function get somehow 
> currupt. After painful research, I pin-pointed the crashes to the 
> reschedule_from_interrupt() method that does not work correctly in 
> voluntary (non-preemptive) context switch scenarios where 
> thread::yield() method is called by an app. It turns out that in those 
> situations so called callee-save registers used by an app function 
> calling thread::yield() are not saved during context switch and then 
> later not restored when such thread gets resumed later. Following steps 
> illustrate such scenario: 
>
> 1. Function F pushes one of the callee saved registers - x23 (just an 
> example) - on the T1 
> stack becuase it uses it for something and it must do it per the calling 
> convention. 
> 2. Function F stores some value in x23. 
> 3. Function F calls thread::yield() directly or indirectly. 
> 4. Eventually, reschedule_from_interrupt() is called and it calls 
> switch_to() to switch 
> stack pointer to the new thread T2 stack. The debug version of neither 
> reschedule_from_interrupt() 
> nor switch_to() stores x23 as they do not use this register (unlike the 
> release version). 
> 5. At some point, later reschedule_from_interrupt() is called again (not 
> necessarily the very next time) 
> and calls switch_to() that switches back to T1. 
> 6. T1 resumes and eventually returns the control to the function F1 right 
> after it called yield(). 
> 7. The code in F1 after calling yield() reads the value of x23 ... and 
> boom. The x23 quite likely 
> contains garbage because it was never restored by F1 after calling yield() 
> because per calling 
> convention yield() or other callees should have saved and restored it. But 
> they did not, did they? 
> Or rather different routines on different threads running on the same cpu 
> in between ruined it. 
>
> One way to solve this problem would be to add all callee-saved registers 
> (x19-x28 and d8-d15 per 
> https://github.com/ARM-software/abi-aa/blob/f52e1ad3f81254497a83578dc102f6aac89e52d0/aapcs64/aapcs64.rst
>  
> - 
> see "6.1.1 General-purpose Registers" and "6.1.2 SIMD and Floating-Point 
> Registers" about v8-v15 registers) 
> to the list of clobberred registers of the inline assembly in 
> thread::switch_to() in arch/aarch/arch-switch.hh. But this does not 
> always work as some versions of gcc (see 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99342) depending on 
> the optimization level (-O0, -O2, etc) inline switch_to() into 
> reschedule_from_interrupt or not and generate machine code that does not 
> do 
> what our intention is. On top of that, the code in 
> reschedule_from_interrupt() 
> that destroys any terminated thread on the current cpu uses the thread 
> local memory (TLS) pointer register - tpidr_el0 - which is not re-read 
> after the switch_to() and causes interesting crashes as the issue #1121 
> describes. 
>
> Given all above and the fact that reschedule_from_interrupt() calls 
> switch_to() to switch threads which is very tricky to do right using 
> pure C++ and some inline assembly, this patch implements high-level 
> version of reschedule_from_interrupt() in assembly for aarch64 (the x64 
> stays as is). 
>
> This patch modifies include/osv/sched.hh and core/sched.cc with number 
> of `#ifdef AARCH64_PORT_STUB` to keep reschedule_from_interrupt() as is 
> for x64 and split it into 2 new methods - schedule_next_thread() and 
> destroy_current_cpu_terminating_thread() for aarch64. The two new 
> functions correspond to the parts of the reschedule_from_interrupt() up 
> to the switch_to() call and after. Please note that 
> schedule_next_thread() returns simple 2-field struct with pointers to 
> old and new thread state (pc,sp,tcb,etc) that gets used by 
> reschedule_from_interrupt. 
>
> On top of this, this patch adds arch/aarch64/sched.S which implements 
> reschedule_from_interrupt() as a C function in assembly. The 
> reschedule_from_interrupt 
> in essence calls new schedule_next_thread(), saves all callee-saves 
> registers on old thread stack, executes a context switch as the old 
> switch_to() 
> did, restores the callee-saves registers from new thread stack and 
> finally calls destroy_current_cpu_terminating_thread(). 
>
> This patch has been tested on with the -O0, -O1, -O2 optimization levels 
> (release, debug 1 and 2) on real ARM hardware and emulated TGI mode. 
>
> Please note that the new code makes the context switch a little bit more 
> expensive (~3%) per this numbers: 
>
> BEFORE: 
> taskset -c 2-5 ./scripts/run.py -c 1 -e '/tests/misc-ctxsw.so' 
> OSv v0.55.0-204-g837a9b5d 
> getauxval() stubbed 
> eth0: 192.168.122.15 
> Booted up in 58.19 ms 
> Cmdline: /tests/misc-ctxsw.so 
> 664 colocated 
> 665 apart 
> 665 nopin 
>
> AFTER: 
> taskset -c 2-5 ./scripts/run.py -c 1 -e '/tests/misc-ctxsw.so' 
> OSv v0.55.0-205-g7e7c49f7 
> getauxval() stubbed 
> eth0: 192.168.122.15 
> Booted up in 58.20 ms 
> Cmdline: /tests/misc-ctxsw.so 
> 683 colocated 
> 690 apart 
> 690 nopin 
>
> Fixes #1121 
> Fixes #1124 
>
> Signed-off-by: Waldemar Kozaczuk <[email protected]> 
> --- 
> Makefile | 1 + 
> arch/aarch64/arch-switch.hh | 38 +++-------------- 
> arch/aarch64/arch-thread-state.hh | 1 + 
> arch/aarch64/sched.S | 70 +++++++++++++++++++++++++++++++ 
> core/sched.cc | 63 ++++++++++++++++++++++++++-- 
> include/osv/sched.hh | 34 ++++++++++++++- 
> 6 files changed, 170 insertions(+), 37 deletions(-) 
> create mode 100644 arch/aarch64/sched.S 
>
> diff --git a/Makefile b/Makefile 
> index bbc57252..e1866c3b 100644 
> --- a/Makefile 
> +++ b/Makefile 
> @@ -878,6 +878,7 @@ objects += arch/$(arch)/memset.o 
> objects += arch/$(arch)/memcpy.o 
> objects += arch/$(arch)/memmove.o 
> objects += arch/$(arch)/tlsdesc.o 
> +objects += arch/$(arch)/sched.o 
> objects += $(libfdt) 
> endif 
>
> diff --git a/arch/aarch64/arch-switch.hh b/arch/aarch64/arch-switch.hh 
> index dff7467c..1a78d132 100644 
> --- a/arch/aarch64/arch-switch.hh 
> +++ b/arch/aarch64/arch-switch.hh 
> @@ -20,32 +20,6 @@ void thread_main_c(sched::thread* t); 
>
> namespace sched { 
>
> -void thread::switch_to() 
> -{ 
> - thread* old = current(); 
> - asm volatile ("msr tpidr_el0, %0; isb; " :: "r"(_tcb) : "memory"); 
> - 
> - asm volatile("\n" 
> - "str x29, %0 \n" 
> - "mov x2, sp \n" 
> - "adr x1, 1f \n" /* address of label */ 
> - "stp x2, x1, %1 \n" 
> - 
> - "ldp x29, x0, %2 \n" 
> - "ldp x2, x1, %3 \n" 
> - 
> - "mov sp, x2 \n" 
> - "blr x1 \n" 
> - 
> - "1: \n" /* label */ 
> - : 
> - : "Q"(old->_state.fp), "Ump"(old->_state.sp), 
> - "Ump"(this->_state.fp), "Ump"(this->_state.sp) 
> - : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", 
> - "x9", "x10", "x11", "x12", "x13", "x14", "x15", 
> - "x16", "x17", "x18", "x30", "memory"); 
> -} 
> - 
> void thread::switch_to_first() 
> { 
> asm volatile ("msr tpidr_el0, %0; isb; " :: "r"(_tcb) : "memory"); 
> @@ -59,14 +33,13 @@ void thread::switch_to_first() 
>
> asm volatile("\n" 
> "ldp x29, x0, %0 \n" 
> - "ldp x2, x1, %1 \n" 
> - "mov sp, x2 \n" 
> - "blr x1 \n" 
> + "ldp x22, x21, %1 \n" 
> + "mov sp, x22 \n" 
> + "blr x21 \n" 
> : 
> : "Ump"(this->_state.fp), "Ump"(this->_state.sp) 
> - : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", 
> - "x9", "x10", "x11", "x12", "x13", "x14", "x15", 
> - "x16", "x17", "x18", "x30", "memory"); 
> + : "x0", "x19", "x20", "x21", "x22", "x23", "x24", 
> + "x25", "x26", "x27", "x28", "x29", "x30", "memory"); 
> } 
>
> void thread::init_stack() 
> @@ -150,6 +123,7 @@ void thread::setup_tcb() 
> void* p = aligned_alloc(64, total_tls_size + sizeof(*_tcb)); 
> _tcb = (thread_control_block *)p; 
> _tcb[0].tls_base = &_tcb[1]; 
> + _state.tcb = p; 
> // 
> // First goes kernel TLS data 
> auto kernel_tls = _tcb[0].tls_base; 
> diff --git a/arch/aarch64/arch-thread-state.hh 
> b/arch/aarch64/arch-thread-state.hh 
> index af424472..6f1b680d 100644 
> --- a/arch/aarch64/arch-thread-state.hh 
> +++ b/arch/aarch64/arch-thread-state.hh 
> @@ -14,6 +14,7 @@ struct thread_state { 
>
> void* sp; 
> void* pc; 
> + void* tcb; 
> }; 
>
> #endif /* ARCH_THREAD_STATE_HH_ */ 
> diff --git a/arch/aarch64/sched.S b/arch/aarch64/sched.S 
> new file mode 100644 
> index 00000000..1e69f8fd 
> --- /dev/null 
> +++ b/arch/aarch64/sched.S 
> @@ -0,0 +1,70 @@ 
> +#void cpu::reschedule_from_interrupt(bool called_from_yield, 
> +# thread_runtime::duration preempt_after) 
> +.global reschedule_from_interrupt 
> +.type reschedule_from_interrupt, @function 
> +reschedule_from_interrupt: 
> + #.cfi_startproc 
> + stp x29, x30, [sp, #-176]! 
> + #.cfi_adjust_cfa_offset 176 
> + mov x29, sp 
> + #.cfi_def_cfa_register x29 
> + 
> + #Call cpu_schedule_next_thread() to determine next thread to switch to 
> if any 
> + bl cpu_schedule_next_thread 
> + 
> + #The cpu_schedule_next_thread returns thread_switch_data in x0 and x1, 
> + #where x0 holds old_thread_state and x1 holds new_thread_state 
> + #If cpu_schedule_next_thread() returns thread_switch_data with null 
> pointers, exit 
> + cmp x0, #0 
> + b.eq 2f 
> + 
> + #Store all callee-save registers on the old thread stack 
> + stp x19, x20, [sp, #16] 
> + stp x21, x22, [sp, #32] 
> + stp x23, x24, [sp, #48] 
> + stp x25, x26, [sp, #64] 
> + stp x27, x28, [sp, #80] 
> + 
> + stp d8, d9, [sp, #96] 
> + stp d10, d11, [sp, #112] 
> + stp d12, d13, [sp, #128] 
> + stp d14, d15, [sp, #144] 
> + 
> + #Switch to new thread 
> + ldr x2, [x1, #32] //Fetch _tcb of the new thread 
> + msr tpidr_el0, x2 //Set thread pointer 
> + isb 
> + 
> + str x29, [x0, #0] //Save frame pointer of the old thread 
> + mov x3, sp //Fetch old thread stack pointer 
> + adr x4, 1f //Fetch old thread instruction point 
> + stp x3, x4, [x0, #16] //Save old thread sp and pc 
> + 
> + ldp x29, x0, [x1, #0] //Set frame pointer of the new thread and this 
> (x0) of the new thread 
> + //Please note that the pc may point to thread_main_c(thread*) which is 
> + //why we have to set x0 (1st argument) to new thread object 
> + ldp x3, x4, [x1, #16] //Fetch new thread sp and pc 
> + mov sp, x3 //Set new thread stack pointer 
> + blr x4 //Jump to the new thread pc 
> + 
> +1: 
> + #Restore all callee-save registers from the new thread stack 
> + ldp x19, x20, [sp, #16] 
> + ldp x21, x22, [sp, #32] 
> + ldp x23, x24, [sp, #48] 
> + ldp x25, x26, [sp, #64] 
> + ldp x27, x28, [sp, #80] 
> + 
> + ldp d8, d9, [sp, #96] 
> + ldp d10, d11, [sp, #112] 
> + ldp d12, d13, [sp, #128] 
> + ldp d14, d15, [sp, #144]

I think that one more register needs to be saved and restored during 
context switch - FPCR (Floating Point Control Register):

"The FPSR is a status register that holds the cumulative exception bits of 
the floating-point unit. It contains the fields IDC, IXC, UFC, OFC, DZC, 
IOC and QC. These fields are not preserved across a public interface and 
may have any value on entry to a subroutine.

The FPCR is used to control the behavior of the floating-point unit. It is 
a global register with the following properties.

   - *The exception-control bits (8-12), rounding mode bits (22-23) and 
   flush-to-zero bits (24) may be modified by calls to specific support 
   functions that affect the global state of the application.*
   - All other bits are reserved and must not be modified. It is not 
   defined whether the bits read as zero or one, or whether they are preserved 
   across a public interface."

Saving and restoring it makes tst-fpu.cc work now for aarch64 where yield() 
is used.

I do not think we need to save/restore FPSR - status register.


> + 
> + #Call destroy_current_cpu_terminating_thread() 
> + bl destroy_current_cpu_terminating_thread 
> + 
> +2: 
> + ldp x29, x30, [sp], #176 
> + #.cfi_def_cfa sp, 0 
> + ret 
> + #.cfi_endproc 
> diff --git a/core/sched.cc b/core/sched.cc 
> index 06f849d1..74b4ab95 100644 
> --- a/core/sched.cc 
> +++ b/core/sched.cc 
> @@ -225,13 +225,34 @@ void thread::cputime_estimator_get( 
> void cpu::schedule() 
> { 
> WITH_LOCK(irq_lock) { 
> +#ifdef AARCH64_PORT_STUB 
> + reschedule_from_interrupt(sched::cpu::current(), false, thyst); 
> +#else 
> current()->reschedule_from_interrupt(); 
> - } 
> -} 
> - 
> +#endif 
> + } 
> +} 
> + 
> +// In aarch64 port, the reschedule_from_interrupt() needs to be 
> implemented 
> +// in assembly (please see arch/aarch64/sched.S) to give us better 
> control 
> +// which registers are used and which ones are saved and restored during 
> +// the context switch. In essence, most of the original 
> reschedule_from_interrupt() 
> +// code up to switch_to() is shared with aarch64-specific 
> cpu::schedule_next_thread() 
> +// that is called from reschedule_from_interrupt() in arch/arch64/sched.S 
> assembly. 
> +// The logic executed after switch_to() that makes up 
> destroy_current_cpu_terminating_thread() 
> +// is called from arch/arch64/sched.S as well. 
> +// At the end, we define reschedule_from_interrupt() in C++ for x86_64 
> and schedule_next_thread() 
> +// and destroy_current_cpu_terminating_thread() in C++ for aarch64. 
> +#ifdef AARCH64_PORT_STUB 
> +inline thread_switch_data cpu::schedule_next_thread(bool 
> called_from_yield, 
> + thread_runtime::duration preempt_after) 
> +{ 
> + thread_switch_data switch_data; 
> +#else 
> void cpu::reschedule_from_interrupt(bool called_from_yield, 
> thread_runtime::duration preempt_after) 
> { 
> +#endif 
> trace_sched_sched(); 
> assert(sched::exception_depth <= 1); 
> need_reschedule = false; 
> @@ -259,7 +280,11 @@ void cpu::reschedule_from_interrupt(bool 
> called_from_yield, 
> // lowest runtime, and update the timer until the next thread's turn. 
> if (runqueue.empty()) { 
> preemption_timer.cancel(); 
> +#ifdef AARCH64_PORT_STUB 
> + return switch_data; 
> +#else 
> return; 
> +#endif 
> } else if (!called_from_yield) { 
> auto &t = *runqueue.begin(); 
> if (p->_runtime.get_local() < t._runtime.get_local()) { 
> @@ -268,7 +293,11 @@ void cpu::reschedule_from_interrupt(bool 
> called_from_yield, 
> if (delta > 0) { 
> preemption_timer.set(now + delta); 
> } 
> +#ifdef AARCH64_PORT_STUB 
> + return switch_data; 
> +#else 
> return; 
> +#endif 
> } 
> } 
> // If we're here, p no longer has the lowest runtime. Before queuing 
> @@ -336,19 +365,39 @@ void cpu::reschedule_from_interrupt(bool 
> called_from_yield, 
> if (lazy_flush_tlb.exchange(false, std::memory_order_seq_cst)) { 
> mmu::flush_tlb_local(); 
> } 
> +#ifdef AARCH64_PORT_STUB 
> + switch_data.old_thread_state = &(p->_state); 
> + switch_data.new_thread_state = &(n->_state); 
> + return switch_data; 
> +} 
> +#else 
> n->switch_to(); 
> +#endif 
>
> // Note: after the call to n->switch_to(), we should no longer use any of 
> // the local variables, nor "this" object, because we just switched to n's 
> // stack and the values we can access now are those that existed in the 
> // reschedule call which scheduled n out, and will now be returning. 
> // So to get the current cpu, we must use cpu::current(), not "this". 
> +#ifdef AARCH64_PORT_STUB 
> +extern "C" void destroy_current_cpu_terminating_thread() 
> +{ 
> +#endif 
> if (cpu::current()->terminating_thread) { 
> cpu::current()->terminating_thread->destroy(); 
> cpu::current()->terminating_thread = nullptr; 
> } 
> } 
>
> +#ifdef AARCH64_PORT_STUB 
> +extern "C" thread_switch_data cpu_schedule_next_thread(cpu* cpu, 
> + bool called_from_yield, 
> + thread_runtime::duration preempt_after) 
> +{ 
> + return cpu->schedule_next_thread(called_from_yield, preempt_after); 
> +} 
> +#endif 
> + 
> void cpu::timer_fired() 
> { 
> // nothing to do, preemption will happen if needed 
> @@ -521,7 +570,11 @@ void thread::pin(cpu *target_cpu) 
> // Note that wakeme is on the same CPU, and irq is disabled, 
> // so it will not actually run until we stop running. 
> wakeme->wake_with([&] { do_wakeme = true; }); 
> +#ifdef AARCH64_PORT_STUB 
> + reschedule_from_interrupt(source_cpu, false, thyst); 
> +#else 
> source_cpu->reschedule_from_interrupt(); 
> +#endif 
> } 
> // wakeme will be implicitly join()ed here. 
> } 
> @@ -760,7 +813,11 @@ void thread::yield(thread_runtime::duration 
> preempt_after) 
> } 
> trace_sched_yield_switch(); 
>
> +#ifdef AARCH64_PORT_STUB 
> + reschedule_from_interrupt(cpu::current(), true, preempt_after); 
> +#else 
> cpu::current()->reschedule_from_interrupt(true, preempt_after); 
> +#endif 
> } 
>
> void thread::set_priority(float priority) 
> diff --git a/include/osv/sched.hh b/include/osv/sched.hh 
> index 0fb44f77..1dacd623 100644 
> --- a/include/osv/sched.hh 
> +++ b/include/osv/sched.hh 
> @@ -31,6 +31,9 @@ typedef float runtime_t; 
>
> extern "C" { 
> void smp_main(); 
> +#ifdef AARCH64_PORT_STUB 
> +void destroy_current_cpu_terminating_thread(); 
> +#endif 
> }; 
> void smp_launch(); 
>
> @@ -321,7 +324,12 @@ constexpr thread_runtime::duration thyst = 
> std::chrono::milliseconds(5); 
> constexpr thread_runtime::duration context_switch_penalty = 
> std::chrono::microseconds(10); 
>
> - 
> +#ifdef AARCH64_PORT_STUB 
> +struct thread_switch_data { 
> + thread_state* old_thread_state = nullptr; 
> + thread_state* new_thread_state = nullptr; 
> +}; 
> +#endif 
> /** 
> * OSv thread 
> */ 
> @@ -600,7 +608,9 @@ private: 
> unsigned allowed_initial_states_mask = 1 << unsigned(status::waiting)); 
> static void sleep_impl(timer &tmr); 
> void main(); 
> +#ifndef AARCH64_PORT_STUB 
> void switch_to(); 
> +#endif 
> void switch_to_first(); 
> void prepare_wait(); 
> void wait(); 
> @@ -702,7 +712,12 @@ private: 
> std::vector<char*> _tls; 
> bool _app; 
> std::shared_ptr<osv::application_runtime> _app_runtime; 
> +public: //TODO: For whatever reason we need to make this public even 
> though destroy_current_cpu_terminating_thread is defined as friend below, 
> so compiler still complains. 
> void destroy(); 
> +private: 
> +#ifdef AARCH64_PORT_STUB 
> + friend void ::destroy_current_cpu_terminating_thread(); 
> +#endif 
> friend class thread_ref_guard; 
> friend void thread_main_c(thread* t); 
> friend class wait_guard; 
> @@ -884,8 +899,13 @@ struct cpu : private timer_base::client { 
> * @param preempt_after fire the preemption timer after this time period 
> * (relevant only when "called_from_yield" is TRUE) 
> */ 
> +#ifdef AARCH64_PORT_STUB 
> + thread_switch_data schedule_next_thread(bool called_from_yield = false, 
> + thread_runtime::duration preempt_after = thyst); 
> +#else 
> void reschedule_from_interrupt(bool called_from_yield = false, 
> - thread_runtime::duration preempt_after = thyst); 
> + thread_runtime::duration preempt_after = thyst); 
> +#endif 
> void enqueue(thread& t); 
> void init_idle_thread(); 
> virtual void timer_fired() override; 
> @@ -895,6 +915,12 @@ struct cpu : private timer_base::client { 
> int renormalize_count; 
> }; 
>
> +#ifdef AARCH64_PORT_STUB 
> +extern "C" void reschedule_from_interrupt(cpu* cpu, 
> + bool called_from_yield, 
> + thread_runtime::duration preempt_after); 
> +#endif 
> + 
> class cpu::notifier { 
> public: 
> explicit notifier(std::function<void ()> cpu_up); 
> @@ -996,7 +1022,11 @@ inline bool preemptable() 
> inline void preempt() 
> { 
> if (preemptable()) { 
> +#ifdef AARCH64_PORT_STUB 
> + reschedule_from_interrupt(sched::cpu::current(), false, thyst); 
> +#else 
> sched::cpu::current()->reschedule_from_interrupt(); 
> +#endif 
> } else { 
> // preempt_enable() will pick this up eventually 
> need_reschedule = true; 
> -- 
> 2.27.0 
>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/32e8da0b-a3b3-440b-86e2-e2d521f8c38dn%40googlegroups.com.

[osv-dev] Re: [PATCH v0] aarch64: implement reschedule_from_interrupt() in assembly

Reply via email to