On Mon, Mar 15, 2021 at 7:02 AM Nadav Har'El <[email protected]> wrote:
> Thanks, very impressive work! > The code looks very reasonable to me, but I can't really read the arm > assembly to verify every small detail. I'm sure you are testing it much > better than I can, and it's really great to see how the ARM port is coming > along. > > I committed this patch as-is, but I have a couple of ideas for future > consideration: > > 1. I see you used the "AARCH64_PORT_STUB" ifdef. I think the original > intention of this macro, as its name suggests, was to mark code that was > done as a quick hack or stub for aarch64 - and should be revisited in the > future. If this isn't quite the case here, and it's not a stub but really > fully-working code, maybe you should have used the regular #ifdef > __arch64__ (I think), and stop using the AARCH64_PORT_STUB in new code? > Very good idea. I will try to prepare a patch that fixes it. > > 2. You split off the "schedule_next_thread" function just for aarch64, > with an #if putting the same code in either schedule_next_thread or > reschedule_from_interrupt. I wonder if we couldn't have the same > schedule_next_thread also for x86, so we don't need a special case for > aarch64. Hopefully (?) the extra function won't cause any performance > deterioration for x86 if it's inlined and optimized. Moreover, from this > discussion it's not entirely clear to me why this whole C function worked > correctly in x86 - maybe it was just luck and in the future we'll need to > do this for x86 as well. > I think it is worth creating an issue. > > -- > Nadav Har'El > [email protected] > > > On Mon, Mar 15, 2021 at 7:18 AM Waldemar Kozaczuk <[email protected]> > wrote: > >> As the issue #1124 and the thread >> https://groups.google.com/g/osv-dev/c/ystZElFNdBI >> on the mailing list explains, some tests in the aarch64 debug build >> fail in a consistent manner with page faults suggesting that possibly >> the values of parameters passed from function to function get somehow >> currupt. After painful research, I pin-pointed the crashes to the >> reschedule_from_interrupt() method that does not work correctly in >> voluntary (non-preemptive) context switch scenarios where >> thread::yield() method is called by an app. It turns out that in those >> situations so called callee-save registers used by an app function >> calling thread::yield() are not saved during context switch and then >> later not restored when such thread gets resumed later. Following steps >> illustrate such scenario: >> >> 1. Function F pushes one of the callee saved registers - x23 (just an >> example) - on the T1 >> stack becuase it uses it for something and it must do it per the >> calling convention. >> 2. Function F stores some value in x23. >> 3. Function F calls thread::yield() directly or indirectly. >> 4. Eventually, reschedule_from_interrupt() is called and it calls >> switch_to() to switch >> stack pointer to the new thread T2 stack. The debug version of neither >> reschedule_from_interrupt() >> nor switch_to() stores x23 as they do not use this register (unlike >> the release version). >> 5. At some point, later reschedule_from_interrupt() is called again (not >> necessarily the very next time) >> and calls switch_to() that switches back to T1. >> 6. T1 resumes and eventually returns the control to the function F1 right >> after it called yield(). >> 7. The code in F1 after calling yield() reads the value of x23 ... and >> boom. The x23 quite likely >> contains garbage because it was never restored by F1 after calling >> yield() because per calling >> convention yield() or other callees should have saved and restored it. >> But they did not, did they? >> Or rather different routines on different threads running on the same >> cpu in between ruined it. >> >> One way to solve this problem would be to add all callee-saved registers >> (x19-x28 and d8-d15 per >> https://github.com/ARM-software/abi-aa/blob/f52e1ad3f81254497a83578dc102f6aac89e52d0/aapcs64/aapcs64.rst >> - >> see "6.1.1 General-purpose Registers" and "6.1.2 SIMD and Floating-Point >> Registers" about v8-v15 registers) >> to the list of clobberred registers of the inline assembly in >> thread::switch_to() in arch/aarch/arch-switch.hh. But this does not >> always work as some versions of gcc (see >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99342) depending on >> the optimization level (-O0, -O2, etc) inline switch_to() into >> reschedule_from_interrupt or not and generate machine code that does not >> do >> what our intention is. On top of that, the code in >> reschedule_from_interrupt() >> that destroys any terminated thread on the current cpu uses the thread >> local memory (TLS) pointer register - tpidr_el0 - which is not re-read >> after the switch_to() and causes interesting crashes as the issue #1121 >> describes. >> >> Given all above and the fact that reschedule_from_interrupt() calls >> switch_to() to switch threads which is very tricky to do right using >> pure C++ and some inline assembly, this patch implements high-level >> version of reschedule_from_interrupt() in assembly for aarch64 (the x64 >> stays as is). >> >> This patch modifies include/osv/sched.hh and core/sched.cc with number >> of `#ifdef AARCH64_PORT_STUB` to keep reschedule_from_interrupt() as is >> for x64 and split it into 2 new methods - schedule_next_thread() and >> destroy_current_cpu_terminating_thread() for aarch64. The two new >> functions correspond to the parts of the reschedule_from_interrupt() up >> to the switch_to() call and after. Please note that >> schedule_next_thread() returns simple 2-field struct with pointers to >> old and new thread state (pc,sp,tcb,etc) that gets used by >> reschedule_from_interrupt. >> >> On top of this, this patch adds arch/aarch64/sched.S which implements >> reschedule_from_interrupt() as a C function in assembly. The >> reschedule_from_interrupt >> in essence calls new schedule_next_thread(), saves all callee-saves >> registers on old thread stack, executes a context switch as the old >> switch_to() >> did, restores the callee-saves registers from new thread stack and >> finally calls destroy_current_cpu_terminating_thread(). >> >> Comparing to the original v0 version, this one refines switch_to_first() >> to use early clobbers. The switch_to() also saves/restores one extra >> register FPCR to make tst-fpu.cc work correctly. >> >> This patch has been tested on with the -O0, -O1, -O2 optimization levels >> (release, debug 1 and 2) on real ARM hardware and emulated TGI mode. >> >> Please note that the new code makes the context switch a little bit more >> expensive (~3%) per this numbers: >> >> BEFORE: >> taskset -c 2-5 ./scripts/run.py -c 1 -e '/tests/misc-ctxsw.so' >> OSv v0.55.0-204-g837a9b5d >> getauxval() stubbed >> eth0: 192.168.122.15 >> Booted up in 58.19 ms >> Cmdline: /tests/misc-ctxsw.so >> 664 colocated >> 665 apart >> 665 nopin >> >> AFTER: >> taskset -c 2-5 ./scripts/run.py -c 1 -e '/tests/misc-ctxsw.so' >> OSv v0.55.0-205-g7e7c49f7 >> getauxval() stubbed >> eth0: 192.168.122.15 >> Booted up in 58.20 ms >> Cmdline: /tests/misc-ctxsw.so >> 683 colocated >> 690 apart >> 690 nopin >> >> Fixes #1121 >> Fixes #1124 >> >> Signed-off-by: Waldemar Kozaczuk <[email protected]> >> --- >> Makefile | 1 + >> arch/aarch64/arch-switch.hh | 43 ++++-------------- >> arch/aarch64/arch-thread-state.hh | 1 + >> arch/aarch64/sched.S | 75 +++++++++++++++++++++++++++++++ >> core/sched.cc | 63 ++++++++++++++++++++++++-- >> include/osv/sched.hh | 34 +++++++++++++- >> 6 files changed, 178 insertions(+), 39 deletions(-) >> create mode 100644 arch/aarch64/sched.S >> >> diff --git a/Makefile b/Makefile >> index e6867cac..d8f888c4 100644 >> --- a/Makefile >> +++ b/Makefile >> @@ -879,6 +879,7 @@ objects += arch/$(arch)/memset.o >> objects += arch/$(arch)/memcpy.o >> objects += arch/$(arch)/memmove.o >> objects += arch/$(arch)/tlsdesc.o >> +objects += arch/$(arch)/sched.o >> objects += $(libfdt) >> endif >> >> diff --git a/arch/aarch64/arch-switch.hh b/arch/aarch64/arch-switch.hh >> index dff7467c..bbf832db 100644 >> --- a/arch/aarch64/arch-switch.hh >> +++ b/arch/aarch64/arch-switch.hh >> @@ -20,32 +20,6 @@ void thread_main_c(sched::thread* t); >> >> namespace sched { >> >> -void thread::switch_to() >> -{ >> - thread* old = current(); >> - asm volatile ("msr tpidr_el0, %0; isb; " :: "r"(_tcb) : "memory"); >> - >> - asm volatile("\n" >> - "str x29, %0 \n" >> - "mov x2, sp \n" >> - "adr x1, 1f \n" /* address of label */ >> - "stp x2, x1, %1 \n" >> - >> - "ldp x29, x0, %2 \n" >> - "ldp x2, x1, %3 \n" >> - >> - "mov sp, x2 \n" >> - "blr x1 \n" >> - >> - "1: \n" /* label */ >> - : >> - : "Q"(old->_state.fp), "Ump"(old->_state.sp), >> - "Ump"(this->_state.fp), "Ump"(this->_state.sp) >> - : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", >> - "x9", "x10", "x11", "x12", "x13", "x14", "x15", >> - "x16", "x17", "x18", "x30", "memory"); >> -} >> - >> void thread::switch_to_first() >> { >> asm volatile ("msr tpidr_el0, %0; isb; " :: "r"(_tcb) : "memory"); >> @@ -58,15 +32,15 @@ void thread::switch_to_first() >> remote_thread_local_var(percpu_base) = >> _detached_state->_cpu->percpu_base; >> >> asm volatile("\n" >> - "ldp x29, x0, %0 \n" >> - "ldp x2, x1, %1 \n" >> - "mov sp, x2 \n" >> - "blr x1 \n" >> - : >> + "ldp x29, x0, %2 \n" >> + "ldp x22, x21, %3 \n" >> + "mov sp, x22 \n" >> + "blr x21 \n" >> + : // No output operands - this is to designate the >> input operands as earlyclobbers >> + "=&Ump"(this->_state.fp), "=&Ump"(this->_state.sp) >> : "Ump"(this->_state.fp), "Ump"(this->_state.sp) >> - : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", >> - "x9", "x10", "x11", "x12", "x13", "x14", "x15", >> - "x16", "x17", "x18", "x30", "memory"); >> + : "x0", "x19", "x20", "x21", "x22", "x23", "x24", >> + "x25", "x26", "x27", "x28", "x30", "memory"); >> } >> >> void thread::init_stack() >> @@ -150,6 +124,7 @@ void thread::setup_tcb() >> void* p = aligned_alloc(64, total_tls_size + sizeof(*_tcb)); >> _tcb = (thread_control_block *)p; >> _tcb[0].tls_base = &_tcb[1]; >> + _state.tcb = p; >> // >> // First goes kernel TLS data >> auto kernel_tls = _tcb[0].tls_base; >> diff --git a/arch/aarch64/arch-thread-state.hh >> b/arch/aarch64/arch-thread-state.hh >> index af424472..6f1b680d 100644 >> --- a/arch/aarch64/arch-thread-state.hh >> +++ b/arch/aarch64/arch-thread-state.hh >> @@ -14,6 +14,7 @@ struct thread_state { >> >> void* sp; >> void* pc; >> + void* tcb; >> }; >> >> #endif /* ARCH_THREAD_STATE_HH_ */ >> diff --git a/arch/aarch64/sched.S b/arch/aarch64/sched.S >> new file mode 100644 >> index 00000000..98603393 >> --- /dev/null >> +++ b/arch/aarch64/sched.S >> @@ -0,0 +1,75 @@ >> +#void cpu::reschedule_from_interrupt(bool called_from_yield, >> +# thread_runtime::duration >> preempt_after) >> +.global reschedule_from_interrupt >> +.type reschedule_from_interrupt, @function >> +reschedule_from_interrupt: >> + stp x29, x30, [sp, #-192]! >> + mov x29, sp >> + >> + #Call cpu_schedule_next_thread() to determine next thread to >> switch to if any >> + bl cpu_schedule_next_thread >> + >> + #The cpu_schedule_next_thread returns thread_switch_data in x0 >> and x1, >> + #where x0 holds old_thread_state and x1 holds new_thread_state >> + #If cpu_schedule_next_thread() returns thread_switch_data with >> null pointers, exit >> + cmp x0, #0 >> + b.eq 2f >> + >> + #Store all regular callee-save registers on the old thread stack >> + stp x19, x20, [sp, #16] >> + stp x21, x22, [sp, #32] >> + stp x23, x24, [sp, #48] >> + stp x25, x26, [sp, #64] >> + stp x27, x28, [sp, #80] >> + >> + #Store all SIMD/FP callee-save registers on the old thread stack >> + stp d8, d9, [sp, #96] >> + stp d10, d11, [sp, #112] >> + stp d12, d13, [sp, #128] >> + stp d14, d15, [sp, #144] >> + >> + #Store FP Control Register with flags that control rounding, etc >> + mrs x2, fpcr >> + str x2, [sp, #160] >> + >> + #Switch to new thread >> + ldr x2, [x1, #32] //Fetch _tcb of the new thread >> + msr tpidr_el0, x2 //Set thread pointer >> + isb >> + >> + str x29, [x0, #0] //Save frame pointer of the old thread >> + mov x3, sp //Fetch old thread stack pointer >> + adr x4, 1f //Fetch old thread instruction point >> + stp x3, x4, [x0, #16] //Save old thread sp and pc >> + >> + ldp x29, x0, [x1, #0] //Set frame pointer of the new thread >> and this (x0) of the new thread >> + //Please note that the pc may point to >> thread_main_c(thread*) which is >> + //why we have to set x0 (1st argument) >> to new thread object >> + ldp x3, x4, [x1, #16] //Fetch new thread sp and pc >> + mov sp, x3 //Set new thread stack pointer >> + blr x4 //Jump to the new thread pc >> + >> +1: >> + #Restore all regular callee-save registers from the new thread >> stack >> + ldp x19, x20, [sp, #16] >> + ldp x21, x22, [sp, #32] >> + ldp x23, x24, [sp, #48] >> + ldp x25, x26, [sp, #64] >> + ldp x27, x28, [sp, #80] >> + >> + #Restore all SIMD/FP callee-save registers from the new thread >> stack >> + ldp d8, d9, [sp, #96] >> + ldp d10, d11, [sp, #112] >> + ldp d12, d13, [sp, #128] >> + ldp d14, d15, [sp, #144] >> + >> + #Restore FP Control Register with flags that control rounding, >> etc >> + ldr x2, [sp, #160] >> + msr fpcr, x2 >> + >> + #Call destroy_current_cpu_terminating_thread() >> + bl destroy_current_cpu_terminating_thread >> + >> +2: >> + ldp x29, x30, [sp], #192 >> + ret >> diff --git a/core/sched.cc b/core/sched.cc >> index 06f849d1..74b4ab95 100644 >> --- a/core/sched.cc >> +++ b/core/sched.cc >> @@ -225,13 +225,34 @@ void thread::cputime_estimator_get( >> void cpu::schedule() >> { >> WITH_LOCK(irq_lock) { >> +#ifdef AARCH64_PORT_STUB >> + reschedule_from_interrupt(sched::cpu::current(), false, thyst); >> +#else >> current()->reschedule_from_interrupt(); >> - } >> -} >> - >> +#endif >> + } >> +} >> + >> +// In aarch64 port, the reschedule_from_interrupt() needs to be >> implemented >> +// in assembly (please see arch/aarch64/sched.S) to give us better >> control >> +// which registers are used and which ones are saved and restored during >> +// the context switch. In essence, most of the original >> reschedule_from_interrupt() >> +// code up to switch_to() is shared with aarch64-specific >> cpu::schedule_next_thread() >> +// that is called from reschedule_from_interrupt() in >> arch/arch64/sched.S assembly. >> +// The logic executed after switch_to() that makes up >> destroy_current_cpu_terminating_thread() >> +// is called from arch/arch64/sched.S as well. >> +// At the end, we define reschedule_from_interrupt() in C++ for x86_64 >> and schedule_next_thread() >> +// and destroy_current_cpu_terminating_thread() in C++ for aarch64. >> +#ifdef AARCH64_PORT_STUB >> +inline thread_switch_data cpu::schedule_next_thread(bool >> called_from_yield, >> + >> thread_runtime::duration preempt_after) >> +{ >> + thread_switch_data switch_data; >> +#else >> void cpu::reschedule_from_interrupt(bool called_from_yield, >> thread_runtime::duration >> preempt_after) >> { >> +#endif >> trace_sched_sched(); >> assert(sched::exception_depth <= 1); >> need_reschedule = false; >> @@ -259,7 +280,11 @@ void cpu::reschedule_from_interrupt(bool >> called_from_yield, >> // lowest runtime, and update the timer until the next thread's >> turn. >> if (runqueue.empty()) { >> preemption_timer.cancel(); >> +#ifdef AARCH64_PORT_STUB >> + return switch_data; >> +#else >> return; >> +#endif >> } else if (!called_from_yield) { >> auto &t = *runqueue.begin(); >> if (p->_runtime.get_local() < t._runtime.get_local()) { >> @@ -268,7 +293,11 @@ void cpu::reschedule_from_interrupt(bool >> called_from_yield, >> if (delta > 0) { >> preemption_timer.set(now + delta); >> } >> +#ifdef AARCH64_PORT_STUB >> + return switch_data; >> +#else >> return; >> +#endif >> } >> } >> // If we're here, p no longer has the lowest runtime. Before >> queuing >> @@ -336,19 +365,39 @@ void cpu::reschedule_from_interrupt(bool >> called_from_yield, >> if (lazy_flush_tlb.exchange(false, std::memory_order_seq_cst)) { >> mmu::flush_tlb_local(); >> } >> +#ifdef AARCH64_PORT_STUB >> + switch_data.old_thread_state = &(p->_state); >> + switch_data.new_thread_state = &(n->_state); >> + return switch_data; >> +} >> +#else >> n->switch_to(); >> +#endif >> >> // Note: after the call to n->switch_to(), we should no longer use >> any of >> // the local variables, nor "this" object, because we just switched >> to n's >> // stack and the values we can access now are those that existed in >> the >> // reschedule call which scheduled n out, and will now be returning. >> // So to get the current cpu, we must use cpu::current(), not "this". >> +#ifdef AARCH64_PORT_STUB >> +extern "C" void destroy_current_cpu_terminating_thread() >> +{ >> +#endif >> if (cpu::current()->terminating_thread) { >> cpu::current()->terminating_thread->destroy(); >> cpu::current()->terminating_thread = nullptr; >> } >> } >> >> +#ifdef AARCH64_PORT_STUB >> +extern "C" thread_switch_data cpu_schedule_next_thread(cpu* cpu, >> + bool >> called_from_yield, >> + >> thread_runtime::duration preempt_after) >> +{ >> + return cpu->schedule_next_thread(called_from_yield, preempt_after); >> +} >> +#endif >> + >> void cpu::timer_fired() >> { >> // nothing to do, preemption will happen if needed >> @@ -521,7 +570,11 @@ void thread::pin(cpu *target_cpu) >> // Note that wakeme is on the same CPU, and irq is disabled, >> // so it will not actually run until we stop running. >> wakeme->wake_with([&] { do_wakeme = true; }); >> +#ifdef AARCH64_PORT_STUB >> + reschedule_from_interrupt(source_cpu, false, thyst); >> +#else >> source_cpu->reschedule_from_interrupt(); >> +#endif >> } >> // wakeme will be implicitly join()ed here. >> } >> @@ -760,7 +813,11 @@ void thread::yield(thread_runtime::duration >> preempt_after) >> } >> trace_sched_yield_switch(); >> >> +#ifdef AARCH64_PORT_STUB >> + reschedule_from_interrupt(cpu::current(), true, preempt_after); >> +#else >> cpu::current()->reschedule_from_interrupt(true, preempt_after); >> +#endif >> } >> >> void thread::set_priority(float priority) >> diff --git a/include/osv/sched.hh b/include/osv/sched.hh >> index 0fb44f77..1b2847be 100644 >> --- a/include/osv/sched.hh >> +++ b/include/osv/sched.hh >> @@ -31,6 +31,9 @@ typedef float runtime_t; >> >> extern "C" { >> void smp_main(); >> +#ifdef AARCH64_PORT_STUB >> +void destroy_current_cpu_terminating_thread(); >> +#endif >> }; >> void smp_launch(); >> >> @@ -321,7 +324,12 @@ constexpr thread_runtime::duration thyst = >> std::chrono::milliseconds(5); >> constexpr thread_runtime::duration context_switch_penalty = >> std::chrono::microseconds(10); >> >> - >> +#ifdef AARCH64_PORT_STUB >> +struct thread_switch_data { >> + thread_state* old_thread_state = nullptr; >> + thread_state* new_thread_state = nullptr; >> +}; >> +#endif >> /** >> * OSv thread >> */ >> @@ -600,7 +608,9 @@ private: >> unsigned allowed_initial_states_mask = 1 << >> unsigned(status::waiting)); >> static void sleep_impl(timer &tmr); >> void main(); >> +#ifndef AARCH64_PORT_STUB >> void switch_to(); >> +#endif >> void switch_to_first(); >> void prepare_wait(); >> void wait(); >> @@ -702,7 +712,12 @@ private: >> std::vector<char*> _tls; >> bool _app; >> std::shared_ptr<osv::application_runtime> _app_runtime; >> +public: >> void destroy(); >> +private: >> +#ifdef AARCH64_PORT_STUB >> + friend void ::destroy_current_cpu_terminating_thread(); >> +#endif >> friend class thread_ref_guard; >> friend void thread_main_c(thread* t); >> friend class wait_guard; >> @@ -884,8 +899,13 @@ struct cpu : private timer_base::client { >> * @param preempt_after fire the preemption timer after this time >> period >> * (relevant only when "called_from_yield" is >> TRUE) >> */ >> +#ifdef AARCH64_PORT_STUB >> + thread_switch_data schedule_next_thread(bool called_from_yield = >> false, >> + thread_runtime::duration >> preempt_after = thyst); >> +#else >> void reschedule_from_interrupt(bool called_from_yield = false, >> - thread_runtime::duration preempt_after = >> thyst); >> + thread_runtime::duration >> preempt_after = thyst); >> +#endif >> void enqueue(thread& t); >> void init_idle_thread(); >> virtual void timer_fired() override; >> @@ -895,6 +915,12 @@ struct cpu : private timer_base::client { >> int renormalize_count; >> }; >> >> +#ifdef AARCH64_PORT_STUB >> +extern "C" void reschedule_from_interrupt(cpu* cpu, >> + bool called_from_yield, >> + thread_runtime::duration >> preempt_after); >> +#endif >> + >> class cpu::notifier { >> public: >> explicit notifier(std::function<void ()> cpu_up); >> @@ -996,7 +1022,11 @@ inline bool preemptable() >> inline void preempt() >> { >> if (preemptable()) { >> +#ifdef AARCH64_PORT_STUB >> + reschedule_from_interrupt(sched::cpu::current(), false, thyst); >> +#else >> sched::cpu::current()->reschedule_from_interrupt(); >> +#endif >> } else { >> // preempt_enable() will pick this up eventually >> need_reschedule = true; >> -- >> 2.27.0 >> >> -- >> You received this message because you are subscribed to the Google Groups >> "OSv Development" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/osv-dev/20210315051824.88794-1-jwkozaczuk%40gmail.com >> . >> > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/CAL9cFfMeemeJdWdxj1EuX-qLBSp27Uto7sXusj3Re6%2BgkHTsNQ%40mail.gmail.com.
