On Wed, Mar 3, 2021 at 4:57 PM Waldemar Kozaczuk <[email protected]>
wrote:

> As the issue #1124 and the thread
> https://groups.google.com/g/osv-dev/c/ystZElFNdBI
> on the mailing list explains, some tests in the aarch64 debug build
> fail in a consistent manner with page faults suggesting that possibly
> the values of parameters passed from function to function get somehow
> currupt. After painful research, I pin-pointed the crashes to the
> reschedule_from_interrupt() method that does not work correctly in
> voluntary (non-preemptive) context switch scenarios where
> thread::yield() method is called by an app. It turns out that in those
> situations so called callee-save registers used by an app function
> calling thread::yield() are not saved during context switch and then
> later not restored when such thread gets resumed later. Following steps
> illustrate such scenario:
>
> 1. Function F pushes one of the callee saved registers - x23 (just an
> example) - on the T1
>    stack becuase it uses it for something and it must do it per the
> calling convention.
> 2. Function F stores some value in x23.
> 3. Function F calls thread::yield() directly or indirectly.
> 4. Eventually, reschedule_from_interrupt() is called and it calls
> switch_to() to switch
>    stack pointer to the new thread T2 stack. The debug version of neither
> reschedule_from_interrupt()
>    nor switch_to() stores x23 as they do not use this register (unlike the
> release version).
> 5. At some point, later reschedule_from_interrupt() is called again (not
> necessarily the very next time)
>    and calls switch_to() that switches back to T1.
> 6. T1 resumes and eventually returns the control to the function F1 right
> after it called yield().
> 7. The code in F1 after calling yield() reads the value of x23 ... and
> boom. The x23 quite likely
>    contains garbage because it was never restored by F1 after calling
> yield() because per calling
>    convention yield() or other callees should have saved and restored it.
> But they did not, did they?
>    Or rather different routines on different threads running on the same
> cpu in between ruined it.
>
> One way to solve this problem would be to add all callee-saved registers
> (x19-x28 and d8-d15 per
> https://github.com/ARM-software/abi-aa/blob/f52e1ad3f81254497a83578dc102f6aac89e52d0/aapcs64/aapcs64.rst
> -
> see "6.1.1 General-purpose Registers" and "6.1.2 SIMD and Floating-Point
> Registers" about v8-v15 registers)
> to the list of clobberred registers of the inline assembly in
> thread::switch_to() in arch/aarch/arch-switch.hh. But this does not
> always work as some versions of gcc (see
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99342) depending on
> the optimization level (-O0, -O2, etc) inline switch_to() into
> reschedule_from_interrupt or not and generate machine code that does not do
> what our intention is. On top of that, the code in
> reschedule_from_interrupt()
> that destroys any terminated thread on the current cpu uses the thread
> local memory (TLS) pointer register - tpidr_el0 - which is not re-read
> after the switch_to() and causes interesting crashes as the issue #1121
> describes.
>
> Given all above and the fact that reschedule_from_interrupt() calls
> switch_to() to switch threads which is very tricky to do right using
> pure C++ and some inline assembly, this patch implements high-level
> version of reschedule_from_interrupt() in assembly for aarch64 (the x64
> stays as is).
>
> This patch modifies include/osv/sched.hh and core/sched.cc with number
> of `#ifdef AARCH64_PORT_STUB` to keep reschedule_from_interrupt() as is
> for x64 and split it into 2 new methods - schedule_next_thread() and
> destroy_current_cpu_terminating_thread() for aarch64. The two new
> functions correspond to the parts of the reschedule_from_interrupt() up
> to the switch_to() call and after. Please note that
> schedule_next_thread() returns simple 2-field struct with pointers to
> old and new thread state (pc,sp,tcb,etc) that gets used by
> reschedule_from_interrupt.
>
> On top of this, this patch adds arch/aarch64/sched.S which implements
> reschedule_from_interrupt() as a C function in assembly. The
> reschedule_from_interrupt
> in essence calls new schedule_next_thread(), saves all callee-saves
> registers on old thread stack, executes a context switch as the old
> switch_to()
> did, restores the callee-saves registers from new thread stack and
> finally calls destroy_current_cpu_terminating_thread().
>
> This patch has been tested on with the -O0, -O1, -O2 optimization levels
> (release, debug 1 and 2) on real ARM hardware and emulated TGI mode.
>
> Please note that the new code makes the context switch a little bit more
> expensive (~3%) per this numbers:
>
> BEFORE:
> taskset -c 2-5 ./scripts/run.py -c 1 -e '/tests/misc-ctxsw.so'
> OSv v0.55.0-204-g837a9b5d
> getauxval() stubbed
> eth0: 192.168.122.15
> Booted up in 58.19 ms
> Cmdline: /tests/misc-ctxsw.so
>        664 colocated
>        665 apart
>        665 nopin
>
> AFTER:
> taskset -c 2-5 ./scripts/run.py -c 1 -e '/tests/misc-ctxsw.so'
> OSv v0.55.0-205-g7e7c49f7
> getauxval() stubbed
> eth0: 192.168.122.15
> Booted up in 58.20 ms
> Cmdline: /tests/misc-ctxsw.so
>        683 colocated
>        690 apart
>        690 nopin
>

BTW the same test app compiled as pie (g++ tests/misc-ctxsw.cc -lpthread -o
misc-ctxsw) to run on Linux host natively yields the following on the same
hardware:

taskset -c 2-2 ./misc-ctxsw
      4075 colocated
      5319 apart
      1813 nopin
>
>
> Fixes #1121
> Fixes #1124
>
> Signed-off-by: Waldemar Kozaczuk <[email protected]>
> ---
>  Makefile                          |  1 +
>  arch/aarch64/arch-switch.hh       | 38 +++--------------
>  arch/aarch64/arch-thread-state.hh |  1 +
>  arch/aarch64/sched.S              | 70 +++++++++++++++++++++++++++++++
>  core/sched.cc                     | 63 ++++++++++++++++++++++++++--
>  include/osv/sched.hh              | 34 ++++++++++++++-
>  6 files changed, 170 insertions(+), 37 deletions(-)
>  create mode 100644 arch/aarch64/sched.S
>
> diff --git a/Makefile b/Makefile
> index bbc57252..e1866c3b 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -878,6 +878,7 @@ objects += arch/$(arch)/memset.o
>  objects += arch/$(arch)/memcpy.o
>  objects += arch/$(arch)/memmove.o
>  objects += arch/$(arch)/tlsdesc.o
> +objects += arch/$(arch)/sched.o
>  objects += $(libfdt)
>  endif
>
> diff --git a/arch/aarch64/arch-switch.hh b/arch/aarch64/arch-switch.hh
> index dff7467c..1a78d132 100644
> --- a/arch/aarch64/arch-switch.hh
> +++ b/arch/aarch64/arch-switch.hh
> @@ -20,32 +20,6 @@ void thread_main_c(sched::thread* t);
>
>  namespace sched {
>
> -void thread::switch_to()
> -{
> -    thread* old = current();
> -    asm volatile ("msr tpidr_el0, %0; isb; " :: "r"(_tcb) : "memory");
> -
> -    asm volatile("\n"
> -                 "str x29,     %0  \n"
> -                 "mov x2, sp       \n"
> -                 "adr x1, 1f       \n" /* address of label */
> -                 "stp x2, x1,  %1  \n"
> -
> -                 "ldp x29, x0, %2  \n"
> -                 "ldp x2, x1,  %3  \n"
> -
> -                 "mov sp, x2       \n"
> -                 "blr x1           \n"
> -
> -                 "1:               \n" /* label */
> -                 :
> -                 : "Q"(old->_state.fp), "Ump"(old->_state.sp),
> -                   "Ump"(this->_state.fp), "Ump"(this->_state.sp)
> -                 : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8",
> -                   "x9", "x10", "x11", "x12", "x13", "x14", "x15",
> -                   "x16", "x17", "x18", "x30", "memory");
> -}
> -
>  void thread::switch_to_first()
>  {
>      asm volatile ("msr tpidr_el0, %0; isb; " :: "r"(_tcb) : "memory");
> @@ -59,14 +33,13 @@ void thread::switch_to_first()
>
>      asm volatile("\n"
>                   "ldp x29, x0, %0  \n"
> -                 "ldp x2, x1, %1   \n"
> -                 "mov sp, x2       \n"
> -                 "blr x1           \n"
> +                 "ldp x22, x21, %1 \n"
> +                 "mov sp, x22      \n"
> +                 "blr x21          \n"
>                   :
>                   : "Ump"(this->_state.fp), "Ump"(this->_state.sp)
> -                 : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8",
> -                   "x9", "x10", "x11", "x12", "x13", "x14", "x15",
> -                   "x16", "x17", "x18", "x30", "memory");
> +                 : "x0", "x19", "x20", "x21", "x22", "x23", "x24",
> +                  "x25", "x26", "x27", "x28", "x29", "x30", "memory");
>  }
>
>  void thread::init_stack()
> @@ -150,6 +123,7 @@ void thread::setup_tcb()
>      void* p = aligned_alloc(64, total_tls_size + sizeof(*_tcb));
>      _tcb = (thread_control_block *)p;
>      _tcb[0].tls_base = &_tcb[1];
> +    _state.tcb = p;
>      //
>      // First goes kernel TLS data
>      auto kernel_tls = _tcb[0].tls_base;
> diff --git a/arch/aarch64/arch-thread-state.hh
> b/arch/aarch64/arch-thread-state.hh
> index af424472..6f1b680d 100644
> --- a/arch/aarch64/arch-thread-state.hh
> +++ b/arch/aarch64/arch-thread-state.hh
> @@ -14,6 +14,7 @@ struct thread_state {
>
>      void* sp;
>      void* pc;
> +    void* tcb;
>  };
>
>  #endif /* ARCH_THREAD_STATE_HH_ */
> diff --git a/arch/aarch64/sched.S b/arch/aarch64/sched.S
> new file mode 100644
> index 00000000..1e69f8fd
> --- /dev/null
> +++ b/arch/aarch64/sched.S
> @@ -0,0 +1,70 @@
> +#void cpu::reschedule_from_interrupt(bool called_from_yield,
> +#                                    thread_runtime::duration
> preempt_after)
> +.global reschedule_from_interrupt
> +.type reschedule_from_interrupt, @function
> +reschedule_from_interrupt:
> +        #.cfi_startproc
> +        stp     x29, x30, [sp, #-176]!
> +        #.cfi_adjust_cfa_offset 176
> +        mov    x29, sp
> +        #.cfi_def_cfa_register x29
> +
> +        #Call cpu_schedule_next_thread() to determine next thread to
> switch to if any
> +        bl cpu_schedule_next_thread
> +
> +       #The cpu_schedule_next_thread returns thread_switch_data in x0 and
> x1,
> +        #where x0 holds old_thread_state and x1 holds new_thread_state
> +        #If cpu_schedule_next_thread() returns thread_switch_data with
> null pointers, exit
> +        cmp x0, #0
> +        b.eq 2f
> +
> +        #Store all callee-save registers on the old thread stack
> +        stp    x19, x20, [sp, #16]
> +        stp    x21, x22, [sp, #32]
> +        stp    x23, x24, [sp, #48]
> +        stp    x25, x26, [sp, #64]
> +        stp    x27, x28, [sp, #80]
> +
> +        stp    d8, d9, [sp, #96]
> +        stp    d10, d11, [sp, #112]
> +        stp    d12, d13, [sp, #128]
> +        stp    d14, d15, [sp, #144]
> +
> +        #Switch to new thread
> +        ldr    x2, [x1, #32]     //Fetch _tcb of the new thread
> +        msr    tpidr_el0, x2     //Set thread pointer
> +        isb
> +
> +        str     x29, [x0, #0]     //Save frame pointer of the old thread
> +        mov     x3, sp            //Fetch old thread stack pointer
> +        adr     x4, 1f            //Fetch old thread instruction point
> +        stp     x3, x4, [x0, #16] //Save old thread sp and pc
> +
> +        ldp     x29, x0, [x1, #0] //Set frame pointer of the new thread
> and this (x0) of the new thread
> +                                  //Please note that the pc may point to
> thread_main_c(thread*) which is
> +                                  //why we have to set x0 (1st argument)
> to new thread object
> +        ldp     x3, x4, [x1, #16] //Fetch new thread sp and pc
> +        mov     sp, x3            //Set new thread stack pointer
> +        blr     x4                //Jump to the new thread pc
> +
> +1:
> +        #Restore all callee-save registers from the new thread stack
> +        ldp     x19, x20, [sp, #16]
> +        ldp     x21, x22, [sp, #32]
> +        ldp     x23, x24, [sp, #48]
> +        ldp     x25, x26, [sp, #64]
> +        ldp     x27, x28, [sp, #80]
> +
> +        ldp    d8, d9, [sp, #96]
> +        ldp    d10, d11, [sp, #112]
> +        ldp    d12, d13, [sp, #128]
> +        ldp    d14, d15, [sp, #144]
> +
> +        #Call destroy_current_cpu_terminating_thread()
> +        bl destroy_current_cpu_terminating_thread
> +
> +2:
> +        ldp     x29, x30, [sp], #176
> +        #.cfi_def_cfa sp, 0
> +        ret
> +        #.cfi_endproc
> diff --git a/core/sched.cc b/core/sched.cc
> index 06f849d1..74b4ab95 100644
> --- a/core/sched.cc
> +++ b/core/sched.cc
> @@ -225,13 +225,34 @@ void thread::cputime_estimator_get(
>  void cpu::schedule()
>  {
>      WITH_LOCK(irq_lock) {
> +#ifdef AARCH64_PORT_STUB
> +        reschedule_from_interrupt(sched::cpu::current(), false, thyst);
> +#else
>          current()->reschedule_from_interrupt();
> -    }
> -}
> -
> +#endif
> +    }
> +}
> +
> +// In aarch64 port, the reschedule_from_interrupt() needs to be
> implemented
> +// in assembly (please see arch/aarch64/sched.S) to give us better control
> +// which registers are used and which ones are saved and restored during
> +// the context switch. In essence, most of the original
> reschedule_from_interrupt()
> +// code up to switch_to() is shared with aarch64-specific
> cpu::schedule_next_thread()
> +// that is called from reschedule_from_interrupt() in arch/arch64/sched.S
> assembly.
> +// The logic executed after switch_to() that makes up
> destroy_current_cpu_terminating_thread()
> +// is called from arch/arch64/sched.S as well.
> +// At the end, we define reschedule_from_interrupt() in C++ for x86_64
> and schedule_next_thread()
> +// and destroy_current_cpu_terminating_thread() in C++ for aarch64.
> +#ifdef AARCH64_PORT_STUB
> +inline thread_switch_data cpu::schedule_next_thread(bool
> called_from_yield,
> +
> thread_runtime::duration preempt_after)
> +{
> +    thread_switch_data switch_data;
> +#else
>  void cpu::reschedule_from_interrupt(bool called_from_yield,
>                                      thread_runtime::duration
> preempt_after)
>  {
> +#endif
>      trace_sched_sched();
>      assert(sched::exception_depth <= 1);
>      need_reschedule = false;
> @@ -259,7 +280,11 @@ void cpu::reschedule_from_interrupt(bool
> called_from_yield,
>          // lowest runtime, and update the timer until the next thread's
> turn.
>          if (runqueue.empty()) {
>              preemption_timer.cancel();
> +#ifdef AARCH64_PORT_STUB
> +            return switch_data;
> +#else
>              return;
> +#endif
>          } else if (!called_from_yield) {
>              auto &t = *runqueue.begin();
>              if (p->_runtime.get_local() < t._runtime.get_local()) {
> @@ -268,7 +293,11 @@ void cpu::reschedule_from_interrupt(bool
> called_from_yield,
>                  if (delta > 0) {
>                      preemption_timer.set(now + delta);
>                  }
> +#ifdef AARCH64_PORT_STUB
> +                return switch_data;
> +#else
>                  return;
> +#endif
>              }
>          }
>          // If we're here, p no longer has the lowest runtime. Before
> queuing
> @@ -336,19 +365,39 @@ void cpu::reschedule_from_interrupt(bool
> called_from_yield,
>      if (lazy_flush_tlb.exchange(false, std::memory_order_seq_cst)) {
>          mmu::flush_tlb_local();
>      }
> +#ifdef AARCH64_PORT_STUB
> +    switch_data.old_thread_state = &(p->_state);
> +    switch_data.new_thread_state = &(n->_state);
> +    return switch_data;
> +}
> +#else
>      n->switch_to();
> +#endif
>
>      // Note: after the call to n->switch_to(), we should no longer use
> any of
>      // the local variables, nor "this" object, because we just switched
> to n's
>      // stack and the values we can access now are those that existed in
> the
>      // reschedule call which scheduled n out, and will now be returning.
>      // So to get the current cpu, we must use cpu::current(), not "this".
> +#ifdef AARCH64_PORT_STUB
> +extern "C" void destroy_current_cpu_terminating_thread()
> +{
> +#endif
>      if (cpu::current()->terminating_thread) {
>          cpu::current()->terminating_thread->destroy();
>          cpu::current()->terminating_thread = nullptr;
>      }
>  }
>
> +#ifdef AARCH64_PORT_STUB
> +extern "C" thread_switch_data cpu_schedule_next_thread(cpu* cpu,
> +                                                       bool
> called_from_yield,
> +
>  thread_runtime::duration preempt_after)
> +{
> +    return cpu->schedule_next_thread(called_from_yield, preempt_after);
> +}
> +#endif
> +
>  void cpu::timer_fired()
>  {
>      // nothing to do, preemption will happen if needed
> @@ -521,7 +570,11 @@ void thread::pin(cpu *target_cpu)
>          // Note that wakeme is on the same CPU, and irq is disabled,
>          // so it will not actually run until we stop running.
>          wakeme->wake_with([&] { do_wakeme = true; });
> +#ifdef AARCH64_PORT_STUB
> +        reschedule_from_interrupt(source_cpu, false, thyst);
> +#else
>          source_cpu->reschedule_from_interrupt();
> +#endif
>      }
>      // wakeme will be implicitly join()ed here.
>  }
> @@ -760,7 +813,11 @@ void thread::yield(thread_runtime::duration
> preempt_after)
>      }
>      trace_sched_yield_switch();
>
> +#ifdef AARCH64_PORT_STUB
> +    reschedule_from_interrupt(cpu::current(), true, preempt_after);
> +#else
>      cpu::current()->reschedule_from_interrupt(true, preempt_after);
> +#endif
>  }
>
>  void thread::set_priority(float priority)
> diff --git a/include/osv/sched.hh b/include/osv/sched.hh
> index 0fb44f77..1dacd623 100644
> --- a/include/osv/sched.hh
> +++ b/include/osv/sched.hh
> @@ -31,6 +31,9 @@ typedef float runtime_t;
>
>  extern "C" {
>  void smp_main();
> +#ifdef AARCH64_PORT_STUB
> +void destroy_current_cpu_terminating_thread();
> +#endif
>  };
>  void smp_launch();
>
> @@ -321,7 +324,12 @@ constexpr thread_runtime::duration thyst =
> std::chrono::milliseconds(5);
>  constexpr thread_runtime::duration context_switch_penalty =
>                                             std::chrono::microseconds(10);
>
> -
> +#ifdef AARCH64_PORT_STUB
> +struct thread_switch_data {
> +   thread_state* old_thread_state = nullptr;
> +   thread_state* new_thread_state = nullptr;
> +};
> +#endif
>  /**
>   * OSv thread
>   */
> @@ -600,7 +608,9 @@ private:
>              unsigned allowed_initial_states_mask = 1 <<
> unsigned(status::waiting));
>      static void sleep_impl(timer &tmr);
>      void main();
> +#ifndef AARCH64_PORT_STUB
>      void switch_to();
> +#endif
>      void switch_to_first();
>      void prepare_wait();
>      void wait();
> @@ -702,7 +712,12 @@ private:
>      std::vector<char*> _tls;
>      bool _app;
>      std::shared_ptr<osv::application_runtime> _app_runtime;
> +public: //TODO: For whatever reason we need to make this public even
> though destroy_current_cpu_terminating_thread is defined as friend below,
> so compiler still complains.
>      void destroy();
> +private:
> +#ifdef AARCH64_PORT_STUB
> +    friend void ::destroy_current_cpu_terminating_thread();
> +#endif
>      friend class thread_ref_guard;
>      friend void thread_main_c(thread* t);
>      friend class wait_guard;
> @@ -884,8 +899,13 @@ struct cpu : private timer_base::client {
>       * @param preempt_after fire the preemption timer after this time
> period
>       *                      (relevant only when "called_from_yield" is
> TRUE)
>       */
> +#ifdef AARCH64_PORT_STUB
> +    thread_switch_data schedule_next_thread(bool called_from_yield =
> false,
> +                                            thread_runtime::duration
> preempt_after = thyst);
> +#else
>      void reschedule_from_interrupt(bool called_from_yield = false,
> -                                thread_runtime::duration preempt_after =
> thyst);
> +                                   thread_runtime::duration preempt_after
> = thyst);
> +#endif
>      void enqueue(thread& t);
>      void init_idle_thread();
>      virtual void timer_fired() override;
> @@ -895,6 +915,12 @@ struct cpu : private timer_base::client {
>      int renormalize_count;
>  };
>
> +#ifdef AARCH64_PORT_STUB
> +extern "C" void reschedule_from_interrupt(cpu* cpu,
> +                                          bool called_from_yield,
> +                                          thread_runtime::duration
> preempt_after);
> +#endif
> +
>  class cpu::notifier {
>  public:
>      explicit notifier(std::function<void ()> cpu_up);
> @@ -996,7 +1022,11 @@ inline bool preemptable()
>  inline void preempt()
>  {
>      if (preemptable()) {
> +#ifdef AARCH64_PORT_STUB
> +        reschedule_from_interrupt(sched::cpu::current(), false, thyst);
> +#else
>          sched::cpu::current()->reschedule_from_interrupt();
> +#endif
>      } else {
>          // preempt_enable() will pick this up eventually
>          need_reschedule = true;
> --
> 2.27.0
>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/CAL9cFfPYtL9rfiH7EkXnErAxn62Qa%2B2zw_QNuZv4s6rCXDK_og%40mail.gmail.com.

Reply via email to