From: Waldemar Kozaczuk <jwkozac...@gmail.com> Committer: WALDEMAR KOZACZUK <jwkozac...@gmail.com> Branch: master
Implement arch_prctl syscall to support TLS in statically linked executables Even simplest executables need thread local storage (TLS) and a good example of it is errno which is a thread local variable. The OSv kernel itself uses many thread local variables and when running dynamically linked executables it shares the TLS memory block with the application. In this case OSv fully controls the setup of TLS and stores the pointer to TCB (Thread Control Block) as part of a thread state and updates the FS register on every context switch. On other hand, the statically linked executables setup their TLS and register it with kernel by executing syscall arch_prtcl on x86_64. In order to support it in OSv, we need to implement the arch_prtcl syscall and modify some key places in kernel code - syscall handler, exception handlers and VDSO - to switch from application TLS to the kernel one and back. The newly implemented arch_prtcl syscall on ARCH_SET_FS stores the application TCB address in the new field app_tcb added to the thread_control_block structure. At the same time, we modify following places to support switching between the application and kernel TCB if necessary (app_tcb != 0): The exception handler assembly in entry.S is modified to detect if on entry the current FS register points to the kernel TCB and if not it switches to the kernel one; on exit from exception, it switches back to the application TCB. To make this possible we "duplicate" the current thread kernel TCB address and store it in the new field _current_thread_kernel_tcb of the cpu_arch structure which is updated on every context switch and can be accessed in assembly as gs:16. The first 8 bytes field self of the thread control block holds the address to itself so we can easily compare fs:0 with gs:16 to know if FS register points to the kernel TCB or not. Please note this scheme is simpler and faster than one of the original version relying on extra counter that also required the interrupts to be disabled. It also works in nested scenarios - for example a page fault interrupted during a sleep. Similarly, we also change the syscall handler and VDSO code where we use simple RAII utility - tls_switch - to detect if current thread has non-zero application TCB and if so to switch to the kernel one before the code in scope and switch back to the application one after. This scheme is a little different from the exception handler because both syscall and VDSO functions are only executed on application threads which could have the FS register pointing to the application TCB for example when running statically linked executables and we do not need to deal with nesting. In addision, the vdso code is changed to C++ to allow using the C++ RAII utility described above. In essence, this PR makes possible to launch simple statically linked executables like "Hello World" on OSv: gcc -static -o hello-static apps/native-example/hello.c ./scripts/run.py -e /hello-static OSv v0.57.0-74-g2a835078 Booted up in 142.76 ms Cmdline: /hello-static WARNING: Statically linked executables are only supported to limited extent! syscall(): unimplemented system call 218 syscall(): unimplemented system call 273 syscall(): unimplemented system call 334 syscall(): unimplemented system call 302 Hello from C code Please note, that the code changes touch some critical places of the kernel functionality - context switching, syscall handling, exception handling, and VDSO implementation - and thus may slightly affect their performance. As far as context switching goes, this patch adds only a single memory write operation that does not seem to affect it in any measurable way based on what the misc-ctxsw.cc indicates. On the other hand, one could see the syscall handling cost go up by 2-5 ns (3-5% of the total cost ~100ns based on what misc-syscall-perf.cc measures) when executing statically linked executables due to the fact we need to switch the fsbase from the app TCB to the kernel one and back. The good news is that the syscall handling does not seem to be affected in any significant way when running dynamically linked executables. The VDSO function calls are affected most 7-10 ns (from 23 to 30ns) even though the VDSO code uses the same exact tls_switch RAII utility and seems to get inlined in similar way as above in the syscall handler. Finally, I did not measure the impact of changes to the exception handling (interrupts, page faults, etc) but I think it should be similar to syscall handling. The interrupts are in general relatively expensive in the virtualized environment (guest/host) as this email by Avi Kivity explains - https://groups.google.com/g/osv-dev/c/w_fuxsYla-M/m/WxpRZTXQ-twJ. On top of this, the the FPU saving/restoring takes ~60ns which is far more expensive than switching fsbase. Fixes #1137 Signed-off-by: Waldemar Kozaczuk <jwkozac...@gmail.com> --- diff --git a/Makefile b/Makefile --- a/Makefile +++ b/Makefile @@ -1012,6 +1012,7 @@ objects += arch/x64/ioapic.o objects += arch/x64/apic.o objects += arch/x64/apic-clock.o objects += arch/x64/entry-xen.o +objects += arch/x64/prctl.o objects += arch/x64/vmlinux.o objects += arch/x64/vmlinux-boot64.o objects += arch/x64/pvh-boot.o @@ -2160,10 +2161,10 @@ $(out)/libenviron.so: $(environ_sources) $(makedir) $(call quiet, $(CC) $(CFLAGS) -shared -o $(out)/libenviron.so $(environ_sources), CC libenviron.so) -$(out)/libvdso.so: libc/vdso/vdso.c +$(out)/libvdso.so: libc/vdso/vdso.cc $(makedir) - $(call quiet, $(CC) $(CFLAGS) -c -fPIC -o $(out)/libvdso.o libc/vdso/vdso.c, CC libvdso.o) - $(call quiet, $(LD) -shared -fPIC -z now -o $(out)/libvdso.so $(out)/libvdso.o -T libc/vdso/vdso.lds --version-script=libc/vdso/$(arch)/vdso.version, LINK libvdso.so) + $(call quiet, $(CXX) $(CXXFLAGS) -fno-exceptions -c -fPIC -o $(out)/libvdso.o libc/vdso/vdso.cc, CXX libvdso.o) + $(call quiet, $(LD) -shared -z now -o $(out)/libvdso.so $(out)/libvdso.o -T libc/vdso/vdso.lds --version-script=libc/vdso/$(arch)/vdso.version, LINK libvdso.so) bootfs_manifest ?= bootfs.manifest.skel diff --git a/arch/x64/arch-cpu.hh b/arch/x64/arch-cpu.hh --- a/arch/x64/arch-cpu.hh +++ b/arch/x64/arch-cpu.hh @@ -55,6 +55,12 @@ struct arch_cpu { // in order to make it possible to access it in assembly code through // a known offset at %gs:0. syscall_stack_descriptor _current_syscall_stack_descriptor; + // This field holds the address of a current thread TCB + // which is updated on every context switch (see arch-switch.hh). + // This makes it possible to fetch address of kernel TCB if we need + // to switch to from the app TCB which is different when running + // statically linked executables + u64 _current_thread_kernel_tcb; void init_on_cpu(); void set_ist_entry(unsigned ist, char* base, size_t size); char* get_ist_entry(unsigned ist); diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh --- a/arch/x64/arch-switch.hh +++ b/arch/x64/arch-switch.hh @@ -11,6 +11,7 @@ #include "msr.hh" #include <osv/barrier.hh> #include <string.h> +#include "tls-switch.hh" // // The last 16 bytes of the syscall stack are reserved for - @@ -50,6 +51,8 @@ extern "C" { void thread_main(void); void thread_main_c(sched::thread* t); + +bool fsgsbase_avail = false; } namespace sched { @@ -70,6 +73,7 @@ void (*resolve_set_fsbase(void))(u64 v) // can't use processor::features, because it is not initialized // early enough. if (processor::features().fsgsbase) { + fsgsbase_avail = true; return set_fsbase_fsgsbase; } else { return set_fsbase_msr; @@ -96,6 +100,9 @@ void thread::switch_to() // so that the syscall handler can reference the current thread syscall stack top using the GS register c->arch._current_syscall_stack_descriptor.caller_stack_pointer = _state._syscall_stack_descriptor.caller_stack_pointer; c->arch._current_syscall_stack_descriptor.stack_top = _state._syscall_stack_descriptor.stack_top; + // set this cpu current thread kernel TCB address to TCB address of the new thread + // we are switching to + c->arch._current_thread_kernel_tcb = reinterpret_cast<u64>(_tcb); auto fpucw = processor::fnstcw(); auto mxcsr = processor::stmxcsr(); asm volatile @@ -130,6 +137,7 @@ void thread::switch_to_first() remote_thread_local_var(percpu_base) = _detached_state->_cpu->percpu_base; _detached_state->_cpu->arch.set_interrupt_stack(&_arch); _detached_state->_cpu->arch.set_exception_stack(&_arch); + _detached_state->_cpu->arch._current_thread_kernel_tcb = reinterpret_cast<u64>(_tcb); asm volatile ("mov %c[rsp](%0), %%rsp \n\t" "mov %c[rbp](%0), %%rbp \n\t" @@ -272,6 +280,8 @@ void thread::setup_tcb() _tcb = static_cast<thread_control_block*>(p + total_tls_size); _tcb->self = _tcb; _tcb->tls_base = p + user_tls_size; + + _tcb->app_tcb = 0; } void thread::setup_large_syscall_stack() @@ -365,11 +375,15 @@ void thread_main_c(thread* t) extern "C" void setup_large_syscall_stack() { + // Switch TLS register from the app to the kernel TCB and back if necessary + arch::tls_switch tls_switch; sched::thread::current()->setup_large_syscall_stack(); } extern "C" void free_tiny_syscall_stack() { + // Switch TLS register from the app to the kernel TCB and back if necessary + arch::tls_switch tls_switch; sched::thread::current()->free_tiny_syscall_stack(); } diff --git a/arch/x64/arch-tls.hh b/arch/x64/arch-tls.hh --- a/arch/x64/arch-tls.hh +++ b/arch/x64/arch-tls.hh @@ -13,6 +13,7 @@ struct thread_control_block { thread_control_block* self; void* tls_base; + unsigned long app_tcb; }; #endif /* ARCH_TLS_HH */ diff --git a/arch/x64/entry.S b/arch/x64/entry.S --- a/arch/x64/entry.S +++ b/arch/x64/entry.S @@ -34,12 +34,57 @@ pushq_cfi %r13 pushq_cfi %r14 pushq_cfi %r15 + + mov $0, %r14 #Use callee-saved register to know that we need to switch FS base to app TCB + mov %gs:16, %r12 #Fetch this thread kernel TCB address + cmpq %fs:0, %r12 #Check if kernel TCB equal to current TCB + je 2f + + mov $1, %r14 #Save we need to switch to app TCB + + #Switch fsbase (FS register) from app TCB to kernel TCB + mov (%r12), %rax #Copy kernel TCB to rax + mov fsgsbase_avail, %r13 + cmpq $0x0, %r13 #Should we use wrfsbase or wrmsr instruction? + jne 1f + + #Switch fsbase using the wrmsr instruction + mov %rax, %rdx + mov $0xc0000100, %ecx + shr $0x20, %rdx + wrmsr + jmp 2f + +1: #Switch fsbase using the wrfsbase instruction + wrfsbase %rax + +2: #Call handler mov %rsp, %rdi subq $8, %rsp # 16-byte alignment .cfi_adjust_cfa_offset 8 call \handler addq $8, %rsp # 16-byte alignment .cfi_adjust_cfa_offset -8 + + cmpq $1, %r14 #Check if we need to switch to app_tcb + jne 4f + + #Switch fsbase (FS register) from kernel TCB to app TCB + mov 16(%r12), %rax #Copy app TCB to rax + cmpq $0x0, %r13 #Should we use wrfsbase or wrmsr instruction? + jne 3f + + #Switch fsbase using the wrmsr instruction + mov %rax, %rdx + mov $0xc0000100, %ecx + shr $0x20, %rdx + wrmsr + jmp 4f + +3: #Switch fsbase using the wrfsbase instruction + wrfsbase %rax + +4: #Restore registers popq_cfi %r15 popq_cfi %r14 popq_cfi %r13 diff --git a/arch/x64/prctl.cc b/arch/x64/prctl.cc --- a/arch/x64/prctl.cc +++ b/arch/x64/prctl.cc @@ -0,0 +1,34 @@ +/* + * Copyright (C) 2014 Cloudius Systems, Ltd. + * Copyright (C) 2023 Waldemar Kozaczuk + * + * This work is open source software, licensed under the terms of the + * BSD license as described in the LICENSE file in the top-level directory. + */ + +#include "arch.hh" +#include "libc/libc.hh" + +#include <assert.h> +#include <stdio.h> + +#include <osv/sched.hh> + +enum { + ARCH_SET_GS = 0x1001, + ARCH_SET_FS = 0x1002, + ARCH_GET_FS = 0x1003, + ARCH_GET_GS = 0x1004, +}; + +long arch_prctl(int code, unsigned long addr) +{ + switch (code) { + case ARCH_SET_FS: + sched::thread::current()->set_app_tcb(addr); + return 0; + case ARCH_GET_FS: + return sched::thread::current()->get_app_tcb(); + } + return libc_error(EINVAL); +} diff --git a/arch/x64/tls-switch.hh b/arch/x64/tls-switch.hh --- a/arch/x64/tls-switch.hh +++ b/arch/x64/tls-switch.hh @@ -0,0 +1,54 @@ +/* + * Copyright (C) 2023 Waldemar Kozaczuk + * + * This work is open source software, licensed under the terms of the + * BSD license as described in the LICENSE file in the top-level directory. + */ + +#ifndef TLS_SWITCH_HH +#define TLS_SWITCH_HH + +#include "arch.hh" +#include "arch-tls.hh" +#include <osv/barrier.hh> + +extern "C" bool fsgsbase_avail; + +namespace arch { + +inline void set_fsbase(u64 v) +{ + barrier(); + if (fsgsbase_avail) { + processor::wrfsbase(v); + } else { + processor::wrmsr(msr::IA32_FS_BASE, v); + } + barrier(); +} + +//Simple RAII utility classes that implement the logic to switch +//fsbase to the kernel address and back to the app one +class tls_switch { + thread_control_block *_kernel_tcb; +public: + tls_switch() { + asm volatile ( "movq %%gs:16, %0\n\t" : "=r"(_kernel_tcb)); + + //Switch to kernel tcb if app tcb present + if (_kernel_tcb->app_tcb) { + set_fsbase(reinterpret_cast<u64>(_kernel_tcb->self)); + } + } + + ~tls_switch() { + //Switch to app tcb if app tcb present + if (_kernel_tcb->app_tcb) { + set_fsbase(reinterpret_cast<u64>(_kernel_tcb->app_tcb)); + } + } +}; + +} + +#endif diff --git a/core/elf.cc b/core/elf.cc --- a/core/elf.cc +++ b/core/elf.cc @@ -536,7 +536,7 @@ void object::process_headers() } } if (!is_core() && is_statically_linked_executable()) { - abort("Statically linked executables are not supported yet!\n"); + std::cout << "WARNING: Statically linked executables are only supported to limited extent!\n"; } if (_is_dynamically_linked_executable && _tls_segment) { auto app_tls_size = get_aligned_tls_size(); diff --git a/include/osv/sched.hh b/include/osv/sched.hh --- a/include/osv/sched.hh +++ b/include/osv/sched.hh @@ -805,6 +805,8 @@ private: std::shared_ptr<osv::application_runtime> _app_runtime; public: void destroy(); + unsigned long get_app_tcb() { return _tcb->app_tcb; } + void set_app_tcb(unsigned long tcb) { _tcb->app_tcb = tcb; } private: #ifdef __aarch64__ friend void ::destroy_current_cpu_terminating_thread(); diff --git a/libc/vdso/vdso.cc b/libc/vdso/vdso.cc --- a/libc/vdso/vdso.cc +++ b/libc/vdso/vdso.cc @@ -1,23 +1,26 @@ -//#include "libc.h" #include <time.h> #include <sys/time.h> #ifdef __x86_64__ -__attribute__((__visibility__("default"))) +#include "tls-switch.hh" +extern "C" __attribute__((__visibility__("default"))) time_t __vdso_time(time_t *tloc) { + arch::tls_switch _tls_switch; return time(tloc); } -__attribute__((__visibility__("default"))) +extern "C" __attribute__((__visibility__("default"))) int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz) { + arch::tls_switch _tls_switch; return gettimeofday(tv, tz); } -__attribute__((__visibility__("default"))) +extern "C" __attribute__((__visibility__("default"))) int __vdso_clock_gettime(clockid_t clk_id, struct timespec *tp) { + arch::tls_switch _tls_switch; return clock_gettime(clk_id, tp); } #endif diff --git a/libc/vdso/vdso.lds b/libc/vdso/vdso.lds --- a/libc/vdso/vdso.lds +++ b/libc/vdso/vdso.lds @@ -14,6 +14,31 @@ SECTIONS .eh_frame_hdr : { *(.eh_frame_hdr) } : eh_frame_hdr : text .eh_frame : { *(.eh_frame) } : text .text : { *(.text*) } : text +.rela.dyn : + { + *(.rela.init) + *(.rela.text .rela.text.* .rela.gnu.linkonce.t.*) + *(.rela.fini) + *(.rela.rodata .rela.rodata.* .rela.gnu.linkonce.r.*) + *(.rela.data .rela.data.* .rela.gnu.linkonce.d.*) + *(.rela.tdata .rela.tdata.* .rela.gnu.linkonce.td.*) + *(.rela.tbss .rela.tbss.* .rela.gnu.linkonce.tb.*) + *(.rela.ctors) + *(.rela.dtors) + *(.rela.got) + *(.rela.bss .rela.bss.* .rela.gnu.linkonce.b.*) + *(.rela.ldata .rela.ldata.* .rela.gnu.linkonce.l.*) + *(.rela.lbss .rela.lbss.* .rela.gnu.linkonce.lb.*) + *(.rela.lrodata .rela.lrodata.* .rela.gnu.linkonce.lr.*) + *(.rela.ifunc) + } : text + .rela.plt : + { + *(.rela.plt) + PROVIDE_HIDDEN (__rela_iplt_start = .); + *(.rela.iplt) + PROVIDE_HIDDEN (__rela_iplt_end = .); + } : text } /* Enforce single PT_LOAD segment by specifying all diff --git a/linux.cc b/linux.cc --- a/linux.cc +++ b/linux.cc @@ -49,6 +49,7 @@ #include <sys/resource.h> #include <termios.h> #include <poll.h> +#include "tls-switch.hh" #include <unordered_map> @@ -504,6 +505,8 @@ static int tgkill(int tgid, int tid, int sig) #define __NR_sys_getdents64 __NR_getdents64 extern "C" ssize_t sys_getdents64(int fd, void *dirp, size_t count); +extern long arch_prctl(int code, unsigned long addr); + #define __NR_sys_brk __NR_brk void *get_program_break(); static long sys_brk(void *addr) @@ -654,6 +657,9 @@ OSV_LIBC_API long syscall(long number, ...) SYSCALL2(timerfd_gettime, int, struct itimerspec*); SYSCALL2(chmod, const char *, mode_t); SYSCALL2(fchmod, int, mode_t); +#ifdef __x86_64__ + SYSCALL2(arch_prctl, int, unsigned long); +#endif } debug_always("syscall(): unimplemented system call %d\n", number); @@ -679,6 +685,11 @@ extern "C" long syscall_wrapper(long number, long p1, long p2, long p3, long p4, extern "C" long syscall_wrapper(long p1, long p2, long p3, long p4, long p5, long p6, long number) #endif { +#ifdef __x86_64__ + // Switch TLS register if necessary + arch::tls_switch tls_switch; +#endif + int errno_backup = errno; // syscall and function return value are in rax auto ret = syscall(number, p1, p2, p3, p4, p5, p6); -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/0000000000003d9b280609557beb%40google.com.