On Monday, April 24, 2023 at 2:08:26 PM UTC-4 Waldek Kozaczuk wrote:
On Mon, Apr 24, 2023 at 4:34 AM Nadav Har'El <[email protected]> wrote: On Mon, Apr 24, 2023 at 6:26 AM Waldek Kozaczuk <[email protected]> wrote: Hi, Over the recent week, I have been working to get OSv to run a simple "hello world" app (aka native-example) built as a position-dependent statically linked executable. Nice! 1. Support the arch_prctl syscall that sets the app TLS - this was by far the most complicated element that required changing OSv to store new per-pcpu data pointer in GS register and enhancing both syscall handler and interrupt/page fault handler to detect and switch if needed the FS base to the kernel TLS on entry and back to the app one on exit (see https://github.com/cloudius-systems/osv/issues/1137#issuecomment-1512315880 ) If this has noticeable overhead, perhaps it makes sense to make it optional? I have not measured it in any formal way. But when testing some of the earlier versions of the code, I could see the context switch time (the colocated one measured by misc-ctxsw) go up from 313 to 362 ns caused by adding this line: processor::wrmsr(msr::IA32_FS_KERNEL_BASE, reinterpret_cast<u64>(_tcb)); which may indirectly measure the cost of the code to change GS or FS base using the MSR instruction at ~ 50 ns (yikes). I would think the FSGSBASE instruction should be faster. BTW I have measured indirectly the cost of the MSR and wrgsbase indirectly by modifying the thread::switch() like this and running misc-ctxsw (I assume the cost of wrfsbase would be identical): +uint32_t IA32_GS_BASE = 0xc0000101; void thread::switch_to() { thread* old = current(); @@ -81,6 +82,8 @@ void thread::switch_to() // barriers barrier(); set_fsbase(reinterpret_cast<u64>(_tcb)); + //asm volatile("wrgsbase %0" : : "r"(reinterpret_cast<u64>(_tcb))); + //processor::wrmsr(IA32_GS_BASE, reinterpret_cast<u64>(_tcb)); barrier(); auto c = _detached_state->_cpu; old->_state.exception_stack = c->arch.get_exception_stack(); With uncommented wrgsbase the cost of colocated context switch barely budged. On average, I could see a maximum of 1-2 ns difference if any. Sometimes the times were identical. So it seems the wrgsbase is pretty cheap though we should avoid calling it in the interrupt/page fault/syscall handler. On the other hand, with uncommented wrmsr code the cost of the context switch bumped by ~50ns so this instruction is very expensive. That is also why we need to especially avoid it if wrgsbase is not available. Here is a subset of the changes I had to make to the context-switching code and the interrupt/syscall handler: 1. Add 2 new fields to the thread control block: unsigned long app_tcb; //holds address of the address the app passed to arch_prctl long kernel_tcb_counter; //if 0 means we have to do an app/kernel/app FS base switch 2. Setup new per-cpu data intended to hold a pointer to the tcb: --- a/arch/x64/arch-cpu.hh +++ b/arch/x64/arch-cpu.hh +struct tcb_data { + u64 kernel_tcb; + u64 tmp[2]; +}; + struct arch_cpu { arch_cpu(); processor::aligned_task_state_segment atss; @@ -46,6 +52,7 @@ struct arch_cpu { u32 apic_id; u32 acpi_id; u64 gdt[nr_gdt]; + tcb_data _tcb_data; void init_on_cpu(); void set_ist_entry(unsigned ist, char* base, size_t size); char* get_ist_entry(unsigned ist); @@ -181,6 +188,8 @@ inline void arch_cpu::init_on_cpu() processor::init_fpu(); processor::init_syscall(); + + processor::wrmsr(msr::IA32_GS_BASE, reinterpret_cast<u64>(&_tcb_data.kernel_tcb)); } 3. Change kernel fs pointer on each context switch. --- a/arch/x64/arch-switch.hh +++ b/arch/x64/arch-switch.hh @@ -81,11 +81,13 @@ void thread::switch_to() ... c->arch.set_exception_stack(_state.exception_stack); + c->arch._tcb_data.kernel_tcb = reinterpret_cast<u64>(_tcb); //This should be very fast auto fpucw = processor::fnstcw(); ... @@ -258,6 +260,7 @@ void thread::setup_tcb() else { _tcb->syscall_stack_top = 0; } + _tcb->kernel_tcb_counter = 1; //By default disable fs base switch } 4. Handle fs switch if necessary on entry/exit of syscall/exception/page fault handler: This is just a code change around syscall entry but we have to do the opposite for exit and similar for page fault/interrupt handler (possibly signal handler as well) @@ -174,6 +214,26 @@ syscall_entry: .cfi_register rip, rcx # rcx took previous rip value .cfi_register rflags, r11 # r11 took previous rflags value # There is no ring transition and rflags are left unchanged. + # + # app->kernel tcb switch + movq %rax, %gs:8 # save register rax so we can restore it later + movq %gs:0, %rax # copy address of kernel tcb to the temp register rax + #1. Check if kernel_tcb_counter 0 and jump over to 3 if not (no need to do fsbase switch) + cmpq $0, 40(%rax) + jne on_kernel_tcb + + #2. If zero set fs MSR to kernel tcb + movq %rbx, %gs:16 # save register rbx so we can restore it later + movq (%rax), %rbx # set kernel tcb + wrfsbase %rbx //TODO: In reality we need to check if wrfsbase is available and use wrmsr if not + movq %gs:16, %rbx + +on_kernel_tcb: + #3. Increment counter (for nested case) + incq 40(%rax) + #4. Restore %rax + movq %gs:8, %rax + # # Unfortunately the mov instruction cannot be used to dereference an address # on syscall stack pointed by address in TCB (%fs:16) - double memory dereference. I did measure that the context switch code is not affected in any way. But I am sure the syscall/page fault/interrupt handler is affected but hopefully by a tiny bit for all the cases except when the application thread (of the static elf) gets interrupted, triggers page fault, or makes a syscall call. In other words, I hope that kernel threads and normal (non-static-elf) threads would not be affected. We could also add the necessary #ifdef static_elf. Any ideas on how to measure how much slower the interrupt/syscall/page fault handler would become? 1. Fixing a potential bug in handling TCGETS in the console driver. I'm curious what this bug was - I am personally fond of this area of this code, as you can see from the history lesson in drivers/line-discipline.cc :-) I think it may have to do with some size difference of the termios struct between glibc and OSv. The symptom seemed to be a corrupted stack after ioctl syscall call that ended up calling the code to handle TCGETS. This change seems to fix it: --- a/drivers/console.cc +++ b/drivers/console.cc @@ -68,7 +68,16 @@ console_ioctl(u_long request, void *arg) { switch (request) { case TCGETS: - *static_cast<termios*>(arg) = tio; + //*static_cast<termios*>(arg) = tio; + { + termios *in = static_cast<termios*>(arg); + in->c_iflag = tio.c_iflag; + in->c_oflag = tio.c_oflag; + in->c_cflag = tio.c_cflag; + in->c_lflag = tio.c_lflag; + in->c_line = tio.c_line; + } return 0; I think I have missed the c_cc field. Here is the relevant code in glibc - https://github.com/bminor/glibc/blob/ae612c45efb5e34713859a5facf92368307efb6e/sysdeps/unix/sysv/linux/tcgetattr.c 1. Implement sys_prlimit 2. Enable the readlink, geteuid and getegid I think we already had those - or did you mean the system call? Yes, just add SYSCALLx() macros to linux.cc This was enough to run a single-threaded app but we will need to implement the clone syscall to support multi-threaded apps. Very nice. You can probably start by implementing the "simple" case of clone() used by a simple multi-threaded application and leave the other cases with UNIMPLEMENTED (or "ignore" various parameters and leave them to be perfected later, with WARN_ONCE) -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/c3b84616-d161-4b89-91cb-1e91b255a211n%40googlegroups.com.
