On Mon, Apr 24, 2023 at 4:34 AM Nadav Har'El <[email protected]> wrote:
> On Mon, Apr 24, 2023 at 6:26 AM Waldek Kozaczuk <[email protected]> > wrote: > >> Hi, >> >> Over the recent week, I have been working to get OSv to run a simple >> "hello world" app (aka native-example) built as a position-dependent >> statically linked executable. >> > > Nice! > > >> >> 1. Support the arch_prctl syscall that sets the app TLS >> - this was by far the most complicated element that required >> changing OSv to store new per-pcpu data pointer in GS register and >> enhancing both syscall handler and interrupt/page fault handler to >> detect >> and switch if needed the FS base to the kernel TLS on entry and back >> to the >> app one on exit (see >> >> https://github.com/cloudius-systems/osv/issues/1137#issuecomment-1512315880 >> ) >> >> > If this has noticeable overhead, perhaps it makes sense to make it > optional? > I have not measured it in any formal way. But when testing some of the earlier versions of the code, I could see the context switch time (the colocated one measured by misc-ctxsw) go up from 313 to 362 ns caused by adding this line: processor::wrmsr(msr::IA32_FS_KERNEL_BASE, reinterpret_cast<u64>(_tcb)); which may indirectly measure the cost of the code to change GS or FS base using the MSR instruction at ~ 50 ns (yikes). I would think the FSGSBASE instruction should be faster. Here is a subset of the changes I had to make to the context-switching code and the interrupt/syscall handler: 1. Add 2 new fields to the thread control block: unsigned long app_tcb; //holds address of the address the app passed to arch_prctl long kernel_tcb_counter; //if 0 means we have to do an app/kernel/app FS base switch 2. Setup new per-cpu data intended to hold a pointer to the tcb: --- a/arch/x64/arch-cpu.hh +++ b/arch/x64/arch-cpu.hh +struct tcb_data { + u64 kernel_tcb; + u64 tmp[2]; +}; + struct arch_cpu { arch_cpu(); processor::aligned_task_state_segment atss; @@ -46,6 +52,7 @@ struct arch_cpu { u32 apic_id; u32 acpi_id; u64 gdt[nr_gdt]; + tcb_data _tcb_data; void init_on_cpu(); void set_ist_entry(unsigned ist, char* base, size_t size); char* get_ist_entry(unsigned ist); @@ -181,6 +188,8 @@ inline void arch_cpu::init_on_cpu() processor::init_fpu(); processor::init_syscall(); + + processor::wrmsr(msr::IA32_GS_BASE, reinterpret_cast<u64>(&_tcb_data.kernel_tcb)); } 3. Change kernel fs pointer on each context switch. --- a/arch/x64/arch-switch.hh +++ b/arch/x64/arch-switch.hh @@ -81,11 +81,13 @@ void thread::switch_to() ... c->arch.set_exception_stack(_state.exception_stack); + c->arch._tcb_data.kernel_tcb = reinterpret_cast<u64>(_tcb); //This should be very fast auto fpucw = processor::fnstcw(); ... @@ -258,6 +260,7 @@ void thread::setup_tcb() else { _tcb->syscall_stack_top = 0; } + _tcb->kernel_tcb_counter = 1; //By default disable fs base switch } 4. Handle fs switch if necessary on entry/exit of syscall/exception/page fault handler: This is just a code change around syscall entry but we have to do the opposite for exit and similar for page fault/interrupt handler (possibly signal handler as well) @@ -174,6 +214,26 @@ syscall_entry: .cfi_register rip, rcx # rcx took previous rip value .cfi_register rflags, r11 # r11 took previous rflags value # There is no ring transition and rflags are left unchanged. + # + # app->kernel tcb switch + movq %rax, %gs:8 # save register rax so we can restore it later + movq %gs:0, %rax # copy address of kernel tcb to the temp register rax + #1. Check if kernel_tcb_counter 0 and jump over to 3 if not (no need to do fsbase switch) + cmpq $0, 40(%rax) + jne on_kernel_tcb + + #2. If zero set fs MSR to kernel tcb + movq %rbx, %gs:16 # save register rbx so we can restore it later + movq (%rax), %rbx # set kernel tcb + wrfsbase %rbx //TODO: In reality we need to check if wrfsbase is available and use wrmsr if not + movq %gs:16, %rbx + +on_kernel_tcb: + #3. Increment counter (for nested case) + incq 40(%rax) + #4. Restore %rax + movq %gs:8, %rax + # # Unfortunately the mov instruction cannot be used to dereference an address # on syscall stack pointed by address in TCB (%fs:16) - double memory dereference. I did measure that the context switch code is not affected in any way. But I am sure the syscall/page fault/interrupt handler is affected but hopefully by a tiny bit for all the cases except when the application thread (of the static elf) gets interrupted, triggers page fault, or makes a syscall call. In other words, I hope that kernel threads and normal (non-static-elf) threads would not be affected. We could also add the necessary #ifdef static_elf. Any ideas on how to measure how much slower the interrupt/syscall/page fault handler would become? >> 1. Fixing a potential bug in handling TCGETS in the console driver. >> >> I'm curious what this bug was - I am personally fond of this area of this > code, as you can see from the history > lesson in drivers/line-discipline.cc :-) > I think it may have to do with some size difference of the termios struct between glibc and OSv. The symptom seemed to be a corrupted stack after ioctl syscall call that ended up calling the code to handle TCGETS. This change seems to fix it: --- a/drivers/console.cc +++ b/drivers/console.cc @@ -68,7 +68,16 @@ console_ioctl(u_long request, void *arg) { switch (request) { case TCGETS: - *static_cast<termios*>(arg) = tio; + //*static_cast<termios*>(arg) = tio; + { + termios *in = static_cast<termios*>(arg); + in->c_iflag = tio.c_iflag; + in->c_oflag = tio.c_oflag; + in->c_cflag = tio.c_cflag; + in->c_lflag = tio.c_lflag; + in->c_line = tio.c_line; + } return 0; I think I have missed the c_cc field. Here is the relevant code in glibc - https://github.com/bminor/glibc/blob/ae612c45efb5e34713859a5facf92368307efb6e/sysdeps/unix/sysv/linux/tcgetattr.c > > >> >> 1. Implement sys_prlimit >> 2. Enable the readlink, geteuid and getegid >> >> I think we already had those - or did you mean the system call? > Yes, just add SYSCALLx() macros to linux.cc > > >> This was enough to run a single-threaded app but we will need to >> implement the clone syscall to support multi-threaded apps. >> > > Very nice. You can probably start by implementing the "simple" case of > clone() used by a simple multi-threaded application and > leave the other cases with UNIMPLEMENTED (or "ignore" various parameters > and leave them to be perfected later, with WARN_ONCE) > > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/CAL9cFfPnk0b6AAqv7UDP1sRrk%3DvAzs7iqbp---dkOU8PKS36Ag%40mail.gmail.com.
