Re: [osv-dev] OSv can run statically linked executable

Waldek Kozaczuk Tue, 25 Apr 2023 19:55:16 -0700


On Monday, April 24, 2023 at 2:08:26 PM UTC-4 Waldek Kozaczuk wrote:


On Mon, Apr 24, 2023 at 4:34 AM Nadav Har'El <[email protected]> wrote:

On Mon, Apr 24, 2023 at 6:26 AM Waldek Kozaczuk <[email protected]> wrote:

Hi,

Over the recent week, I have been working to get OSv to run a simple "hello 
world" app (aka native-example) built as a position-dependent statically 
linked executable. 


Nice!
 


   1. Support the arch_prctl syscall that sets the app TLS 
      - this was by far the most complicated element that required changing 
      OSv to store new per-pcpu data pointer in GS register and enhancing both 
      syscall handler and interrupt/page fault handler to detect and switch if 
      needed the FS base to the kernel TLS on entry and back to the app one on 
      exit (see 
      
https://github.com/cloudius-systems/osv/issues/1137#issuecomment-1512315880
      ) 
   

If this has noticeable overhead, perhaps it makes sense to make it optional?

I have not measured it in any formal way. But when testing some of the 
earlier versions of the code, I could see the context switch time (the 
colocated one measured by misc-ctxsw) go up from 313 to 362 ns caused by 
adding this line: 

processor::wrmsr(msr::IA32_FS_KERNEL_BASE, reinterpret_cast<u64>(_tcb));
which may indirectly measure the cost of the code to change GS or FS base 
using the MSR instruction at ~ 50 ns (yikes). I would think the FSGSBASE 
instruction should be faster.


BTW I have measured indirectly the cost of the MSR and wrgsbase indirectly 
by modifying the thread::switch() like this and running misc-ctxsw (I 
assume the cost of wrfsbase would be identical):
 
+uint32_t IA32_GS_BASE = 0xc0000101;
 void thread::switch_to()
 {
     thread* old = current();
@@ -81,6 +82,8 @@ void thread::switch_to()
     // barriers
     barrier();
     set_fsbase(reinterpret_cast<u64>(_tcb));
+    //asm volatile("wrgsbase %0" : : "r"(reinterpret_cast<u64>(_tcb)));
+    //processor::wrmsr(IA32_GS_BASE, reinterpret_cast<u64>(_tcb));
     barrier();
     auto c = _detached_state->_cpu;
     old->_state.exception_stack = c->arch.get_exception_stack();

With uncommented wrgsbase the cost of colocated context switch barely 
budged. On average, I could see a maximum of 1-2 ns difference if any. 
Sometimes the times were identical. So it seems the wrgsbase is pretty 
cheap though we should avoid calling it in the interrupt/page fault/syscall 
handler.

On the other hand, with uncommented wrmsr code the cost of the context 
switch bumped by ~50ns so this instruction is very expensive. That is also 
why we need to especially avoid it if wrgsbase is not available.


Here is a subset of the changes I had to make to the context-switching code 
and the interrupt/syscall handler:

1. Add 2 new fields to the thread control block:

    unsigned long app_tcb;   //holds address of the address the app passed 
to arch_prctl

    long kernel_tcb_counter; //if 0 means we have to do an app/kernel/app 
FS base switch

2. Setup new per-cpu data intended to hold a pointer to the tcb:

--- a/arch/x64/arch-cpu.hh

+++ b/arch/x64/arch-cpu.hh


+struct tcb_data {

+    u64 kernel_tcb;

+    u64 tmp[2];

+};

+

 struct arch_cpu {

     arch_cpu();

     processor::aligned_task_state_segment atss;

@@ -46,6 +52,7 @@ struct arch_cpu {

     u32 apic_id;

     u32 acpi_id;

     u64 gdt[nr_gdt];

+    tcb_data _tcb_data;

     void init_on_cpu();

     void set_ist_entry(unsigned ist, char* base, size_t size);

     char* get_ist_entry(unsigned ist);

@@ -181,6 +188,8 @@ inline void arch_cpu::init_on_cpu()

     processor::init_fpu();

 

     processor::init_syscall();

+

+    processor::wrmsr(msr::IA32_GS_BASE, 
reinterpret_cast<u64>(&_tcb_data.kernel_tcb));

 }



3. Change kernel fs pointer on each context switch.

--- a/arch/x64/arch-switch.hh

+++ b/arch/x64/arch-switch.hh


@@ -81,11 +81,13 @@ void thread::switch_to()

...

     c->arch.set_exception_stack(_state.exception_stack);

+    c->arch._tcb_data.kernel_tcb = reinterpret_cast<u64>(_tcb); //This 
should be very fast

     auto fpucw = processor::fnstcw();

...

@@ -258,6 +260,7 @@ void thread::setup_tcb()

     else {

         _tcb->syscall_stack_top = 0;

     }

+    _tcb->kernel_tcb_counter = 1; //By default disable fs base switch

 }

4. Handle fs switch if necessary on entry/exit of syscall/exception/page 
fault handler:

This is just a code change around syscall entry but we have to do the 
opposite for exit and similar for page fault/interrupt handler (possibly 
signal handler as well) 

@@ -174,6 +214,26 @@ syscall_entry:

     .cfi_register rip, rcx # rcx took previous rip value

     .cfi_register rflags, r11 # r11 took previous rflags value

     # There is no ring transition and rflags are left unchanged.

+    #

+    # app->kernel tcb switch

+    movq %rax, %gs:8  # save register rax so we can restore it later

+    movq %gs:0, %rax  # copy address of kernel tcb to the temp register rax

+    #1. Check if kernel_tcb_counter 0 and jump over to 3 if not (no need 
to do fsbase switch)

+    cmpq $0, 40(%rax)

+    jne on_kernel_tcb

+

+    #2. If zero set fs MSR to kernel tcb

+    movq %rbx, %gs:16  # save register rbx so we can restore it later

+    movq (%rax), %rbx # set kernel tcb

+    wrfsbase %rbx //TODO: In reality we need to check if wrfsbase is 
available and use wrmsr if not

+    movq %gs:16, %rbx

+

+on_kernel_tcb:

+    #3. Increment counter (for nested case)

+    incq 40(%rax)

+    #4. Restore %rax

+    movq %gs:8, %rax

+

     #

     # Unfortunately the mov instruction cannot be used to dereference an 
address

     # on syscall stack pointed by address in TCB (%fs:16) - double memory 
dereference.


I did measure that the context switch code is not affected in any way. But 
I am sure the syscall/page fault/interrupt handler is affected but 
hopefully by a tiny bit for all the cases except when the application 
thread (of the static elf) gets interrupted, triggers page fault, or makes 
a syscall call. In other words, I hope that kernel threads and normal 
(non-static-elf) threads would not be affected. We could also add the 
necessary #ifdef static_elf.

Any ideas on how to measure how much slower the interrupt/syscall/page 
fault handler would become?


   1. Fixing a potential bug in handling TCGETS in the console driver.

I'm curious what this bug was - I am personally fond of this area of this 
code, as you can see from the history
lesson in drivers/line-discipline.cc :-)

I think it may have to do with some size difference of the termios struct 
between glibc and OSv. The symptom seemed to be a corrupted stack after 
ioctl syscall call that ended up calling the code to handle TCGETS. This 
change seems to fix it:

--- a/drivers/console.cc

+++ b/drivers/console.cc

@@ -68,7 +68,16 @@ console_ioctl(u_long request, void *arg)

 {

     switch (request) {

     case TCGETS:

-        *static_cast<termios*>(arg) = tio;

+        //*static_cast<termios*>(arg) = tio;

+        {

+          termios *in = static_cast<termios*>(arg);

+          in->c_iflag = tio.c_iflag;

+          in->c_oflag = tio.c_oflag;

+          in->c_cflag = tio.c_cflag;

+          in->c_lflag = tio.c_lflag;

+          in->c_line = tio.c_line;

+        }

         return 0;
  
I think I have missed the c_cc field.

Here is the relevant code in glibc - 
https://github.com/bminor/glibc/blob/ae612c45efb5e34713859a5facf92368307efb6e/sysdeps/unix/sysv/linux/tcgetattr.c

 


   1. Implement sys_prlimit
   2. Enable the readlink, geteuid and getegid

I think we already had those - or did you mean the system call?

Yes, just add SYSCALLx() macros to linux.cc 

 

This was enough to run a single-threaded app but we will need to implement 
the clone syscall to support multi-threaded apps. 


Very nice. You can probably start by implementing the "simple" case of 
clone() used by a simple multi-threaded application and
leave the other cases with UNIMPLEMENTED (or "ignore" various parameters 
and leave them to be perfected later, with WARN_ONCE)
 

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/c3b84616-d161-4b89-91cb-1e91b255a211n%40googlegroups.com.

Re: [osv-dev] OSv can run statically linked executable

Reply via email to