[osv-dev] Re: [PATCH] Option 1: threads: allocate application stacks lazily

Waldek Kozaczuk Mon, 27 Jan 2020 05:41:19 -0800

Obviously the most important concern is whether this patch is logically 
correct.


But another issue is performance. Does it make disabling interrupts and 
preemption too costly? It seems that both are used when applying/handling 
locks. By disassembling I saw at least 100 references to 
sched::preempt_counter. But I am not sure how frequently this code is 
called and if it will really have significant impact on performance.

If so I think there are many way to optimized the code in 
ensure_next_stack_page():

   1. I noticed many (maybe most) accesses to thread local variables use 
   initial-exec instead of local-exec. We could force all code in kernel to 
   use local-exec. I already have a patch that does it.
   2. We could pack both preempt_counter and irq_counter into single 64 
   bits union field so that the generated code makes single access instead of 
   two.
   3. We could avoid branching (jne) and somehow force the code to use 
   cmovz/cmove (conditional move). I have tried it but to no avail.

Any other ideas?
Waldek

On Sunday, January 26, 2020 at 12:33:30 PM UTC-5, Waldek Kozaczuk wrote:
>
> As the issue #143 explains, currently applications stacks are eagerly 
> allocated (aka pre-populated) which may lead to substantial memory waste. 
> We want those stacks to be lazily allocated instead, so that physical 
> memory behind the stack gets allocated as needed by standard fault 
> handling mechanism. 
>
> The page fault handling logic requires that both interrupts and preemption 
> are enabled. Therefore this patch implements simple strategy to read 
> single byte on a stack 
> one page (4K) deeper before disabling interrupts or preemption. 
>
> It turns out we cannot read that single byte "blindly" because in some 
> scenarios 
> interrupts might be already disabled when we are about to disable 
> preemption and vice versa. 
> Also similarly, which is more speculative, disabling preemption (and 
> possibly interrupts) may nest. 
> Therefore we need a "conditional" read of that byte on stack. More 
> specifically 
> we should ONLY read the byte IF both interrupts and preemption are 
> enabled. 
>
> This patch achieves this by adding extra thread-local counter to track 
> interrupts disabling/enabling 
> that operates in a similar way the preemption counter does. So every time 
> interrupts 
> are disabled we increment the counter and decrement it everytime 
> interrupts are enabled back. 
> Then everytime before we disable interrupts or preemption, we check if 
> both counters 
> are 0 and only then try to read a byte to potentially trigger a fault. 
>
> Please note that performance wise this patch is not as "cheap" as it 
> possibly was hoped. As the disassembled snippets illustrate it costs extra 
> 7-8 instructions 
> extra every time we disable preemption or interrupts. It would be nice to 
> properly measure 
> the added cost of this patch: 
>
> A. Disabling preemption in __tls_get_addr before the patch: 
> ``` 
>  4035cae8:   e8 83 f6 ff ff          callq  4035c170 <elf::get_program()> 
>  4035caed:   4c 8b 2d 6c 40 51 00    mov    0x51406c(%rip),%r13        # 
> 40870b60 <sched::preempt_counter+0x40870b60> 
>  4035caf4:   4c 8b 33                mov    (%rbx),%r14 
>  4035caf7:   64 41 83 45 00 01       addl   $0x1,%fs:0x0(%r13) 
>  4035cafd:   e8 6e f6 ff ff          callq  4035c170 <elf::get_program()> 
> ``` 
>
> B. Disabling preemption in __tls_get_addr after the patch: 
> ``` 
>  40360c78:   e8 33 f6 ff ff          callq  403602b0 <elf::get_program()> 
>  40360c7d:   4c 8b 2d dc 4e 51 00    mov    0x514edc(%rip),%r13        # 
> 40875b60 <sched::preempt_counter+0x40875b5c> 
>  40360c84:   48 8b 15 05 4f 51 00    mov    0x514f05(%rip),%rdx        # 
> 40875b90 <mmu::irq_counter+0x40875b90> 
>  40360c8b:   4c 8b 33                mov    (%rbx),%r14 
>  40360c8e:   64 41 8b 45 00          mov    %fs:0x0(%r13),%eax 
>  40360c93:   89 c6                   mov    %eax,%esi 
>  40360c95:   64 0b 32                or     %fs:(%rdx),%esi 
>  40360c98:   75 07                   jne    40360ca1 <__tls_get_addr+0x81> 
>  40360c9a:   8a 94 24 00 f0 ff ff    mov    -0x1000(%rsp),%dl 
>  40360ca1:   83 c0 01                add    $0x1,%eax 
>  40360ca4:   64 41 89 45 00          mov    %eax,%fs:0x0(%r13) 
>  40360ca9:   e8 02 f6 ff ff          callq  403602b0 <elf::get_program()> 
> ``` 
>
> As an example of memory saving, tomcat using ~ 300 thread ended up 365MB 
> instead 
> of 620MB before the patch. 
>
> Fixes #143 
>
> Signed-off-by: Matthew Pabst <[email protected]> 
> Signed-off-by: Waldemar Kozaczuk <[email protected]> 
> --- 
>  arch/x64/arch-switch.hh | 10 ++++++++++ 
>  arch/x64/arch.hh        | 29 +++++++++++++++++++++++++++++ 
>  arch/x64/exceptions.cc  |  1 + 
>  core/mmu.cc             |  2 ++ 
>  include/osv/irqlock.hh  |  6 ++++++ 
>  include/osv/mmu-defs.hh |  1 + 
>  include/osv/sched.hh    |  2 ++ 
>  libc/mman.cc            |  7 +------ 
>  libc/pthread.cc         |  3 ++- 
>  9 files changed, 54 insertions(+), 7 deletions(-) 
>
> diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh 
> index 6803498f..eefa505e 100644 
> --- a/arch/x64/arch-switch.hh 
> +++ b/arch/x64/arch-switch.hh 
> @@ -148,6 +148,13 @@ void thread::init_stack() 
>      _state.rip = reinterpret_cast<void*>(thread_main); 
>      _state.rsp = stacktop; 
>      _state.exception_stack = _arch.exception_stack + 
> sizeof(_arch.exception_stack); 
> + 
> +    if (stack.lazy) { 
> +        // If thread stack ia lazily allocated and given the thread 
> initially starts 
> +        // running with preemption disabled, we need to pre-fault the 
> first stack page. 
> +        volatile char r = *((char *) (stacktop - 1)); 
> +        (void) r; // trick the compiler into thinking that r is used 
> +    } 
>  } 
>   
>  void thread::setup_tcb() 
> @@ -311,6 +318,9 @@ void thread::free_tcb() 
>   
>  void thread_main_c(thread* t) 
>  { 
> +    if (t->get_stack_info().lazy) { 
> +        mmu::irq_counter = mmu::irq_counter_lazy_stack_init_value; 
> +    } 
>      arch::irq_enable(); 
>  #ifdef CONF_preempt 
>      preempt_enable(); 
> diff --git a/arch/x64/arch.hh b/arch/x64/arch.hh 
> index 17df5f5c..d246df0c 100644 
> --- a/arch/x64/arch.hh 
> +++ b/arch/x64/arch.hh 
> @@ -10,18 +10,40 @@ 
>   
>  #include "processor.hh" 
>  #include "msr.hh" 
> +#include <osv/barrier.hh> 
>   
>  // namespace arch - architecture independent interface for architecture 
>  //                  dependent operations (e.g. irq_disable vs. cli) 
>   
> +namespace mmu { 
> +constexpr unsigned irq_counter_default_init_value = 11; 
> +constexpr unsigned irq_counter_lazy_stack_init_value = 1; 
> +extern unsigned __thread irq_counter; 
> +} 
> + 
> +namespace sched { 
> +extern unsigned __thread preempt_counter; 
> +} 
> + 
>  namespace arch { 
>   
>  #define CACHELINE_ALIGNED __attribute__((aligned(64))) 
>  #define INSTR_SIZE_MIN 1 
>  #define ELF_IMAGE_START OSV_KERNEL_BASE 
>   
> +inline void ensure_next_stack_page() { 
> +    if (sched::preempt_counter || mmu::irq_counter) { 
> +        return; 
> +    } 
> + 
> +    char i; 
> +    asm volatile("movb -4096(%%rsp), %0" : "=r"(i)); 
> +} 
> + 
>  inline void irq_disable() 
>  { 
> +    ensure_next_stack_page(); 
> +    ++mmu::irq_counter; 
>      processor::cli(); 
>  } 
>   
> @@ -36,11 +58,15 @@ inline void irq_disable_notrace() 
>  inline void irq_enable() 
>  { 
>      processor::sti(); 
> +    barrier(); 
> +    --mmu::irq_counter; 
>  } 
>   
>  inline void wait_for_interrupt() 
>  { 
>      processor::sti_hlt(); 
> +    barrier(); 
> +    --mmu::irq_counter; 
>  } 
>   
>  inline void halt_no_interrupts() 
> @@ -78,12 +104,15 @@ private: 
>   
>  inline void irq_flag_notrace::save() 
>  { 
> +    ensure_next_stack_page(); 
> +    ++mmu::irq_counter; 
>      asm volatile("sub $128, %%rsp; pushfq; popq %0; add $128, %%rsp" : 
> "=r"(_rflags)); 
>  } 
>   
>  inline void irq_flag_notrace::restore() 
>  { 
>      asm volatile("sub $128, %%rsp; pushq %0; popfq; add $128, %%rsp" : : 
> "r"(_rflags)); 
> +    --mmu::irq_counter; 
>  } 
>   
>  inline bool irq_flag_notrace::enabled() const 
> diff --git a/arch/x64/exceptions.cc b/arch/x64/exceptions.cc 
> index 7c9eaf51..889ca48c 100644 
> --- a/arch/x64/exceptions.cc 
> +++ b/arch/x64/exceptions.cc 
> @@ -302,6 +302,7 @@ extern "C" void simd_exception(exception_frame *ef) 
>   
>  extern "C" void nmi(exception_frame* ef) 
>  { 
> +    //TODO: Possibly needs to be handled 
>      while (true) { 
>          processor::cli_hlt(); 
>      } 
> diff --git a/core/mmu.cc b/core/mmu.cc 
> index ff3fab47..7c6a4a97 100644 
> --- a/core/mmu.cc 
> +++ b/core/mmu.cc 
> @@ -42,6 +42,8 @@ extern const char text_start[], text_end[]; 
>   
>  namespace mmu { 
>   
> +unsigned __thread irq_counter = irq_counter_default_init_value; 
> + 
>  namespace bi = boost::intrusive; 
>   
>  class vma_compare { 
> diff --git a/include/osv/irqlock.hh b/include/osv/irqlock.hh 
> index 2d5372fc..01609b7f 100644 
> --- a/include/osv/irqlock.hh 
> +++ b/include/osv/irqlock.hh 
> @@ -10,6 +10,10 @@ 
>   
>  #include "arch.hh" 
>   
> +namespace mmu { 
> +extern unsigned __thread irq_counter; 
> +} 
> + 
>  class irq_lock_type { 
>  public: 
>      static void lock() { arch::irq_disable(); } 
> @@ -34,6 +38,8 @@ inline void irq_save_lock_type::lock() 
>  inline void irq_save_lock_type::unlock() 
>  { 
>      _flags.restore(); 
> +    barrier(); 
> +    --mmu::irq_counter; 
>  } 
>   
>  extern irq_lock_type irq_lock; 
> diff --git a/include/osv/mmu-defs.hh b/include/osv/mmu-defs.hh 
> index 18edf441..694815f8 100644 
> --- a/include/osv/mmu-defs.hh 
> +++ b/include/osv/mmu-defs.hh 
> @@ -84,6 +84,7 @@ enum { 
>      mmap_small       = 1ul << 5, 
>      mmap_jvm_balloon = 1ul << 6, 
>      mmap_file        = 1ul << 7, 
> +    mmap_stack       = 1ul << 8, 
>  }; 
>   
>  enum { 
> diff --git a/include/osv/sched.hh b/include/osv/sched.hh 
> index 0fb44f77..fd5e003c 100644 
> --- a/include/osv/sched.hh 
> +++ b/include/osv/sched.hh 
> @@ -335,6 +335,7 @@ public: 
>          void* begin; 
>          size_t size; 
>          void (*deleter)(stack_info si);  // null: don't delete 
> +        bool lazy = false; 
>          static void default_deleter(stack_info si); 
>      }; 
>      struct attr { 
> @@ -1005,6 +1006,7 @@ inline void preempt() 
>   
>  inline void preempt_disable() 
>  { 
> +    arch::ensure_next_stack_page(); 
>      ++preempt_counter; 
>      barrier(); 
>  } 
> diff --git a/libc/mman.cc b/libc/mman.cc 
> index d0803ac4..de7ee4dd 100644 
> --- a/libc/mman.cc 
> +++ b/libc/mman.cc 
> @@ -38,12 +38,7 @@ unsigned libc_flags_to_mmap(int flags) 
>          mmap_flags |= mmu::mmap_populate; 
>      } 
>      if (flags & MAP_STACK) { 
> -        // OSv currently requires that stacks be pinned (see issue #143). 
> So 
> -        // if an application wants to mmap() a stack for 
> pthread_attr_setstack 
> -        // and did us the courtesy of telling this to ue (via MAP_STACK), 
> -        // let's return the courtesy by returning pre-faulted memory. 
> -        // FIXME: If issue #143 is fixed, this workaround should be 
> removed. 
> -        mmap_flags |= mmu::mmap_populate; 
> +        mmap_flags |= mmu::mmap_stack; 
>      } 
>      if (flags & MAP_SHARED) { 
>          mmap_flags |= mmu::mmap_shared; 
> diff --git a/libc/pthread.cc b/libc/pthread.cc 
> index 8c976bf6..c7dbd017 100644 
> --- a/libc/pthread.cc 
> +++ b/libc/pthread.cc 
> @@ -140,10 +140,11 @@ namespace pthread_private { 
>              return {attr.stack_begin, attr.stack_size}; 
>          } 
>          size_t size = attr.stack_size; 
> -        void *addr = mmu::map_anon(nullptr, size, mmu::mmap_populate, 
> mmu::perm_rw); 
> +        void *addr = mmu::map_anon(nullptr, size, mmu::mmap_stack, 
> mmu::perm_rw); 
>          mmu::mprotect(addr, attr.guard_size, 0); 
>          sched::thread::stack_info si{addr, size}; 
>          si.deleter = free_stack; 
> +        si.lazy = true; 
>          return si; 
>      } 
>   
> -- 
> 2.20.1 
>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/5dada336-0f5b-478b-bbd8-6cea81625038%40googlegroups.com.

[osv-dev] Re: [PATCH] Option 1: threads: allocate application stacks lazily

Reply via email to