Nadav, What do you think about this patch? Also, do you think that the deadlock issue related to *vma_list_mutex *I discovered and had to overcome (see my previous email in this chain) is the only one we need to worry about or are there any potential hidden problems like that I have missed to notice?
Shall I make this lazy-stack feature activated with #ifdef and only if we feel confident we could make it permanent? Waldek On Sunday, February 2, 2020 at 11:15:30 AM UTC-5, Waldek Kozaczuk wrote: > > As the issue #143 explains, currently applications stacks are eagerly > allocated (aka pre-populated) which may lead to substantial memory waste. > We want those stacks to be lazily allocated instead, so that physical > memory behind the stack gets allocated as needed by standard fault > handling mechanism. > > The page fault handling logic requires that both interrupts and preemption > are enabled. Therefore this patch implements simple strategy to read > single byte on a stack > one page (4K) deeper before disabling interrupts or preemption. > > It turns out we cannot read that single byte "blindly" because in some > scenarios > interrupts might be already disabled when we are about to disable > preemption and vice versa. > Also similarly, which is more speculative, disabling preemption (and > possibly interrupts) may nest. > Therefore we need a "conditional" read of that byte on stack. More > specifically > we should ONLY read the byte IF both interrupts and preemption are > enabled. > > This patch achieves this by adding extra thread-local counter to track > interrupts disabling/enabling > that operates in a similar way the preemption counter does. So every time > interrupts > are disabled we increment the counter and decrement it everytime > interrupts are enabled back. > Then everytime before we disable interrupts or preemption, we check if > both counters > are 0 and only then try to read a byte to potentially trigger a fault. > > Please note that performance-wise this patch is slighy more expensive than > it > was originally hoped. As the disassembled snippets illustrate it costs > extra 3-4 instructions > extra every time we disable preemption or interrupts. > > A. Disabling preemption in __tls_get_addr before the patch: > ``` > 4035cae8: e8 83 f6 ff ff callq 4035c170 <elf::get_program()> > 4035caed: 4c 8b 2d 6c 40 51 00 mov 0x51406c(%rip),%r13 # > 40870b60 <sched::preempt_counter+0x40870b60> > 4035caf4: 4c 8b 33 mov (%rbx),%r14 > 4035caf7: 64 41 83 45 00 01 addl $0x1,%fs:0x0(%r13) > 4035cafd: e8 6e f6 ff ff callq 4035c170 <elf::get_program()> > ``` > > B. Disabling preemption in __tls_get_addr after the patch: > ``` > 4035fdd8: e8 63 f6 ff ff callq 4035f440 <elf::get_program()> > 4035fddd: 4c 8b 2d a4 4d 51 00 mov 0x514da4(%rip),%r13 # > 40874b88 <arch::irq_preempt_counters+0x40874b88> > 4035fde4: 4c 8b 33 mov (%rbx),%r14 > 4035fde7: 64 49 83 7d 00 00 cmpq $0x0,%fs:0x0(%r13) > 4035fded: 75 07 jne 4035fdf6 <__tls_get_addr+0x76> > 4035fdef: 8a 84 24 00 f0 ff ff mov -0x1000(%rsp),%al > 4035fdf6: 64 41 83 45 00 01 addl $0x1,%fs:0x0(%r13) > 4035fdfc: e8 3f f6 ff ff callq 4035f440 <elf::get_program()> > ``` > > As an example of memory saving, tomcat using ~ 300 thread ended up 365MB > instead > of 620MB before the patch. > > Fixes #143 > > Signed-off-by: Matthew Pabst <[email protected]> > Signed-off-by: Waldemar Kozaczuk <[email protected]> > --- > arch/aarch64/arch.hh | 5 ++++ > arch/x64/arch-switch.hh | 12 ++++++++++ > arch/x64/arch.hh | 34 ++++++++++++++++++++++++++ > arch/x64/mmu.cc | 5 ++++ > core/mmu.cc | 24 +++++++++++++++++++ > core/sched.cc | 6 ++++- > include/osv/counters.hh | 53 +++++++++++++++++++++++++++++++++++++++++ > include/osv/irqlock.hh | 4 ++++ > include/osv/mmu-defs.hh | 3 +-- > include/osv/sched.hh | 10 ++++---- > libc/mman.cc | 9 ++++--- > libc/pthread.cc | 10 +++++++- > 12 files changed, 162 insertions(+), 13 deletions(-) > create mode 100644 include/osv/counters.hh > > diff --git a/arch/aarch64/arch.hh b/arch/aarch64/arch.hh > index 855f1987..473e84e9 100644 > --- a/arch/aarch64/arch.hh > +++ b/arch/aarch64/arch.hh > @@ -20,6 +20,11 @@ namespace arch { > #define INSTR_SIZE_MIN 4 > #define ELF_IMAGE_START (OSV_KERNEL_BASE + 0x10000) > > +inline void ensure_next_stack_page() > +{ > + //TODO: Implement lazy stack check for AARCH64 > +} > + > inline void irq_disable() > { > processor::irq_disable(); > diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh > index 6803498f..3e52f7f8 100644 > --- a/arch/x64/arch-switch.hh > +++ b/arch/x64/arch-switch.hh > @@ -148,6 +148,13 @@ void thread::init_stack() > _state.rip = reinterpret_cast<void*>(thread_main); > _state.rsp = stacktop; > _state.exception_stack = _arch.exception_stack + > sizeof(_arch.exception_stack); > + > + if (stack.lazy) { > + // If the thread stack is setup to get lazily allocated and given > the thread initially starts > + // running with preemption disabled, we need to pre-fault the > first stack page. > + volatile char r = *((char *) (stacktop - 1)); > + (void) r; // trick the compiler into thinking that r is used > + } > } > > void thread::setup_tcb() > @@ -311,6 +318,11 @@ void thread::free_tcb() > > void thread_main_c(thread* t) > { > + // Disable lazy stack pre-fault logic for this thread > + // if stack is not lazy (most likely kernel thread) > + if (t->get_stack_info().lazy) { > + > > arch::irq_preempt_counters._data.set_lazy_flags_bit(arch::lazy_stack_disable, > false); > + } > arch::irq_enable(); > #ifdef CONF_preempt > preempt_enable(); > diff --git a/arch/x64/arch.hh b/arch/x64/arch.hh > index 17df5f5c..74b728dc 100644 > --- a/arch/x64/arch.hh > +++ b/arch/x64/arch.hh > @@ -10,6 +10,8 @@ > > #include "processor.hh" > #include "msr.hh" > +#include <osv/barrier.hh> > +#include <osv/counters.hh> > > // namespace arch - architecture independent interface for architecture > // dependent operations (e.g. irq_disable vs. cli) > @@ -20,8 +22,33 @@ namespace arch { > #define INSTR_SIZE_MIN 1 > #define ELF_IMAGE_START OSV_KERNEL_BASE > > +inline void ensure_next_stack_page() { > + // Because both irq and preempt counters and lazy stack > + // flags are co-located within same 8-bytes long union > + // we can check single field for non-zero value to determine > + // if we should trigger pre-fault on lazy stack > + if (irq_preempt_counters.lazy_stack_no_pre_fault) { > + return; > + } > + > + char i; > + asm volatile("movb -4096(%%rsp), %0" : "=r"(i)); > +} > + > +inline void ensure_next_two_stack_pages() { > + if (irq_preempt_counters.lazy_stack_no_pre_fault) { > + return; > + } > + > + char i; > + asm volatile("movb -4096(%%rsp), %0" : "=r"(i)); > + asm volatile("movb -8192(%%rsp), %0" : "=r"(i)); > +} > + > inline void irq_disable() > { > + ensure_next_stack_page(); > + ++irq_preempt_counters._data.irq; > processor::cli(); > } > > @@ -36,11 +63,15 @@ inline void irq_disable_notrace() > inline void irq_enable() > { > processor::sti(); > + --irq_preempt_counters._data.irq; > + barrier(); > } > > inline void wait_for_interrupt() > { > processor::sti_hlt(); > + --irq_preempt_counters._data.irq; > + barrier(); > } > > inline void halt_no_interrupts() > @@ -78,12 +109,15 @@ private: > > inline void irq_flag_notrace::save() > { > + ensure_next_stack_page(); > + ++irq_preempt_counters._data.irq; > asm volatile("sub $128, %%rsp; pushfq; popq %0; add $128, %%rsp" : > "=r"(_rflags)); > } > > inline void irq_flag_notrace::restore() > { > asm volatile("sub $128, %%rsp; pushq %0; popfq; add $128, %%rsp" : : > "r"(_rflags)); > + --irq_preempt_counters._data.irq; > } > > inline bool irq_flag_notrace::enabled() const > diff --git a/arch/x64/mmu.cc b/arch/x64/mmu.cc > index 24da5caa..ae619a9f 100644 > --- a/arch/x64/mmu.cc > +++ b/arch/x64/mmu.cc > @@ -37,6 +37,11 @@ void page_fault(exception_frame *ef) > assert(sched::preemptable()); > assert(ef->rflags & processor::rflags_if); > > + // Even though we are on exception stack that is not lazy > + // for pre-caution we are disabling the pre-fault logic > + // for the time the fault is being handled > + arch::lazy_stack_flags_guard > lazy_stack_guard(arch::lazy_stack_fault_disable); > + > // And since we may sleep, make sure interrupts are enabled. > DROP_LOCK(irq_lock) { // irq_lock is acquired by HW > mmu::vm_fault(addr, ef); > diff --git a/core/mmu.cc b/core/mmu.cc > index ff3fab47..b6e7cc36 100644 > --- a/core/mmu.cc > +++ b/core/mmu.cc > @@ -42,6 +42,18 @@ extern const char text_start[], text_end[]; > > namespace mmu { > > +static void prevent_stack_page_fault() { > + // We need to ensure that lazy stack is populated > + // deeply enough (2 pages) for all cases where > + // the vma_list_mutex is taken for write so that we can disable > + // the pre-fault check (arch::ensure_next_stack_page) which, > + // if page fault triggered, would make page-fault handling logic > + // attempt to take same vma_list_mutex fo read and end up with > deadlock > +#ifndef AARCH64_PORT_STUB > + arch::ensure_next_two_stack_pages(); > +#endif > +} > + > namespace bi = boost::intrusive; > > class vma_compare { > @@ -1183,6 +1195,8 @@ static void nohugepage(void* addr, size_t length) > > error advise(void* addr, size_t size, int advice) > { > + prevent_stack_page_fault(); > + arch::lazy_stack_flags_guard > lazy_stack_guard(arch::lazy_stack_mmu_disable); > WITH_LOCK(vma_list_mutex.for_write()) { > if (!ismapped(addr, size)) { > return make_error(ENOMEM); > @@ -1215,6 +1229,8 @@ void* map_anon(const void* addr, size_t size, > unsigned flags, unsigned perm) > size = align_up(size, mmu::page_size); > auto start = reinterpret_cast<uintptr_t>(addr); > auto* vma = new mmu::anon_vma(addr_range(start, start + size), perm, > flags); > + prevent_stack_page_fault(); > + arch::lazy_stack_flags_guard > lazy_stack_guard(arch::lazy_stack_mmu_disable); > SCOPE_LOCK(vma_list_mutex.for_write()); > auto v = (void*) allocate(vma, start, size, search); > if (flags & mmap_populate) { > @@ -1241,6 +1257,8 @@ void* map_file(const void* addr, size_t size, > unsigned flags, unsigned perm, > auto start = reinterpret_cast<uintptr_t>(addr); > auto *vma = f->mmap(addr_range(start, start + size), flags | > mmap_file, perm, offset).release(); > void *v; > + prevent_stack_page_fault(); > + arch::lazy_stack_flags_guard > lazy_stack_guard(arch::lazy_stack_mmu_disable); > WITH_LOCK(vma_list_mutex.for_write()) { > v = (void*) allocate(vma, start, size, search); > if (flags & mmap_populate) { > @@ -1607,6 +1625,8 @@ ulong map_jvm(unsigned char* jvm_addr, size_t size, > size_t align, balloon_ptr b) > > auto* vma = new mmu::jvm_balloon_vma(jvm_addr, start, start + size, > b, v->perm(), v->flags()); > > + prevent_stack_page_fault(); > + arch::lazy_stack_flags_guard > lazy_stack_guard(arch::lazy_stack_mmu_disable); > WITH_LOCK(vma_list_mutex.for_write()) { > // This means that the mapping that we had before was a balloon > mapping > // that was laying around and wasn't updated to an anon mapping. > If we > @@ -1868,6 +1888,8 @@ void free_initial_memory_range(uintptr_t addr, > size_t size) > > error mprotect(const void *addr, size_t len, unsigned perm) > { > + prevent_stack_page_fault(); > + arch::lazy_stack_flags_guard > lazy_stack_guard(arch::lazy_stack_mmu_disable); > SCOPE_LOCK(vma_list_mutex.for_write()); > > if (!ismapped(addr, len)) { > @@ -1879,6 +1901,8 @@ error mprotect(const void *addr, size_t len, > unsigned perm) > > error munmap(const void *addr, size_t length) > { > + prevent_stack_page_fault(); > + arch::lazy_stack_flags_guard > lazy_stack_guard(arch::lazy_stack_mmu_disable); > SCOPE_LOCK(vma_list_mutex.for_write()); > > length = align_up(length, mmu::page_size); > diff --git a/core/sched.cc b/core/sched.cc > index 06f849d1..4510a85e 100644 > --- a/core/sched.cc > +++ b/core/sched.cc > @@ -42,6 +42,11 @@ extern char _percpu_start[], _percpu_end[]; > using namespace osv; > using namespace osv::clock::literals; > > +namespace arch { > +// By default disable lazy stack pre-fault logic > +counters __thread irq_preempt_counters = {{1, 1, 1}}; > +} > + > namespace sched { > > TRACEPOINT(trace_sched_idle, ""); > @@ -69,7 +74,6 @@ std::vector<cpu*> cpus > __attribute__((init_priority((int)init_prio::cpus))); > thread __thread * s_current; > cpu __thread * current_cpu; > > -unsigned __thread preempt_counter = 1; > bool __thread need_reschedule = false; > > elf::tls_data tls; > diff --git a/include/osv/counters.hh b/include/osv/counters.hh > new file mode 100644 > index 00000000..b862f32c > --- /dev/null > +++ b/include/osv/counters.hh > @@ -0,0 +1,53 @@ > +/* > + * Copyright (C) 2020 Waldemar Kozaczuk > + * > + * This work is open source software, licensed under the terms of the > + * BSD license as described in the LICENSE file in the top-level > directory. > + */ > + > +#ifndef OSV_COUNTERS_HH > +#define OSV_COUNTERS_HH > + > +namespace arch { > + > +// Both preempt and irq counters are co-located next to each other > +// along with the the flags that drive lazy stack related logic > +// union by design. It is so that compiler can optimize > ensure_next_stack_page() > +// to test against single 64-bit lazy_stack_no_pre_fault field > +union counters { > + struct data { > + unsigned preempt; > + u16 irq; > + u16 lazy_stack_flags; > + > + inline void set_lazy_flags_bit(unsigned nr, bool v) { > + lazy_stack_flags &= ~(u64(1) << nr); > + lazy_stack_flags |= u64(v) << nr; > + } > + } _data; > + uint64_t lazy_stack_no_pre_fault; > +}; > + > +extern counters __thread irq_preempt_counters; > + > +class lazy_stack_flags_guard { > + unsigned _flag_nr; > +public: > + lazy_stack_flags_guard(unsigned flag_nr) { > + _flag_nr = flag_nr; > + irq_preempt_counters._data.set_lazy_flags_bit(_flag_nr, true); > + } > + ~lazy_stack_flags_guard() { > + irq_preempt_counters._data.set_lazy_flags_bit(_flag_nr, false); > + } > +}; > + > +enum { > + lazy_stack_disable = 0, > + lazy_stack_mmu_disable = 1, > + lazy_stack_fault_disable = 2, > +}; > + > +} > + > +#endif //OSV_COUNTERS_HH > diff --git a/include/osv/irqlock.hh b/include/osv/irqlock.hh > index 2d5372fc..1e0ab6fa 100644 > --- a/include/osv/irqlock.hh > +++ b/include/osv/irqlock.hh > @@ -34,6 +34,10 @@ inline void irq_save_lock_type::lock() > inline void irq_save_lock_type::unlock() > { > _flags.restore(); > +#ifndef AARCH64_PORT_STUB > + --arch::irq_preempt_counters._data.irq; > + barrier(); > +#endif > } > > extern irq_lock_type irq_lock; > diff --git a/include/osv/mmu-defs.hh b/include/osv/mmu-defs.hh > index 18edf441..a731ed6e 100644 > --- a/include/osv/mmu-defs.hh > +++ b/include/osv/mmu-defs.hh > @@ -15,8 +15,6 @@ > > struct exception_frame; > > -struct exception_frame; > - > namespace mmu { > > constexpr uintptr_t page_size = 4096; > @@ -84,6 +82,7 @@ enum { > mmap_small = 1ul << 5, > mmap_jvm_balloon = 1ul << 6, > mmap_file = 1ul << 7, > + mmap_stack = 1ul << 8, > }; > > enum { > diff --git a/include/osv/sched.hh b/include/osv/sched.hh > index 0fb44f77..96453a8b 100644 > --- a/include/osv/sched.hh > +++ b/include/osv/sched.hh > @@ -25,6 +25,7 @@ > #include <osv/rcu.hh> > #include <osv/clock.hh> > #include <osv/timer-set.hh> > +#include "osv/counters.hh" > #include <string.h> > > typedef float runtime_t; > @@ -335,6 +336,7 @@ public: > void* begin; > size_t size; > void (*deleter)(stack_info si); // null: don't delete > + bool lazy = false; > static void default_deleter(stack_info si); > }; > struct attr { > @@ -978,14 +980,13 @@ inline void release(mutex_t* mtx) > } > } > > -extern unsigned __thread preempt_counter; > extern bool __thread need_reschedule; > > #ifdef __OSV_CORE__ > inline unsigned int get_preempt_counter() > { > barrier(); > - return preempt_counter; > + return arch::irq_preempt_counters._data.preempt; > } > > inline bool preemptable() > @@ -1005,14 +1006,15 @@ inline void preempt() > > inline void preempt_disable() > { > - ++preempt_counter; > + arch::ensure_next_stack_page(); > + ++arch::irq_preempt_counters._data.preempt; > barrier(); > } > > inline void preempt_enable() > { > barrier(); > - --preempt_counter; > + --arch::irq_preempt_counters._data.preempt; > if (preemptable() && need_reschedule && arch::irq_enabled()) { > cpu::schedule(); > } > diff --git a/libc/mman.cc b/libc/mman.cc > index d0803ac4..d00befc3 100644 > --- a/libc/mman.cc > +++ b/libc/mman.cc > @@ -38,12 +38,11 @@ unsigned libc_flags_to_mmap(int flags) > mmap_flags |= mmu::mmap_populate; > } > if (flags & MAP_STACK) { > - // OSv currently requires that stacks be pinned (see issue #143). > So > - // if an application wants to mmap() a stack for > pthread_attr_setstack > - // and did us the courtesy of telling this to ue (via MAP_STACK), > - // let's return the courtesy by returning pre-faulted memory. > - // FIXME: If issue #143 is fixed, this workaround should be > removed. > +#ifdef AARCH64_PORT_STUB > mmap_flags |= mmu::mmap_populate; > +#else > + mmap_flags |= mmu::mmap_stack; > +#endif > } > if (flags & MAP_SHARED) { > mmap_flags |= mmu::mmap_shared; > diff --git a/libc/pthread.cc b/libc/pthread.cc > index 8c976bf6..2279ac0c 100644 > --- a/libc/pthread.cc > +++ b/libc/pthread.cc > @@ -140,10 +140,18 @@ namespace pthread_private { > return {attr.stack_begin, attr.stack_size}; > } > size_t size = attr.stack_size; > - void *addr = mmu::map_anon(nullptr, size, mmu::mmap_populate, > mmu::perm_rw); > +#ifdef AARCH64_PORT_STUB > + unsigned stack_flags = mmu::mmap_populate; > +#else > + unsigned stack_flags = mmu::mmap_stack; > +#endif > + void *addr = mmu::map_anon(nullptr, size, stack_flags, > mmu::perm_rw); > mmu::mprotect(addr, attr.guard_size, 0); > sched::thread::stack_info si{addr, size}; > si.deleter = free_stack; > +#ifndef AARCH64_PORT_STUB > + si.lazy = true; > +#endif > return si; > } > > -- > 2.20.1 > > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/6b547aaf-852b-46e3-b154-649d0266e3a1%40googlegroups.com.
