Bump again :-) On Sunday, March 1, 2020 at 10:30:07 AM UTC-5, Waldek Kozaczuk wrote: > > Bump! > > On Monday, February 10, 2020 at 12:14:52 AM UTC-5, Waldek Kozaczuk wrote: >> >> Nadav, >> >> What do you think about this patch? Also, do you think that the deadlock >> issue related to *vma_list_mutex *I discovered and had to overcome (see >> my previous email in this chain) is the only one we need to worry about or >> are there any potential hidden problems like that I have missed to notice? >> >> Shall I make this lazy-stack feature activated with #ifdef and only if we >> feel confident we could make it permanent? >> >> Waldek >> >> On Sunday, February 2, 2020 at 11:15:30 AM UTC-5, Waldek Kozaczuk wrote: >>> >>> As the issue #143 explains, currently applications stacks are eagerly >>> allocated (aka pre-populated) which may lead to substantial memory >>> waste. >>> We want those stacks to be lazily allocated instead, so that physical >>> memory behind the stack gets allocated as needed by standard fault >>> handling mechanism. >>> >>> The page fault handling logic requires that both interrupts and >>> preemption >>> are enabled. Therefore this patch implements simple strategy to read >>> single byte on a stack >>> one page (4K) deeper before disabling interrupts or preemption. >>> >>> It turns out we cannot read that single byte "blindly" because in some >>> scenarios >>> interrupts might be already disabled when we are about to disable >>> preemption and vice versa. >>> Also similarly, which is more speculative, disabling preemption (and >>> possibly interrupts) may nest. >>> Therefore we need a "conditional" read of that byte on stack. More >>> specifically >>> we should ONLY read the byte IF both interrupts and preemption are >>> enabled. >>> >>> This patch achieves this by adding extra thread-local counter to track >>> interrupts disabling/enabling >>> that operates in a similar way the preemption counter does. So every >>> time interrupts >>> are disabled we increment the counter and decrement it everytime >>> interrupts are enabled back. >>> Then everytime before we disable interrupts or preemption, we check if >>> both counters >>> are 0 and only then try to read a byte to potentially trigger a fault. >>> >>> Please note that performance-wise this patch is slighy more expensive >>> than it >>> was originally hoped. As the disassembled snippets illustrate it costs >>> extra 3-4 instructions >>> extra every time we disable preemption or interrupts. >>> >>> A. Disabling preemption in __tls_get_addr before the patch: >>> ``` >>> 4035cae8: e8 83 f6 ff ff callq 4035c170 >>> <elf::get_program()> >>> 4035caed: 4c 8b 2d 6c 40 51 00 mov 0x51406c(%rip),%r13 # >>> 40870b60 <sched::preempt_counter+0x40870b60> >>> 4035caf4: 4c 8b 33 mov (%rbx),%r14 >>> 4035caf7: 64 41 83 45 00 01 addl $0x1,%fs:0x0(%r13) >>> 4035cafd: e8 6e f6 ff ff callq 4035c170 >>> <elf::get_program()> >>> ``` >>> >>> B. Disabling preemption in __tls_get_addr after the patch: >>> ``` >>> 4035fdd8: e8 63 f6 ff ff callq 4035f440 >>> <elf::get_program()> >>> 4035fddd: 4c 8b 2d a4 4d 51 00 mov 0x514da4(%rip),%r13 # >>> 40874b88 <arch::irq_preempt_counters+0x40874b88> >>> 4035fde4: 4c 8b 33 mov (%rbx),%r14 >>> 4035fde7: 64 49 83 7d 00 00 cmpq $0x0,%fs:0x0(%r13) >>> 4035fded: 75 07 jne 4035fdf6 >>> <__tls_get_addr+0x76> >>> 4035fdef: 8a 84 24 00 f0 ff ff mov -0x1000(%rsp),%al >>> 4035fdf6: 64 41 83 45 00 01 addl $0x1,%fs:0x0(%r13) >>> 4035fdfc: e8 3f f6 ff ff callq 4035f440 >>> <elf::get_program()> >>> ``` >>> >>> As an example of memory saving, tomcat using ~ 300 thread ended up 365MB >>> instead >>> of 620MB before the patch. >>> >>> Fixes #143 >>> >>> Signed-off-by: Matthew Pabst <[email protected]> >>> Signed-off-by: Waldemar Kozaczuk <[email protected]> >>> --- >>> arch/aarch64/arch.hh | 5 ++++ >>> arch/x64/arch-switch.hh | 12 ++++++++++ >>> arch/x64/arch.hh | 34 ++++++++++++++++++++++++++ >>> arch/x64/mmu.cc | 5 ++++ >>> core/mmu.cc | 24 +++++++++++++++++++ >>> core/sched.cc | 6 ++++- >>> include/osv/counters.hh | 53 +++++++++++++++++++++++++++++++++++++++++ >>> include/osv/irqlock.hh | 4 ++++ >>> include/osv/mmu-defs.hh | 3 +-- >>> include/osv/sched.hh | 10 ++++---- >>> libc/mman.cc | 9 ++++--- >>> libc/pthread.cc | 10 +++++++- >>> 12 files changed, 162 insertions(+), 13 deletions(-) >>> create mode 100644 include/osv/counters.hh >>> >>> diff --git a/arch/aarch64/arch.hh b/arch/aarch64/arch.hh >>> index 855f1987..473e84e9 100644 >>> --- a/arch/aarch64/arch.hh >>> +++ b/arch/aarch64/arch.hh >>> @@ -20,6 +20,11 @@ namespace arch { >>> #define INSTR_SIZE_MIN 4 >>> #define ELF_IMAGE_START (OSV_KERNEL_BASE + 0x10000) >>> >>> +inline void ensure_next_stack_page() >>> +{ >>> + //TODO: Implement lazy stack check for AARCH64 >>> +} >>> + >>> inline void irq_disable() >>> { >>> processor::irq_disable(); >>> diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh >>> index 6803498f..3e52f7f8 100644 >>> --- a/arch/x64/arch-switch.hh >>> +++ b/arch/x64/arch-switch.hh >>> @@ -148,6 +148,13 @@ void thread::init_stack() >>> _state.rip = reinterpret_cast<void*>(thread_main); >>> _state.rsp = stacktop; >>> _state.exception_stack = _arch.exception_stack + >>> sizeof(_arch.exception_stack); >>> + >>> + if (stack.lazy) { >>> + // If the thread stack is setup to get lazily allocated and >>> given the thread initially starts >>> + // running with preemption disabled, we need to pre-fault the >>> first stack page. >>> + volatile char r = *((char *) (stacktop - 1)); >>> + (void) r; // trick the compiler into thinking that r is used >>> + } >>> } >>> >>> void thread::setup_tcb() >>> @@ -311,6 +318,11 @@ void thread::free_tcb() >>> >>> void thread_main_c(thread* t) >>> { >>> + // Disable lazy stack pre-fault logic for this thread >>> + // if stack is not lazy (most likely kernel thread) >>> + if (t->get_stack_info().lazy) { >>> + >>> >>> arch::irq_preempt_counters._data.set_lazy_flags_bit(arch::lazy_stack_disable, >>> >>> false); >>> + } >>> arch::irq_enable(); >>> #ifdef CONF_preempt >>> preempt_enable(); >>> diff --git a/arch/x64/arch.hh b/arch/x64/arch.hh >>> index 17df5f5c..74b728dc 100644 >>> --- a/arch/x64/arch.hh >>> +++ b/arch/x64/arch.hh >>> @@ -10,6 +10,8 @@ >>> >>> #include "processor.hh" >>> #include "msr.hh" >>> +#include <osv/barrier.hh> >>> +#include <osv/counters.hh> >>> >>> // namespace arch - architecture independent interface for architecture >>> // dependent operations (e.g. irq_disable vs. cli) >>> @@ -20,8 +22,33 @@ namespace arch { >>> #define INSTR_SIZE_MIN 1 >>> #define ELF_IMAGE_START OSV_KERNEL_BASE >>> >>> +inline void ensure_next_stack_page() { >>> + // Because both irq and preempt counters and lazy stack >>> + // flags are co-located within same 8-bytes long union >>> + // we can check single field for non-zero value to determine >>> + // if we should trigger pre-fault on lazy stack >>> + if (irq_preempt_counters.lazy_stack_no_pre_fault) { >>> + return; >>> + } >>> + >>> + char i; >>> + asm volatile("movb -4096(%%rsp), %0" : "=r"(i)); >>> +} >>> + >>> +inline void ensure_next_two_stack_pages() { >>> + if (irq_preempt_counters.lazy_stack_no_pre_fault) { >>> + return; >>> + } >>> + >>> + char i; >>> + asm volatile("movb -4096(%%rsp), %0" : "=r"(i)); >>> + asm volatile("movb -8192(%%rsp), %0" : "=r"(i)); >>> +} >>> + >>> inline void irq_disable() >>> { >>> + ensure_next_stack_page(); >>> + ++irq_preempt_counters._data.irq; >>> processor::cli(); >>> } >>> >>> @@ -36,11 +63,15 @@ inline void irq_disable_notrace() >>> inline void irq_enable() >>> { >>> processor::sti(); >>> + --irq_preempt_counters._data.irq; >>> + barrier(); >>> } >>> >>> inline void wait_for_interrupt() >>> { >>> processor::sti_hlt(); >>> + --irq_preempt_counters._data.irq; >>> + barrier(); >>> } >>> >>> inline void halt_no_interrupts() >>> @@ -78,12 +109,15 @@ private: >>> >>> inline void irq_flag_notrace::save() >>> { >>> + ensure_next_stack_page(); >>> + ++irq_preempt_counters._data.irq; >>> asm volatile("sub $128, %%rsp; pushfq; popq %0; add $128, %%rsp" : >>> "=r"(_rflags)); >>> } >>> >>> inline void irq_flag_notrace::restore() >>> { >>> asm volatile("sub $128, %%rsp; pushq %0; popfq; add $128, %%rsp" : >>> : "r"(_rflags)); >>> + --irq_preempt_counters._data.irq; >>> } >>> >>> inline bool irq_flag_notrace::enabled() const >>> diff --git a/arch/x64/mmu.cc b/arch/x64/mmu.cc >>> index 24da5caa..ae619a9f 100644 >>> --- a/arch/x64/mmu.cc >>> +++ b/arch/x64/mmu.cc >>> @@ -37,6 +37,11 @@ void page_fault(exception_frame *ef) >>> assert(sched::preemptable()); >>> assert(ef->rflags & processor::rflags_if); >>> >>> + // Even though we are on exception stack that is not lazy >>> + // for pre-caution we are disabling the pre-fault logic >>> + // for the time the fault is being handled >>> + arch::lazy_stack_flags_guard >>> lazy_stack_guard(arch::lazy_stack_fault_disable); >>> + >>> // And since we may sleep, make sure interrupts are enabled. >>> DROP_LOCK(irq_lock) { // irq_lock is acquired by HW >>> mmu::vm_fault(addr, ef); >>> diff --git a/core/mmu.cc b/core/mmu.cc >>> index ff3fab47..b6e7cc36 100644 >>> --- a/core/mmu.cc >>> +++ b/core/mmu.cc >>> @@ -42,6 +42,18 @@ extern const char text_start[], text_end[]; >>> >>> namespace mmu { >>> >>> +static void prevent_stack_page_fault() { >>> + // We need to ensure that lazy stack is populated >>> + // deeply enough (2 pages) for all cases where >>> + // the vma_list_mutex is taken for write so that we can disable >>> + // the pre-fault check (arch::ensure_next_stack_page) which, >>> + // if page fault triggered, would make page-fault handling logic >>> + // attempt to take same vma_list_mutex fo read and end up with >>> deadlock >>> +#ifndef AARCH64_PORT_STUB >>> + arch::ensure_next_two_stack_pages(); >>> +#endif >>> +} >>> + >>> namespace bi = boost::intrusive; >>> >>> class vma_compare { >>> @@ -1183,6 +1195,8 @@ static void nohugepage(void* addr, size_t length) >>> >>> error advise(void* addr, size_t size, int advice) >>> { >>> + prevent_stack_page_fault(); >>> + arch::lazy_stack_flags_guard >>> lazy_stack_guard(arch::lazy_stack_mmu_disable); >>> WITH_LOCK(vma_list_mutex.for_write()) { >>> if (!ismapped(addr, size)) { >>> return make_error(ENOMEM); >>> @@ -1215,6 +1229,8 @@ void* map_anon(const void* addr, size_t size, >>> unsigned flags, unsigned perm) >>> size = align_up(size, mmu::page_size); >>> auto start = reinterpret_cast<uintptr_t>(addr); >>> auto* vma = new mmu::anon_vma(addr_range(start, start + size), >>> perm, flags); >>> + prevent_stack_page_fault(); >>> + arch::lazy_stack_flags_guard >>> lazy_stack_guard(arch::lazy_stack_mmu_disable); >>> SCOPE_LOCK(vma_list_mutex.for_write()); >>> auto v = (void*) allocate(vma, start, size, search); >>> if (flags & mmap_populate) { >>> @@ -1241,6 +1257,8 @@ void* map_file(const void* addr, size_t size, >>> unsigned flags, unsigned perm, >>> auto start = reinterpret_cast<uintptr_t>(addr); >>> auto *vma = f->mmap(addr_range(start, start + size), flags | >>> mmap_file, perm, offset).release(); >>> void *v; >>> + prevent_stack_page_fault(); >>> + arch::lazy_stack_flags_guard >>> lazy_stack_guard(arch::lazy_stack_mmu_disable); >>> WITH_LOCK(vma_list_mutex.for_write()) { >>> v = (void*) allocate(vma, start, size, search); >>> if (flags & mmap_populate) { >>> @@ -1607,6 +1625,8 @@ ulong map_jvm(unsigned char* jvm_addr, size_t >>> size, size_t align, balloon_ptr b) >>> >>> auto* vma = new mmu::jvm_balloon_vma(jvm_addr, start, start + size, >>> b, v->perm(), v->flags()); >>> >>> + prevent_stack_page_fault(); >>> + arch::lazy_stack_flags_guard >>> lazy_stack_guard(arch::lazy_stack_mmu_disable); >>> WITH_LOCK(vma_list_mutex.for_write()) { >>> // This means that the mapping that we had before was a balloon >>> mapping >>> // that was laying around and wasn't updated to an anon >>> mapping. If we >>> @@ -1868,6 +1888,8 @@ void free_initial_memory_range(uintptr_t addr, >>> size_t size) >>> >>> error mprotect(const void *addr, size_t len, unsigned perm) >>> { >>> + prevent_stack_page_fault(); >>> + arch::lazy_stack_flags_guard >>> lazy_stack_guard(arch::lazy_stack_mmu_disable); >>> SCOPE_LOCK(vma_list_mutex.for_write()); >>> >>> if (!ismapped(addr, len)) { >>> @@ -1879,6 +1901,8 @@ error mprotect(const void *addr, size_t len, >>> unsigned perm) >>> >>> error munmap(const void *addr, size_t length) >>> { >>> + prevent_stack_page_fault(); >>> + arch::lazy_stack_flags_guard >>> lazy_stack_guard(arch::lazy_stack_mmu_disable); >>> SCOPE_LOCK(vma_list_mutex.for_write()); >>> >>> length = align_up(length, mmu::page_size); >>> diff --git a/core/sched.cc b/core/sched.cc >>> index 06f849d1..4510a85e 100644 >>> --- a/core/sched.cc >>> +++ b/core/sched.cc >>> @@ -42,6 +42,11 @@ extern char _percpu_start[], _percpu_end[]; >>> using namespace osv; >>> using namespace osv::clock::literals; >>> >>> +namespace arch { >>> +// By default disable lazy stack pre-fault logic >>> +counters __thread irq_preempt_counters = {{1, 1, 1}}; >>> +} >>> + >>> namespace sched { >>> >>> TRACEPOINT(trace_sched_idle, ""); >>> @@ -69,7 +74,6 @@ std::vector<cpu*> cpus >>> __attribute__((init_priority((int)init_prio::cpus))); >>> thread __thread * s_current; >>> cpu __thread * current_cpu; >>> >>> -unsigned __thread preempt_counter = 1; >>> bool __thread need_reschedule = false; >>> >>> elf::tls_data tls; >>> diff --git a/include/osv/counters.hh b/include/osv/counters.hh >>> new file mode 100644 >>> index 00000000..b862f32c >>> --- /dev/null >>> +++ b/include/osv/counters.hh >>> @@ -0,0 +1,53 @@ >>> +/* >>> + * Copyright (C) 2020 Waldemar Kozaczuk >>> + * >>> + * This work is open source software, licensed under the terms of the >>> + * BSD license as described in the LICENSE file in the top-level >>> directory. >>> + */ >>> + >>> +#ifndef OSV_COUNTERS_HH >>> +#define OSV_COUNTERS_HH >>> + >>> +namespace arch { >>> + >>> +// Both preempt and irq counters are co-located next to each other >>> +// along with the the flags that drive lazy stack related logic >>> +// union by design. It is so that compiler can optimize >>> ensure_next_stack_page() >>> +// to test against single 64-bit lazy_stack_no_pre_fault field >>> +union counters { >>> + struct data { >>> + unsigned preempt; >>> + u16 irq; >>> + u16 lazy_stack_flags; >>> + >>> + inline void set_lazy_flags_bit(unsigned nr, bool v) { >>> + lazy_stack_flags &= ~(u64(1) << nr); >>> + lazy_stack_flags |= u64(v) << nr; >>> + } >>> + } _data; >>> + uint64_t lazy_stack_no_pre_fault; >>> +}; >>> + >>> +extern counters __thread irq_preempt_counters; >>> + >>> +class lazy_stack_flags_guard { >>> + unsigned _flag_nr; >>> +public: >>> + lazy_stack_flags_guard(unsigned flag_nr) { >>> + _flag_nr = flag_nr; >>> + irq_preempt_counters._data.set_lazy_flags_bit(_flag_nr, true); >>> + } >>> + ~lazy_stack_flags_guard() { >>> + irq_preempt_counters._data.set_lazy_flags_bit(_flag_nr, false); >>> + } >>> +}; >>> + >>> +enum { >>> + lazy_stack_disable = 0, >>> + lazy_stack_mmu_disable = 1, >>> + lazy_stack_fault_disable = 2, >>> +}; >>> + >>> +} >>> + >>> +#endif //OSV_COUNTERS_HH >>> diff --git a/include/osv/irqlock.hh b/include/osv/irqlock.hh >>> index 2d5372fc..1e0ab6fa 100644 >>> --- a/include/osv/irqlock.hh >>> +++ b/include/osv/irqlock.hh >>> @@ -34,6 +34,10 @@ inline void irq_save_lock_type::lock() >>> inline void irq_save_lock_type::unlock() >>> { >>> _flags.restore(); >>> +#ifndef AARCH64_PORT_STUB >>> + --arch::irq_preempt_counters._data.irq; >>> + barrier(); >>> +#endif >>> } >>> >>> extern irq_lock_type irq_lock; >>> diff --git a/include/osv/mmu-defs.hh b/include/osv/mmu-defs.hh >>> index 18edf441..a731ed6e 100644 >>> --- a/include/osv/mmu-defs.hh >>> +++ b/include/osv/mmu-defs.hh >>> @@ -15,8 +15,6 @@ >>> >>> struct exception_frame; >>> >>> -struct exception_frame; >>> - >>> namespace mmu { >>> >>> constexpr uintptr_t page_size = 4096; >>> @@ -84,6 +82,7 @@ enum { >>> mmap_small = 1ul << 5, >>> mmap_jvm_balloon = 1ul << 6, >>> mmap_file = 1ul << 7, >>> + mmap_stack = 1ul << 8, >>> }; >>> >>> enum { >>> diff --git a/include/osv/sched.hh b/include/osv/sched.hh >>> index 0fb44f77..96453a8b 100644 >>> --- a/include/osv/sched.hh >>> +++ b/include/osv/sched.hh >>> @@ -25,6 +25,7 @@ >>> #include <osv/rcu.hh> >>> #include <osv/clock.hh> >>> #include <osv/timer-set.hh> >>> +#include "osv/counters.hh" >>> #include <string.h> >>> >>> typedef float runtime_t; >>> @@ -335,6 +336,7 @@ public: >>> void* begin; >>> size_t size; >>> void (*deleter)(stack_info si); // null: don't delete >>> + bool lazy = false; >>> static void default_deleter(stack_info si); >>> }; >>> struct attr { >>> @@ -978,14 +980,13 @@ inline void release(mutex_t* mtx) >>> } >>> } >>> >>> -extern unsigned __thread preempt_counter; >>> extern bool __thread need_reschedule; >>> >>> #ifdef __OSV_CORE__ >>> inline unsigned int get_preempt_counter() >>> { >>> barrier(); >>> - return preempt_counter; >>> + return arch::irq_preempt_counters._data.preempt; >>> } >>> >>> inline bool preemptable() >>> @@ -1005,14 +1006,15 @@ inline void preempt() >>> >>> inline void preempt_disable() >>> { >>> - ++preempt_counter; >>> + arch::ensure_next_stack_page(); >>> + ++arch::irq_preempt_counters._data.preempt; >>> barrier(); >>> } >>> >>> inline void preempt_enable() >>> { >>> barrier(); >>> - --preempt_counter; >>> + --arch::irq_preempt_counters._data.preempt; >>> if (preemptable() && need_reschedule && arch::irq_enabled()) { >>> cpu::schedule(); >>> } >>> diff --git a/libc/mman.cc b/libc/mman.cc >>> index d0803ac4..d00befc3 100644 >>> --- a/libc/mman.cc >>> +++ b/libc/mman.cc >>> @@ -38,12 +38,11 @@ unsigned libc_flags_to_mmap(int flags) >>> mmap_flags |= mmu::mmap_populate; >>> } >>> if (flags & MAP_STACK) { >>> - // OSv currently requires that stacks be pinned (see issue >>> #143). So >>> - // if an application wants to mmap() a stack for >>> pthread_attr_setstack >>> - // and did us the courtesy of telling this to ue (via >>> MAP_STACK), >>> - // let's return the courtesy by returning pre-faulted memory. >>> - // FIXME: If issue #143 is fixed, this workaround should be >>> removed. >>> +#ifdef AARCH64_PORT_STUB >>> mmap_flags |= mmu::mmap_populate; >>> +#else >>> + mmap_flags |= mmu::mmap_stack; >>> +#endif >>> } >>> if (flags & MAP_SHARED) { >>> mmap_flags |= mmu::mmap_shared; >>> diff --git a/libc/pthread.cc b/libc/pthread.cc >>> index 8c976bf6..2279ac0c 100644 >>> --- a/libc/pthread.cc >>> +++ b/libc/pthread.cc >>> @@ -140,10 +140,18 @@ namespace pthread_private { >>> return {attr.stack_begin, attr.stack_size}; >>> } >>> size_t size = attr.stack_size; >>> - void *addr = mmu::map_anon(nullptr, size, mmu::mmap_populate, >>> mmu::perm_rw); >>> +#ifdef AARCH64_PORT_STUB >>> + unsigned stack_flags = mmu::mmap_populate; >>> +#else >>> + unsigned stack_flags = mmu::mmap_stack; >>> +#endif >>> + void *addr = mmu::map_anon(nullptr, size, stack_flags, >>> mmu::perm_rw); >>> mmu::mprotect(addr, attr.guard_size, 0); >>> sched::thread::stack_info si{addr, size}; >>> si.deleter = free_stack; >>> +#ifndef AARCH64_PORT_STUB >>> + si.lazy = true; >>> +#endif >>> return si; >>> } >>> >>> -- >>> 2.20.1 >>> >>>
-- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/1029ced5-7dab-4e90-bb3c-309181133c47%40googlegroups.com.
