As the issue #1117 describes, OSv crashes very often early on when running on QEMU with KVM accelleration on with backtrace looking like this:
``` OSv v0.55.0-164-gd58635f3 getauxval() stubbed page fault outside application, addr: 0xffff80000a000000 [registers] PC: 0x00000000402ea2c0 <mmio_getl(void volatile*)+0> X00: 0xffff80000a000000 X01: 0x008000000a000713 X02: 0xffff80000a001000 X03: 0x008000000a000713 X04: 0xffff80000a000000 X05: 0xffff80000a000fff X06: 0xffff80000a000000 X07: 0x0000000000000000 X08: 0x0000000000000001 X09: 0xffff800040ed7008 X10: 0x0000000000000713 X11: 0xfffffffffffff8a3 X12: 0x0080000000000003 X13: 0x0000200000100a40 X14: 0x0000000000000703 X15: 0x0000000000000000 X16: 0x0000000000000000 X17: 0x0000000000000000 X18: 0x0000000000000000 X19: 0xffffa00040ed9140 X20: 0xffffa000407e71a0 X21: 0x0000000000000030 X22: 0x0000000000000200 X23: 0x0000000040554390 X24: 0xffffa000407e7260 X25: 0xffffa00040c68cc0 X26: 0x00000000401fd670 X27: 0x000000000a000000 X28: 0x0000000040735010 X29: 0x0000200000100aa0 X30: 0x00000000401fd394 SP: 0x0000200000100aa0 ESR: 0x0000000096000007 PSTATE: 0x0000000080000345 Aborted [backtrace] 0x00000000401d9564 <mmu::vm_fault(unsigned long, exception_frame*)+724> 0x000000004020aa54 <page_fault+192> 0x000000004020a824 <???+1075882020> 0x00000000401fd504 <virtio::register_mmio_devices(hw::device_manager*)+180> 0x00000000402083a0 <arch_init_drivers()+256> 0x00000000400daf48 <do_main_thread(void*)+104> 0x0000000040347240 <???+1077178944> 0x00000000402e7ad0 <thread_main_c+32> 0x00000000402e5468 <sched::cpu::reschedule_from_interrupt(bool, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)+788> ``` The crash would happen in virtio::register_mmio_devices() method right after mmio::mmap() call when code tries to access memory-mapped configuration area of the device. While researching this issue, I discovered that calling mmu::flush_tlb_all() right after mmio_mmap() would make issue go away. But it was not clear to me why, until Punit Agrawal discovered that flush_tlb_all() issues the memory synchronization barrier instruction DSB followed by the instruction synchronization barrier instruction ISB. It turns out that on aarch64 architecture which comes with weak memory model, it is necessary to force completion of memory writes to page table entries and flush cpu instruction pipeline before subsequent page table walk. Otherwise such page table walk could trigger translation fault like the one above. Please see related fragment from the "ARM v8 Architecture Reference Manual" (D5-2557) which states this: "A translation table walk is considered to be a separate observer. An explicit write to the translation tables might be observed by that separate observer for either of the following: - A translation table walk caused by a different explicit write generated by the same instruction. - A translation table walk caused by an explicit access generated by any instruction appearing in program order after the instruction doing the explicit write to the translation table. The explicit write to the translation tables is guaranteed to be observable, to the extent required by the shareability attributes, only after the execution of a DSB instruction. This DSB instruction and the instruction that performed the explicit write to the translation tables must have been executed by the same PE. Any writes to the translation tables are not observed by the translation table walks of an explicit memory access generated by a load or store that occurs in program order before the instruction that performs the write to the translation tables." The comments from related patches to Linux kernel explain need to use ISB instruction: - https://github.com/torvalds/linux/commit/d0b7a302d58abe24ed0f32a0672dd4c356bb73db - https://github.com/torvalds/linux/commit/24fe1b0efad4fcdd32ce46cffeab297f22581707 - https://github.com/torvalds/linux/commit/7f0b1bf04511348995d6fce38c87c98a3b5cb781 As far as the solution goes, we could have added inline assembly to issue DSB and ISB to each PTE write/modification method found in include/osv/mmu-defs.hh: - void hw_ptep_impl::write(pt_element<N> pte) - void hw_ptep_rcu_impl::write(pt_element<N> pte) - pt_element<N> hw_ptep::exchange(pt_element<N> newval) - pt_element<N> hw_ptep::compare_exchange(pt_element<N> newval) But it turns out there is a central method in mmu related logic that gets called for every page manipulation case: template<typename PageOp> void map_range(uintptr_t vma_start, uintptr_t vstart, size_t size, PageOp& page_mapper, size_t slop = page_size) So this patch modifies map_range() to call arch specific synchronize_page_table_modifications() function that issues DSB, ISB sequence for aarch64 and does nothing for x64. The special credit goes to Punit Agrawal who discovered the real culprit of the issue this patch is fixing. Fixes #1117 Signed-off-by: Waldemar Kozaczuk <[email protected]> --- arch/aarch64/arch-mmu.hh | 22 ++++++++++++++++++++++ arch/x64/arch-mmu.hh | 6 ++++++ core/mmu.cc | 7 +++++++ 3 files changed, 35 insertions(+) diff --git a/arch/aarch64/arch-mmu.hh b/arch/aarch64/arch-mmu.hh index c407832c..bd9efeb1 100644 --- a/arch/aarch64/arch-mmu.hh +++ b/arch/aarch64/arch-mmu.hh @@ -174,5 +174,27 @@ pt_element<N> make_pte(phys addr, bool leaf, unsigned perm = perm_rwx, return pte; } +// On aarch64 architecture which comes with weak memory model, +// it is necessary to force completion of writes to page table entries +// and flush cpu pipeline to make sure that subsequent accesses to the +// mapped memory areas do not cause page faults during page table walk later. +// The fragment from the "ARM v8 Architecture Reference Manual" (D5-2557) states +// this: +// "A translation table walk is considered to be a separate observer. +// An explicit write to the translation tables might be observed by that separate observer for either of the following: +// - A translation table walk caused by a different explicit write generated by the same instruction. +// - A translation table walk caused by an explicit access generated by any instruction appearing +// in program order after the instruction doing the explicit write to the translation table. +// The explicit write to the translation tables is guaranteed to be observable, to the extent +// required by the shareability attributes, only after the execution of a DSB instruction. +// This DSB instruction and the instruction that performed the explicit write to the translation +// tables must have been executed by the same PE. +// Any writes to the translation tables are not observed by the translation table walks of +// an explicit memory access generated by a load or store that occurs in program order +// before the instruction that performs the write to the translation tables." +inline void synchronize_page_table_modifications() { + asm volatile("dsb ishst; isb;"); +} + } #endif /* ARCH_MMU_HH */ diff --git a/arch/x64/arch-mmu.hh b/arch/x64/arch-mmu.hh index 2e6ea13e..b974c15d 100644 --- a/arch/x64/arch-mmu.hh +++ b/arch/x64/arch-mmu.hh @@ -146,5 +146,11 @@ pt_element<N> make_pte(phys addr, bool leaf, unsigned perm = perm_rwx, return pte; } +// On Intel x86_64 architecture which comes with strong memory model +// it is not necessary to do anything extra after writes to page +// tables entries to make them visible to page table walker. +inline void synchronize_page_table_modifications() { +} + } #endif /* ARCH_MMU_HH_ */ diff --git a/core/mmu.cc b/core/mmu.cc index 37a1c60b..cd2cd550 100644 --- a/core/mmu.cc +++ b/core/mmu.cc @@ -351,6 +351,13 @@ template<typename PageOp> { map_level<PageOp, 4> pt_mapper(vma_start, vstart, size, page_mapper, slop); pt_mapper(hw_ptep<4>::force(mmu::get_root_pt(vstart))); + // On some architectures with weak memory model it is necessary + // to force writes to page table entries complete and instruction pipeline + // flushed so that new mappings are properly visible when relevant newly mapped + // virtual memory areas are accessed right after this point. + // So let us call arch-specific function to execute the logic above if + // applicable for given architecture. + synchronize_page_table_modifications(); } template<typename PageOp, int ParentLevel> class map_level { -- 2.29.2 -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/20210301195212.292323-1-jwkozaczuk%40gmail.com.
