As the issue #1117 describes, OSv crashes very often early on when running
on QEMU with KVM accelleration on with backtrace looking like this:

```
OSv v0.55.0-164-gd58635f3
getauxval() stubbed
page fault outside application, addr: 0xffff80000a000000
[registers]
PC: 0x00000000402ea2c0 <mmio_getl(void volatile*)+0>
X00: 0xffff80000a000000 X01: 0x008000000a000713 X02: 0xffff80000a001000
X03: 0x008000000a000713 X04: 0xffff80000a000000 X05: 0xffff80000a000fff
X06: 0xffff80000a000000 X07: 0x0000000000000000 X08: 0x0000000000000001
X09: 0xffff800040ed7008 X10: 0x0000000000000713 X11: 0xfffffffffffff8a3
X12: 0x0080000000000003 X13: 0x0000200000100a40 X14: 0x0000000000000703
X15: 0x0000000000000000 X16: 0x0000000000000000 X17: 0x0000000000000000
X18: 0x0000000000000000 X19: 0xffffa00040ed9140 X20: 0xffffa000407e71a0
X21: 0x0000000000000030 X22: 0x0000000000000200 X23: 0x0000000040554390
X24: 0xffffa000407e7260 X25: 0xffffa00040c68cc0 X26: 0x00000000401fd670
X27: 0x000000000a000000 X28: 0x0000000040735010 X29: 0x0000200000100aa0
X30: 0x00000000401fd394 SP:  0x0000200000100aa0 ESR: 0x0000000096000007
PSTATE: 0x0000000080000345
Aborted

[backtrace]
0x00000000401d9564 <mmu::vm_fault(unsigned long, exception_frame*)+724>
0x000000004020aa54 <page_fault+192>
0x000000004020a824 <???+1075882020>
0x00000000401fd504 <virtio::register_mmio_devices(hw::device_manager*)+180>
0x00000000402083a0 <arch_init_drivers()+256>
0x00000000400daf48 <do_main_thread(void*)+104>
0x0000000040347240 <???+1077178944>
0x00000000402e7ad0 <thread_main_c+32>
0x00000000402e5468 <sched::cpu::reschedule_from_interrupt(bool, 
std::chrono::duration<long, std::ratio<1l, 1000000000l> >)+788>
```

The crash would happen in virtio::register_mmio_devices() method right after
mmio::mmap() call when code tries to access memory-mapped configuration area
of the device. While researching this issue, I discovered that calling
mmu::flush_tlb_all() right after mmio_mmap() would make issue go away.
But it was not clear to me why, until Punit Agrawal discovered that
flush_tlb_all() issues the memory synchronization barrier instruction DSB
followed by the instruction synchronization barrier instruction ISB.

It turns out that on aarch64 architecture which comes with weak memory model, 
it is
necessary to force completion of memory writes to page table entries and
flush cpu instruction pipeline before subsequent page table walk.
Otherwise such page table walk could trigger translation fault like the
one above. Please see related fragment from the "ARM v8 Architecture Reference 
Manual" (D5-2557)
which states this:

"A translation table walk is considered to be a separate observer.

An explicit write to the translation tables might be observed by that separate 
observer for either of the following:
- A translation table walk caused by a different explicit write generated by 
the same instruction.
- A translation table walk caused by an explicit access generated by any 
instruction appearing
  in program order after the instruction doing the explicit write to the 
translation table.

The explicit write to the translation tables is guaranteed to be observable, to 
the extent
required by the shareability attributes, only after the execution of a DSB 
instruction.
This DSB instruction and the instruction that performed the explicit write to 
the translation
tables must have been executed by the same PE. Any writes to the translation 
tables are not
observed by the translation table walks of an explicit memory access generated 
by a load or store
that occurs in program order before the instruction that performs the write to 
the translation tables."

The comments from related patches to Linux kernel explain need to use
ISB instruction:
- 
https://github.com/torvalds/linux/commit/d0b7a302d58abe24ed0f32a0672dd4c356bb73db
- 
https://github.com/torvalds/linux/commit/24fe1b0efad4fcdd32ce46cffeab297f22581707
- 
https://github.com/torvalds/linux/commit/7f0b1bf04511348995d6fce38c87c98a3b5cb781

As far as the solution goes, we could have added inline assembly to
issue DSB and ISB to each PTE write/modification method found in
include/osv/mmu-defs.hh:

- void hw_ptep_impl::write(pt_element<N> pte)
- void hw_ptep_rcu_impl::write(pt_element<N> pte)
- pt_element<N> hw_ptep::exchange(pt_element<N> newval)
- pt_element<N> hw_ptep::compare_exchange(pt_element<N> newval)

But it turns out there is a central method in mmu related logic
that gets called for every page manipulation case:

template<typename PageOp>
  void map_range(uintptr_t vma_start, uintptr_t vstart, size_t size, PageOp& 
page_mapper, size_t slop = page_size)

So this patch modifies map_range() to call arch specific
synchronize_page_table_modifications() function that issues DSB, ISB
sequence for aarch64 and does nothing for x64.

The special credit goes to Punit Agrawal who discovered the real culprit
of the issue this patch is fixing.

Fixes #1117

Signed-off-by: Waldemar Kozaczuk <[email protected]>
---
 arch/aarch64/arch-mmu.hh | 22 ++++++++++++++++++++++
 arch/x64/arch-mmu.hh     |  6 ++++++
 core/mmu.cc              |  7 +++++++
 3 files changed, 35 insertions(+)

diff --git a/arch/aarch64/arch-mmu.hh b/arch/aarch64/arch-mmu.hh
index c407832c..bd9efeb1 100644
--- a/arch/aarch64/arch-mmu.hh
+++ b/arch/aarch64/arch-mmu.hh
@@ -174,5 +174,27 @@ pt_element<N> make_pte(phys addr, bool leaf, unsigned perm 
= perm_rwx,
     return pte;
 }
 
+// On aarch64 architecture which comes with weak memory model,
+// it is necessary to force completion of writes to page table entries
+// and flush cpu pipeline to make sure that subsequent accesses to the
+// mapped memory areas do not cause page faults during page table walk later.
+// The fragment from the "ARM v8 Architecture Reference Manual" (D5-2557) 
states
+// this:
+// "A translation table walk is considered to be a separate observer.
+// An explicit write to the translation tables might be observed by that 
separate observer for either of the following:
+// - A translation table walk caused by a different explicit write generated 
by the same instruction.
+// - A translation table walk caused by an explicit access generated by any 
instruction appearing
+//   in program order after the instruction doing the explicit write to the 
translation table.
+// The explicit write to the translation tables is guaranteed to be 
observable, to the extent
+// required by the shareability attributes, only after the execution of a DSB 
instruction.
+// This DSB instruction and the instruction that performed the explicit write 
to the translation
+// tables must have been executed by the same PE.
+// Any writes to the translation tables are not observed by the translation 
table walks of
+// an explicit memory access generated by a load or store that occurs in 
program order
+// before the instruction that performs the write to the translation tables."
+inline void synchronize_page_table_modifications() {
+    asm volatile("dsb ishst; isb;");
+}
+
 }
 #endif /* ARCH_MMU_HH */
diff --git a/arch/x64/arch-mmu.hh b/arch/x64/arch-mmu.hh
index 2e6ea13e..b974c15d 100644
--- a/arch/x64/arch-mmu.hh
+++ b/arch/x64/arch-mmu.hh
@@ -146,5 +146,11 @@ pt_element<N> make_pte(phys addr, bool leaf, unsigned perm 
= perm_rwx,
     return pte;
 }
 
+// On Intel x86_64 architecture which comes with strong memory model
+// it is not necessary to do anything extra after writes to page
+// tables entries to make them visible to page table walker.
+inline void synchronize_page_table_modifications() {
+}
+
 }
 #endif /* ARCH_MMU_HH_ */
diff --git a/core/mmu.cc b/core/mmu.cc
index 37a1c60b..cd2cd550 100644
--- a/core/mmu.cc
+++ b/core/mmu.cc
@@ -351,6 +351,13 @@ template<typename PageOp>
 {
     map_level<PageOp, 4> pt_mapper(vma_start, vstart, size, page_mapper, slop);
     pt_mapper(hw_ptep<4>::force(mmu::get_root_pt(vstart)));
+    // On some architectures with weak memory model it is necessary
+    // to force writes to page table entries complete and instruction pipeline
+    // flushed so that new mappings are properly visible when relevant newly 
mapped
+    // virtual memory areas are accessed right after this point.
+    // So let us call arch-specific function to execute the logic above if
+    // applicable for given architecture.
+    synchronize_page_table_modifications();
 }
 
 template<typename PageOp, int ParentLevel> class map_level {
-- 
2.29.2

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/20210301195212.292323-1-jwkozaczuk%40gmail.com.

Reply via email to