Again, I appreciate you continuing to make suggestions. I was also on vacation yesterday and catching up today. I will respond to your various points in more detail as soon as possible.
On the question of nested virtualisation, the answer is yes, my dev environment is Ubuntu 20.04 under VMWare Fusion, with Intel VT-x/EPT and Code Profiling CPU options enabled in the VM settings. On Thursday, April 15, 2021 at 11:06:37 PM UTC+1 jwkoz...@gmail.com wrote: > Another related possibility is that we have a bug somewhere where we read > a configuration value - some memory address, size parameter, etc. And we > read it incorrectly - instead of say 4 bytes we read 8 bytes and we get > garbage value which is then used to set up things incorrectly. And we end > up reading from or writing to some wrong places. > > On Thursday, April 15, 2021 at 5:04:47 PM UTC-4 Waldek Kozaczuk wrote: > >> Sorry I was away on a vacation so responding somewhat late. >> >> Yes, I did see your April 6th reply, which seems to prove that the issue >> is unlikely related to the GCC version nor other build specifics. >> >> It is still most likely a bug in OSv that is triggered by some >> runtime-specific element that we do not know. My hunch is that it is >> memory-related, possibly setup, given we see two types of crashes - "bsd >> .." page fault and missing symbol page error. The second one is very weird >> as the symbols from the kernel should always be found, so most likely >> something is reading from the wrong part of memory. >> >> Having said that I do not know where to go from there especially given I >> cannot reproduce it. >> >> Today I tried to re-produce on my Ubuntu 20.10 machine to no avail. I >> looked at your email from Feb 26 where you wrote: >> >> "On the crash issue, I built an OSv image with the standard native >> example: *./scripts/build image=native-example* >> Then ran continuous launches under firecracker using a bash loop: *while >> true; do ./scripts/firecracker.py; done* >> After 1249 iterations, I got the same Assert at startup." >> >> which seems to prove that this error is not app-specific and I used >> exactly some commands to build and run the example: >> >> With the original command: >> >> while true; do ./scripts/firecracker.py >> /tmp/running; done >> >> I could run it 71700 (over 71K) with no errors. >> >> >> With verbose enabled: >> >> while true; do ./scripts/firecracker.py -V >> /tmp/running2; done >> >> I could run it 46691(over 46K) with no errors. >> >> >> Let me share some of my host machine-specific (5-6 year old MacBook Pro) >> info: >> >> *more /proc/cpuinfo * >> >> *processor : 0* >> >> *vendor_id : GenuineIntel* >> >> *cpu family : 6* >> >> *model : 70* >> >> *model name : Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz* >> >> *stepping : 1* >> >> *microcode : 0x1c* >> >> *cpu MHz : 876.525* >> >> *cache size : 6144 KB* >> >> *physical id : 0* >> >> *siblings : 8* >> >> *core id : 0* >> >> *cpu cores : 4* >> >> *apicid : 0* >> >> *initial apicid : 0* >> >> *fpu : yes* >> >> *fpu_exception : yes* >> >> *cpuid level : 13* >> >> *wp : yes* >> >> *flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov >> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1* >> >> *gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology >> nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx >> est tm2 s* >> >> *sse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt >> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb >> invpcid_singl* >> >> *e pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad >> fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida >> arat pln * >> >> *pts md_clear flush_l1d* >> >> *vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb >> flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple >> shadow_vm* >> >> *cs* >> >> *bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds >> swapgs itlb_multihit srbds* >> >> *bogomips : 4589.68* >> >> *clflush size : 64* >> >> *cache_alignment : 64* >> >> *address sizes : 39 bits physical, 48 bits virtual* >> >> *power management:* >> >> uname -a >> Linux wkoMacBookPro 5.8.0-49-generic #55-Ubuntu SMP Wed Mar 24 14:45:45 >> UTC 2021 x86_64 x86_64 x86_64 GNU/Linux >> >> So my machine comes with Intel CPUs. >> >> Is yours Intel as well or AMD? I know that Firecracker had to add some >> special support for AMD. >> >> Finally given it might be memory-setup related issue I wonder what the >> physical memory ranges you see. >> >> This patch: >> git diff core/ >> diff --git a/core/mempool.cc b/core/mempool.cc >> index bd8e2fcf..52f94774 100644 >> --- a/core/mempool.cc >> +++ b/core/mempool.cc >> @@ -1732,6 +1732,8 @@ void free_initial_memory_range(void* addr, size_t >> size) >> if (!size) { >> return; >> } >> + debug_early_u64("mempool: add range at: ", (u64)addr); >> + debug_early_u64("mempool: add range of size: ", size); >> auto a = reinterpret_cast<uintptr_t>(addr); >> auto delta = align_up(a, page_size) - a; >> if (delta > size) { >> >> produces this extra info with verbose output: >> >> 2021-04-15T13:40:50.493125: Start >> 2021-04-15T13:40:50.493200: API socket-less: True >> 2021-04-15T13:40:50.493204: Firecracker ready >> 2021-04-15T13:40:50.493293: Configured VM >> 2021-04-15T13:40:50.493298: Added disk >> 2021-04-15T13:40:50.493301: Created OSv VM with cmdline: --verbose >> --nopci --rootfs=zfs /hello >> { >> "machine-config": { >> "vcpu_count": 1, >> "mem_size_mib": 128, >> "ht_enabled": false >> }, >> "drives": [ >> { >> "drive_id": "rootfs", >> "path_on_host": >> "/home/wkozaczuk/projects/osv-master/scripts/../build/last/usr.raw", >> "is_root_device": false, >> "is_read_only": false >> } >> ], >> "boot-source": { >> "kernel_image_path": >> "/home/wkozaczuk/projects/osv-master/scripts/../build/last/kernel.elf", >> "boot_args": "--verbose --nopci --rootfs=zfs /hello" >> } >> } >> OSv v0.55.0-240-g9d1f5111 >> *mempool: add range at: ffff800000957434* >> *mempool: add range of size: 00000000076a8bcc* >> *mempool: add range at: ffff800000000001* >> *mempool: add range of size: 000000000009fbff* >> *mempool: add range at: ffff800000100000* >> *mempool: add range of size: 0000000000100000* >> 1 CPUs detected >> Firmware vendor: Unknown >> bsd: initializing - done >> VFS: mounting ramfs at / >> VFS: mounting devfs at /dev >> net: initializing - done >> Detected virtio-mmio device: (2,0) >> virtio-blk: Add blk device instances 0 as vblk0, devsize=268435456 >> random: intel drng, rdrand registered as a source. >> random: <Software, Yarrow> initialized >> VFS: unmounting /dev >> VFS: mounting zfs at /zfs >> zfs: mounting osv/zfs from device /dev/vblk0.1 >> VFS: mounting devfs at /dev >> VFS: mounting procfs at /proc >> VFS: mounting sysfs at /sys >> BSD shrinker: event handler list found: 0xffffa000009ad080 >> BSD shrinker found: 1 >> BSD shrinker: unlocked, running >> Booted up in 47.39 ms >> Cmdline: /hello >> Hello from C code >> VFS: unmounting /dev >> VFS: unmounting /proc >> VFS: unmounting / >> Powering off. >> 2021-04-15T13:40:50.495752: Waiting for firecracker process to terminate >> 2021-04-15T13:40:50.594645: End >> >> Also in all 46691 runs with verbose I would always see the same memory >> ranges. >> >> I wonder what ranges are in your case. I know we used to have some >> edge-case bugs in memory setup around parsing e820 ranges so. >> >> Maybe we have a bug where we read incorrectly e820 and then create a >> wrong mapping or read less memory than provided. >> >> Are the ranges the same in your case or possibly fluctuate and in case of >> the crash are somewhat different? >> >> Otherwise, I am out of any more ideas. >> >> Waldek >> >> PS. Do you possibly run your tests in a nested virtualization setup or >> possibly on EC2 i3 metal instances? >> >> >> On Sunday, April 11, 2021 at 12:43:17 PM UTC-4 d787...@gmail.com wrote: >> >>> The gcc 10.2.0 build already included the missing symbol change to >>> disable setting up the missing symbols page. >>> >>> Did you see my April 6th reply? >>> >>> >>> On Sunday, April 11, 2021 at 4:45:59 AM UTC+1 jwkoz...@gmail.com wrote: >>> >>>> Interesting. What happens if you apply that other change to disable >>>> setting up that missing symbols page and compile with gcc 10.2.0? >>>> >>>> On Wednesday, April 7, 2021 at 8:30:10 AM UTC-4 d787...@gmail.com >>>> wrote: >>>> >>>>> Re: Your earlier point that it could be due to my gcc version >>>>> revealing a gcc/OSv bug, I thought I would try upgrading to the latest >>>>> gcc/g++ tools. So I'm now running: >>>>> >>>>> gcc (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0 >>>>> >>>>> I rebuilt my OSv image as follows: >>>>> >>>>> ./scripts/build clean >>>>> ./scripts/build -j4 image=golang-example >>>>> >>>>> Then re-ran my test. At iterations 591, 688 and 9326, I saw the >>>>> missing symbol failure, but so far, no sign of the Assert failure. >>>>> >>>> -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/3671d039-7ef6-4c7a-ad7b-942079cab144n%40googlegroups.com.