Nadav, I have sent a patch "mempool: resize l1_pool_stats once" that should address this.
David, Could you also please test this patch to see if it fixes the crashes? Waldek On Friday, April 30, 2021 at 12:27:47 PM UTC-4 Nadav Har'El wrote: > On Fri, Apr 30, 2021 at 4:28 AM Waldek Kozaczuk <[email protected]> > wrote: > >> And I was also able to replicate it on QEMU running OSv in PVH mode on >> microvm machine (similar to how firecracker runs it) again in nested >> virtualization. >> >> ./scripts/run.py -k -p qemu_microvm --nics=0 -c 1 >> >> >> I have used the same recipe - 3 terminals running similar commands to >> start C native example on ROFS image. After I connected with gdb I saw this: >> >> >> (gdb) osv info threads >> >> id address name cpu status vruntime >> total_cpu_time location >> >> 1 (0xffff800000014040) reclaimer cpu0 waiting 6.51405e-24 >> 1229 condvar::wait(lockfree::mutex*, sched::timer*) at >> core/condvar.cc:43 >> >> 2 (0xffff80000004b040) kvm_wall_clock_ cpu0 waiting 4.76921e-24 >> 8226 std::_Function_handler<void(), >> kvmclock::kvmclock()::<lambda()> >::_M_invoke(const std::_Any_data &) at >> /usr/include/c++/10/bits/std_function.h:291 >> >> 3 (0xffff800000068040) page-access-sca cpu0 queued 1.92449e-23 >> 33196 pagecache::access_scanner::run() at core/pagecache.cc:695 >> >> 4 (0xffff800000080040) page_pool_l2 cpu0 queued 1.40362e-23 >> 24193 memory::page_pool::l2::fill_thread() at core/mempool.cc:1435 >> >> 5 (0xffff80000008d040) itimer-real cpu? unstarted 0 >> 0 ?? at arch/x64/entry.S:112 >> >> 6 (0xffff800000094040) itimer-virt cpu? unstarted 0 >> 0 ?? at arch/x64/entry.S:112 >> >> 7 (0xffff800000150040) balancer0 cpu0 queued 4.63251e-21 >> 8154897 memory::page_pool::l1::alloc_page_local() at >> core/mempool.cc:1403 >> >> 8 (0xffff800000168040) rcu0 cpu0 waiting 2.04856e-23 >> 25420 osv::rcu::cpu_quiescent_state_thread::do_work() at >> core/rcu.cc:178 >> >> 9 (0xffff800000180040) idle0 cpu0 queued inf >> 12481 std::_Function_handler<void(), >> sched::cpu::init_idle_thread()::<lambda()> >::_M_invoke(const >> std::_Any_data &) at /usr/include/c++/10/bits/std_function.h:291 >> >> 10 (0xffff800000197040) >init cpu0 waiting 7.75482e-24 >> 13338 std::_Function_handler<void(), >> sched::thread::reaper::reaper()::<lambda()> >::_M_invoke(const >> std::_Any_data &) at /usr/include/c++/10/bits/std_function.h:291 >> >> 11 (0xffff8000001b3040) thread taskq cpu0 waiting 1.16969e-23 >> 10158 synch_port::_msleep(void*, mtx*, int, char const*, int) at >> bsd/porting/synch.cc:101 >> >> 12 (0xffff8000001bf040) callout cpu0 waiting 1.15418e-23 >> 9891 _callout_thread() at bsd/porting/callout.cc:118 >> >> 13 (0xffff8000001d9040) page_pool_l1_0 cpu0 running -2.4551e-21 >> 709254 sched::thread::switch_to() at arch/x64/arch-switch.hh:108 >> >> Number of threads: 13 >> >> (gdb) bt >> >> #0 abort (fmt=fmt@entry=0x4063f860 "Assertion failed: %s (%s: %s: >> %d)\n") at runtime.cc:106 >> >> #1 0x0000000040237f29 in __assert_fail (expr=<optimized out>, >> file=<optimized out>, line=<optimized out>, func=<optimized out>) at >> runtime.cc:140 >> >> #2 0x0000000040390970 in page_fault (ef=0xffff80000094a048) at >> arch/x64/mmu.cc:37 >> >> #3 <signal handler called> >> >> #4 0x00000000403e6266 in memory::page_pool::l1::pop >> (this=0xffff8000001d7040) at core/mempool.cc:1191 >> >> #5 memory::page_pool::l1::alloc_page_local () at core/mempool.cc:1411 >> >> #6 0x00000000403e8cfa in memory::page_pool::l1::alloc_page () at >> core/mempool.cc:1174 >> >> #7 memory::untracked_alloc_page () at core/mempool.cc:1671 >> >> #8 memory::pool::add_page (this=0x4090c968 <memory::malloc_pools+72>) at >> core/mempool.cc:254 >> >> #9 0x00000000403e8e18 in memory::pool::alloc (this=0x4090c968 >> <memory::malloc_pools+72>) at core/mempool.cc:221 >> >> #10 0x00000000403e8f9c in std_malloc (size=8, alignment=8) at >> core/mempool.cc:1782 >> >> #11 0x00000000403e9457 in malloc (size=8) at core/mempool.cc:2001 >> >> #12 0x00000000404b0439 in operator new(unsigned long) () >> >> #13 0x0000000040354018 in >> __gnu_cxx::new_allocator<elf::object*>::allocate (__n=1, this=<optimized >> out>) at /usr/include/c++/10/ext/new_allocator.h:103 >> >> #14 std::allocator_traits<std::allocator<elf::object*> >::allocate >> (__a=..., __n=1) at /usr/include/c++/10/bits/alloc_traits.h:460 >> >> #15 std::_Vector_base<elf::object*, std::allocator<elf::object*> >> >::_M_allocate (__n=1, this=0xffff8000001dd2e0) >> >> at /usr/include/c++/10/bits/stl_vector.h:346 >> >> #16 std::vector<elf::object*, std::allocator<elf::object*> >::reserve >> (__n=1, this=0xffff8000001dd2e0) at /usr/include/c++/10/bits/vector.tcc:78 >> >> #17 elf::program::modules_get (this=0xffffa0000009adf0) at >> core/elf.cc:1810 >> >> #18 0x00000000403548d0 in >> elf::program::with_modules<elf::program::lookup_addr(void >> const*)::<lambda(const elf::program::modules_list&)> > (f=..., >> >> this=0xffffa0000009adf0) at include/osv/elf.hh:702 >> >> #19 elf::program::lookup_addr (this=0xffffa0000009adf0, >> >> addr=addr@entry=0x40237f28 <__assert_fail(char const*, char const*, >> unsigned int, char const*)+24>) at core/elf.cc:1621 >> >> #20 0x00000000404273ec in osv::lookup_name_demangled (addr=0x40237f28 >> <__assert_fail(char const*, char const*, unsigned int, char const*)+24>, >> >> buf=0xffff8000001dd7a0 "", len=1024) at core/demangle.cc:47 >> >> #21 0x0000000040237a1a in print_backtrace () at runtime.cc:86 >> >> #22 0x0000000040237edc in abort (fmt=fmt@entry=0x4063f860 "Assertion >> failed: %s (%s: %s: %d)\n") at runtime.cc:122 >> >> #23 0x0000000040237f29 in __assert_fail (expr=<optimized out>, >> file=<optimized out>, line=<optimized out>, func=<optimized out>) at >> runtime.cc:140 >> >> #24 0x0000000040390970 in page_fault (ef=0xffff8000001de068) at >> arch/x64/mmu.cc:37 >> >> #25 <signal handler called> >> >> #26 0x00000000403e860e in memory::page_pool::l1::push >> (page=0xffff800000a27000, this=0xffff8000001d7040) at core/mempool.cc:1198 >> >> #27 memory::page_pool::l1::refill () at core/mempool.cc:1379 >> >> #28 0x00000000403e88f5 in memory::page_pool::l1::fill_thread () at >> core/mempool.cc:1356 >> >> #29 0x00000000403eb579 in >> memory::page_pool::l1::l1(sched::cpu*)::{lambda()#1}::operator()() const >> (__closure=<optimized out>) at core/mempool.cc:1164 >> >> #30 std::__invoke_impl<void, >> memory::page_pool::l1::l1(sched::cpu*)::{lambda()#1}&>(std::__invoke_other, >> memory::page_pool::l1::l1(sched::cpu*)::{lambda()#1}&) (__f=...) at >> /usr/include/c++/10/bits/invoke.h:60 >> >> #31 std::__invoke_r<void, >> memory::page_pool::l1::l1(sched::cpu*)::{lambda()#1}&>(memory::page_pool::l1::l1(sched::cpu*)::{lambda()#1}&) >> >> (__fn=...) >> >> at /usr/include/c++/10/bits/invoke.h:153 >> >> #32 std::_Function_handler<void (), >> memory::page_pool::l1::l1(sched::cpu*)::{lambda()#1}>::_M_invoke(std::_Any_data >> >> const&) (__functor=...) >> >> at /usr/include/c++/10/bits/std_function.h:291 >> >> --Type <RET> for more, q to quit, c to continue without paging--c >> >> #33 0x00000000403fbc0a in sched::thread::main (this=0xffff8000001d9040) >> at core/sched.cc:1267 >> >> #34 sched::thread_main_c (t=0xffff8000001d9040) at >> arch/x64/arch-switch.hh:325 >> >> #35 0x0000000040390733 in thread_main () at arch/x64/entry.S:113 >> >> (gdb) frame 26 >> >> #26 0x00000000403e860e in memory::page_pool::l1::push >> (page=0xffff800000a27000, this=0xffff8000001d7040) at core/mempool.cc:1198 >> >> 1198 l1_pool_stats[cpu_id]._nr = nr; >> >> (gdb) p cpu_id >> >> $1 = 0 >> >> (gdb) p l1_pool_stats >> >> $2 = std::vector of length 0, capacity 0 >> >> (gdb) p l1_pool_stats[0] >> >> Cannot access memory at address 0x0 >> >> (gdb) p smp_allocator >> >> $3 = true >> The page fault happens at frame 26 - core/mempool.cc:1198. It seems that >> the l1_pool_stats global variable has address of 0. I do not know how it is >> possible and I do not see any obvious bug, yet (I have not had time to >> study it yet). In any case the l1_pool_stats is some vector that keeps >> statistics about L1 memory pool. >> > > I have a guess. > > I think that the vector is valid - it was simply never resize()ed to the > right size, so it has zero size. This is a vector, not a hash_map, so doing > l1_pool_stats[cpu_id]._nr = nr does *NOT* create the cpu_id'th element if > not already existing - it must have existed previously. > > I see the vector is resized in > > static sched::cpu::notifier _notifier([] () { > ... > l1_pool_stats.resize(sched::cpus.size()); > } > > One guess I have on a problem here is that it seems you call this > l1_pool_stats.resize(sched::cpus.size()) again and again for each of the > CPUs. Maybe it even happens in parallel on two CPUs and cause a big mess if > it happens (and explains why this problem happens rarely and not every > time)? Can you try doing it only once - perhaps in the same place you > decide to do do smp_allocator = true, or using some other mutual exclusion > trick? > > If this doesn't help, please also take a look at the initialization order. > The l1_pool_stats initialization doesn't have a > __attribute__((init_priority(...))), and I don't know how it is ordered > compared to other initializations and to that notifier thing. Maybe you can > add printouts to debug if there's a problem there. > > >> So my hunch was to comment out 2 lines where we access this variable and >> see what happens: >> >> diff --git a/core/mempool.cc b/core/mempool.cc >> index bd8e2fcf..8f843e99 100644 >> --- a/core/mempool.cc >> +++ b/core/mempool.cc >> @@ -1188,14 +1188,14 @@ struct l1 { >> void* pop() >> { >> assert(nr); >> - l1_pool_stats[cpu_id]._nr = nr - 1; >> + //l1_pool_stats[cpu_id]._nr = nr - 1; >> return _pages[--nr]; >> } >> void push(void* page) >> { >> assert(nr < 512); >> _pages[nr++] = page; >> - l1_pool_stats[cpu_id]._nr = nr; >> + //l1_pool_stats[cpu_id]._nr = nr; >> >> } >> void* top() { return _pages[nr - 1]; } >> >> And miraculously the crash seems to go away. I ran the test with over 10K >> iterations in each terminal on QEMU and over 20K iterations on Firecracker. >> And I could not reproduce it after this patch. >> >> I wonder if you will see the same behavior. I wonder what the root cause >> of this crash is. And who knows if that is a symptom of something weirder >> and we will see other crashes even with these 2 lines commented out >> >> Waldek >> >> PS. On a side note, the nested virtualization makes OSv boot time 7-9 >> times slower (~40ms vs 5-7ms). QEMU microvm is actually faster than >> firecracker as far as boot time goes - on average it takes ~ 22ms to boot. >> In both cases, there is no networking device which would add some time. >> On Thursday, April 29, 2021 at 7:04:02 PM UTC-4 Waldek Kozaczuk wrote: >> >>> I have installed VMWare Fusion Pro on my other Mac (OSX - 10.15.7, >>> 2.3GHz, 8-core, i9) with Fedora 33 server in it, enabled the 2 settings to >>> allow KVM in nested virtualization and I was able to replicate the crashes. >>> >>> So there is most likely something about nested virtualization that >>> possibly triggers some bug in OSv. >>> >>> Now that I can re-produce it, I will try to figure out what is wrong. >>> >>> Waldek >>> >>> On Wednesday, April 28, 2021 at 3:34:18 AM UTC-4 [email protected] >>> wrote: >>> >>>> That's disappointing, I'd hoped this would be reproducible on your >>>> setup. As you say, perhaps this is an indication that it is somehow >>>> connected with the fact that our environment is using nested >>>> virtualisation. >>>> >>>> The VMWare KB article you referenced isn't relevant to my setup, as it >>>> relates to MacOS 11.x and I'm running MacOS 10.15.7 (as the article >>>> indicates, there were some significant MacOS changes in 11.x which >>>> impacted >>>> VMWare's virtualisation capabilities, so I'm avoiding the upgrade for the >>>> moment). >>>> >>>> Our target environment will be using nested virtualisation (Firecracker >>>> launched OSv VMs running on an Alpine Linux VM under a VMWare ESXi host) >>>> and as indicated previously, we see this same failure in our test >>>> environment which replicates the target environment. >>>> >>>> I have not tried to reproduce this on a non-nested virtualisation >>>> setup. I'll look into this. >>>> >>>> VMWare Fusion Pro is available as a fully-functional 30-day free trial >>>> from https://www.vmware.com/products/fusion/fusion-evaluation.html >>>> If you do go down the route of trying it under Fusion, to enable nested >>>> virtualisation you will need to modify the following settings for the >>>> Ubuntu/Fedora VM: >>>> >>>> Settings->Processors & Memory->Advanced: >>>> Enable hypervisor applications in this virtual machine >>>> Enable code profiling application in this virtual machine (can be >>>> enabled optionally for Ubuntu, but seems to be required for Fedora) >>>> >>>> David >>>> >>>> On Wednesday, April 28, 2021 at 12:34:43 AM UTC+1 [email protected] >>>> wrote: >>>> >>>>> On Thursday, April 22, 2021 at 11:42:54 AM UTC-4 [email protected] >>>>> wrote: >>>>> >>>>>> Further to my previous reply, I've now been able to reproduce the >>>>>> original bsd: initializing Assert failure on my Fedora 33 system. >>>>>> It did require some specific actions, which seem to invoke the >>>>>> failure relatively quickly. These more closely align the firecracker >>>>>> instance launch method with our test system, where we launch a few >>>>>> different OSv VMs under firecracker (on Alpine) in quick succession. On >>>>>> our >>>>>> test system we see a much higher failure rate, as high as 1 in 20 >>>>>> launches. >>>>>> >>>>>> Using latest OSv source (pulled today), with no modifications (i.e. >>>>>> none of the new instrumenting changes you've provided in relation to >>>>>> this >>>>>> issue applied): >>>>>> >>>>>> build -j4 fs=rofs image=native-example >>>>>> firecracker.py -V (just to prove that it runs and to obtain the >>>>>> converted raw ROFS image file needed by firecracker). >>>>>> >>>>>> I then duplicated the kernel.elf and usr.raw files to: >>>>>> kernel1.elf >>>>>> kernel2.elf >>>>>> kernel3.elf >>>>>> usr1.raw >>>>>> usr2.raw >>>>>> usr3.raw >>>>>> >>>>>> This probably isn't necessary, but for the avoidance of doubt, I used >>>>>> this approach rather than sharing the same files across multiple >>>>>> firecracker instances. >>>>>> >>>>>> I then extracted the firecracker json config generated by the >>>>>> firecracker.py -V command into 3 separate json config files. In each >>>>>> file, >>>>>> I modified the paths for the kernel.elf and usr.raw to reference one of >>>>>> the >>>>>> numbered versions of the same file. Here's an example of the resulting >>>>>> content: >>>>>> >>>>>> { >>>>>> "machine-config": { >>>>>> "vcpu_count": 1, >>>>>> "mem_size_mib": 128, >>>>>> "ht_enabled": false >>>>>> }, >>>>>> "drives": [ >>>>>> { >>>>>> "drive_id": "rootfs", >>>>>> "path_on_host": "/home/test/usr1.raw", >>>>>> "is_root_device": false, >>>>>> "is_read_only": false >>>>>> } >>>>>> ], >>>>>> "boot-source": { >>>>>> "kernel_image_path": "/home/test/kernel1.elf", >>>>>> "boot_args": "--verbose --nopci --rootfs=rofs /hello" >>>>>> } >>>>>> } >>>>>> >>>>>> Finally, in three different terminal windows, I ran a script which >>>>>> performs the usual while forever loop launching the firecracker >>>>>> instance, >>>>>> using a different numbered json config. in each terminal window and >>>>>> logging >>>>>> to a different log file in each terminal window. Here's an example of >>>>>> the >>>>>> script content: >>>>>> >>>>>> *Script 1* >>>>>> #!/bin/bash >>>>>> rm test1.log; while true; do firecracker --no-api --config-file >>>>>> firecracker1.json >> test1.log ; done >>>>>> >>>>>> *Script 2* >>>>>> #!/bin/bash >>>>>> rm test2.log; while true; do firecracker --no-api --config-file >>>>>> firecracker2.json >> test2.log ; done >>>>>> >>>>>> *Script 3* >>>>>> #!/bin/bash >>>>>> rm test3.log; while true; do firecracker --no-api --config-file >>>>>> firecracker3.json >> test3.log ; done >>>>>> >>>>>> >>>>>> On my system, running these three scripts concurrently invokes the >>>>>> Assert condition within 30 seconds of the start of the 3rd script. >>>>>> Interestingly, when one fails, all 3 seem to fail, though I can't be >>>>>> sure this is happening at exactly the same time. >>>>>> >>>>>> Hopefully this provides a mechanism by which you can reproduce the >>>>>> issue. >>>>>> >>>>> >>>>> I have run the exact same test - 3 terminals each firing the command >>>>> as you have it above. And I could NOT reproduce it after 50K runs or so >>>>> in >>>>> each terminal. >>>>> >>>>> Now the difference is that my setup does NOT involve nested >>>>> virtualization. Both of my MBPros have Ubuntu and Fedora installed >>>>> natively >>>>> (dual or triple boot mode). >>>>> >>>>> I have never run firecracker nor OSv on firecracker in such a setup. >>>>> Until this point, I was not even aware that such a possibility exists on >>>>> Mac (this page - https://kb.vmware.com/s/article/80785 - seems to >>>>> point there are some limitations). Given all that it could be that this >>>>> is >>>>> a bug in OSv or some other layer involved in this nested virtualization >>>>> setup. >>>>> >>>>> I know that Firecracker is used in production on the so-called >>>>> 'bare-metal' EC2 instances (i3, etc) that in reality involve nested >>>>> virtualization (Nitro hypervisor) but I am not sure how well firecracker >>>>> is >>>>> supported in nested virtualization setup in general besides that. >>>>> >>>>> Do your experiments target nested virtualization by design or is it >>>>> just a matter of convenience? >>>>> >>>>> Have you tried to reproduce this problem in not nested virtualization >>>>> setup? >>>>> >>>>> Waldek >>>>> >>>>> PS. I do not have VMware Fusion handy on any of my MBPros machines but >>>>> I guess I could install it and see if I can reproduce your issue. Is >>>>> VMware >>>>> Fusion free? >>>>> >>>>> >>>>> >>>>>> On Wednesday, April 21, 2021 at 3:15:08 PM UTC+1 David Smith wrote: >>>>>> >>>>>>> Some info. for my host (Intel CPU based) Mac system (on which I'm >>>>>>> running MacOS 10.15.7): >>>>>>> >>>>>>> Darwin 19.6.0 Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST >>>>>>> 2021; root:xnu-6153.141.16~1/RELEASE_X86_64 x86_64 >>>>>>> >>>>>>> machdep.cpu.max_basic: 27 >>>>>>> machdep.cpu.max_ext: 2147483656 <(214)%20748-3656> >>>>>>> machdep.cpu.vendor: GenuineIntel >>>>>>> machdep.cpu.brand_string: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz >>>>>>> machdep.cpu.family: 6 >>>>>>> machdep.cpu.model: 126 >>>>>>> machdep.cpu.extmodel: 7 >>>>>>> machdep.cpu.extfamily: 0 >>>>>>> machdep.cpu.stepping: 5 >>>>>>> machdep.cpu.feature_bits: 9221959987971750911 >>>>>>> machdep.cpu.leaf7_feature_bits: 4072613871 <(407)%20261-3871> >>>>>>> 1077960526 >>>>>>> machdep.cpu.leaf7_feature_bits_edx: 3154117648 <(315)%20411-7648> >>>>>>> machdep.cpu.extfeature_bits: 1241984796928 >>>>>>> machdep.cpu.signature: 460517 >>>>>>> machdep.cpu.brand: 0 >>>>>>> machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP >>>>>>> MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM >>>>>>> PBE >>>>>>> SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM >>>>>>> SSE4.1 >>>>>>> SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR >>>>>>> AVX1.0 >>>>>>> RDRAND F16C >>>>>>> machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 AVX2 >>>>>>> FDPEO SMEP BMI2 ERMS INVPCID FPU_CSDS AVX512F AVX512DQ RDSEED ADX SMAP >>>>>>> AVX512IFMA CLFSOPT IPT AVX512CD SHA AVX512BW AVX512VL AVX512VBMI UMIP >>>>>>> PKU >>>>>>> GFNI VAES VPCLMULQDQ AVX512VNNI AVX512BITALG AVX512VPOPCNTDQ RDPID >>>>>>> SGXLC >>>>>>> FSREPMOV MDCLEAR IBRS STIBP L1DF ACAPMSR SSBD >>>>>>> machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT >>>>>>> PREFETCHW RDTSCP TSCI >>>>>>> machdep.cpu.logical_per_package: 16 >>>>>>> machdep.cpu.cores_per_package: 8 >>>>>>> machdep.cpu.microcode_version: 160 >>>>>>> machdep.cpu.processor_flag: 7 >>>>>>> machdep.cpu.mwait.linesize_min: 64 >>>>>>> machdep.cpu.mwait.linesize_max: 64 >>>>>>> machdep.cpu.mwait.extensions: 3 >>>>>>> machdep.cpu.mwait.sub_Cstates: 286396448 >>>>>>> machdep.cpu.thermal.sensor: 1 >>>>>>> machdep.cpu.thermal.dynamic_acceleration: 1 >>>>>>> machdep.cpu.thermal.invariant_APIC_timer: 1 >>>>>>> machdep.cpu.thermal.thresholds: 2 >>>>>>> machdep.cpu.thermal.ACNT_MCNT: 1 >>>>>>> machdep.cpu.thermal.core_power_limits: 1 >>>>>>> machdep.cpu.thermal.fine_grain_clock_mod: 1 >>>>>>> machdep.cpu.thermal.package_thermal_intr: 1 >>>>>>> machdep.cpu.thermal.hardware_feedback: 0 >>>>>>> machdep.cpu.thermal.energy_policy: 1 >>>>>>> machdep.cpu.xsave.extended_state: 743 2688 2696 0 >>>>>>> machdep.cpu.xsave.extended_state1: 15 2432 8448 0 >>>>>>> machdep.cpu.arch_perf.version: 5 >>>>>>> machdep.cpu.arch_perf.number: 8 >>>>>>> machdep.cpu.arch_perf.width: 48 >>>>>>> machdep.cpu.arch_perf.events_number: 8 >>>>>>> machdep.cpu.arch_perf.events: 0 >>>>>>> machdep.cpu.arch_perf.fixed_number: 4 >>>>>>> machdep.cpu.arch_perf.fixed_width: 48 >>>>>>> machdep.cpu.cache.linesize: 64 >>>>>>> machdep.cpu.cache.L2_associativity: 8 >>>>>>> machdep.cpu.cache.size: 256 >>>>>>> machdep.cpu.address_bits.physical: 39 >>>>>>> machdep.cpu.address_bits.virtual: 48 >>>>>>> machdep.cpu.core_count: 4 >>>>>>> machdep.cpu.thread_count: 8 >>>>>>> machdep.cpu.tsc_ccc.numerator: 104 >>>>>>> machdep.cpu.tsc_ccc.denominator: 2 >>>>>>> >>>>>>> I would add that this is my development system. We have a test >>>>>>> system which utilises VMWare ESXi and runs the OSv VM under firecracker >>>>>>> on >>>>>>> an Alpine Linux VMWare VM (so nested virtualisation again) and we also >>>>>>> see >>>>>>> the "bsd initializing" crash on startup on this configuration >>>>>>> occasionally. >>>>>>> >>>>>>> I added the patch to dump the physical memory region information as >>>>>>> you requested. Built again with build fs=rofs image=native-example >>>>>>> then tested with while true; do ./scripts/firecracker.py >> >>>>>>> /tmp/running; done (my firecracker.py script is slightly modified >>>>>>> to add in the '--power-off-on-abort' cmd line argument, which I think >>>>>>> is >>>>>>> not important, but I mention it just in case). >>>>>>> On the 996'th iteration, I hit the missing symbol failure. Below is >>>>>>> the output for the last successful invocation followed by the failed >>>>>>> invocation. The physical memory information dumped is consistent across >>>>>>> all >>>>>>> 996 invocations. >>>>>>> >>>>>>> OSv v0.55.0-237-g7cd33926 >>>>>>> mempool: add range at: ffff800000956434 >>>>>>> mempool: add range of size: 00000000076a9bcc >>>>>>> mempool: add range at: ffff800000000001 >>>>>>> mempool: add range of size: 000000000009fbff >>>>>>> mempool: add range at: ffff800000100000 >>>>>>> mempool: add range of size: 0000000000100000 >>>>>>> 1 CPUs detected >>>>>>> Firmware vendor: Unknown >>>>>>> >>>>>>> *percpu_l1 = 0x175040 >>>>>>> bsd: initializing - done >>>>>>> VFS: mounting ramfs at / >>>>>>> VFS: mounting devfs at /dev >>>>>>> net: initializing - done >>>>>>> Detected virtio-mmio device: (2,0) >>>>>>> virtio-blk: Add blk device instances 0 as vblk0, devsize=6474752 >>>>>>> random: intel drng, rdrand registered as a source. >>>>>>> random: <Software, Yarrow> initialized >>>>>>> VFS: unmounting /dev >>>>>>> VFS: mounting rofs at /rofs >>>>>>> VFS: mounting devfs at /dev >>>>>>> VFS: mounting procfs at /proc >>>>>>> VFS: mounting sysfs at /sys >>>>>>> VFS: mounting ramfs at /tmp >>>>>>> Booted up in 121.29 ms >>>>>>> Cmdline: /hello >>>>>>> Hello from C code >>>>>>> VFS: unmounting /dev >>>>>>> VFS: unmounting /proc >>>>>>> VFS: unmounting / >>>>>>> ROFS: spent 1.78 ms reading from disk >>>>>>> ROFS: read 64 512-byte blocks from disk >>>>>>> ROFS: allocated 61 512-byte blocks of cache memory >>>>>>> ROFS: hit ratio is 93.88% >>>>>>> Powering off. >>>>>>> 2021-04-21T13:22:17.382768: Start >>>>>>> 2021-04-21T13:22:17.382851: API socket-less: True >>>>>>> 2021-04-21T13:22:17.382855: Firecracker ready >>>>>>> 2021-04-21T13:22:17.382940: Configured VM >>>>>>> 2021-04-21T13:22:17.382945: Added disk >>>>>>> 2021-04-21T13:22:17.382948: Created OSv VM with cmdline: >>>>>>> --power-off-on-abort --verbose --nopci --rootfs=rofs /hello >>>>>>> { >>>>>>> "machine-config": { >>>>>>> "vcpu_count": 1, >>>>>>> "mem_size_mib": 128, >>>>>>> "ht_enabled": false >>>>>>> }, >>>>>>> "drives": [ >>>>>>> { >>>>>>> "drive_id": "rootfs", >>>>>>> "path_on_host": >>>>>>> "/home/david-griffin/projects/firecracker/osv/scripts/../build/last/usr.raw", >>>>>>> "is_root_device": false, >>>>>>> "is_read_only": false >>>>>>> } >>>>>>> ], >>>>>>> "boot-source": { >>>>>>> "kernel_image_path": >>>>>>> "/home/david-griffin/projects/firecracker/osv/scripts/../build/last/kernel.elf", >>>>>>> "boot_args": "--power-off-on-abort --verbose --nopci >>>>>>> --rootfs=rofs /hello" >>>>>>> } >>>>>>> } >>>>>>> 2021-04-21T13:22:17.385972: Waiting for firecracker process to >>>>>>> terminate >>>>>>> 2021-04-21T13:22:17.586150: End >>>>>>> OSv v0.55.0-237-g7cd33926 >>>>>>> mempool: add range at: ffff800000956434 >>>>>>> mempool: add range of size: 00000000076a9bcc >>>>>>> mempool: add range at: ffff800000000001 >>>>>>> mempool: add range of size: 000000000009fbff >>>>>>> mempool: add range at: ffff800000100000 >>>>>>> mempool: add range of size: 0000000000100000 >>>>>>> 1 CPUs detected >>>>>>> Firmware vendor: Unknown >>>>>>> >>>>>>> *percpu_l1 = 0x175040 >>>>>>> addr: 0x0000000000000008 mmu::page_size: 4096 elf:mspa: >>>>>>> 0x0000000000000000 >>>>>>> trying to execute or access missing symbol >>>>>>> [backtrace] >>>>>>> addr: 0x0000000000000008 mmu::page_size: 4096 elf:mspa: >>>>>>> 0x0000000000000000 >>>>>>> >>>>>>> Prior to adding the physical memory info. patch, I setup a Fedora 33 >>>>>>> system (again under my VMWare Fusion) environment, installed the same >>>>>>> OSv >>>>>>> sources, built the same image with the same commands and ran the same >>>>>>> test. >>>>>>> I was unable to reproduce either startup failure over two attempts each >>>>>>> running approx. 50K iterations. >>>>>>> Fedora 33 is running a later kernel (5.11.11) compared to both my >>>>>>> Ubuntu 20.04 (5.8.0) and Alpine (5.10.13). Could there be some subtle >>>>>>> kernel interaction? >>>>>>> >>>>>>> On Tuesday, April 20, 2021 at 5:08:23 PM UTC+1 David Smith wrote: >>>>>>> >>>>>>>> Again, I appreciate you continuing to make suggestions. I was also >>>>>>>> on vacation yesterday and catching up today. I will respond to your >>>>>>>> various >>>>>>>> points in more detail as soon as possible. >>>>>>>> >>>>>>>> On the question of nested virtualisation, the answer is yes, my dev >>>>>>>> environment is Ubuntu 20.04 under VMWare Fusion, with Intel VT-x/EPT >>>>>>>> and >>>>>>>> Code Profiling CPU options enabled in the VM settings. >>>>>>>> >>>>>>>> On Thursday, April 15, 2021 at 11:06:37 PM UTC+1 [email protected] >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Another related possibility is that we have a bug somewhere where >>>>>>>>> we read a configuration value - some memory address, size parameter, >>>>>>>>> etc. >>>>>>>>> And we read it incorrectly - instead of say 4 bytes we read 8 bytes >>>>>>>>> and we >>>>>>>>> get garbage value which is then used to set up things incorrectly. >>>>>>>>> And we >>>>>>>>> end up reading from or writing to some wrong places. >>>>>>>>> >>>>>>>>> On Thursday, April 15, 2021 at 5:04:47 PM UTC-4 Waldek Kozaczuk >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Sorry I was away on a vacation so responding somewhat late. >>>>>>>>>> >>>>>>>>>> Yes, I did see your April 6th reply, which seems to prove that >>>>>>>>>> the issue is unlikely related to the GCC version nor other build >>>>>>>>>> specifics. >>>>>>>>>> >>>>>>>>>> It is still most likely a bug in OSv that is triggered by some >>>>>>>>>> runtime-specific element that we do not know. My hunch is that it is >>>>>>>>>> memory-related, possibly setup, given we see two types of crashes - >>>>>>>>>> "bsd >>>>>>>>>> .." page fault and missing symbol page error. The second one is very >>>>>>>>>> weird >>>>>>>>>> as the symbols from the kernel should always be found, so most >>>>>>>>>> likely >>>>>>>>>> something is reading from the wrong part of memory. >>>>>>>>>> >>>>>>>>>> Having said that I do not know where to go from there especially >>>>>>>>>> given I cannot reproduce it. >>>>>>>>>> >>>>>>>>>> Today I tried to re-produce on my Ubuntu 20.10 machine to no >>>>>>>>>> avail. I looked at your email from Feb 26 where you wrote: >>>>>>>>>> >>>>>>>>>> "On the crash issue, I built an OSv image with the standard >>>>>>>>>> native example: *./scripts/build image=native-example* >>>>>>>>>> Then ran continuous launches under firecracker using a bash loop: >>>>>>>>>> *while >>>>>>>>>> true; do ./scripts/firecracker.py; done* >>>>>>>>>> After 1249 iterations, I got the same Assert at startup." >>>>>>>>>> >>>>>>>>>> which seems to prove that this error is not app-specific and I >>>>>>>>>> used exactly some commands to build and run the example: >>>>>>>>>> >>>>>>>>>> With the original command: >>>>>>>>>> >>>>>>>>>> while true; do ./scripts/firecracker.py >> /tmp/running; done >>>>>>>>>> >>>>>>>>>> I could run it 71700 (over 71K) with no errors. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> With verbose enabled: >>>>>>>>>> >>>>>>>>>> while true; do ./scripts/firecracker.py -V >> /tmp/running2; done >>>>>>>>>> >>>>>>>>>> I could run it 46691(over 46K) with no errors. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Let me share some of my host machine-specific (5-6 year old >>>>>>>>>> MacBook Pro) info: >>>>>>>>>> >>>>>>>>>> *more /proc/cpuinfo * >>>>>>>>>> >>>>>>>>>> *processor : 0* >>>>>>>>>> >>>>>>>>>> *vendor_id : GenuineIntel* >>>>>>>>>> >>>>>>>>>> *cpu family : 6* >>>>>>>>>> >>>>>>>>>> *model : 70* >>>>>>>>>> >>>>>>>>>> *model name : Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz* >>>>>>>>>> >>>>>>>>>> *stepping : 1* >>>>>>>>>> >>>>>>>>>> *microcode : 0x1c* >>>>>>>>>> >>>>>>>>>> *cpu MHz : 876.525* >>>>>>>>>> >>>>>>>>>> *cache size : 6144 KB* >>>>>>>>>> >>>>>>>>>> *physical id : 0* >>>>>>>>>> >>>>>>>>>> *siblings : 8* >>>>>>>>>> >>>>>>>>>> *core id : 0* >>>>>>>>>> >>>>>>>>>> *cpu cores : 4* >>>>>>>>>> >>>>>>>>>> *apicid : 0* >>>>>>>>>> >>>>>>>>>> *initial apicid : 0* >>>>>>>>>> >>>>>>>>>> *fpu : yes* >>>>>>>>>> >>>>>>>>>> *fpu_exception : yes* >>>>>>>>>> >>>>>>>>>> *cpuid level : 13* >>>>>>>>>> >>>>>>>>>> *wp : yes* >>>>>>>>>> >>>>>>>>>> *flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca >>>>>>>>>> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe >>>>>>>>>> syscall nx >>>>>>>>>> pdpe1* >>>>>>>>>> >>>>>>>>>> *gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl >>>>>>>>>> xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor >>>>>>>>>> ds_cpl >>>>>>>>>> vmx smx est tm2 s* >>>>>>>>>> >>>>>>>>>> *sse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe >>>>>>>>>> popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm >>>>>>>>>> cpuid_fault >>>>>>>>>> epb invpcid_singl* >>>>>>>>>> >>>>>>>>>> *e pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid >>>>>>>>>> ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt >>>>>>>>>> dtherm >>>>>>>>>> ida arat pln * >>>>>>>>>> >>>>>>>>>> *pts md_clear flush_l1d* >>>>>>>>>> >>>>>>>>>> *vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad >>>>>>>>>> ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid >>>>>>>>>> unrestricted_guest >>>>>>>>>> ple shadow_vm* >>>>>>>>>> >>>>>>>>>> *cs* >>>>>>>>>> >>>>>>>>>> *bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf >>>>>>>>>> mds swapgs itlb_multihit srbds* >>>>>>>>>> >>>>>>>>>> *bogomips : 4589.68* >>>>>>>>>> >>>>>>>>>> *clflush size : 64* >>>>>>>>>> >>>>>>>>>> *cache_alignment : 64* >>>>>>>>>> >>>>>>>>>> *address sizes : 39 bits physical, 48 bits virtual* >>>>>>>>>> >>>>>>>>>> *power management:* >>>>>>>>>> >>>>>>>>>> uname -a >>>>>>>>>> Linux wkoMacBookPro 5.8.0-49-generic #55-Ubuntu SMP Wed Mar 24 >>>>>>>>>> 14:45:45 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux >>>>>>>>>> >>>>>>>>>> So my machine comes with Intel CPUs. >>>>>>>>>> >>>>>>>>>> Is yours Intel as well or AMD? I know that Firecracker had to add >>>>>>>>>> some special support for AMD. >>>>>>>>>> >>>>>>>>>> Finally given it might be memory-setup related issue I wonder >>>>>>>>>> what the physical memory ranges you see. >>>>>>>>>> >>>>>>>>>> This patch: >>>>>>>>>> git diff core/ >>>>>>>>>> diff --git a/core/mempool.cc b/core/mempool.cc >>>>>>>>>> index bd8e2fcf..52f94774 100644 >>>>>>>>>> --- a/core/mempool.cc >>>>>>>>>> +++ b/core/mempool.cc >>>>>>>>>> @@ -1732,6 +1732,8 @@ void free_initial_memory_range(void* addr, >>>>>>>>>> size_t size) >>>>>>>>>> if (!size) { >>>>>>>>>> return; >>>>>>>>>> } >>>>>>>>>> + debug_early_u64("mempool: add range at: ", (u64)addr); >>>>>>>>>> + debug_early_u64("mempool: add range of size: ", size); >>>>>>>>>> auto a = reinterpret_cast<uintptr_t>(addr); >>>>>>>>>> auto delta = align_up(a, page_size) - a; >>>>>>>>>> if (delta > size) { >>>>>>>>>> >>>>>>>>>> produces this extra info with verbose output: >>>>>>>>>> >>>>>>>>>> 2021-04-15T13:40:50.493125: Start >>>>>>>>>> 2021-04-15T13:40:50.493200: API socket-less: True >>>>>>>>>> 2021-04-15T13:40:50.493204: Firecracker ready >>>>>>>>>> 2021-04-15T13:40:50.493293: Configured VM >>>>>>>>>> 2021-04-15T13:40:50.493298: Added disk >>>>>>>>>> 2021-04-15T13:40:50.493301: Created OSv VM with cmdline: >>>>>>>>>> --verbose --nopci --rootfs=zfs /hello >>>>>>>>>> { >>>>>>>>>> "machine-config": { >>>>>>>>>> "vcpu_count": 1, >>>>>>>>>> "mem_size_mib": 128, >>>>>>>>>> "ht_enabled": false >>>>>>>>>> }, >>>>>>>>>> "drives": [ >>>>>>>>>> { >>>>>>>>>> "drive_id": "rootfs", >>>>>>>>>> "path_on_host": >>>>>>>>>> "/home/wkozaczuk/projects/osv-master/scripts/../build/last/usr.raw", >>>>>>>>>> "is_root_device": false, >>>>>>>>>> "is_read_only": false >>>>>>>>>> } >>>>>>>>>> ], >>>>>>>>>> "boot-source": { >>>>>>>>>> "kernel_image_path": >>>>>>>>>> "/home/wkozaczuk/projects/osv-master/scripts/../build/last/kernel.elf", >>>>>>>>>> "boot_args": "--verbose --nopci --rootfs=zfs /hello" >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> OSv v0.55.0-240-g9d1f5111 >>>>>>>>>> *mempool: add range at: ffff800000957434* >>>>>>>>>> *mempool: add range of size: 00000000076a8bcc* >>>>>>>>>> *mempool: add range at: ffff800000000001* >>>>>>>>>> *mempool: add range of size: 000000000009fbff* >>>>>>>>>> *mempool: add range at: ffff800000100000* >>>>>>>>>> *mempool: add range of size: 0000000000100000* >>>>>>>>>> 1 CPUs detected >>>>>>>>>> Firmware vendor: Unknown >>>>>>>>>> bsd: initializing - done >>>>>>>>>> VFS: mounting ramfs at / >>>>>>>>>> VFS: mounting devfs at /dev >>>>>>>>>> net: initializing - done >>>>>>>>>> Detected virtio-mmio device: (2,0) >>>>>>>>>> virtio-blk: Add blk device instances 0 as vblk0, devsize=268435456 >>>>>>>>>> random: intel drng, rdrand registered as a source. >>>>>>>>>> random: <Software, Yarrow> initialized >>>>>>>>>> VFS: unmounting /dev >>>>>>>>>> VFS: mounting zfs at /zfs >>>>>>>>>> zfs: mounting osv/zfs from device /dev/vblk0.1 >>>>>>>>>> VFS: mounting devfs at /dev >>>>>>>>>> VFS: mounting procfs at /proc >>>>>>>>>> VFS: mounting sysfs at /sys >>>>>>>>>> BSD shrinker: event handler list found: 0xffffa000009ad080 >>>>>>>>>> BSD shrinker found: 1 >>>>>>>>>> BSD shrinker: unlocked, running >>>>>>>>>> Booted up in 47.39 ms >>>>>>>>>> Cmdline: /hello >>>>>>>>>> Hello from C code >>>>>>>>>> VFS: unmounting /dev >>>>>>>>>> VFS: unmounting /proc >>>>>>>>>> VFS: unmounting / >>>>>>>>>> Powering off. >>>>>>>>>> 2021-04-15T13:40:50.495752: Waiting for firecracker process to >>>>>>>>>> terminate >>>>>>>>>> 2021-04-15T13:40:50.594645: End >>>>>>>>>> >>>>>>>>>> Also in all 46691 runs with verbose I would always see the same >>>>>>>>>> memory ranges. >>>>>>>>>> >>>>>>>>>> I wonder what ranges are in your case. I know we used to have >>>>>>>>>> some edge-case bugs in memory setup around parsing e820 ranges so. >>>>>>>>>> >>>>>>>>>> Maybe we have a bug where we read incorrectly e820 and then >>>>>>>>>> create a wrong mapping or read less memory than provided. >>>>>>>>>> >>>>>>>>>> Are the ranges the same in your case or possibly fluctuate and in >>>>>>>>>> case of the crash are somewhat different? >>>>>>>>>> >>>>>>>>>> Otherwise, I am out of any more ideas. >>>>>>>>>> >>>>>>>>>> Waldek >>>>>>>>>> >>>>>>>>>> PS. Do you possibly run your tests in a nested virtualization >>>>>>>>>> setup or possibly on EC2 i3 metal instances? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sunday, April 11, 2021 at 12:43:17 PM UTC-4 [email protected] >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> The gcc 10.2.0 build already included the missing symbol change >>>>>>>>>>> to disable setting up the missing symbols page. >>>>>>>>>>> >>>>>>>>>>> Did you see my April 6th reply? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sunday, April 11, 2021 at 4:45:59 AM UTC+1 [email protected] >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Interesting. What happens if you apply that other change to >>>>>>>>>>>> disable setting up that missing symbols page and compile with gcc >>>>>>>>>>>> 10.2.0? >>>>>>>>>>>> >>>>>>>>>>>> On Wednesday, April 7, 2021 at 8:30:10 AM UTC-4 >>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Re: Your earlier point that it could be due to my gcc version >>>>>>>>>>>>> revealing a gcc/OSv bug, I thought I would try upgrading to the >>>>>>>>>>>>> latest >>>>>>>>>>>>> gcc/g++ tools. So I'm now running: >>>>>>>>>>>>> >>>>>>>>>>>>> gcc (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0 >>>>>>>>>>>>> >>>>>>>>>>>>> I rebuilt my OSv image as follows: >>>>>>>>>>>>> >>>>>>>>>>>>> ./scripts/build clean >>>>>>>>>>>>> ./scripts/build -j4 image=golang-example >>>>>>>>>>>>> >>>>>>>>>>>>> Then re-ran my test. At iterations 591, 688 and 9326, I saw >>>>>>>>>>>>> the missing symbol failure, but so far, no sign of the Assert >>>>>>>>>>>>> failure. >>>>>>>>>>>>> >>>>>>>>>>>> -- >> > You received this message because you are subscribed to the Google Groups >> "OSv Development" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/osv-dev/f820b2d4-a641-4c39-9c9b-2e8b8a0a0d05n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/osv-dev/f820b2d4-a641-4c39-9c9b-2e8b8a0a0d05n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/ae40d15f-3e16-4818-9d87-8715dcaed238n%40googlegroups.com.
