Some info. for my host (Intel CPU based) Mac system (on which I'm running 
MacOS 10.15.7):

Darwin 19.6.0 Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST 2021; 
root:xnu-6153.141.16~1/RELEASE_X86_64 x86_64

machdep.cpu.max_basic: 27
machdep.cpu.max_ext: 2147483656
machdep.cpu.vendor: GenuineIntel
machdep.cpu.brand_string: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
machdep.cpu.family: 6
machdep.cpu.model: 126
machdep.cpu.extmodel: 7
machdep.cpu.extfamily: 0
machdep.cpu.stepping: 5
machdep.cpu.feature_bits: 9221959987971750911
machdep.cpu.leaf7_feature_bits: 4072613871 1077960526
machdep.cpu.leaf7_feature_bits_edx: 3154117648
machdep.cpu.extfeature_bits: 1241984796928
machdep.cpu.signature: 460517
machdep.cpu.brand: 0
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE 
MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 
PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 
SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 
RDRAND F16C
machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 AVX2 FDPEO 
SMEP BMI2 ERMS INVPCID FPU_CSDS AVX512F AVX512DQ RDSEED ADX SMAP AVX512IFMA 
CLFSOPT IPT AVX512CD SHA AVX512BW AVX512VL AVX512VBMI UMIP PKU GFNI VAES 
VPCLMULQDQ AVX512VNNI AVX512BITALG AVX512VPOPCNTDQ RDPID SGXLC FSREPMOV 
MDCLEAR IBRS STIBP L1DF ACAPMSR SSBD
machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW 
RDTSCP TSCI
machdep.cpu.logical_per_package: 16
machdep.cpu.cores_per_package: 8
machdep.cpu.microcode_version: 160
machdep.cpu.processor_flag: 7
machdep.cpu.mwait.linesize_min: 64
machdep.cpu.mwait.linesize_max: 64
machdep.cpu.mwait.extensions: 3
machdep.cpu.mwait.sub_Cstates: 286396448
machdep.cpu.thermal.sensor: 1
machdep.cpu.thermal.dynamic_acceleration: 1
machdep.cpu.thermal.invariant_APIC_timer: 1
machdep.cpu.thermal.thresholds: 2
machdep.cpu.thermal.ACNT_MCNT: 1
machdep.cpu.thermal.core_power_limits: 1
machdep.cpu.thermal.fine_grain_clock_mod: 1
machdep.cpu.thermal.package_thermal_intr: 1
machdep.cpu.thermal.hardware_feedback: 0
machdep.cpu.thermal.energy_policy: 1
machdep.cpu.xsave.extended_state: 743 2688 2696 0
machdep.cpu.xsave.extended_state1: 15 2432 8448 0
machdep.cpu.arch_perf.version: 5
machdep.cpu.arch_perf.number: 8
machdep.cpu.arch_perf.width: 48
machdep.cpu.arch_perf.events_number: 8
machdep.cpu.arch_perf.events: 0
machdep.cpu.arch_perf.fixed_number: 4
machdep.cpu.arch_perf.fixed_width: 48
machdep.cpu.cache.linesize: 64
machdep.cpu.cache.L2_associativity: 8
machdep.cpu.cache.size: 256
machdep.cpu.address_bits.physical: 39
machdep.cpu.address_bits.virtual: 48
machdep.cpu.core_count: 4
machdep.cpu.thread_count: 8
machdep.cpu.tsc_ccc.numerator: 104
machdep.cpu.tsc_ccc.denominator: 2

I would add that this is my development system. We have a test system which 
utilises VMWare ESXi and runs the OSv VM under firecracker on an Alpine 
Linux VMWare VM (so nested virtualisation again) and we also see the "bsd 
initializing" crash on startup on this configuration occasionally.

I added the patch to dump the physical memory region information as you 
requested. Built again with build fs=rofs image=native-example
then tested with  while true; do ./scripts/firecracker.py >> /tmp/running; 
done (my firecracker.py script is slightly modified to add in the 
'--power-off-on-abort' cmd line argument, which I think is not important, 
but I mention it just in case).
On the 996'th iteration, I hit the missing symbol failure. Below is the 
output for the last successful invocation followed by the failed 
invocation. The physical memory information dumped is consistent across all 
996 invocations.

OSv v0.55.0-237-g7cd33926
mempool: add range at:      ffff800000956434
mempool: add range of size: 00000000076a9bcc
mempool: add range at:      ffff800000000001
mempool: add range of size: 000000000009fbff
mempool: add range at:      ffff800000100000
mempool: add range of size: 0000000000100000
1 CPUs detected
Firmware vendor: Unknown

*percpu_l1 = 0x175040
bsd: initializing - done
VFS: mounting ramfs at /
VFS: mounting devfs at /dev
net: initializing - done
Detected virtio-mmio device: (2,0)
virtio-blk: Add blk device instances 0 as vblk0, devsize=6474752
random: intel drng, rdrand registered as a source.
random: <Software, Yarrow> initialized
VFS: unmounting /dev
VFS: mounting rofs at /rofs
VFS: mounting devfs at /dev
VFS: mounting procfs at /proc
VFS: mounting sysfs at /sys
VFS: mounting ramfs at /tmp
Booted up in 121.29 ms
Cmdline: /hello 
Hello from C code
VFS: unmounting /dev
VFS: unmounting /proc
VFS: unmounting /
ROFS: spent 1.78 ms reading from disk
ROFS: read 64 512-byte blocks from disk
ROFS: allocated 61 512-byte blocks of cache memory
ROFS: hit ratio is 93.88%
Powering off.
2021-04-21T13:22:17.382768: Start
2021-04-21T13:22:17.382851: API socket-less: True
2021-04-21T13:22:17.382855: Firecracker ready
2021-04-21T13:22:17.382940: Configured VM
2021-04-21T13:22:17.382945: Added disk
2021-04-21T13:22:17.382948: Created OSv VM with cmdline: 
--power-off-on-abort --verbose --nopci --rootfs=rofs /hello
{
   "machine-config": {
      "vcpu_count": 1,
      "mem_size_mib": 128,
      "ht_enabled": false
   },
   "drives": [
      {
         "drive_id": "rootfs",
         "path_on_host": 
"/home/david-griffin/projects/firecracker/osv/scripts/../build/last/usr.raw",
         "is_root_device": false,
         "is_read_only": false
      }
   ],
   "boot-source": {
      "kernel_image_path": 
"/home/david-griffin/projects/firecracker/osv/scripts/../build/last/kernel.elf",
      "boot_args": "--power-off-on-abort --verbose --nopci --rootfs=rofs 
/hello"
   }
}
2021-04-21T13:22:17.385972: Waiting for firecracker process to terminate
2021-04-21T13:22:17.586150: End
OSv v0.55.0-237-g7cd33926
mempool: add range at:      ffff800000956434
mempool: add range of size: 00000000076a9bcc
mempool: add range at:      ffff800000000001
mempool: add range of size: 000000000009fbff
mempool: add range at:      ffff800000100000
mempool: add range of size: 0000000000100000
1 CPUs detected
Firmware vendor: Unknown

*percpu_l1 = 0x175040
addr: 0x0000000000000008    mmu::page_size: 4096     elf:mspa: 
0x0000000000000000
trying to execute or access missing symbol
[backtrace]
addr: 0x0000000000000008    mmu::page_size: 4096     elf:mspa: 
0x0000000000000000

Prior to adding the physical memory info. patch, I setup a Fedora 33 system 
(again under my VMWare Fusion) environment, installed the same OSv sources, 
built the same image with the same commands and ran the same test. I was 
unable to reproduce either startup failure over two attempts each running 
approx. 50K iterations.
Fedora 33 is running a later kernel (5.11.11) compared to both my Ubuntu 
20.04 (5.8.0) and Alpine (5.10.13). Could there be some subtle kernel 
interaction? 

On Tuesday, April 20, 2021 at 5:08:23 PM UTC+1 David Smith wrote:

> Again, I appreciate you continuing to make suggestions. I was also on 
> vacation yesterday and catching up today. I will respond to your various 
> points in more detail as soon as possible.
>
> On the question of nested virtualisation, the answer is yes, my dev 
> environment is Ubuntu 20.04 under VMWare Fusion, with Intel VT-x/EPT and 
> Code Profiling CPU options enabled in the VM settings.
>
> On Thursday, April 15, 2021 at 11:06:37 PM UTC+1 jwkoz...@gmail.com wrote:
>
>> Another related possibility is that we have a bug somewhere where we read 
>> a configuration value - some memory address, size parameter, etc. And we 
>> read it incorrectly - instead of say 4 bytes we read 8 bytes and we get 
>> garbage value which is then used to set up things incorrectly. And we end 
>> up reading from or writing to some wrong places.
>>
>> On Thursday, April 15, 2021 at 5:04:47 PM UTC-4 Waldek Kozaczuk wrote:
>>
>>> Sorry I was away on a vacation so responding somewhat late.
>>>
>>> Yes, I did see your April 6th reply, which seems to prove that the issue 
>>> is unlikely related to the GCC version nor other build specifics. 
>>>
>>> It is still most likely a bug in OSv that is triggered by some 
>>> runtime-specific element that we do not know. My hunch is that it is 
>>> memory-related, possibly setup, given we see two types of crashes - "bsd 
>>> .." page fault and missing symbol page error. The second one is very weird 
>>> as the symbols from the kernel should always be found, so most likely 
>>> something is reading from the wrong part of memory.
>>>
>>> Having said that I do not know where to go from there especially given I 
>>> cannot reproduce it.
>>>
>>> Today I tried to re-produce on my Ubuntu 20.10 machine to no avail. I 
>>> looked at your email from Feb 26 where you wrote:
>>>
>>> "On the crash issue, I built an OSv image with the standard native 
>>> example: *./scripts/build image=native-example*
>>> Then ran continuous launches under firecracker using a bash loop: *while 
>>> true; do ./scripts/firecracker.py; done*
>>> After 1249 iterations, I got the same Assert at startup."
>>>
>>> which seems to prove that this error is not app-specific and I used 
>>> exactly some commands to build and run the example:
>>>
>>> With the original command:
>>>
>>> while true; do ./scripts/firecracker.py >> /tmp/running; done
>>>
>>> I could run it 71700 (over 71K) with no errors. 
>>>
>>>
>>> With verbose enabled:
>>>
>>> while true; do ./scripts/firecracker.py -V >> /tmp/running2; done
>>>
>>> I could run it 46691(over 46K) with no errors.
>>>
>>>
>>> Let me share some of my host machine-specific (5-6 year old MacBook Pro) 
>>> info:
>>>
>>> *more /proc/cpuinfo *
>>>
>>> *processor : 0*
>>>
>>> *vendor_id : GenuineIntel*
>>>
>>> *cpu family : 6*
>>>
>>> *model : 70*
>>>
>>> *model name : Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz*
>>>
>>> *stepping : 1*
>>>
>>> *microcode : 0x1c*
>>>
>>> *cpu MHz : 876.525*
>>>
>>> *cache size : 6144 KB*
>>>
>>> *physical id : 0*
>>>
>>> *siblings : 8*
>>>
>>> *core id : 0*
>>>
>>> *cpu cores : 4*
>>>
>>> *apicid : 0*
>>>
>>> *initial apicid : 0*
>>>
>>> *fpu : yes*
>>>
>>> *fpu_exception : yes*
>>>
>>> *cpuid level : 13*
>>>
>>> *wp : yes*
>>>
>>> *flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
>>> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1*
>>>
>>> *gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
>>> nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx 
>>> est tm2 s*
>>>
>>> *sse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt 
>>> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb 
>>> invpcid_singl*
>>>
>>> *e pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad 
>>> fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida 
>>> arat pln *
>>>
>>> *pts md_clear flush_l1d*
>>>
>>> *vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb 
>>> flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple 
>>> shadow_vm*
>>>
>>> *cs*
>>>
>>> *bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds 
>>> swapgs itlb_multihit srbds*
>>>
>>> *bogomips : 4589.68*
>>>
>>> *clflush size : 64*
>>>
>>> *cache_alignment : 64*
>>>
>>> *address sizes : 39 bits physical, 48 bits virtual*
>>>
>>> *power management:*
>>>
>>> uname -a
>>> Linux wkoMacBookPro 5.8.0-49-generic #55-Ubuntu SMP Wed Mar 24 14:45:45 
>>> UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> So my machine comes with Intel CPUs.
>>>
>>> Is yours Intel as well or AMD? I know that Firecracker had to add some 
>>> special support for AMD.
>>>
>>> Finally given it might be memory-setup related issue I wonder what the 
>>> physical memory ranges you see.
>>>
>>> This patch:
>>> git diff core/
>>> diff --git a/core/mempool.cc b/core/mempool.cc
>>> index bd8e2fcf..52f94774 100644
>>> --- a/core/mempool.cc
>>> +++ b/core/mempool.cc
>>> @@ -1732,6 +1732,8 @@ void free_initial_memory_range(void* addr, size_t 
>>> size)
>>>      if (!size) {
>>>          return;
>>>      }
>>> +    debug_early_u64("mempool: add range at:      ", (u64)addr);
>>> +    debug_early_u64("mempool: add range of size: ", size);
>>>      auto a = reinterpret_cast<uintptr_t>(addr);
>>>      auto delta = align_up(a, page_size) - a;
>>>      if (delta > size) {
>>>
>>> produces this extra info with verbose output:
>>>
>>> 2021-04-15T13:40:50.493125: Start
>>> 2021-04-15T13:40:50.493200: API socket-less: True
>>> 2021-04-15T13:40:50.493204: Firecracker ready
>>> 2021-04-15T13:40:50.493293: Configured VM
>>> 2021-04-15T13:40:50.493298: Added disk
>>> 2021-04-15T13:40:50.493301: Created OSv VM with cmdline: --verbose 
>>> --nopci --rootfs=zfs /hello
>>> {
>>>    "machine-config": {
>>>       "vcpu_count": 1,
>>>       "mem_size_mib": 128,
>>>       "ht_enabled": false
>>>    },
>>>    "drives": [
>>>       {
>>>          "drive_id": "rootfs",
>>>          "path_on_host": 
>>> "/home/wkozaczuk/projects/osv-master/scripts/../build/last/usr.raw",
>>>          "is_root_device": false,
>>>          "is_read_only": false
>>>       }
>>>    ],
>>>    "boot-source": {
>>>       "kernel_image_path": 
>>> "/home/wkozaczuk/projects/osv-master/scripts/../build/last/kernel.elf",
>>>       "boot_args": "--verbose --nopci --rootfs=zfs /hello"
>>>    }
>>> }
>>> OSv v0.55.0-240-g9d1f5111
>>> *mempool: add range at:      ffff800000957434*
>>> *mempool: add range of size: 00000000076a8bcc*
>>> *mempool: add range at:      ffff800000000001*
>>> *mempool: add range of size: 000000000009fbff*
>>> *mempool: add range at:      ffff800000100000*
>>> *mempool: add range of size: 0000000000100000*
>>> 1 CPUs detected
>>> Firmware vendor: Unknown
>>> bsd: initializing - done
>>> VFS: mounting ramfs at /
>>> VFS: mounting devfs at /dev
>>> net: initializing - done
>>> Detected virtio-mmio device: (2,0)
>>> virtio-blk: Add blk device instances 0 as vblk0, devsize=268435456
>>> random: intel drng, rdrand registered as a source.
>>> random: <Software, Yarrow> initialized
>>> VFS: unmounting /dev
>>> VFS: mounting zfs at /zfs
>>> zfs: mounting osv/zfs from device /dev/vblk0.1
>>> VFS: mounting devfs at /dev
>>> VFS: mounting procfs at /proc
>>> VFS: mounting sysfs at /sys
>>> BSD shrinker: event handler list found: 0xffffa000009ad080
>>>         BSD shrinker found: 1
>>> BSD shrinker: unlocked, running
>>> Booted up in 47.39 ms
>>> Cmdline: /hello 
>>> Hello from C code
>>> VFS: unmounting /dev
>>> VFS: unmounting /proc
>>> VFS: unmounting /
>>> Powering off.
>>> 2021-04-15T13:40:50.495752: Waiting for firecracker process to terminate
>>> 2021-04-15T13:40:50.594645: End
>>>
>>> Also in all 46691 runs with verbose I would always see the same memory 
>>> ranges.
>>>
>>> I wonder what ranges are in your case. I know we used to have some 
>>> edge-case bugs in memory setup around parsing e820 ranges so.
>>>
>>> Maybe we have a bug where we read incorrectly e820 and then create a 
>>> wrong mapping or read less memory than provided.
>>>
>>> Are the ranges the same in your case or possibly fluctuate and in case 
>>> of the crash are somewhat different?
>>>
>>> Otherwise, I am out of any more ideas.
>>>
>>> Waldek
>>>
>>> PS. Do you possibly run your tests in a nested virtualization setup or 
>>> possibly on EC2 i3 metal instances?
>>>
>>>
>>> On Sunday, April 11, 2021 at 12:43:17 PM UTC-4 d787...@gmail.com wrote:
>>>
>>>> The gcc 10.2.0 build already included the missing symbol change to 
>>>> disable setting up the missing symbols page.
>>>>
>>>> Did you see my April 6th reply?
>>>>
>>>>
>>>> On Sunday, April 11, 2021 at 4:45:59 AM UTC+1 jwkoz...@gmail.com wrote:
>>>>
>>>>> Interesting. What happens if you apply that other change to disable 
>>>>> setting up that missing symbols page and compile with gcc 10.2.0?
>>>>>
>>>>> On Wednesday, April 7, 2021 at 8:30:10 AM UTC-4 d787...@gmail.com 
>>>>> wrote:
>>>>>
>>>>>> Re: Your earlier point that it could be due to my gcc version 
>>>>>> revealing a gcc/OSv bug, I thought I would try upgrading to the latest 
>>>>>> gcc/g++ tools. So I'm now running:
>>>>>>
>>>>>> gcc (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0
>>>>>>
>>>>>> I rebuilt my OSv image as follows:
>>>>>>
>>>>>> ./scripts/build clean
>>>>>> ./scripts/build -j4 image=golang-example
>>>>>>
>>>>>> Then re-ran my test. At iterations 591, 688 and 9326, I saw the 
>>>>>> missing symbol failure, but so far, no sign of the Assert failure.
>>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/93b86c1d-ea29-408d-9536-b64de34f3321n%40googlegroups.com.

Reply via email to