Is there a chance I could get a hold of the test app you are running on
OSv? On other hand given it happens so early in the boot process (it seems
at least) it probably does not matter.
Also, which version of firecracker are you using? I am using v0.23.0 and
the latest OSv master (you see to using a version of OSv that is ~30
commits behind it seems - 168). Probably does not matter.
I think that backtrace does not work as the error happens pretty early in
the boot process.
Can you try to modify OSv with this patch to dump registers at least:
diff --git a/arch/x64/mmu.cc b/arch/x64/mmu.cc
index 1af268c0..5fd0ee2e 100644
--- a/arch/x64/mmu.cc
+++ b/arch/x64/mmu.cc
@@ -16,6 +16,8 @@
#include <osv/elf.hh>
#include "exceptions.hh"
+void dump_registers(exception_frame* ef);
+
void page_fault(exception_frame *ef)
{
sched::fpu_lock fpu;
@@ -34,6 +36,9 @@ void page_fault(exception_frame *ef)
}
// The following code may sleep. So let's verify the fault did not
happen
// when preemption was disabled, or interrupts were disabled.
+ if (!sched::preemptable()) {
+ dump_registers(ef);
+ }
assert(sched::preemptable());
assert(ef->rflags & processor::rflags_if);
and send us the output? Best if you can use RIP value to find which
function was called last.
BTW I ran simple golang-pie-example (./scripts/build -j4
image=golang-pie-example fs=rofs) over 1,500 times and I could not see the
error:
OSv v0.55.0-203-gbd2fa23f
1 CPUs detected
Firmware vendor: Unknown
bsd: initializing - done
VFS: mounting ramfs at /
VFS: mounting devfs at /dev
net: initializing - done
Detected virtio-mmio device: (2,0)
virtio-blk: Add blk device instances 0 as vblk0, devsize=8432640
random: intel drng, rdrand registered as a source.
random: <Software, Yarrow> initialized
VFS: unmounting /dev
VFS: mounting rofs at /rofs
VFS: mounting devfs at /dev
VFS: mounting procfs at /proc
VFS: mounting sysfs at /sys
VFS: mounting ramfs at /tmp
Booted up in 11.17 ms
Cmdline: /hello.so
random: blocking on read.
random: device unblocked.
Hello, 世界
Go version: go1.12.6
program exited with status 0
VFS: unmounting /dev
VFS: unmounting /proc
VFS: unmounting /
ROFS: spent 4.84 ms reading from disk
ROFS: read 3760 512-byte blocks from disk
ROFS: allocated 3804 512-byte blocks of cache memory
ROFS: hit ratio is 86.20%
Powering off.
BTW are you using ROFS or default ZFS image?
BTW I used this modified firecracker.py script to run your example:
diff --git a/scripts/firecracker.py b/scripts/firecracker.py
index 7137b21c..5fa2a599 100755
--- a/scripts/firecracker.py
+++ b/scripts/firecracker.py
@@ -329,7 +329,16 @@ def main(options):
if not options.api:
if options.verbose:
print(client.firecracker_config_json())
- firecracker, config_file_path =
start_firecracker_with_no_api(firecracker_path,
client.firecracker_config_json())
+ N = 1
+ while True:
+ print_time("-----------------------------------")
+ print_time("Running the %d-th time" % N)
+ firecracker, config_file_path =
start_firecracker_with_no_api(firecracker_path,
client.firecracker_config_json())
+ try:
+ firecracker.wait()
+ except KeyboardInterrupt:
+ os.kill(firecracker.pid, signal.SIGINT)
+ N = N + 1
else:
client.start_instance()
print_time("Booted OSv VM")
On Tuesday, March 9, 2021 at 3:54:35 AM UTC-5 [email protected] wrote:
> To add weight to the timing issue theory, we are seeing the crash much
> more often (est. 1 in 20) when :
>
> - The -verbose OSv startup argument is supplied
>
> or
>
> - We TCP connect to the OSv instance whilst it is starting up
>
> Is there somewhere I can raise a bug report for this?
>
Sure. You can always create a new issue
at https://github.com/cloudius-systems/osv/issues.
On Friday, February 26, 2021 at 12:36:43 PM UTC David Smith wrote:
>
>> The power-off-on-abort flag worked, thanks for that tip.
>>
>> On the crash issue, I built an OSv image with the standard native
>> example: *./scripts/build image=native-example*
>> Then ran continuous launches under firecracker using a bash loop: *while
>> true; do ./scripts/firecracker.py; done*
>> After 1249 iterations, I got the same Assert at startup.
>>
>> Then I ran continuous launches under qemu: *while true; do
>> ./scripts/run.py; done*
>> So far, I'm over 6,500 iterations without failure, so I think this issue
>> is occurring under firecracker only, but that may be due to startup timing.
>>
>> I tried the patch supplied for a similar issue ("[PATCH] pthread_create()
>> - prefault top of stack") in case it helped, but the Assert still happens
>> as described in my original post.
>>
>> On Wednesday, February 17, 2021 at 3:31:35 PM UTC David Smith wrote:
>>
>>> I ran the same test using firecracker. After 1865 successful executions
>>> (OSv VM startup, my app dumping usage info. to stdout then quitting, OSv VM
>>> shutdown/poweroff) the 1866'th failed with the same Assert. I'll keep
>>> trying to find a more easily reproducible test (I think I'll try the sample
>>> go app. next, which cuts out anything I'm developing).
>>> With a 1 in 1866 failure, I don't think that's a clear cut "It only
>>> happens under firecracker", so I'll also repeat both firecracker and Qemu
>>> tests to get a more certain picture.
>>>
>>> On Wednesday, February 17, 2021 at 12:10:27 PM UTC David Smith wrote:
>>>
>>>> Thanks for the power-off tip, I'll give that a try.
>>>> I ran 2,500 cycles of starting the VM under Qemu yesterday and saw no
>>>> occurrences of this issue.
>>>> I'll repeat the same test using firecracker and see if I can reproduce.
>>>> Last time I saw the issue, it happened once in 12 VM launches.
>>>>
>>>> On Tuesday, February 16, 2021 at 10:26:29 PM UTC [email protected]
>>>> wrote:
>>>>
>>>>> The #1123 issue has been reported in aarch64 port and only (as far as
>>>>> I have seen) happens in SMP mode (#vCPUs >=2). In your case, you seem to
>>>>> be
>>>>> getting it with 1 cpu. But who knows if it is related.
>>>>>
>>>>> As far as automatic power off, have you tried to prepend your command
>>>>> line with '--power-off-on-abort'?
>>>>>
>>>>> Waldek
>>>>> On Tuesday, February 16, 2021 at 3:59:22 AM UTC-5 [email protected]
>>>>> wrote:
>>>>>
>>>>>> Thanks for the quick reply. I haven't seen this failure when running
>>>>>> under Qemu, but since I'm almost always running under Firecracker and it
>>>>>> happens so infrequently under Firecracker, that's no surprise. I will
>>>>>> try
>>>>>> to reproduce under Qemu and hopefully collect more information.
>>>>>>
>>>>>> On a related point, when this occurs (and I assume the behaviour
>>>>>> would be the same with any kernel assert), the OS seems to be left in a
>>>>>> hung state. What I'd like to ensure with my build is that if a fatal
>>>>>> error
>>>>>> occurs, the VM is powered off. Is there an option I can use to achieve
>>>>>> this, or would it require code modification (perhaps injecting a call to
>>>>>> poweroff() at an appropriate location)?
>>>>>>
>>>>>> On Monday, February 15, 2021 at 12:16:55 PM UTC Nadav Har'El wrote:
>>>>>>
>>>>>>> On Mon, Feb 15, 2021 at 1:51 PM David Smith <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I am running a Golang application on OSv under firecracker.
>>>>>>>> Occasionally (it's proving difficult to reproduce reliably) I see an
>>>>>>>> assertion failure early in the OSv startup sequence. The console
>>>>>>>> output for
>>>>>>>> my release build is as follows:
>>>>>>>>
>>>>>>>> OSv v0.55.0-168-g31a04239
>>>>>>>> 1 CPUs detected
>>>>>>>> Firmware vendor: Unknown
>>>>>>>> bsd: initializingAssertion failed: sched::preemptable()
>>>>>>>> (arch/x64/mmu.cc: page_fault: 37)
>>>>>>>>
>>>>>>>> [backtrace]
>>>>>>>>
>>>>>>>
>>>>>>> I wonder why the backtrace is empty. If you run with qemu, not
>>>>>>> firecracker, I you can perhaps get a better backtrace with gdb - see
>>>>>>> instructions in
>>>>>>> https://github.com/cloudius-systems/osv/wiki/Debugging-OSv#debugging-osv-with-gdb.
>>>>>>>
>>>>>>> Does it fail in a similar way in qemu too, or just in firecracker?
>>>>>>>
>>>>>>>
>>>>>>>> It appears to be hitting the assert well before my go application
>>>>>>>> is launched, so my impression is that this isn't being caused by
>>>>>>>> anything
>>>>>>>> my application is doing,
>>>>>>>>
>>>>>>>
>>>>>>> That's my impression too - bsd_init() finishes (and prints "done()")
>>>>>>> before starting to run the application.
>>>>>>>
>>>>>>>
>>>>>>>> though I haven't been able to prove that the same behaviour occurs
>>>>>>>> with a base image (no apps included).
>>>>>>>>
>>>>>>>
>>>>>>> This assert appears to happen in bsd_init() (although it's hard to
>>>>>>> know without a backtrace) - and that runs with IRQs disabled so indeed
>>>>>>> should not have had page faults and the fact we got a page fault might
>>>>>>> indicate a bug. Hopefully the debugger or backtrace can explain
>>>>>>> why/where
>>>>>>> this page fault is happening.
>>>>>>>
>>>>>>>
>>>>>>> As you can see the backtrace is empty (perhaps that's the norm for a
>>>>>>>> release build?), so there are no additional clues. If I can find a way
>>>>>>>> to
>>>>>>>> reproduce with a debug build and/or with a base image, I will update
>>>>>>>> this
>>>>>>>> thread.
>>>>>>>>
>>>>>>>> In most cases, my OSv image starts up without issue and in those
>>>>>>>> cases, the console output around the same area looks as follows:
>>>>>>>>
>>>>>>>> OSv v0.55.0-168-g31a04239
>>>>>>>> 1 CPUs detected
>>>>>>>> Firmware vendor: Unknown
>>>>>>>> bsd: initializing - done
>>>>>>>> VFS: mounting ramfs at /
>>>>>>>> VFS: mounting devfs at /dev
>>>>>>>> net: initializing - done
>>>>>>>>
>>>>>>>> I can see there are numerous previous discussions around the page
>>>>>>>> fault when preemption is disabled assertion, but these seem to be tied
>>>>>>>> to
>>>>>>>> application behaviour.
>>>>>>>> Is an assert during "bsd: Initializing" a known issue for which
>>>>>>>> there is already a patch?
>>>>>>>>
>>>>>>>
>>>>>>> Unfortunately, I don't remember this specific case was seen. We had
>>>>>>> recently https://github.com/cloudius-systems/osv/issues/1123 where
>>>>>>> bsd_init() hangs, though, I wonder if it's related.
>>>>>>>
>>>>>>>
>>>>>>>> Otherwise, if there is additional info. that I can capture to help
>>>>>>>> isolate the cause, let me know.
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "OSv Development" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to [email protected].
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/osv-dev/17c2a0f9-970f-4b8a-b2c4-18b3c678f027n%40googlegroups.com
>>>>>>>>
>>>>>>>> <https://groups.google.com/d/msgid/osv-dev/17c2a0f9-970f-4b8a-b2c4-18b3c678f027n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>>
--
You received this message because you are subscribed to the Google Groups "OSv
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/osv-dev/0e96bbc2-029a-45a2-8a2a-d253d8f2e693n%40googlegroups.com.