Hi Mike,

On 9/26/19 3:17 PM, Mike Larkin wrote:
On Thu, Sep 26, 2019 at 11:54:03AM +0200, Mark Patruck wrote:
When running OpenBSD 6.6-beta on bare metal with hw.smt=0...

...and SMT disabled in BIOS

hw.machine=amd64
hw.model=AMD EPYC 7402P 24-Core Processor
hw.ncpu=24
hw.ncpufound=24
hw.smt=0
hw.ncpuonline=24


...and SMT enabled in BIOS

hw.machine=amd64
hw.model=AMD EPYC 7402P 24-Core Processor
hw.ncpu=48
hw.ncpufound=48
hw.smt=0
hw.ncpuonline=24

Regarding ESXi, i wonder if there are any special settings needed
(cores/socket, socket/cores). According to Vmware docs, every core assigned
to the VM should be presented as "real core" w/o SMT at all.

Back to the OpenBSD machine running on bare metal.

Though it looks better, the counting seems weird. (this is for SMT off)
As you can see, some cores are skipped somehow...


Those are the APIC IDs that come from cpuid. I've seen them be non-consecutive
in the past on big machines like this. I don't think this means anything.

One way to debug this further is to instrument a kernel that pinpoints
where and why init is being sent a SIGSEGV, (maybe printing this information
in the trap function). I'm wondering if it's the same place every time or if
init is segfaulting at random places.

Thinking out loud, does bsd.sp exhibit the same issues?

Sorry, forgot that before. Booting /bsd.sp works on bare metal (it boots to login prompt) as well as via ESXi. I've set up three OpenBSD 6.6 VMs for testing and all work w/o issues as long as - it seems - only CPU0 gets work.

Thanks,

        -Mark

-ml

# dmesg | grep smt
cpu0: smt 0, core 0, package 0
cpu1: smt 0, core 1, package 0
cpu2: smt 0, core 2, package 0
cpu3: smt 0, core 4, package 0
cpu4: smt 0, core 5, package 0
cpu5: smt 0, core 6, package 0
cpu6: smt 0, core 8, package 0
cpu7: smt 0, core 9, package 0
cpu8: smt 0, core 10, package 0
cpu9: smt 0, core 12, package 0
cpu10: smt 0, core 13, package 0
cpu11: smt 0, core 14, package 0
cpu12: smt 0, core 16, package 0
cpu13: smt 0, core 17, package 0
cpu14: smt 0, core 18, package 0
cpu15: smt 0, core 20, package 0
cpu16: smt 0, core 21, package 0
cpu17: smt 0, core 22, package 0
cpu18: smt 0, core 24, package 0
cpu19: smt 0, core 25, package 0
cpu20: smt 0, core 26, package 0
cpu21: smt 0, core 28, package 0
cpu22: smt 0, core 29, package 0
cpu23: smt 0, core 30, package 0


See "dmesg_smt_off", "dmesg_smt_on" for more details (long)

In the end, i didn't get to the login, as init died (Segmentation
fault). This looks similar to OpenBSD 6.6-beta running as VM...as well
as the following panic i catched during one of the reboots (SMT off in
BIOS, hw.smt=0)

....
root on sd4a (96d2f4e21ea90b51.a) swap on sd4b dump on sd4b
panic: init died (signal 11, exit 0)
Stopped at      db_enter+0x10:  popq    %rbp
     TID    PID     UID    PRFLAGS     PFLAGS   CPU  COMMAND
  332879   7766       0          0          0     0  init
* 86645      1       0      0x802     0x2000    7K  init
db_enter() at db_enter+0x10
panic() at panic+0x128
exit1(ffff8000fffff8b0,8b,1) at exit1+0x5c4
trapsignal(ffff8000fffff8b0,b,6,1,0) at trapsignal+0x13a
pageflttrap() at pageflttrap+0x287
usertrap(ffff8000228d03e0) at usertrap+0x1b0
recall_trap(6,0,1a3c866037d5,4,7f7fffff6740,0) at recall_trap+0x8
end of kernel
end trace frame: 0x7f7fffff40d0, count: 8
https://www.openbsd.org/ddb.html describes the minimum info required in bug
reports. Insufficient info makes it difficult to find and fix bugs.
ddb{2}> trace
db_enter() at db_enter+0x10
panic() at panic+0x128
exit1(ffff8000fffff8b0,8b,1) at exit1+0x5c4
trapsignal(ffff8000fffff8b0,b,6,1,0) at trapsignal+0x13a
pageflttrap() at pageflttrap+0x287
usertrap(ffff8000228d03e0) at usertrap+0x1b0
recall_trap(6,0,1a3c8660375d5,4,7f7fffff6740,0) at recall_trap+0x8
end of kernel
end trace frame: 0x7f7fffff40d0, count: -7
ddb{2}> ps

[ cut big parts ]

--
Mark Patruck ( mark at wrapped.cx )
GPG key 0xF2865E51 / 187F F6D3 EE04 1DCE 1C74  F644 0D3C F66F F286 5E51

https://www.wrapped.cx

Reply via email to