On Thu, Feb 04, 2021 at 04:05:42PM -0700, Alan Somers wrote: > On Thu, Feb 4, 2021 at 3:58 PM Konstantin Belousov <[email protected]> > wrote: > > > On Thu, Feb 04, 2021 at 01:34:13PM -0800, Matthew Macy wrote: > > > On Thu, Feb 4, 2021 at 1:31 PM Alan Somers <[email protected]> wrote: > > > > > > > > After upgrading a machine to FreeBSD, 12.2, it hit the following panic > > on > > > > its first reboot. I suspect that a few other servers have hit this > > too, > > > > but since it happens before swap is mounted there are no core dumps, > > and > > > > they usually reboot immediately. The code in question hasn't changed > > since > > > > 2018. The panic happened in cmci_monitor at line 930. Does anybody > > have > > > > any suggestions for how I could debug further? I can't readily > > reproduce > > > > it, and I can't dump core, but I'd like to investigate it any way I > > can. > > > > The server in question has dual Xeon Gold 6142 CPUs. > > > > > > > > > > I can't actually help :( but I can add a +1 with similar hardware or > > > equivalent specs. It's not frequent, but it's often enough to be > > > annoying. > > > -M > > > > > > > if (!(ctl & MC_CTL2_CMCI_EN)) > > > > /* This bank does not support CMCI. */ > > > > return; > > > > > > > > cc = &cmc_state[PCPU_GET(cpuid)][i]; // <- panic here > > > > > > > > /* Determine maximum threshold. */ > > > > > > > > > > > > Fatal trap 12: page fault while in kernel mode > > > > cpuid = 26; apic id = 34 > > > > fault virtual address = 0xd0 > > > > fault code = supervisor read data, page not present > > > > instruction pointer = 0x20:0xffffffff8125a009 > > > > stack pointer = 0x28:0xfffffe0000b65f20 > > > > frame pointer = 0x28:0xfffffe0000b65f50 > > > > code segment = base 0x0, limit 0xfffff, type 0x1b > > > > = DPL 0, pres 1, long 1, def32 0, gran 1 > > > > processor eflags = resume, IOPL = 0 > > > > current process = 11 (idle: cpu26) > > > > trap number = 12 > > > > panic: page fault > > > > cpuid = 26 > > > > time = 1 > > > > KDB: stack backtrace: > > > > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame > > > > 0xfffffe0000b65be0 > > > > vpanic() at vpanic+0x17b/frame 0xfffffe0000b65c30 > > > > panic() at panic+0x43/frame 0xfffffe0000b65c90 > > > > trap_fatal() at trap_fatal+0x391/frame 0xfffffe0000b65cf0 > > > > trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0000b65d40 > > > > trap() at trap+0x286/frame 0xfffffe0000b65e50 > > > > calltrap() at calltrap+0x8/frame 0xfffffe0000b65e50 > > > > --- trap 0xc, rip = 0xffffffff8125a009, rsp = 0xfffffe0000b65f20, rbp = > > > > 0xfffffe0000b65f50 --- > > > > _mca_init() at _mca_init+0x5d9/frame 0xfffffe0000b65f50 > > > > init_secondary_tail() at init_secondary_tail+0xfd/frame > > 0xfffffe0000b65f80 > > > > init_secondary() at init_secondary+0x2d1/frame 0xfffffe0000b65ff0 > > > > KDB: enter: panic > > > > [ thread pid 11 tid 100029 ] > > > > Stopped at kdb_enter+0x37: movq $0,0x12bc1f6(%rip) > > > > Try this. > > > > I think that there is no other dependencies in the startup order, but > > cannot know it for sure. > > > > commit 19584e3d3e9606d591fa30999b370ed758960e8c > > Author: Konstantin Belousov <[email protected]> > > Date: Fri Feb 5 00:56:09 2021 +0200 > > > > x86: init mca before APs are started > > > > diff --git a/sys/x86/x86/mca.c b/sys/x86/x86/mca.c > > index 03100e77d455..e2bf2673cf69 100644 > > --- a/sys/x86/x86/mca.c > > +++ b/sys/x86/x86/mca.c > > @@ -1371,7 +1371,7 @@ mca_init_bsp(void *arg __unused) > > > > mca_init(); > > } > > -SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_ANY, mca_init_bsp, NULL); > > +SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_SECOND, mca_init_bsp, NULL); > > > > /* Called when a machine check exception fires. */ > > void > > > > I can test this patch on development servers, but so far I've only seen the > crash on production servers. Do you have any suggestions for how to force > the crash, or how to test this patch besides simply making sure that my dev > servers can boot?
The race, as I see it, is that we call mca_init() on BSP too late, so malloc() that provides the storage for cmc_state array, could be called too late, before one of the APs was IPIed for startup. Patch ensures that mca_init_bsp() SYSINIT is finished before we go to start the APs. I do not think there is any reliable way to trigger the panic while keeping the patch usable, except to observe enough successfull boots. _______________________________________________ [email protected] mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[email protected]"
