[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 John Baldwin changed: What|Removed |Added Resolution|--- |DUPLICATE Status|New |Closed --- Comment #23 from John Baldwin --- Sorry I wasn't able to track this down earlier. I knew when looking at the other bug (228768) that I had tried to investigate this before but couldn't find this bug. *** This bug has been marked as a duplicate of bug 228768 *** -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #22 from m...@rubenvos.com --- Hi, Just to follow up: 12.0 RC-1 boots perfectly on our Opteron cpu's. I upgraded from 11.0 to 12.0 RC-1 , haven't tried 11.2. Ruben ---<>--- Copyright (c) 1992-2018 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 12.0-RC1 r340470 GENERIC amd64 FreeBSD clang version 6.0.1 (tags/RELEASE_601/final 335540) (based on LLVM 6.0.1) VT(vga): resolution 640x480 CPU: Quad-Core AMD Opteron(tm) Processor 2356 (2300.14-MHz K8-class CPU) Origin="AuthenticAMD" Id=0x100f23 Family=0x10 Model=0x2 Stepping=3 Features=0x178bfbff Features2=0x802009 AMD Features=0xee500800 AMD Features2=0x7ff SVM: NP,NAsids=64 TSC: P-state invariant real memory = 25769803776 (24576 MB) avail memory = 25003937792 (23845 MB) Event timer "LAPIC" quality 100 ACPI APIC Table: FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs FreeBSD/SMP: 2 package(s) x 4 core(s) random: unblocking device. ioapic0 irqs 0-23 on motherboard ioapic1 irqs 24-47 on motherboard Launching APs: 4 5 2 6 3 7 1 Timecounter "TSC-low" frequency 1150069463 Hz quality 800 random: entropy device external interface kbd1 at kbdmux0 netmap: loaded module [ath_hal] loaded module_register_init: MOD_LOAD (vesa, 0x810f8690, 0) error 19 nexus0 vtvga0: on motherboard -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #20 from John Baldwin--- Grrr, not sure why my patch didn't prevent it from recursing. You could try '|| cold' instead of '|| n == 1' perhaps. You could also try changing the 'DELAY(1)' in _mtx_lock_indefinite_check() in sys/kern/kern_mutex.c to be something like 'if (cold) cpu_spinwait(); else DELAY(1);' instead of the 'n == 1' hack. Oh, I see why 'n == 1' didn't help. The early_delay callback that is used when that 'n == 1' check fails is i8254_delay (set in amd64/amd64/machdep.c). -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #19 from Pablo Ruiz--- Hi, Did you guys had a chance to take another look at this? May I help with any test? Regards Pablo -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #18 from Pablo Ruiz--- I've tried the following patch: diff --git a/sys/x86/x86/delay.c b/sys/x86/x86/delay.c index 8cbe6012a96..00dfff48c19 100644 --- a/sys/x86/x86/delay.c +++ b/sys/x86/x86/delay.c @@ -70,7 +70,7 @@ delay_tc(int n) func = get_tsc; mask = ~0u; } else { - if (tc->tc_quality <= 0) + if (tc->tc_quality <= 0 || n == 1) return (0); func = tc->tc_get_timecount; mask = tc->tc_counter_mask; diff --git a/sys/x86/x86/mp_x86.c b/sys/x86/x86/mp_x86.c index 04975dd8a2e..c2a90e9a7d4 100644 --- a/sys/x86/x86/mp_x86.c +++ b/sys/x86/x86/mp_x86.c @@ -952,7 +952,7 @@ init_secondary_tail(void) #endif mtx_unlock_spin(_boot_mtx); -while(1); +//while(1); /* Wait until all the AP's are up. */ while (atomic_load_acq_int(_started) == 0) . But I got to 'db' prompt to: cpu0 BSP: ID: 0x VER: 0x80050010 LDR: 0x DFR: 0x lint0: 0x00010700 lint1: 0x0400 TPR: 0x SVR: 0x01ff timer: 0x000100ef therm: 0x0001 err: 0x00f0 pmc: 0x00010400 AMD ext features: 0x00010003 AMD elvt0: 0x0001 SMP: AP CPU #1 Launched! cpu1 AP: FFa tFa FFa FFa Fa kFernel trap 12 with interrupts disabled Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 02 fault virtual address = 0x441 fault code = supervisor read data, page not present instruction pointer = 0x20:0x80ffb704 stack pointer = 0x28:0xfe001b97ba90 frame pointer = 0x28:0xfe001b97baa0 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags= resume, IOPL = 0 current process = 11 (idle: cpu2) [ thread pid 11 tid 15 ] Stopped at spinlock_exit+0x14: movq0x440(%rbx),%rax db> show thread Thread 15 at 0xf8000332c000: proc (pid 11): 0xf80003328000 name: idle: cpu2 stack: 0xfe001b9a6000-0xfe001b9a9fff flags: 0x40024 pflags: 0x20 state: CAN RUN priority: 255 container lock: sched lock 0 (0x827ee880) db> bt Tracing pid 11 tid 15 td 0xf8000332c000 spinlock_exit() at spinlock_exit+0x14/frame 0xfe001b97baa0 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xa5/frame 0xfe001b97bb10 i8254_delay() at i8254_delay+0x143/frame 0xfe001b97bb50 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xa5/frame 0xfe001b97bbc0 i8254_delay() at i8254_delay+0x143/frame 0xfe001b97bc00 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xa5/frame 0xfe001b97bc70 i8254_delay() at i8254_delay+0x143/frame 0xfe001b97bcb0 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xa5/frame 0xfe001b97bd20 i8254_delay() at i8254_delay+0x143/frame 0xfe001b97bd60 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xa5/frame 0xfe001b97bdd0 i8254_delay() at i8254_delay+0x143/frame 0xfe001b97be10 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xa5/frame 0xfe001b97be80 i8254_delay() at i8254_delay+0x143/frame 0xfe001b97bec0 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xa5/frame 0xfe001b97bf30 i8254_delay() at i8254_delay+0x143/frame 0xfe001b97bf70 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xa5/frame 0xfe001b97bfe0 i8254_delay() at i8254_delay+0x143/frame 0xfe001b97c020 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xa5/frame 0xfe001b97c090 -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #17 from Pablo Ruiz--- Hi John, The CPU mode is 'AMD Opteron(tm) Processor SE (Dual Core)'. I'll try the patch you suggested and report back.. -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 Mark Linimonchanged: What|Removed |Added Keywords||patch -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #16 from John Baldwin--- (In reply to Pablo Ruiz from comment #14) Thanks, this gives me what I was looking for. We have infinite recursion because the spin lock code is calling DELAY() which is trying to grab a spin lock. One question is why isn't DELAY using the tsc? Hmm, it seems 'tsc_is_invariant' isn't set. Are these older AMD CPUs? You can try this as a hack-workaround to verify it fixes the issue, but I need to think a bit more about what the right fix might be: Index: x86/x86/delay.c === --- delay.c (revision 329004) +++ delay.c (working copy) @@ -72,7 +72,7 @@ func = get_tsc; mask = ~0u; } else { - if (tc->tc_quality <= 0) + if (tc->tc_quality <= 0 || n == 1) return (0); func = tc->tc_get_timecount; mask = tc->tc_counter_mask; -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #15 from Pablo Ruiz--- And, finally if I move the while(1) just after the mtx_unlock_spin call, like this: diff --git a/sys/x86/x86/mp_x86.c b/sys/x86/x86/mp_x86.c index 1c257d87e58..04975dd8a2e 100644 --- a/sys/x86/x86/mp_x86.c +++ b/sys/x86/x86/mp_x86.c @@ -950,9 +950,9 @@ init_secondary_tail(void) load_es(_udatasel); load_fs(_ufssel); #endif -while(1); mtx_unlock_spin(_boot_mtx); +while(1); /* Wait until all the AP's are up. */ while (atomic_load_acq_int(_started) == 0) We arrive to the original behaviour of no 'db' prompt, and somewhat garbaged crash: [...] cpu0 BSP: ID: 0x VER: 0x80050010 LDR: 0x DFR: 0x lint0: 0x00010700 lint1: 0x0400 TPR: 0x SVR: 0x01ff timer: 0x000100ef therm: 0x0001 err: 0x00f0 pmc: 0x00010400 AMD ext features: 0x00010003 AMD elvt0: 0x0001 SMP: AP CPU #1 Launched! cpu1 AP: Fa Fa FFa Fa Fa Fa Fata . I hope this helps. -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #14 from Pablo Ruiz--- I've tried moving the while loop a bit down at mp_x86, and I got a db prompt too: diff --git a/sys/x86/x86/mp_x86.c b/sys/x86/x86/mp_x86.c index 3cca61ca72e..1c257d87e58 100644 --- a/sys/x86/x86/mp_x86.c +++ b/sys/x86/x86/mp_x86.c @@ -925,7 +925,6 @@ init_secondary_tail(void) CTR1(KTR_SMP, "SMP: AP CPU #%d Launched", cpuid); printf("SMP: AP CPU #%d Launched!\n", cpuid); -while(1); /* Determine if we are a logical CPU. */ if (cpu_info[PCPU_GET(apic_id)].cpu_hyperthread) @@ -951,6 +950,7 @@ while(1); load_es(_udatasel); load_fs(_ufssel); #endif +while(1); mtx_unlock_spin(_boot_mtx); This is the relevant boot output: [...] cpu0 BSP: ID: 0x VER: 0x80050010 LDR: 0x DFR: 0x lint0: 0x00010700 lint1: 0x0400 TPR: 0x SVR: 0x01ff timer: 0x000100ef therm: 0x0001 err: 0x00f0 pmc: 0x00010400 AMD ext features: 0x00010003 AMD elvt0: 0x0001 SMP: AP CPU #1 Launched! cpu1 AP: Fa kFernel trap 12 with interrupts disabled Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 02 fault virtual address = 0x441 fault code = supervisor read data, page not present instruction pointer = 0x20:0x80ffb704 stack pointer = 0x28:0xfe001b979d00 frame pointer = 0x28:0xfe001b979d10 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags= resume, IOPL = 0 current process = 11 (idle: cpu2) [ thread pid 11 tid 15 ] Stopped at spinlock_exit+0x14: movq0x440(%rbx),%rax db> bt Tracing pid 11 tid 15 td 0xf8000332c000 spinlock_exit() at spinlock_exit+0x14/frame 0xfe001b979d10 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xa5/frame 0xfe001b979d80 i8254_delay() at i8254_delay+0x143/frame 0xfe001b979dc0 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xa5/frame 0xfe001b979e30 i8254_delay() at i8254_delay+0x143/frame 0xfe001b979e70 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xa5/frame 0xfe001b979ee0 i8254_delay() at i8254_delay+0x143/frame 0xfe001b979f20 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xa5/frame 0xfe001b979f90 i8254_delay() at i8254_delay+0x143/frame 0xfe001b979fd0 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xa5/frame 0xfe001b97a040 i8254_delay() at i8254_delay+0x143/frame 0xfe001b97a080 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xa5/frame 0xfe001b97a0f0 i8254_delay() at i8254_delay+0x143/frame 0xfe001b97a130 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xa5/frame 0xfe001b97a1a0 i8254_delay() at i8254_delay+0x143/frame 0xfe001b97a1e0 [... repeates quite a lot ] _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0x14a/frame 0xfe001b97cb90 i8254_delay() at i8254_delay+0x3a/frame 0xfe001b97cbd0 ns8250_putc() at ns8250_putc+0x2a/frame 0xfe001b97cc00 uart_cnputc() at uart_cnputc+0x47/frame 0xfe001b97cc20 cnputc() at cnputc+0x7d/frame 0xfe001b97cc50 cnputs() at cnputs+0x68/frame 0xfe001b97cc70 putchar() at putchar+0x14d/frame 0xfe001b97ccf0 kvprintf() at kvprintf+0x103d/frame 0xfe001b97cde0 vprintf() at vprintf+0x87/frame 0xfe001b97ceb0 printf() at printf+0x43/frame 0xfe001b97cf10 dblfault_handler() at dblfault_handler+0x26/frame 0xfe001b97cf30 Xdblfault() at Xdblfault+0xac/frame 0xfe001b97cf30 --- trap 0x17, rip = 0x81159ea5, rsp = 0xfe001b978000, rbp = 0xfe001b978030 --- i8254_delay() at i8254_delay+0x35/frame 0xfe001b978030 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0x14a/frame 0xfe001b9780a0 i8254_delay() at i8254_delay+0x3a/frame 0xfe001b9780e0 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0x14a/frame 0xfe001b978150 [... more similar calls repeating omitted ...] spinlock_exit() at spinlock_exit+0x14/frame 0xfe001b9785d0 [... more _mtx_lock_spin_cookie calls ...] -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #13 from Pablo Ruiz--- Hi again, Adding the following patch: diff --git a/sys/x86/x86/mp_x86.c b/sys/x86/x86/mp_x86.c index 7cc02d663bf..3cca61ca72e 100644 --- a/sys/x86/x86/mp_x86.c +++ b/sys/x86/x86/mp_x86.c @@ -925,6 +925,7 @@ init_secondary_tail(void) CTR1(KTR_SMP, "SMP: AP CPU #%d Launched", cpuid); printf("SMP: AP CPU #%d Launched!\n", cpuid); +while(1); /* Determine if we are a logical CPU. */ if (cpu_info[PCPU_GET(apic_id)].cpu_hyperthread) I get into db while crashing: [...] cpu0 BSP: ID: 0x VER: 0x80050010 LDR: 0x DFR: 0x lint0: 0x00010700 lint1: 0x0400 TPR: 0x SVR: 0x01ff timer: 0x000100ef therm: 0x0001 err: 0x00f0 pmc: 0x00010400 AMD ext features: 0x00010003 AMD elvt0: 0x0001 SMP: AP CPU #1 Launched! kkkerneel trap 12 with interrupts disabled Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 02 fault virtual address = 0x0 fault code = supervisor write data, page not present instruction pointer = 0x20:0x80bb739d stack pointer = 0x28:0xfe001b9835b0 frame pointer = 0x28:0xfe001b983620 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags= resume, IOPL = 0 current process = 11 (idle: cpu2) [ thread pid 11 tid 15 ] Stopped at putchar+0x15d: movb$0,(%rax) db> bt Tracing pid 11 tid 15 td 0xf8000332c000 putchar() at putchar+0x15d/frame 0xfe001b983620 db> show all procs pid ppid pgrp uid state wmesg wchancmd 11 0 0 0 RL (threaded) [idle] 13 CanRun [idle: cpu0] 14 CanRun [idle: cpu1] 15 CanRun [idle: cpu2] 16 CanRun [idle: cpu3] 1 0 0 0 ?L [kernel] 10 0 0 0 RL [audit] 0 0 0 0 RLs CPU 0 [swapper] db> show all pcpu Current CPU: 2 cpuid= 0 dynamic pcpu = 0x682000 curthread= 0x82883640: pid 0 "swapper" curpcb = 0x82c0ecc0 fpcurthread = none idlethread = 0xf8000332d000: tid 13 "idle: cpu0" curpmap = 0x828af188 tssp = 0x828ad510 commontssp = 0x828ad510 rsp0 = 0x82c0ecc0 gs32p= 0x828ad708 ldt = 0x828ad748 tss = 0x828ad738 cpuid= 1 dynamic pcpu = 0xfe00993f1000 curthread= 0xf8000332c580: pid 11 "idle: cpu1" curpcb = 0 fpcurthread = none idlethread = 0xf8000332c580: tid 14 "idle: cpu1" curpmap = 0x828af188 tssp = 0x828ad578 commontssp = 0x828ad578 rsp0 = 0x0 gs32p= 0x828ad770 ldt = 0x828ad7b0 tss = 0x828ad7a0 cpuid= 2 dynamic pcpu = 0xfe00993f9000 curthread= 0xf8000332c000: pid 11 "idle: cpu2" curpcb = 0 fpcurthread = none idlethread = 0xf8000332c000: tid 15 "idle: cpu2" curpmap = 0x828af188 tssp = 0x828ad5e0 commontssp = 0x828ad5e0 rsp0 = 0x0 gs32p= 0x828ad7d8 ldt = 0x828ad818 tss = 0x828ad808 cpuid= 3 dynamic pcpu = 0xfe0099401000 curthread= 0xf8000332b580: pid 11 "idle: cpu3" curpcb = 0 fpcurthread = none idlethread = 0xf8000332b580: tid 16 "idle: cpu3" curpmap = 0x828af188 tssp = 0x828ad648 commontssp = 0x828ad648 rsp0 = 0x0 gs32p= 0x828ad840 ldt = 0x828ad880 tss = 0x828ad870 db> show all trace Tracing command idle pid 11 tid 13 td 0xf8000332d000 fork_trampoline() at fork_trampoline Tracing command idle pid 11 tid 14 td 0xf8000332c580 fork_trampoline() at fork_trampoline Tracing command idle pid 11 tid 15 td 0xf8000332c000 putchar() at putchar+0x15d/frame 0xfe001b983620 Tracing command idle pid 11 tid 16 td 0xf8000332b580 fork_trampoline() at fork_trampoline Tracing command kernel pid 1 tid 12 td 0xf8000332d580 fork_trampoline() at fork_trampoline Tracing command audit pid 10 tid 11 td 0xf8000332e000 fork_trampoline() at fork_trampoline Tracing command kernel pid 0 tid 10 td 0x82883640 KDB: reentering KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe001b982cc0 kdb_reenter() at kdb_reenter+0x2f/frame 0xfe001b982cd0 trap() at trap+0x4d/frame
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #11 from John Baldwin--- Hmm, I don't know why the previous simple lock didn't help. One other possible thing to try is placing 'while (1);' infinite loop in the init_secondary_tail() function in sys/x86/x86/mp_x86.c and moving it around in the function to narrow down when the APs are triggering the double fault (which is a stack overflow). If you put the while(1) before the smp_cpus++; the failure mode you should see if the AP doesn't fault is a 'panic AP #x failed to start'. After the smp_cpus++ line you should at least no longer get the double fault panic if you haven't hit the double fault yet. Another thought is that it might be there is a missing MFC in 11 related to one or more kthreads starting too early. You could perhaps build a kernel with: options KTR_COMPILE=KTR_PROC options KTR_MASK=KTR_PROC options KTR_VERBOSE And see what messages are logged before the crash (to see if the APs are starting to run other kthreads besides the idle thread). -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #10 from Pablo Ruiz--- Hi, Is there anything I can help with this weekend? :) -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #9 from Pablo Ruiz--- Hi John, I've tried your patch, but even after trying to boot quite a few times, I got no stacktrace. All I got was: [...] MADT: Found IO APIC ID 17, Interrupt 56 at 0xfeafc000 ioapic2: Changing APIC ID to 17 ioapic2: WARNING: intbase 56 != expected base 55 ioapic2: ver 0x11 maxredir 0x06 MADT: Found IO APIC ID 14, Interrupt 24 at 0xfeaff000 ioapic3: WARNING: intbase 24 != expected base 63 ioapic3: ver 0x11 maxredir 0x17 MADT: Interrupt override: source 0, irq 2 ioapic0: Routing IRQ 0 -> intpin 2 MADT: Interrupt override: source 9, irq 9 ioapic0: intpin 9 trigger: level ioapic3 irqs 24-47 on motherboard ioapic0 irqs 0-23 on motherboard ioapic1 irqs 48-54 on motherboard ioapic2 irqs 56-62 on motherboard cpu0 BSP: ID: 0x VER: 0x80050010 LDR: 0x DFR: 0x lint0: 0x00010700 lint1: 0x0400 TPR: 0x SVR: 0x01ff timer: 0x000100ef therm: 0x0001 err: 0x00f0 pmc: 0x00010400 AMD ext features: 0x00010003 AMD elvt0: 0x0001 SMP: AP CPU #1 Launched! cpu1 AP: Some times it get to the point of printing 'Fa', or just ends up there w/o showing anything else.. :( -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #8 from John Baldwin--- So it looks like the panic is a double fault. Please try this hack patch to see if it cleans up the printfs: Index: amd64/amd64/trap.c === --- amd64/amd64/trap.c (revision 328557) +++ amd64/amd64/trap.c (working copy) @@ -830,6 +830,11 @@ void dblfault_handler(struct trapframe *frame) { + + static int dblflt_lock = 0; + + while (!atomic_cmpset_int(_lock, 0, 1)) + cpu_spinwait(); #ifdef KDTRACE_HOOKS if (dtrace_doubletrap_func != NULL) (*dtrace_doubletrap_func)(); It won't fix the panic, but hopefully only one CPU will print out the messages so we can debug this further. -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #7 from Pablo Ruiz--- Hi guys, Did you need anything else from my side? May I help with some additional diagnosis? Best Regards -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #6 from Pablo Ruiz--- Created attachment 190117 --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=190117=edit verbose output of a kernel w/o EARLY_AP_STARTUP For reference, attached is the verbose output of a kernel w/o EARLY_AP_STARTUP from this same system. -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #5 from Pablo Ruiz--- Sure, Here's a "boot -v" for a MAXCPU=4 kernel: - OK boot kernel.test -v /boot/kernel.test/kernel text=0x16f3108 data=0xb66730+0x34fc98 syms=[0x8+0x1898e8+0x8+0x18a675] Booting... KDB: debugger backends: ddb KDB: current backend: ddb Table 'FACP' at 0xdbfb0290 Table 'APIC' at 0xdbfb0390 APIC: Found table at 0xdbfb0390 APIC: Using the MADT enumerator. MADT: Found CPU APIC ID 0 ACPI ID 1: enabled SMP: Added CPU 0 (AP) MADT: Found CPU APIC ID 1 ACPI ID 2: enabled SMP: Added CPU 1 (AP) MADT: Found CPU APIC ID 2 ACPI ID 3: enabled SMP: Added CPU 2 (AP) MADT: Found CPU APIC ID 3 ACPI ID 4: enabled SMP: Added CPU 3 (AP) MADT: Found CPU APIC ID 132 ACPI ID 5: disabled MADT: Found CPU APIC ID 133 ACPI ID 6: disabled MADT: Found CPU APIC ID 134 ACPI ID 7: disabled MADT: Found CPU APIC ID 135 ACPI ID 8: disabled Copyright (c) 1992-2017 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 11.1-RELEASE-p6 #2 r313908+cac12d01a89(RELENG_2_4-EVI): Fri Jan 26 23:35:27 UTC 2018 root@builder:/usr/src/kernel/tmp/obj/usr/src/FreeBSD-src/sys/test amd64 FreeBSD clang version 4.0.0 (tags/RELEASE_400/final 297347) (based on LLVM 4.0.0) Table 'FACP' at 0xdbfb0290 Table 'APIC' at 0xdbfb0390 Table 'SPCR' at 0xdbfb0440 Table 'SLIT' at 0xdbfb0490 Table 'SPMI' at 0xdbfb04c0 Table 'OEMB' at 0xdbfbe040 Table 'SRAT' at 0xdbfb7540 SRAT: Found table at 0xdbfb7540 SRAT: Found CPU APIC ID 0 domain 0: enabled SRAT: Found CPU APIC ID 1 domain 0: enabled SRAT: Found memory domain 0 addr 0x0 len 0xa: enabled SRAT: Found memory domain 0 addr 0x10 len 0xbff0: enabled SRAT: Found CPU APIC ID 2 domain 1: enabled SRAT: Found CPU APIC ID 3 domain 1: enabled SRAT: Found memory domain 1 addr 0xc000 len 0x1c00: enabled SRAT: Found memory domain 1 addr 0x1 len 0xa400: enabled Table 'FACP' at 0xdbfb0290 Table 'APIC' at 0xdbfb0390 Table 'SPCR' at 0xdbfb0440 Table 'SLIT' at 0xdbfb0490 SLIT: Found table at 0xdbfb0490 SLIT.Localities: 2 0: 10 10 1: 10 10 PPIM 0: PA=0xa, VA=0x82e1, size=0x1, mode=0 VT(vga): resolution 640x480 Preloaded elf kernel "/boot/kernel.test/kernel" at 0x82cbe000. Calibrating TSC clock ... TSC clock: 2992117590 Hz CPU: Dual-Core AMD Opteron(tm) Processor SE (2992.12-MHz K8-class CPU) Origin="AuthenticAMD" Id=0x40f13 Family=0xf Model=0x41 Stepping=3 Features=0x178bfbff Features2=0x2001 AMD Features=0xea500800 AMD Features2=0x1f SVM: Features=0x0 Revision=1, ASIDs=64 L1 2MB data TLB: 8 entries, fully associative L1 2MB instruction TLB: 8 entries, fully associative L1 4KB data TLB: 32 entries, fully associative L1 4KB instruction TLB: 32 entries, fully associative L1 data cache: 64 kbytes, 64 bytes/line, 1 lines/tag, 2-way associative L1 instruction cache: 64 kbytes, 64 bytes/line, 1 lines/tag, 2-way associative L2 2MB unified TLB: 0 entries, disabled/not present L2 4KB data TLB: 512 entries, 4-way associative L2 4KB instruction TLB: 512 entries, 4-way associative L2 unified cache: 1024 kbytes, 64 bytes/line, 1 lines/tag, 16-way associative real memory = 6442450944 (6144 MB) Physical memory chunk(s): 0x0001 - 0x00097fff, 557056 bytes (136 pages) 0x0010 - 0x001f, 1048576 bytes (256 pages) 0x02d0d000 - 0xd1422fff, 3463536640 bytes (845590 pages) 0xdbfae000 - 0xdbfa, 8192 bytes (2 pages) 0x0001 - 0x0001a3fe7fff, 2751365120 bytes (671720 pages) avail memory = 6190829568 (5904 MB) Event timer "LAPIC" quality 100 LAPIC: ipi_wait() us multiplier 72 (r 4115237 tsc 2992117590) ACPI APIC Table: Package ID shift: 1 L2 cache ID shift: 0 L1 cache ID shift: 0 Core ID shift: 0 INTR: Adding local APIC 1 as a target INTR: Adding local APIC 2 as a target INTR: Adding local APIC 3 as a target FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs FreeBSD/SMP: 2 package(s) x 2 core(s) Package HW ID = 0 (0) Core HW ID = 0 (0) CPU0 (BSP): APIC ID: 0 (0) Core HW ID = 1 (0x1) CPU1 (AP): APIC ID: 1 (0x1) Package HW ID = 1 (0x1) Core HW ID = 2 (0x2) CPU2 (AP): APIC ID: 2 (0x2) Core HW ID = 3 (0x3) CPU3 (AP): APIC ID: 3 (0x3) APIC: CPU 0 has ACPI ID 1 APIC: CPU 1 has ACPI ID 2 APIC: CPU 2 has ACPI ID 3 APIC: CPU 3 has ACPI ID 4 SRAT: CPU 0 has memory domain 0 SRAT: CPU 1 has memory domain 0 SRAT: CPU 2 has memory domain 1 SRAT: CPU 3 has memory domain 1 x86bios: IVT 0x00-0x0004ff at
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #4 from John Baldwin--- Can you get boot -v output? -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #2 from Pablo Ruiz--- I've built a kernel with MAXCPU=2, and funnily enough it boots ok.. it only launches CPU#1 (along with #0), but it works ok.. This system has two physical CPUs, each with two cores. Dunno if this may be related.. Here is the output: --- OK boot kernel.test /boot/kernel.test/kernel text=0x16f1f28 data=0xb66738+0x34cc58 syms=[0x8+0x1898e8+0x8+0x18a675] Booting... KDB: debugger backends: ddb KDB: current backend: ddb Copyright (c) 1992-2017 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 11.1-RELEASE-p6 #0 r313908+b1692e611a8(TEST): Fri Jan 26 22:52:47 UTC 2018 root@pfsense-builder:/usr/src/kernel/tmp/obj/usr/src/kernel/tmp/FreeBSD-src/sys/test amd64 FreeBSD clang version 4.0.0 (tags/RELEASE_400/final 297347) (based on LLVM 4.0.0) VT(vga): resolution 640x480 CPU: Dual-Core AMD Opteron(tm) Processor SE (2992.11-MHz K8-class CPU) Origin="AuthenticAMD" Id=0x40f13 Family=0xf Model=0x41 Stepping=3 Features=0x178bfbff Features2=0x2001 AMD Features=0xea500800 AMD Features2=0x1f SVM: NAsids=64 real memory = 6442450944 (6144 MB) avail memory = 6190903296 (5904 MB) Event timer "LAPIC" quality 100 ACPI APIC Table: FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs FreeBSD/SMP: 2 package(s) x 2 core(s) FreeBSD/SMP Online: 1 package(s) x 2 core(s) ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/Gpe0Block: 64/32 (20170303/tbfadt-748) ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/Gpe1Block: 128/64 (20170303/tbfadt-748) ioapic1: Changing APIC ID to 16 ioapic2: Changing APIC ID to 17 ioapic3 irqs 24-47 on motherboard ioapic0 irqs 0-23 on motherboard ioapic1 irqs 48-54 on motherboard ioapic2 irqs 56-62 on motherboard SMP: AP CPU #1 Launched! random: entropy device external interface wlan: mac acl policy registered kbd1 at kbdmux0 netmap: loaded module module_register_init: MOD_LOAD (vesa, 0x81155cc0, 0) error 19 nexus0 vtvga0: on motherboard cryptosoft0: on motherboard padlock0: No ACE support. acpi0: on motherboard acpi0: Power Button (fixed) unknown: I/O range not supported cpu0: on acpi0 cpu1: on acpi0 attimer0: port 0x40-0x43 irq 0 on acpi0 Timecounter "i8254" frequency 1193182 Hz quality 0 Event timer "i8254" frequency 1193182 Hz quality 100 atrtc0: port 0x70-0x71 irq 8 on acpi0 Event timer "RTC" frequency 32768 Hz quality 0 hpet0: iomem 0xfed0-0xfed00fff on acpi0 Timecounter "HPET" frequency 2500 Hz quality 950 hpet1: iomem 0xfed1-0xfed10fff on acpi0 Timecounter "ACPI-safe" frequency 3579545 Hz quality 850 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x2008-0x200b on acpi0 pcib0: on acpi0 pcib0: _OSC returned error 0x10 pci0: on pcib0 pci0: at device 0.0 (no driver attached) isab0: at device 1.0 on pci0 isa0: on isab0 ohci0: mem 0xfe3ff000-0xfe3f irq 20 at device 2.0 on pci0 usbus0 on ohci0 usbus0: 12Mbps Full Speed USB v1.0 ehci0: mem 0xfe3fec00-0xfe3fecff irq 21 at device 2.1 on pci0 usbus1: EHCI version 1.0 usbus1 on ehci0 usbus1: 480Mbps High Speed USB v2.0 atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x9100-0x910f at device 6.0 on pci0 ata0: at channel 0 on atapci0 ata1: at channel 1 on atapci0 pcib1: at device 9.0 on pci0 pci1: on pcib1 vgapci0: port 0xa800-0xa8ff mem 0xfc00-0xfcff,0xfdbff000-0xfdbf irq 16 at device 3.0 on pci1 vgapci0: Boot video device nfe0: port 0xdc00-0xdc07 mem 0xfe3fd000-0xfe3fdfff irq 22 at device 10.0 on pci0 miibus0: on nfe0 e1000phy0: PHY 1 on miibus0 e1000phy0: none, 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseSX, 1000baseSX-FDX, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow nfe0: Using defaults for TSO: 65518/35/2048 nfe0: Ethernet address: 00:14:4f:e5:3a:3c pcib2: at device 11.0 on pci0 pci2: on pcib2 pcib3: at device 12.0 on pci0 pci3: on pcib3 pcib4: at device 13.0 on pci0 pci4: on pcib4 em0: port 0xbc00-0xbc1f mem 0xfdde-0xfddf,0xfddc-0xfddd irq 17 at device 0.0 on pci4 em0: Using an MSI interrupt em0: Ethernet address: 00:15:17:c3:df:7c em0: netmap queues/slots: TX 1/4096, RX 1/4096 em1: port 0xb800-0xb81f mem 0xfdd8-0xfdd9,0xfdd6-0xfdd7 irq 16 at device 0.1 on pci4 em1: Using an MSI interrupt em1: Ethernet address: 00:15:17:c3:df:7d em1: netmap queues/slots: TX 1/4096, RX 1/4096 pcib5: at device 14.0 on pci0 pci5: on pcib5 ix0:
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 --- Comment #1 from John Baldwin--- Can you build a kernel with MAXCPU set to 2? That will reduce the garbage by only having 1 AP startup so the panic stack trace is cleaner. -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"
[Bug 225450] 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225450 Bug ID: 225450 Summary: 11.1-* panics on AMD Opteron 2k due to EARLY_AP_STARTUP Product: Base System Version: 11.1-STABLE Hardware: amd64 OS: Any Status: New Severity: Affects Many People Priority: --- Component: kern Assignee: freebsd-bugs@FreeBSD.org Reporter: pablo.r...@gmail.com The EARLY_AP_STARTUP kernel option causes kernel panics on AMD Opteron 2xxxSE CPUs. Including SUN X4100, X4200, etc. Commenting EARLY_AP_STARTUP while building kernel allows boot. Safe mode boot works too because it disables smp. An svn bisect was performed for 11-stable and the addition of EARLY_AP_STARTUP at r318763 seems to be the culprit. This has been reproduced with 11.1-RELEASE, 11.1-RELEASE-p1, 11.1-RELEASE-p2, 11.1-RELEASE-p4 & 11.1-RELEASE-p6. Meanwhile FreeBSD 10 works/boots fine. Example boot output with failing kernel (includes garbage): /boot/kernel.old/kernel text=0x14972f8 data=0x1384c0+0x4c15e8 syms=[0x8+0x15e8b0+0x8+0x178422] /boot/entropy size=0x1000 Booting... Copyright (c) 1992-2017 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 11.1-RELEASE #0 r321309: Fri Jul 21 02:08:28 UTC 2017 r...@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 FreeBSD clang version 4.0.0 (tags/RELEASE_400/final 297347) (based on LLVM 4.0.0) VT(vga): resolution 640x480 CPU: Dual-Core AMD Opteron(tm) Processor 2218 (2593.16-MHz K8-class CPU) Origin="AuthenticAMD" Id=0x40f12 Family=0xf Model=0x41 Stepping=2 Features=0x178bfbffFeatures2=0x2001 AMD Features=0xea500800 AMD Features2=0x1f SVM: NAsids=64 real memory = 4563402752 (4352 MB) avail memory = 4104478720 (3914 MB) Event timer "LAPIC" quality 100 ACPI APIC Table: FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs FreeBSD/SMP: 2 package(s) x 2 core(s) random: unblocking device. ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/Gpe0Block: 64/32 (20170303/tbfadt-748) ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/Gpe1Block: 128/64 (20170303/tbfadt-748) ioapic1: Changing APIC ID to 16 ioapic2: Changing APIC ID to 17 ioapic3 irqs 24-47 on motherboard ioapic0 irqs 0-23 on motherboard ioapic1 irqs 48-54 on motherboard ioapic2 irqs 56-62 on motherboard SMP: AP CPU SMP: AP CPU k F aF aFat F aF kkernel trap 12 with interrupts disabþÿÿÿÿ�ÿÿÿÿÿÿÿÿkernel trap 12 with interrupts disabled Fatal trap 60276736: UNÿÿkernel trap 12 with interrupts disabled iatal trap -2130508367: UNKNOWN whilpanierc: staeck ovlerflowt detrected; backt1race mawy bei corrup ted cpuid t= 1 KrDB: stack bupackttrace: #0 0xs80aada97 at lkdb_backtrace+0x67 x80a6bb76 aFat vpanic+0x1l86 #r2 0xfpff1f80a6b9e3 at panipc+0x43g #3e 0xa80a9b072u at __tstackw_chkh_lfail+0x12 i#4 0xff kff80eab3f2b eat vprintf+0 x10b 000 atcp dmapbase+0x397c000 Upt =ime: 12s Au;toma tic rebpoot inc 15 secondsi - press a key on 0the con sole fto abort -u-> Press a key on the iconsroule to reboot , -a-> or drswitech off the =system xnow. ke¡ÿÿÿÿÿÿÿÿkernel trap 12 with interrupts disabled Fatal trap 12: page fault while in kernel mode cð c --- -- You are receiving this mail because: You are the assignee for the bug. ___ freebsd-bugs@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"