> Date: Wed, 6 Jan 2021 19:14:08 +0000 > From: Miod Vallat <[email protected]> > > I have been confused by a very strange issue on sparc64 over the last > few days, and I can't figure out its cause. > > What happens is that cold boots work, but warm boots (i.e. rebooting) > almost always fail like this: > > Rebooting with command: boot > Boot device: disk File and args: > OpenBSD IEEE 1275 Bootblock 2.1 > ..>> OpenBSD BOOT 1.20 > Trying bsd... > NOTE: random seed is being reused. > Booting /sbus@1f,0/SUNW,fas@e,8800000/sd@0,0:a/bsd > 9956944@0x1000000+4528@0x197ee50+191788@0x1c00000+4002516@0x1c2ed2c > symbols @ 0xffde0400 476705+165+641112+443003 start=0x1000000 > [ using 1562024 bytes of bsd ELF symbol table ] > Fast Data Access MMU Miss > ok > > However, sometimes, a bsd.rd -> bsd transition works (i.e. boot bsd.rd, > install or upgrade or even do nothing, then reboot to bsd). But most of > the time it fails in the same way. And once it has failed, there does > not seem to be a way to boot a kernel without having to poweroff > (reset-all will not help). > > At first I thought this was a subtle relinking problem, but it isn't. > From the prom, I have been able to get this trace: > mtx_enter+0x58 > msgbuf_putchar+0x2c > initmsgbuf+0x80 > pmap_bootstrap+0x140 > bootstrap+0x18c > > This is a on an Ultra 1, thus single-processor machine. The code for > mtx_enter() is: > > void > mtx_enter(struct mutex *mtx) > { > struct cpu_info *ci = curcpu(); > > /* Avoid deadlocks after panic or in DDB */ > if (panicstr || db_active) > return; > > WITNESS_CHECKORDER(MUTEX_LOCK_OBJECT(mtx), > LOP_EXCLUSIVE | LOP_NEWORDER, NULL); > > #ifdef DIAGNOSTIC > if (__predict_false(mtx->mtx_owner == ci)) > panic("mtx %p: locking against myself", mtx); > #endif > > if (mtx->mtx_wantipl != IPL_NONE) > mtx->mtx_oldipl = splraise(mtx->mtx_wantipl); > > mtx->mtx_owner = ci; > > #ifdef DIAGNOSTIC > ci->ci_mutex_level++; > #endif > WITNESS_LOCK(MUTEX_LOCK_OBJECT(mtx), LOP_EXCLUSIVE); > } > > and the "Fast Data Access MMU Miss" occurs on the > ci->ci_mutex_level++; > line. > > It turns out that, being a single-processor kernel, ci == CPUINFO_VA == > 0xe0018000 (KERNEND + 64KB + 32KB). > > And the prom tells me that: > > ok e0018000 map? > VA:e0018000 > G:0 W:0 P:0 E:0 CV:0 CP:0 L:0 Soft1:0 PA[40:13]:0 PA:0 > Diag:0 Soft2:0 IE:0 NFO:0 Size:0 V:0 > Invalid > > Is this a PROM mapping which got invalidated by mistake, or a mapping > which ought to have been set up by the boot blocks but is no longer set > up correctly? I see no obvious change to blame about this in the last > few releases. > > Any ideas on where to look or what to try to get to understand that > problem better?
The per-CPU struct is mapped using a 64K locked TLB entry. That TLB entry is installed by sun4u_bootstrap_cpu(), which gets called *after* initsmgbuf() is called. So this issue was introduced when locking was added to msgbuf_putchar(). Now the real question is why this doesn't crash even on a cold boot? I suspect that is because on a cold boot the buffer is clean and the msgbuf_putchar() call in initmsgbuf() is skipped. It may be possible to do some reordering in pmap_bootstrap(), but frankly I think the locking added to msgbuf_putchar() was a mistake. Or maybe the locking code should be bypassed when cold.
