On Wed, Jan 06, 2021 at 07:14:08PM +0000, Miod Vallat wrote:
> I have been confused by a very strange issue on sparc64 over the last
> few days, and I can't figure out its cause.
>
> What happens is that cold boots work, but warm boots (i.e. rebooting)
> almost always fail like this:
>
> Rebooting with command: boot
> Boot device: disk File and args:
> OpenBSD IEEE 1275 Bootblock 2.1
> ..>> OpenBSD BOOT 1.20
> Trying bsd...
> NOTE: random seed is being reused.
> Booting /sbus@1f,0/SUNW,fas@e,8800000/sd@0,0:a/bsd
> 9956944@0x1000000+4528@0x197ee50+191788@0x1c00000+4002516@0x1c2ed2c
> symbols @ 0xffde0400 476705+165+641112+443003 start=0x1000000
> [ using 1562024 bytes of bsd ELF symbol table ]
> Fast Data Access MMU Miss
> ok
>
> However, sometimes, a bsd.rd -> bsd transition works (i.e. boot bsd.rd,
> install or upgrade or even do nothing, then reboot to bsd). But most of
> the time it fails in the same way. And once it has failed, there does
> not seem to be a way to boot a kernel without having to poweroff
> (reset-all will not help).
>
> At first I thought this was a subtle relinking problem, but it isn't.
> >From the prom, I have been able to get this trace:
> mtx_enter+0x58
> msgbuf_putchar+0x2c
> initmsgbuf+0x80
> pmap_bootstrap+0x140
> bootstrap+0x18c
>
> This is a on an Ultra 1, thus single-processor machine. The code for
> mtx_enter() is:
>
> void
> mtx_enter(struct mutex *mtx)
> {
> struct cpu_info *ci = curcpu();
>
> /* Avoid deadlocks after panic or in DDB */
> if (panicstr || db_active)
> return;
>
> WITNESS_CHECKORDER(MUTEX_LOCK_OBJECT(mtx),
> LOP_EXCLUSIVE | LOP_NEWORDER, NULL);
>
> #ifdef DIAGNOSTIC
> if (__predict_false(mtx->mtx_owner == ci))
> panic("mtx %p: locking against myself", mtx);
> #endif
>
> if (mtx->mtx_wantipl != IPL_NONE)
> mtx->mtx_oldipl = splraise(mtx->mtx_wantipl);
>
> mtx->mtx_owner = ci;
>
> #ifdef DIAGNOSTIC
> ci->ci_mutex_level++;
> #endif
> WITNESS_LOCK(MUTEX_LOCK_OBJECT(mtx), LOP_EXCLUSIVE);
> }
>
> and the "Fast Data Access MMU Miss" occurs on the
> ci->ci_mutex_level++;
> line.
>
> It turns out that, being a single-processor kernel, ci == CPUINFO_VA ==
> 0xe0018000 (KERNEND + 64KB + 32KB).
>
> And the prom tells me that:
>
> ok e0018000 map?
> VA:e0018000
> G:0 W:0 P:0 E:0 CV:0 CP:0 L:0 Soft1:0 PA[40:13]:0 PA:0
> Diag:0 Soft2:0 IE:0 NFO:0 Size:0 V:0
> Invalid
>
> Is this a PROM mapping which got invalidated by mistake, or a mapping
> which ought to have been set up by the boot blocks but is no longer set
> up correctly? I see no obvious change to blame about this in the last
> few releases.
>
> Any ideas on where to look or what to try to get to understand that
> problem better?
I saw something like this on a V120.
Booting /pci@1f,0/pci@1/scsi@8/disk@0,0:a/bsd.1105
9901216@0x1000000+2912@0x19714a0+191348@0x1c00000+4002956@0x1c2eb74
symbols @ 0xfee82400 479089+165+641136+442948 start=0x1000000
[ using 1564376 bytes of bsd ELF symbol table ]
Fast Data Access MMU Miss
ok
instead of
9910440@0x1000000+1880@0x19738a8+188572@0x1c00000+4005732@0x1c2e09c
symbols @ 0xfee80400 481463+165+641016+442891 start=0x1000000
[ using 1566568 bytes of bsd ELF symbol table ]
console is /pci@1f,0/pci@1,1/isa@7/serial@0,3f8
...
(with bsd did not occur with bsd.rd)
Booting a known good kernel was not enough to clear this state
or even reset-all at the ok prompt. I had to do power-off at
ok prompt and poweron at lom prompt.
I think the window this occurs is something like:
bad
OpenBSD 6.8-current (GENERIC) #510: Thu Oct 29 19:58:32 MDT 2020
good
OpenBSD 6.8-current (GENERIC) #508: Thu Oct 29 06:05:29 MDT 2020