On Wed, Jan 06, 2021 at 07:14:08PM +0000, Miod Vallat wrote:
> I have been confused by a very strange issue on sparc64 over the last
> few days, and I can't figure out its cause.
> 
> What happens is that cold boots work, but warm boots (i.e. rebooting)
> almost always fail like this:
> 
> Rebooting with command: boot                                          
> Boot device: disk  File and args: 
> OpenBSD IEEE 1275 Bootblock 2.1
> ..>> OpenBSD BOOT 1.20
> Trying bsd...
> NOTE: random seed is being reused.
> Booting /sbus@1f,0/SUNW,fas@e,8800000/sd@0,0:a/bsd
> 9956944@0x1000000+4528@0x197ee50+191788@0x1c00000+4002516@0x1c2ed2c 
> symbols @ 0xffde0400 476705+165+641112+443003 start=0x1000000
> [ using 1562024 bytes of bsd ELF symbol table ]
> Fast Data Access MMU Miss
> ok 
> 
> However, sometimes, a bsd.rd -> bsd transition works (i.e. boot bsd.rd,
> install or upgrade or even do nothing, then reboot to bsd). But most of
> the time it fails in the same way. And once it has failed, there does
> not seem to be a way to boot a kernel without having to poweroff
> (reset-all will not help).
> 
> At first I thought this was a subtle relinking problem, but it isn't.
> >From the prom, I have been able to get this trace:
>       mtx_enter+0x58
>       msgbuf_putchar+0x2c
>       initmsgbuf+0x80
>       pmap_bootstrap+0x140
>       bootstrap+0x18c
> 
> This is a on an Ultra 1, thus single-processor machine. The code for
> mtx_enter() is:
> 
>       void
>       mtx_enter(struct mutex *mtx)
>       {
>               struct cpu_info *ci = curcpu();
> 
>               /* Avoid deadlocks after panic or in DDB */
>               if (panicstr || db_active)
>                       return;
> 
>               WITNESS_CHECKORDER(MUTEX_LOCK_OBJECT(mtx),
>                   LOP_EXCLUSIVE | LOP_NEWORDER, NULL);
> 
>       #ifdef DIAGNOSTIC
>               if (__predict_false(mtx->mtx_owner == ci))
>                       panic("mtx %p: locking against myself", mtx);
>       #endif
> 
>               if (mtx->mtx_wantipl != IPL_NONE)
>                       mtx->mtx_oldipl = splraise(mtx->mtx_wantipl);
> 
>               mtx->mtx_owner = ci;
> 
>       #ifdef DIAGNOSTIC
>               ci->ci_mutex_level++;
>       #endif
>               WITNESS_LOCK(MUTEX_LOCK_OBJECT(mtx), LOP_EXCLUSIVE);
>       }
> 
> and the "Fast Data Access MMU Miss" occurs on the
>               ci->ci_mutex_level++;
> line.
> 
> It turns out that, being a single-processor kernel, ci == CPUINFO_VA ==
> 0xe0018000 (KERNEND + 64KB + 32KB).
> 
> And the prom tells me that:
> 
>       ok e0018000 map?
>       VA:e0018000 
>       G:0 W:0 P:0 E:0 CV:0 CP:0 L:0 Soft1:0 PA[40:13]:0 PA:0 
>       Diag:0 Soft2:0 IE:0 NFO:0 Size:0 V:0 
>       Invalid
> 
> Is this a PROM mapping which got invalidated by mistake, or a mapping
> which ought to have been set up by the boot blocks but is no longer set
> up correctly? I see no obvious change to blame about this in the last
> few releases.
> 
> Any ideas on where to look or what to try to get to understand that
> problem better?

I saw something like this on a V120.

Booting /pci@1f,0/pci@1/scsi@8/disk@0,0:a/bsd.1105
9901216@0x1000000+2912@0x19714a0+191348@0x1c00000+4002956@0x1c2eb74
symbols @ 0xfee82400 479089+165+641136+442948 start=0x1000000
[ using 1564376 bytes of bsd ELF symbol table ]
Fast Data Access MMU Miss
ok

instead of
9910440@0x1000000+1880@0x19738a8+188572@0x1c00000+4005732@0x1c2e09c
symbols @ 0xfee80400 481463+165+641016+442891 start=0x1000000
[ using 1566568 bytes of bsd ELF symbol table ]
console is /pci@1f,0/pci@1,1/isa@7/serial@0,3f8
...

(with bsd did not occur with bsd.rd)

Booting a known good kernel was not enough to clear this state
or even reset-all at the ok prompt.  I had to do power-off at
ok prompt and poweron at lom prompt.

I think the window this occurs is something like:

bad
OpenBSD 6.8-current (GENERIC) #510: Thu Oct 29 19:58:32 MDT 2020

good
OpenBSD 6.8-current (GENERIC) #508: Thu Oct 29 06:05:29 MDT 2020

Reply via email to