I have been confused by a very strange issue on sparc64 over the last
few days, and I can't figure out its cause.
What happens is that cold boots work, but warm boots (i.e. rebooting)
almost always fail like this:
Rebooting with command: boot
Boot device: disk File and args:
OpenBSD IEEE 1275 Bootblock 2.1
..>> OpenBSD BOOT 1.20
Trying bsd...
NOTE: random seed is being reused.
Booting /sbus@1f,0/SUNW,fas@e,8800000/sd@0,0:a/bsd
9956944@0x1000000+4528@0x197ee50+191788@0x1c00000+4002516@0x1c2ed2c
symbols @ 0xffde0400 476705+165+641112+443003 start=0x1000000
[ using 1562024 bytes of bsd ELF symbol table ]
Fast Data Access MMU Miss
ok
However, sometimes, a bsd.rd -> bsd transition works (i.e. boot bsd.rd,
install or upgrade or even do nothing, then reboot to bsd). But most of
the time it fails in the same way. And once it has failed, there does
not seem to be a way to boot a kernel without having to poweroff
(reset-all will not help).
At first I thought this was a subtle relinking problem, but it isn't.
>From the prom, I have been able to get this trace:
mtx_enter+0x58
msgbuf_putchar+0x2c
initmsgbuf+0x80
pmap_bootstrap+0x140
bootstrap+0x18c
This is a on an Ultra 1, thus single-processor machine. The code for
mtx_enter() is:
void
mtx_enter(struct mutex *mtx)
{
struct cpu_info *ci = curcpu();
/* Avoid deadlocks after panic or in DDB */
if (panicstr || db_active)
return;
WITNESS_CHECKORDER(MUTEX_LOCK_OBJECT(mtx),
LOP_EXCLUSIVE | LOP_NEWORDER, NULL);
#ifdef DIAGNOSTIC
if (__predict_false(mtx->mtx_owner == ci))
panic("mtx %p: locking against myself", mtx);
#endif
if (mtx->mtx_wantipl != IPL_NONE)
mtx->mtx_oldipl = splraise(mtx->mtx_wantipl);
mtx->mtx_owner = ci;
#ifdef DIAGNOSTIC
ci->ci_mutex_level++;
#endif
WITNESS_LOCK(MUTEX_LOCK_OBJECT(mtx), LOP_EXCLUSIVE);
}
and the "Fast Data Access MMU Miss" occurs on the
ci->ci_mutex_level++;
line.
It turns out that, being a single-processor kernel, ci == CPUINFO_VA ==
0xe0018000 (KERNEND + 64KB + 32KB).
And the prom tells me that:
ok e0018000 map?
VA:e0018000
G:0 W:0 P:0 E:0 CV:0 CP:0 L:0 Soft1:0 PA[40:13]:0 PA:0
Diag:0 Soft2:0 IE:0 NFO:0 Size:0 V:0
Invalid
Is this a PROM mapping which got invalidated by mistake, or a mapping
which ought to have been set up by the boot blocks but is no longer set
up correctly? I see no obvious change to blame about this in the last
few releases.
Any ideas on where to look or what to try to get to understand that
problem better?