A sparc oddity (hair-pulling bug)

Miod Vallat Wed, 06 Jan 2021 11:35:52 -0800

I have been confused by a very strange issue on sparc64 over the last
few days, and I can't figure out its cause.


What happens is that cold boots work, but warm boots (i.e. rebooting)
almost always fail like this:

Rebooting with command: boot                                          
Boot device: disk  File and args: 
OpenBSD IEEE 1275 Bootblock 2.1
..>> OpenBSD BOOT 1.20
Trying bsd...
NOTE: random seed is being reused.
Booting /sbus@1f,0/SUNW,fas@e,8800000/sd@0,0:a/bsd
9956944@0x1000000+4528@0x197ee50+191788@0x1c00000+4002516@0x1c2ed2c 
symbols @ 0xffde0400 476705+165+641112+443003 start=0x1000000
[ using 1562024 bytes of bsd ELF symbol table ]
Fast Data Access MMU Miss
ok 

However, sometimes, a bsd.rd -> bsd transition works (i.e. boot bsd.rd,
install or upgrade or even do nothing, then reboot to bsd). But most of
the time it fails in the same way. And once it has failed, there does
not seem to be a way to boot a kernel without having to poweroff
(reset-all will not help).

At first I thought this was a subtle relinking problem, but it isn't.
>From the prom, I have been able to get this trace:
        mtx_enter+0x58
        msgbuf_putchar+0x2c
        initmsgbuf+0x80
        pmap_bootstrap+0x140
        bootstrap+0x18c

This is a on an Ultra 1, thus single-processor machine. The code for
mtx_enter() is:

        void
        mtx_enter(struct mutex *mtx)
        {
                struct cpu_info *ci = curcpu();

                /* Avoid deadlocks after panic or in DDB */
                if (panicstr || db_active)
                        return;

                WITNESS_CHECKORDER(MUTEX_LOCK_OBJECT(mtx),
                    LOP_EXCLUSIVE | LOP_NEWORDER, NULL);

        #ifdef DIAGNOSTIC
                if (__predict_false(mtx->mtx_owner == ci))
                        panic("mtx %p: locking against myself", mtx);
        #endif

                if (mtx->mtx_wantipl != IPL_NONE)
                        mtx->mtx_oldipl = splraise(mtx->mtx_wantipl);

                mtx->mtx_owner = ci;

        #ifdef DIAGNOSTIC
                ci->ci_mutex_level++;
        #endif
                WITNESS_LOCK(MUTEX_LOCK_OBJECT(mtx), LOP_EXCLUSIVE);
        }

and the "Fast Data Access MMU Miss" occurs on the
                ci->ci_mutex_level++;
line.

It turns out that, being a single-processor kernel, ci == CPUINFO_VA ==
0xe0018000 (KERNEND + 64KB + 32KB).

And the prom tells me that:

        ok e0018000 map?
        VA:e0018000 
        G:0 W:0 P:0 E:0 CV:0 CP:0 L:0 Soft1:0 PA[40:13]:0 PA:0 
        Diag:0 Soft2:0 IE:0 NFO:0 Size:0 V:0 
        Invalid

Is this a PROM mapping which got invalidated by mistake, or a mapping
which ought to have been set up by the boot blocks but is no longer set
up correctly? I see no obvious change to blame about this in the last
few releases.

Any ideas on where to look or what to try to get to understand that
problem better?

A sparc oddity (hair-pulling bug)

Reply via email to