Re: A sparc oddity (hair-pulling bug)

Mark Kettenis Wed, 06 Jan 2021 14:05:00 -0800

> Date: Wed, 6 Jan 2021 19:14:08 +0000
> From: Miod Vallat <[email protected]>
> 
> I have been confused by a very strange issue on sparc64 over the last
> few days, and I can't figure out its cause.
> 
> What happens is that cold boots work, but warm boots (i.e. rebooting)
> almost always fail like this:
> 
> Rebooting with command: boot                                          
> Boot device: disk  File and args: 
> OpenBSD IEEE 1275 Bootblock 2.1
> ..>> OpenBSD BOOT 1.20
> Trying bsd...
> NOTE: random seed is being reused.
> Booting /sbus@1f,0/SUNW,fas@e,8800000/sd@0,0:a/bsd
> 9956944@0x1000000+4528@0x197ee50+191788@0x1c00000+4002516@0x1c2ed2c 
> symbols @ 0xffde0400 476705+165+641112+443003 start=0x1000000
> [ using 1562024 bytes of bsd ELF symbol table ]
> Fast Data Access MMU Miss
> ok 
> 
> However, sometimes, a bsd.rd -> bsd transition works (i.e. boot bsd.rd,
> install or upgrade or even do nothing, then reboot to bsd). But most of
> the time it fails in the same way. And once it has failed, there does
> not seem to be a way to boot a kernel without having to poweroff
> (reset-all will not help).
> 
> At first I thought this was a subtle relinking problem, but it isn't.
> From the prom, I have been able to get this trace:
>       mtx_enter+0x58
>       msgbuf_putchar+0x2c
>       initmsgbuf+0x80
>       pmap_bootstrap+0x140
>       bootstrap+0x18c
> 
> This is a on an Ultra 1, thus single-processor machine. The code for
> mtx_enter() is:
> 
>       void
>       mtx_enter(struct mutex *mtx)
>       {
>               struct cpu_info *ci = curcpu();
> 
>               /* Avoid deadlocks after panic or in DDB */
>               if (panicstr || db_active)
>                       return;
> 
>               WITNESS_CHECKORDER(MUTEX_LOCK_OBJECT(mtx),
>                   LOP_EXCLUSIVE | LOP_NEWORDER, NULL);
> 
>       #ifdef DIAGNOSTIC
>               if (__predict_false(mtx->mtx_owner == ci))
>                       panic("mtx %p: locking against myself", mtx);
>       #endif
> 
>               if (mtx->mtx_wantipl != IPL_NONE)
>                       mtx->mtx_oldipl = splraise(mtx->mtx_wantipl);
> 
>               mtx->mtx_owner = ci;
> 
>       #ifdef DIAGNOSTIC
>               ci->ci_mutex_level++;
>       #endif
>               WITNESS_LOCK(MUTEX_LOCK_OBJECT(mtx), LOP_EXCLUSIVE);
>       }
> 
> and the "Fast Data Access MMU Miss" occurs on the
>               ci->ci_mutex_level++;
> line.
> 
> It turns out that, being a single-processor kernel, ci == CPUINFO_VA ==
> 0xe0018000 (KERNEND + 64KB + 32KB).
> 
> And the prom tells me that:
> 
>       ok e0018000 map?
>       VA:e0018000 
>       G:0 W:0 P:0 E:0 CV:0 CP:0 L:0 Soft1:0 PA[40:13]:0 PA:0 
>       Diag:0 Soft2:0 IE:0 NFO:0 Size:0 V:0 
>       Invalid
> 
> Is this a PROM mapping which got invalidated by mistake, or a mapping
> which ought to have been set up by the boot blocks but is no longer set
> up correctly? I see no obvious change to blame about this in the last
> few releases.
> 
> Any ideas on where to look or what to try to get to understand that
> problem better?


The per-CPU struct is mapped using a 64K locked TLB entry.  That TLB
entry is installed by sun4u_bootstrap_cpu(), which gets called *after*
initsmgbuf() is called.  So this issue was introduced when locking was
added to msgbuf_putchar().

Now the real question is why this doesn't crash even on a cold boot?
I suspect that is because on a cold boot the buffer is clean and the
msgbuf_putchar() call in initmsgbuf() is skipped.

It may be possible to do some reordering in pmap_bootstrap(), but
frankly I think the locking added to msgbuf_putchar() was a mistake.
Or maybe the locking code should be bypassed when cold.

Re: A sparc oddity (hair-pulling bug)

Reply via email to