On Mon, Mar 16, 2026 at 01:19:36PM -0600, Theo de Raadt wrote:
> Jeremie Courreges-Anglas <[email protected]> wrote:
> 
> > On Mon, Mar 16, 2026 at 12:18:05PM -0600, Theo de Raadt wrote:
> > > I'm surprised at your proposal.
> > > 
> > > If this condition gets detected, why do you think it is fine to
> > > continue?  A kernel data structure is seriously corrupted.
> > 
> > I'm not saying it's fine, sorry if my mail was too long to read. ;)
> > 
> > 1. I'm not 100% sure the checks that trigger are correct, after all
> >   they're not using volatile reads.  Maaaaybe that's the bug but I
> >   have no idea right now.
> >   
> > 2. Kurt had posted this on ports@ earlier, then on bugs@, so far no
> >   one has a fix and you recently tagged 7.9.  This diff is an attempt
> >   to make kmos' and users life easier before next release.  Obviously
> >   everybody would be happier with a proper fix.  Maybe this admittedly
> >   incomplete fix will spark a discussion?
> 
> Maybe.
> 
> But you cannot delete that ddb enter.  You could replace it with a
> panic.  If you continue to run after that printf, the system will just
> crash in other unknown ways which are more difficult to debug.

We have proof that the system doesn't necessarily crash after that
message is printed.  kmos tested the db_enter removal yesterday
and confirmed that he got the message on the console without the
system crashing.  Using the diff below, I got this today on my LDOM's
console:

Apr  8 11:37:26 ports /bsd: ctx_free: context 1641 still active in dmmu
Apr  8 12:21:12 ports /bsd: ctx_free: context 7896 still active in dmmu
Apr  8 12:24:29 ports /bsd: ctx_free: context 3150 still active in dmmu
Apr  8 13:43:56 ports /bsd: ctx_free: context 4221 still active in dmmu
Apr  8 15:55:50 ports /bsd: ctx_free: context 1264 still active in dmmu
Apr  8 18:55:48 ports /bsd: ctx_free: context 5664 still active in dmmu

The system is running many loops of perl subprocesses in an attempt to
reproduce another bug:

  count=0; while perl t.pl; do count=$((count + 1)); done; echo $count

I have zero reason to believe that this is specific to perl.  eg it
may happens when building rust which AFAIK doesn't use perl.

So I stand by my initial proposal (or the variant below).  I'm not
happy either with our partial understanding of this issue, and if
someone had a better fix, I'd be all for it.  BUT the db_enter() call
in -current and next 7.9 has so far done more harm than good.


Index: pmap.c
===================================================================
RCS file: /cvs/src/sys/arch/sparc64/sparc64/pmap.c,v
diff -u -p -r1.127 pmap.c
--- pmap.c      14 Dec 2025 12:37:22 -0000      1.127
+++ pmap.c      7 Apr 2026 08:58:11 -0000
@@ -2597,11 +2597,10 @@ ctx_free(struct pmap *pm)
                db_enter();
        }
        for (i = 0; i < TSBENTS; i++) {
-               if (TSB_TAG_CTX(tsb_dmmu[i].tag) == oldctx ||
-                   TSB_TAG_CTX(tsb_immu[i].tag) == oldctx) {
-                       printf("ctx_free: context %d still active\n", oldctx);
-                       db_enter();
-               }
+               if (TSB_TAG_CTX(tsb_dmmu[i].tag) == oldctx)
+                       printf("ctx_free: context %d still active in dmmu\n", 
oldctx);
+               if (TSB_TAG_CTX(tsb_immu[i].tag) == oldctx)
+                       printf("ctx_free: context %d still active in immu\n", 
oldctx);
        }
 #endif
        /* We should verify it has not been stolen and reallocated... */



-- 
jca

Reply via email to