On Fri, Feb 20, 2026 at 04:27:54AM -0500, Kurt Mosiejczuk wrote:
> For a month or so I've been seeing panics when doing bulk builds
> on sparc64. It is always of the form seen in the subject. I'm attaching
> a representative dmesg of the LDOMs that make up the cluster along
> with traces I've done of the crashes. I often had trouble tracing as
> switching to another cpu would just hang. Thanks to jca for pointing
> out that not all cpus may be in the stopped state and pointing out
> how to avoid those. (Thus better traces as time goes on).
> 
> It does not seem to be a hardware issue since it has happened on
> the LDOMs on multiple T4-1s.

> login: ctx_free: context 5422 still active

Well thanks kmos for sending this report on bugs.  It indeed doesn't
look like a hardware issue to me.  The panic^Wdb_enter() call has been
added by Mark in the latest pmap.c commit.  Quoting the commit
message:

  revision 1.127
  date: 2025/12/14 12:37:22;  author: kettenis;  state: Exp;  lines: +23 -5;  
commitid: QtkG6mGBOZVl6MLw;
  Protect the array that keeps track of which MMU contexts are in use with
  a mutex.  Also disable the context stealing code.  It isn't mpsafe and we
  should have more than enough MMU contexts to never need to steal one with
  the current (hard) limites on the number of processes.
  
  This enables some code that checks that a context that is being freed no
  longer has live entries in the TSB.  This code is somewhat expensive so
  we may want to disable it again in the not too distant future.

Since this db_enter() has been plaguing kmos' latest builds up to a
point that some ports/packages were corrupted, I'd suggest that we
disable the db_enter() now that we know that this error case can be
hit.  I've managed to this code path twice, months/weeks ago by
building large ports on a T4-2 LDOM.  I have ideas to test but I have
just gotten my hands back on said LDOM and right now I can't even
reproduce.  But maybe kmos can give it a try, look for printfs and
confirm that the system recovers when hitting such a condition.

Thoughts?  ok?


Index: pmap.c
===================================================================
RCS file: /cvs/src/sys/arch/sparc64/sparc64/pmap.c,v
diff -u -p -r1.127 pmap.c
--- pmap.c      14 Dec 2025 12:37:22 -0000      1.127
+++ pmap.c      11 Mar 2026 22:39:23 -0000
@@ -2600,7 +2600,6 @@ ctx_free(struct pmap *pm)
                if (TSB_TAG_CTX(tsb_dmmu[i].tag) == oldctx ||
                    TSB_TAG_CTX(tsb_immu[i].tag) == oldctx) {
                        printf("ctx_free: context %d still active\n", oldctx);
-                       db_enter();
                }
        }
 #endif


-- 
jca

Reply via email to