I'm surprised at your proposal. If this condition gets detected, why do you think it is fine to continue? A kernel data structure is seriously corrupted.
Jeremie Courreges-Anglas <[email protected]> wrote: > On Fri, Feb 20, 2026 at 04:27:54AM -0500, Kurt Mosiejczuk wrote: > > For a month or so I've been seeing panics when doing bulk builds > > on sparc64. It is always of the form seen in the subject. I'm attaching > > a representative dmesg of the LDOMs that make up the cluster along > > with traces I've done of the crashes. I often had trouble tracing as > > switching to another cpu would just hang. Thanks to jca for pointing > > out that not all cpus may be in the stopped state and pointing out > > how to avoid those. (Thus better traces as time goes on). > > > > It does not seem to be a hardware issue since it has happened on > > the LDOMs on multiple T4-1s. > > > login: ctx_free: context 5422 still active > > Well thanks kmos for sending this report on bugs. It indeed doesn't > look like a hardware issue to me. The panic^Wdb_enter() call has been > added by Mark in the latest pmap.c commit. Quoting the commit > message: > > revision 1.127 > date: 2025/12/14 12:37:22; author: kettenis; state: Exp; lines: +23 -5; > commitid: QtkG6mGBOZVl6MLw; > Protect the array that keeps track of which MMU contexts are in use with > a mutex. Also disable the context stealing code. It isn't mpsafe and we > should have more than enough MMU contexts to never need to steal one with > the current (hard) limites on the number of processes. > > This enables some code that checks that a context that is being freed no > longer has live entries in the TSB. This code is somewhat expensive so > we may want to disable it again in the not too distant future. > > Since this db_enter() has been plaguing kmos' latest builds up to a > point that some ports/packages were corrupted, I'd suggest that we > disable the db_enter() now that we know that this error case can be > hit. I've managed to this code path twice, months/weeks ago by > building large ports on a T4-2 LDOM. I have ideas to test but I have > just gotten my hands back on said LDOM and right now I can't even > reproduce. But maybe kmos can give it a try, look for printfs and > confirm that the system recovers when hitting such a condition. > > Thoughts? ok? > > > Index: pmap.c > =================================================================== > RCS file: /cvs/src/sys/arch/sparc64/sparc64/pmap.c,v > diff -u -p -r1.127 pmap.c > --- pmap.c 14 Dec 2025 12:37:22 -0000 1.127 > +++ pmap.c 11 Mar 2026 22:39:23 -0000 > @@ -2600,7 +2600,6 @@ ctx_free(struct pmap *pm) > if (TSB_TAG_CTX(tsb_dmmu[i].tag) == oldctx || > TSB_TAG_CTX(tsb_immu[i].tag) == oldctx) { > printf("ctx_free: context %d still active\n", oldctx); > - db_enter(); > } > } > #endif > > > -- > jca >
