> Date: Thu, 9 Apr 2026 10:30:17 +0200
> From: Claudio Jeker <[email protected]>
>
> On Wed, Apr 08, 2026 at 11:45:23PM +0200, Mark Kettenis wrote:
> > > Date: Wed, 8 Apr 2026 19:24:56 +0200
> > > From: Jeremie Courreges-Anglas <[email protected]>
> > >
> > > On Mon, Mar 16, 2026 at 01:19:36PM -0600, Theo de Raadt wrote:
> > > > Jeremie Courreges-Anglas <[email protected]> wrote:
> > > >
> > > > > On Mon, Mar 16, 2026 at 12:18:05PM -0600, Theo de Raadt wrote:
> > > > > > I'm surprised at your proposal.
> > > > > >
> > > > > > If this condition gets detected, why do you think it is fine to
> > > > > > continue? A kernel data structure is seriously corrupted.
> > > > >
> > > > > I'm not saying it's fine, sorry if my mail was too long to read. ;)
> > > > >
> > > > > 1. I'm not 100% sure the checks that trigger are correct, after all
> > > > > they're not using volatile reads. Maaaaybe that's the bug but I
> > > > > have no idea right now.
> > > > >
> > > > > 2. Kurt had posted this on ports@ earlier, then on bugs@, so far no
> > > > > one has a fix and you recently tagged 7.9. This diff is an attempt
> > > > > to make kmos' and users life easier before next release. Obviously
> > > > > everybody would be happier with a proper fix. Maybe this admittedly
> > > > > incomplete fix will spark a discussion?
> > > >
> > > > Maybe.
> > > >
> > > > But you cannot delete that ddb enter. You could replace it with a
> > > > panic. If you continue to run after that printf, the system will just
> > > > crash in other unknown ways which are more difficult to debug.
> > >
> > > We have proof that the system doesn't necessarily crash after that
> > > message is printed. kmos tested the db_enter removal yesterday
> > > and confirmed that he got the message on the console without the
> > > system crashing. Using the diff below, I got this today on my LDOM's
> > > console:
> > >
> > > Apr 8 11:37:26 ports /bsd: ctx_free: context 1641 still active in dmmu
> > > Apr 8 12:21:12 ports /bsd: ctx_free: context 7896 still active in dmmu
> > > Apr 8 12:24:29 ports /bsd: ctx_free: context 3150 still active in dmmu
> > > Apr 8 13:43:56 ports /bsd: ctx_free: context 4221 still active in dmmu
> > > Apr 8 15:55:50 ports /bsd: ctx_free: context 1264 still active in dmmu
> > > Apr 8 18:55:48 ports /bsd: ctx_free: context 5664 still active in dmmu
> > >
> > > The system is running many loops of perl subprocesses in an attempt to
> > > reproduce another bug:
> > >
> > > count=0; while perl t.pl; do count=$((count + 1)); done; echo $count
> > >
> > > I have zero reason to believe that this is specific to perl. eg it
> > > may happens when building rust which AFAIK doesn't use perl.
> > >
> > > So I stand by my initial proposal (or the variant below). I'm not
> > > happy either with our partial understanding of this issue, and if
> > > someone had a better fix, I'd be all for it. BUT the db_enter() call
> > > in -current and next 7.9 has so far done more harm than good.
> >
> > Sorry, but this is really bad. It means stale TSB entries have been
> > left behind and may be re-used when the context is re-used. And that
> > could lead to some serious memory corruption.
>
> While this really indicates that we have yet another bug in the pmap code
> I think it is really hard to hit it. The TSB is horribly small and the
> system needs to cycle through 8k contexts before reuse. I think because of
> this the chance of this entry to remain in the TSB until reuse is very
> low.
This Murphy guy wants to have a word with you ;).
> That bug has probably been around for a very long time and was never
> noticed. Only because of this extra check busy systems hit this now.
>
> > If we want to paper over this issue, we should at least invalidate the
> > stale TSB entry. So something like:
> >
> > for (i = 0; i < TSBENTS; i++) {
> > tag = READ_ONCE(&tsb_dmmu[i].tag);
> > if (TSB_TAG_CTX(tag) == oldctx) {
> > atomic_cas_ulong(&tsb_dmmu[i].tag, tag,
> > TSB_TAG_INVALID);
> > printf("ctx_free: context %d still active in dmmu\n",
> > oldctx);
> > }
> > tag = READ_ONCE(&tsb_immu[i].tag);
> > if (TSB_TAG_CTX(tag) == oldctx) {
> > atomic_cas_ulong(&tsb_dmmu[i].tag, tag,
> > TSB_TAG_INVALID);
> > printf("ctx_free: context %d still active in immu\n",
> > oldctx);
> > }
> > }
>
> I agree that we should invalidate the entry. On top of this please extend
> the printf to show both the tag and data field of the entry.
Yeah, that wouldn't be a bad idea. As long as the printing doesn't
turn this into a DoS.
I can probably come up with a proper (and tested) diff this weekend.
But I don't mind if somebody beats me to it.
> > > Index: pmap.c
> > > ===================================================================
> > > RCS file: /cvs/src/sys/arch/sparc64/sparc64/pmap.c,v
> > > diff -u -p -r1.127 pmap.c
> > > --- pmap.c 14 Dec 2025 12:37:22 -0000 1.127
> > > +++ pmap.c 7 Apr 2026 08:58:11 -0000
> > > @@ -2597,11 +2597,10 @@ ctx_free(struct pmap *pm)
> > > db_enter();
> > > }
> > > for (i = 0; i < TSBENTS; i++) {
> > > - if (TSB_TAG_CTX(tsb_dmmu[i].tag) == oldctx ||
> > > - TSB_TAG_CTX(tsb_immu[i].tag) == oldctx) {
> > > - printf("ctx_free: context %d still active\n", oldctx);
> > > - db_enter();
> > > - }
> > > + if (TSB_TAG_CTX(tsb_dmmu[i].tag) == oldctx)
> > > + printf("ctx_free: context %d still active in dmmu\n",
> > > oldctx);
> > > + if (TSB_TAG_CTX(tsb_immu[i].tag) == oldctx)
> > > + printf("ctx_free: context %d still active in immu\n",
> > > oldctx);
> > > }
> > > #endif
> > > /* We should verify it has not been stolen and reallocated... */
> > >
> > >
> > >
> > > --
> > > jca
> > >
> >
>
> --
> :wq Claudio
>