Kurt Mosiejczuk <[email protected]> wrote:

> On Wed, Apr 08, 2026 at 11:45:23PM +0200, Mark Kettenis wrote:
> > > Date: Wed, 8 Apr 2026 19:24:56 +0200
> > > From: Jeremie Courreges-Anglas <[email protected]>
> 
> > > We have proof that the system doesn't necessarily crash after that
> > > message is printed.  kmos tested the db_enter removal yesterday
> > > and confirmed that he got the message on the console without the
> > > system crashing.  Using the diff below, I got this today on my LDOM's
> > > console:
> 
> > > Apr  8 11:37:26 ports /bsd: ctx_free: context 1641 still active in dmmu
> > > Apr  8 12:21:12 ports /bsd: ctx_free: context 7896 still active in dmmu
> > > Apr  8 12:24:29 ports /bsd: ctx_free: context 3150 still active in dmmu
> > > Apr  8 13:43:56 ports /bsd: ctx_free: context 4221 still active in dmmu
> > > Apr  8 15:55:50 ports /bsd: ctx_free: context 1264 still active in dmmu
> > > Apr  8 18:55:48 ports /bsd: ctx_free: context 5664 still active in dmmu
> 
> > Sorry, but this is really bad.  It means stale TSB entries have been
> > left behind and may be re-used when the context is re-used.  And that
> > could lead to some serious memory corruption.
> 
> > If we want to paper over this issue, we should at least invalidate the
> > stale TSB entry.  So something like:
> 
> >     for (i = 0; i < TSBENTS; i++) {
> >             tag = READ_ONCE(&tsb_dmmu[i].tag);
> >             if (TSB_TAG_CTX(tag) == oldctx) {
> >                     atomic_cas_ulong(&tsb_dmmu[i].tag, tag, 
> > TSB_TAG_INVALID);
> >                     printf("ctx_free: context %d still active in dmmu\n", 
> > oldctx);
> >             }
> >             tag = READ_ONCE(&tsb_immu[i].tag);
> >             if (TSB_TAG_CTX(tag) == oldctx) {
> >                     atomic_cas_ulong(&tsb_dmmu[i].tag, tag, 
> > TSB_TAG_INVALID);
> >                     printf("ctx_free: context %d still active in immu\n", 
> > oldctx);
> >             }
> >     }
> 
> I'd definitely prefer something other than the existing db_enter
> "solution".  My last full build I had to restart LDOMs at least 8 times
> over the course of the 5 day build. Every few times requires dropping to
> single user mode for manual fsck to repair filesystems.
> 
> The current build is being run with jca's proposed patch and within 4
> hours of starting the build, one of the LDOMs already had this on its
> console:
> 
> ctx_free: context 4575 still active in dmmu
> 
> That's before getting to the really memory intensive parts of the
> package builds as the heavy C++ and rust builds have to wait for
> ports-gcc to finish building some 8-10 hours in.

Step back.  This will get technical for a moment.

It is telling us that the context still has pages which reference it,
which if I understands means will violate memory / cache behaviour.

I have this great diff I can show that makes sshd do a printf instead of
failing authentication.  As a result, you manage to login.  Great!  It
does not crash.

Do you understand?

This panic indicates a serious low level problem.

You prefer it doesn't fail?

But that bug really ties the room together, doesn't it?


Reply via email to