Kurt Mosiejczuk <[email protected]> wrote: > On Wed, Apr 08, 2026 at 11:45:23PM +0200, Mark Kettenis wrote: > > > Date: Wed, 8 Apr 2026 19:24:56 +0200 > > > From: Jeremie Courreges-Anglas <[email protected]> > > > > We have proof that the system doesn't necessarily crash after that > > > message is printed. kmos tested the db_enter removal yesterday > > > and confirmed that he got the message on the console without the > > > system crashing. Using the diff below, I got this today on my LDOM's > > > console: > > > > Apr 8 11:37:26 ports /bsd: ctx_free: context 1641 still active in dmmu > > > Apr 8 12:21:12 ports /bsd: ctx_free: context 7896 still active in dmmu > > > Apr 8 12:24:29 ports /bsd: ctx_free: context 3150 still active in dmmu > > > Apr 8 13:43:56 ports /bsd: ctx_free: context 4221 still active in dmmu > > > Apr 8 15:55:50 ports /bsd: ctx_free: context 1264 still active in dmmu > > > Apr 8 18:55:48 ports /bsd: ctx_free: context 5664 still active in dmmu > > > Sorry, but this is really bad. It means stale TSB entries have been > > left behind and may be re-used when the context is re-used. And that > > could lead to some serious memory corruption. > > > If we want to paper over this issue, we should at least invalidate the > > stale TSB entry. So something like: > > > for (i = 0; i < TSBENTS; i++) { > > tag = READ_ONCE(&tsb_dmmu[i].tag); > > if (TSB_TAG_CTX(tag) == oldctx) { > > atomic_cas_ulong(&tsb_dmmu[i].tag, tag, > > TSB_TAG_INVALID); > > printf("ctx_free: context %d still active in dmmu\n", > > oldctx); > > } > > tag = READ_ONCE(&tsb_immu[i].tag); > > if (TSB_TAG_CTX(tag) == oldctx) { > > atomic_cas_ulong(&tsb_dmmu[i].tag, tag, > > TSB_TAG_INVALID); > > printf("ctx_free: context %d still active in immu\n", > > oldctx); > > } > > } > > I'd definitely prefer something other than the existing db_enter > "solution". My last full build I had to restart LDOMs at least 8 times > over the course of the 5 day build. Every few times requires dropping to > single user mode for manual fsck to repair filesystems. > > The current build is being run with jca's proposed patch and within 4 > hours of starting the build, one of the LDOMs already had this on its > console: > > ctx_free: context 4575 still active in dmmu > > That's before getting to the really memory intensive parts of the > package builds as the heavy C++ and rust builds have to wait for > ports-gcc to finish building some 8-10 hours in.
Step back. This will get technical for a moment. It is telling us that the context still has pages which reference it, which if I understands means will violate memory / cache behaviour. I have this great diff I can show that makes sshd do a printf instead of failing authentication. As a result, you manage to login. Great! It does not crash. Do you understand? This panic indicates a serious low level problem. You prefer it doesn't fail? But that bug really ties the room together, doesn't it?
