On 30/04/25 13:00, Dave Hansen wrote: > On 4/30/25 12:42, Steven Rostedt wrote: >>> Look at the syscall code for instance: >>> >>>> SYM_CODE_START(entry_SYSCALL_64) >>>> swapgs >>>> movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2) >>>> SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp >>> You can _trivially_ audit this and know that swapgs doesn't touch memory >>> and that as long as PER_CPU_VAR()s and the process stack don't have >>> their mappings munged and flushes deferred that this would be correct. >> Hmm, so there is still a path for this? >> >> At least if it added more ways to debug it, and some other changes to make >> the locations where vmalloc is dangerous smaller? > > Being able to debug it would be a good start. But, more generally, what > we need is for more people to be able to run the code in the first > place. Would a _normal_ system (without setups that are trying to do > NOHZ_FULL) ever be able to defer TLB flush IPIs? > > If the answer is no, then, yeah, I'll settle for some debugging options. > > But if you shrink the window as small as I'm talking about, it would > look very different from this series. > > For instance, imagine when a CPU goes into the NOHZ mode. Could it just > unconditionally flush the TLB on the way back into the kernel (in the > same SWITCH_TO_KERNEL_CR3 spot)? Yeah, it'll make entry into the kernel > expensive for NOHZ tasks, but it's not *THAT* bad. And if the entire > point of a NOHZ_FULL task is to minimize the number of kernel entries > then a little extra overhead there doesn't sound too bad. >
Right, so my thought per your previous comments was to special case the TLB flush, depend on kPTI and do it uncondtionally in SWITCH_TO_KERNEL_CR3 just like you've described - but keep the context tracking mechanism for other deferrable operations. My gripe with that was having two separate mechanisms - super early entry around SWITCH_TO_KERNEL_CR3) - later entry at context tracking Shifting everything to SWITCH_TO_KERNEL_CR3 means we lose the context_tracking infra to dynamically defer operations (atomically reading and writing to context_tracking.state), which means we unconditionally run all possible deferrable operations. This doesn't scream scalable, even though as you say NOHZ_FULL kernel entry is already a "you lose" situation. Yet another option is to duplicate the context tracking state specifically for IPI deferral and have it driven in/by SWITCH_TO_KERNEL_CR3, which is also not super savoury. I suppose I can start poking around running deferred ops in that SWITCH_TO_KERNEL_CR3 region, and add state/infra on top. Let's see where this gets me :-) Again, thanks for the insight and the suggestions Dave! > Also, about the new hardware, I suspect there's some mystery customer > lurking in the shadows asking folks for this functionality. Could you at > least go _talk_ to the mystery customer(s) and see which hardware they > care about? They might already even have the magic CPUs they need for > this, or have them on the roadmap. If they've got Intel CPUs, I'd be > happy to help figure it out.