On Wed, 30 Apr 2025 11:07:35 -0700 Dave Hansen <dave.han...@intel.com> wrote:
> On 4/30/25 10:20, Steven Rostedt wrote: > > On Tue, 29 Apr 2025 09:11:57 -0700 > > Dave Hansen <dave.han...@intel.com> wrote: > > > >> I don't think we should do this series. > > > > Could you provide more rationale for your decision. > > I talked about it a bit in here: > > > https://lore.kernel.org/all/408ebd8b-4bfb-4c4f-b118-7fe853c6e...@intel.com/ > > Hmm, that's easily missed. But thanks for linking it. > > But, basically, this series puts a new onus on the entry code: it can't > touch the vmalloc() area ... except the LDT ... and except the PEBS > buffers. If anyone touches vmalloc()'d memory (or anything else that > eventually gets deferred), they crash. They _only_ crash on these > NOHZ_FULL systems. > > Putting new restrictions on the entry code is really nasty. Let's say a > new hardware feature showed up that touched vmalloc()'d memory in the > entry code. Probably, nobody would notice until they got that new > hardware and tried to do a NOHZ_FULL workload. It might take years to > uncover, once that hardware was out in the wild. > > I have a substantial number of gray hairs from dealing with corner cases > in the entry code. > > You _could_ make it more debuggable. Could you make this work for all > tasks, not just NOHZ_FULL? The same logic _should_ apply. It would be > inefficient, but would provide good debugging coverage. > > I also mentioned this earlier, but PTI could be leveraged here to ensure > that the TLB is flushed properly. You could have the rule that anything > mapped into the user page table can't have a deferred flush and then do > deferred flushes at SWITCH_TO_KERNEL_CR3 time. Yeah, that's in > arch-specific assembly, but it's a million times easier to reason about > because the window where a deferred-flush allocation might bite you is > so small. > > Look at the syscall code for instance: > > > SYM_CODE_START(entry_SYSCALL_64) > > swapgs > > movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2) > > SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp > > You can _trivially_ audit this and know that swapgs doesn't touch memory > and that as long as PER_CPU_VAR()s and the process stack don't have > their mappings munged and flushes deferred that this would be correct. Hmm, so there is still a path for this? At least if it added more ways to debug it, and some other changes to make the locations where vmalloc is dangerous smaller? > > >> If folks want this functionality, they should get a new CPU that can > >> flush the TLB without IPIs. > > > > That's a pretty heavy handed response. I'm not sure that's always a > > feasible solution. > > > > From my experience in the world, software has always been around to fix the > > hardware, not the other way around ;-) > > Both AMD and Intel have hardware to do it. ARM CPUs do it too, I think. > You can go buy the Intel hardware off the shelf today. Sure, but changing CPUs on machines is not always that feasible either. -- Steve