On Fri, Mar 05, 2021 at 09:36:30AM +0100, Peter Zijlstra wrote: > On Thu, Mar 04, 2021 at 07:45:44PM -0800, Linus Torvalds wrote: > > That ____cacheline_aligned goes back many years, this is not new, it > > seems to come from back in 2014: commit 2565711fb7d7 ("perf: Improve > > the perf_sample_data struct layout"). > > long time ago... > > > But it really seems entirely and utterly bogus. That cacheline > > alignment makes things *worse*, when the variables are on the local > > stack. The local stack is already going to be dirty and in the cache, > > and aligning those things isn't going to - and I quote from the code > > in that commend in that commit - "minimize the cachelines touched". > > > > Quite the reverse. It's just going to make the stack frame use *more* > > memory, and make any cacheline usage _worse_. > > IIRC there is more history here, but I can't seem to find references > just now. > > What I remember is that since perf_sample_data is fairly large, > unconditionally initializing the whole thing is *slow* (and > -fauto-var-init=zero will hurt here). > > So at some point I removed that full initialization and made sure we > only unconditionally touched the first few variables, which gave a > measurable speedup. > > Then things got messy again and the commit 2565711fb7d7 referenced above > was cleanup, to get back to that initial state. > > Now, you're right that __cacheline_aligned on on-stack (and this is > indeed mostly on-stack) is fairly tedious (there were a few patches > recently to reduce the amount of on-stack instances). > > I'll put it on the todo list, along with that hotplug stuff (which I > tried to fix but ended up with an even bigger mess). I suppose we can > try and not have the alignment for the on-stack instances while > preserving it for the few off-stack ones. > > Also; we're running on the NMI stack, and that's not typically hot.
This seems to be it... (completely untested) --- diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 3f7f89ea5e51..918a296d2ca2 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1032,7 +1032,9 @@ struct perf_sample_data { u64 cgroup; u64 data_page_size; u64 code_page_size; -} ____cacheline_aligned; +}; + +typedef struct perf_sample_data perf_sample_data_t ____cacheline_aligned; /* default value for data source */ #define PERF_MEM_NA (PERF_MEM_S(OP, NA) |\ diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index b0c45d923f0f..f32c623abef6 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -923,7 +923,7 @@ __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map, * bpf_perf_event_output */ struct bpf_trace_sample_data { - struct perf_sample_data sds[3]; + perf_sample_data_t sds[3]; }; static DEFINE_PER_CPU(struct bpf_trace_sample_data, bpf_trace_sds);