On Fri, Mar 5, 2021 at 7:01 AM Peter Zijlstra <pet...@infradead.org> wrote: > > On Fri, Mar 05, 2021 at 09:36:30AM +0100, Peter Zijlstra wrote: > > On Thu, Mar 04, 2021 at 07:45:44PM -0800, Linus Torvalds wrote: > > > That ____cacheline_aligned goes back many years, this is not new, it > > > seems to come from back in 2014: commit 2565711fb7d7 ("perf: Improve > > > the perf_sample_data struct layout"). > > > > long time ago... > > > > > But it really seems entirely and utterly bogus. That cacheline > > > alignment makes things *worse*, when the variables are on the local > > > stack. The local stack is already going to be dirty and in the cache, > > > and aligning those things isn't going to - and I quote from the code > > > in that commend in that commit - "minimize the cachelines touched". > > > > > > Quite the reverse. It's just going to make the stack frame use *more* > > > memory, and make any cacheline usage _worse_. > > > > IIRC there is more history here, but I can't seem to find references > > just now. > > > > What I remember is that since perf_sample_data is fairly large, > > unconditionally initializing the whole thing is *slow* (and > > -fauto-var-init=zero will hurt here). > > > > So at some point I removed that full initialization and made sure we > > only unconditionally touched the first few variables, which gave a > > measurable speedup. > > > > Then things got messy again and the commit 2565711fb7d7 referenced above > > was cleanup, to get back to that initial state. > > > > Now, you're right that __cacheline_aligned on on-stack (and this is > > indeed mostly on-stack) is fairly tedious (there were a few patches > > recently to reduce the amount of on-stack instances). > > > > I'll put it on the todo list, along with that hotplug stuff (which I > > tried to fix but ended up with an even bigger mess). I suppose we can > > try and not have the alignment for the on-stack instances while > > preserving it for the few off-stack ones. > > > > Also; we're running on the NMI stack, and that's not typically hot. > > This seems to be it... (completely untested) > > --- > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 3f7f89ea5e51..918a296d2ca2 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -1032,7 +1032,9 @@ struct perf_sample_data { > u64 cgroup; > u64 data_page_size; > u64 code_page_size; > -} ____cacheline_aligned; > +}; > + > +typedef struct perf_sample_data perf_sample_data_t ____cacheline_aligned; > > /* default value for data source */ > #define PERF_MEM_NA (PERF_MEM_S(OP, NA) |\ > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index b0c45d923f0f..f32c623abef6 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -923,7 +923,7 @@ __bpf_perf_event_output(struct pt_regs *regs, struct > bpf_map *map, > * bpf_perf_event_output > */ > struct bpf_trace_sample_data { > - struct perf_sample_data sds[3]; > + perf_sample_data_t sds[3];
bpf side doesn't care about about cacheline aligned. No need to add new typedef just for that.