My appologies for adding a typo to the linux-kernel address, corrected now.
On Wed, Oct 07, 2020 at 10:58:00PM -0700, Stephane Eranian wrote: > Hi Peter, > > On Tue, Oct 6, 2020 at 6:17 AM Peter Zijlstra <pet...@infradead.org> wrote: > > > > Hi all, > > > > I've been trying to float this idea for a fair number of years, and I > > think at least Stephane has been talking to tools people about it, but > > I'm not sure what, if anything, ever happened with it, so let me post it > > here :-) > > > > > Thanks for bringing this back. This is a pet project of mine and I > have been looking at it for the last 4 years intermittently now. > Simply never got a chance to complete because preempted by other > higher priority projects. I have developed an internal > proof-of-concept prototype using one of the 3 approaches I know. My > goal was to demonstrate that PMU statistical sampling of loads/stores > and with data addresses would work as well as instrumentation. This is > slightly different from hit/miss in the analysis but the process is > the same. > > As you point out, the difficulty is not so much in collecting the > sample but rather in symbolizing data addresses from the heap. Right, that's non-trivial, although for static and per-cpu objects it should be rather straight forward, heap objects are going to be a pain. You'd basically have to also log the alloc/free of every object along with the data type used for it, which is not something we have readily abailable at the allocator. > Intel PEBS, IBM Marked Events work well to collect the data. AMD IBS > works though you get a lot of irrelevant samples due to lack of > hardware filtering. ARM SPE would work too. Overall, all the major > architectures will provide the sampling support needed. That's for the data address, or also the eventing IP? > Some time ago, I had my intern pursue the other 2 approaches for > symbolization. The one I see as most promising is by using the DWARF > information (no BPF needed). The good news is that I believe we do not > need more information than what is already there. We just need the > compiler to generate valid DWARF at most optimization levels, which I > believe is not the case for LLVM based compilers but maybe okay for > GCC. Right, I think GCC improved a lot on this front over the past few years. Also added Andi and Masami, who have worked on this or related topics. > Once we have the DWARF logic in place then it is easier to improve > perf report/annotate do to hit/miss or hot/cold, read/write analysis > on each data type and fields within. > > Once we have the code for perf, we are planning to contribute it upstream. > > In the meantime, we need to lean on the compiler teams to ensure no > data type information is lost with high optimizations levels. My > understanding from talking with some compiler folks is that this is > not a trivial fix. As you might have noticed, I send this to the linux-toolchains list. While you lean on your copmiler folks, try and get them subscribed to this list. It is meant to discuss toolchain issues as related to Linux. Both GCC/binutils and LLVM should be represented here.