On Tue, May 20, 2025 at 4:52 PM Paolo Bonzini <pbonz...@redhat.com> wrote: > Il mar 20 mag 2025, 21:01 Stefan Hajnoczi <stefa...@gmail.com> ha scritto: >> >> On Mon, May 19, 2025 at 2:52 PM Tanish Desai <tanishdesa...@gmail.com> wrote: >> > >> > Remove hot paths from .c file and added it in .h file to keep it inline. >> >> Please include performance results in the commit description so it's >> clear what impact this change has. > > > Hi Stefan, > > I am replying because I take the blame for this :) and as an example of how > he could interact with the maintainer. > > For now we mostly looked at differences between the code that tracetool > generates for the various backends, and the observation that some of them put > code in the .h and some in the .c file. I explained to Tanish the concept of > separating hot vs cold code in theory, showed him some examples in QEMU where > performance measurements were done in the past, and suggested applying this > to various backends (starting with the one with the weirdest choice!). > However we didn't do any measurement yet. > > Some possibilities that come to mind: > > 1) maybe the coroutine benchmarks are enough to show a difference, with some > luck > > 2) a new microbenchmark (or a set of microbenchmarks that try various level > of register pressure around the tracepoint), which would be nice to have > anyway
2 is going above and beyond :). I'm not trying to make life hard. Another option if 1 or 2 are difficult to quantify: I'd be happy with just a "before" and "after" disassembly snippet from objdump showing that the instructions at a call site changed as intended. What I'm looking for is a commit description that explains the purpose of the commit followed by, since this is a performance improvement, some form of evidence that the change achieved its goal. At that point it's easy for me to merge because it has a justification and the nature of the commit will be clear to anyone looking at the git log in the future. > 3) perhaps we could try to check the code size for some object files in > block/ (for example libblock.a.p/*.c.o), as a proxy for how much instruction > cache space is saved when all tracepoints are disabled > > We can start from 3, but also try 1+3 and 2+3 if it fails if you think that > would be more persuasive. Can you explain 3 more? Total code size on disk should stay about the same because most trace events only have one call site. Moving code between the caller and callee doesn't make a big difference in either direction. At runtime there is a benefit from inlining the condition check since the call site no longer needs to execute the callee's function prologue/epilogue when the trace event is runtime-disabled, but that CPU cycle and instruction cache effect won't be visible from the on-disk code size. Thanks, Stefan