Re: [PATCH] trace/simple: seperate hot paths of tracing fucntions

Tanish Desai Mon, 26 May 2025 21:28:53 -0700

> Can you explain 3 more?
Pablo and I have looked into object files for one of default trace point "
trace_vhost_commit"(first tracepoint of hw/virtio) before and after making
this change and we found that by moves the hot-path check for
_simple_trace_vhost_commit into the caller (in the .h file) and leaves the
full stack-frame prologue and tracepoint body in the cold path. By inlining
only the if (trace_events_enabled()) test at the call site, we eliminate
the 11-instruction prologue overhead (stack allocation, canary load/store,
register spills, and argument masking) on every vhost commit(trace call)
when tracing is disabled.
The old disassembled code looks like this:-
0x10 stp x29, x30, [sp, #-64]! *Prologue:* allocates 64-byte frame and
saves old FP (x29) & LR (x30)
0x14 adrp x3, trace_events_enabled_count *Prologue:* computes page-base of
the trace-enable counter
0x18 adrp x2, __stack_chk_guard *Important (maybe prolog don't
know?)(stack-protector):* starts up the stack-canary load
0x1c mov x29, sp *Prologue:* sets new frame pointer
0x20 ldr x3, [x3] *Prologue:* loads the actual trace-enabled count
0x24 stp x19, x20, [sp, #16] *Prologue:* spills callee-saved regs used by
this function (x19, x20)
0x28 and w20, w0, #0xff *Tracepoint setup:* extracts the low-8 bits of arg0
as the “event boolean”
0x2c ldr x2, [x2] *Prologue (cont’d):* completes loading of the
stack-canary value
0x30 and w19, w1, #0xff *Tracepoint setup:* extracts low-8 bits of arg1
0x34 ldr w0, [x3] *Important:* loads the current trace-enabled flag from
memory
0x38 ldr x1, [x2] *Prologue (cont’d):* reads the canary
0x3c str x1, [sp, #56] *Prologue (cont’d):* writes the canary into the new
frame
0x40 mov x1, #0 *Prologue (cont’d):* zeroes out x1 for the upcoming branch
test
0x44 cbnz w0, 0x88 *Important:* if tracing is disabled (w0==0) skip the
heavy path entirely
Saving 11/14 instructions!Also We have not considered epilog
instructional saves
would be definitely more than 11.
Old flow: Every call does ~11 insns of prologue + tracepoint check, even if
tracing is disabled.
New flow: We inline a tiny if (trace_enabled) at the caller; only when it’s
true do you call into a full function with its prologue.(Extra instructions)


On Wed, May 21, 2025 at 11:46 PM Stefan Hajnoczi <stefa...@gmail.com> wrote:

> On Tue, May 20, 2025 at 4:52 PM Paolo Bonzini <pbonz...@redhat.com> wrote:
> > Il mar 20 mag 2025, 21:01 Stefan Hajnoczi <stefa...@gmail.com> ha
> scritto:
> >>
> >> On Mon, May 19, 2025 at 2:52 PM Tanish Desai <tanishdesa...@gmail.com>
> wrote:
> >> >
> >> > Remove hot paths from .c file and added it in .h file to keep it
> inline.
> >>
> >> Please include performance results in the commit description so it's
> >> clear what impact this change has.
> >
> >
> > Hi Stefan,
> >
> > I am replying because I take the blame for this :) and as an example of
> how he could interact with the maintainer.
> >
> > For now we mostly looked at differences between the code that tracetool
> generates for the various backends, and the observation that some of them
> put code in the .h and some in the .c file. I explained to Tanish the
> concept of separating hot vs cold code in theory, showed him some examples
> in QEMU where performance measurements were done in the past, and suggested
> applying this to various backends (starting with the one with the weirdest
> choice!). However we didn't do any measurement yet.
> >
> > Some possibilities that come to mind:
> >
> > 1) maybe the coroutine benchmarks are enough to show a difference, with
> some luck
> >
> > 2) a new microbenchmark (or a set of microbenchmarks that try various
> level of register pressure around the tracepoint), which would be nice to
> have anyway
>
> 2 is going above and beyond :). I'm not trying to make life hard.
>
> Another option if 1 or 2 are difficult to quantify: I'd be happy with
> just a "before" and "after" disassembly snippet from objdump showing
> that the instructions at a call site changed as intended.
>
> What I'm looking for is a commit description that explains the purpose
> of the commit followed by, since this is a performance improvement,
> some form of evidence that the change achieved its goal. At that point
> it's easy for me to merge because it has a justification and the
> nature of the commit will be clear to anyone looking at the git log in
> the future.
>
> > 3) perhaps we could try to check the code size for some object files in
> block/ (for example libblock.a.p/*.c.o), as a proxy for how much
> instruction cache space is saved when all tracepoints are disabled
> >
> > We can start from 3, but also try 1+3 and 2+3 if it fails if you think
> that would be more persuasive.
>
> Can you explain 3 more? Total code size on disk should stay about the
> same because most trace events only have one call site. Moving code
> between the caller and callee doesn't make a big difference in either
> direction. At runtime there is a benefit from inlining the condition
> check since the call site no longer needs to execute the callee's
> function prologue/epilogue when the trace event is runtime-disabled,
> but that CPU cycle and instruction cache effect won't be visible from
> the on-disk code size.
>
> Thanks,
> Stefan
>

Re: [PATCH] trace/simple: seperate hot paths of tracing fucntions

Reply via email to