Re: [FFmpeg-devel] [PATCH 4/7] checkasm: use pointers for start/stop functions

2023-07-19 Thread Rémi Denis-Courmont
Le keskiviikkona 19. heinäkuuta 2023, 0.32.26 EEST Lynne a écrit :
> > User-space access to the cycle counter has been deemed a security threat
> > due to the Cycle Drift attack, and is disabled as of OpenSBI 1.3.
> > 
> > If FFmpeg does not support Linux perf, FFmpeg will get _no_ performance
> > benchmarks on Linux.
> 
> It does support linux perf. I've used it many times.

Yes of course. This MR modifies the Linux perf code, so I can see that it's 
there. My point is that *if* Linux perf is *not* enabled (in ./configure), 
there will be no benchmarks anymore on newer Linux RISC-V systems. But if it 
is enabled it, there will be no benchmarks on existing systems. Hence the need 
for the run-time indirection, or some very major PITA for development and 
supporting new developers (and we've seen already 3 different people sending 
RISC-V patches besides me in the past few months).

> >> This is a development tool first and foremost.
> > 
> > A development tool is not a justification for leaving a security whole in
> > the system. I don't make the rules, and you don't either. OpenSBI and
> > Linux make them.
> 
> You're not required to leave them on always. It's a development tool.
> You don't disable frequency scaling on all the time on other platforms,
> just when doing benchmarks, which is what the tool is meant for.

There is no need to disable frequency scaling if you use the most _suited_ 
counter for microbencmarks, that is to say the cycle counter. And then, 
disabling frequency scaling is not always feasible.

> For ARM, we routinely use an external kernel module to enable
> the performance timers just so we avoid linux perf.

I don't think an external ARM-specific kernel module is an excuse for breaking 
checkasm for _normal_ out-of-the-box kernel setups. Just because checkasm is a 
development tool doesn't mean that it only runs on custom development systems.

> Specifically on arm, the linux perf noise is very high, which is where
> my concerns are.

I fail to see how this MR makes this any worse. You can still disable Linux 
perf at build time as you could before. Quite the contrary in fact: if the 
noise is so high then there are no ways that a single indirect function call 
would make any meaningful difference for the worse.

> >> If anyone doesn't want to use rdcycle, they can use linux perf, it still
> >> works, with or without the patch.
> > 
> > It does not.
> 
> Why not? It works on arm. The same exact code is used on risc-v
> and on any other platform too.

Currently it's disabled by default on RISC-V (really anything but Arm) in 
FFmpeg's configure, and hardly anybody has the necessary OpenSBI and Linux 
updates for Linux perf to work on RISC-V.

We can't simply assume that Linux perf is supported. We're a few years too 
early for that. And by then, Android will have blocked Linux perf, so we will 
have the same problem anyway, only for different reasons.

> >> Either way, I don't agree with this patch, not accepting it.
> > 
> > The only vaguely valid reason you've given is that this should cache the
> > function pointers locally, which version 2 does.
> 
> I disagree with it doing an indirection at all.

I can see that, and you're entitled to your opinion.
But that's neither technical reasoning, nor actionable review feedback.

> It's overhead and a potential source of inaccuracy.

If Linux perf is enabled, it's blatantly obvious that whatever additional 
overhead this adds is negligible. Since you insist, on top of my head, Linux 
perf already now will do quite a few indirect calls:
- from checkasm's PLT to libc's ioctl(),
- from libc's syscall to kernel exception vector,
- from exception vector to ioctl syscall handler,
- from ioctl handler to PMU framework,
- from PMU framework to PMU backend driver.
And that's grossly over-simplified and glossing over all the other calls and 
stuff that happen in-between.

As for the case that Linux perf is disabled at build time, then I can make a 
different version that calls AV_READ_TIME directly without indirection if 
that's what bothers Gramner and/or you. Though you've noted yourself that that 
path is also very noisy as it is so not like adding one indirect call will 
matter here :shrug:

> It's up to you to prove it wouldn't affect major platforms.

I think that I have supplied plenty of technical arguments that there 
shouldn't be a problem with this MR. And in the unlikely event that there 
would be an unexpected problem, I'm sure somebody would revert it; it's not an 
irreversible situation. Not to mention that, in the mean time, Martin did 
indeed verify that the overhead is small and stable (on Arm).

> > Since you're giving zero valid reasons, I'm invoking the TC.
> 
> What, you expect them to materialize an opinion out of thin air?

I think that they already have plenty to materialise an opinion from, between 
the MR, this thread, Martin's benchmarks, and their experience. Otherwise, 
they always have the possibility to ask for 

Re: [FFmpeg-devel] [PATCH 4/7] checkasm: use pointers for start/stop functions

2023-07-18 Thread Lynne
Jul 17, 2023, 20:09 by r...@remlab.net:

> Le maanantaina 17. heinäkuuta 2023, 20.48.40 EEST Lynne a écrit :
>
>> >> > But I still argue that that is, either way, completely negligible
>> >> > compared
>> >> > to the *existing* overhead. Each loop is making 4 system calls, and
>> >> > each
>> >> > of those system call requires a direct call (to PLT) and an indirect
>> >> > branch (from GOT). If you have a problem with the two additional
>> >> > function
>> >> > calls, then you can't be using Linux perf in the first place.
>> >>
>> >> You don't want to ever use linux perf in the first place, it's second
>> >> class.>
>> > No it isn't. The interface is more involved than just reading a CSR; and
>> > sure I'd prefer the simple interface that RDCYCLE is all other things
>> > being equal. But other things are not equal. Linux perf is in fact *more*
>> > accurate by virtue of not *wrongly* counting other things. And it does
>> > not threaten the security of the entire system, so it will work inside a
>> > rented VM or an unprivileged process.
>>
>> Threaten?
>>
>
> User-space access to the cycle counter has been deemed a security threat due
> to the Cycle Drift attack, and is disabled as of OpenSBI 1.3.
>
> If FFmpeg does not support Linux perf, FFmpeg will get _no_ performance
> benchmarks on Linux.
>

It does support linux perf. I've used it many times.


>> This is a development tool first and foremost.
>>
>
> A development tool is not a justification for leaving a security whole in the
> system. I don't make the rules, and you don't either. OpenSBI and Linux make
> them.
>

You're not required to leave them on always. It's a development tool.
You don't disable frequency scaling on all the time on other platforms,
just when doing benchmarks, which is what the tool is meant for.

For ARM, we routinely use an external kernel module to enable
the performance timers just so we avoid linux perf.
Specifically on arm, the linux perf noise is very high, which is where
my concerns are.


>> If anyone doesn't want to use rdcycle, they can use linux perf, it still
>> works, with or without the patch.
>>
>
> It does not.
>

Why not? It works on arm. The same exact code is used on risc-v
and on any other platform too.


>> >> I don't think it's worth changing the direct inlining we had before.
>> >> You're
>> >> not interested in whether or not the same exact code is ran between
>> >> platforms,
>> >
>> > Err, I am definitely interested in doing exactly that. I don't want to
>> > have to reconfigure and recompile the entire FFmpeg just to switch
>> > between Linux perf and raw cycle counter. A contrario, I *do* want to
>> > compare performance between vendors once the hardware is available.
>>
>> That's a weak reason to compromise the accuracy of a development tool.
>>
>
> This is not compromising any accuracy of any development tool. Again, in a
> scenario where both RDCYCLE and Linux perf work, Linux perf is more accurate
> for reasons that I already outlined in previous messages. And on systems with
> newer OpenSBI and kernel, RDCYCLE does not work at all.
>
>> >> just that the code that's measuring timing is as efficient and
>> >> low overhead as possible.
>> >
>> > Of course not. Low overhead is irrelevant here. The measurement overhead
>> > is know and is subtracted. What we need is stable/reproducible overhead,
>> > and accurate measurements.
>>
>> Which is what TSC or the equivalent gets you. It's noisy, but that's because
>> it's better and higher accuracy than having to roundtrip through the
>> kernel.
>>
>
> We _could_ use RDTIME. That's not blocked.
>
> And while that should be "low overhead", but measuring time is also obviously
> way _less_ accurate than measuring cycles (RDCYCLE), which is in turn _less_
> accurate than measuring cycles only in user mode and only of the current
> process (Linux perf).
>
>> Either way, I don't agree with this patch, not accepting it.
>>
>
> The only vaguely valid reason you've given is that this should cache the
> function pointers locally, which version 2 does.
>

I disagree with it doing an indirection at all.
It's overhead and a potential source of inaccuracy.
It's up to you to prove it wouldn't affect major platforms.


> Since you're giving zero valid reasons, I'm invoking the TC.
>

What, you expect them to materialize an opinion out of thin air?
You have to call for a vote. And bring up why it hasn't been changed
in 3 years, even though in the duties listed, it's only supposed
to last for a year before elections are called again. Which possibly
involves doing elections again.

You could've asked the actual maintainer, and one who knows the
most about the code, Gramner. But you're focusing too hard
on arguing about irrelevant issues, even bringing up security,
rather than using reasoning or clearly describing the need for the patch.

> All you've done is make it abundantly clear that you don't like Linux perf 
> and 
> don't care about the rationale for this patch 

Re: [FFmpeg-devel] [PATCH 4/7] checkasm: use pointers for start/stop functions

2023-07-17 Thread Rémi Denis-Courmont
Le maanantaina 17. heinäkuuta 2023, 20.48.40 EEST Lynne a écrit :
> >> > But I still argue that that is, either way, completely negligible
> >> > compared
> >> > to the *existing* overhead. Each loop is making 4 system calls, and
> >> > each
> >> > of those system call requires a direct call (to PLT) and an indirect
> >> > branch (from GOT). If you have a problem with the two additional
> >> > function
> >> > calls, then you can't be using Linux perf in the first place.
> >> 
> >> You don't want to ever use linux perf in the first place, it's second
> >> class.> 
> > No it isn't. The interface is more involved than just reading a CSR; and
> > sure I'd prefer the simple interface that RDCYCLE is all other things
> > being equal. But other things are not equal. Linux perf is in fact *more*
> > accurate by virtue of not *wrongly* counting other things. And it does
> > not threaten the security of the entire system, so it will work inside a
> > rented VM or an unprivileged process.
> 
> Threaten?

User-space access to the cycle counter has been deemed a security threat due 
to the Cycle Drift attack, and is disabled as of OpenSBI 1.3.

If FFmpeg does not support Linux perf, FFmpeg will get _no_ performance 
benchmarks on Linux.

> This is a development tool first and foremost.

A development tool is not a justification for leaving a security whole in the 
system. I don't make the rules, and you don't either. OpenSBI and Linux make 
them.

> If anyone doesn't want to use rdcycle, they can use linux perf, it still
> works, with or without the patch.

It does not.

> >> I don't think it's worth changing the direct inlining we had before.
> >> You're
> >> not interested in whether or not the same exact code is ran between
> >> platforms,
> > 
> > Err, I am definitely interested in doing exactly that. I don't want to
> > have to reconfigure and recompile the entire FFmpeg just to switch
> > between Linux perf and raw cycle counter. A contrario, I *do* want to
> > compare performance between vendors once the hardware is available.
> 
> That's a weak reason to compromise the accuracy of a development tool.

This is not compromising any accuracy of any development tool. Again, in a 
scenario where both RDCYCLE and Linux perf work, Linux perf is more accurate 
for reasons that I already outlined in previous messages. And on systems with 
newer OpenSBI and kernel, RDCYCLE does not work at all.

> >> just that the code that's measuring timing is as efficient and
> >> low overhead as possible.
> > 
> > Of course not. Low overhead is irrelevant here. The measurement overhead
> > is know and is subtracted. What we need is stable/reproducible overhead,
> > and accurate measurements.
> 
> Which is what TSC or the equivalent gets you. It's noisy, but that's because
> it's better and higher accuracy than having to roundtrip through the
> kernel.

We _could_ use RDTIME. That's not blocked.

And while that should be "low overhead", but measuring time is also obviously 
way _less_ accurate than measuring cycles (RDCYCLE), which is in turn _less_ 
accurate than measuring cycles only in user mode and only of the current 
process (Linux perf).

> Either way, I don't agree with this patch, not accepting it.

The only vaguely valid reason you've given is that this should cache the 
function pointers locally, which version 2 does.

All you've done is make it abundantly clear that you don't like Linux perf and 
don't care about the rationale for this patch because it doesn't suit your 
personal preferences, especially on IRC:

20:49 <@Lynne> "security"
20:49 <@Lynne> it's a fucking timer
20:49 <@Lynne> "insecure"
20:51 <@Lynne> computers are very good at knowing how much time has passed, to 
   the point of the speed of light being an issue
20:52 <@Lynne> getting rid of this because some script kiddie could 
potentially 
   figure out a bit by calling this a million times while the CPU 
   is busy is paranoid
20:58 <@Lynne> it's a cache issue, so you fix the cache lifetime, not nerf a 
   timer

And it's not a cache issue. Cycle Drift is not a cache issue, it's an 
instruction timing issue.

Since you're giving zero valid reasons, I'm invoking the TC.

-- 
レミ・デニ-クールモン
http://www.remlab.net/



___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 4/7] checkasm: use pointers for start/stop functions

2023-07-17 Thread Lynne
Jul 17, 2023, 07:18 by r...@remlab.net:

> Le sunnuntaina 16. heinäkuuta 2023, 23.32.21 EEST Lynne a écrit :
>
>> Introducing additional overhead in the form of a dereference is a point
>> where instability can creep in. Can you guarantee that a context will
>> always remain in L1D cache,
>>
>
> L1D is not involved here. In version 2, the pointers are cached locally.
>
>> as opposed to just reading the raw CPU timing
>> directly where that's supported.
>>
>
> Of course not. Raw CPU timing is subject to noise from interrupts (and 
> whatever those interrupts trigger). And that's not just theoretical. I've 
> experienced it and it sucks. Raw CPU timing is much noisier than Linux perf.
>
> And because it has also been proven vastly insecure, it's been disabled on 
> Arm 
> for a long time, and is being disabled on RISC-V too now.
>
>> > But I still argue that that is, either way, completely negligible compared
>> > to the *existing* overhead. Each loop is making 4 system calls, and each
>> > of those system call requires a direct call (to PLT) and an indirect
>> > branch (from GOT). If you have a problem with the two additional function
>> > calls, then you can't be using Linux perf in the first place.
>>
>> You don't want to ever use linux perf in the first place, it's second class.
>>
>
> No it isn't. The interface is more involved than just reading a CSR; and sure 
> I'd prefer the simple interface that RDCYCLE is all other things being equal. 
> But other things are not equal. Linux perf is in fact *more* accurate by 
> virtue of not *wrongly* counting other things. And it does not threaten the 
> security of the entire system, so it will work inside a rented VM or an 
> unprivileged process.
>

Threaten? This is a development tool first and foremost.
If anyone doesn't want to use rdcycle, they can use linux perf, it still works,
with or without the patch.


>> I don't think it's worth changing the direct inlining we had before. You're
>> not interested in whether or not the same exact code is ran between
>> platforms,
>>
>
> Err, I am definitely interested in doing exactly that. I don't want to have 
> to 
> reconfigure and recompile the entire FFmpeg just to switch between Linux perf 
> and raw cycle counter. A contrario, I *do* want to compare performance 
> between 
> vendors once the hardware is available.
>

That's a weak reason to compromise the accuracy of a development tool.


>> just that the code that's measuring timing is as efficient and
>> low overhead as possible.
>>
>
> Of course not. Low overhead is irrelevant here. The measurement overhead is 
> know and is subtracted. What we need is stable/reproducible overhead, and 
> accurate measurements.
>

Which is what TSC or the equivalent gets you. It's noisy, but that's because
it's better and higher accuracy than having to roundtrip through the kernel.


> And that's assuming the stuff works at all. You can argue that we should use 
> Arm PMU and RISC-V RDCYCLE, and that Linux perf sucks, all you want. PMU 
> access will just throw a SIGILL and end the checkasm process with zero 
> measurements. The rest of the industry wants to use system calls for informed 
> reasons. I don't think you, or even the whole FFmpeg project, can win that 
> argument against OS and CPU vendors.
>

Either way, I don't agree with this patch, not accepting it.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 4/7] checkasm: use pointers for start/stop functions

2023-07-16 Thread Rémi Denis-Courmont
Le sunnuntaina 16. heinäkuuta 2023, 23.32.21 EEST Lynne a écrit :
> Introducing additional overhead in the form of a dereference is a point
> where instability can creep in. Can you guarantee that a context will
> always remain in L1D cache,

L1D is not involved here. In version 2, the pointers are cached locally.

> as opposed to just reading the raw CPU timing
> directly where that's supported.

Of course not. Raw CPU timing is subject to noise from interrupts (and 
whatever those interrupts trigger). And that's not just theoretical. I've 
experienced it and it sucks. Raw CPU timing is much noisier than Linux perf.

And because it has also been proven vastly insecure, it's been disabled on Arm 
for a long time, and is being disabled on RISC-V too now.

> > But I still argue that that is, either way, completely negligible compared
> > to the *existing* overhead. Each loop is making 4 system calls, and each
> > of those system call requires a direct call (to PLT) and an indirect
> > branch (from GOT). If you have a problem with the two additional function
> > calls, then you can't be using Linux perf in the first place.
> 
> You don't want to ever use linux perf in the first place, it's second class.

No it isn't. The interface is more involved than just reading a CSR; and sure 
I'd prefer the simple interface that RDCYCLE is all other things being equal. 
But other things are not equal. Linux perf is in fact *more* accurate by 
virtue of not *wrongly* counting other things. And it does not threaten the 
security of the entire system, so it will work inside a rented VM or an 
unprivileged process.

> I don't think it's worth changing the direct inlining we had before. You're
> not interested in whether or not the same exact code is ran between
> platforms,

Err, I am definitely interested in doing exactly that. I don't want to have to 
reconfigure and recompile the entire FFmpeg just to switch between Linux perf 
and raw cycle counter. A contrario, I *do* want to compare performance between 
vendors once the hardware is available.

> just that the code that's measuring timing is as efficient and
> low overhead as possible.

Of course not. Low overhead is irrelevant here. The measurement overhead is 
know and is subtracted. What we need is stable/reproducible overhead, and 
accurate measurements.

And that's assuming the stuff works at all. You can argue that we should use 
Arm PMU and RISC-V RDCYCLE, and that Linux perf sucks, all you want. PMU 
access will just throw a SIGILL and end the checkasm process with zero 
measurements. The rest of the industry wants to use system calls for informed 
reasons. I don't think you, or even the whole FFmpeg project, can win that 
argument against OS and CPU vendors.

-- 
Rémi Denis-Courmont
http://www.remlab.net/



___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 4/7] checkasm: use pointers for start/stop functions

2023-07-16 Thread Lynne
Jul 15, 2023, 22:13 by r...@remlab.net:

> Le lauantaina 15. heinäkuuta 2023, 20.43.26 EEST Lynne a écrit :
>
>> Jul 15, 2023, 10:26 by r...@remlab.net:
>> > Le lauantaina 15. heinäkuuta 2023, 11.05.51 EEST Lynne a écrit :
>> >> Jul 14, 2023, 20:29 by r...@remlab.net:
>> >> > This makes all calls to the bench start and stop functions via
>> >> > function pointers. While the primary goal is to support run-time
>> >> > selection of the performance measurement back-end in later commits,
>> >> > this has the side benefit of containing platform dependencies in to
>> >> > checkasm.c and out of checkasm.h.
>> >> > ---
>> >> > 
>> >> >  tests/checkasm/checkasm.c | 33 -
>> >> >  tests/checkasm/checkasm.h | 31 ---
>> >> >  2 files changed, 32 insertions(+), 32 deletions(-)
>> >> 
>> >> Not sure I agree with this commit, the overhead can be detectable,
>> >> and we have a lot of small functions with runtime a few times that
>> >> of a null function call.
>> > 
>> > I don't think the function call is ever null. The pointers are left NULL
>> > only if none of the backend initialise. But then, checkasm will bail out
>> > and exit before we try to benchmark anything anyway.
>> > 
>> > As for the real functions, they always do *something*. None of them "just
>> > return 0".
>>
>> I meant a no-op function call to measure the overhead of function
>> calls themselves, complete with all the ABI stuff.
>>
>
> I
>
>>
>> >> Can you store the function pointers out of the loop to reduce
>> >> the derefs needed?
>> > 
>> > Taking just the two loads is out of the loop should be feasible but it
>> > seems a rather vain. You will still have the overhead of the indirect
>> > function call, the function, and most importantly in the case of Linux
>> > perf and MacOS kperf, the system calls.
>> > 
>> > The only way to avoid the indirect function calls are to use IFUNC (tricky
>> > and not portable), or to make horrible macros to spawn one bench loop for
>> > each backend.
>> > 
>> > In the end, I think we should rather aim for as constant time as possible,
>> > rather than as fast as possible, so that the nop loop can estimate the
>> > benchmarking overhead as well as possible. In this respect, I think it is
>> > actually marginally better *not* to cache the function pointers in local
>> > variables, which could end up spilled on the stack, or not, depending on
>> > local compiler optimisations for any given test case.
>>
>> I disagree, uninlining the timer fetches adds another source of
>> inconsistency.
>>
>
> Err, outlining the timer makes sure that it's always the exact same code 
> that's run, and not differently optimised inlinings, at least if LTO is 
> absent. 
> (And even with LTO, it vastly reduces the compiler's ability to optimise and 
> vary the compilation.) Again, given how the calculations are made at the 
> moment, the stability of the overhead is important, so that we can *compare* 
> measurements. The absolute value of the overhead, not so much.
>

Introducing additional overhead in the form of a dereference is a point
where instability can creep in. Can you guarantee that a context will
always remain in L1D cache, as opposed to just reading the raw CPU timing
directly where that's supported.


> But I still argue that that is, either way, completely negligible compared to 
> the *existing* overhead. Each loop is making 4 system calls, and each of 
> those 
> system call requires a direct call (to PLT) and an indirect branch (from 
> GOT). 
> If you have a problem with the two additional function calls, then you can't 
> be using Linux perf in the first place.
>

You don't want to ever use linux perf in the first place, it's second class.
I don't think it's worth changing the direct inlining we had before. You're not
interested in whether or not the same exact code is ran between platforms,
just that the code that's measuring timing is as efficient and low overhead
as possible.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 4/7] checkasm: use pointers for start/stop functions

2023-07-15 Thread Rémi Denis-Courmont
Le lauantaina 15. heinäkuuta 2023, 20.43.26 EEST Lynne a écrit :
> Jul 15, 2023, 10:26 by r...@remlab.net:
> > Le lauantaina 15. heinäkuuta 2023, 11.05.51 EEST Lynne a écrit :
> >> Jul 14, 2023, 20:29 by r...@remlab.net:
> >> > This makes all calls to the bench start and stop functions via
> >> > function pointers. While the primary goal is to support run-time
> >> > selection of the performance measurement back-end in later commits,
> >> > this has the side benefit of containing platform dependencies in to
> >> > checkasm.c and out of checkasm.h.
> >> > ---
> >> > 
> >> >  tests/checkasm/checkasm.c | 33 -
> >> >  tests/checkasm/checkasm.h | 31 ---
> >> >  2 files changed, 32 insertions(+), 32 deletions(-)
> >> 
> >> Not sure I agree with this commit, the overhead can be detectable,
> >> and we have a lot of small functions with runtime a few times that
> >> of a null function call.
> > 
> > I don't think the function call is ever null. The pointers are left NULL
> > only if none of the backend initialise. But then, checkasm will bail out
> > and exit before we try to benchmark anything anyway.
> > 
> > As for the real functions, they always do *something*. None of them "just
> > return 0".
> 
> I meant a no-op function call to measure the overhead of function
> calls themselves, complete with all the ABI stuff.

I

> 
> >> Can you store the function pointers out of the loop to reduce
> >> the derefs needed?
> > 
> > Taking just the two loads is out of the loop should be feasible but it
> > seems a rather vain. You will still have the overhead of the indirect
> > function call, the function, and most importantly in the case of Linux
> > perf and MacOS kperf, the system calls.
> > 
> > The only way to avoid the indirect function calls are to use IFUNC (tricky
> > and not portable), or to make horrible macros to spawn one bench loop for
> > each backend.
> > 
> > In the end, I think we should rather aim for as constant time as possible,
> > rather than as fast as possible, so that the nop loop can estimate the
> > benchmarking overhead as well as possible. In this respect, I think it is
> > actually marginally better *not* to cache the function pointers in local
> > variables, which could end up spilled on the stack, or not, depending on
> > local compiler optimisations for any given test case.
> 
> I disagree, uninlining the timer fetches adds another source of
> inconsistency.

Err, outlining the timer makes sure that it's always the exact same code 
that's run, and not differently optimised inlinings, at least if LTO is absent. 
(And even with LTO, it vastly reduces the compiler's ability to optimise and 
vary the compilation.) Again, given how the calculations are made at the 
moment, the stability of the overhead is important, so that we can *compare* 
measurements. The absolute value of the overhead, not so much.

But I still argue that that is, either way, completely negligible compared to 
the *existing* overhead. Each loop is making 4 system calls, and each of those 
system call requires a direct call (to PLT) and an indirect branch (from GOT). 
If you have a problem with the two additional function calls, then you can't 
be using Linux perf in the first place.

-- 
レミ・デニ-クールモン
http://www.remlab.net/



___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 4/7] checkasm: use pointers for start/stop functions

2023-07-15 Thread Lynne
Jul 15, 2023, 10:26 by r...@remlab.net:

> Le lauantaina 15. heinäkuuta 2023, 11.05.51 EEST Lynne a écrit :
>
>> Jul 14, 2023, 20:29 by r...@remlab.net:
>> > This makes all calls to the bench start and stop functions via
>> > function pointers. While the primary goal is to support run-time
>> > selection of the performance measurement back-end in later commits,
>> > this has the side benefit of containing platform dependencies in to
>> > checkasm.c and out of checkasm.h.
>> > ---
>> > 
>> >  tests/checkasm/checkasm.c | 33 -
>> >  tests/checkasm/checkasm.h | 31 ---
>> >  2 files changed, 32 insertions(+), 32 deletions(-)
>>
>> Not sure I agree with this commit, the overhead can be detectable,
>> and we have a lot of small functions with runtime a few times that
>> of a null function call.
>>
>
> I don't think the function call is ever null. The pointers are left NULL only 
> if none of the backend initialise. But then, checkasm will bail out and exit 
> before we try to benchmark anything anyway.
>
> As for the real functions, they always do *something*. None of them "just 
> return 0".
>

I meant a no-op function call to measure the overhead of function
calls themselves, complete with all the ABI stuff.



>> Can you store the function pointers out of the loop to reduce
>> the derefs needed?
>>
>
> Taking just the two loads is out of the loop should be feasible but it seems 
> a 
> rather vain. You will still have the overhead of the indirect function call, 
> the function, and most importantly in the case of Linux perf and MacOS kperf, 
> the system calls.
>
> The only way to avoid the indirect function calls are to use IFUNC (tricky 
> and 
> not portable), or to make horrible macros to spawn one bench loop for each 
> backend.
>
> In the end, I think we should rather aim for as constant time as possible, 
> rather than as fast as possible, so that the nop loop can estimate the 
> benchmarking overhead as well as possible. In this respect, I think it is 
> actually marginally better *not* to cache the function pointers in local 
> variables, which could end up spilled on the stack, or not, depending on 
> local 
> compiler optimisations for any given test case.
>

I disagree, uninlining the timer fetches adds another source of
inconsistency. It may be messy, but I think accuracy here is more
important than cleanliness, especially as it's a development tool.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 4/7] checkasm: use pointers for start/stop functions

2023-07-15 Thread Rémi Denis-Courmont
Le lauantaina 15. heinäkuuta 2023, 11.05.51 EEST Lynne a écrit :
> Jul 14, 2023, 20:29 by r...@remlab.net:
> > This makes all calls to the bench start and stop functions via
> > function pointers. While the primary goal is to support run-time
> > selection of the performance measurement back-end in later commits,
> > this has the side benefit of containing platform dependencies in to
> > checkasm.c and out of checkasm.h.
> > ---
> > 
> >  tests/checkasm/checkasm.c | 33 -
> >  tests/checkasm/checkasm.h | 31 ---
> >  2 files changed, 32 insertions(+), 32 deletions(-)
> 
> Not sure I agree with this commit, the overhead can be detectable,
> and we have a lot of small functions with runtime a few times that
> of a null function call.

I don't think the function call is ever null. The pointers are left NULL only 
if none of the backend initialise. But then, checkasm will bail out and exit 
before we try to benchmark anything anyway.

As for the real functions, they always do *something*. None of them "just 
return 0".

> Can you store the function pointers out of the loop to reduce
> the derefs needed?

Taking just the two loads is out of the loop should be feasible but it seems a 
rather vain. You will still have the overhead of the indirect function call, 
the function, and most importantly in the case of Linux perf and MacOS kperf, 
the system calls.

The only way to avoid the indirect function calls are to use IFUNC (tricky and 
not portable), or to make horrible macros to spawn one bench loop for each 
backend.

In the end, I think we should rather aim for as constant time as possible, 
rather than as fast as possible, so that the nop loop can estimate the 
benchmarking overhead as well as possible. In this respect, I think it is 
actually marginally better *not* to cache the function pointers in local 
variables, which could end up spilled on the stack, or not, depending on local 
compiler optimisations for any given test case.

-- 
雷米‧德尼-库尔蒙
http://www.remlab.net/



___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".


Re: [FFmpeg-devel] [PATCH 4/7] checkasm: use pointers for start/stop functions

2023-07-15 Thread Lynne
Jul 14, 2023, 20:29 by r...@remlab.net:

> This makes all calls to the bench start and stop functions via
> function pointers. While the primary goal is to support run-time
> selection of the performance measurement back-end in later commits,
> this has the side benefit of containing platform dependencies in to
> checkasm.c and out of checkasm.h.
> ---
>  tests/checkasm/checkasm.c | 33 -
>  tests/checkasm/checkasm.h | 31 ---
>  2 files changed, 32 insertions(+), 32 deletions(-)
>

Not sure I agree with this commit, the overhead can be detectable,
and we have a lot of small functions with runtime a few times that
of a null function call.
Can you store the function pointers out of the loop to reduce
the derefs needed?
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".