On Wed, 12 Mar 2025, Niklas Haas wrote:
On Sun, 09 Mar 2025 20:45:23 +0100 Niklas Haas <ffm...@haasn.xyz> wrote:
On Sun, 09 Mar 2025 18:11:54 +0200 Martin Storsjö <mar...@martin.st> wrote:
> On Sat, 8 Mar 2025, Niklas Haas wrote:
>
> > What are the thoughts on the float-first approach?
>
> In general, for modern architectures, relying on floats probably is
> reasonable. (On architectures that aren't of quite as widespread interest,
> it might not be so clear cut though.)
>
> However with the benchmark example you provided a couple of weeks ago, we
> concluded that even on x86 on modern HW, floats were faster than int16
> only in one case: When using Clang, not GCC, and when compiling with
> -mavx2, not without it. In all the other cases, int16 was faster than
> float.
Hi Martin,
I should preface that this particular benchmark was a very specific test for
floating point *filtering*, which is considerably more punishing than the
conversion pipeline I have implemented here, and I think it's partly the
fault of compilers generating very unoptimal filtering code.
I think it would be better to re-assess using the current prototype on actual
hardware. I threw up a quick NEON test branch: (untested, should hopefully work)
https://github.com/haasn/FFmpeg/commits/swscale3-neon
# adjust the benchmark iters count as needed based on the HW perf
make libswscale/tests/swscale && libswscale/tests/swscale -unscaled 1 -bench 50
If this differs significantly from the ~1.8x speedup I measure on x86, I
will be far more concerned about the new approach.
Sorry, I haven't had time to try this out myself yet...
I gave it a try. So, the result of a naive/blind run on a Cortex-X1 using clang
version 20.0.0 (from the latest Android NDK v29) is:
Overall speedup=1.688x faster, min=0.141x max=45.898x
This has quite a lot more significant speed regressions compared to x86 though.
In particular, clang/LLVM refuses to vectorize packed reads of 2 or 3 elements,
so any sort of operation involving rgb24 or bgr24 suffers horribly:
So, if the performance of this relies on compiler autovectorization,
what's the plan wrt GCC? We blanket disable autovectorization when
compiling with GCC - see fd6dbc53855fbfc9a782095d0ffe11dd3a98905f for when
it was disabled last time. Building and running fate with
autovectorization in GCC does succeed at least on modern GCC on x86_64,
but it's of course possible that it still can cause issues in various more
tricky configurations.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".