On Sun, 09 Mar 2025 20:45:23 +0100 Niklas Haas <ffm...@haasn.xyz> wrote: > On Sun, 09 Mar 2025 18:11:54 +0200 Martin Storsjö <mar...@martin.st> wrote: > > On Sat, 8 Mar 2025, Niklas Haas wrote: > > > > > What are the thoughts on the float-first approach? > > > > In general, for modern architectures, relying on floats probably is > > reasonable. (On architectures that aren't of quite as widespread interest, > > it might not be so clear cut though.) > > > > However with the benchmark example you provided a couple of weeks ago, we > > concluded that even on x86 on modern HW, floats were faster than int16 > > only in one case: When using Clang, not GCC, and when compiling with > > -mavx2, not without it. In all the other cases, int16 was faster than > > float. > > Hi Martin, > > I should preface that this particular benchmark was a very specific test for > floating point *filtering*, which is considerably more punishing than the > conversion pipeline I have implemented here, and I think it's partly the > fault of compilers generating very unoptimal filtering code. > > I think it would be better to re-assess using the current prototype on actual > hardware. I threw up a quick NEON test branch: (untested, should hopefully > work) > https://github.com/haasn/FFmpeg/commits/swscale3-neon > > # adjust the benchmark iters count as needed based on the HW perf > make libswscale/tests/swscale && libswscale/tests/swscale -unscaled 1 -bench > 50 > > If this differs significantly from the ~1.8x speedup I measure on x86, I > will be far more concerned about the new approach.
I gave it a try. So, the result of a naive/blind run on a Cortex-X1 using clang version 20.0.0 (from the latest Android NDK v29) is: Overall speedup=1.688x faster, min=0.141x max=45.898x This has quite a lot more significant speed regressions compared to x86 though. In particular, clang/LLVM refuses to vectorize packed reads of 2 or 3 elements, so any sort of operation involving rgb24 or bgr24 suffers horribly: Conversion pass for rgb24 -> rgba: [ u8 XXXX -> +++X] SWS_OP_READ : 3 elem(s) packed >> 0 [ u8 ...X -> ++++] SWS_OP_CLEAR : {_ _ _ 255} [ u8 .... -> XXXX] SWS_OP_WRITE : 4 elem(s) packed >> 0 (X = unused, + = exact, 0 = zero) rgb24 1920x1080 -> rgba 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000} time=2856 us, ref=387 us, speedup=0.136x slower Another thing LLVM seemingly does not optimize at all is integer shifts, they also end up as horribly inefficient scalar code: Conversion pass for yuv444p -> yuv444p16le: [ u8 XXXX -> +++X] SWS_OP_READ : 3 elem(s) planar >> 0 [ u8 ...X -> +++X] SWS_OP_CONVERT : u8 -> u16 [u16 ...X -> +++X] SWS_OP_LSHIFT : << 8 [u16 ...X -> XXXX] SWS_OP_WRITE : 3 elem(s) planar >> 0 (X = unused, + = exact, 0 = zero) yuv444p 1920x1080 -> yuv444p16le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000} time=1564 us, ref=590 us, speedup=0.377x slower On the other hand, float performance does not seem to be an issue here: Conversion pass for rgba -> yuv444p: [ u8 XXXX -> +++X] SWS_OP_READ : 4 elem(s) packed >> 0 [ u8 ...X -> +++X] SWS_OP_CONVERT : u8 -> f32 [f32 ...X -> ...X] SWS_OP_LINEAR : matrix3+off3 [[0.256788 0.504129 0.097906 0 16] [-0.148223 -0.290993 112/255 0 128] [112/255 -0.367788 -0.071427 0 128] [0 0 0 1 0]] [f32 ...X -> ...X] SWS_OP_DITHER : 16x16 matrix [f32 ...X -> ...X] SWS_OP_CLAMP : 0 <= x <= {255 255 255 _} [f32 ...X -> +++X] SWS_OP_CONVERT : f32 -> u8 [ u8 ...X -> XXXX] SWS_OP_WRITE : 3 elem(s) planar >> 0 (X = unused, + = exact, 0 = zero) rgba 1920x1080 -> yuv444p 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000} time=4074 us, ref=6987 us, speedup=1.715x faster So in summary, from what I gather, on all platforms I tested so far, the two most important ASM routines to focus on are: 1. packed reads/writes 2. integer shifts Because compilers seem to have a very hard time generating good code for these. On the other hand, simple floating point FMAs and planar reads/writes are handled quite well as is. > > > After doing those benchmarks, my understanding was that you concluded that > > we probably need to keep int16 based codepaths still, then. > > This may have been a misunderstanding. While I think we should keep the option > of using fixed point precision *open*, the main take-away for me was that we > will definitely need to transition to custom SIMD; since we cannot rely on the > compiler to generate good code for us. > > > Did something fundamental come up since we did these benchmarks that > > changed your conclusion? > > > > // Martin > > > > _______________________________________________ > > ffmpeg-devel mailing list > > ffmpeg-devel@ffmpeg.org > > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > > > To unsubscribe, visit link above, or email > > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".