On Sat, 08 Mar 2025 23:53:42 +0100 Niklas Haas <ffm...@haasn.xyz> wrote: > Hi all, > > for the past two months, I have been working on a prototype for a radical > redesign of the swscale internals, specifically the format handling layer. > This includes, or will eventually expand to include, all format input/output > and unscaled special conversion steps. > > I am not yet at a point where the new code can replace the scaling kernels, > but for the time being, we could start usaing it for the simple unscaled > cases, > in theory, right away. > > Rather than repeating my entire design document here, I opted to collect my > notes into a design document on my WIP branch: > > https://github.com/haasn/FFmpeg/blob/swscale3/doc/swscale-v2.txt > > I have spent the past week or so ironing out the last kinks and extensively > benchmarking the new design at least on x86, and it is generally a roughly > 1.9x > improvement over the existing unscaled special converters across the board, > before even adding any hand written ASM. (This speedup is *just* using the > less-than-optimal compiler output from my reference C code!) > > In some cases we even measure ~3-4x or even ~6x speedups, especially those > where swscale does not currently have hand written SIMD. Overall: > > cpu: 16-core AMD Ryzen Threadripper 1950X > gcc 14.2.1: > single thread: > Overall speedup=1.887x faster, min=0.250x max=22.578x > multi thread: > Overall speedup=1.657x faster, min=0.190x max=87.972x
I was asked to substantiate these figures with more practical and relevant examples. Apologies in advance for the wall of text, but I felt the need to be thorough. The most important information is up-front. Methodology: Sorting the most popular pixel formats (by occurrence count inside FFmpeg internal code), and excluding subsampled formats, we have: 347 AV_PIX_FMT_YUV444P 331 AV_PIX_FMT_GRAY8 281 AV_PIX_FMT_RGB24 235 AV_PIX_FMT_BGR24 232 AV_PIX_FMT_GBRP 220 AV_PIX_FMT_RGBA 190 AV_PIX_FMT_YUV444P10 185 AV_PIX_FMT_BGRA 184 AV_PIX_FMT_GBRAP 177 AV_PIX_FMT_YUVJ444P 172 AV_PIX_FMT_YUVA444P 162 AV_PIX_FMT_YUV444P12 150 AV_PIX_FMT_YUV444P16 150 AV_PIX_FMT_GBRP10 139 AV_PIX_FMT_GBRP12 138 AV_PIX_FMT_ARGB 131 AV_PIX_FMT_GRAY16 129 AV_PIX_FMT_YUV444P9 127 AV_PIX_FMT_ABGR 119 AV_PIX_FMT_YUVA444P10 115 AV_PIX_FMT_GBRP16 113 AV_PIX_FMT_GRAY10 111 AV_PIX_FMT_GBRP9 109 AV_PIX_FMT_YUV444P14 (remaining formats are used fewer than 100 times) Across this reduced set of formats, the overall speedup (on my weaker, older laptop) was: CPU: quad core AMD Ryzen 7 PRO 3700U w/ Radeon Vega Mobile Gfx (-MT MCP-) Overall speedup=1.666x faster, min=0.431x max=5.819x The biggest speedups were seen for anything involving gbrp: Conversion pass for bgra -> gbrp16le: [ u8 XXXX -> +++X] SWS_OP_READ : 4 elem(s) packed >> 0 [ u8 ...X -> +++X] SWS_OP_SWIZZLE : 1023 [ u8 ...X -> +++X] SWS_OP_CONVERT : u8 -> u16 (expand) [u16 ...X -> XXXX] SWS_OP_WRITE : 3 elem(s) planar >> 0 (X = unused, + = exact, 0 = zero) bgra 1920x1080 -> gbrp16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000} time=1933 us, ref=8216 us, speedup=4.249x faster Conversion pass for gray -> gbrp: [ u8 XXXX -> +XXX] SWS_OP_READ : 1 elem(s) packed >> 0 [ u8 .XXX -> +++X] SWS_OP_SWIZZLE : 0003 [ u8 ...X -> XXXX] SWS_OP_WRITE : 3 elem(s) planar >> 0 (X = unused, + = exact, 0 = zero) gray 1920x1080 -> gbrp 1920x1080, flags=0 dither=1, SSIM {Y=0.999977 U=1.000000 V=1.000000 A=1.000000} time=868 us, ref=3510 us, speedup=4.039x faster Conversion pass for gbrp -> gbrp16le: [ u8 XXXX -> +++X] SWS_OP_READ : 3 elem(s) planar >> 0 [ u8 ...X -> +++X] SWS_OP_CONVERT : u8 -> u16 (expand) [u16 ...X -> XXXX] SWS_OP_WRITE : 3 elem(s) planar >> 0 (X = unused, + = exact, 0 = zero) gbrp 1920x1080 -> gbrp16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000} time=1724 us, ref=9489 us, speedup=5.505x faster Though an honorable mention goes out to reductions in plane count, which the optimizer identifies as a noop and optimizes into a refcopy / memcpy: yuva444p10le 1920x1080 -> yuv444p10le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000} time=0 us, ref=2453 us, speedup=6072.812x faster The worst slowdowns are currently those involving any sort of packed swizzle for which there exist dedicated MMX functions currently: Conversion pass for bgr24 -> abgr: [ u8 XXXX -> +++X] SWS_OP_READ : 3 elem(s) packed >> 0 [ u8 ...X -> X+++] SWS_OP_SWIZZLE : 0012 [ u8 X... -> ++++] SWS_OP_CLEAR : {255 _ _ _} [ u8 .... -> XXXX] SWS_OP_WRITE : 4 elem(s) packed >> 0 (X = unused, + = exact, 0 = zero) bgr24 1920x1080 -> abgr 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000} time=1710 us, ref=826 us, speedup=0.483x slower I have previously identified these as a particularly weak spot in the compiler output, since no matter what C code I write, the result will always be roughly 0.5x compared to the existing hand-written MMX. That said, I also plan on taking that existing MMX code and simply plugging it into the new architecture, which should get rid of these last few slow cases. On the other hand, the generated code outperforms the existing architecture in cases where the old code fails to provide a dedicated function, e.g.: Conversion pass for bgr24 -> argb: [ u8 XXXX -> +++X] SWS_OP_READ : 3 elem(s) packed >> 0 [ u8 ...X -> X+++] SWS_OP_SWIZZLE : 0210 [ u8 X... -> ++++] SWS_OP_CLEAR : {255 _ _ _} [ u8 .... -> XXXX] SWS_OP_WRITE : 4 elem(s) packed >> 0 (X = unused, + = exact, 0 = zero) bgr24 1920x1080 -> argb 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000} time=1685 us, ref=2646 us, speedup=1.570x faster Thus proving that the general-purpose pipeline is faster than the old general-purpose pipeline. And lastly, here is a randomly chosen subset of the overall test: bgra 1920x1080 -> yuv444p16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000} time=4243 us, ref=5824 us, speedup=1.372x faster yuv444p12le 1920x1080 -> yuva444p 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000} time=2985 us, ref=2813 us, speedup=0.942x slower gbrp10le 1920x1080 -> gbrp 1920x1080, flags=0 dither=1, SSIM {Y=0.999998 U=0.999987 V=1.000000 A=1.000000} time=4473 us, ref=9638 us, speedup=2.155x faster yuv444p10le 1920x1080 -> yuva444p10le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000} time=2040 us, ref=3095 us, speedup=1.517x faster yuv444p10le 1920x1080 -> gbrp16le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000} time=3855 us, ref=7277 us, speedup=1.888x faster gbrp 1920x1080 -> gbrap 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000} time=1059 us, ref=1032 us, speedup=0.975x slower argb 1920x1080 -> gray16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=1.000000 V=1.000000 A=1.000000} time=3113 us, ref=3697 us, speedup=1.187x faster yuv444p12le 1920x1080 -> gbrap 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000} time=4066 us, ref=7141 us, speedup=1.756x faster gbrp 1920x1080 -> rgba 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000} time=1384 us, ref=3072 us, speedup=2.220x faster yuvj444p 1920x1080 -> argb 1920x1080, flags=0 dither=1, SSIM {Y=0.999980 U=0.999974 V=0.999987 A=1.000000} time=4777 us, ref=9294 us, speedup=1.946x faster yuvj444p 1920x1080 -> gbrp16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999977 U=0.999978 V=0.999978 A=1.000000} time=3850 us, ref=7314 us, speedup=1.900x faster gray10le 1920x1080 -> gray16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999991 U=1.000000 V=1.000000 A=1.000000} time=1269 us, ref=1296 us, speedup=1.021x faster argb 1920x1080 -> bgra 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000} time=1052 us, ref=1047 us, speedup=0.995x slower yuv444p16le 1920x1080 -> yuv444p14le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000} time=2926 us, ref=3618 us, speedup=1.237x faster gbrp12le 1920x1080 -> rgba 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999988 V=1.000000 A=1.000000} time=4221 us, ref=11934 us, speedup=2.827x faster yuvj444p 1920x1080 -> gbrp16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999977 U=0.999978 V=0.999978 A=1.000000} time=3939 us, ref=7227 us, speedup=1.835x faster yuv444p14le 1920x1080 -> gbrap 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000} time=4188 us, ref=7221 us, speedup=1.724x faster gbrp10le 1920x1080 -> yuv444p12le 1920x1080, flags=0 dither=1, SSIM {Y=0.999998 U=0.999996 V=1.000000 A=1.000000} time=4325 us, ref=10025 us, speedup=2.318x faster yuv444p14le 1920x1080 -> gray16le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000} time=1333 us, ref=2065 us, speedup=1.549x faster gbrap 1920x1080 -> yuv444p16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000} time=4346 us, ref=7390 us, speedup=1.700x faster The only two actual slowdowns here were: Conversion pass for gbrp -> gbrap: [ u8 XXXX -> +++X] SWS_OP_READ : 3 elem(s) planar >> 0 [ u8 ...X -> ++++] SWS_OP_CLEAR : {_ _ _ 255} [ u8 .... -> XXXX] SWS_OP_WRITE : 4 elem(s) planar >> 0 (X = unused, + = exact, 0 = zero) I neglected to add a dedicated kernel for read-clear-write, so this is going through the general path with three spearate function calls. And even so, it is only 2.5% slower than the existing dedicated fast path. I imagine that we will want to add a fast path here eventually, unless the custom calling convention invalidates the need for such fast paths. Conversion pass for yuva444p -> yuv444p12le: [ u8 XXXX -> +++X] SWS_OP_READ : 3 elem(s) planar >> 0 [ u8 ...X -> +++X] SWS_OP_CONVERT : u8 -> u16 [u16 ...X -> +++X] SWS_OP_LSHIFT : << 4 [u16 ...X -> XXXX] SWS_OP_WRITE : 3 elem(s) planar >> 0 (X = unused, + = exact, 0 = zero) Which is another case where the only operation being performed (an expanding left shift) is so small, that the load/store overhead is great enough to cause a measurable slowdown - 5.8% in this case. As with the previous, it would be easy to add a dedicated read-shift-write implementation to make these cases faster, I just opted not to because the slowdown was not massive. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".