Hi all, for the past two months, I have been working on a prototype for a radical redesign of the swscale internals, specifically the format handling layer. This includes, or will eventually expand to include, all format input/output and unscaled special conversion steps.
I am not yet at a point where the new code can replace the scaling kernels, but for the time being, we could start usaing it for the simple unscaled cases, in theory, right away. Rather than repeating my entire design document here, I opted to collect my notes into a design document on my WIP branch: https://github.com/haasn/FFmpeg/blob/swscale3/doc/swscale-v2.txt I have spent the past week or so ironing out the last kinks and extensively benchmarking the new design at least on x86, and it is generally a roughly 1.9x improvement over the existing unscaled special converters across the board, before even adding any hand written ASM. (This speedup is *just* using the less-than-optimal compiler output from my reference C code!) In some cases we even measure ~3-4x or even ~6x speedups, especially those where swscale does not currently have hand written SIMD. Overall: cpu: 16-core AMD Ryzen Threadripper 1950X gcc 14.2.1: single thread: Overall speedup=1.887x faster, min=0.250x max=22.578x multi thread: Overall speedup=1.657x faster, min=0.190x max=87.972x (The 0.2x slowdown cases are for rgb8/gbr8 input, which requires LUT support for efficient decoding, but I wanted to focus on the core operations first before worrying about adding LUT-based optimizations to the design) I am (almost) ready to begin moving forwards with this design, merging it into swscale and using it at least for unscaled format conversions, XYZ decoding, colorspace transformations (subsuming the existing, horribly unoptimized, 3DLUT layer), gamma transformations, and so on. I wanted to post it here to gather some feedback on the approach. Where does it fall on the "madness" scale? Is the new operations and optimizer design comprehensible? Am I trying too hard to reinvent compilers? Are there any platforms where the high number of function calls per frame would be probitively expensive? What are the thoughts on the float-first approach? See also the list of limitations and improvement ideas at the bottom of my design document. Thanks for your time, Niklas _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".