On Sun, 09 Mar 2025 22:13:49 +0100 Niklas Haas <ffm...@haasn.xyz> wrote: > The worst slowdowns are currently those involving any sort of packed swizzle > for which there exist dedicated MMX functions currently: > > Conversion pass for bgr24 -> abgr: > [ u8 XXXX -> +++X] SWS_OP_READ : 3 elem(s) packed >> 0 > [ u8 ...X -> X+++] SWS_OP_SWIZZLE : 0012 > [ u8 X... -> ++++] SWS_OP_CLEAR : {255 _ _ _} > [ u8 .... -> XXXX] SWS_OP_WRITE : 4 elem(s) packed >> 0 > (X = unused, + = exact, 0 = zero) > bgr24 1920x1080 -> abgr 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 > U=0.999989 V=1.000000 A=1.000000} > time=1710 us, ref=826 us, speedup=0.483x slower > > I have previously identified these as a particularly weak spot in the compiler > output, since no matter what C code I write, the result will always be roughly > 0.5x compared to the existing hand-written MMX. That said, I also plan on > taking > that existing MMX code and simply plugging it into the new architecture, which > should get rid of these last few slow cases.
I also wanted to point out that a lot of our conversions are also more *accurate* than the previous implementations. An illustrative example: Conversion pass for gray -> gray10le: [ u8 XXXX -> +XXX] SWS_OP_READ : 1 elem(s) packed >> 0 [ u8 .XXX -> +XXX] SWS_OP_CONVERT : u8 -> f32 [f32 .XXX -> .XXX] SWS_OP_SCALE : * 341/85 [f32 .XXX -> .XXX] SWS_OP_DITHER : 16x16 matrix [f32 .XXX -> .XXX] SWS_OP_CLAMP : 0 <= x <= {1023 _ _ _} [f32 .XXX -> +XXX] SWS_OP_CONVERT : f32 -> u16 [u16 .XXX -> XXXX] SWS_OP_WRITE : 1 elem(s) packed >> 0 (X = unused, + = exact, 0 = zero) gray 1920x1080 -> gray10le 1920x1080, flags=0 dither=1, SSIM {Y=0.999974 U=1.000000 V=1.000000 A=1.000000} time=1317 us, ref=1300 us, speedup=0.987x slower The reference implementation handles this as a full range shift: gray10 = gray << 2 | gray >> 6. But this is *not* accurate and will therefore introduce round trip error. For example, a value of 200 produces 200 << 2 | 200 >> 6 = 803, while the correct result would be 200 / 255 * 1023 = 802.3529411764706. Our new implementation accurately handles this conversion in floating point math and dithers the result down to a 35%/65% mix of 802 and 803. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".