On Wed, 19 Feb 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:
---
As you've noticed in later patches; most of this commentery _is_ valuable to keep in the commit message, so I'd keep most of this, including the performance diff, in the commit message (i.e. above the ---).
This patch replaces integer widening with halving addition, and multi-step "emulated" rounding shift with a single asm instruction doing exactly that. This pattern repeats in other functions in this file, I fixed some in the succeeding patch. There's a lot of performance to be gained there. I didn't modify the existing function because it adds a few extra steps solely for the shared w_avg implementation (every cycle matters), but also because I find this linear version easier to digest and understand.
That's probably reasonable - but if the avg codepath in vvc_avg is unused now, we should remove it; that makes the patch clearer to see the change, when we see the removed old codepath together with the new added one in the same patch.
Besides, I noticed that removing smin and smax instructions used for clamping the values for 10 and 12 bit_depth instantiations does not affect the checkasm result, but it breaks FATE.
It would probably be good if we could improve the checkasm to hit those cases too, but that's of course a separate question.
Benchmarks before and after: A78 avg_8_2x2_neon: 21.0 ( 1.55x) avg_8_4x4_neon: 25.8 ( 3.05x) avg_8_8x8_neon: 45.0 ( 5.86x) avg_8_16x16_neon: 178.5 ( 5.49x) avg_8_32x32_neon: 709.2 ( 6.20x) avg_8_64x64_neon: 2686.2 ( 6.12x) avg_8_128x128_neon: 10734.2 ( 5.88x) avg_10_2x2_neon: 19.0 ( 1.75x) avg_10_4x4_neon: 28.2 ( 2.76x) avg_10_8x8_neon: 44.0 ( 5.82x) avg_10_16x16_neon: 179.5 ( 4.81x) avg_10_32x32_neon: 680.8 ( 5.58x) avg_10_64x64_neon: 2536.8 ( 5.40x) avg_10_128x128_neon: 10079.0 ( 5.22x) avg_12_2x2_neon: 20.8 ( 1.59x) avg_12_4x4_neon: 25.2 ( 3.09x) avg_12_8x8_neon: 44.0 ( 5.79x) avg_12_16x16_neon: 182.2 ( 4.80x) avg_12_32x32_neon: 696.2 ( 5.46x) avg_12_64x64_neon: 2548.2 ( 5.38x) avg_12_128x128_neon: 10133.8 ( 5.19x) avg_8_2x2_neon: 16.5 ( 1.98x) avg_8_4x4_neon: 26.2 ( 2.93x) avg_8_8x8_neon: 31.8 ( 8.55x) avg_8_16x16_neon: 82.0 (12.02x) avg_8_32x32_neon: 310.2 (14.12x) avg_8_64x64_neon: 897.8 (18.26x) avg_8_128x128_neon: 3608.5 (17.37x) avg_10_2x2_neon: 19.5 ( 1.69x) avg_10_4x4_neon: 28.0 ( 2.79x) avg_10_8x8_neon: 34.8 ( 7.32x) avg_10_16x16_neon: 119.8 ( 7.35x) avg_10_32x32_neon: 444.2 ( 8.51x) avg_10_64x64_neon: 1711.8 ( 8.00x) avg_10_128x128_neon: 7065.2 ( 7.43x) avg_12_2x2_neon: 19.5 ( 1.71x) avg_12_4x4_neon: 24.2 ( 3.22x) avg_12_8x8_neon: 33.8 ( 7.57x) avg_12_16x16_neon: 120.2 ( 7.33x) avg_12_32x32_neon: 442.5 ( 8.53x) avg_12_64x64_neon: 1706.2 ( 8.02x) avg_12_128x128_neon: 7010.0 ( 7.46x) A72 avg_8_2x2_neon: 30.2 ( 1.48x) avg_8_4x4_neon: 40.0 ( 3.10x) avg_8_8x8_neon: 91.0 ( 4.14x) avg_8_16x16_neon: 340.4 ( 3.92x) avg_8_32x32_neon: 1220.7 ( 4.67x) avg_8_64x64_neon: 5823.4 ( 3.88x) avg_8_128x128_neon: 17430.5 ( 4.73x) avg_10_2x2_neon: 34.0 ( 1.66x) avg_10_4x4_neon: 45.2 ( 2.73x) avg_10_8x8_neon: 97.5 ( 3.87x) avg_10_16x16_neon: 317.7 ( 3.90x) avg_10_32x32_neon: 1376.2 ( 4.21x) avg_10_64x64_neon: 5228.1 ( 3.71x) avg_10_128x128_neon: 16722.2 ( 4.17x) avg_12_2x2_neon: 31.7 ( 1.76x) avg_12_4x4_neon: 36.0 ( 3.44x) avg_12_8x8_neon: 91.7 ( 4.10x) avg_12_16x16_neon: 297.2 ( 4.13x) avg_12_32x32_neon: 1400.5 ( 4.14x) avg_12_64x64_neon: 5379.1 ( 3.51x) avg_12_128x128_neon: 16715.7 ( 4.17x) avg_8_2x2_neon: 33.7 ( 1.72x) avg_8_4x4_neon: 45.5 ( 2.84x) avg_8_8x8_neon: 65.0 ( 5.98x) avg_8_16x16_neon: 171.0 ( 7.81x) avg_8_32x32_neon: 558.2 (10.05x) avg_8_64x64_neon: 2006.5 (10.61x) avg_8_128x128_neon: 9158.7 ( 8.96x) avg_10_2x2_neon: 38.0 ( 1.92x) avg_10_4x4_neon: 53.2 ( 2.69x) avg_10_8x8_neon: 95.2 ( 4.08x) avg_10_16x16_neon: 243.0 ( 5.02x) avg_10_32x32_neon: 891.7 ( 5.64x) avg_10_64x64_neon: 3357.7 ( 5.60x) avg_10_128x128_neon: 12411.7 ( 5.56x) avg_12_2x2_neon: 34.7 ( 1.97x) avg_12_4x4_neon: 53.2 ( 2.68x) avg_12_8x8_neon: 91.7 ( 4.22x) avg_12_16x16_neon: 239.0 ( 5.08x) avg_12_32x32_neon: 895.7 ( 5.62x) avg_12_64x64_neon: 3317.5 ( 5.67x) avg_12_128x128_neon: 12358.5 ( 5.58x) A53 avg_8_2x2_neon: 58.3 ( 1.41x) avg_8_4x4_neon: 101.8 ( 2.21x) avg_8_8x8_neon: 178.6 ( 4.53x) avg_8_16x16_neon: 569.5 ( 5.01x) avg_8_32x32_neon: 1962.5 ( 5.50x) avg_8_64x64_neon: 8327.8 ( 5.18x) avg_8_128x128_neon: 31631.3 ( 5.34x) avg_10_2x2_neon: 54.5 ( 1.56x) avg_10_4x4_neon: 88.8 ( 2.53x) avg_10_8x8_neon: 163.6 ( 4.97x) avg_10_16x16_neon: 550.5 ( 5.16x) avg_10_32x32_neon: 1942.5 ( 5.64x) avg_10_64x64_neon: 8783.5 ( 4.98x) avg_10_128x128_neon: 32617.0 ( 5.25x) avg_12_2x2_neon: 53.3 ( 1.66x) avg_12_4x4_neon: 86.8 ( 2.61x) avg_12_8x8_neon: 156.6 ( 5.12x) avg_12_16x16_neon: 541.3 ( 5.25x) avg_12_32x32_neon: 1955.3 ( 5.59x) avg_12_64x64_neon: 8686.0 ( 5.06x) avg_12_128x128_neon: 32487.5 ( 5.25x) avg_8_2x2_neon: 39.5 ( 1.96x) avg_8_4x4_neon: 65.3 ( 3.41x) avg_8_8x8_neon: 168.8 ( 4.79x) avg_8_16x16_neon: 348.0 ( 8.20x) avg_8_32x32_neon: 1207.5 ( 8.98x) avg_8_64x64_neon: 6032.3 ( 7.17x) avg_8_128x128_neon: 22008.5 ( 7.69x) avg_10_2x2_neon: 55.5 ( 1.52x) avg_10_4x4_neon: 73.8 ( 3.08x) avg_10_8x8_neon: 157.8 ( 5.12x) avg_10_16x16_neon: 445.0 ( 6.43x) avg_10_32x32_neon: 1587.3 ( 6.87x) avg_10_64x64_neon: 7738.0 ( 5.68x) avg_10_128x128_neon: 27813.8 ( 6.14x) avg_12_2x2_neon: 48.3 ( 1.80x) avg_12_4x4_neon: 77.0 ( 2.95x) avg_12_8x8_neon: 161.5 ( 4.98x) avg_12_16x16_neon: 433.5 ( 6.59x) avg_12_32x32_neon: 1622.0 ( 6.75x) avg_12_64x64_neon: 7844.5 ( 5.60x) avg_12_128x128_neon: 26999.5 ( 6.34x) Krzysztof libavcodec/aarch64/vvc/inter.S | 124 ++++++++++++++++++++++++++++++++- 1 file changed, 121 insertions(+), 3 deletions(-)
Overall the change looks reasonable to me, thanks, but remove the now unused parts and update the patch to include the valuable comments and benchmarks above the "---" bit.
// Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".