On Wed, 19 Feb 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:

---

As you've noticed in later patches; most of this commentery _is_ valuable to keep in the commit message, so I'd keep most of this, including the performance diff, in the commit message (i.e. above the ---).

This patch replaces integer widening with halving addition, and
multi-step "emulated" rounding shift with a single asm instruction doing
exactly that. This pattern repeats in other functions in this file, I
fixed some in the succeeding patch. There's a lot of performance to be
gained there.

I didn't modify the existing function because it adds a few extra steps
solely for the shared w_avg implementation (every cycle matters), but also
because I find this linear version easier to digest and understand.

That's probably reasonable - but if the avg codepath in vvc_avg is unused now, we should remove it; that makes the patch clearer to see the change, when we see the removed old codepath together with the new added one in the same patch.

Besides, I noticed that removing smin and smax instructions used for
clamping the values for 10 and 12 bit_depth instantiations does not
affect the checkasm result, but it breaks FATE.

It would probably be good if we could improve the checkasm to hit those cases too, but that's of course a separate question.


Benchmarks before and after:
A78
avg_8_2x2_neon:                                         21.0 ( 1.55x)
avg_8_4x4_neon:                                         25.8 ( 3.05x)
avg_8_8x8_neon:                                         45.0 ( 5.86x)
avg_8_16x16_neon:                                      178.5 ( 5.49x)
avg_8_32x32_neon:                                      709.2 ( 6.20x)
avg_8_64x64_neon:                                     2686.2 ( 6.12x)
avg_8_128x128_neon:                                  10734.2 ( 5.88x)
avg_10_2x2_neon:                                        19.0 ( 1.75x)
avg_10_4x4_neon:                                        28.2 ( 2.76x)
avg_10_8x8_neon:                                        44.0 ( 5.82x)
avg_10_16x16_neon:                                     179.5 ( 4.81x)
avg_10_32x32_neon:                                     680.8 ( 5.58x)
avg_10_64x64_neon:                                    2536.8 ( 5.40x)
avg_10_128x128_neon:                                 10079.0 ( 5.22x)
avg_12_2x2_neon:                                        20.8 ( 1.59x)
avg_12_4x4_neon:                                        25.2 ( 3.09x)
avg_12_8x8_neon:                                        44.0 ( 5.79x)
avg_12_16x16_neon:                                     182.2 ( 4.80x)
avg_12_32x32_neon:                                     696.2 ( 5.46x)
avg_12_64x64_neon:                                    2548.2 ( 5.38x)
avg_12_128x128_neon:                                 10133.8 ( 5.19x)

avg_8_2x2_neon:                                         16.5 ( 1.98x)
avg_8_4x4_neon:                                         26.2 ( 2.93x)
avg_8_8x8_neon:                                         31.8 ( 8.55x)
avg_8_16x16_neon:                                       82.0 (12.02x)
avg_8_32x32_neon:                                      310.2 (14.12x)
avg_8_64x64_neon:                                      897.8 (18.26x)
avg_8_128x128_neon:                                   3608.5 (17.37x)
avg_10_2x2_neon:                                        19.5 ( 1.69x)
avg_10_4x4_neon:                                        28.0 ( 2.79x)
avg_10_8x8_neon:                                        34.8 ( 7.32x)
avg_10_16x16_neon:                                     119.8 ( 7.35x)
avg_10_32x32_neon:                                     444.2 ( 8.51x)
avg_10_64x64_neon:                                    1711.8 ( 8.00x)
avg_10_128x128_neon:                                  7065.2 ( 7.43x)
avg_12_2x2_neon:                                        19.5 ( 1.71x)
avg_12_4x4_neon:                                        24.2 ( 3.22x)
avg_12_8x8_neon:                                        33.8 ( 7.57x)
avg_12_16x16_neon:                                     120.2 ( 7.33x)
avg_12_32x32_neon:                                     442.5 ( 8.53x)
avg_12_64x64_neon:                                    1706.2 ( 8.02x)
avg_12_128x128_neon:                                  7010.0 ( 7.46x)

A72
avg_8_2x2_neon:                                         30.2 ( 1.48x)
avg_8_4x4_neon:                                         40.0 ( 3.10x)
avg_8_8x8_neon:                                         91.0 ( 4.14x)
avg_8_16x16_neon:                                      340.4 ( 3.92x)
avg_8_32x32_neon:                                     1220.7 ( 4.67x)
avg_8_64x64_neon:                                     5823.4 ( 3.88x)
avg_8_128x128_neon:                                  17430.5 ( 4.73x)
avg_10_2x2_neon:                                        34.0 ( 1.66x)
avg_10_4x4_neon:                                        45.2 ( 2.73x)
avg_10_8x8_neon:                                        97.5 ( 3.87x)
avg_10_16x16_neon:                                     317.7 ( 3.90x)
avg_10_32x32_neon:                                    1376.2 ( 4.21x)
avg_10_64x64_neon:                                    5228.1 ( 3.71x)
avg_10_128x128_neon:                                 16722.2 ( 4.17x)
avg_12_2x2_neon:                                        31.7 ( 1.76x)
avg_12_4x4_neon:                                        36.0 ( 3.44x)
avg_12_8x8_neon:                                        91.7 ( 4.10x)
avg_12_16x16_neon:                                     297.2 ( 4.13x)
avg_12_32x32_neon:                                    1400.5 ( 4.14x)
avg_12_64x64_neon:                                    5379.1 ( 3.51x)
avg_12_128x128_neon:                                 16715.7 ( 4.17x)

avg_8_2x2_neon:                                         33.7 ( 1.72x)
avg_8_4x4_neon:                                         45.5 ( 2.84x)
avg_8_8x8_neon:                                         65.0 ( 5.98x)
avg_8_16x16_neon:                                      171.0 ( 7.81x)
avg_8_32x32_neon:                                      558.2 (10.05x)
avg_8_64x64_neon:                                     2006.5 (10.61x)
avg_8_128x128_neon:                                   9158.7 ( 8.96x)
avg_10_2x2_neon:                                        38.0 ( 1.92x)
avg_10_4x4_neon:                                        53.2 ( 2.69x)
avg_10_8x8_neon:                                        95.2 ( 4.08x)
avg_10_16x16_neon:                                     243.0 ( 5.02x)
avg_10_32x32_neon:                                     891.7 ( 5.64x)
avg_10_64x64_neon:                                    3357.7 ( 5.60x)
avg_10_128x128_neon:                                 12411.7 ( 5.56x)
avg_12_2x2_neon:                                        34.7 ( 1.97x)
avg_12_4x4_neon:                                        53.2 ( 2.68x)
avg_12_8x8_neon:                                        91.7 ( 4.22x)
avg_12_16x16_neon:                                     239.0 ( 5.08x)
avg_12_32x32_neon:                                     895.7 ( 5.62x)
avg_12_64x64_neon:                                    3317.5 ( 5.67x)
avg_12_128x128_neon:                                 12358.5 ( 5.58x)


A53
avg_8_2x2_neon:                                         58.3 ( 1.41x)
avg_8_4x4_neon:                                        101.8 ( 2.21x)
avg_8_8x8_neon:                                        178.6 ( 4.53x)
avg_8_16x16_neon:                                      569.5 ( 5.01x)
avg_8_32x32_neon:                                     1962.5 ( 5.50x)
avg_8_64x64_neon:                                     8327.8 ( 5.18x)
avg_8_128x128_neon:                                  31631.3 ( 5.34x)
avg_10_2x2_neon:                                        54.5 ( 1.56x)
avg_10_4x4_neon:                                        88.8 ( 2.53x)
avg_10_8x8_neon:                                       163.6 ( 4.97x)
avg_10_16x16_neon:                                     550.5 ( 5.16x)
avg_10_32x32_neon:                                    1942.5 ( 5.64x)
avg_10_64x64_neon:                                    8783.5 ( 4.98x)
avg_10_128x128_neon:                                 32617.0 ( 5.25x)
avg_12_2x2_neon:                                        53.3 ( 1.66x)
avg_12_4x4_neon:                                        86.8 ( 2.61x)
avg_12_8x8_neon:                                       156.6 ( 5.12x)
avg_12_16x16_neon:                                     541.3 ( 5.25x)
avg_12_32x32_neon:                                    1955.3 ( 5.59x)
avg_12_64x64_neon:                                    8686.0 ( 5.06x)
avg_12_128x128_neon:                                 32487.5 ( 5.25x)

avg_8_2x2_neon:                                         39.5 ( 1.96x)
avg_8_4x4_neon:                                         65.3 ( 3.41x)
avg_8_8x8_neon:                                        168.8 ( 4.79x)
avg_8_16x16_neon:                                      348.0 ( 8.20x)
avg_8_32x32_neon:                                     1207.5 ( 8.98x)
avg_8_64x64_neon:                                     6032.3 ( 7.17x)
avg_8_128x128_neon:                                  22008.5 ( 7.69x)
avg_10_2x2_neon:                                        55.5 ( 1.52x)
avg_10_4x4_neon:                                        73.8 ( 3.08x)
avg_10_8x8_neon:                                       157.8 ( 5.12x)
avg_10_16x16_neon:                                     445.0 ( 6.43x)
avg_10_32x32_neon:                                    1587.3 ( 6.87x)
avg_10_64x64_neon:                                    7738.0 ( 5.68x)
avg_10_128x128_neon:                                 27813.8 ( 6.14x)
avg_12_2x2_neon:                                        48.3 ( 1.80x)
avg_12_4x4_neon:                                        77.0 ( 2.95x)
avg_12_8x8_neon:                                       161.5 ( 4.98x)
avg_12_16x16_neon:                                     433.5 ( 6.59x)
avg_12_32x32_neon:                                    1622.0 ( 6.75x)
avg_12_64x64_neon:                                    7844.5 ( 5.60x)
avg_12_128x128_neon:                                 26999.5 ( 6.34x)

Krzysztof

libavcodec/aarch64/vvc/inter.S | 124 ++++++++++++++++++++++++++++++++-
1 file changed, 121 insertions(+), 3 deletions(-)

Overall the change looks reasonable to me, thanks, but remove the now unused parts and update the patch to include the valuable comments and benchmarks above the "---" bit.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to