Re: [FFmpeg-devel] [aarch64] improve performance of ff_hscale_8_to_15_neon

Martin Storsjö Sun, 01 Dec 2019 13:02:28 -0800

On Sun, 1 Dec 2019, Clément Bœsch wrote:

On Wed, Nov 27, 2019 at 12:30:35PM -0600, Sebastian Pop wrote:
[...]

From 9ecaa99fab4b8bedf3884344774162636eaa5389 Mon Sep 17 00:00:00 2001
From: Sebastian Pop <[email protected]>
Date: Sun, 17 Nov 2019 14:13:13 -0600
Subject: [PATCH] [aarch64] use FMA and increase vector factor to 4


This patch implements ff_hscale_8_to_15_neon with NEON fused multiply accumulate
and bumps the vectorization factor from 2 to 4.
The speedup is of 34% on Graviton A1 instances based on A-72 cpus:

$ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf 
bench=start,scale=1024x1024,bench=stop -f null -
before: t:0.040303 avg:0.040287 max:0.040371 min:0.039214
after:  t:0.030079 avg:0.030102 max:0.030462 min:0.030051

Tested with `make check` on aarch64-linux.

FWIW, I'm not certain how much this routine actually is tested by that -in particular, there's no checkasm test for it as far as I can see.

+        add                 x17, x3, x18                // srcp + filterPos[0]
+        add                 x18, x3, x0                 // srcp + filterPos[1]
+        add                 x0, x3, x2                  // srcp + filterPos[2]
+        add                 x2, x3, x6                  // srcp + filterPos[3]

+2:      ldr                 d4, [x17, x15]              // srcp[filterPos[0] + 
{0..7}]
+        ldr                 q5, [x16]                   // load 8x16-bit 
filter values, part 1
+        ldr                 d6, [x18, x15]              // srcp[filterPos[1] + 
{0..7}]
+        ldr                 q7, [x16, x12]              // load 8x16-bit at 
filter+filterSize


Why not use ld1 {v4.8B} etc like it was before? The use of Dn/Qn in is
very confusing here.

The ldr instruction, instead of ld1, allows you to to do a load (or store,similarly, for str instead of st1) with a constant/register offset, like[x17, x15] here, without incrementing the source register inbetween foreach load (which can help with latency between individual loadinstructions, or can avoid extra instructions for incrementing theregister inbetween).

That works for loading the first 1/2/4/8/16 bytes of a vector, but can'tbe used e.g. for loading a lane other than the first (e.g. ld1 {v4.s}[1]).But it does require using the b/h/s/d/q names for the registers instead ofv.

I didn't check the changes here if they're essential for the optimizationthough.


// Martin
_______________________________________________
ffmpeg-devel mailing list
[email protected]
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
[email protected] with subject "unsubscribe".

Re: [FFmpeg-devel] [aarch64] improve performance of ff_hscale_8_to_15_neon

Reply via email to