On Sun, 1 Dec 2019, Clément Bœsch wrote:

On Wed, Nov 27, 2019 at 12:30:35PM -0600, Sebastian Pop wrote:
[...]
From 9ecaa99fab4b8bedf3884344774162636eaa5389 Mon Sep 17 00:00:00 2001
From: Sebastian Pop <s...@amazon.com>
Date: Sun, 17 Nov 2019 14:13:13 -0600
Subject: [PATCH] [aarch64] use FMA and increase vector factor to 4

This patch implements ff_hscale_8_to_15_neon with NEON fused multiply accumulate
and bumps the vectorization factor from 2 to 4.
The speedup is of 34% on Graviton A1 instances based on A-72 cpus:

$ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf 
bench=start,scale=1024x1024,bench=stop -f null -
before: t:0.040303 avg:0.040287 max:0.040371 min:0.039214
after:  t:0.030079 avg:0.030102 max:0.030462 min:0.030051

Tested with `make check` on aarch64-linux.

FWIW, I'm not certain how much this routine actually is tested by that - in particular, there's no checkasm test for it as far as I can see.

+        add                 x17, x3, x18                // srcp + filterPos[0]
+        add                 x18, x3, x0                 // srcp + filterPos[1]
+        add                 x0, x3, x2                  // srcp + filterPos[2]
+        add                 x2, x3, x6                  // srcp + filterPos[3]

+2:      ldr                 d4, [x17, x15]              // srcp[filterPos[0] + 
{0..7}]
+        ldr                 q5, [x16]                   // load 8x16-bit 
filter values, part 1
+        ldr                 d6, [x18, x15]              // srcp[filterPos[1] + 
{0..7}]
+        ldr                 q7, [x16, x12]              // load 8x16-bit at 
filter+filterSize

Why not use ld1 {v4.8B} etc like it was before? The use of Dn/Qn in is
very confusing here.

The ldr instruction, instead of ld1, allows you to to do a load (or store, similarly, for str instead of st1) with a constant/register offset, like [x17, x15] here, without incrementing the source register inbetween for each load (which can help with latency between individual load instructions, or can avoid extra instructions for incrementing the register inbetween).

That works for loading the first 1/2/4/8/16 bytes of a vector, but can't be used e.g. for loading a lane other than the first (e.g. ld1 {v4.s}[1]). But it does require using the b/h/s/d/q names for the registers instead of v.

I didn't check the changes here if they're essential for the optimization though.

// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to