Re: [FFmpeg-devel] [PATCH] swscale/output: Altivec-optimize float yuv2plane1
On Sun, 16 Dec 2018 00:22:00 +0100 Michael Niedermayer wrote: > On Sat, Dec 15, 2018 at 06:32:31PM +0200, Lauri Kasanen wrote: > > Tested on POWER8 LE. Testing on earlier ppc and/or BE appreciated. > > > > libswscale/ppc/swscale_altivec.c | 139 > > ++- > > 1 file changed, 137 insertions(+), 2 deletions(-) > > breaks build: > src/libswscale/ppc/swscale_altivec.c: In function ‘yuv2plane1_float_altivec’: > src/libswscale/ppc/swscale_altivec.c:158:80: error: expected declaration > specifiers or ‘...’ before ‘(’ token > const vector float vzero = (vector float) {0, 0, 0, 0}; Thanks for testing. I missed the vzero define at the top, I wonder why my gcc did not break. Patch v2 coming. - Lauri ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] swscale/output: Altivec-optimize float yuv2plane1
On Sat, Dec 15, 2018 at 06:32:31PM +0200, Lauri Kasanen wrote: > This function wouldn't benefit from VSX instructions, and input > and output share alignment, so I put it under altivec. > > ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt > grayf32le \ > -f null -vframes 100 -v error -nostats - > > 3743 UNITS in planar1, 65495 runs, 41 skips > > -cpuflags 0 > > 23511 UNITS in planar1, 65530 runs, 6 skips > > grayf32be > > 4647 UNITS in planar1, 65449 runs, 87 skips > > -cpuflags 0 > > 28608 UNITS in planar1, 65530 runs, 6 skips > > The native speedup is 6.28133, and the bswapping one 6.15623. > Fate passes, each format tested with an image to video conversion. > > Signed-off-by: Lauri Kasanen > --- > > Tested on POWER8 LE. Testing on earlier ppc and/or BE appreciated. > > libswscale/ppc/swscale_altivec.c | 139 > ++- > 1 file changed, 137 insertions(+), 2 deletions(-) breaks build: CC libswscale/ppc/swscale_altivec.o In file included from src/libswscale/ppc/swscale_altivec.c:103:0: src/libswscale/ppc/swscale_ppc_template.c: In function ‘yuv2planeX_16_altivec’: src/libswscale/ppc/swscale_ppc_template.c:52:215: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] yuv2planeX_8(vo1, vo2, l1, src[j], x, perm, vLumFilter); ^ In file included from src/libswscale/ppc/swscale_altivec.c:103:0: src/libswscale/ppc/swscale_ppc_template.c:53:219: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] yuv2planeX_8(vo3, vo4, l1, src[j], x + 8, perm, vLumFilter); ^ In file included from src/libswscale/ppc/swscale_altivec.c:103:0: src/libswscale/ppc/swscale_ppc_template.c: In function ‘hScale_real_altivec’: src/libswscale/ppc/swscale_ppc_template.c:189:21: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] vector signed short src_vA = // vec_unpackh sign-extends... ^ src/libswscale/ppc/swscale_ppc_template.c:196:21: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] vector signed int val_acc = vec_msums(src_vA, filter_v0, val_v); ^ src/libswscale/ppc/swscale_altivec.c: In function ‘yuv2plane1_float_altivec’: src/libswscale/ppc/swscale_altivec.c:158:80: error: expected declaration specifiers or ‘...’ before ‘(’ token const vector float vzero = (vector float) {0, 0, 0, 0}; ^ src/libswscale/ppc/swscale_altivec.c:159:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] vector uint32_t v; ^ src/libswscale/ppc/swscale_altivec.c:172:9: error: invalid parameter combination for AltiVec intrinsic vd = vec_madd(vd, vmul, vzero); ^ src/libswscale/ppc/swscale_altivec.c: In function ‘yuv2plane1_float_bswap_altivec’: src/libswscale/ppc/swscale_altivec.c:191:80: error: expected declaration specifiers or ‘...’ before ‘(’ token const vector float vzero = (vector float) {0, 0, 0, 0}; ^ src/libswscale/ppc/swscale_altivec.c:192:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] const vector uint32_t vswapbig = (vector uint32_t) {16, 16, 16, 16}; ^ src/libswscale/ppc/swscale_altivec.c:207:9: error: invalid parameter combination for AltiVec intrinsic vd = vec_madd(vd, vmul, vzero); ^ make: *** [libswscale/ppc/swscale_altivec.o] Error 1 [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB I have often repented speaking, but never of holding my tongue. -- Xenocrates signature.asc Description: PGP signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH] swscale/output: Altivec-optimize float yuv2plane1
This function wouldn't benefit from VSX instructions, and input and output share alignment, so I put it under altivec. ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt grayf32le \ -f null -vframes 100 -v error -nostats - 3743 UNITS in planar1, 65495 runs, 41 skips -cpuflags 0 23511 UNITS in planar1, 65530 runs, 6 skips grayf32be 4647 UNITS in planar1, 65449 runs, 87 skips -cpuflags 0 28608 UNITS in planar1, 65530 runs, 6 skips The native speedup is 6.28133, and the bswapping one 6.15623. Fate passes, each format tested with an image to video conversion. Signed-off-by: Lauri Kasanen --- Tested on POWER8 LE. Testing on earlier ppc and/or BE appreciated. libswscale/ppc/swscale_altivec.c | 139 ++- 1 file changed, 137 insertions(+), 2 deletions(-) diff --git a/libswscale/ppc/swscale_altivec.c b/libswscale/ppc/swscale_altivec.c index 1d2b2fa..2ef5257 100644 --- a/libswscale/ppc/swscale_altivec.c +++ b/libswscale/ppc/swscale_altivec.c @@ -31,7 +31,8 @@ #include "yuv2rgb_altivec.h" #include "libavutil/ppc/util_altivec.h" -#if HAVE_ALTIVEC && HAVE_BIGENDIAN +#if HAVE_ALTIVEC +#if HAVE_BIGENDIAN #define vzero vec_splat_s32(0) #define GET_LS(a,b,c,s) {\ @@ -102,7 +103,135 @@ #include "swscale_ppc_template.c" #undef FUNC -#endif /* HAVE_ALTIVEC && HAVE_BIGENDIAN */ +#endif /* HAVE_BIGENDIAN */ + +#define output_pixel(pos, val, bias, signedness) \ +if (big_endian) { \ +AV_WB16(pos, bias + av_clip_ ## signedness ## 16(val >> shift)); \ +} else { \ +AV_WL16(pos, bias + av_clip_ ## signedness ## 16(val >> shift)); \ +} + +static void +yuv2plane1_float_u(const int32_t *src, float *dest, int dstW, int start) +{ +static const int big_endian = HAVE_BIGENDIAN; +static const int shift = 3; +static const float float_mult = 1.0f / 65535.0f; +int i, val; +uint16_t val_uint; + +for (i = start; i < dstW; ++i){ +val = src[i] + (1 << (shift - 1)); +output_pixel(_uint, val, 0, uint); +dest[i] = float_mult * (float)val_uint; +} +} + +static void +yuv2plane1_float_bswap_u(const int32_t *src, uint32_t *dest, int dstW, int start) +{ +static const int big_endian = HAVE_BIGENDIAN; +static const int shift = 3; +static const float float_mult = 1.0f / 65535.0f; +int i, val; +uint16_t val_uint; + +for (i = start; i < dstW; ++i){ +val = src[i] + (1 << (shift - 1)); +output_pixel(_uint, val, 0, uint); +dest[i] = av_bswap32(av_float2int(float_mult * (float)val_uint)); +} +} + +static void yuv2plane1_float_altivec(const int32_t *src, float *dest, int dstW) +{ +const int dst_u = -(uintptr_t)dest & 3; +const int shift = 3; +const int add = (1 << (shift - 1)); +const int clip = (1 << 16) - 1; +const float fmult = 1.0f / 65535.0f; +const vector uint32_t vadd = (vector uint32_t) {add, add, add, add}; +const vector uint32_t vshift = (vector uint32_t) vec_splat_u32(shift); +const vector uint32_t vlargest = (vector uint32_t) {clip, clip, clip, clip}; +const vector float vmul = (vector float) {fmult, fmult, fmult, fmult}; +const vector float vzero = (vector float) {0, 0, 0, 0}; +vector uint32_t v; +vector float vd; +int i; + +yuv2plane1_float_u(src, dest, dst_u, 0); + +for (i = dst_u; i < dstW - 3; i += 4) { +v = vec_ld(0, (const uint32_t *) [i]); +v = vec_add(v, vadd); +v = vec_sr(v, vshift); +v = vec_min(v, vlargest); + +vd = vec_ctf(v, 0); +vd = vec_madd(vd, vmul, vzero); + +vec_st(vd, 0, [i]); +} + +yuv2plane1_float_u(src, dest, dstW, i); +} + +static void yuv2plane1_float_bswap_altivec(const int32_t *src, uint32_t *dest, int dstW) +{ +const int dst_u = -(uintptr_t)dest & 3; +const int shift = 3; +const int add = (1 << (shift - 1)); +const int clip = (1 << 16) - 1; +const float fmult = 1.0f / 65535.0f; +const vector uint32_t vadd = (vector uint32_t) {add, add, add, add}; +const vector uint32_t vshift = (vector uint32_t) vec_splat_u32(shift); +const vector uint32_t vlargest = (vector uint32_t) {clip, clip, clip, clip}; +const vector float vmul = (vector float) {fmult, fmult, fmult, fmult}; +const vector float vzero = (vector float) {0, 0, 0, 0}; +const vector uint32_t vswapbig = (vector uint32_t) {16, 16, 16, 16}; +const vector uint16_t vswapsmall = vec_splat_u16(8); +vector uint32_t v; +vector float vd; +int i; + +yuv2plane1_float_bswap_u(src, dest, dst_u, 0); + +for (i = dst_u; i < dstW - 3; i += 4) { +v = vec_ld(0, (const uint32_t *) [i]); +v = vec_add(v, vadd); +v = vec_sr(v, vshift); +v = vec_min(v, vlargest); + +vd = vec_ctf(v, 0); +vd = vec_madd(vd, vmul, vzero); + +vd = (vector float) vec_rl((vector uint32_t) vd, vswapbig); +vd = (vector float)