Re: [FFmpeg-devel] [PATCH v6] libswscale/ppc: VSX-optimize 9-16 bit yuv2planeX

2019-02-04 Thread Lauri Kasanen
On Sun, 13 Jan 2019 10:26:20 +0200
Lauri Kasanen  wrote:

> ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt 
> yuv420p16be \
> -s 1920x1728 -f null -vframes 100 -v error -nostats -
> 
> 9-14 bit funcs get about 6x speedup, 16-bit gets about 15x.
> Fate passes, each format tested with an image to video conversion.
> 
> Only POWER8 includes 32-bit vector multiplies, so POWER7 is locked out
> of the 16-bit function. This includes the vec_mulo/mule functions too,
> not just vmuluwm.

Applying.

- Lauri
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH v6] libswscale/ppc: VSX-optimize 9-16 bit yuv2planeX

2019-01-27 Thread Lauri Kasanen
On Mon, 14 Jan 2019 16:13:52 +0100
Michael Niedermayer  wrote:

> On Sun, Jan 13, 2019 at 10:26:20AM +0200, Lauri Kasanen wrote:
> > ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt 
> > yuv420p16be \
> > -s 1920x1728 -f null -vframes 100 -v error -nostats -
> > 
> > 9-14 bit funcs get about 6x speedup, 16-bit gets about 15x.
> > Fate passes, each format tested with an image to video conversion.
> > 
> > Only POWER8 includes 32-bit vector multiplies, so POWER7 is locked out
> > of the 16-bit function. This includes the vec_mulo/mule functions too,
> > not just vmuluwm.
...
> > v6: No patch changes, updated bench numbers without skips.
> 
> fate does not get worse from this patch on qemu ppc32be and ppc64le 

Ping

- Lauri
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH v6] libswscale/ppc: VSX-optimize 9-16 bit yuv2planeX

2019-01-14 Thread Michael Niedermayer
On Sun, Jan 13, 2019 at 10:26:20AM +0200, Lauri Kasanen wrote:
> ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt 
> yuv420p16be \
> -s 1920x1728 -f null -vframes 100 -v error -nostats -
> 
> 9-14 bit funcs get about 6x speedup, 16-bit gets about 15x.
> Fate passes, each format tested with an image to video conversion.
> 
> Only POWER8 includes 32-bit vector multiplies, so POWER7 is locked out
> of the 16-bit function. This includes the vec_mulo/mule functions too,
> not just vmuluwm.
> 
> With TIMER_REPORT skips disabled:
> yuv420p9le
>   12412 UNITS in planarX,  131072 runs,  0 skips
>   73136 UNITS in planarX,  131072 runs,  0 skips
> yuv420p9be
>   12481 UNITS in planarX,  131072 runs,  0 skips
>   73410 UNITS in planarX,  131072 runs,  0 skips
> yuv420p10le
>   12322 UNITS in planarX,  131072 runs,  0 skips
>   72546 UNITS in planarX,  131072 runs,  0 skips
> yuv420p10be
>   12291 UNITS in planarX,  131072 runs,  0 skips
>   72935 UNITS in planarX,  131072 runs,  0 skips
> yuv420p12le
>   12316 UNITS in planarX,  131072 runs,  0 skips
>   72708 UNITS in planarX,  131072 runs,  0 skips
> yuv420p12be
>   12319 UNITS in planarX,  131072 runs,  0 skips
>   72577 UNITS in planarX,  131072 runs,  0 skips
> yuv420p14le
>   12259 UNITS in planarX,  131072 runs,  0 skips
>   72516 UNITS in planarX,  131072 runs,  0 skips
> yuv420p14be
>   12440 UNITS in planarX,  131072 runs,  0 skips
>   72962 UNITS in planarX,  131072 runs,  0 skips
> yuv420p16le
>   10548 UNITS in planarX,  131072 runs,  0 skips
>   73429 UNITS in planarX,  131072 runs,  0 skips
> yuv420p16be
>   10634 UNITS in planarX,  131072 runs,  0 skips
>  150959 UNITS in planarX,  131072 runs,  0 skips
> 
> Signed-off-by: Lauri Kasanen 
> ---
>  libswscale/ppc/swscale_ppc_template.c |   4 +-
>  libswscale/ppc/swscale_vsx.c  | 186 
> +-
>  2 files changed, 184 insertions(+), 6 deletions(-)
> 
> v6: No patch changes, updated bench numbers without skips.

fate does not get worse from this patch on qemu ppc32be and ppc64le 


[...]
-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

If you fake or manipulate statistics in a paper in physics you will never
get a job again.
If you fake or manipulate statistics in a paper in medicin you will get
a job for life at the pharma industry.


signature.asc
Description: PGP signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH v6] libswscale/ppc: VSX-optimize 9-16 bit yuv2planeX

2019-01-13 Thread Lauri Kasanen
./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt 
yuv420p16be \
-s 1920x1728 -f null -vframes 100 -v error -nostats -

9-14 bit funcs get about 6x speedup, 16-bit gets about 15x.
Fate passes, each format tested with an image to video conversion.

Only POWER8 includes 32-bit vector multiplies, so POWER7 is locked out
of the 16-bit function. This includes the vec_mulo/mule functions too,
not just vmuluwm.

With TIMER_REPORT skips disabled:
yuv420p9le
  12412 UNITS in planarX,  131072 runs,  0 skips
  73136 UNITS in planarX,  131072 runs,  0 skips
yuv420p9be
  12481 UNITS in planarX,  131072 runs,  0 skips
  73410 UNITS in planarX,  131072 runs,  0 skips
yuv420p10le
  12322 UNITS in planarX,  131072 runs,  0 skips
  72546 UNITS in planarX,  131072 runs,  0 skips
yuv420p10be
  12291 UNITS in planarX,  131072 runs,  0 skips
  72935 UNITS in planarX,  131072 runs,  0 skips
yuv420p12le
  12316 UNITS in planarX,  131072 runs,  0 skips
  72708 UNITS in planarX,  131072 runs,  0 skips
yuv420p12be
  12319 UNITS in planarX,  131072 runs,  0 skips
  72577 UNITS in planarX,  131072 runs,  0 skips
yuv420p14le
  12259 UNITS in planarX,  131072 runs,  0 skips
  72516 UNITS in planarX,  131072 runs,  0 skips
yuv420p14be
  12440 UNITS in planarX,  131072 runs,  0 skips
  72962 UNITS in planarX,  131072 runs,  0 skips
yuv420p16le
  10548 UNITS in planarX,  131072 runs,  0 skips
  73429 UNITS in planarX,  131072 runs,  0 skips
yuv420p16be
  10634 UNITS in planarX,  131072 runs,  0 skips
 150959 UNITS in planarX,  131072 runs,  0 skips

Signed-off-by: Lauri Kasanen 
---
 libswscale/ppc/swscale_ppc_template.c |   4 +-
 libswscale/ppc/swscale_vsx.c  | 186 +-
 2 files changed, 184 insertions(+), 6 deletions(-)

v6: No patch changes, updated bench numbers without skips.

diff --git a/libswscale/ppc/swscale_ppc_template.c 
b/libswscale/ppc/swscale_ppc_template.c
index 00e4b99..11decab 100644
--- a/libswscale/ppc/swscale_ppc_template.c
+++ b/libswscale/ppc/swscale_ppc_template.c
@@ -21,7 +21,7 @@
  * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
  */
 
-static void FUNC(yuv2planeX_16)(const int16_t *filter, int filterSize,
+static void FUNC(yuv2planeX_8_16)(const int16_t *filter, int filterSize,
   const int16_t **src, uint8_t *dest,
   const uint8_t *dither, int offset, int x)
 {
@@ -88,7 +88,7 @@ static void FUNC(yuv2planeX)(const int16_t *filter, int 
filterSize,
 yuv2planeX_u(filter, filterSize, src, dest, dst_u, dither, offset, 0);
 
 for (i = dst_u; i < dstW - 15; i += 16)
-FUNC(yuv2planeX_16)(filter, filterSize, src, dest + i, dither,
+FUNC(yuv2planeX_8_16)(filter, filterSize, src, dest + i, dither,
   offset, i);
 
 yuv2planeX_u(filter, filterSize, src, dest, dstW, dither, offset, i);
diff --git a/libswscale/ppc/swscale_vsx.c b/libswscale/ppc/swscale_vsx.c
index 70da6ae..f6c7f1d 100644
--- a/libswscale/ppc/swscale_vsx.c
+++ b/libswscale/ppc/swscale_vsx.c
@@ -83,6 +83,8 @@
 #include "swscale_ppc_template.c"
 #undef FUNC
 
+#undef vzero
+
 #endif /* !HAVE_BIGENDIAN */
 
 static void yuv2plane1_8_u(const int16_t *src, uint8_t *dest, int dstW,
@@ -180,6 +182,76 @@ static void yuv2plane1_nbps_vsx(const int16_t *src, 
uint16_t *dest, int dstW,
 yuv2plane1_nbps_u(src, dest, dstW, big_endian, output_bits, i);
 }
 
+static void yuv2planeX_nbps_u(const int16_t *filter, int filterSize,
+  const int16_t **src, uint16_t *dest, int dstW,
+  int big_endian, int output_bits, int start)
+{
+int i;
+int shift = 11 + 16 - output_bits;
+
+for (i = start; i < dstW; i++) {
+int val = 1 << (shift - 1);
+int j;
+
+for (j = 0; j < filterSize; j++)
+val += src[j][i] * filter[j];
+
+output_pixel([i], val);
+}
+}
+
+static void yuv2planeX_nbps_vsx(const int16_t *filter, int filterSize,
+const int16_t **src, uint16_t *dest, int dstW,
+int big_endian, int output_bits)
+{
+const int dst_u = -(uintptr_t)dest & 7;
+const int shift = 11 + 16 - output_bits;
+const int add = (1 << (shift - 1));
+const int clip = (1 << output_bits) - 1;
+const uint16_t swap = big_endian ? 8 : 0;
+const vector uint32_t vadd = (vector uint32_t) {add, add, add, add};
+const vector uint32_t vshift = (vector uint32_t) {shift, shift, shift, 
shift};
+const vector uint16_t vswap = (vector uint16_t) {swap, swap, swap, swap, 
swap, swap, swap, swap};
+const vector uint16_t vlargest = (vector uint16_t) {clip, clip, clip, 
clip, clip, clip, clip, clip};
+const vector int16_t vzero = vec_splat_s16(0);
+const vector uint8_t vperm = (vector uint8_t) {0, 1, 8, 9, 2, 3, 10, 11, 
4,