Re: [FFmpeg-devel] [PATCH] avcodec: Remove libstagefright

2016-01-03 Thread Martin Storsjö

On Sun, 3 Jan 2016, Derek Buitenhuis wrote:


It serves absolutely no purpose other than to confuse potentional
Android developers about how to use hardware acceleration properly
on the the platform. Both stagefright itself, and MediaCodec, have
avcodec backends already, and this is the correct way to use it.


No, that's unrelated. Yes, people have written avcodec backends for 
stagefright/MediaCodec, but that's unrelated and only of interest for 
stock Android mediaplayers to extend their codec support.



MediaCodec as a proper JNI API.


wat? (Yes, using MediaCodec, either via the recent C API, or via JNI, is 
the correct way to do it.)



Furthermore, stagefright support in avcodec needs a series of
magic incantations and version-specific stuff, such that
using it actually provides downsides compared just using the actual
Android frameworks properly, in that it is a lot more work and confusion
to get it even running. It also leads to a lot of misinformation, like
these sorts of comments (in [1]) that are absolutely incorrect.


Spot on, +1.


[1] http://stackoverflow.com/a/29362353/3115956

Signed-off-by: Derek Buitenhuis 
---
I am certain there are many more reasons to remvoe this as well. I know
its own author despises it, and I know j-b will same things to say.


Not direct author, but co-author/mentor.

// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 2/2] libopenh264: Support building with the 1.6 release

2016-07-26 Thread Martin Storsjö
This fixes trac issue #5417.

This is cherry-picked from libav commit
d825b1a5306576dcd0553b7d0d24a3a46ad92864.
---
Updated the commit message to mention the ticket number.
---
 libavcodec/libopenh264dec.c |  2 ++
 libavcodec/libopenh264enc.c | 26 --
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/libavcodec/libopenh264dec.c b/libavcodec/libopenh264dec.c
index f642082..6af60af 100644
--- a/libavcodec/libopenh264dec.c
+++ b/libavcodec/libopenh264dec.c
@@ -90,7 +90,9 @@ static av_cold int svc_decode_init(AVCodecContext *avctx)
 (*s->decoder)->SetOption(s->decoder, DECODER_OPTION_TRACE_CALLBACK, (void 
*)_function);
 (*s->decoder)->SetOption(s->decoder, 
DECODER_OPTION_TRACE_CALLBACK_CONTEXT, (void *));
 
+#if !OPENH264_VER_AT_LEAST(1, 6)
 param.eOutputColorFormat = videoFormatI420;
+#endif
 param.eEcActiveIdc   = ERROR_CON_DISABLE;
 param.sVideoProperty.eVideoBsType = VIDEO_BITSTREAM_DEFAULT;
 
diff --git a/libavcodec/libopenh264enc.c b/libavcodec/libopenh264enc.c
index d27fc41..07af31d 100644
--- a/libavcodec/libopenh264enc.c
+++ b/libavcodec/libopenh264enc.c
@@ -33,6 +33,10 @@
 #include "internal.h"
 #include "libopenh264.h"
 
+#if !OPENH264_VER_AT_LEAST(1, 6)
+#define SM_SIZELIMITED_SLICE SM_DYN_SLICE
+#endif
+
 typedef struct SVCContext {
 const AVClass *av_class;
 ISVCEncoder *encoder;
@@ -48,11 +52,20 @@ typedef struct SVCContext {
 #define OFFSET(x) offsetof(SVCContext, x)
 #define VE AV_OPT_FLAG_VIDEO_PARAM | AV_OPT_FLAG_ENCODING_PARAM
 static const AVOption options[] = {
+#if OPENH264_VER_AT_LEAST(1, 6)
+{ "slice_mode", "set slice mode", OFFSET(slice_mode), AV_OPT_TYPE_INT, { 
.i64 = SM_FIXEDSLCNUM_SLICE }, SM_SINGLE_SLICE, SM_RESERVED, VE, "slice_mode" },
+#else
 { "slice_mode", "set slice mode", OFFSET(slice_mode), AV_OPT_TYPE_INT, { 
.i64 = SM_AUTO_SLICE }, SM_SINGLE_SLICE, SM_RESERVED, VE, "slice_mode" },
+#endif
 { "fixed", "a fixed number of slices", 0, AV_OPT_TYPE_CONST, { .i64 = 
SM_FIXEDSLCNUM_SLICE }, 0, 0, VE, "slice_mode" },
+#if OPENH264_VER_AT_LEAST(1, 6)
+{ "dyn", "Size limited (compatibility name)", 0, AV_OPT_TYPE_CONST, { 
.i64 = SM_SIZELIMITED_SLICE }, 0, 0, VE, "slice_mode" },
+{ "sizelimited", "Size limited", 0, AV_OPT_TYPE_CONST, { .i64 = 
SM_SIZELIMITED_SLICE }, 0, 0, VE, "slice_mode" },
+#else
 { "rowmb", "one slice per row of macroblocks", 0, AV_OPT_TYPE_CONST, { 
.i64 = SM_ROWMB_SLICE }, 0, 0, VE, "slice_mode" },
 { "auto", "automatic number of slices according to number of threads", 
0, AV_OPT_TYPE_CONST, { .i64 = SM_AUTO_SLICE }, 0, 0, VE, "slice_mode" },
 { "dyn", "Dynamic slicing", 0, AV_OPT_TYPE_CONST, { .i64 = 
SM_DYN_SLICE }, 0, 0, VE, "slice_mode" },
+#endif
 { "loopfilter", "enable loop filter", OFFSET(loopfilter), AV_OPT_TYPE_INT, 
{ .i64 = 1 }, 0, 1, VE },
 { "profile", "set profile restrictions", OFFSET(profile), 
AV_OPT_TYPE_STRING, { .str = NULL }, 0, 0, VE },
 { "max_nal_size", "set maximum NAL size in bytes", OFFSET(max_nal_size), 
AV_OPT_TYPE_INT, { .i64 = 0 }, 0, INT_MAX, VE },
@@ -159,15 +172,24 @@ FF_ENABLE_DEPRECATION_WARNINGS
 s->slice_mode = SM_FIXEDSLCNUM_SLICE;
 
 if (s->max_nal_size)
-s->slice_mode = SM_DYN_SLICE;
+s->slice_mode = SM_SIZELIMITED_SLICE;
 
+#if OPENH264_VER_AT_LEAST(1, 6)
+param.sSpatialLayers[0].sSliceArgument.uiSliceMode = s->slice_mode;
+param.sSpatialLayers[0].sSliceArgument.uiSliceNum  = avctx->slices;
+#else
 param.sSpatialLayers[0].sSliceCfg.uiSliceMode   = 
s->slice_mode;
 param.sSpatialLayers[0].sSliceCfg.sSliceArgument.uiSliceNum = 
avctx->slices;
+#endif
 
-if (s->slice_mode == SM_DYN_SLICE) {
+if (s->slice_mode == SM_SIZELIMITED_SLICE) {
 if (s->max_nal_size){
 param.uiMaxNalSize = s->max_nal_size;
+#if OPENH264_VER_AT_LEAST(1, 6)
+param.sSpatialLayers[0].sSliceArgument.uiSliceSizeConstraint = 
s->max_nal_size;
+#else
 
param.sSpatialLayers[0].sSliceCfg.sSliceArgument.uiSliceSizeConstraint = 
s->max_nal_size;
+#endif
 } else {
 av_log(avctx, AV_LOG_ERROR, "Invalid -max_nal_size, "
"specify a valid max_nal_size to use -slice_mode dyn\n");
-- 
2.7.4 (Apple Git-66)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 1/2] Add an OpenH264 decoder wrapper

2016-07-26 Thread Martin Storsjö
This is cherrypicked from libav, from commits
82b7525173f20702a8cbc26ebedbf4b69b8fecec and
d0b1e6049b06ca146ece4d2f199c5dba1565.

---
Fixed the issues pointed out by Michael, removed the parts of the
commit message as requested by Carl.
---
 Changelog   |   1 +
 configure   |   2 +
 doc/general.texi|   9 +-
 libavcodec/Makefile |   3 +-
 libavcodec/allcodecs.c  |   2 +-
 libavcodec/libopenh264.c|  62 +++
 libavcodec/libopenh264.h|  39 +++
 libavcodec/libopenh264dec.c | 243 
 libavcodec/libopenh264enc.c |  48 ++---
 libavcodec/version.h|   2 +-
 10 files changed, 366 insertions(+), 45 deletions(-)
 create mode 100644 libavcodec/libopenh264.c
 create mode 100644 libavcodec/libopenh264.h
 create mode 100644 libavcodec/libopenh264dec.c

diff --git a/Changelog b/Changelog
index 479f164..7f536db 100644
--- a/Changelog
+++ b/Changelog
@@ -10,6 +10,7 @@ version :
 - curves filter doesn't automatically insert points at x=0 and x=1 anymore
 - 16-bit support in curves filter
 - 16-bit support in selectivecolor filter
+- OpenH264 decoder wrapper
 
 
 version 3.1:
diff --git a/configure b/configure
index 1b41303..9f5b31f 100755
--- a/configure
+++ b/configure
@@ -2771,6 +2771,8 @@ libopencore_amrnb_decoder_deps="libopencore_amrnb"
 libopencore_amrnb_encoder_deps="libopencore_amrnb"
 libopencore_amrnb_encoder_select="audio_frame_queue"
 libopencore_amrwb_decoder_deps="libopencore_amrwb"
+libopenh264_decoder_deps="libopenh264"
+libopenh264_decoder_select="h264_mp4toannexb_bsf"
 libopenh264_encoder_deps="libopenh264"
 libopenjpeg_decoder_deps="libopenjpeg"
 libopenjpeg_encoder_deps="libopenjpeg"
diff --git a/doc/general.texi b/doc/general.texi
index 7823dc1..6b5975c 100644
--- a/doc/general.texi
+++ b/doc/general.texi
@@ -103,12 +103,19 @@ enable it.
 
 @section OpenH264
 
-FFmpeg can make use of the OpenH264 library for H.264 encoding.
+FFmpeg can make use of the OpenH264 library for H.264 encoding and decoding.
 
 Go to @url{http://www.openh264.org/} and follow the instructions for
 installing the library. Then pass @code{--enable-libopenh264} to configure to
 enable it.
 
+For decoding, this library is much more limited than the built-in decoder
+in libavcodec; currently, this library lacks support for decoding B-frames
+and some other main/high profile features. (It currently only supports
+constrained baseline profile and CABAC.) Using it is mostly useful for
+testing and for taking advantage of Cisco's patent portfolio license
+(@url{http://www.openh264.org/BINARY_LICENSE.txt}).
+
 @section x264
 
 FFmpeg can make use of the x264 library for H.264 encoding.
diff --git a/libavcodec/Makefile b/libavcodec/Makefile
index a548e02..3def3ad 100644
--- a/libavcodec/Makefile
+++ b/libavcodec/Makefile
@@ -868,7 +868,8 @@ OBJS-$(CONFIG_LIBMP3LAME_ENCODER) += libmp3lame.o 
mpegaudiodata.o mpegau
 OBJS-$(CONFIG_LIBOPENCORE_AMRNB_DECODER)  += libopencore-amr.o
 OBJS-$(CONFIG_LIBOPENCORE_AMRNB_ENCODER)  += libopencore-amr.o
 OBJS-$(CONFIG_LIBOPENCORE_AMRWB_DECODER)  += libopencore-amr.o
-OBJS-$(CONFIG_LIBOPENH264_ENCODER)+= libopenh264enc.o
+OBJS-$(CONFIG_LIBOPENH264_DECODER)+= libopenh264dec.o libopenh264.o
+OBJS-$(CONFIG_LIBOPENH264_ENCODER)+= libopenh264enc.o libopenh264.o
 OBJS-$(CONFIG_LIBOPENJPEG_DECODER)+= libopenjpegdec.o
 OBJS-$(CONFIG_LIBOPENJPEG_ENCODER)+= libopenjpegenc.o
 OBJS-$(CONFIG_LIBOPUS_DECODER)+= libopusdec.o libopus.o \
diff --git a/libavcodec/allcodecs.c b/libavcodec/allcodecs.c
index 951e199..a1ae61f 100644
--- a/libavcodec/allcodecs.c
+++ b/libavcodec/allcodecs.c
@@ -623,7 +623,7 @@ void avcodec_register_all(void)
 
 /* external libraries, that shouldn't be used by default if one of the
  * above is available */
-REGISTER_ENCODER(LIBOPENH264,   libopenh264);
+REGISTER_ENCDEC (LIBOPENH264,   libopenh264);
 REGISTER_DECODER(H264_CUVID,h264_cuvid);
 REGISTER_ENCODER(H264_NVENC,h264_nvenc);
 REGISTER_ENCODER(H264_OMX,  h264_omx);
diff --git a/libavcodec/libopenh264.c b/libavcodec/libopenh264.c
new file mode 100644
index 000..59c61a3
--- /dev/null
+++ b/libavcodec/libopenh264.c
@@ -0,0 +1,62 @@
+/*
+ * OpenH264 shared utils
+ * Copyright (C) 2014 Martin Storsjo
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the 

Re: [FFmpeg-devel] [PATCH 1/2] Add an OpenH264 decoder wrapper

2016-07-27 Thread Martin Storsjö

On Tue, 26 Jul 2016, Michael Niedermayer wrote:


On Tue, Jul 26, 2016 at 09:31:17PM +0300, Martin Storsjö wrote:

This is cherrypicked from libav, from commits
82b7525173f20702a8cbc26ebedbf4b69b8fecec and
d0b1e6049b06ca146ece4d2f199c5dba1565.

---
Fixed the issues pointed out by Michael, removed the parts of the
commit message as requested by Carl.
---
 Changelog   |   1 +
 configure   |   2 +
 doc/general.texi|   9 +-
 libavcodec/Makefile |   3 +-
 libavcodec/allcodecs.c  |   2 +-
 libavcodec/libopenh264.c|  62 +++
 libavcodec/libopenh264.h|  39 +++
 libavcodec/libopenh264dec.c | 243 
 libavcodec/libopenh264enc.c |  48 ++---
 libavcodec/version.h|   2 +-
 10 files changed, 366 insertions(+), 45 deletions(-)
 create mode 100644 libavcodec/libopenh264.c
 create mode 100644 libavcodec/libopenh264.h
 create mode 100644 libavcodec/libopenh264dec.c


LGTM, please push, unless someone else has more comments

thanks


Pushed both.

// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 1/2] Add an OpenH264 decoder wrapper

2016-07-26 Thread Martin Storsjö
While it is less featureful (and slower) than the built-in H264
decoder, one could potentially want to use it to take advantage
of the cisco patent license offer.

This is cherrypicked from libav, from commits
82b7525173f20702a8cbc26ebedbf4b69b8fecec and
d0b1e6049b06ca146ece4d2f199c5dba1565.
---
 Changelog   |   1 +
 configure   |   2 +
 doc/general.texi|   9 +-
 libavcodec/Makefile |   3 +-
 libavcodec/allcodecs.c  |   2 +-
 libavcodec/libopenh264.c|  62 +++
 libavcodec/libopenh264.h|  39 +++
 libavcodec/libopenh264dec.c | 245 
 libavcodec/libopenh264enc.c |  48 ++---
 libavcodec/version.h|   2 +-
 10 files changed, 368 insertions(+), 45 deletions(-)
 create mode 100644 libavcodec/libopenh264.c
 create mode 100644 libavcodec/libopenh264.h
 create mode 100644 libavcodec/libopenh264dec.c

diff --git a/Changelog b/Changelog
index 479f164..7f536db 100644
--- a/Changelog
+++ b/Changelog
@@ -10,6 +10,7 @@ version :
 - curves filter doesn't automatically insert points at x=0 and x=1 anymore
 - 16-bit support in curves filter
 - 16-bit support in selectivecolor filter
+- OpenH264 decoder wrapper
 
 
 version 3.1:
diff --git a/configure b/configure
index 1b41303..9f5b31f 100755
--- a/configure
+++ b/configure
@@ -2771,6 +2771,8 @@ libopencore_amrnb_decoder_deps="libopencore_amrnb"
 libopencore_amrnb_encoder_deps="libopencore_amrnb"
 libopencore_amrnb_encoder_select="audio_frame_queue"
 libopencore_amrwb_decoder_deps="libopencore_amrwb"
+libopenh264_decoder_deps="libopenh264"
+libopenh264_decoder_select="h264_mp4toannexb_bsf"
 libopenh264_encoder_deps="libopenh264"
 libopenjpeg_decoder_deps="libopenjpeg"
 libopenjpeg_encoder_deps="libopenjpeg"
diff --git a/doc/general.texi b/doc/general.texi
index 7823dc1..6b5975c 100644
--- a/doc/general.texi
+++ b/doc/general.texi
@@ -103,12 +103,19 @@ enable it.
 
 @section OpenH264
 
-FFmpeg can make use of the OpenH264 library for H.264 encoding.
+FFmpeg can make use of the OpenH264 library for H.264 encoding and decoding.
 
 Go to @url{http://www.openh264.org/} and follow the instructions for
 installing the library. Then pass @code{--enable-libopenh264} to configure to
 enable it.
 
+For decoding, this library is much more limited than the built-in decoder
+in libavcodec; currently, this library lacks support for decoding B-frames
+and some other main/high profile features. (It currently only supports
+constrained baseline profile and CABAC.) Using it is mostly useful for
+testing and for taking advantage of Cisco's patent portfolio license
+(@url{http://www.openh264.org/BINARY_LICENSE.txt}).
+
 @section x264
 
 FFmpeg can make use of the x264 library for H.264 encoding.
diff --git a/libavcodec/Makefile b/libavcodec/Makefile
index a548e02..3def3ad 100644
--- a/libavcodec/Makefile
+++ b/libavcodec/Makefile
@@ -868,7 +868,8 @@ OBJS-$(CONFIG_LIBMP3LAME_ENCODER) += libmp3lame.o 
mpegaudiodata.o mpegau
 OBJS-$(CONFIG_LIBOPENCORE_AMRNB_DECODER)  += libopencore-amr.o
 OBJS-$(CONFIG_LIBOPENCORE_AMRNB_ENCODER)  += libopencore-amr.o
 OBJS-$(CONFIG_LIBOPENCORE_AMRWB_DECODER)  += libopencore-amr.o
-OBJS-$(CONFIG_LIBOPENH264_ENCODER)+= libopenh264enc.o
+OBJS-$(CONFIG_LIBOPENH264_DECODER)+= libopenh264dec.o libopenh264.o
+OBJS-$(CONFIG_LIBOPENH264_ENCODER)+= libopenh264enc.o libopenh264.o
 OBJS-$(CONFIG_LIBOPENJPEG_DECODER)+= libopenjpegdec.o
 OBJS-$(CONFIG_LIBOPENJPEG_ENCODER)+= libopenjpegenc.o
 OBJS-$(CONFIG_LIBOPUS_DECODER)+= libopusdec.o libopus.o \
diff --git a/libavcodec/allcodecs.c b/libavcodec/allcodecs.c
index 951e199..a1ae61f 100644
--- a/libavcodec/allcodecs.c
+++ b/libavcodec/allcodecs.c
@@ -623,7 +623,7 @@ void avcodec_register_all(void)
 
 /* external libraries, that shouldn't be used by default if one of the
  * above is available */
-REGISTER_ENCODER(LIBOPENH264,   libopenh264);
+REGISTER_ENCDEC (LIBOPENH264,   libopenh264);
 REGISTER_DECODER(H264_CUVID,h264_cuvid);
 REGISTER_ENCODER(H264_NVENC,h264_nvenc);
 REGISTER_ENCODER(H264_OMX,  h264_omx);
diff --git a/libavcodec/libopenh264.c b/libavcodec/libopenh264.c
new file mode 100644
index 000..59c61a3
--- /dev/null
+++ b/libavcodec/libopenh264.c
@@ -0,0 +1,62 @@
+/*
+ * OpenH264 shared utils
+ * Copyright (C) 2014 Martin Storsjo
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more 

[FFmpeg-devel] [PATCH 2/2] libopenh264: Support building with the 1.6 release

2016-07-26 Thread Martin Storsjö
This is cherry-picked from libav commit
d825b1a5306576dcd0553b7d0d24a3a46ad92864.
---
 libavcodec/libopenh264dec.c |  2 ++
 libavcodec/libopenh264enc.c | 26 --
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/libavcodec/libopenh264dec.c b/libavcodec/libopenh264dec.c
index 8388e4e..80dff4c 100644
--- a/libavcodec/libopenh264dec.c
+++ b/libavcodec/libopenh264dec.c
@@ -90,7 +90,9 @@ static av_cold int svc_decode_init(AVCodecContext *avctx)
 (*s->decoder)->SetOption(s->decoder, DECODER_OPTION_TRACE_CALLBACK, (void 
*)_function);
 (*s->decoder)->SetOption(s->decoder, 
DECODER_OPTION_TRACE_CALLBACK_CONTEXT, (void *));
 
+#if !OPENH264_VER_AT_LEAST(1, 6)
 param.eOutputColorFormat = videoFormatI420;
+#endif
 param.eEcActiveIdc   = ERROR_CON_DISABLE;
 param.sVideoProperty.eVideoBsType = VIDEO_BITSTREAM_DEFAULT;
 
diff --git a/libavcodec/libopenh264enc.c b/libavcodec/libopenh264enc.c
index d27fc41..07af31d 100644
--- a/libavcodec/libopenh264enc.c
+++ b/libavcodec/libopenh264enc.c
@@ -33,6 +33,10 @@
 #include "internal.h"
 #include "libopenh264.h"
 
+#if !OPENH264_VER_AT_LEAST(1, 6)
+#define SM_SIZELIMITED_SLICE SM_DYN_SLICE
+#endif
+
 typedef struct SVCContext {
 const AVClass *av_class;
 ISVCEncoder *encoder;
@@ -48,11 +52,20 @@ typedef struct SVCContext {
 #define OFFSET(x) offsetof(SVCContext, x)
 #define VE AV_OPT_FLAG_VIDEO_PARAM | AV_OPT_FLAG_ENCODING_PARAM
 static const AVOption options[] = {
+#if OPENH264_VER_AT_LEAST(1, 6)
+{ "slice_mode", "set slice mode", OFFSET(slice_mode), AV_OPT_TYPE_INT, { 
.i64 = SM_FIXEDSLCNUM_SLICE }, SM_SINGLE_SLICE, SM_RESERVED, VE, "slice_mode" },
+#else
 { "slice_mode", "set slice mode", OFFSET(slice_mode), AV_OPT_TYPE_INT, { 
.i64 = SM_AUTO_SLICE }, SM_SINGLE_SLICE, SM_RESERVED, VE, "slice_mode" },
+#endif
 { "fixed", "a fixed number of slices", 0, AV_OPT_TYPE_CONST, { .i64 = 
SM_FIXEDSLCNUM_SLICE }, 0, 0, VE, "slice_mode" },
+#if OPENH264_VER_AT_LEAST(1, 6)
+{ "dyn", "Size limited (compatibility name)", 0, AV_OPT_TYPE_CONST, { 
.i64 = SM_SIZELIMITED_SLICE }, 0, 0, VE, "slice_mode" },
+{ "sizelimited", "Size limited", 0, AV_OPT_TYPE_CONST, { .i64 = 
SM_SIZELIMITED_SLICE }, 0, 0, VE, "slice_mode" },
+#else
 { "rowmb", "one slice per row of macroblocks", 0, AV_OPT_TYPE_CONST, { 
.i64 = SM_ROWMB_SLICE }, 0, 0, VE, "slice_mode" },
 { "auto", "automatic number of slices according to number of threads", 
0, AV_OPT_TYPE_CONST, { .i64 = SM_AUTO_SLICE }, 0, 0, VE, "slice_mode" },
 { "dyn", "Dynamic slicing", 0, AV_OPT_TYPE_CONST, { .i64 = 
SM_DYN_SLICE }, 0, 0, VE, "slice_mode" },
+#endif
 { "loopfilter", "enable loop filter", OFFSET(loopfilter), AV_OPT_TYPE_INT, 
{ .i64 = 1 }, 0, 1, VE },
 { "profile", "set profile restrictions", OFFSET(profile), 
AV_OPT_TYPE_STRING, { .str = NULL }, 0, 0, VE },
 { "max_nal_size", "set maximum NAL size in bytes", OFFSET(max_nal_size), 
AV_OPT_TYPE_INT, { .i64 = 0 }, 0, INT_MAX, VE },
@@ -159,15 +172,24 @@ FF_ENABLE_DEPRECATION_WARNINGS
 s->slice_mode = SM_FIXEDSLCNUM_SLICE;
 
 if (s->max_nal_size)
-s->slice_mode = SM_DYN_SLICE;
+s->slice_mode = SM_SIZELIMITED_SLICE;
 
+#if OPENH264_VER_AT_LEAST(1, 6)
+param.sSpatialLayers[0].sSliceArgument.uiSliceMode = s->slice_mode;
+param.sSpatialLayers[0].sSliceArgument.uiSliceNum  = avctx->slices;
+#else
 param.sSpatialLayers[0].sSliceCfg.uiSliceMode   = 
s->slice_mode;
 param.sSpatialLayers[0].sSliceCfg.sSliceArgument.uiSliceNum = 
avctx->slices;
+#endif
 
-if (s->slice_mode == SM_DYN_SLICE) {
+if (s->slice_mode == SM_SIZELIMITED_SLICE) {
 if (s->max_nal_size){
 param.uiMaxNalSize = s->max_nal_size;
+#if OPENH264_VER_AT_LEAST(1, 6)
+param.sSpatialLayers[0].sSliceArgument.uiSliceSizeConstraint = 
s->max_nal_size;
+#else
 
param.sSpatialLayers[0].sSliceCfg.sSliceArgument.uiSliceSizeConstraint = 
s->max_nal_size;
+#endif
 } else {
 av_log(avctx, AV_LOG_ERROR, "Invalid -max_nal_size, "
"specify a valid max_nal_size to use -slice_mode dyn\n");
-- 
2.7.4 (Apple Git-66)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 1/8] arm: vp9dsp: Restructure the bpp checks

2017-01-24 Thread Martin Storsjö

On Thu, 19 Jan 2017, Michael Niedermayer wrote:


On Wed, Jan 18, 2017 at 11:45:08PM +0200, Martin Storsjö wrote:

This work is sponsored by, and copyright, Google.

This is more in line with how it will be extended for more bitdepths.
---
 libavcodec/arm/vp9dsp_init_arm.c | 24 +---
 1 file changed, 9 insertions(+), 15 deletions(-)


fate passes with this patchset under qemu arm


Pushed, thanks!

// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 5/8] aarch64: vp9dsp: Restructure the bpp checks

2017-01-18 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

This is more in line with how it will be extended for more bitdepths.
---
 libavcodec/aarch64/vp9dsp_init_aarch64.c | 24 +---
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/libavcodec/aarch64/vp9dsp_init_aarch64.c 
b/libavcodec/aarch64/vp9dsp_init_aarch64.c
index 0bc200e..7b50540 100644
--- a/libavcodec/aarch64/vp9dsp_init_aarch64.c
+++ b/libavcodec/aarch64/vp9dsp_init_aarch64.c
@@ -96,13 +96,10 @@ define_8tap_2d_funcs(16)
 define_8tap_2d_funcs(8)
 define_8tap_2d_funcs(4)
 
-static av_cold void vp9dsp_mc_init_aarch64(VP9DSPContext *dsp, int bpp)
+static av_cold void vp9dsp_mc_init_aarch64(VP9DSPContext *dsp)
 {
 int cpu_flags = av_get_cpu_flags();
 
-if (bpp != 8)
-return;
-
 #define init_fpel(idx1, idx2, sz, type, suffix)  \
 dsp->mc[idx1][FILTER_8TAP_SMOOTH ][idx2][0][0] = \
 dsp->mc[idx1][FILTER_8TAP_REGULAR][idx2][0][0] = \
@@ -173,13 +170,10 @@ define_itxfm(idct, idct, 32);
 define_itxfm(iwht, iwht, 4);
 
 
-static av_cold void vp9dsp_itxfm_init_aarch64(VP9DSPContext *dsp, int bpp)
+static av_cold void vp9dsp_itxfm_init_aarch64(VP9DSPContext *dsp)
 {
 int cpu_flags = av_get_cpu_flags();
 
-if (bpp != 8)
-return;
-
 if (have_neon(cpu_flags)) {
 #define init_itxfm(tx, sz) \
 dsp->itxfm_add[tx][DCT_DCT]   = ff_vp9_idct_idct_##sz##_add_neon;  \
@@ -219,13 +213,10 @@ define_loop_filters(48, 16);
 define_loop_filters(84, 16);
 define_loop_filters(88, 16);
 
-static av_cold void vp9dsp_loopfilter_init_aarch64(VP9DSPContext *dsp, int bpp)
+static av_cold void vp9dsp_loopfilter_init_aarch64(VP9DSPContext *dsp)
 {
 int cpu_flags = av_get_cpu_flags();
 
-if (bpp != 8)
-return;
-
 if (have_neon(cpu_flags)) {
 dsp->loop_filter_8[0][1] = ff_vp9_loop_filter_v_4_8_neon;
 dsp->loop_filter_8[0][0] = ff_vp9_loop_filter_h_4_8_neon;
@@ -250,7 +241,10 @@ static av_cold void 
vp9dsp_loopfilter_init_aarch64(VP9DSPContext *dsp, int bpp)
 
 av_cold void ff_vp9dsp_init_aarch64(VP9DSPContext *dsp, int bpp)
 {
-vp9dsp_mc_init_aarch64(dsp, bpp);
-vp9dsp_loopfilter_init_aarch64(dsp, bpp);
-vp9dsp_itxfm_init_aarch64(dsp, bpp);
+if (bpp != 8)
+return;
+
+vp9dsp_mc_init_aarch64(dsp);
+vp9dsp_loopfilter_init_aarch64(dsp);
+vp9dsp_itxfm_init_aarch64(dsp);
 }
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 8/8] aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter

2017-01-18 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

This is similar to the arm version, but due to the larger registers
on aarch64, we can do 8 pixels at a time for all filter sizes.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
 ARM AArch64
vp9_loop_filter_h_4_8_10bpp_neon:  213.2   172.6
vp9_loop_filter_h_8_8_10bpp_neon:  281.2   244.2
vp9_loop_filter_h_16_8_10bpp_neon: 657.0   444.5
vp9_loop_filter_h_16_16_10bpp_neon:   1280.4   877.7
vp9_loop_filter_mix2_h_44_16_10bpp_neon:   397.7   358.0
vp9_loop_filter_mix2_h_48_16_10bpp_neon:   465.7   429.0
vp9_loop_filter_mix2_h_84_16_10bpp_neon:   465.7   428.0
vp9_loop_filter_mix2_h_88_16_10bpp_neon:   533.7   499.0
vp9_loop_filter_mix2_v_44_16_10bpp_neon:   271.5   244.0
vp9_loop_filter_mix2_v_48_16_10bpp_neon:   330.0   305.0
vp9_loop_filter_mix2_v_84_16_10bpp_neon:   329.0   306.0
vp9_loop_filter_mix2_v_88_16_10bpp_neon:   386.0   365.0
vp9_loop_filter_v_4_8_10bpp_neon:  150.0   115.2
vp9_loop_filter_v_8_8_10bpp_neon:  209.0   175.5
vp9_loop_filter_v_16_8_10bpp_neon: 492.7   345.2
vp9_loop_filter_v_16_16_10bpp_neon:951.0   682.7

This is significantly faster than the ARM version in almost
all cases except for the mix2 functions.

Based on START_TIMER/STOP_TIMER wrapping around a few individual
functions, the speedup vs C code is around 2-3x.
---
 libavcodec/aarch64/Makefile|   1 +
 .../aarch64/vp9dsp_init_16bpp_aarch64_template.c   |  62 ++
 libavcodec/aarch64/vp9lpf_16bpp_neon.S | 873 +
 3 files changed, 936 insertions(+)
 create mode 100644 libavcodec/aarch64/vp9lpf_16bpp_neon.S

diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index 715cc6f..37666b4 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -44,6 +44,7 @@ NEON-OBJS-$(CONFIG_DCA_DECODER) += 
aarch64/synth_filter_neon.o
 NEON-OBJS-$(CONFIG_VORBIS_DECODER)  += aarch64/vorbisdsp_neon.o
 NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_16bpp_neon.o   
\
aarch64/vp9itxfm_neon.o 
\
+   aarch64/vp9lpf_16bpp_neon.o 
\
aarch64/vp9lpf_neon.o   
\
aarch64/vp9mc_16bpp_neon.o  
\
aarch64/vp9mc_neon.o
diff --git a/libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c 
b/libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c
index 0e86b02..d5649f7 100644
--- a/libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c
+++ b/libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c
@@ -203,8 +203,70 @@ static av_cold void 
vp9dsp_itxfm_init_aarch64(VP9DSPContext *dsp)
 }
 }
 
+#define define_loop_filter(dir, wd, size, bpp) \
+void ff_vp9_loop_filter_##dir##_##wd##_##size##_##bpp##_neon(uint8_t *dst, 
ptrdiff_t stride, int E, int I, int H)
+
+#define define_loop_filters(wd, size, bpp) \
+define_loop_filter(h, wd, size, bpp);  \
+define_loop_filter(v, wd, size, bpp)
+
+define_loop_filters(4,  8,  BPP);
+define_loop_filters(8,  8,  BPP);
+define_loop_filters(16, 8,  BPP);
+
+define_loop_filters(16, 16, BPP);
+
+define_loop_filters(44, 16, BPP);
+define_loop_filters(48, 16, BPP);
+define_loop_filters(84, 16, BPP);
+define_loop_filters(88, 16, BPP);
+
+static av_cold void vp9dsp_loopfilter_init_aarch64(VP9DSPContext *dsp)
+{
+int cpu_flags = av_get_cpu_flags();
+
+if (have_neon(cpu_flags)) {
+#define init_lpf_func_8(idx1, idx2, dir, wd, bpp) \
+dsp->loop_filter_8[idx1][idx2] = 
ff_vp9_loop_filter_##dir##_##wd##_8_##bpp##_neon
+
+#define init_lpf_func_16(idx, dir, bpp) \
+dsp->loop_filter_16[idx] = ff_vp9_loop_filter_##dir##_16_16_##bpp##_neon
+
+#define init_lpf_func_mix2(idx1, idx2, idx3, dir, wd, bpp) \
+dsp->loop_filter_mix2[idx1][idx2][idx3] = 
ff_vp9_loop_filter_##dir##_##wd##_16_##bpp##_neon
+
+#define init_lpf_funcs_8_wd(idx, wd, bpp) \
+init_lpf_func_8(idx, 0, h, wd, bpp);  \
+init_lpf_func_8(idx, 1, v, wd, bpp)
+
+#define init_lpf_funcs_16(bpp)   \
+init_lpf_func_16(0, h, bpp); \
+init_lpf_func_16(1, v, bpp)
+
+#define init_lpf_funcs_mix2_wd(idx1, idx2, wd, bpp) \
+init_lpf_func_mix2(idx1, idx2, 0, h, wd, bpp);  \
+init_lpf_func_mix2(idx1, idx2, 1, v, wd, bpp)
+
+#define init_lpf_funcs_8(bpp)\
+init_lpf_funcs_8_wd(0, 4,  bpp); \
+init_lpf_funcs_8_wd(1, 8,  bpp); \
+init_lpf_funcs_8_wd(2, 16, bpp)
+
+#define init_lpf_funcs_mix2(bpp)   \
+init_lpf_funcs_mix2_wd(0, 0, 44, bpp); \
+init_lpf_funcs_mix2_wd(0, 1, 48, bpp); \
+init_lpf_funcs_mix2_wd(1, 0, 84, bpp); \
+init_lpf_funcs_mix2_wd(1, 1, 88, bpp)
+
+init_lpf_funcs_8(BPP);
+init_lpf_funcs_16(BPP);
+init_lpf_funcs_mix2(BPP);
+}
+}
+
 

[FFmpeg-devel] [PATCH 2/8] arm: Add NEON optimizations for 10 and 12 bit vp9 MC

2017-01-18 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

The plain pixel put/copy functions are used from the 8 bit version,
for the double size (e.g. put16 uses ff_vp9_copy32_neon), and a new
copy128 is added.

Compared with the 8 bit version, the filters can no longer use the
trick to accumulate in 16 bit with only saturation at the end, but now
the accumulators need to be 32 bit. This avoids the need to keep track
of which filter index is the largest though, reducing the size of the
executable code for these filters.

For the horizontal filters, we only do 4 or 8 pixels wide in parallel
(while doing two rows at a time), since we don't have enough register
space to filter 16 pixels wide.

For the vertical filters, we still do 4 and 8 pixels in parallel just
as in the 8 bit case, but we need to store the output after every 2
rows instead of after every 4 rows.

Examples of relative speedup compared to the C version, from checkasm:
   CortexA7 A8 A9A53
vp9_avg4_10bpp_neon:   2.25   2.44   3.05   2.16
vp9_avg8_10bpp_neon:   3.66   8.48   3.86   3.50
vp9_avg16_10bpp_neon:  3.39   8.26   3.37   2.72
vp9_avg32_10bpp_neon:  4.03  10.20   4.07   3.42
vp9_avg64_10bpp_neon:  4.15  10.01   4.13   3.70
vp9_avg_8tap_smooth_4h_10bpp_neon: 3.38   6.22   3.41   4.75
vp9_avg_8tap_smooth_4hv_10bpp_neon:3.89   6.39   4.30   5.32
vp9_avg_8tap_smooth_4v_10bpp_neon: 5.32   9.73   6.34   7.31
vp9_avg_8tap_smooth_8h_10bpp_neon: 4.45   9.40   4.68   6.87
vp9_avg_8tap_smooth_8hv_10bpp_neon:4.64   8.91   5.44   6.47
vp9_avg_8tap_smooth_8v_10bpp_neon: 6.44  13.42   8.68   8.79
vp9_avg_8tap_smooth_64h_10bpp_neon:4.66   9.02   4.84   7.71
vp9_avg_8tap_smooth_64hv_10bpp_neon:   4.61   9.14   4.92   7.10
vp9_avg_8tap_smooth_64v_10bpp_neon:6.90  14.13   9.57  10.41
vp9_put4_10bpp_neon:   1.33   1.46   2.09   1.33
vp9_put8_10bpp_neon:   1.57   3.42   1.83   1.84
vp9_put16_10bpp_neon:  1.55   4.78   2.17   1.89
vp9_put32_10bpp_neon:  2.06   5.35   2.14   2.30
vp9_put64_10bpp_neon:  3.00   2.41   1.95   1.66
vp9_put_8tap_smooth_4h_10bpp_neon: 3.19   5.81   3.31   4.63
vp9_put_8tap_smooth_4hv_10bpp_neon:3.86   6.22   4.32   5.21
vp9_put_8tap_smooth_4v_10bpp_neon: 5.40   9.77   6.08   7.21
vp9_put_8tap_smooth_8h_10bpp_neon: 4.22   8.41   4.46   6.63
vp9_put_8tap_smooth_8hv_10bpp_neon:4.56   8.51   5.39   6.25
vp9_put_8tap_smooth_8v_10bpp_neon: 6.60  12.43   8.17   8.89
vp9_put_8tap_smooth_64h_10bpp_neon:4.41   8.59   4.54   7.49
vp9_put_8tap_smooth_64hv_10bpp_neon:   4.43   8.58   5.34   6.63
vp9_put_8tap_smooth_64v_10bpp_neon:7.26  13.92   9.27  10.92

For the larger 8tap filters, the speedup vs C code is around 4-14x.
---
 libavcodec/arm/Makefile |   5 +-
 libavcodec/arm/vp9dsp_init.h|  29 ++
 libavcodec/arm/vp9dsp_init_10bpp_arm.c  |  23 +
 libavcodec/arm/vp9dsp_init_12bpp_arm.c  |  23 +
 libavcodec/arm/vp9dsp_init_16bpp_arm_template.c | 147 ++
 libavcodec/arm/vp9dsp_init_arm.c|   9 +-
 libavcodec/arm/vp9mc_16bpp_neon.S   | 615 
 7 files changed, 849 insertions(+), 2 deletions(-)
 create mode 100644 libavcodec/arm/vp9dsp_init.h
 create mode 100644 libavcodec/arm/vp9dsp_init_10bpp_arm.c
 create mode 100644 libavcodec/arm/vp9dsp_init_12bpp_arm.c
 create mode 100644 libavcodec/arm/vp9dsp_init_16bpp_arm_template.c
 create mode 100644 libavcodec/arm/vp9mc_16bpp_neon.S

diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile
index 7f18daa..fb35d25 100644
--- a/libavcodec/arm/Makefile
+++ b/libavcodec/arm/Makefile
@@ -44,7 +44,9 @@ OBJS-$(CONFIG_MLP_DECODER) += 
arm/mlpdsp_init_arm.o
 OBJS-$(CONFIG_RV40_DECODER)+= arm/rv40dsp_init_arm.o
 OBJS-$(CONFIG_VORBIS_DECODER)  += arm/vorbisdsp_init_arm.o
 OBJS-$(CONFIG_VP6_DECODER) += arm/vp6dsp_init_arm.o
-OBJS-$(CONFIG_VP9_DECODER) += arm/vp9dsp_init_arm.o
+OBJS-$(CONFIG_VP9_DECODER) += arm/vp9dsp_init_10bpp_arm.o   \
+  arm/vp9dsp_init_12bpp_arm.o   \
+  arm/vp9dsp_init_arm.o
 
 
 # ARMv5 optimizations
@@ -142,4 +144,5 @@ NEON-OBJS-$(CONFIG_VORBIS_DECODER) += 
arm/vorbisdsp_neon.o
 NEON-OBJS-$(CONFIG_VP6_DECODER)+= arm/vp6dsp_neon.o
 NEON-OBJS-$(CONFIG_VP9_DECODER)+= arm/vp9itxfm_neon.o   \
   arm/vp9lpf_neon.o \
+  arm/vp9mc_16bpp_neon.o\
   arm/vp9mc_neon.o
diff --git a/libavcodec/arm/vp9dsp_init.h b/libavcodec/arm/vp9dsp_init.h
new file mode 100644
index 000..0dc1c2d
--- /dev/null
+++ 

[FFmpeg-devel] [PATCH 3/8] arm: Add NEON optimizations for 10 and 12 bit vp9 itxfm

2017-01-18 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

This is structured similarly to the 8 bit version. In the 8 bit
version, the coefficients are 16 bits, and intermediates are 32 bits.

Here, the coefficients are 32 bit. For the 4x4 transforms for 10 bit
content, the intermediates also fit in 32 bits, but for all other
transforms (4x4 for 12 bit content, and 8x8 and larger for both 10
and 12 bit) the intermediates are 64 bit.

For the existing 8 bit case, the 8x8 transform fit all coefficients in
registers; for 10/12 bit, when the coefficients are 32 bit, the 8x8
transform also has to be done in slices of 4 pixels (just as 16x16 and
32x32 for 8 bit).

The slice width also shrinks from 4 elements to 2 elements in parallel
for the 16x16 and 32x32 cases.

The 16 bit coefficients from idct_coeffs and similar tables also need
to be lenghtened to 32 bit in order to be used in multiplication with
vectors with 32 bit elements. This leads to the fixed coefficient
vectors needing more space, leading to more cases where they have to
be reloaded within the transform (in iadst16).

This technically would need testing in checkasm for subpartitions
in increments of 2, but that slows down normal checkasm runs
excessively.

Examples of relative speedup compared to the C version, from checkasm:
 CortexA7 A8 A9A53
vp9_inv_adst_adst_4x4_sub4_add_10_neon:  4.83  11.36   5.22   6.77
vp9_inv_adst_adst_8x8_sub8_add_10_neon:  4.12   7.60   4.06   4.84
vp9_inv_adst_adst_16x16_sub16_add_10_neon:   3.93   8.16   4.52   5.35
vp9_inv_dct_dct_4x4_sub1_add_10_neon:1.36   2.57   1.41   1.61
vp9_inv_dct_dct_4x4_sub4_add_10_neon:4.24   8.66   5.06   5.81
vp9_inv_dct_dct_8x8_sub1_add_10_neon:2.63   4.18   1.68   2.87
vp9_inv_dct_dct_8x8_sub4_add_10_neon:4.52   9.47   4.24   5.39
vp9_inv_dct_dct_8x8_sub8_add_10_neon:3.45   7.34   3.45   4.30
vp9_inv_dct_dct_16x16_sub1_add_10_neon:  3.56   6.21   2.47   4.32
vp9_inv_dct_dct_16x16_sub2_add_10_neon:  5.68  12.73   5.28   7.07
vp9_inv_dct_dct_16x16_sub8_add_10_neon:  4.42   9.28   4.24   5.45
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 3.41   7.29   3.35   4.19
vp9_inv_dct_dct_32x32_sub1_add_10_neon:  4.52   8.35   3.83   6.40
vp9_inv_dct_dct_32x32_sub2_add_10_neon:  5.86  13.19   6.14   7.04
vp9_inv_dct_dct_32x32_sub16_add_10_neon: 4.29   8.11   4.59   5.06
vp9_inv_dct_dct_32x32_sub32_add_10_neon: 3.31   5.70   3.56   3.84
vp9_inv_wht_wht_4x4_sub4_add_10_neon:1.89   2.80   1.82   1.97

The speedup compared to the C functions is around 1.3 to 7x for the
full transforms, even higher for the smaller subpartitions.
---
 libavcodec/arm/Makefile |3 +-
 libavcodec/arm/vp9dsp_init_16bpp_arm_template.c |   47 +
 libavcodec/arm/vp9itxfm_16bpp_neon.S| 1515 +++
 3 files changed, 1564 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/arm/vp9itxfm_16bpp_neon.S

diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile
index fb35d25..856c154 100644
--- a/libavcodec/arm/Makefile
+++ b/libavcodec/arm/Makefile
@@ -142,7 +142,8 @@ NEON-OBJS-$(CONFIG_RV40_DECODER)   += 
arm/rv34dsp_neon.o\
   arm/rv40dsp_neon.o
 NEON-OBJS-$(CONFIG_VORBIS_DECODER) += arm/vorbisdsp_neon.o
 NEON-OBJS-$(CONFIG_VP6_DECODER)+= arm/vp6dsp_neon.o
-NEON-OBJS-$(CONFIG_VP9_DECODER)+= arm/vp9itxfm_neon.o   \
+NEON-OBJS-$(CONFIG_VP9_DECODER)+= arm/vp9itxfm_16bpp_neon.o \
+  arm/vp9itxfm_neon.o   \
   arm/vp9lpf_neon.o \
   arm/vp9mc_16bpp_neon.o\
   arm/vp9mc_neon.o
diff --git a/libavcodec/arm/vp9dsp_init_16bpp_arm_template.c 
b/libavcodec/arm/vp9dsp_init_16bpp_arm_template.c
index 05efd29..95f2bbc 100644
--- a/libavcodec/arm/vp9dsp_init_16bpp_arm_template.c
+++ b/libavcodec/arm/vp9dsp_init_16bpp_arm_template.c
@@ -141,7 +141,54 @@ static av_cold void vp9dsp_mc_init_arm(VP9DSPContext *dsp)
 }
 }
 
+#define define_itxfm2(type_a, type_b, sz, bpp) 
\
+void ff_vp9_##type_a##_##type_b##_##sz##x##sz##_add_##bpp##_neon(uint8_t 
*_dst,\
+ ptrdiff_t 
stride, \
+ int16_t 
*_block, int eob)
+#define define_itxfm(type_a, type_b, sz, bpp) define_itxfm2(type_a, type_b, 
sz, bpp)
+
+#define define_itxfm_funcs(sz, bpp)  \
+define_itxfm(idct,  idct,  sz, bpp); \
+define_itxfm(iadst, idct,  sz, bpp); \
+define_itxfm(idct,  iadst, sz, bpp); \
+define_itxfm(iadst, iadst, sz, bpp)
+
+define_itxfm_funcs(4,  BPP);
+define_itxfm_funcs(8,  BPP);
+define_itxfm_funcs(16, BPP);

[FFmpeg-devel] [PATCH 6/8] aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC

2017-01-18 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

This has mostly got the same differences to the 8 bit version as
in the arm version. For the horizontal filters, we do 16 pixels
in parallel as well. For the 8 pixel wide vertical filters, we can
accumulate 4 rows before storing, just as in the 8 bit version.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
   ARM   AArch64
vp9_avg4_10bpp_neon:  35.7  30.7
vp9_avg8_10bpp_neon:  93.5  84.7
vp9_avg16_10bpp_neon:324.4 296.6
vp9_avg32_10bpp_neon:   1236.51148.2
vp9_avg64_10bpp_neon:   4639.64571.1
vp9_avg_8tap_smooth_4h_10bpp_neon:   130.0 128.0
vp9_avg_8tap_smooth_4hv_10bpp_neon:  440.0 440.5
vp9_avg_8tap_smooth_4v_10bpp_neon:   114.0 105.5
vp9_avg_8tap_smooth_8h_10bpp_neon:   327.0 314.0
vp9_avg_8tap_smooth_8hv_10bpp_neon:  918.7 865.4
vp9_avg_8tap_smooth_8v_10bpp_neon:   330.0 300.2
vp9_avg_8tap_smooth_16h_10bpp_neon: 1187.51155.5
vp9_avg_8tap_smooth_16hv_10bpp_neon:2663.12591.0
vp9_avg_8tap_smooth_16v_10bpp_neon: 1107.41078.3
vp9_avg_8tap_smooth_64h_10bpp_neon:17754.6   17454.7
vp9_avg_8tap_smooth_64hv_10bpp_neon:   33285.2   33001.5
vp9_avg_8tap_smooth_64v_10bpp_neon:16066.9   16048.6
vp9_put4_10bpp_neon:  25.5  21.7
vp9_put8_10bpp_neon:  56.0  52.0
vp9_put16_10bpp_neon/armv8:  183.0 163.1
vp9_put32_10bpp_neon/armv8:  678.6 563.1
vp9_put64_10bpp_neon/armv8: 2679.92195.8
vp9_put_8tap_smooth_4h_10bpp_neon:   120.0 118.0
vp9_put_8tap_smooth_4hv_10bpp_neon:  435.2 435.0
vp9_put_8tap_smooth_4v_10bpp_neon:   107.0  98.2
vp9_put_8tap_smooth_8h_10bpp_neon:   303.0 290.0
vp9_put_8tap_smooth_8hv_10bpp_neon:  893.7 828.7
vp9_put_8tap_smooth_8v_10bpp_neon:   305.5 263.5
vp9_put_8tap_smooth_16h_10bpp_neon: 1089.11059.2
vp9_put_8tap_smooth_16hv_10bpp_neon:2578.82452.4
vp9_put_8tap_smooth_16v_10bpp_neon: 1009.5 933.5
vp9_put_8tap_smooth_64h_10bpp_neon:16223.4   15918.6
vp9_put_8tap_smooth_64hv_10bpp_neon:   32153.0   31016.2
vp9_put_8tap_smooth_64v_10bpp_neon:14516.5   13748.1

These are generally about as fast as the corresponding ARM
routines on the same CPU (at least on the A53), in most cases
marginally faster.

The speedup vs C code is around 4-9x.
---
 libavcodec/aarch64/Makefile|   5 +-
 libavcodec/aarch64/vp9dsp_init.h   |  29 +
 libavcodec/aarch64/vp9dsp_init_10bpp_aarch64.c |  23 +
 libavcodec/aarch64/vp9dsp_init_12bpp_aarch64.c |  23 +
 .../aarch64/vp9dsp_init_16bpp_aarch64_template.c   | 163 ++
 libavcodec/aarch64/vp9dsp_init_aarch64.c   |   9 +-
 libavcodec/aarch64/vp9mc_16bpp_neon.S  | 631 +
 7 files changed, 881 insertions(+), 2 deletions(-)
 create mode 100644 libavcodec/aarch64/vp9dsp_init.h
 create mode 100644 libavcodec/aarch64/vp9dsp_init_10bpp_aarch64.c
 create mode 100644 libavcodec/aarch64/vp9dsp_init_12bpp_aarch64.c
 create mode 100644 libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c
 create mode 100644 libavcodec/aarch64/vp9mc_16bpp_neon.S

diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index 5593863..0766e90 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -15,7 +15,9 @@ OBJS-$(CONFIG_DCA_DECODER)  += 
aarch64/synth_filter_init.o
 OBJS-$(CONFIG_RV40_DECODER) += aarch64/rv40dsp_init_aarch64.o
 OBJS-$(CONFIG_VC1DSP)   += aarch64/vc1dsp_init_aarch64.o
 OBJS-$(CONFIG_VORBIS_DECODER)   += aarch64/vorbisdsp_init.o
-OBJS-$(CONFIG_VP9_DECODER)  += aarch64/vp9dsp_init_aarch64.o
+OBJS-$(CONFIG_VP9_DECODER)  += aarch64/vp9dsp_init_10bpp_aarch64.o 
\
+   aarch64/vp9dsp_init_12bpp_aarch64.o 
\
+   aarch64/vp9dsp_init_aarch64.o
 
 # ARMv8 optimizations
 
@@ -42,4 +44,5 @@ NEON-OBJS-$(CONFIG_DCA_DECODER) += 
aarch64/synth_filter_neon.o
 NEON-OBJS-$(CONFIG_VORBIS_DECODER)  += aarch64/vorbisdsp_neon.o
 NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_neon.o 
\
aarch64/vp9lpf_neon.o   
\
+   aarch64/vp9mc_16bpp_neon.o  
\
aarch64/vp9mc_neon.o
diff --git a/libavcodec/aarch64/vp9dsp_init.h b/libavcodec/aarch64/vp9dsp_init.h
new file mode 100644
index 000..9df1752
--- /dev/null
+++ b/libavcodec/aarch64/vp9dsp_init.h
@@ -0,0 +1,29 @@
+/*
+ * Copyright (c) 2017 Google Inc.
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or

[FFmpeg-devel] [PATCH 7/8] aarch64: Add NEON optimizations for 10 and 12 bit vp9 itxfm

2017-01-18 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

Compared to the arm version, on aarch64 we can keep the full 8x8
transform in registers, and for 16x16 and 32x32, we can process
it in slices of 4 pixels instead of 2.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
ARM  AArch64
vp9_inv_adst_adst_4x4_sub4_add_10_neon:   111.0109.7
vp9_inv_adst_adst_8x8_sub8_add_10_neon:   914.0733.5
vp9_inv_adst_adst_16x16_sub16_add_10_neon:   5184.0   3745.7
vp9_inv_dct_dct_4x4_sub1_add_10_neon:  65.0 65.7
vp9_inv_dct_dct_4x4_sub4_add_10_neon: 100.0 96.7
vp9_inv_dct_dct_8x8_sub1_add_10_neon: 111.0119.7
vp9_inv_dct_dct_8x8_sub8_add_10_neon: 618.0494.7
vp9_inv_dct_dct_16x16_sub1_add_10_neon:   295.1284.6
vp9_inv_dct_dct_16x16_sub2_add_10_neon:  2303.2   1883.9
vp9_inv_dct_dct_16x16_sub8_add_10_neon:  2984.8   2189.3
vp9_inv_dct_dct_16x16_sub16_add_10_neon: 3890.0   2799.4
vp9_inv_dct_dct_32x32_sub1_add_10_neon:  1044.4   1012.7
vp9_inv_dct_dct_32x32_sub2_add_10_neon: 1.7   9695.1
vp9_inv_dct_dct_32x32_sub16_add_10_neon:18531.3  12459.8
vp9_inv_dct_dct_32x32_sub32_add_10_neon:24470.7  16160.2
vp9_inv_wht_wht_4x4_sub4_add_10_neon:  83.0 79.7

The larger transforms are significantly faster than the corresponding
ARM versions.

The speedup vs C code is smaller than in 32 bit mode, probably
because the 64 bit intermediates in the C code can be expressed
more efficiently in aarch64.
---
 libavcodec/aarch64/Makefile|3 +-
 .../aarch64/vp9dsp_init_16bpp_aarch64_template.c   |   47 +
 libavcodec/aarch64/vp9itxfm_16bpp_neon.S   | 1517 
 3 files changed, 1566 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/aarch64/vp9itxfm_16bpp_neon.S

diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index 0766e90..715cc6f 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -42,7 +42,8 @@ NEON-OBJS-$(CONFIG_MPEGAUDIODSP)+= 
aarch64/mpegaudiodsp_neon.o
 # decoders/encoders
 NEON-OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_neon.o
 NEON-OBJS-$(CONFIG_VORBIS_DECODER)  += aarch64/vorbisdsp_neon.o
-NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_neon.o 
\
+NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_16bpp_neon.o   
\
+   aarch64/vp9itxfm_neon.o 
\
aarch64/vp9lpf_neon.o   
\
aarch64/vp9mc_16bpp_neon.o  
\
aarch64/vp9mc_neon.o
diff --git a/libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c 
b/libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c
index 4719ea3..0e86b02 100644
--- a/libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c
+++ b/libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c
@@ -157,7 +157,54 @@ static av_cold void vp9dsp_mc_init_aarch64(VP9DSPContext 
*dsp)
 }
 }
 
+#define define_itxfm2(type_a, type_b, sz, bpp) 
\
+void ff_vp9_##type_a##_##type_b##_##sz##x##sz##_add_##bpp##_neon(uint8_t 
*_dst,\
+ ptrdiff_t 
stride, \
+ int16_t 
*_block, int eob)
+#define define_itxfm(type_a, type_b, sz, bpp) define_itxfm2(type_a, type_b, 
sz, bpp)
+
+#define define_itxfm_funcs(sz, bpp)  \
+define_itxfm(idct,  idct,  sz, bpp); \
+define_itxfm(iadst, idct,  sz, bpp); \
+define_itxfm(idct,  iadst, sz, bpp); \
+define_itxfm(iadst, iadst, sz, bpp)
+
+define_itxfm_funcs(4,  BPP);
+define_itxfm_funcs(8,  BPP);
+define_itxfm_funcs(16, BPP);
+define_itxfm(idct, idct, 32, BPP);
+define_itxfm(iwht, iwht, 4,  BPP);
+
+
+static av_cold void vp9dsp_itxfm_init_aarch64(VP9DSPContext *dsp)
+{
+int cpu_flags = av_get_cpu_flags();
+
+if (have_neon(cpu_flags)) {
+#define init_itxfm2(tx, sz, bpp)   
\
+dsp->itxfm_add[tx][DCT_DCT]   = ff_vp9_idct_idct_##sz##_add_##bpp##_neon;  
\
+dsp->itxfm_add[tx][DCT_ADST]  = ff_vp9_iadst_idct_##sz##_add_##bpp##_neon; 
\
+dsp->itxfm_add[tx][ADST_DCT]  = ff_vp9_idct_iadst_##sz##_add_##bpp##_neon; 
\
+dsp->itxfm_add[tx][ADST_ADST] = ff_vp9_iadst_iadst_##sz##_add_##bpp##_neon
+#define init_itxfm(tx, sz, bpp) init_itxfm2(tx, sz, bpp)
+
+#define init_idct2(tx, nm, bpp) \
+dsp->itxfm_add[tx][DCT_DCT]   = \
+dsp->itxfm_add[tx][ADST_DCT]  = \
+dsp->itxfm_add[tx][DCT_ADST]  = \
+dsp->itxfm_add[tx][ADST_ADST] = ff_vp9_##nm##_add_##bpp##_neon
+#define init_idct(tx, nm, bpp) init_idct2(tx, nm, bpp)
+
+init_itxfm(TX_4X4,   4x4,   BPP);
+init_itxfm(TX_8X8,   8x8,   BPP);
+

[FFmpeg-devel] [PATCH 4/8] arm: Add NEON optimizations for 10 and 12 bit vp9 loop filter

2017-01-18 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

This is pretty much similar to the 8 bpp version, but in some senses
simpler. All input pixels are 16 bits, and all intermediates also fit
in 16 bits, so there's no lengthening/narrowing in the filter at all.

For the full 16 pixel wide filter, we can only process 4 pixels at a time
(using an implementation very much similar to the one for 8 bpp),
but we can do 8 pixels at a time for the 4 and 8 pixel wide filters with
a different implementation of the core filter.

Examples of relative speedup compared to the C version, from checkasm:
   CortexA7 A8 A9A53
vp9_loop_filter_h_4_8_10bpp_neon:  1.83   2.16   1.40   2.09
vp9_loop_filter_h_8_8_10bpp_neon:  1.39   1.67   1.24   1.70
vp9_loop_filter_h_16_8_10bpp_neon: 1.56   1.47   1.10   1.81
vp9_loop_filter_h_16_16_10bpp_neon:1.94   1.69   1.33   2.24
vp9_loop_filter_mix2_h_44_16_10bpp_neon:   2.01   2.27   1.67   2.39
vp9_loop_filter_mix2_h_48_16_10bpp_neon:   1.84   2.06   1.45   2.19
vp9_loop_filter_mix2_h_84_16_10bpp_neon:   1.89   2.20   1.47   2.29
vp9_loop_filter_mix2_h_88_16_10bpp_neon:   1.69   2.12   1.47   2.08
vp9_loop_filter_mix2_v_44_16_10bpp_neon:   3.16   3.98   2.50   4.05
vp9_loop_filter_mix2_v_48_16_10bpp_neon:   2.84   3.64   2.25   3.77
vp9_loop_filter_mix2_v_84_16_10bpp_neon:   2.65   3.45   2.16   3.54
vp9_loop_filter_mix2_v_88_16_10bpp_neon:   2.55   3.30   2.16   3.55
vp9_loop_filter_v_4_8_10bpp_neon:  2.85   3.97   2.24   3.68
vp9_loop_filter_v_8_8_10bpp_neon:  2.27   3.19   1.96   3.08
vp9_loop_filter_v_16_8_10bpp_neon: 3.42   2.74   2.26   4.40
vp9_loop_filter_v_16_16_10bpp_neon:2.86   2.44   1.93   3.88

The speedup vs C code measured in checkasm is around 1.1-4x.
These numbers are quite inconclusive though, since the checkasm test
runs multiple filterings on top of each other, so later rounds might
end up with different codepaths (different decisions on which filter
to apply, based on input pixel differences).

Based on START_TIMER/STOP_TIMER wrapping around a few individual
functions, the speedup vs C code is around 2-4x.
---
 libavcodec/arm/Makefile |1 +
 libavcodec/arm/vp9dsp_init_16bpp_arm_template.c |   62 ++
 libavcodec/arm/vp9lpf_16bpp_neon.S  | 1044 +++
 3 files changed, 1107 insertions(+)
 create mode 100644 libavcodec/arm/vp9lpf_16bpp_neon.S

diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile
index 856c154..1eeac54 100644
--- a/libavcodec/arm/Makefile
+++ b/libavcodec/arm/Makefile
@@ -144,6 +144,7 @@ NEON-OBJS-$(CONFIG_VORBIS_DECODER) += 
arm/vorbisdsp_neon.o
 NEON-OBJS-$(CONFIG_VP6_DECODER)+= arm/vp6dsp_neon.o
 NEON-OBJS-$(CONFIG_VP9_DECODER)+= arm/vp9itxfm_16bpp_neon.o \
   arm/vp9itxfm_neon.o   \
+  arm/vp9lpf_16bpp_neon.o   \
   arm/vp9lpf_neon.o \
   arm/vp9mc_16bpp_neon.o\
   arm/vp9mc_neon.o
diff --git a/libavcodec/arm/vp9dsp_init_16bpp_arm_template.c 
b/libavcodec/arm/vp9dsp_init_16bpp_arm_template.c
index 95f2bbc..3620535 100644
--- a/libavcodec/arm/vp9dsp_init_16bpp_arm_template.c
+++ b/libavcodec/arm/vp9dsp_init_16bpp_arm_template.c
@@ -187,8 +187,70 @@ static av_cold void vp9dsp_itxfm_init_arm(VP9DSPContext 
*dsp)
 }
 }
 
+#define define_loop_filter(dir, wd, size, bpp) \
+void ff_vp9_loop_filter_##dir##_##wd##_##size##_##bpp##_neon(uint8_t *dst, 
ptrdiff_t stride, int E, int I, int H)
+
+#define define_loop_filters(wd, size, bpp) \
+define_loop_filter(h, wd, size, bpp);  \
+define_loop_filter(v, wd, size, bpp)
+
+define_loop_filters(4,  8,  BPP);
+define_loop_filters(8,  8,  BPP);
+define_loop_filters(16, 8,  BPP);
+
+define_loop_filters(16, 16, BPP);
+
+define_loop_filters(44, 16, BPP);
+define_loop_filters(48, 16, BPP);
+define_loop_filters(84, 16, BPP);
+define_loop_filters(88, 16, BPP);
+
+static av_cold void vp9dsp_loopfilter_init_arm(VP9DSPContext *dsp)
+{
+int cpu_flags = av_get_cpu_flags();
+
+if (have_neon(cpu_flags)) {
+#define init_lpf_func_8(idx1, idx2, dir, wd, bpp) \
+dsp->loop_filter_8[idx1][idx2] = 
ff_vp9_loop_filter_##dir##_##wd##_8_##bpp##_neon
+
+#define init_lpf_func_16(idx, dir, bpp) \
+dsp->loop_filter_16[idx] = ff_vp9_loop_filter_##dir##_16_16_##bpp##_neon
+
+#define init_lpf_func_mix2(idx1, idx2, idx3, dir, wd, bpp) \
+dsp->loop_filter_mix2[idx1][idx2][idx3] = 
ff_vp9_loop_filter_##dir##_##wd##_16_##bpp##_neon
+
+#define init_lpf_funcs_8_wd(idx, wd, bpp) \
+init_lpf_func_8(idx, 0, h, wd, bpp);  \
+init_lpf_func_8(idx, 1, v, wd, bpp)
+
+#define init_lpf_funcs_16(bpp)   \
+init_lpf_func_16(0, h, bpp); \
+init_lpf_func_16(1, v, bpp)
+
+#define 

[FFmpeg-devel] [PATCH 1/8] arm: vp9dsp: Restructure the bpp checks

2017-01-18 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

This is more in line with how it will be extended for more bitdepths.
---
 libavcodec/arm/vp9dsp_init_arm.c | 24 +---
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/libavcodec/arm/vp9dsp_init_arm.c b/libavcodec/arm/vp9dsp_init_arm.c
index 05e50d7..0b76eb1 100644
--- a/libavcodec/arm/vp9dsp_init_arm.c
+++ b/libavcodec/arm/vp9dsp_init_arm.c
@@ -94,13 +94,10 @@ define_8tap_2d_funcs(8)
 define_8tap_2d_funcs(4)
 
 
-static av_cold void vp9dsp_mc_init_arm(VP9DSPContext *dsp, int bpp)
+static av_cold void vp9dsp_mc_init_arm(VP9DSPContext *dsp)
 {
 int cpu_flags = av_get_cpu_flags();
 
-if (bpp != 8)
-return;
-
 if (have_neon(cpu_flags)) {
 #define init_fpel(idx1, idx2, sz, type)  \
 dsp->mc[idx1][FILTER_8TAP_SMOOTH ][idx2][0][0] = \
@@ -160,13 +157,10 @@ define_itxfm(idct, idct, 32);
 define_itxfm(iwht, iwht, 4);
 
 
-static av_cold void vp9dsp_itxfm_init_arm(VP9DSPContext *dsp, int bpp)
+static av_cold void vp9dsp_itxfm_init_arm(VP9DSPContext *dsp)
 {
 int cpu_flags = av_get_cpu_flags();
 
-if (bpp != 8)
-return;
-
 if (have_neon(cpu_flags)) {
 #define init_itxfm(tx, sz) \
 dsp->itxfm_add[tx][DCT_DCT]   = ff_vp9_idct_idct_##sz##_add_neon;  \
@@ -218,13 +212,10 @@ lf_mix_fns(4, 8)
 lf_mix_fns(8, 4)
 lf_mix_fns(8, 8)
 
-static av_cold void vp9dsp_loopfilter_init_arm(VP9DSPContext *dsp, int bpp)
+static av_cold void vp9dsp_loopfilter_init_arm(VP9DSPContext *dsp)
 {
 int cpu_flags = av_get_cpu_flags();
 
-if (bpp != 8)
-return;
-
 if (have_neon(cpu_flags)) {
 dsp->loop_filter_8[0][1] = ff_vp9_loop_filter_v_4_8_neon;
 dsp->loop_filter_8[0][0] = ff_vp9_loop_filter_h_4_8_neon;
@@ -249,7 +240,10 @@ static av_cold void 
vp9dsp_loopfilter_init_arm(VP9DSPContext *dsp, int bpp)
 
 av_cold void ff_vp9dsp_init_arm(VP9DSPContext *dsp, int bpp)
 {
-vp9dsp_mc_init_arm(dsp, bpp);
-vp9dsp_loopfilter_init_arm(dsp, bpp);
-vp9dsp_itxfm_init_arm(dsp, bpp);
+if (bpp != 8)
+return;
+
+vp9dsp_mc_init_arm(dsp);
+vp9dsp_loopfilter_init_arm(dsp);
+vp9dsp_itxfm_init_arm(dsp);
 }
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 1/9] vp9dsp: Deduplicate the subpel filters

2016-11-14 Thread Martin Storsjö

On Mon, 14 Nov 2016, Ronald S. Bultje wrote:


Hi,

On Mon, Nov 14, 2016 at 5:32 AM, Martin Storsjö <mar...@martin.st> wrote:


Make them aligned, to allow efficient access to them from simd.

This is an adapted cherry-pick from libav commit
a4cfcddcb0f76e837d5abc06840c2b26c0e8aefc.
---
 libavcodec/vp9dsp.c  | 56 +++
 libavcodec/vp9dsp.h  |  3 +++
 libavcodec/vp9dsp_template.c | 63 +++---
--
 3 files changed, 63 insertions(+), 59 deletions(-)



OK.

Do I need to queue them up?


Yes, that'd be appreciated.


I thought they would be merged automagically from Libav...


In principle, but the merging is quite far behind at the moment. I've 
included the commit hashes of all included commits to make it clear which 
commits can be no-oped in future merges at least.


Also for the record, it has been tested on linux, iOS and with the MSVC 
toolchain (in wine).


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 5/9] arm: vp9: Add NEON loop filters

2016-11-14 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

The implementation tries to have smart handling of cases
where no pixels need the full filtering for the 8/16 width
filters, skipping both calculation and writeback of the
unmodified pixels in those cases. The actual effect of this
is hard to test with checkasm though, since it tests the
full filtering, and the benefit depends on how many filtered
blocks use the shortcut.

Examples of relative speedup compared to the C version, from checkasm:
  Cortex   A7 A8 A9A53
vp9_loop_filter_h_4_8_neon:  2.72   2.68   1.78   3.15
vp9_loop_filter_h_8_8_neon:  2.36   2.38   1.70   2.91
vp9_loop_filter_h_16_8_neon: 1.80   1.89   1.45   2.01
vp9_loop_filter_h_16_16_neon:2.81   2.78   2.18   3.16
vp9_loop_filter_mix2_h_44_16_neon:   2.65   2.67   1.93   3.05
vp9_loop_filter_mix2_h_48_16_neon:   2.46   2.38   1.81   2.85
vp9_loop_filter_mix2_h_84_16_neon:   2.50   2.41   1.73   2.85
vp9_loop_filter_mix2_h_88_16_neon:   2.77   2.66   1.96   3.23
vp9_loop_filter_mix2_v_44_16_neon:   4.28   4.46   3.22   5.70
vp9_loop_filter_mix2_v_48_16_neon:   3.92   4.00   3.03   5.19
vp9_loop_filter_mix2_v_84_16_neon:   3.97   4.31   2.98   5.33
vp9_loop_filter_mix2_v_88_16_neon:   3.91   4.19   3.06   5.18
vp9_loop_filter_v_4_8_neon:  4.53   4.47   3.31   6.05
vp9_loop_filter_v_8_8_neon:  3.58   3.99   2.92   5.17
vp9_loop_filter_v_16_8_neon: 3.40   3.50   2.81   4.68
vp9_loop_filter_v_16_16_neon:4.66   4.41   3.74   6.02

The speedup vs C code is around 2-6x. The numbers are quite
inconclusive though, since the checkasm test runs multiple filterings
on top of each other, so later rounds might end up with different
codepaths (different decisions on which filter to apply, based
on input pixel differences). Disabling the early-exit in the asm
doesn't give a fair comparison either though, since the C code
only does the necessary calcuations for each row.

Based on START_TIMER/STOP_TIMER wrapping around a few individual
functions, the speedup vs C code is around 4-9x.

This is pretty similar in runtime to the corresponding routines
in libvpx. (This is comparing vpx_lpf_vertical_16_neon,
vpx_lpf_horizontal_edge_8_neon and vpx_lpf_horizontal_edge_16_neon
to vp9_loop_filter_h_16_8_neon, vp9_loop_filter_v_16_8_neon
and vp9_loop_filter_v_16_16_neon - note that the naming of horizonal
and vertical is flipped between the libraries.)

In order to have stable, comparable numbers, the early exits in both
asm versions were disabled, forcing the full filtering codepath.

   Cortex   A7  A8  A9 A53
vp9_loop_filter_h_16_8_neon: 597.2   472.0   482.4   415.0
libvpx vpx_lpf_vertical_16_neon: 626.0   464.5   470.7   445.0
vp9_loop_filter_v_16_8_neon: 500.2   422.5   429.7   295.0
libvpx vpx_lpf_horizontal_edge_8_neon:   586.5   414.5   415.6   383.2
vp9_loop_filter_v_16_16_neon:905.0   784.7   791.5   546.0
libvpx vpx_lpf_horizontal_edge_16_neon: 1060.2   751.7   743.5   685.2

Our version is consistently faster on on A7 and A53, marginally slower on
A8, and sometimes faster, sometimes slower on A9 (marginally slower in all
three tests in this particular test run).

This is an adapted cherry-pick from libav commit
dd299a2d6d4d1af9528ed35a8131c35946be5973.
---
 libavcodec/arm/Makefile  |   1 +
 libavcodec/arm/vp9dsp_init_arm.c |  60 +++
 libavcodec/arm/vp9lpf_neon.S | 770 +++
 3 files changed, 831 insertions(+)
 create mode 100644 libavcodec/arm/vp9lpf_neon.S

diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile
index 8602e28..7f18daa 100644
--- a/libavcodec/arm/Makefile
+++ b/libavcodec/arm/Makefile
@@ -141,4 +141,5 @@ NEON-OBJS-$(CONFIG_RV40_DECODER)   += 
arm/rv34dsp_neon.o\
 NEON-OBJS-$(CONFIG_VORBIS_DECODER) += arm/vorbisdsp_neon.o
 NEON-OBJS-$(CONFIG_VP6_DECODER)+= arm/vp6dsp_neon.o
 NEON-OBJS-$(CONFIG_VP9_DECODER)+= arm/vp9itxfm_neon.o   \
+  arm/vp9lpf_neon.o \
   arm/vp9mc_neon.o
diff --git a/libavcodec/arm/vp9dsp_init_arm.c b/libavcodec/arm/vp9dsp_init_arm.c
index 1d4eabf..05e50d7 100644
--- a/libavcodec/arm/vp9dsp_init_arm.c
+++ b/libavcodec/arm/vp9dsp_init_arm.c
@@ -188,8 +188,68 @@ static av_cold void vp9dsp_itxfm_init_arm(VP9DSPContext 
*dsp, int bpp)
 }
 }
 
+#define define_loop_filter(dir, wd, size) \
+void ff_vp9_loop_filter_##dir##_##wd##_##size##_neon(uint8_t *dst, ptrdiff_t 
stride, int E, int I, int H)
+
+#define define_loop_filters(wd, size) \
+define_loop_filter(h, wd, size);  \
+define_loop_filter(v, wd, size)
+
+define_loop_filters(4, 8);
+define_loop_filters(8, 8);
+define_loop_filters(16, 8);
+define_loop_filters(16, 16);
+
+#define lf_mix_fn(dir, wd1, wd2, stridea)   

[FFmpeg-devel] [PATCH 9/9] aarch64: vp9: Implement NEON loop filters

2016-11-14 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

These are ported from the ARM version; thanks to the larger
amount of registers available, we can do the loop filters with
16 pixels at a time. The implementation is fully templated, with
a single macro which can generate versions for both 8 and
16 pixels wide, for both 4, 8 and 16 pixels loop filters
(and the 4/8 mixed versions as well).

For the 8 pixel wide versions, it is pretty close in speed (the
v_4_8 and v_8_8 filters are the best examples of this; the h_4_8
and h_8_8 filters seem to get some gain in the load/transpose/store
part). For the 16 pixels wide ones, we get a speedup of around
1.2-1.4x compared to the 32 bit version.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
   ARM AArch64
vp9_loop_filter_h_4_8_neon:  144.0   127.2
vp9_loop_filter_h_8_8_neon:  207.0   182.5
vp9_loop_filter_h_16_8_neon: 415.0   328.7
vp9_loop_filter_h_16_16_neon:672.0   558.6
vp9_loop_filter_mix2_h_44_16_neon:   302.0   203.5
vp9_loop_filter_mix2_h_48_16_neon:   365.0   305.2
vp9_loop_filter_mix2_h_84_16_neon:   365.0   305.2
vp9_loop_filter_mix2_h_88_16_neon:   376.0   305.2
vp9_loop_filter_mix2_v_44_16_neon:   193.2   128.2
vp9_loop_filter_mix2_v_48_16_neon:   246.7   218.4
vp9_loop_filter_mix2_v_84_16_neon:   248.0   218.5
vp9_loop_filter_mix2_v_88_16_neon:   302.0   218.2
vp9_loop_filter_v_4_8_neon:   89.088.7
vp9_loop_filter_v_8_8_neon:  141.0   137.7
vp9_loop_filter_v_16_8_neon: 295.0   272.7
vp9_loop_filter_v_16_16_neon:546.0   453.7

The speedup vs C code in checkasm tests is around 2-7x, which is
pretty much the same as for the 32 bit version. Even if these functions
are faster than their 32 bit equivalent, the C version that we compare
to also became around 1.3-1.7x faster than the C version in 32 bit.

Based on START_TIMER/STOP_TIMER wrapping around a few individual
functions, the speedup vs C code is around 4-5x.

Examples of runtimes vs C on a Cortex A57 (for a slightly older version
of the patch):
 A57 gcc-5.3  neon
loop_filter_h_4_8_neon:256.6  93.4
loop_filter_h_8_8_neon:307.3 139.1
loop_filter_h_16_8_neon:   340.1 254.1
loop_filter_h_16_16_neon:  827.0 407.9
loop_filter_mix2_h_44_16_neon: 524.5 155.4
loop_filter_mix2_h_48_16_neon: 644.5 173.3
loop_filter_mix2_h_84_16_neon: 630.5 222.0
loop_filter_mix2_h_88_16_neon: 697.3 222.0
loop_filter_mix2_v_44_16_neon: 598.5 100.6
loop_filter_mix2_v_48_16_neon: 651.5 127.0
loop_filter_mix2_v_84_16_neon: 591.5 167.1
loop_filter_mix2_v_88_16_neon: 855.1 166.7
loop_filter_v_4_8_neon:271.7  65.3
loop_filter_v_8_8_neon:312.5 106.9
loop_filter_v_16_8_neon:   473.3 206.5
loop_filter_v_16_16_neon:  976.1 327.8

The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57
is again 30-50% faster than the cortex-a53.

This is an adapted cherry-pick from libav commits
9d2afd1eb8c5cc0633062430e66326dbf98c99e0 and
31756abe29eb039a11c59a42cb12e0cc2aef3b97.
---
 libavcodec/aarch64/Makefile  |1 +
 libavcodec/aarch64/vp9dsp_init_aarch64.c |   48 ++
 libavcodec/aarch64/vp9lpf_neon.S | 1355 ++
 3 files changed, 1404 insertions(+)
 create mode 100644 libavcodec/aarch64/vp9lpf_neon.S

diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index e8a7f7a..b7bb898 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -43,4 +43,5 @@ NEON-OBJS-$(CONFIG_MPEGAUDIODSP)+= 
aarch64/mpegaudiodsp_neon.o
 NEON-OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_neon.o
 NEON-OBJS-$(CONFIG_VORBIS_DECODER)  += aarch64/vorbisdsp_neon.o
 NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_neon.o 
\
+   aarch64/vp9lpf_neon.o   
\
aarch64/vp9mc_neon.o
diff --git a/libavcodec/aarch64/vp9dsp_init_aarch64.c 
b/libavcodec/aarch64/vp9dsp_init_aarch64.c
index 2848608..7e34375 100644
--- a/libavcodec/aarch64/vp9dsp_init_aarch64.c
+++ b/libavcodec/aarch64/vp9dsp_init_aarch64.c
@@ -201,8 +201,56 @@ static av_cold void 
vp9dsp_itxfm_init_aarch64(VP9DSPContext *dsp, int bpp)
 }
 }
 
+#define define_loop_filter(dir, wd, len) \
+void ff_vp9_loop_filter_##dir##_##wd##_##len##_neon(uint8_t *dst, ptrdiff_t 
stride, int E, int I, int H)
+
+#define define_loop_filters(wd, len) \
+define_loop_filter(h, wd, len);  \
+define_loop_filter(v, wd, len)
+
+define_loop_filters(4, 8);
+define_loop_filters(8, 8);
+define_loop_filters(16, 8);
+
+define_loop_filters(16, 16);
+
+define_loop_filters(44, 16);
+define_loop_filters(48, 16);
+define_loop_filters(84, 16);
+define_loop_filters(88, 16);
+
+static av_cold void vp9dsp_loopfilter_init_aarch64(VP9DSPContext *dsp, int bpp)
+{
+int cpu_flags = av_get_cpu_flags();
+
+if (bpp != 8)
+

[FFmpeg-devel] [PATCH 7/9] aarch64: vp9: Add NEON optimizations of VP9 MC functions

2016-11-14 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

These are ported from the ARM version; it is essentially a 1:1
port with no extra added features, but with some hand tuning
(especially for the plain copy/avg functions). The ARM version
isn't very register starved to begin with, so there's not much
to be gained from having more spare registers here - we only
avoid having to clobber callee-saved registers.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
 ARM   AArch64
vp9_avg4_neon:  27.2  23.7
vp9_avg8_neon:  56.5  54.7
vp9_avg16_neon:169.9 167.4
vp9_avg32_neon:585.8 585.2
vp9_avg64_neon:   2460.32294.7
vp9_avg_8tap_smooth_4h_neon:   132.7 125.2
vp9_avg_8tap_smooth_4hv_neon:  478.8 442.0
vp9_avg_8tap_smooth_4v_neon:   126.0  93.7
vp9_avg_8tap_smooth_8h_neon:   241.7 234.2
vp9_avg_8tap_smooth_8hv_neon:  690.9 646.5
vp9_avg_8tap_smooth_8v_neon:   245.0 205.5
vp9_avg_8tap_smooth_64h_neon:11273.2   11280.1
vp9_avg_8tap_smooth_64hv_neon:   22980.6   22184.1
vp9_avg_8tap_smooth_64v_neon:11549.7   10781.1
vp9_put4_neon:  18.0  17.2
vp9_put8_neon:  40.2  37.7
vp9_put16_neon: 97.4  99.5
vp9_put32_neon/armv8:  346.0 307.4
vp9_put64_neon/armv8: 1319.01107.5
vp9_put_8tap_smooth_4h_neon:   126.7 118.2
vp9_put_8tap_smooth_4hv_neon:  465.7 434.0
vp9_put_8tap_smooth_4v_neon:   113.0  86.5
vp9_put_8tap_smooth_8h_neon:   229.7 221.6
vp9_put_8tap_smooth_8hv_neon:  658.9 621.3
vp9_put_8tap_smooth_8v_neon:   215.0 187.5
vp9_put_8tap_smooth_64h_neon:10636.7   10627.8
vp9_put_8tap_smooth_64hv_neon:   21076.8   21026.9
vp9_put_8tap_smooth_64v_neon: 9635.09632.4

These are generally about as fast as the corresponding ARM
routines on the same CPU (at least on the A53), in most cases
marginally faster.

The speedup vs C code is pretty much the same as for the 32 bit
case; on the A53 it's around 6-13x for ther larger 8tap filters.
The exact speedup varies a little, since the C versions generally
don't end up exactly as slow/fast as on 32 bit.

This is an adapted cherry-pick from libav commit
383d96aa2229f644d9bd77b821ed3a309da5e9fc.
---
 libavcodec/aarch64/Makefile  |   2 +
 libavcodec/aarch64/vp9dsp_init_aarch64.c | 156 +++
 libavcodec/aarch64/vp9mc_neon.S  | 676 +++
 libavcodec/vp9.c |   8 +-
 libavcodec/vp9dsp.c  |   1 +
 libavcodec/vp9dsp.h  |   1 +
 6 files changed, 840 insertions(+), 4 deletions(-)
 create mode 100644 libavcodec/aarch64/vp9dsp_init_aarch64.c
 create mode 100644 libavcodec/aarch64/vp9mc_neon.S

diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index c3df887..e7db95e 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -16,6 +16,7 @@ OBJS-$(CONFIG_DCA_DECODER)  += 
aarch64/synth_filter_init.o
 OBJS-$(CONFIG_RV40_DECODER) += aarch64/rv40dsp_init_aarch64.o
 OBJS-$(CONFIG_VC1DSP)   += aarch64/vc1dsp_init_aarch64.o
 OBJS-$(CONFIG_VORBIS_DECODER)   += aarch64/vorbisdsp_init.o
+OBJS-$(CONFIG_VP9_DECODER)  += aarch64/vp9dsp_init_aarch64.o
 
 # ARMv8 optimizations
 
@@ -41,3 +42,4 @@ NEON-OBJS-$(CONFIG_MPEGAUDIODSP)+= 
aarch64/mpegaudiodsp_neon.o
 # decoders/encoders
 NEON-OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_neon.o
 NEON-OBJS-$(CONFIG_VORBIS_DECODER)  += aarch64/vorbisdsp_neon.o
+NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9mc_neon.o
diff --git a/libavcodec/aarch64/vp9dsp_init_aarch64.c 
b/libavcodec/aarch64/vp9dsp_init_aarch64.c
new file mode 100644
index 000..4adf363
--- /dev/null
+++ b/libavcodec/aarch64/vp9dsp_init_aarch64.c
@@ -0,0 +1,156 @@
+/*
+ * Copyright (c) 2016 Google Inc.
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include 
+
+#include "libavutil/attributes.h"
+#include "libavutil/aarch64/cpu.h"
+#include "libavcodec/vp9dsp.h"
+
+#define 

[FFmpeg-devel] [PATCH 3/9] arm: vp9: Add NEON optimizations of VP9 MC functions

2016-11-14 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

The filter coefficients are signed values, where the product of the
multiplication with one individual filter coefficient doesn't
overflow a 16 bit signed value (the largest filter coefficient is
127). But when the products are accumulated, the resulting sum can
overflow the 16 bit signed range. Instead of accumulating in 32 bit,
we accumulate the largest product (either index 3 or 4) last with a
saturated addition.

(The VP8 MC asm does something similar, but slightly simpler, by
accumulating each half of the filter separately. In the VP9 MC
filters, each half of the filter can also overflow though, so the
largest component has to be handled individually.)

Examples of relative speedup compared to the C version, from checkasm:
   Cortex  A7 A8 A9A53
vp9_avg4_neon:   1.71   1.15   1.42   1.49
vp9_avg8_neon:   2.51   3.63   3.14   2.58
vp9_avg16_neon:  2.95   6.76   3.01   2.84
vp9_avg32_neon:  3.29   6.64   2.85   3.00
vp9_avg64_neon:  3.47   6.67   3.14   2.80
vp9_avg_8tap_smooth_4h_neon: 3.22   4.73   2.76   4.67
vp9_avg_8tap_smooth_4hv_neon:3.67   4.76   3.28   4.71
vp9_avg_8tap_smooth_4v_neon: 5.52   7.60   4.60   6.31
vp9_avg_8tap_smooth_8h_neon: 6.22   9.04   5.12   9.32
vp9_avg_8tap_smooth_8hv_neon:6.38   8.21   5.72   8.17
vp9_avg_8tap_smooth_8v_neon: 9.22  12.66   8.15  11.10
vp9_avg_8tap_smooth_64h_neon:7.02  10.23   5.54  11.58
vp9_avg_8tap_smooth_64hv_neon:   6.76   9.46   5.93   9.40
vp9_avg_8tap_smooth_64v_neon:   10.76  14.13   9.46  13.37
vp9_put4_neon:   1.11   1.47   1.00   1.21
vp9_put8_neon:   1.23   2.17   1.94   1.48
vp9_put16_neon:  1.63   4.02   1.73   1.97
vp9_put32_neon:  1.56   4.92   2.00   1.96
vp9_put64_neon:  2.10   5.28   2.03   2.35
vp9_put_8tap_smooth_4h_neon: 3.11   4.35   2.63   4.35
vp9_put_8tap_smooth_4hv_neon:3.67   4.69   3.25   4.71
vp9_put_8tap_smooth_4v_neon: 5.45   7.27   4.49   6.52
vp9_put_8tap_smooth_8h_neon: 5.97   8.18   4.81   8.56
vp9_put_8tap_smooth_8hv_neon:6.39   7.90   5.64   8.15
vp9_put_8tap_smooth_8v_neon: 9.03  11.84   8.07  11.51
vp9_put_8tap_smooth_64h_neon:6.78   9.48   4.88  10.89
vp9_put_8tap_smooth_64hv_neon:   6.99   8.87   5.94   9.56
vp9_put_8tap_smooth_64v_neon:   10.69  13.30   9.43  14.34

For the larger 8tap filters, the speedup vs C code is around 5-14x.

This is significantly faster than libvpx's implementation of the same
functions, at least when comparing the put_8tap_smooth_64 functions
(compared to vpx_convolve8_horiz_neon and vpx_convolve8_vert_neon from
libvpx).

Absolute runtimes from checkasm:
  Cortex  A7A8A9   A53
vp9_put_8tap_smooth_64h_neon:20150.3   14489.4   19733.6   10863.7
libvpx vpx_convolve8_horiz_neon: 52623.3   19736.4   21907.7   25027.7

vp9_put_8tap_smooth_64v_neon:14455.0   12303.9   13746.49628.9
libvpx vpx_convolve8_vert_neon:  42090.0   17706.2   17659.9   16941.2

Thus, on the A9, the horizontal filter is only marginally faster than
libvpx, while our version is significantly faster on the other cores,
and the vertical filter is significantly faster on all cores. The
difference is especially large on the A7.

The libvpx implementation does the accumulation in 32 bit, which
probably explains most of the differences.

This is an adapted cherry-pick from libav commits
ffbd1d2b0002576ef0d976a41ff959c635373fdc,
392caa65df3efa8b2d48a80f08a6af4892c61c08,
557c1675cf0e803b2fee43b4c8b58433842c84d0 and
11623217e3c9b859daee544e31acdd0821b61039.
---
 libavcodec/arm/Makefile  |   2 +
 libavcodec/arm/vp9dsp_init_arm.c | 143 
 libavcodec/arm/vp9mc_neon.S  | 709 +++
 libavcodec/vp9.c |  20 +-
 libavcodec/vp9dsp.c  |   1 +
 libavcodec/vp9dsp.h  |   1 +
 6 files changed, 872 insertions(+), 4 deletions(-)
 create mode 100644 libavcodec/arm/vp9dsp_init_arm.c
 create mode 100644 libavcodec/arm/vp9mc_neon.S

diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile
index a4ceca7..82b740b 100644
--- a/libavcodec/arm/Makefile
+++ b/libavcodec/arm/Makefile
@@ -44,6 +44,7 @@ OBJS-$(CONFIG_MLP_DECODER) += 
arm/mlpdsp_init_arm.o
 OBJS-$(CONFIG_RV40_DECODER)+= arm/rv40dsp_init_arm.o
 OBJS-$(CONFIG_VORBIS_DECODER)  += arm/vorbisdsp_init_arm.o
 OBJS-$(CONFIG_VP6_DECODER) += arm/vp6dsp_init_arm.o
+OBJS-$(CONFIG_VP9_DECODER) += arm/vp9dsp_init_arm.o
 
 
 # ARMv5 optimizations
@@ -139,3 +140,4 @@ NEON-OBJS-$(CONFIG_RV40_DECODER)   += 
arm/rv34dsp_neon.o\
   arm/rv40dsp_neon.o
 NEON-OBJS-$(CONFIG_VORBIS_DECODER) += arm/vorbisdsp_neon.o
 NEON-OBJS-$(CONFIG_VP6_DECODER) 

[FFmpeg-devel] [PATCH 6/9] aarch64: Add an offset parameter to the movrel macro

2016-11-14 Thread Martin Storsjö
With apple tools, the linker fails with errors like these, if the
offset is negative:

ld: in section __TEXT,__text reloc 8: symbol index out of range for 
architecture arm64

This is cherry-picked from libav commit
c44a8a3eabcd6acd2ba79f32ec8a432e6ebe552c.
---
 libavutil/aarch64/asm.S | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/libavutil/aarch64/asm.S b/libavutil/aarch64/asm.S
index ff34e7a..523b8c5 100644
--- a/libavutil/aarch64/asm.S
+++ b/libavutil/aarch64/asm.S
@@ -72,15 +72,21 @@ ELF .size   \name, . - \name
 \name:
 .endm
 
-.macro  movrel rd, val
+.macro  movrel rd, val, offset=0
 #if CONFIG_PIC && defined(__APPLE__)
+.if \offset < 0
 adrp\rd, \val@PAGE
 add \rd, \rd, \val@PAGEOFF
+sub \rd, \rd, -(\offset)
+.else
+adrp\rd, \val+(\offset)@PAGE
+add \rd, \rd, \val+(\offset)@PAGEOFF
+.endif
 #elif CONFIG_PIC
-adrp\rd, \val
-add \rd, \rd, :lo12:\val
+adrp\rd, \val+\offset
+add \rd, \rd, :lo12:\val+\offset
 #else
-ldr \rd, =\val
+ldr \rd, =\val+\offset
 #endif
 .endm
 
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 4/9] arm: vp9: Add NEON itxfm routines

2016-11-14 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

For the transforms up to 8x8, we can fit all the data (including
temporaries) in registers and just do a straightforward transform
of all the data. For 16x16, we do a transform of 4x16 pixels in
4 slices, using a temporary buffer. For 32x32, we transform 4x32
pixels at a time, in two steps of 4x16 pixels each.

Examples of relative speedup compared to the C version, from checkasm:
 Cortex   A7 A8 A9A53
vp9_inv_adst_adst_4x4_add_neon: 3.39   5.83   4.17   4.01
vp9_inv_adst_adst_8x8_add_neon: 3.79   4.86   4.23   3.98
vp9_inv_adst_adst_16x16_add_neon:   3.33   4.36   4.11   4.16
vp9_inv_dct_dct_4x4_add_neon:   4.06   6.16   4.59   4.46
vp9_inv_dct_dct_8x8_add_neon:   4.61   6.01   4.98   4.86
vp9_inv_dct_dct_16x16_add_neon: 3.35   3.44   3.36   3.79
vp9_inv_dct_dct_32x32_add_neon: 3.89   3.50   3.79   4.42
vp9_inv_wht_wht_4x4_add_neon:   3.22   5.13   3.53   3.77

Thus, the speedup vs C code is around 3-6x.

This is mostly marginally faster than the corresponding routines
in libvpx on most cores, tested with their 32x32 idct (compared to
vpx_idct32x32_1024_add_neon). These numbers are slightly in libvpx's
favour since their version doesn't clear the input buffer like ours
do (although the effect of that on the total runtime probably is
negligible.)

   Cortex   A7   A8   A9  A53
vp9_inv_dct_dct_32x32_add_neon:18436.8  16874.1  14235.1  11988.9
libvpx vpx_idct32x32_1024_add_neon 20789.0  13344.3  15049.9  13030.5

Only on the Cortex A8, the libvpx function is faster. On the other cores,
ours is slightly faster even though ours has got source block clearing
integrated.

This is an adapted cherry-pick from libav commits
a67ae67083151f2f9595a1f2d17b601da19b939e and
52d196fb30fb6628921b5f1b31e7bd11eb7e1d9a.
---
 libavcodec/arm/Makefile  |3 +-
 libavcodec/arm/vp9dsp_init_arm.c |   54 +-
 libavcodec/arm/vp9itxfm_neon.S   | 1149 ++
 3 files changed, 1204 insertions(+), 2 deletions(-)
 create mode 100644 libavcodec/arm/vp9itxfm_neon.S

diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile
index 82b740b..8602e28 100644
--- a/libavcodec/arm/Makefile
+++ b/libavcodec/arm/Makefile
@@ -140,4 +140,5 @@ NEON-OBJS-$(CONFIG_RV40_DECODER)   += 
arm/rv34dsp_neon.o\
   arm/rv40dsp_neon.o
 NEON-OBJS-$(CONFIG_VORBIS_DECODER) += arm/vorbisdsp_neon.o
 NEON-OBJS-$(CONFIG_VP6_DECODER)+= arm/vp6dsp_neon.o
-NEON-OBJS-$(CONFIG_VP9_DECODER)+= arm/vp9mc_neon.o
+NEON-OBJS-$(CONFIG_VP9_DECODER)+= arm/vp9itxfm_neon.o   \
+  arm/vp9mc_neon.o
diff --git a/libavcodec/arm/vp9dsp_init_arm.c b/libavcodec/arm/vp9dsp_init_arm.c
index bd0ac37..1d4eabf 100644
--- a/libavcodec/arm/vp9dsp_init_arm.c
+++ b/libavcodec/arm/vp9dsp_init_arm.c
@@ -94,7 +94,7 @@ define_8tap_2d_funcs(8)
 define_8tap_2d_funcs(4)
 
 
-av_cold void ff_vp9dsp_init_arm(VP9DSPContext *dsp, int bpp)
+static av_cold void vp9dsp_mc_init_arm(VP9DSPContext *dsp, int bpp)
 {
 int cpu_flags = av_get_cpu_flags();
 
@@ -141,3 +141,55 @@ av_cold void ff_vp9dsp_init_arm(VP9DSPContext *dsp, int 
bpp)
 init_mc_funcs_dirs(4, 4);
 }
 }
+
+#define define_itxfm(type_a, type_b, sz)   \
+void ff_vp9_##type_a##_##type_b##_##sz##x##sz##_add_neon(uint8_t *_dst,\
+ ptrdiff_t stride, \
+ int16_t *_block, int 
eob)
+
+#define define_itxfm_funcs(sz)  \
+define_itxfm(idct,  idct,  sz); \
+define_itxfm(iadst, idct,  sz); \
+define_itxfm(idct,  iadst, sz); \
+define_itxfm(iadst, iadst, sz)
+
+define_itxfm_funcs(4);
+define_itxfm_funcs(8);
+define_itxfm_funcs(16);
+define_itxfm(idct, idct, 32);
+define_itxfm(iwht, iwht, 4);
+
+
+static av_cold void vp9dsp_itxfm_init_arm(VP9DSPContext *dsp, int bpp)
+{
+int cpu_flags = av_get_cpu_flags();
+
+if (bpp != 8)
+return;
+
+if (have_neon(cpu_flags)) {
+#define init_itxfm(tx, sz) \
+dsp->itxfm_add[tx][DCT_DCT]   = ff_vp9_idct_idct_##sz##_add_neon;  \
+dsp->itxfm_add[tx][DCT_ADST]  = ff_vp9_iadst_idct_##sz##_add_neon; \
+dsp->itxfm_add[tx][ADST_DCT]  = ff_vp9_idct_iadst_##sz##_add_neon; \
+dsp->itxfm_add[tx][ADST_ADST] = ff_vp9_iadst_iadst_##sz##_add_neon
+
+#define init_idct(tx, nm)   \
+dsp->itxfm_add[tx][DCT_DCT]   = \
+dsp->itxfm_add[tx][ADST_DCT]  = \
+dsp->itxfm_add[tx][DCT_ADST]  = \
+dsp->itxfm_add[tx][ADST_ADST] = ff_vp9_##nm##_add_neon
+
+init_itxfm(TX_4X4, 4x4);
+init_itxfm(TX_8X8, 8x8);
+init_itxfm(TX_16X16, 16x16);
+init_idct(TX_32X32, idct_idct_32x32);
+init_idct(4, iwht_iwht_4x4);
+}
+}

[FFmpeg-devel] [PATCH 2/9] arm: Clear the gp register alias at the end of functions

2016-11-14 Thread Martin Storsjö
We reset .Lpic_gp to zero at the start of each function, which means
that the logic within movrelx for clearing gp when necessary will
be missed.

This fixes using movrelx in different functions with a different
helper register.

This is cherry-picked from libav commit
824e8c284054f323f854892d1b4739239ed1fdc7.
---
 libavutil/arm/asm.S | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/libavutil/arm/asm.S b/libavutil/arm/asm.S
index e9b0bca..b0a6e50 100644
--- a/libavutil/arm/asm.S
+++ b/libavutil/arm/asm.S
@@ -77,6 +77,9 @@ ELF .section .note.GNU-stack,"",%progbits @ Mark stack as 
non-executable
 put_pic %(.Lpic_idx - 1)
 .noaltmacro
   .endif
+  .if .Lpic_gp
+.unreq  gp
+  .endif
 ELF .size   \name, . - \name
 FUNC.endfunc
 .purgem endfunc
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 8/9] aarch64: vp9: Add NEON itxfm routines

2016-11-14 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

These are ported from the ARM version; thanks to the larger
amount of registers available, we can do the 16x16 and 32x32
transforms in slices 8 pixels wide instead of 4. This gives
a speedup of around 1.4x compared to the 32 bit version.

The fact that aarch64 doesn't have the same d/q register
aliasing makes some of the macros quite a bit simpler as well.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
   ARM  AArch64
vp9_inv_adst_adst_4x4_add_neon:   90.0 87.7
vp9_inv_adst_adst_8x8_add_neon:  400.0354.7
vp9_inv_adst_adst_16x16_add_neon:   2526.5   1827.2
vp9_inv_dct_dct_4x4_add_neon: 74.0 72.7
vp9_inv_dct_dct_8x8_add_neon:271.0256.7
vp9_inv_dct_dct_16x16_add_neon: 1960.7   1372.7
vp9_inv_dct_dct_32x32_add_neon:11988.9   8088.3
vp9_inv_wht_wht_4x4_add_neon: 63.0 57.7

The speedup vs C code (2-4x) is smaller than in the 32 bit case,
mostly because the C code ends up significantly faster (around
1.6x faster, with GCC 5.4) when built for aarch64.

Examples of runtimes vs C on a Cortex A57 (for a slightly older version
of the patch):
A57 gcc-5.3   neon
vp9_inv_adst_adst_4x4_add_neon:   152.2   60.0
vp9_inv_adst_adst_8x8_add_neon:   948.2  288.0
vp9_inv_adst_adst_16x16_add_neon:4830.4 1380.5
vp9_inv_dct_dct_4x4_add_neon: 153.0   58.6
vp9_inv_dct_dct_8x8_add_neon: 789.2  180.2
vp9_inv_dct_dct_16x16_add_neon:  3639.6  917.1
vp9_inv_dct_dct_32x32_add_neon: 20462.1 4985.0
vp9_inv_wht_wht_4x4_add_neon:  91.0   49.8

The asm is around factor 3-4 faster than C on the cortex-a57 and the asm
is around 30-50% faster on the a57 compared to the a53.

This is an adapted cherry-pick from libav commit
3c9546dfafcdfe8e7860aff9ebbf609318220f29.
---
 libavcodec/aarch64/Makefile  |3 +-
 libavcodec/aarch64/vp9dsp_init_aarch64.c |   54 +-
 libavcodec/aarch64/vp9itxfm_neon.S   | 1116 ++
 3 files changed, 1171 insertions(+), 2 deletions(-)
 create mode 100644 libavcodec/aarch64/vp9itxfm_neon.S

diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index e7db95e..e8a7f7a 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -42,4 +42,5 @@ NEON-OBJS-$(CONFIG_MPEGAUDIODSP)+= 
aarch64/mpegaudiodsp_neon.o
 # decoders/encoders
 NEON-OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_neon.o
 NEON-OBJS-$(CONFIG_VORBIS_DECODER)  += aarch64/vorbisdsp_neon.o
-NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9mc_neon.o
+NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_neon.o 
\
+   aarch64/vp9mc_neon.o
diff --git a/libavcodec/aarch64/vp9dsp_init_aarch64.c 
b/libavcodec/aarch64/vp9dsp_init_aarch64.c
index 4adf363..2848608 100644
--- a/libavcodec/aarch64/vp9dsp_init_aarch64.c
+++ b/libavcodec/aarch64/vp9dsp_init_aarch64.c
@@ -96,7 +96,7 @@ define_8tap_2d_funcs(16)
 define_8tap_2d_funcs(8)
 define_8tap_2d_funcs(4)
 
-av_cold void ff_vp9dsp_init_aarch64(VP9DSPContext *dsp, int bpp)
+static av_cold void vp9dsp_mc_init_aarch64(VP9DSPContext *dsp, int bpp)
 {
 int cpu_flags = av_get_cpu_flags();
 
@@ -154,3 +154,55 @@ av_cold void ff_vp9dsp_init_aarch64(VP9DSPContext *dsp, 
int bpp)
 init_mc_funcs_dirs(4, 4);
 }
 }
+
+#define define_itxfm(type_a, type_b, sz)   \
+void ff_vp9_##type_a##_##type_b##_##sz##x##sz##_add_neon(uint8_t *_dst,\
+ ptrdiff_t stride, \
+ int16_t *_block, int 
eob)
+
+#define define_itxfm_funcs(sz)  \
+define_itxfm(idct,  idct,  sz); \
+define_itxfm(iadst, idct,  sz); \
+define_itxfm(idct,  iadst, sz); \
+define_itxfm(iadst, iadst, sz)
+
+define_itxfm_funcs(4);
+define_itxfm_funcs(8);
+define_itxfm_funcs(16);
+define_itxfm(idct, idct, 32);
+define_itxfm(iwht, iwht, 4);
+
+
+static av_cold void vp9dsp_itxfm_init_aarch64(VP9DSPContext *dsp, int bpp)
+{
+int cpu_flags = av_get_cpu_flags();
+
+if (bpp != 8)
+return;
+
+if (have_neon(cpu_flags)) {
+#define init_itxfm(tx, sz) \
+dsp->itxfm_add[tx][DCT_DCT]   = ff_vp9_idct_idct_##sz##_add_neon;  \
+dsp->itxfm_add[tx][DCT_ADST]  = ff_vp9_iadst_idct_##sz##_add_neon; \
+dsp->itxfm_add[tx][ADST_DCT]  = ff_vp9_idct_iadst_##sz##_add_neon; \
+dsp->itxfm_add[tx][ADST_ADST] = ff_vp9_iadst_iadst_##sz##_add_neon
+
+#define init_idct(tx, nm)   \
+dsp->itxfm_add[tx][DCT_DCT]   = \
+dsp->itxfm_add[tx][ADST_DCT]  = \
+dsp->itxfm_add[tx][DCT_ADST]  = \
+dsp->itxfm_add[tx][ADST_ADST] = ff_vp9_##nm##_add_neon
+
+init_itxfm(TX_4X4, 4x4);
+init_itxfm(TX_8X8, 8x8);
+init_itxfm(TX_16X16, 16x16);
+  

[FFmpeg-devel] [PATCH 1/9] vp9dsp: Deduplicate the subpel filters

2016-11-14 Thread Martin Storsjö
Make them aligned, to allow efficient access to them from simd.

This is an adapted cherry-pick from libav commit
a4cfcddcb0f76e837d5abc06840c2b26c0e8aefc.
---
 libavcodec/vp9dsp.c  | 56 +++
 libavcodec/vp9dsp.h  |  3 +++
 libavcodec/vp9dsp_template.c | 63 +++-
 3 files changed, 63 insertions(+), 59 deletions(-)

diff --git a/libavcodec/vp9dsp.c b/libavcodec/vp9dsp.c
index 54e77e2..6dd49c8 100644
--- a/libavcodec/vp9dsp.c
+++ b/libavcodec/vp9dsp.c
@@ -25,6 +25,62 @@
 #include "libavutil/common.h"
 #include "vp9dsp.h"
 
+const DECLARE_ALIGNED(16, int16_t, ff_vp9_subpel_filters)[3][16][8] = {
+[FILTER_8TAP_REGULAR] = {
+{  0,  0,   0, 128,   0,   0,  0,  0 },
+{  0,  1,  -5, 126,   8,  -3,  1,  0 },
+{ -1,  3, -10, 122,  18,  -6,  2,  0 },
+{ -1,  4, -13, 118,  27,  -9,  3, -1 },
+{ -1,  4, -16, 112,  37, -11,  4, -1 },
+{ -1,  5, -18, 105,  48, -14,  4, -1 },
+{ -1,  5, -19,  97,  58, -16,  5, -1 },
+{ -1,  6, -19,  88,  68, -18,  5, -1 },
+{ -1,  6, -19,  78,  78, -19,  6, -1 },
+{ -1,  5, -18,  68,  88, -19,  6, -1 },
+{ -1,  5, -16,  58,  97, -19,  5, -1 },
+{ -1,  4, -14,  48, 105, -18,  5, -1 },
+{ -1,  4, -11,  37, 112, -16,  4, -1 },
+{ -1,  3,  -9,  27, 118, -13,  4, -1 },
+{  0,  2,  -6,  18, 122, -10,  3, -1 },
+{  0,  1,  -3,   8, 126,  -5,  1,  0 },
+}, [FILTER_8TAP_SHARP] = {
+{  0,  0,   0, 128,   0,   0,  0,  0 },
+{ -1,  3,  -7, 127,   8,  -3,  1,  0 },
+{ -2,  5, -13, 125,  17,  -6,  3, -1 },
+{ -3,  7, -17, 121,  27, -10,  5, -2 },
+{ -4,  9, -20, 115,  37, -13,  6, -2 },
+{ -4, 10, -23, 108,  48, -16,  8, -3 },
+{ -4, 10, -24, 100,  59, -19,  9, -3 },
+{ -4, 11, -24,  90,  70, -21, 10, -4 },
+{ -4, 11, -23,  80,  80, -23, 11, -4 },
+{ -4, 10, -21,  70,  90, -24, 11, -4 },
+{ -3,  9, -19,  59, 100, -24, 10, -4 },
+{ -3,  8, -16,  48, 108, -23, 10, -4 },
+{ -2,  6, -13,  37, 115, -20,  9, -4 },
+{ -2,  5, -10,  27, 121, -17,  7, -3 },
+{ -1,  3,  -6,  17, 125, -13,  5, -2 },
+{  0,  1,  -3,   8, 127,  -7,  3, -1 },
+}, [FILTER_8TAP_SMOOTH] = {
+{  0,  0,   0, 128,   0,   0,  0,  0 },
+{ -3, -1,  32,  64,  38,   1, -3,  0 },
+{ -2, -2,  29,  63,  41,   2, -3,  0 },
+{ -2, -2,  26,  63,  43,   4, -4,  0 },
+{ -2, -3,  24,  62,  46,   5, -4,  0 },
+{ -2, -3,  21,  60,  49,   7, -4,  0 },
+{ -1, -4,  18,  59,  51,   9, -4,  0 },
+{ -1, -4,  16,  57,  53,  12, -4, -1 },
+{ -1, -4,  14,  55,  55,  14, -4, -1 },
+{ -1, -4,  12,  53,  57,  16, -4, -1 },
+{  0, -4,   9,  51,  59,  18, -4, -1 },
+{  0, -4,   7,  49,  60,  21, -3, -2 },
+{  0, -4,   5,  46,  62,  24, -3, -2 },
+{  0, -4,   4,  43,  63,  26, -2, -2 },
+{  0, -3,   2,  41,  63,  29, -2, -2 },
+{  0, -3,   1,  38,  64,  32, -1, -3 },
+}
+};
+
+
 av_cold void ff_vp9dsp_init(VP9DSPContext *dsp, int bpp, int bitexact)
 {
 if (bpp == 8) {
diff --git a/libavcodec/vp9dsp.h b/libavcodec/vp9dsp.h
index 733f5bf..cb43f5e 100644
--- a/libavcodec/vp9dsp.h
+++ b/libavcodec/vp9dsp.h
@@ -120,6 +120,9 @@ typedef struct VP9DSPContext {
 vp9_scaled_mc_func smc[5][4][2];
 } VP9DSPContext;
 
+
+extern const int16_t ff_vp9_subpel_filters[3][16][8];
+
 void ff_vp9dsp_init(VP9DSPContext *dsp, int bpp, int bitexact);
 
 void ff_vp9dsp_init_8(VP9DSPContext *dsp);
diff --git a/libavcodec/vp9dsp_template.c b/libavcodec/vp9dsp_template.c
index 4d810fe..bb54561 100644
--- a/libavcodec/vp9dsp_template.c
+++ b/libavcodec/vp9dsp_template.c
@@ -1991,61 +1991,6 @@ copy_avg_fn(4)
 
 #endif /* BIT_DEPTH != 12 */
 
-static const int16_t vp9_subpel_filters[3][16][8] = {
-[FILTER_8TAP_REGULAR] = {
-{  0,  0,   0, 128,   0,   0,  0,  0 },
-{  0,  1,  -5, 126,   8,  -3,  1,  0 },
-{ -1,  3, -10, 122,  18,  -6,  2,  0 },
-{ -1,  4, -13, 118,  27,  -9,  3, -1 },
-{ -1,  4, -16, 112,  37, -11,  4, -1 },
-{ -1,  5, -18, 105,  48, -14,  4, -1 },
-{ -1,  5, -19,  97,  58, -16,  5, -1 },
-{ -1,  6, -19,  88,  68, -18,  5, -1 },
-{ -1,  6, -19,  78,  78, -19,  6, -1 },
-{ -1,  5, -18,  68,  88, -19,  6, -1 },
-{ -1,  5, -16,  58,  97, -19,  5, -1 },
-{ -1,  4, -14,  48, 105, -18,  5, -1 },
-{ -1,  4, -11,  37, 112, -16,  4, -1 },
-{ -1,  3,  -9,  27, 118, -13,  4, -1 },
-{  0,  2,  -6,  18, 122, -10,  3, -1 },
-{  0,  1,  -3,   8, 126,  -5,  1,  0 },
-}, [FILTER_8TAP_SHARP] = {
-{  0,  0,   0, 128,   0,   0,  0,  0 },
-{ -1,  3,  -7, 127,   8,  -3,  1,  0 },
-{ -2,  5, -13, 125,  17,  -6,  3, -1 },
-{ -3,  7, -17, 121,  27, -10,  

[FFmpeg-devel] [PATCH 01/13] aarch64: vp9: use alternative returns in the core loop filter function

2017-01-09 Thread Martin Storsjö
From: Janne Grunau 

Since aarch64 has enough free general purpose registers use them to
branch to the appropiate storage code. 1-2 cycles faster for the
functions using loop_filter 8/16, ... on a cortex-a53. Mixed results
(up to 2 cycles faster/slower) on a cortex-a57.

This is cherrypicked from libav commit
d7595de0b25e7064fd9e06dea5d0425536cef6dc.
---
 libavcodec/aarch64/vp9lpf_neon.S | 48 +++-
 1 file changed, 18 insertions(+), 30 deletions(-)

diff --git a/libavcodec/aarch64/vp9lpf_neon.S b/libavcodec/aarch64/vp9lpf_neon.S
index e727a4d..78aae61 100644
--- a/libavcodec/aarch64/vp9lpf_neon.S
+++ b/libavcodec/aarch64/vp9lpf_neon.S
@@ -410,15 +410,19 @@
 .endif
 // If no pixels needed flat8in nor flat8out, jump to a
 // writeout of the inner 4 pixels
-cbz x5,  7f
+cbnzx5,  1f
+br  x14
+1:
 mov x5,  v7.d[0]
 .ifc \sz, .16b
 mov x6,  v7.d[1]
 orr x5,  x5,  x6
 .endif
 // If no pixels need flat8out, jump to a writeout of the inner 6 pixels
-cbz x5,  8f
+cbnzx5,  1f
+br  x15
 
+1:
 // flat8out
 // This writes all outputs into v2-v17 (skipping v6 and v16).
 // If this part is skipped, the output is read from v21-v26 (which is 
the input
@@ -549,35 +553,24 @@ endfunc
 
 function vp9_loop_filter_8
 loop_filter 8,  .8b,  0,v16, v17, v18, v19, v28, v29, v30, v31
-mov x5,  #0
 ret
 6:
-mov x5,  #6
-ret
+br  x13
 9:
 br  x10
 endfunc
 
 function vp9_loop_filter_8_16b_mix
 loop_filter 8,  .16b, 88,   v16, v17, v18, v19, v28, v29, v30, v31
-mov x5,  #0
 ret
 6:
-mov x5,  #6
-ret
+br  x13
 9:
 br  x10
 endfunc
 
 function vp9_loop_filter_16
 loop_filter 16, .8b,  0,v8,  v9,  v10, v11, v12, v13, v14, v15
-mov x5,  #0
-ret
-7:
-mov x5,  #7
-ret
-8:
-mov x5,  #8
 ret
 9:
 ldp d8,  d9,  [sp], 0x10
@@ -589,13 +582,6 @@ endfunc
 
 function vp9_loop_filter_16_16b
 loop_filter 16, .16b, 0,v8,  v9,  v10, v11, v12, v13, v14, v15
-mov x5,  #0
-ret
-7:
-mov x5,  #7
-ret
-8:
-mov x5,  #8
 ret
 9:
 ldp d8,  d9,  [sp], 0x10
@@ -614,11 +600,14 @@ endfunc
 .endm
 
 .macro loop_filter_8
+// calculate alternative 'return' targets
+adr x13, 6f
 bl  vp9_loop_filter_8
-cbnzx5,  6f
 .endm
 
 .macro loop_filter_8_16b_mix mix
+// calculate alternative 'return' targets
+adr x13, 6f
 .if \mix == 48
 mov x11, #0x
 .elseif \mix == 84
@@ -627,21 +616,20 @@ endfunc
 mov x11, #0x
 .endif
 bl  vp9_loop_filter_8_16b_mix
-cbnzx5,  6f
 .endm
 
 .macro loop_filter_16
+// calculate alternative 'return' targets
+adr x14, 7f
+adr x15, 8f
 bl  vp9_loop_filter_16
-cmp x5,  7
-b.gt8f
-b.eq7f
 .endm
 
 .macro loop_filter_16_16b
+// calculate alternative 'return' targets
+adr x14, 7f
+adr x15, 8f
 bl  vp9_loop_filter_16_16b
-cmp x5,  7
-b.gt8f
-b.eq7f
 .endm
 
 
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 10/13] aarch64: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32

2017-01-09 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:

vp9_inv_dct_dct_16x16_sub16_add_neon:   1373.2
vp9_inv_dct_dct_32x32_sub32_add_neon:   8089.0

By skipping individual 8x16 or 8x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:

vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3
vp9_inv_dct_dct_16x16_sub2_add_neon:1036.7
vp9_inv_dct_dct_16x16_sub4_add_neon:1036.7
vp9_inv_dct_dct_16x16_sub8_add_neon:1036.7
vp9_inv_dct_dct_16x16_sub12_add_neon:   1372.1
vp9_inv_dct_dct_16x16_sub16_add_neon:   1372.1
vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1
vp9_inv_dct_dct_32x32_sub2_add_neon:5190.2
vp9_inv_dct_dct_32x32_sub4_add_neon:5180.0
vp9_inv_dct_dct_32x32_sub8_add_neon:5183.1
vp9_inv_dct_dct_32x32_sub12_add_neon:   6161.5
vp9_inv_dct_dct_32x32_sub16_add_neon:   6155.5
vp9_inv_dct_dct_32x32_sub20_add_neon:   7136.3
vp9_inv_dct_dct_32x32_sub24_add_neon:   7128.4
vp9_inv_dct_dct_32x32_sub28_add_neon:   8098.9
vp9_inv_dct_dct_32x32_sub32_add_neon:   8098.8

I.e. in general a very minor overhead for the full subpartition case due
to the additional cmps, but a significant speedup for the cases when we
only need to process a small part of the actual input data.

This is cherrypicked from libav commits
cad42fadcd2c2ae1b3676bb398844a1f521a2d7b and
a0c443a3980dc22eb02b067ac4cb9ffa2f9b04d2.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 61 ++
 1 file changed, 56 insertions(+), 5 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index e5fc612..82f1f41 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -588,6 +588,9 @@ endfunc
 .macro store i, dst, inc
 st1 {v\i\().8h},  [\dst], \inc
 .endm
+.macro movi_v i, size, imm
+moviv\i\()\size,  \imm
+.endm
 .macro load_clear i, src, inc
 ld1 {v\i\().8h}, [\src]
 st1 {v2.8h},  [\src], \inc
@@ -596,9 +599,8 @@ endfunc
 // Read a vertical 8x16 slice out of a 16x16 matrix, do a transform on it,
 // transpose into a horizontal 16x8 slice and store.
 // x0 = dst (temp buffer)
-// x1 = unused
+// x1 = slice offset
 // x2 = src
-// x3 = slice offset
 // x9 = input stride
 .macro itxfm16_1d_funcs txfm
 function \txfm\()16_1d_8x16_pass1_neon
@@ -616,14 +618,14 @@ function \txfm\()16_1d_8x16_pass1_neon
 transpose_8x8H  v24, v25, v26, v27, v28, v29, v30, v31, v2, v3
 
 // Store the transposed 8x8 blocks horizontally.
-cmp x3,  #8
+cmp x1,  #8
 b.eq1f
 .irp i, 16, 24, 17, 25, 18, 26, 19, 27, 20, 28, 21, 29, 22, 30, 23, 31
 store   \i,  x0,  #16
 .endr
 ret
 1:
-// Special case: For the last input column (x3 == 8),
+// Special case: For the last input column (x1 == 8),
 // which would be stored as the last row in the temp buffer,
 // don't store the first 8x8 block, but keep it in registers
 // for the first slice of the second pass (where it is the
@@ -751,13 +753,36 @@ function ff_vp9_\txfm1\()_\txfm2\()_16x16_add_neon, 
export=1
 
 .irp i, 0, 8
 add x0,  sp,  #(\i*32)
+.ifc \txfm1\()_\txfm2,idct_idct
+.if \i == 8
+cmp w3,  #38
+b.le1f
+.endif
+.endif
+mov x1,  #\i
 add x2,  x6,  #(\i*2)
-mov x3,  #\i
 bl  \txfm1\()16_1d_8x16_pass1_neon
 .endr
 .ifc \txfm1\()_\txfm2,iadst_idct
 ld1 {v0.8h,v1.8h}, [x10]
 .endif
+
+.ifc \txfm1\()_\txfm2,idct_idct
+b   3f
+1:
+// Set v24-v31 to zero, for the in-register passthrough of
+// coefficients to pass 2. Since we only do two slices, this can
+// only ever happen for the second slice. So we only need to store
+// zeros to the temp buffer for the second half of the buffer.
+// Move x0 to the second half, and use x9 == 32 as increment.
+add x0,  x0,  #16
+.irp i, 24, 25, 26, 27, 28, 29, 30, 31
+movi_v  \i,  .16b, #0
+st1 {v24.8h},  [x0], x9
+.endr
+3:
+.endif
+
 .irp i, 0, 8
 add x0,  x4,  #(\i)
 mov x1,  x5
@@ -1073,12 +1098,17 @@ function idct32_1d_8x32_pass2_neon
 ret
 endfunc
 
+const min_eob_idct_idct_32, align=4
+.short  0, 34, 135, 336
+endconst
+
 function ff_vp9_idct_idct_32x32_add_neon, export=1
 cmp w3,  #1
 b.eqidct32x32_dc_add_neon
 
 movrel  x10, idct_coeffs
 add x11, x10, #32
+movrel  x12, min_eob_idct_idct_32, 2
 
 mov x15, x30
 
@@ -1099,9 +1129,30 @@ function ff_vp9_idct_idct_32x32_add_neon, export=1
 
 .irp i, 0, 8, 16, 24
   

[FFmpeg-devel] [PATCH 09/13] arm: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32

2017-01-09 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:

 Cortex A7   A8   A9  A53
vp9_inv_dct_dct_16x16_sub16_add_neon:   3188.1   2435.4   2499.0   1969.0
vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.7  16582.3  14207.6  12000.3

By skipping individual 4x16 or 4x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:

vp9_inv_dct_dct_16x16_sub1_add_neon: 274.6189.5211.7235.8
vp9_inv_dct_dct_16x16_sub2_add_neon:2064.0   1534.8   1719.4   1248.7
vp9_inv_dct_dct_16x16_sub4_add_neon:2135.0   1477.2   1736.3   1249.5
vp9_inv_dct_dct_16x16_sub8_add_neon:2446.7   1828.7   1993.6   1494.7
vp9_inv_dct_dct_16x16_sub12_add_neon:   2832.4   2118.3   2266.5   1735.1
vp9_inv_dct_dct_16x16_sub16_add_neon:   3211.7   2475.3   2523.5   1983.1
vp9_inv_dct_dct_32x32_sub1_add_neon: 756.2456.7862.0553.9
vp9_inv_dct_dct_32x32_sub2_add_neon:   10682.2   8190.4   8539.2   6762.5
vp9_inv_dct_dct_32x32_sub4_add_neon:   10813.5   8014.9   8518.3   6762.8
vp9_inv_dct_dct_32x32_sub8_add_neon:   11859.6   9313.0   9347.4   7514.5
vp9_inv_dct_dct_32x32_sub12_add_neon:  12946.6  10752.4  10192.2   8280.2
vp9_inv_dct_dct_32x32_sub16_add_neon:  14074.6  11946.5  11001.4   9008.6
vp9_inv_dct_dct_32x32_sub20_add_neon:  15269.9  13662.7  11816.1   9762.6
vp9_inv_dct_dct_32x32_sub24_add_neon:  16327.9  14940.1  12626.7  10516.0
vp9_inv_dct_dct_32x32_sub28_add_neon:  17462.7  15776.1  13446.2  11264.7
vp9_inv_dct_dct_32x32_sub32_add_neon:  18575.5  17157.0  14249.3  12015.1

I.e. in general a very minor overhead for the full subpartition case due
to the additional loads and cmps, but a significant speedup for the cases
when we only need to process a small part of the actual input data.

In common VP9 content in a few inspected clips, 70-90% of the non-dc-only
16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left
8x8 or 16x16 subpartitions respectively.

This is cherrypicked from libav commit
9c8bc74c2b40537b0997f646c87c008042d788c2.
---
 libavcodec/arm/vp9itxfm_neon.S | 75 +-
 tests/checkasm/vp9dsp.c|  6 ++--
 2 files changed, 70 insertions(+), 11 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index d5b8495..25f6dde 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -659,9 +659,8 @@ endfunc
 @ Read a vertical 4x16 slice out of a 16x16 matrix, do a transform on it,
 @ transpose into a horizontal 16x4 slice and store.
 @ r0 = dst (temp buffer)
-@ r1 = unused
+@ r1 = slice offset
 @ r2 = src
-@ r3 = slice offset
 function \txfm\()16_1d_4x16_pass1_neon
 mov r12, #32
 vmov.s16q2, #0
@@ -678,14 +677,14 @@ function \txfm\()16_1d_4x16_pass1_neon
 transpose16_q_4x_4x4 q8,  q9,  q10, q11, q12, q13, q14, q15, d16, d17, 
d18, d19, d20, d21, d22, d23, d24, d25, d26, d27, d28, d29, d30, d31
 
 @ Store the transposed 4x4 blocks horizontally.
-cmp r3,  #12
+cmp r1,  #12
 beq 1f
 .irp i, 16, 20, 24, 28, 17, 21, 25, 29, 18, 22, 26, 30, 19, 23, 27, 31
 vst1.16 {d\i}, [r0,:64]!
 .endr
 bx  lr
 1:
-@ Special case: For the last input column (r3 == 12),
+@ Special case: For the last input column (r1 == 12),
 @ which would be stored as the last row in the temp buffer,
 @ don't store the first 4x4 block, but keep it in registers
 @ for the first slice of the second pass (where it is the
@@ -781,15 +780,22 @@ endfunc
 itxfm16_1d_funcs idct
 itxfm16_1d_funcs iadst
 
+@ This is the minimum eob value for each subpartition, in increments of 4
+const min_eob_idct_idct_16, align=4
+.short  0, 10, 38, 89
+endconst
+
 .macro itxfm_func16x16 txfm1, txfm2
 function ff_vp9_\txfm1\()_\txfm2\()_16x16_add_neon, export=1
 .ifc \txfm1\()_\txfm2,idct_idct
 cmp r3,  #1
 beq idct16x16_dc_add_neon
 .endif
-push{r4-r7,lr}
+push{r4-r8,lr}
 .ifnc \txfm1\()_\txfm2,idct_idct
 vpush   {q4-q7}
+.else
+movrel  r8,  min_eob_idct_idct_16 + 2
 .endif
 
 @ Align the stack, allocate a temp buffer
@@ -810,10 +816,36 @@ A   and r7,  sp,  #15
 
 .irp i, 0, 4, 8, 12
 add r0,  sp,  #(\i*32)
+.ifc \txfm1\()_\txfm2,idct_idct
+.if \i > 0
+ldrh_post   r1,  r8,  #2
+cmp r3,  r1
+it  le
+movle   r1,  #(16 - \i)/4
+ble 1f
+.endif
+.endif
+mov r1,  #\i
 add r2,  r6,  #(\i*2)
-mov r3,  #\i
 bl  \txfm1\()16_1d_4x16_pass1_neon
 .endr
+
+.ifc \txfm1\()_\txfm2,idct_idct

[FFmpeg-devel] [PATCH 06/13] arm: vp9itxfm: Rename a macro parameter to fit better

2017-01-09 Thread Martin Storsjö
Since the same parameter is used for both input and output,
the name inout is more fitting.

This matches the naming used below in the dmbutterfly macro.

This is cherrypicked from libav commit
79566ec8c77969d5f9be533de04b1349834cca62.
---
 libavcodec/arm/vp9itxfm_neon.S | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index b4cc592..0097f5f 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -125,16 +125,16 @@ endconst
 vmlal.s16   \out4, \in4, \coef1
 .endm
 
-@ in1 = (in1 * coef1 - in2 * coef2 + (1 << 13)) >> 14
-@ in2 = (in1 * coef2 + in2 * coef1 + (1 << 13)) >> 14
-@ in are 2 d registers, tmp are 2 q registers
-.macro mbutterfly in1, in2, coef1, coef2, tmp1, tmp2, neg=0
-mbutterfly_l\tmp1, \tmp2, \in1, \in2, \coef1, \coef2
+@ inout1 = (inout1 * coef1 - inout2 * coef2 + (1 << 13)) >> 14
+@ inout2 = (inout1 * coef2 + inout2 * coef1 + (1 << 13)) >> 14
+@ inout are 2 d registers, tmp are 2 q registers
+.macro mbutterfly inout1, inout2, coef1, coef2, tmp1, tmp2, neg=0
+mbutterfly_l\tmp1, \tmp2, \inout1, \inout2, \coef1, \coef2
 .if \neg > 0
 vneg.s32\tmp2, \tmp2
 .endif
-vrshrn.s32  \in1, \tmp1,  #14
-vrshrn.s32  \in2, \tmp2,  #14
+vrshrn.s32  \inout1, \tmp1,  #14
+vrshrn.s32  \inout2, \tmp2,  #14
 .endm
 
 @ inout1,inout2 = (inout1,inout2 * coef1 - inout3,inout4 * coef2 + (1 << 13)) 
>> 14
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 05/13] arm/aarch64: vp9itxfm: Fix indentation of macro arguments

2017-01-09 Thread Martin Storsjö
This is cherrypicked from libav commit
721bc37522c5c1d6a8c3cea5e9c3fcde8d256c05.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 16 
 libavcodec/arm/vp9itxfm_neon.S |  8 
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index 3535c7b..d5165bf 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -969,14 +969,14 @@ function idct32_1d_8x32_pass1_neon
 st1 {v7.8h},  [x0], #16
 .endm
 
-store_rev 31, 23
-store_rev 30, 22
-store_rev 29, 21
-store_rev 28, 20
-store_rev 27, 19
-store_rev 26, 18
-store_rev 25, 17
-store_rev 24, 16
+store_rev   31, 23
+store_rev   30, 22
+store_rev   29, 21
+store_rev   28, 20
+store_rev   27, 19
+store_rev   26, 18
+store_rev   25, 17
+store_rev   24, 16
 .purgem store_rev
 ret
 endfunc
diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index d7a2654..b4cc592 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -1017,10 +1017,10 @@ function idct32_1d_4x32_pass1_neon
 .endr
 .endm
 
-store_rev 31, 27, 23, 19
-store_rev 30, 26, 22, 18
-store_rev 29, 25, 21, 17
-store_rev 28, 24, 20, 16
+store_rev   31, 27, 23, 19
+store_rev   30, 26, 22, 18
+store_rev   29, 25, 21, 17
+store_rev   28, 24, 20, 16
 .purgem store_rev
 bx  lr
 endfunc
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 13/13] aarch64: vp9mc: Fix a comment to refer to a register with the right name

2017-01-09 Thread Martin Storsjö
This is cherrypicked from libav commit
85ad5ea72ce3983947a3b07e4b35c66cb16dfaba.
---
 libavcodec/aarch64/vp9mc_neon.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libavcodec/aarch64/vp9mc_neon.S b/libavcodec/aarch64/vp9mc_neon.S
index 69dad6d..80d1d23 100644
--- a/libavcodec/aarch64/vp9mc_neon.S
+++ b/libavcodec/aarch64/vp9mc_neon.S
@@ -250,7 +250,7 @@ function \type\()_8tap_\size\()h_\idx1\idx2
 .if \size >= 16
 sub x1,  x1,  x5
 .endif
-// size >= 16 loads two qwords and increments r2,
+// size >= 16 loads two qwords and increments x2,
 // for size 4/8 it's enough with one qword and no
 // postincrement
 .if \size >= 16
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 03/13] arm: vp9itxfm: Simplify the stack alignment code

2017-01-09 Thread Martin Storsjö
From: Janne Grunau 

This is one instruction less for thumb, and only have got
1/2 arm/thumb specific instructions.

This is cherrypicked from libav commit
e5b0fc170f85b00f7dd0ac514918fb5c95253d39.
---
 libavcodec/arm/vp9itxfm_neon.S | 28 
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index 06470a3..d7a2654 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -791,15 +791,13 @@ function ff_vp9_\txfm1\()_\txfm2\()_16x16_add_neon, 
export=1
 .ifnc \txfm1\()_\txfm2,idct_idct
 vpush   {q4-q7}
 .endif
-mov r7,  sp
 
 @ Align the stack, allocate a temp buffer
-T   mov r12, sp
-T   bic r12, r12, #15
-T   sub r12, r12, #512
-T   mov sp,  r12
-A   bic sp,  sp,  #15
-A   sub sp,  sp,  #512
+T   mov r7,  sp
+T   and r7,  r7,  #15
+A   and r7,  sp,  #15
+add r7,  r7,  #512
+sub sp,  sp,  r7
 
 mov r4,  r0
 mov r5,  r1
@@ -828,7 +826,7 @@ A   sub sp,  sp,  #512
 bl  \txfm2\()16_1d_4x16_pass2_neon
 .endr
 
-mov sp,  r7
+add sp,  sp,  r7
 .ifnc \txfm1\()_\txfm2,idct_idct
 vpop{q4-q7}
 .endif
@@ -1117,15 +1115,13 @@ function ff_vp9_idct_idct_32x32_add_neon, export=1
 beq idct32x32_dc_add_neon
 push{r4-r7,lr}
 vpush   {q4-q7}
-mov r7,  sp
 
 @ Align the stack, allocate a temp buffer
-T   mov r12, sp
-T   bic r12, r12, #15
-T   sub r12, r12, #2048
-T   mov sp,  r12
-A   bic sp,  sp,  #15
-A   sub sp,  sp,  #2048
+T   mov r7,  sp
+T   and r7,  r7,  #15
+A   and r7,  sp,  #15
+add r7,  r7,  #2048
+sub sp,  sp,  r7
 
 mov r4,  r0
 mov r5,  r1
@@ -1143,7 +1139,7 @@ A   sub sp,  sp,  #2048
 bl  idct32_1d_4x32_pass2_neon
 .endr
 
-mov sp,  r7
+add sp,  sp,  r7
 vpop{q4-q7}
 pop {r4-r7,pc}
 endfunc
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 12/13] aarch64: vp9dsp: Fix vertical alignment in the init file

2017-01-09 Thread Martin Storsjö
This is cherrypicked from libav commit
65074791e8f8397600aacc9801efdd1eb6e3.
---
 libavcodec/aarch64/vp9dsp_init_aarch64.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/libavcodec/aarch64/vp9dsp_init_aarch64.c 
b/libavcodec/aarch64/vp9dsp_init_aarch64.c
index 7e34375..0bc200e 100644
--- a/libavcodec/aarch64/vp9dsp_init_aarch64.c
+++ b/libavcodec/aarch64/vp9dsp_init_aarch64.c
@@ -103,7 +103,7 @@ static av_cold void vp9dsp_mc_init_aarch64(VP9DSPContext 
*dsp, int bpp)
 if (bpp != 8)
 return;
 
-#define init_fpel(idx1, idx2, sz, type, suffix) \
+#define init_fpel(idx1, idx2, sz, type, suffix)  \
 dsp->mc[idx1][FILTER_8TAP_SMOOTH ][idx2][0][0] = \
 dsp->mc[idx1][FILTER_8TAP_REGULAR][idx2][0][0] = \
 dsp->mc[idx1][FILTER_8TAP_SHARP  ][idx2][0][0] = \
@@ -128,7 +128,7 @@ static av_cold void vp9dsp_mc_init_aarch64(VP9DSPContext 
*dsp, int bpp)
 #define init_mc_func(idx1, idx2, op, filter, fname, dir, mx, my, sz, pfx) \
 dsp->mc[idx1][filter][idx2][mx][my] = pfx##op##_##fname##sz##_##dir##_neon
 
-#define init_mc_funcs(idx, dir, mx, my, sz, pfx) \
+#define init_mc_funcs(idx, dir, mx, my, sz, pfx)   
\
 init_mc_func(idx, 0, put, FILTER_8TAP_REGULAR, regular, dir, mx, my, sz, 
pfx); \
 init_mc_func(idx, 0, put, FILTER_8TAP_SHARP,   sharp,   dir, mx, my, sz, 
pfx); \
 init_mc_func(idx, 0, put, FILTER_8TAP_SMOOTH,  smooth,  dir, mx, my, sz, 
pfx); \
@@ -136,7 +136,7 @@ static av_cold void vp9dsp_mc_init_aarch64(VP9DSPContext 
*dsp, int bpp)
 init_mc_func(idx, 1, avg, FILTER_8TAP_SHARP,   sharp,   dir, mx, my, sz, 
pfx); \
 init_mc_func(idx, 1, avg, FILTER_8TAP_SMOOTH,  smooth,  dir, mx, my, sz, 
pfx)
 
-#define init_mc_funcs_dirs(idx, sz) \
+#define init_mc_funcs_dirs(idx, sz)\
 init_mc_funcs(idx, h,  1, 0, sz, ff_vp9_); \
 init_mc_funcs(idx, v,  0, 1, sz, ff_vp9_); \
 init_mc_funcs(idx, hv, 1, 1, sz,)
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 08/13] arm: vp9itxfm: Only reload the idct coeffs for the iadst_idct combination

2017-01-09 Thread Martin Storsjö
This avoids reloading them if they haven't been clobbered, if the
first pass also was idct.

This is similar to what was done in the aarch64 version.

This is cherrypicked from libav commit
3c87039a404c5659ae9bf7454a04e186532eb40b.
---
 libavcodec/arm/vp9itxfm_neon.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index 0097f5f..d5b8495 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -814,7 +814,7 @@ A   and r7,  sp,  #15
 mov r3,  #\i
 bl  \txfm1\()16_1d_4x16_pass1_neon
 .endr
-.ifc \txfm2,idct
+.ifc \txfm1\()_\txfm2,iadst_idct
 movrel  r12, idct_coeffs
 vld1.16 {q0-q1}, [r12,:128]
 .endif
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 07/13] aarch64: vp9itxfm: Don't repeatedly set x9 when nothing overwrites it

2017-01-09 Thread Martin Storsjö
This is cherrypicked from libav commit
2f99117f6ff24ce5be2abb9e014cb8b86c2aa0e0.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 26 +++---
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index d5165bf..e5fc612 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -599,9 +599,9 @@ endfunc
 // x1 = unused
 // x2 = src
 // x3 = slice offset
+// x9 = input stride
 .macro itxfm16_1d_funcs txfm
 function \txfm\()16_1d_8x16_pass1_neon
-mov x9, #32
 moviv2.8h, #0
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
 load_clear  \i,  x2,  x9
@@ -649,8 +649,8 @@ endfunc
 // x1 = dst stride
 // x2 = src (temp buffer)
 // x3 = slice offset
+// x9 = temp buffer stride
 function \txfm\()16_1d_8x16_pass2_neon
-mov x9, #32
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23
 load\i,  x2,  x9
 .endr
@@ -747,6 +747,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_16x16_add_neon, export=1
 .ifc \txfm1,idct
 ld1 {v0.8h,v1.8h}, [x10]
 .endif
+mov x9, #32
 
 .irp i, 0, 8
 add x0,  sp,  #(\i*32)
@@ -882,13 +883,12 @@ endfunc
 // x0 = dst (temp buffer)
 // x1 = unused
 // x2 = src
+// x9 = double input stride
 // x10 = idct_coeffs
 // x11 = idct_coeffs + 32
 function idct32_1d_8x32_pass1_neon
 ld1 {v0.8h,v1.8h}, [x10]
 
-// Double stride of the input, since we only read every other line
-mov x9,  #128
 moviv4.8h, #0
 
 // v16 = IN(0), v17 = IN(2) ... v31 = IN(30)
@@ -987,12 +987,13 @@ endfunc
 // x0 = dst
 // x1 = dst stride
 // x2 = src (temp buffer)
+// x7 = negative double temp buffer stride
+// x9 = double temp buffer stride
 // x10 = idct_coeffs
 // x11 = idct_coeffs + 32
 function idct32_1d_8x32_pass2_neon
 ld1 {v0.8h,v1.8h}, [x10]
 
-mov x9, #128
 // v16 = IN(0), v17 = IN(2) ... v31 = IN(30)
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
 ld1 {v\i\().8h}, [x2], x9
@@ -1001,7 +1002,6 @@ function idct32_1d_8x32_pass2_neon
 
 idct16
 
-mov x9,  #128
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
 st1 {v\i\().8h}, [x2], x9
 .endr
@@ -1018,11 +1018,10 @@ function idct32_1d_8x32_pass2_neon
 
 idct32_odd
 
-mov x9,  #128
 .macro load_acc_store a, b, c, d, neg=0
+.if \neg == 0
 ld1 {v4.8h},  [x2], x9
 ld1 {v5.8h},  [x2], x9
-.if \neg == 0
 add v4.8h, v4.8h, v\a\().8h
 ld1 {v6.8h},  [x2], x9
 add v5.8h, v5.8h, v\b\().8h
@@ -1030,10 +1029,12 @@ function idct32_1d_8x32_pass2_neon
 add v6.8h, v6.8h, v\c\().8h
 add v7.8h, v7.8h, v\d\().8h
 .else
+ld1 {v4.8h},  [x2], x7
+ld1 {v5.8h},  [x2], x7
 sub v4.8h, v4.8h, v\a\().8h
-ld1 {v6.8h},  [x2], x9
+ld1 {v6.8h},  [x2], x7
 sub v5.8h, v5.8h, v\b\().8h
-ld1 {v7.8h},  [x2], x9
+ld1 {v7.8h},  [x2], x7
 sub v6.8h, v6.8h, v\c\().8h
 sub v7.8h, v7.8h, v\d\().8h
 .endif
@@ -1064,7 +1065,6 @@ function idct32_1d_8x32_pass2_neon
 load_acc_store  23, 22, 21, 20
 load_acc_store  19, 18, 17, 16
 sub x2,  x2,  x9
-neg x9,  x9
 load_acc_store  16, 17, 18, 19, 1
 load_acc_store  20, 21, 22, 23, 1
 load_acc_store  24, 25, 26, 27, 1
@@ -1093,6 +1093,10 @@ function ff_vp9_idct_idct_32x32_add_neon, export=1
 mov x5,  x1
 mov x6,  x2
 
+// Double stride of the input, since we only read every other line
+mov x9,  #128
+neg x7,  x9
+
 .irp i, 0, 8, 16, 24
 add x0,  sp,  #(\i*64)
 add x2,  x6,  #(\i*2)
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 04/13] aarch64: vp9itxfm: Use w3 instead of x3 for the int eob parameter

2017-01-09 Thread Martin Storsjö
The clobbering tests in checkasm are only invoked when testing
correctness, so this bug didn't show up when benchmarking the
dc-only version.

This is cherrypicked from libav commit
4d960a11855f4212eb3a4e470ce890db7f01df29.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index 7ce3116..3535c7b 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -204,7 +204,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_4x4_add_neon, export=1
 
 moviv31.8h, #0
 .ifc \txfm1\()_\txfm2,idct_idct
-cmp x3,  #1
+cmp w3,  #1
 b.ne1f
 // DC-only for idct/idct
 ld1r{v2.4h},  [x2]
@@ -344,7 +344,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1
 moviv5.16b, #0
 
 .ifc \txfm1\()_\txfm2,idct_idct
-cmp x3,  #1
+cmp w3,  #1
 b.ne1f
 // DC-only for idct/idct
 ld1r{v2.4h},  [x2]
@@ -722,7 +722,7 @@ itxfm16_1d_funcs iadst
 .macro itxfm_func16x16 txfm1, txfm2
 function ff_vp9_\txfm1\()_\txfm2\()_16x16_add_neon, export=1
 .ifc \txfm1\()_\txfm2,idct_idct
-cmp x3,  #1
+cmp w3,  #1
 b.eqidct16x16_dc_add_neon
 .endif
 mov x15, x30
@@ -1074,7 +1074,7 @@ function idct32_1d_8x32_pass2_neon
 endfunc
 
 function ff_vp9_idct_idct_32x32_add_neon, export=1
-cmp x3,  #1
+cmp w3,  #1
 b.eqidct32x32_dc_add_neon
 
 movrel  x10, idct_coeffs
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 02/13] aarch64: vp9: loop filter: replace 'orr; cbn?z' with 'adds; b.{eq,ne};

2017-01-09 Thread Martin Storsjö
From: Janne Grunau 

The latter is 1 cycle faster on a cortex-53 and since the operands are
bytewise (or larger) bitmask (impossible to overflow to zero) both are
equivalent.

This is cherrypicked from libav commit
e7ae8f7a715843a5089d18e033afb3ee19ab3057.
---
 libavcodec/aarch64/vp9lpf_neon.S | 31 ---
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/libavcodec/aarch64/vp9lpf_neon.S b/libavcodec/aarch64/vp9lpf_neon.S
index 78aae61..55e1964 100644
--- a/libavcodec/aarch64/vp9lpf_neon.S
+++ b/libavcodec/aarch64/vp9lpf_neon.S
@@ -218,13 +218,15 @@
 xtn_sz  v5, v6.8h,  v7.8h,  \sz
 and v4\sz,  v4\sz,  v5\sz // fm
 
+// If no pixels need filtering, just exit as soon as possible
 mov x5,  v4.d[0]
 .ifc \sz, .16b
 mov x6,  v4.d[1]
-orr x5,  x5,  x6
-.endif
-// If no pixels need filtering, just exit as soon as possible
+addsx5,  x5,  x6
+b.eq9f
+.else
 cbz x5,  9f
+.endif
 
 .if \wd >= 8
 moviv0\sz,  #1
@@ -344,15 +346,17 @@
 bit v22\sz, v0\sz,  v5\sz   // if (!hev && fm && 
!flat8in)
 bit v25\sz, v2\sz,  v5\sz
 
+// If no pixels need flat8in, jump to flat8out
+// (or to a writeout of the inner 4 pixels, for wd=8)
 .if \wd >= 8
 mov x5,  v6.d[0]
 .ifc \sz, .16b
 mov x6,  v6.d[1]
-orr x5,  x5,  x6
-.endif
-// If no pixels need flat8in, jump to flat8out
-// (or to a writeout of the inner 4 pixels, for wd=8)
+addsx5,  x5,  x6
+b.eq6f
+.else
 cbz x5,  6f
+.endif
 
 // flat8in
 uaddl_sz\tmp1\().8h, \tmp2\().8h,  v20, v21, \sz
@@ -406,20 +410,25 @@
 mov x5,  v2.d[0]
 .ifc \sz, .16b
 mov x6,  v2.d[1]
-orr x5,  x5,  x6
+adds x5,  x5,  x6
+b.ne1f
+.else
+cbnzx5,  1f
 .endif
 // If no pixels needed flat8in nor flat8out, jump to a
 // writeout of the inner 4 pixels
-cbnzx5,  1f
 br  x14
 1:
+
 mov x5,  v7.d[0]
 .ifc \sz, .16b
 mov x6,  v7.d[1]
-orr x5,  x5,  x6
+adds x5,  x5,  x6
+b.ne1f
+.else
+cbnzx5,  1f
 .endif
 // If no pixels need flat8out, jump to a writeout of the inner 6 pixels
-cbnzx5,  1f
 br  x15
 
 1:
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 11/13] arm: vp9mc: Fix vertical alignment of operands

2017-01-09 Thread Martin Storsjö
This is cherrypicked from libav commit
c536e5e8698110c139b1c17938998a5547550aa3.
---
 libavcodec/arm/vp9mc_neon.S | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/libavcodec/arm/vp9mc_neon.S b/libavcodec/arm/vp9mc_neon.S
index 5fe3024..83235ff 100644
--- a/libavcodec/arm/vp9mc_neon.S
+++ b/libavcodec/arm/vp9mc_neon.S
@@ -79,7 +79,7 @@ function ff_vp9_avg32_neon, export=1
 vrhadd.u8   q0,  q0,  q2
 vrhadd.u8   q1,  q1,  q3
 subsr12, r12, #1
-vst1.8  {q0, q1},  [r0, :128], r1
+vst1.8  {q0,  q1},  [r0, :128], r1
 bne 1b
 bx  lr
 endfunc
@@ -407,7 +407,7 @@ function ff_vp9_\type\()_\filter\()\size\()_h_neon, export=1
 add r12, r12, 256*\offset
 cmp r5,  #8
 add r12, r12, r5, lsl #4
-mov r5, #\size
+mov r5,  #\size
 .if \size >= 16
 bge \type\()_8tap_16h_34
 b   \type\()_8tap_16h_43
@@ -541,7 +541,7 @@ function \type\()_8tap_8v_\idx1\idx2
 sub r2,  r2,  r3
 vld1.16 {q0},  [r12, :128]
 1:
-mov r12,  r4
+mov r12, r4
 
 loadl   q5,  q6,  q7
 loadl   q8,  q9,  q10, q11
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 01/14] arm: vp9itxfm: Template the quarter/half idct32 function

2017-03-16 Thread Martin Storsjö
This reduces the number of lines and reduces the duplication.

Also simplify the eob check for the half case.

If we are in the half case, we know we at least will need to do the
first three slices, we only need to check eob for the fourth one,
so we can hardcode the value to check against instead of loading
from the min_eob array.

Since at most one slice can be skipped in the first pass, we can
unroll the loop for filling zeros completely, as it was done for
the quarter case before.

This allows skipping loading the min_eob pointer when using the
quarter/half cases.

This is cherrypicked from libav commit
98ee855ae0cc118bd1d20921d6bdb14731832462.
---
 libavcodec/arm/vp9itxfm_neon.S | 57 +++---
 1 file changed, 20 insertions(+), 37 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index ebbbda9..adc9896 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -1575,7 +1575,6 @@ function ff_vp9_idct_idct_32x32_add_neon, export=1
 beq idct32x32_dc_add_neon
 push{r4-r8,lr}
 vpush   {q4-q6}
-movrel  r8,  min_eob_idct_idct_32 + 2
 
 @ Align the stack, allocate a temp buffer
 T   mov r7,  sp
@@ -1597,6 +1596,8 @@ A   and r7,  sp,  #15
 cmp r3,  #135
 ble idct32x32_half_add_neon
 
+movrel  r8,  min_eob_idct_idct_32 + 2
+
 .irp i, 0, 4, 8, 12, 16, 20, 24, 28
 add r0,  sp,  #(\i*64)
 .if \i > 0
@@ -1634,72 +1635,54 @@ A   and r7,  sp,  #15
 pop {r4-r8,pc}
 endfunc
 
-function idct32x32_quarter_add_neon
+.macro idct32_partial size
+function idct32x32_\size\()_add_neon
 .irp i, 0, 4
 add r0,  sp,  #(\i*64)
+.ifc \size,quarter
 .if \i == 4
 cmp r3,  #9
 ble 1f
 .endif
+.endif
 add r2,  r6,  #(\i*2)
-bl  idct32_1d_4x32_pass1_quarter_neon
-.endr
-b   3f
-
-1:
-@ Write zeros to the temp buffer for pass 2
-vmov.i16q14, #0
-vmov.i16q15, #0
-.rept 8
-vst1.16 {q14-q15}, [r0,:128]!
-.endr
-3:
-.irp i, 0, 4, 8, 12, 16, 20, 24, 28
-add r0,  r4,  #(\i)
-mov r1,  r5
-add r2,  sp,  #(\i*2)
-bl  idct32_1d_4x32_pass2_quarter_neon
+bl  idct32_1d_4x32_pass1_\size\()_neon
 .endr
 
-add sp,  sp,  r7
-vpop{q4-q6}
-pop {r4-r8,pc}
-endfunc
-
-function idct32x32_half_add_neon
-.irp i, 0, 4, 8, 12
+.ifc \size,half
+.irp i, 8, 12
 add r0,  sp,  #(\i*64)
-.if \i > 0
-ldrh_post   r1,  r8,  #2
-cmp r3,  r1
-it  le
-movle   r1,  #(16 - \i)/2
+.if \i == 12
+cmp r3,  #70
 ble 1f
 .endif
 add r2,  r6,  #(\i*2)
-bl  idct32_1d_4x32_pass1_half_neon
+bl  idct32_1d_4x32_pass1_\size\()_neon
 .endr
+.endif
 b   3f
 
 1:
 @ Write zeros to the temp buffer for pass 2
 vmov.i16q14, #0
 vmov.i16q15, #0
-2:
-subsr1,  r1,  #1
-.rept 4
+.rept 8
 vst1.16 {q14-q15}, [r0,:128]!
 .endr
-bne 2b
+
 3:
 .irp i, 0, 4, 8, 12, 16, 20, 24, 28
 add r0,  r4,  #(\i)
 mov r1,  r5
 add r2,  sp,  #(\i*2)
-bl  idct32_1d_4x32_pass2_half_neon
+bl  idct32_1d_4x32_pass2_\size\()_neon
 .endr
 
 add sp,  sp,  r7
 vpop{q4-q6}
 pop {r4-r8,pc}
 endfunc
+.endm
+
+idct32_partial quarter
+idct32_partial half
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 04/14] arm: vp9itxfm16: Use the right lane size

2017-03-16 Thread Martin Storsjö
This makes the code slightly clearer, but doesn't make any functional
difference.
---
 libavcodec/arm/vp9itxfm_16bpp_neon.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_16bpp_neon.S 
b/libavcodec/arm/vp9itxfm_16bpp_neon.S
index e6e9440..a92f323 100644
--- a/libavcodec/arm/vp9itxfm_16bpp_neon.S
+++ b/libavcodec/arm/vp9itxfm_16bpp_neon.S
@@ -1082,8 +1082,8 @@ A   and r7,  sp,  #15
 .ifc \txfm1\()_\txfm2,idct_idct
 b   3f
 1:
-vmov.i16q14, #0
-vmov.i16q15, #0
+vmov.i32q14, #0
+vmov.i32q15, #0
 2:
 subsr1,  r1,  #1
 @ Unroll for 2 lines
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 06/14] arm: vp9itxfm16: Avoid reloading the idct32 coefficients

2017-03-16 Thread Martin Storsjö
Keep the idct32 coefficients in narrow form in q6-q7, and idct16
coefficients in lengthened 32 bit form in q0-q3. Avoid clobbering
q0-q3 in the pass1 function, and squeeze the idct16 coefficients
into q0-q1 in the pass2 function to avoid reloading them.

The idct16 coefficients are clobbered and reloaded within idct32_odd
though, since that turns out to be faster than narrowing them and
swapping them into q6-q7.

Before:Cortex   A7A8A9  A53
vp9_inv_dct_dct_32x32_sub4_add_10_neon:22653.8   18268.4   19598.0  14079.0
vp9_inv_dct_dct_32x32_sub32_add_10_neon:   37699.0   38665.2   32542.3  24472.2
After:
vp9_inv_dct_dct_32x32_sub4_add_10_neon:22270.8   18159.3   19531.0  13865.0
vp9_inv_dct_dct_32x32_sub32_add_10_neon:   37523.3   37731.6   32181.7  24071.2
---
 libavcodec/arm/vp9itxfm_16bpp_neon.S | 128 +++
 1 file changed, 69 insertions(+), 59 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_16bpp_neon.S 
b/libavcodec/arm/vp9itxfm_16bpp_neon.S
index 9c02ed9..29d95ca 100644
--- a/libavcodec/arm/vp9itxfm_16bpp_neon.S
+++ b/libavcodec/arm/vp9itxfm_16bpp_neon.S
@@ -1195,12 +1195,12 @@ endfunc
 
 .macro idct32_odd
 movrel  r12, idct_coeffs
-add r12, r12, #32
-vld1.16 {q0-q1}, [r12,:128]
-vmovl.s16   q2,  d2
-vmovl.s16   q3,  d3
-vmovl.s16   q1,  d1
-vmovl.s16   q0,  d0
+
+@ Overwrite the idct16 coeffs with the stored ones for idct32
+vmovl.s16   q0,  d12
+vmovl.s16   q1,  d13
+vmovl.s16   q2,  d14
+vmovl.s16   q3,  d15
 
 mbutterfly  d16, d31, d0[0], d0[1], q4, q5 @ d16 = t16a, d31 = t31a
 mbutterfly  d24, d23, d1[0], d1[1], q4, q5 @ d24 = t17a, d23 = t30a
@@ -1211,15 +1211,19 @@ endfunc
 mbutterfly  d22, d25, d6[0], d6[1], q4, q5 @ d22 = t22a, d25 = t25a
 mbutterfly  d30, d17, d7[0], d7[1], q4, q5 @ d30 = t23a, d17 = t24a
 
-sub r12, r12, #32
-vld1.16 {q0}, [r12,:128]
+@ Reload the idct16 coefficients. We could swap the coefficients 
between
+@ q0-q3 and q6-q7 by narrowing/lengthening, but that's slower than just
+@ loading and lengthening.
+vld1.16 {q0-q1}, [r12,:128]
+
+butterfly   d8,  d24, d16, d24 @ d8  = t16, d24 = t17
+butterfly   d9,  d20, d28, d20 @ d9  = t19, d20 = t18
+butterfly   d10, d26, d18, d26 @ d10 = t20, d26 = t21
+butterfly   d11, d22, d30, d22 @ d11 = t23, d22 = t22
+vmovl.s16   q2,  d2
+vmovl.s16   q3,  d3
 vmovl.s16   q1,  d1
 vmovl.s16   q0,  d0
-
-butterfly   d4,  d24, d16, d24 @ d4  = t16, d24 = t17
-butterfly   d5,  d20, d28, d20 @ d5  = t19, d20 = t18
-butterfly   d6,  d26, d18, d26 @ d6  = t20, d26 = t21
-butterfly   d7,  d22, d30, d22 @ d7  = t23, d22 = t22
 butterfly   d28, d25, d17, d25 @ d28 = t24, d25 = t25
 butterfly   d30, d21, d29, d21 @ d30 = t27, d21 = t26
 butterfly   d29, d23, d31, d23 @ d29 = t31, d23 = t30
@@ -1230,34 +1234,34 @@ endfunc
 mbutterfly  d21, d26, d3[0], d3[1], q8, q9@ d21 = t21a, 
d26 = t26a
 mbutterfly  d25, d22, d3[0], d3[1], q8, q9, neg=1 @ d25 = t25a, 
d22 = t22a
 
-butterfly   d16, d5,  d4,  d5  @ d16 = t16a, d5  = t19a
+butterfly   d16, d9,  d8,  d9  @ d16 = t16a, d9  = t19a
 butterfly   d17, d20, d23, d20 @ d17 = t17,  d20 = t18
-butterfly   d18, d6,  d7,  d6  @ d18 = t23a, d6  = t20a
+butterfly   d18, d10, d11, d10 @ d18 = t23a, d10 = t20a
 butterfly   d19, d21, d22, d21 @ d19 = t22,  d21 = t21
-butterfly   d4,  d28, d28, d30 @ d4  = t24a, d28 = t27a
+butterfly   d8,  d28, d28, d30 @ d8  = t24a, d28 = t27a
 butterfly   d23, d26, d25, d26 @ d23 = t25,  d26 = t26
-butterfly   d7,  d29, d29, d31 @ d7  = t31a, d29 = t28a
+butterfly   d11, d29, d29, d31 @ d11 = t31a, d29 = t28a
 butterfly   d22, d27, d24, d27 @ d22 = t30,  d27 = t29
 
 mbutterfly  d27, d20, d1[0], d1[1], q12, q15@ d27 = t18a, 
d20 = t29a
-mbutterfly  d29, d5,  d1[0], d1[1], q12, q15@ d29 = t19,  
d5  = t28
-mbutterfly  d28, d6,  d1[0], d1[1], q12, q15, neg=1 @ d28 = t27,  
d6  = t20
+mbutterfly  d29, d9,  d1[0], d1[1], q12, q15@ d29 = t19,  
d9  = t28
+mbutterfly  d28, d10, d1[0], d1[1], q12, q15, neg=1 @ d28 = t27,  
d10 = t20
 mbutterfly  d26, d21, d1[0], d1[1], q12, q15, neg=1 @ d26 = t26a, 
d21 = t21a
 
-butterfly   d31, d24, d7,  d4  @ d31 = t31,  d24 = t24
+butterfly   d31, d24, d11, d8  @ d31 = t31,  d24 = t24
 butterfly   d30, d25, 

[FFmpeg-devel] [PATCH 02/14] arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used

2017-03-16 Thread Martin Storsjö
In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.

This is cherrypicked from libav commit
3a0d5e206d24d41d87a25ba16a79b2ea04c39d4c.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 3 ++-
 libavcodec/arm/vp9itxfm_neon.S | 4 ++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index 2c3c002..3e5da08 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -1483,7 +1483,6 @@ function ff_vp9_idct_idct_32x32_add_neon, export=1
 b.eqidct32x32_dc_add_neon
 
 movrel  x10, idct_coeffs
-movrel  x12, min_eob_idct_idct_32, 2
 
 mov x15, x30
 
@@ -1508,6 +1507,8 @@ function ff_vp9_idct_idct_32x32_add_neon, export=1
 cmp w3,  #135
 b.leidct32x32_half_add_neon
 
+movrel  x12, min_eob_idct_idct_32, 2
+
 .irp i, 0, 8, 16, 24
 add x0,  sp,  #(\i*64)
 .if \i > 0
diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index adc9896..6d4d765 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -889,8 +889,6 @@ function ff_vp9_\txfm1\()_\txfm2\()_16x16_add_neon, export=1
 push{r4-r8,lr}
 .ifnc \txfm1\()_\txfm2,idct_idct
 vpush   {q4-q7}
-.else
-movrel  r8,  min_eob_idct_idct_16 + 2
 .endif
 
 @ Align the stack, allocate a temp buffer
@@ -914,6 +912,8 @@ A   and r7,  sp,  #15
 ble idct16x16_quarter_add_neon
 cmp r3,  #38
 ble idct16x16_half_add_neon
+
+movrel  r8,  min_eob_idct_idct_16 + 2
 .endif
 
 .irp i, 0, 4, 8, 12
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 12/14] aarch64: vp9itxfm16: Move the load_add_store macro out from the itxfm16 pass2 function

2017-03-16 Thread Martin Storsjö
This allows reusing the macro for a separate implementation of the
pass2 function.
---
 libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 98 
 1 file changed, 49 insertions(+), 49 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S 
b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
index de1da55..f30fdd8 100644
--- a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
@@ -851,6 +851,55 @@ endfunc
 st1 {v4.4s},  [\src], \inc
 .endm
 
+.macro load_add_store coef0, coef1, coef2, coef3, coef4, coef5, coef6, coef7
+srshr   \coef0, \coef0, #6
+ld1 {v4.4h},   [x0], x1
+srshr   \coef1, \coef1, #6
+ld1 {v4.d}[1], [x3], x1
+srshr   \coef2, \coef2, #6
+ld1 {v5.4h},   [x0], x1
+srshr   \coef3, \coef3, #6
+uaddw   \coef0, \coef0, v4.4h
+ld1 {v5.d}[1], [x3], x1
+srshr   \coef4, \coef4, #6
+uaddw2  \coef1, \coef1, v4.8h
+ld1 {v6.4h},   [x0], x1
+srshr   \coef5, \coef5, #6
+uaddw   \coef2, \coef2, v5.4h
+ld1 {v6.d}[1], [x3], x1
+sqxtun  v4.4h,  \coef0
+srshr   \coef6, \coef6, #6
+uaddw2  \coef3, \coef3, v5.8h
+ld1 {v7.4h},   [x0], x1
+sqxtun2 v4.8h,  \coef1
+srshr   \coef7, \coef7, #6
+uaddw   \coef4, \coef4, v6.4h
+ld1 {v7.d}[1], [x3], x1
+uminv4.8h,  v4.8h,  v8.8h
+sub x0,  x0,  x1, lsl #2
+sub x3,  x3,  x1, lsl #2
+sqxtun  v5.4h,  \coef2
+uaddw2  \coef5, \coef5, v6.8h
+st1 {v4.4h},   [x0], x1
+sqxtun2 v5.8h,  \coef3
+uaddw   \coef6, \coef6, v7.4h
+st1 {v4.d}[1], [x3], x1
+uminv5.8h,  v5.8h,  v8.8h
+sqxtun  v6.4h,  \coef4
+uaddw2  \coef7, \coef7, v7.8h
+st1 {v5.4h},   [x0], x1
+sqxtun2 v6.8h,  \coef5
+st1 {v5.d}[1], [x3], x1
+uminv6.8h,  v6.8h,  v8.8h
+sqxtun  v7.4h,  \coef6
+st1 {v6.4h},   [x0], x1
+sqxtun2 v7.8h,  \coef7
+st1 {v6.d}[1], [x3], x1
+uminv7.8h,  v7.8h,  v8.8h
+st1 {v7.4h},   [x0], x1
+st1 {v7.d}[1], [x3], x1
+.endm
+
 // Read a vertical 4x16 slice out of a 16x16 matrix, do a transform on it,
 // transpose into a horizontal 16x4 slice and store.
 // x0 = dst (temp buffer)
@@ -937,57 +986,8 @@ function \txfm\()16_1d_4x16_pass2_neon
 bl  \txfm\()16
 
 dup v8.8h, w13
-.macro load_add_store coef0, coef1, coef2, coef3, coef4, coef5, coef6, coef7
-srshr   \coef0, \coef0, #6
-ld1 {v4.4h},   [x0], x1
-srshr   \coef1, \coef1, #6
-ld1 {v4.d}[1], [x3], x1
-srshr   \coef2, \coef2, #6
-ld1 {v5.4h},   [x0], x1
-srshr   \coef3, \coef3, #6
-uaddw   \coef0, \coef0, v4.4h
-ld1 {v5.d}[1], [x3], x1
-srshr   \coef4, \coef4, #6
-uaddw2  \coef1, \coef1, v4.8h
-ld1 {v6.4h},   [x0], x1
-srshr   \coef5, \coef5, #6
-uaddw   \coef2, \coef2, v5.4h
-ld1 {v6.d}[1], [x3], x1
-sqxtun  v4.4h,  \coef0
-srshr   \coef6, \coef6, #6
-uaddw2  \coef3, \coef3, v5.8h
-ld1 {v7.4h},   [x0], x1
-sqxtun2 v4.8h,  \coef1
-srshr   \coef7, \coef7, #6
-uaddw   \coef4, \coef4, v6.4h
-ld1 {v7.d}[1], [x3], x1
-uminv4.8h,  v4.8h,  v8.8h
-sub x0,  x0,  x1, lsl #2
-sub x3,  x3,  x1, lsl #2
-sqxtun  v5.4h,  \coef2
-uaddw2  \coef5, \coef5, v6.8h
-st1 {v4.4h},   [x0], x1
-sqxtun2 v5.8h,  \coef3
-uaddw   \coef6, \coef6, v7.4h
-st1 {v4.d}[1], [x3], x1
-uminv5.8h,  v5.8h,  v8.8h
-sqxtun  v6.4h,  \coef4
-uaddw2  \coef7, \coef7, v7.8h
-st1 {v5.4h},   [x0], x1
-sqxtun2 v6.8h,  \coef5
-st1 {v5.d}[1], [x3], x1
-uminv6.8h,  v6.8h,  v8.8h
-sqxtun  v7.4h,  \coef6
-st1 {v6.4h},   [x0], x1
-sqxtun2 v7.8h,  \coef7
-st1 {v6.d}[1], [x3], x1
-uminv7.8h,  v7.8h,  v8.8h
-

[FFmpeg-devel] [PATCH 05/14] arm: vp9itxfm16: Fix vertical alignment

2017-03-16 Thread Martin Storsjö
---
 libavcodec/arm/vp9itxfm_16bpp_neon.S | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_16bpp_neon.S 
b/libavcodec/arm/vp9itxfm_16bpp_neon.S
index a92f323..9c02ed9 100644
--- a/libavcodec/arm/vp9itxfm_16bpp_neon.S
+++ b/libavcodec/arm/vp9itxfm_16bpp_neon.S
@@ -1395,25 +1395,25 @@ function idct32_1d_2x32_pass2_neon
 vld1.32 {d4},  [r2,:64], r12
 vld1.32 {d5},  [r2,:64], r12
 .if \neg == 0
-vadd.s32d4, d4, d\a
+vadd.s32d4,  d4,  d\a
 vld1.32 {d6},  [r2,:64], r12
-vadd.s32d5, d5, d\b
+vadd.s32d5,  d5,  d\b
 vld1.32 {d7},  [r2,:64], r12
-vadd.s32d6, d6, d\c
-vadd.s32d7, d7, d\d
+vadd.s32d6,  d6,  d\c
+vadd.s32d7,  d7,  d\d
 .else
-vsub.s32d4, d4, d\a
+vsub.s32d4,  d4,  d\a
 vld1.32 {d6},  [r2,:64], r12
-vsub.s32d5, d5, d\b
+vsub.s32d5,  d5,  d\b
 vld1.32 {d7},  [r2,:64], r12
-vsub.s32d6, d6, d\c
-vsub.s32d7, d7, d\d
+vsub.s32d6,  d6,  d\c
+vsub.s32d7,  d7,  d\d
 .endif
 vld1.32 {d2[]},   [r0,:32], r1
 vld1.32 {d2[1]},  [r0,:32], r1
-vrshr.s32   q2, q2, #6
+vrshr.s32   q2,  q2,  #6
 vld1.32 {d3[]},   [r0,:32], r1
-vrshr.s32   q3, q3, #6
+vrshr.s32   q3,  q3,  #6
 vld1.32 {d3[1]},  [r0,:32], r1
 sub r0,  r0,  r1, lsl #2
 vaddw.u16   q2,  q2,  d2
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 07/14] aarch64: vp9itxfm16: Fix a typo in a comment

2017-03-16 Thread Martin Storsjö
---
 libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S 
b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
index f53e94a..f80604f 100644
--- a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
@@ -872,7 +872,7 @@ function \txfm\()16_1d_4x16_pass1_neon
 transpose_4x4s  v24, v25, v26, v27, v4, v5, v6, v7
 transpose_4x4s  v28, v29, v30, v31, v4, v5, v6, v7
 
-// Store the transposed 8x8 blocks horizontally.
+// Store the transposed 4x4 blocks horizontally.
 cmp x1,  #12
 b.eq1f
 .irp i, 16, 20, 24, 28, 17, 21, 25, 29, 18, 22, 26, 30, 19, 23, 27, 31
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 14/14] aarch64: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible

2017-03-16 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 21512 bytes to 31400 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before:
vp9_inv_dct_dct_16x16_sub1_add_10_neon: 284.6
vp9_inv_dct_dct_16x16_sub2_add_10_neon:1902.7
vp9_inv_dct_dct_16x16_sub4_add_10_neon:1903.0
vp9_inv_dct_dct_16x16_sub8_add_10_neon:2201.1
vp9_inv_dct_dct_16x16_sub12_add_10_neon:   2510.0
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   2821.3
vp9_inv_dct_dct_32x32_sub1_add_10_neon:1011.6
vp9_inv_dct_dct_32x32_sub2_add_10_neon:9716.5
vp9_inv_dct_dct_32x32_sub4_add_10_neon:9704.9
vp9_inv_dct_dct_32x32_sub8_add_10_neon:   10641.7
vp9_inv_dct_dct_32x32_sub12_add_10_neon:  11555.7
vp9_inv_dct_dct_32x32_sub16_add_10_neon:  12499.8
vp9_inv_dct_dct_32x32_sub20_add_10_neon:  13403.7
vp9_inv_dct_dct_32x32_sub24_add_10_neon:  14335.8
vp9_inv_dct_dct_32x32_sub28_add_10_neon:  15253.6
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  16179.5

After:
vp9_inv_dct_dct_16x16_sub1_add_10_neon: 282.8
vp9_inv_dct_dct_16x16_sub2_add_10_neon:1142.4
vp9_inv_dct_dct_16x16_sub4_add_10_neon:1139.0
vp9_inv_dct_dct_16x16_sub8_add_10_neon:1772.9
vp9_inv_dct_dct_16x16_sub12_add_10_neon:   2515.2
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   2823.5
vp9_inv_dct_dct_32x32_sub1_add_10_neon:1012.7
vp9_inv_dct_dct_32x32_sub2_add_10_neon:6944.4
vp9_inv_dct_dct_32x32_sub4_add_10_neon:6944.2
vp9_inv_dct_dct_32x32_sub8_add_10_neon:7609.8
vp9_inv_dct_dct_32x32_sub12_add_10_neon:   9953.4
vp9_inv_dct_dct_32x32_sub16_add_10_neon:  10770.1
vp9_inv_dct_dct_32x32_sub20_add_10_neon:  13418.8
vp9_inv_dct_dct_32x32_sub24_add_10_neon:  14330.7
vp9_inv_dct_dct_32x32_sub28_add_10_neon:  15257.1
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  16190.6
---
 libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 605 ---
 1 file changed, 547 insertions(+), 58 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S 
b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
index f30fdd8..0befe38 100644
--- a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
@@ -124,6 +124,17 @@ endconst
 .endif
 .endm
 
+// Same as dmbutterfly0 above, but treating the input in in2 as zero,
+// writing the same output into both out1 and out2.
+.macro dmbutterfly0_h out1, out2, in1, in2, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6
+smull   \tmp1\().2d, \in1\().2s,  v0.s[0]
+smull2  \tmp2\().2d, \in1\().4s,  v0.s[0]
+rshrn   \out1\().2s, \tmp1\().2d, #14
+rshrn2  \out1\().4s, \tmp2\().2d, #14
+rshrn   \out2\().2s, \tmp1\().2d, #14
+rshrn2  \out2\().4s, \tmp2\().2d, #14
+.endm
+
 // out1,out2 = in1 * coef1 - in2 * coef2
 // out3,out4 = in1 * coef2 + in2 * coef1
 // out are 4 x .2d registers, in are 2 x .4s registers
@@ -153,6 +164,43 @@ endconst
 rshrn2  \inout2\().4s, \tmp4\().2d,  #14
 .endm
 
+// Same as dmbutterfly above, but treating the input in inout2 as zero
+.macro dmbutterfly_h1 inout1, inout2, coef1, coef2, tmp1, tmp2, tmp3, tmp4
+smull   \tmp1\().2d, \inout1\().2s, \coef1
+smull2  \tmp2\().2d, \inout1\().4s, \coef1
+smull   \tmp3\().2d, \inout1\().2s, \coef2
+smull2  \tmp4\().2d, \inout1\().4s, \coef2
+rshrn   \inout1\().2s, \tmp1\().2d, #14
+rshrn2  \inout1\().4s, \tmp2\().2d, #14
+rshrn   \inout2\().2s, \tmp3\().2d, #14
+rshrn2  \inout2\().4s, \tmp4\().2d, #14
+.endm
+
+// Same as dmbutterfly above, but treating the input in inout1 as zero
+.macro dmbutterfly_h2 inout1, inout2, coef1, coef2, tmp1, tmp2, tmp3, tmp4
+smull   \tmp1\().2d, \inout2\().2s, \coef2
+smull2  \tmp2\().2d, \inout2\().4s, \coef2
+smull   \tmp3\().2d, \inout2\().2s, \coef1
+smull2  \tmp4\().2d, \inout2\().4s, \coef1
+neg \tmp1\().2d, \tmp1\().2d
+neg \tmp2\().2d, \tmp2\().2d
+rshrn   \inout2\().2s, \tmp3\().2d, #14
+rshrn2  \inout2\().4s, \tmp4\().2d, #14
+rshrn   \inout1\().2s, \tmp1\().2d, #14
+rshrn2  \inout1\().4s, \tmp2\().2d, #14
+.endm
+
+.macro dsmull_h out1, out2, in, coef
+smull   \out1\().2d, \in\().2s, \coef
+smull2  \out2\().2d, \in\().4s, \coef
+.endm
+
+.macro drshrn_h out, in1, in2, shift
+rshrn   \out\().2s, \in1\().2d, \shift
+

[FFmpeg-devel] [PATCH 10/14] arm: vp9itxfm16: Make the larger core transforms standalone functions

2017-03-16 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/arm/vp9itxfm_16bpp_neon.o from
17500 to 14516 bytes.

This gives a small slowdown of a couple tens of cycles, up to around
150 cycles for the full case of the largest transform, but makes
it more feasible to add more optimized versions of these transforms.

Before: Cortex A7   A8   A9  A53
vp9_inv_dct_dct_16x16_sub4_add_10_neon:4237.4   3561.5   3971.8   2525.3
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6371.9   5452.0   5779.3   3910.5
vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22068.8  17867.5  19555.2  13871.6
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37268.9  38684.2  32314.2  23969.0

After:
vp9_inv_dct_dct_16x16_sub4_add_10_neon:4375.1   3571.9   4283.8   2567.2
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6415.6   5578.9   5844.6   3948.3
vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22653.7  18079.7  19603.7  13905.3
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37593.2  38862.2  32235.8  24070.9
---
 libavcodec/arm/vp9itxfm_16bpp_neon.S | 43 ++--
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_16bpp_neon.S 
b/libavcodec/arm/vp9itxfm_16bpp_neon.S
index 29d95ca..8350153 100644
--- a/libavcodec/arm/vp9itxfm_16bpp_neon.S
+++ b/libavcodec/arm/vp9itxfm_16bpp_neon.S
@@ -807,7 +807,7 @@ function idct16x16_dc_add_neon
 endfunc
 .ltorg
 
-.macro idct16
+function idct16
 mbutterfly0 d16, d24, d16, d24, d8, d10, q4,  q5 @ d16 = t0a,  d24 
= t1a
 mbutterfly  d20, d28, d1[0], d1[1], q4,  q5  @ d20 = t2a,  d28 = 
t3a
 mbutterfly  d18, d30, d2[0], d2[1], q4,  q5  @ d18 = t4a,  d30 = 
t7a
@@ -853,9 +853,10 @@ endfunc
 vmovd8,  d21 @ d8  = t10a
 butterfly   d20, d27, d10, d27   @ d20 = out[4], d27 = 
out[11]
 butterfly   d21, d26, d26, d8@ d21 = out[5], d26 = 
out[10]
-.endm
+bx  lr
+endfunc
 
-.macro iadst16
+function iadst16
 movrel  r12, iadst16_coeffs
 vld1.16 {q0},  [r12,:128]!
 vmovl.s16   q1,  d1
@@ -933,7 +934,8 @@ endfunc
 
 vmovd16, d2
 vmovd30, d4
-.endm
+bx  lr
+endfunc
 
 .macro itxfm16_1d_funcs txfm
 @ Read a vertical 2x16 slice out of a 16x16 matrix, do a transform on it,
@@ -941,6 +943,8 @@ endfunc
 @ r0 = dst (temp buffer)
 @ r2 = src
 function \txfm\()16_1d_2x16_pass1_neon
+push{lr}
+
 mov r12, #64
 vmov.s32q4,  #0
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
@@ -948,7 +952,7 @@ function \txfm\()16_1d_2x16_pass1_neon
 vst1.32 {d8},  [r2,:64], r12
 .endr
 
-\txfm\()16
+bl  \txfm\()16
 
 @ Do eight 2x2 transposes. Originally, d16-d31 contain the
 @ 16 rows. Afterwards, d16-d17, d18-d19 etc contain the eight
@@ -959,7 +963,7 @@ function \txfm\()16_1d_2x16_pass1_neon
 .irp i, 16, 18, 20, 22, 24, 26, 28, 30, 17, 19, 21, 23, 25, 27, 29, 31
 vst1.32 {d\i}, [r0,:64]!
 .endr
-bx  lr
+pop {pc}
 endfunc
 
 @ Read a vertical 2x16 slice out of a 16x16 matrix, do a transform on it,
@@ -968,6 +972,8 @@ endfunc
 @ r1 = dst stride
 @ r2 = src (temp buffer)
 function \txfm\()16_1d_2x16_pass2_neon
+push{lr}
+
 mov r12, #64
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
 vld1.16 {d\i}, [r2,:64], r12
@@ -975,7 +981,7 @@ function \txfm\()16_1d_2x16_pass2_neon
 
 add r3,  r0,  r1
 lsl r1,  r1,  #1
-\txfm\()16
+bl  \txfm\()16
 
 .macro load_add_store coef0, coef1, coef2, coef3
 vrshr.s32   \coef0, \coef0, #6
@@ -1019,7 +1025,7 @@ function \txfm\()16_1d_2x16_pass2_neon
 load_add_store  q12, q13, q14, q15
 .purgem load_add_store
 
-bx  lr
+pop {pc}
 endfunc
 .endm
 
@@ -1193,7 +1199,7 @@ function idct32x32_dc_add_neon
 pop {r4-r9,pc}
 endfunc
 
-.macro idct32_odd
+function idct32_odd
 movrel  r12, idct_coeffs
 
 @ Overwrite the idct16 coeffs with the stored ones for idct32
@@ -1262,7 +1268,8 @@ endfunc
 mbutterfly0 d26, d21, d26, d21, d8, d10, q4, q5 @ d26 = t26a, d21 
= t21a
 mbutterfly0 d25, d22, d25, d22, d8, d10, q4, q5 @ d25 = t25,  d22 
= t22
 mbutterfly0 d24, d23, d24, d23, d8, d10, q4, q5 @ d24 = t24a, d23 
= t23a
-.endm
+bx  lr
+endfunc
 
 @ Do an 32-point IDCT of a 2x32 slice out of a 32x32 matrix.
 @ We don't have register space to do a single pass IDCT of 2x32 though,
@@ -1274,6 +1281,8 @@ endfunc
 @ r1 = unused
 @ r2 = src
 function idct32_1d_2x32_pass1_neon

[FFmpeg-devel] [PATCH 11/14] aarch64: vp9itxfm16: Make the larger core transforms standalone functions

2017-03-16 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/aarch64/vp9itxfm_16bpp_neon.o from
26288 to 21512 bytes.

This gives a small slowdown of a couple of tens of cycles, but makes
it more feasible to add more optimized versions of these transforms.

Before:
vp9_inv_dct_dct_16x16_sub4_add_10_neon:1887.4
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   2801.5
vp9_inv_dct_dct_32x32_sub4_add_10_neon:9691.4
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  16154.9

After:
vp9_inv_dct_dct_16x16_sub4_add_10_neon:1899.5
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   2827.2
vp9_inv_dct_dct_32x32_sub4_add_10_neon:9714.7
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  16175.9
---
 libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 45 
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S 
b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
index a97c1b6..de1da55 100644
--- a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
@@ -710,7 +710,7 @@ function idct16x16_dc_add_neon
 ret
 endfunc
 
-.macro idct16
+function idct16
 dmbutterfly0v16, v24, v16, v24, v4, v5, v6, v7, v8, v9 // v16 = 
t0a,  v24 = t1a
 dmbutterfly v20, v28, v0.s[2], v0.s[3], v4, v5, v6, v7 // v20 = 
t2a,  v28 = t3a
 dmbutterfly v18, v30, v1.s[0], v1.s[1], v4, v5, v6, v7 // v18 = 
t4a,  v30 = t7a
@@ -753,9 +753,10 @@ endfunc
 butterfly_4sv19, v28, v5,  v28   // v19 = out[3], v28 
= out[12]
 butterfly_4sv20, v27, v6,  v27   // v20 = out[4], v27 
= out[11]
 butterfly_4sv21, v26, v26, v9// v21 = out[5], v26 
= out[10]
-.endm
+ret
+endfunc
 
-.macro iadst16
+function iadst16
 ld1 {v0.8h,v1.8h}, [x11]
 sxtlv2.4s,  v1.4h
 sxtl2   v3.4s,  v1.8h
@@ -830,7 +831,8 @@ endfunc
 
 mov v16.16b, v2.16b
 mov v30.16b, v4.16b
-.endm
+ret
+endfunc
 
 // Helper macros; we can't use these expressions directly within
 // e.g. .irp due to the extra concatenation \(). Therefore wrap
@@ -857,12 +859,14 @@ endfunc
 // x9 = input stride
 .macro itxfm16_1d_funcs txfm
 function \txfm\()16_1d_4x16_pass1_neon
+mov x14, x30
+
 moviv4.4s, #0
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
 load_clear  \i,  x2,  x9
 .endr
 
-\txfm\()16
+bl  \txfm\()16
 
 // Do four 4x4 transposes. Originally, v16-v31 contain the
 // 16 rows. Afterwards, v16-v19, v20-v23, v24-v27 and v28-v31
@@ -878,7 +882,7 @@ function \txfm\()16_1d_4x16_pass1_neon
 .irp i, 16, 20, 24, 28, 17, 21, 25, 29, 18, 22, 26, 30, 19, 23, 27, 31
 store   \i,  x0,  #16
 .endr
-ret
+br  x14
 1:
 // Special case: For the last input column (x1 == 12),
 // which would be stored as the last row in the temp buffer,
@@ -906,7 +910,7 @@ function \txfm\()16_1d_4x16_pass1_neon
 mov v29.16b, v17.16b
 mov v30.16b, v18.16b
 mov v31.16b, v19.16b
-ret
+br  x14
 endfunc
 
 // Read a vertical 4x16 slice out of a 16x16 matrix, do a transform on it,
@@ -917,6 +921,8 @@ endfunc
 // x3 = slice offset
 // x9 = temp buffer stride
 function \txfm\()16_1d_4x16_pass2_neon
+mov x14, x30
+
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27
 load\i,  x2,  x9
 .endr
@@ -928,7 +934,7 @@ function \txfm\()16_1d_4x16_pass2_neon
 
 add x3,  x0,  x1
 lsl x1,  x1,  #1
-\txfm\()16
+bl  \txfm\()16
 
 dup v8.8h, w13
 .macro load_add_store coef0, coef1, coef2, coef3, coef4, coef5, coef6, coef7
@@ -983,7 +989,7 @@ function \txfm\()16_1d_4x16_pass2_neon
 load_add_store  v24.4s, v25.4s, v26.4s, v27.4s, v28.4s, v29.4s, 
v30.4s, v31.4s
 .purgem load_add_store
 
-ret
+br  x14
 endfunc
 .endm
 
@@ -1158,7 +1164,7 @@ function idct32x32_dc_add_neon
 ret
 endfunc
 
-.macro idct32_odd
+function idct32_odd
 dmbutterfly v16, v31, v10.s[0], v10.s[1], v4, v5, v6, v7 // v16 = 
t16a, v31 = t31a
 dmbutterfly v24, v23, v10.s[2], v10.s[3], v4, v5, v6, v7 // v24 = 
t17a, v23 = t30a
 dmbutterfly v20, v27, v11.s[0], v11.s[1], v4, v5, v6, v7 // v20 = 
t18a, v27 = t29a
@@ -1209,7 +1215,8 @@ endfunc
 dmbutterfly0v26, v21, v26, v21, v4, v5, v6, v7, v8, v9 // v26 = 
t26a, v21 = t21a
 dmbutterfly0v25, v22, v25, v22, v4, v5, v6, v7, v8, v9 // v25 = 
t25,  v22 = t22
 dmbutterfly0v24, v23, v24, v23, v4, v5, v6, v7, v8, v9 // v24 = 
t24a, v23 = t23a
-.endm
+ret
+endfunc
 
 // Do an 32-point IDCT of a 4x32 slice out of a 

[FFmpeg-devel] [PATCH 08/14] aarch64: vp9itxfm16: Avoid .irp when it doesn't save any lines

2017-03-16 Thread Martin Storsjö
This makes the code a bit more readable.
---
 libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S 
b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
index f80604f..86ea29e 100644
--- a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
@@ -886,21 +886,21 @@ function \txfm\()16_1d_4x16_pass1_neon
 // for the first slice of the second pass (where it is the
 // last 4x4 block).
 add x0,  x0,  #16
-.irp i, 20, 24, 28
-store   \i,  x0,  #16
-.endr
+st1 {v20.4s},  [x0], #16
+st1 {v24.4s},  [x0], #16
+st1 {v28.4s},  [x0], #16
 add x0,  x0,  #16
-.irp i, 21, 25, 29
-store   \i,  x0,  #16
-.endr
+st1 {v21.4s},  [x0], #16
+st1 {v25.4s},  [x0], #16
+st1 {v29.4s},  [x0], #16
 add x0,  x0,  #16
-.irp i, 22, 26, 30
-store   \i,  x0,  #16
-.endr
+st1 {v22.4s},  [x0], #16
+st1 {v26.4s},  [x0], #16
+st1 {v30.4s},  [x0], #16
 add x0,  x0,  #16
-.irp i, 23, 27, 31
-store   \i,  x0,  #16
-.endr
+st1 {v23.4s},  [x0], #16
+st1 {v27.4s},  [x0], #16
+st1 {v31.4s},  [x0], #16
 
 mov v28.16b, v16.16b
 mov v29.16b, v17.16b
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 13/14] arm: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible

2017-03-16 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 14516 bytes to 22484 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before: Cortex A7   A8   A9  A53
vp9_inv_dct_dct_16x16_sub1_add_10_neon: 454.0270.7418.5295.4
vp9_inv_dct_dct_16x16_sub2_add_10_neon:3840.2   3244.8   3700.1   2337.9
vp9_inv_dct_dct_16x16_sub4_add_10_neon:4212.5   3575.4   3996.9   2571.6
vp9_inv_dct_dct_16x16_sub8_add_10_neon:5174.4   4270.5   4615.5   3031.9
vp9_inv_dct_dct_16x16_sub12_add_10_neon:   5676.0   4908.5   5226.5   3491.3
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6403.9   5589.0   5839.8   3948.5
vp9_inv_dct_dct_32x32_sub1_add_10_neon:1710.7944.7   1582.1   1045.4
vp9_inv_dct_dct_32x32_sub2_add_10_neon:   21040.7  16706.1  18687.7  13193.1
vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22197.7  18282.7  19577.5  13918.6
vp9_inv_dct_dct_32x32_sub8_add_10_neon:   24511.5  20911.5  21472.5  15367.5
vp9_inv_dct_dct_32x32_sub12_add_10_neon:  26939.5  24264.3  23239.1  16830.3
vp9_inv_dct_dct_32x32_sub16_add_10_neon:  29419.5  26845.1  25020.6  18259.9
vp9_inv_dct_dct_32x32_sub20_add_10_neon:  31146.4  29633.5  26803.3  19721.7
vp9_inv_dct_dct_32x32_sub24_add_10_neon:  33376.3  32507.8  28642.4  21174.2
vp9_inv_dct_dct_32x32_sub28_add_10_neon:  35629.4  35439.6  30416.5  22625.7
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37269.9  37914.9  32271.9  24078.9

After:
vp9_inv_dct_dct_16x16_sub1_add_10_neon: 454.0276.0418.5295.1
vp9_inv_dct_dct_16x16_sub2_add_10_neon:2336.2   1886.0   2251.0   1458.6
vp9_inv_dct_dct_16x16_sub4_add_10_neon:2531.0   2054.7   2402.8   1591.1
vp9_inv_dct_dct_16x16_sub8_add_10_neon:3848.6   3491.1   3845.7   2554.8
vp9_inv_dct_dct_16x16_sub12_add_10_neon:   5703.8   4831.6   5230.8   3493.4
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6399.5   5567.0   5832.4   3951.5
vp9_inv_dct_dct_32x32_sub1_add_10_neon:1722.1938.5   1577.3   1044.5
vp9_inv_dct_dct_32x32_sub2_add_10_neon:   15003.5  11576.8  13105.8   9602.2
vp9_inv_dct_dct_32x32_sub4_add_10_neon:   15768.5  12677.2  13726.0  10138.1
vp9_inv_dct_dct_32x32_sub8_add_10_neon:   17278.8  14825.4  14907.5  11185.7
vp9_inv_dct_dct_32x32_sub12_add_10_neon:  22335.7  21544.5  20379.5  15019.8
vp9_inv_dct_dct_32x32_sub16_add_10_neon:  24165.6  23881.7  21938.6  16308.2
vp9_inv_dct_dct_32x32_sub20_add_10_neon:  31082.2  30860.9  26835.3  19711.3
vp9_inv_dct_dct_32x32_sub24_add_10_neon:  33102.6  31922.8  28638.3  21161.0
vp9_inv_dct_dct_32x32_sub28_add_10_neon:  35104.9  34867.5  30411.7  22621.2
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37438.1  39103.4  32217.8  24067.6
---
 libavcodec/arm/vp9itxfm_16bpp_neon.S | 529 +++
 1 file changed, 469 insertions(+), 60 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_16bpp_neon.S 
b/libavcodec/arm/vp9itxfm_16bpp_neon.S
index 8350153..b4f615e 100644
--- a/libavcodec/arm/vp9itxfm_16bpp_neon.S
+++ b/libavcodec/arm/vp9itxfm_16bpp_neon.S
@@ -82,6 +82,14 @@ endconst
 vrshrn.s64  \out2, \tmpq4, #14
 .endm
 
+@ Same as mbutterfly0 above, but treating the input in in2 as zero,
+@ writing the same output into both out1 and out2.
+.macro mbutterfly0_h out1, out2, in1, in2, tmpd1, tmpd2, tmpq3, tmpq4
+vmull.s32   \tmpq3, \in1, d0[0]
+vrshrn.s64  \out1, \tmpq3, #14
+vrshrn.s64  \out2, \tmpq3, #14
+.endm
+
 @ out1,out2 = ((in1 + in2) * d0[0] + (1 << 13)) >> 14
 @ out3,out4 = ((in1 - in2) * d0[0] + (1 << 13)) >> 14
 @ Same as mbutterfly0, but with input being 2 q registers, output
@@ -148,6 +156,23 @@ endconst
 vrshrn.s64  \inout2, \tmp2,  #14
 .endm
 
+@ Same as mbutterfly above, but treating the input in inout2 as zero
+.macro mbutterfly_h1 inout1, inout2, coef1, coef2, tmp1, tmp2
+vmull.s32   \tmp1,   \inout1, \coef1
+vmull.s32   \tmp2,   \inout1, \coef2
+vrshrn.s64  \inout1, \tmp1,   #14
+vrshrn.s64  \inout2, \tmp2,   #14
+.endm
+
+@ Same as mbutterfly above, but treating the input in inout1 as zero
+.macro mbutterfly_h2 inout1, inout2, coef1, coef2, tmp1, tmp2
+vmov.s64\tmp1,   #0
+vmull.s32   \tmp2,   \inout2, \coef1
+vmlsl.s32   \tmp1,   \inout2, \coef2
+vrshrn.s64  \inout2, \tmp2,   #14
+vrshrn.s64  \inout1, \tmp1,   #14
+.endm
+
 @ inout1,inout2 = (inout1,inout2 * coef1 - inout3,inout4 * coef2 + (1 << 13)) 
>> 14
 @ inout3,inout4 = (inout1,inout2 * coef2 + 

[FFmpeg-devel] [PATCH 03/14] arm/aarch64: vp9: Fix vertical alignment

2017-03-16 Thread Martin Storsjö
Align the second/third operands as they usually are.

Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.

This is cherrypicked from libav commit
7995ebfad12002033c73feed422a1cfc62081e8f.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 36 ++--
 libavcodec/arm/vp9itxfm_neon.S | 14 +++---
 libavcodec/arm/vp9lpf_neon.S   |  2 +-
 3 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index 3e5da08..b12890f 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -380,7 +380,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1
 .ifc \txfm1\()_\txfm2,idct_idct
 movrel  x4,  idct_coeffs
 .else
-movrel  x4, iadst8_coeffs
+movrel  x4,  iadst8_coeffs
 ld1 {v1.8h}, [x4], #16
 .endif
 ld1 {v0.8h}, [x4]
@@ -480,23 +480,23 @@ itxfm_func8x8 iadst, iadst
 
 
 function idct16x16_dc_add_neon
-movrel  x4, idct_coeffs
+movrel  x4,  idct_coeffs
 ld1 {v0.4h}, [x4]
 
-moviv1.4h, #0
+moviv1.4h,  #0
 
 ld1 {v2.h}[0], [x2]
-smull   v2.4s,  v2.4h, v0.h[0]
-rshrn   v2.4h,  v2.4s, #14
-smull   v2.4s,  v2.4h, v0.h[0]
-rshrn   v2.4h,  v2.4s, #14
+smull   v2.4s,  v2.4h,  v0.h[0]
+rshrn   v2.4h,  v2.4s,  #14
+smull   v2.4s,  v2.4h,  v0.h[0]
+rshrn   v2.4h,  v2.4s,  #14
 dup v2.8h,  v2.h[0]
 st1 {v1.h}[0], [x2]
 
-srshr   v2.8h, v2.8h, #6
+srshr   v2.8h,  v2.8h,  #6
 
-mov x3, x0
-mov x4, #16
+mov x3,  x0
+mov x4,  #16
 1:
 // Loop to add the constant from v2 into all 16x16 outputs
 subsx4,  x4,  #2
@@ -869,7 +869,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_16x16_add_neon, export=1
 .ifc \txfm1,idct
 ld1 {v0.8h,v1.8h}, [x10]
 .endif
-mov x9, #32
+mov x9,  #32
 
 .ifc \txfm1\()_\txfm2,idct_idct
 cmp w3,  #10
@@ -1046,10 +1046,10 @@ idct16_partial quarter
 idct16_partial half
 
 function idct32x32_dc_add_neon
-movrel  x4, idct_coeffs
+movrel  x4,  idct_coeffs
 ld1 {v0.4h}, [x4]
 
-moviv1.4h, #0
+moviv1.4h,  #0
 
 ld1 {v2.h}[0], [x2]
 smull   v2.4s,  v2.4h,  v0.h[0]
@@ -1059,10 +1059,10 @@ function idct32x32_dc_add_neon
 dup v2.8h,  v2.h[0]
 st1 {v1.h}[0], [x2]
 
-srshr   v0.8h, v2.8h, #6
+srshr   v0.8h,  v2.8h,  #6
 
-mov x3, x0
-mov x4, #32
+mov x3,  x0
+mov x4,  #32
 1:
 // Loop to add the constant v0 into all 32x32 outputs
 subsx4,  x4,  #2
@@ -1230,7 +1230,7 @@ endfunc
 // x9 = double input stride
 function idct32_1d_8x32_pass1\suffix\()_neon
 mov x14, x30
-moviv2.8h, #0
+moviv2.8h,  #0
 
 // v16 = IN(0), v17 = IN(2) ... v31 = IN(30)
 .ifb \suffix
@@ -1295,7 +1295,7 @@ function idct32_1d_8x32_pass1\suffix\()_neon
 .endif
 add x2,  x2,  #64
 
-moviv2.8h, #0
+moviv2.8h,  #0
 // v16 = IN(1), v17 = IN(3) ... v31 = IN(31)
 .ifb \suffix
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index 6d4d765..6c09922 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -530,7 +530,7 @@ function idct16x16_dc_add_neon
 movrel  r12, idct_coeffs
 vld1.16 {d0}, [r12,:64]
 
-vmov.i16q2, #0
+vmov.i16q2,  #0
 
 vld1.16 {d16[]}, [r2,:16]
 vmull.s16   q8,  d16, d0[0]
@@ -793,7 +793,7 @@ function \txfm\()16_1d_4x16_pass1_neon
 push{lr}
 
 mov r12, #32
-vmov.s16q2, #0
+vmov.s16q2,  #0
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
 vld1.16 {d\i}, [r2,:64]
 vst1.16 {d4},  [r2,:64], r12
@@ -1142,7 +1142,7 @@ function idct32x32_dc_add_neon
 movrel  r12, idct_coeffs
 vld1.16 {d0}, [r12,:64]
 
-vmov.i16q2, #0
+vmov.i16q2,  #0
 
 vld1.16 {d16[]}, [r2,:16]
 vmull.s16

[FFmpeg-devel] [PATCH 09/14] aarch64: vp9itxfm16: Restructure the idct32 store macros

2017-03-16 Thread Martin Storsjö
This avoids concatenation, which can't be used if the whole macro
is wrapped within another macro.
---
 libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 90 
 1 file changed, 45 insertions(+), 45 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S 
b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
index 86ea29e..a97c1b6 100644
--- a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
@@ -1244,27 +1244,27 @@ function idct32_1d_4x32_pass1_neon
 .macro store_rev a, b, c, d
 // There's no rev128 instruction, but we reverse each 64 bit
 // half, and then flip them using an ext with 8 bytes offset.
-rev64   v7.4s, v\d\().4s
-st1 {v\a\().4s},  [x0], #16
+rev64   v7.4s, \d
+st1 {\a},  [x0], #16
 ext v7.16b, v7.16b, v7.16b, #8
-st1 {v\b\().4s},  [x0], #16
-rev64   v6.4s, v\c\().4s
-st1 {v\c\().4s},  [x0], #16
+st1 {\b},  [x0], #16
+rev64   v6.4s, \c
+st1 {\c},  [x0], #16
 ext v6.16b, v6.16b, v6.16b, #8
-st1 {v\d\().4s},  [x0], #16
-rev64   v5.4s, v\b\().4s
+st1 {\d},  [x0], #16
+rev64   v5.4s, \b
 st1 {v7.4s},  [x0], #16
 ext v5.16b, v5.16b, v5.16b, #8
 st1 {v6.4s},  [x0], #16
-rev64   v4.4s, v\a\().4s
+rev64   v4.4s, \a
 st1 {v5.4s},  [x0], #16
 ext v4.16b, v4.16b, v4.16b, #8
 st1 {v4.4s},  [x0], #16
 .endm
-store_rev   16, 20, 24, 28
-store_rev   17, 21, 25, 29
-store_rev   18, 22, 26, 30
-store_rev   19, 23, 27, 31
+store_rev   v16.4s, v20.4s, v24.4s, v28.4s
+store_rev   v17.4s, v21.4s, v25.4s, v29.4s
+store_rev   v18.4s, v22.4s, v26.4s, v30.4s
+store_rev   v19.4s, v23.4s, v27.4s, v31.4s
 sub x0,  x0,  #512
 .purgem store_rev
 
@@ -1290,27 +1290,27 @@ function idct32_1d_4x32_pass1_neon
 // Store the registers a, b, c, d horizontally,
 // adding into the output first, and the mirrored,
 // subtracted from the output.
-.macro store_rev a, b, c, d
+.macro store_rev a, b, c, d, a16b, b16b
 ld1 {v4.4s},  [x0]
-rev64   v9.4s, v\d\().4s
-add v4.4s, v4.4s, v\a\().4s
+rev64   v9.4s, \d
+add v4.4s, v4.4s, \a
 st1 {v4.4s},  [x0], #16
-rev64   v8.4s, v\c\().4s
+rev64   v8.4s, \c
 ld1 {v4.4s},  [x0]
 ext v9.16b, v9.16b, v9.16b, #8
-add v4.4s, v4.4s, v\b\().4s
+add v4.4s, v4.4s, \b
 st1 {v4.4s},  [x0], #16
 ext v8.16b, v8.16b, v8.16b, #8
 ld1 {v4.4s},  [x0]
-rev64   v\b\().4s, v\b\().4s
-add v4.4s, v4.4s, v\c\().4s
+rev64   \b, \b
+add v4.4s, v4.4s, \c
 st1 {v4.4s},  [x0], #16
-rev64   v\a\().4s, v\a\().4s
+rev64   \a, \a
 ld1 {v4.4s},  [x0]
-ext v\b\().16b, v\b\().16b, v\b\().16b, #8
-add v4.4s, v4.4s, v\d\().4s
+ext \b16b, \b16b, \b16b, #8
+add v4.4s, v4.4s, \d
 st1 {v4.4s},  [x0], #16
-ext v\a\().16b, v\a\().16b, v\a\().16b, #8
+ext \a16b, \a16b, \a16b, #8
 ld1 {v4.4s},  [x0]
 sub v4.4s, v4.4s, v9.4s
 st1 {v4.4s},  [x0], #16
@@ -1318,17 +1318,17 @@ function idct32_1d_4x32_pass1_neon
 sub v4.4s, v4.4s, v8.4s
 st1 {v4.4s},  [x0], #16
 ld1 {v4.4s},  [x0]
-sub v4.4s, v4.4s, v\b\().4s
+sub v4.4s, v4.4s, \b
 st1 {v4.4s},  [x0], #16
 ld1 {v4.4s},  [x0]
-sub v4.4s, v4.4s, v\a\().4s
+sub v4.4s, v4.4s, \a
 st1 {v4.4s},  [x0], #16
 .endm
 
-store_rev   31, 27, 23, 19
-store_rev   30, 26, 22, 18
-store_rev   29, 25, 21, 17
-store_rev   28, 24, 20, 16
+store_rev   v31.4s, v27.4s, v23.4s, v19.4s, v31.16b, v27.16b
+store_rev   v30.4s, v26.4s, v22.4s, v18.4s, v30.16b, v26.16b
+store_rev   v29.4s, v25.4s, v21.4s, v17.4s, v29.16b, v25.16b
+store_rev   v28.4s, v24.4s, v20.4s, v16.4s, v28.16b, v24.16b
 .purgem store_rev
 ret
 endfunc
@@ -1370,21 +1370,21 @@ function 

[FFmpeg-devel] [PATCH 08/34] aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32

2017-03-08 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 14740 bytes to 24292 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before:
vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7
vp9_inv_dct_dct_16x16_sub2_add_neon:1051.0
vp9_inv_dct_dct_16x16_sub4_add_neon:1051.0
vp9_inv_dct_dct_16x16_sub8_add_neon:1051.0
vp9_inv_dct_dct_16x16_sub12_add_neon:   1387.4
vp9_inv_dct_dct_16x16_sub16_add_neon:   1387.6
vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1
vp9_inv_dct_dct_32x32_sub2_add_neon:5198.5
vp9_inv_dct_dct_32x32_sub4_add_neon:5198.6
vp9_inv_dct_dct_32x32_sub8_add_neon:5196.3
vp9_inv_dct_dct_32x32_sub12_add_neon:   6183.4
vp9_inv_dct_dct_32x32_sub16_add_neon:   6174.3
vp9_inv_dct_dct_32x32_sub20_add_neon:   7151.4
vp9_inv_dct_dct_32x32_sub24_add_neon:   7145.3
vp9_inv_dct_dct_32x32_sub28_add_neon:   8119.3
vp9_inv_dct_dct_32x32_sub32_add_neon:   8118.7

After:
vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7
vp9_inv_dct_dct_16x16_sub2_add_neon: 640.8
vp9_inv_dct_dct_16x16_sub4_add_neon: 639.0
vp9_inv_dct_dct_16x16_sub8_add_neon: 842.0
vp9_inv_dct_dct_16x16_sub12_add_neon:   1388.3
vp9_inv_dct_dct_16x16_sub16_add_neon:   1389.3
vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1
vp9_inv_dct_dct_32x32_sub2_add_neon:3685.5
vp9_inv_dct_dct_32x32_sub4_add_neon:3685.1
vp9_inv_dct_dct_32x32_sub8_add_neon:3684.4
vp9_inv_dct_dct_32x32_sub12_add_neon:   5312.2
vp9_inv_dct_dct_32x32_sub16_add_neon:   5315.4
vp9_inv_dct_dct_32x32_sub20_add_neon:   7154.9
vp9_inv_dct_dct_32x32_sub24_add_neon:   7154.5
vp9_inv_dct_dct_32x32_sub28_add_neon:   8126.6
vp9_inv_dct_dct_32x32_sub32_add_neon:   8127.2

This is cherrypicked from libav commit
a63da4511d0fee66695ff4afd264ba1dbf1e812d.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 525 -
 1 file changed, 466 insertions(+), 59 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index e45d385..3eb999a 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -75,6 +75,17 @@ endconst
 .endif
 .endm
 
+// Same as dmbutterfly0 above, but treating the input in in2 as zero,
+// writing the same output into both out1 and out2.
+.macro dmbutterfly0_h out1, out2, in1, in2, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6
+smull   \tmp1\().4s,  \in1\().4h,  v0.h[0]
+smull2  \tmp2\().4s,  \in1\().8h,  v0.h[0]
+rshrn   \out1\().4h,  \tmp1\().4s, #14
+rshrn2  \out1\().8h,  \tmp2\().4s, #14
+rshrn   \out2\().4h,  \tmp1\().4s, #14
+rshrn2  \out2\().8h,  \tmp2\().4s, #14
+.endm
+
 // out1,out2 = in1 * coef1 - in2 * coef2
 // out3,out4 = in1 * coef2 + in2 * coef1
 // out are 4 x .4s registers, in are 2 x .8h registers
@@ -104,6 +115,43 @@ endconst
 rshrn2  \inout2\().8h, \tmp4\().4s,  #14
 .endm
 
+// Same as dmbutterfly above, but treating the input in inout2 as zero
+.macro dmbutterfly_h1 inout1, inout2, coef1, coef2, tmp1, tmp2, tmp3, tmp4
+smull   \tmp1\().4s, \inout1\().4h, \coef1
+smull2  \tmp2\().4s, \inout1\().8h, \coef1
+smull   \tmp3\().4s, \inout1\().4h, \coef2
+smull2  \tmp4\().4s, \inout1\().8h, \coef2
+rshrn   \inout1\().4h, \tmp1\().4s, #14
+rshrn2  \inout1\().8h, \tmp2\().4s, #14
+rshrn   \inout2\().4h, \tmp3\().4s, #14
+rshrn2  \inout2\().8h, \tmp4\().4s, #14
+.endm
+
+// Same as dmbutterfly above, but treating the input in inout1 as zero
+.macro dmbutterfly_h2 inout1, inout2, coef1, coef2, tmp1, tmp2, tmp3, tmp4
+smull   \tmp1\().4s, \inout2\().4h, \coef2
+smull2  \tmp2\().4s, \inout2\().8h, \coef2
+smull   \tmp3\().4s, \inout2\().4h, \coef1
+smull2  \tmp4\().4s, \inout2\().8h, \coef1
+neg \tmp1\().4s, \tmp1\().4s
+neg \tmp2\().4s, \tmp2\().4s
+rshrn   \inout2\().4h, \tmp3\().4s, #14
+rshrn2  \inout2\().8h, \tmp4\().4s, #14
+rshrn   \inout1\().4h, \tmp1\().4s, #14
+rshrn2  \inout1\().8h, \tmp2\().4s, #14
+.endm
+
+.macro dsmull_h out1, out2, in, coef
+smull   \out1\().4s, \in\().4h, \coef
+smull2  \out2\().4s, \in\().8h, \coef
+.endm
+
+.macro drshrn_h out, in1, in2, shift
+rshrn   \out\().4h, \in1\().4s, \shift
+rshrn2  \out\().8h, 

[FFmpeg-devel] [PATCH 07/34] arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible

2017-03-08 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 12388 bytes to 19784 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before:  Cortex A7   A8   A9  A53
vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0189.5212.0235.8
vp9_inv_dct_dct_16x16_sub2_add_neon:2102.1   1521.7   1736.2   1265.8
vp9_inv_dct_dct_16x16_sub4_add_neon:2104.5   1533.0   1736.6   1265.5
vp9_inv_dct_dct_16x16_sub8_add_neon:2484.8   1828.7   2014.4   1506.5
vp9_inv_dct_dct_16x16_sub12_add_neon:   2851.2   2117.8   2294.8   1753.2
vp9_inv_dct_dct_16x16_sub16_add_neon:   3239.4   2408.3   2543.5   1994.9
vp9_inv_dct_dct_32x32_sub1_add_neon: 758.3456.7864.5553.9
vp9_inv_dct_dct_32x32_sub2_add_neon:   10776.7   7949.8   8567.7   6819.7
vp9_inv_dct_dct_32x32_sub4_add_neon:   10865.6   8131.5   8589.6   6816.3
vp9_inv_dct_dct_32x32_sub8_add_neon:   12053.9   9271.3   9387.7   7564.0
vp9_inv_dct_dct_32x32_sub12_add_neon:  13328.3  10463.2  10217.0   8321.3
vp9_inv_dct_dct_32x32_sub16_add_neon:  14176.4  11509.5  11018.7   9062.3
vp9_inv_dct_dct_32x32_sub20_add_neon:  15301.5  12999.9  11855.1   9828.2
vp9_inv_dct_dct_32x32_sub24_add_neon:  16482.7  14931.5  12650.1  10575.0
vp9_inv_dct_dct_32x32_sub28_add_neon:  17589.5  15811.9  13482.8  11333.4
vp9_inv_dct_dct_32x32_sub32_add_neon:  18696.2  17049.2  14355.6  12089.7

After:
vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0189.5211.7235.8
vp9_inv_dct_dct_16x16_sub2_add_neon:1203.5998.2   1035.3763.0
vp9_inv_dct_dct_16x16_sub4_add_neon:1203.5998.1   1035.5760.8
vp9_inv_dct_dct_16x16_sub8_add_neon:1926.1   1610.6   1722.1   1271.7
vp9_inv_dct_dct_16x16_sub12_add_neon:   2873.2   2129.7   2285.1   1757.3
vp9_inv_dct_dct_16x16_sub16_add_neon:   3221.4   2520.3   2557.6   2002.1
vp9_inv_dct_dct_32x32_sub1_add_neon: 753.0457.5866.6554.6
vp9_inv_dct_dct_32x32_sub2_add_neon:7554.6   5652.4   6048.4   4920.2
vp9_inv_dct_dct_32x32_sub4_add_neon:7549.9   5685.0   6046.9   4925.7
vp9_inv_dct_dct_32x32_sub8_add_neon:8336.9   6704.5   6604.0   5478.0
vp9_inv_dct_dct_32x32_sub12_add_neon:  10914.0   9777.2   9240.4   7416.9
vp9_inv_dct_dct_32x32_sub16_add_neon:  11859.2  11223.3   9966.3   8095.1
vp9_inv_dct_dct_32x32_sub20_add_neon:  15237.1  13029.4  11838.3   9829.4
vp9_inv_dct_dct_32x32_sub24_add_neon:  16293.2  14379.8  12644.9  10572.0
vp9_inv_dct_dct_32x32_sub28_add_neon:  17424.3  15734.7  13473.0  11326.9
vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.3  17457.0  14298.6  12080.0

This is cherrypicked from libav commit
5eb5aec475aabc884d083566f902876ecbc072cb.
---
 libavcodec/arm/vp9itxfm_neon.S | 591 +
 1 file changed, 537 insertions(+), 54 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index 682a82e..33a7af1 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -74,6 +74,14 @@ endconst
 vrshrn.s32  \out2, \tmpq4, #14
 .endm
 
+@ Same as mbutterfly0 above, but treating the input in in2 as zero,
+@ writing the same output into both out1 and out2.
+.macro mbutterfly0_h out1, out2, in1, in2, tmpd1, tmpd2, tmpq3, tmpq4
+vmull.s16   \tmpq3, \in1, d0[0]
+vrshrn.s32  \out1,  \tmpq3, #14
+vrshrn.s32  \out2,  \tmpq3, #14
+.endm
+
 @ out1,out2 = ((in1 + in2) * d0[0] + (1 << 13)) >> 14
 @ out3,out4 = ((in1 - in2) * d0[0] + (1 << 13)) >> 14
 @ Same as mbutterfly0, but with input being 2 q registers, output
@@ -137,6 +145,23 @@ endconst
 vrshrn.s32  \inout2, \tmp2,  #14
 .endm
 
+@ Same as mbutterfly above, but treating the input in inout2 as zero
+.macro mbutterfly_h1 inout1, inout2, coef1, coef2, tmp1, tmp2
+vmull.s16   \tmp1,   \inout1, \coef1
+vmull.s16   \tmp2,   \inout1, \coef2
+vrshrn.s32  \inout1, \tmp1,   #14
+vrshrn.s32  \inout2, \tmp2,   #14
+.endm
+
+@ Same as mbutterfly above, but treating the input in inout1 as zero
+.macro mbutterfly_h2 inout1, inout2, coef1, coef2, tmp1, tmp2
+vmull.s16   \tmp1,   \inout2, \coef2
+vmull.s16   \tmp2,   \inout2, \coef1
+vneg.s32\tmp1,   \tmp1
+vrshrn.s32  \inout2, \tmp2,   #14
+vrshrn.s32  \inout1, \tmp1,   #14
+.endm
+
 @ inout1,inout2 = (inout1,inout2 * coef1 - inout3,inout4 * coef2 + (1 << 13)) 
>> 14
 @ inout3,inout4 = (inout1,inout2 * coef2 + inout3,inout4 * coef1 + (1 << 13)) 
>> 14
 @ 

[FFmpeg-devel] [PATCH 17/34] aarch64: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter

2017-03-08 Thread Martin Storsjö
No measured speedup on a Cortex A53, but other cores might benefit.

This is cherrypicked from libav commit
388e0d2515bc6bbc9d0c9af1d230bd16cf945fe7.
---
 libavcodec/aarch64/vp9mc_neon.S | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/libavcodec/aarch64/vp9mc_neon.S b/libavcodec/aarch64/vp9mc_neon.S
index 9403911..82a0f53 100644
--- a/libavcodec/aarch64/vp9mc_neon.S
+++ b/libavcodec/aarch64/vp9mc_neon.S
@@ -202,9 +202,12 @@ endfunc
 ext v23.16b, \src5\().16b, \src6\().16b, #(2*\offset)
 mla \dst2\().8h, v21.8h, v0.h[\offset]
 mla \dst4\().8h, v23.8h, v0.h[\offset]
-.else
+.elseif \size == 8
 mla \dst1\().8h, v20.8h, v0.h[\offset]
 mla \dst3\().8h, v22.8h, v0.h[\offset]
+.else
+mla \dst1\().4h, v20.4h, v0.h[\offset]
+mla \dst3\().4h, v22.4h, v0.h[\offset]
 .endif
 .endm
 // The same as above, but don't accumulate straight into the
@@ -219,16 +222,24 @@ endfunc
 ext v23.16b, \src5\().16b, \src6\().16b, #(2*\offset)
 mul v21.8h, v21.8h, v0.h[\offset]
 mul v23.8h, v23.8h, v0.h[\offset]
-.else
+.elseif \size == 8
 mul v20.8h, v20.8h, v0.h[\offset]
 mul v22.8h, v22.8h, v0.h[\offset]
+.else
+mul v20.4h, v20.4h, v0.h[\offset]
+mul v22.4h, v22.4h, v0.h[\offset]
 .endif
+.if \size == 4
+sqadd   \dst1\().4h, \dst1\().4h, v20.4h
+sqadd   \dst3\().4h, \dst3\().4h, v22.4h
+.else
 sqadd   \dst1\().8h, \dst1\().8h, v20.8h
 sqadd   \dst3\().8h, \dst3\().8h, v22.8h
 .if \size >= 16
 sqadd   \dst2\().8h, \dst2\().8h, v21.8h
 sqadd   \dst4\().8h, \dst4\().8h, v23.8h
 .endif
+.endif
 .endm
 
 
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 16/34] arm: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter

2017-03-08 Thread Martin Storsjö
Before:Cortex A7  A8 A9 A53
vp9_put_8tap_smooth_4h_neon:   378.1   273.2  340.7   229.5
After:
vp9_put_8tap_smooth_4h_neon:   352.1   222.2  290.5   229.5

This is cherrypicked from libav commit
fea92a4b57d1c328b1de226a5f213a629ee63754.
---
 libavcodec/arm/vp9mc_neon.S | 33 ++---
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/libavcodec/arm/vp9mc_neon.S b/libavcodec/arm/vp9mc_neon.S
index 83235ff..bd8cda7 100644
--- a/libavcodec/arm/vp9mc_neon.S
+++ b/libavcodec/arm/vp9mc_neon.S
@@ -209,7 +209,7 @@ endfunc
 @ Extract a vector from src1-src2 and src4-src5 (src1-src3 and src4-src6
 @ for size >= 16), and multiply-accumulate into dst1 and dst3 (or
 @ dst1-dst2 and dst3-dst4 for size >= 16)
-.macro extmla dst1, dst2, dst3, dst4, src1, src2, src3, src4, src5, src6, 
offset, size
+.macro extmla dst1, dst2, dst3, dst4, dst1d, dst3d, src1, src2, src3, src4, 
src5, src6, offset, size
 vext.8  q14, \src1, \src2, #(2*\offset)
 vext.8  q15, \src4, \src5, #(2*\offset)
 .if \size >= 16
@@ -219,14 +219,17 @@ endfunc
 vext.8  q6,  \src5, \src6, #(2*\offset)
 vmla_lane   \dst2,  q5,  \offset
 vmla_lane   \dst4,  q6,  \offset
-.else
+.elseif \size == 8
 vmla_lane   \dst1,  q14, \offset
 vmla_lane   \dst3,  q15, \offset
+.else
+vmla_lane   \dst1d, d28, \offset
+vmla_lane   \dst3d, d30, \offset
 .endif
 .endm
 @ The same as above, but don't accumulate straight into the
 @ destination, but use a temp register and accumulate with saturation.
-.macro extmulqadd dst1, dst2, dst3, dst4, src1, src2, src3, src4, src5, src6, 
offset, size
+.macro extmulqadd dst1, dst2, dst3, dst4, dst1d, dst3d, src1, src2, src3, 
src4, src5, src6, offset, size
 vext.8  q14, \src1, \src2, #(2*\offset)
 vext.8  q15, \src4, \src5, #(2*\offset)
 .if \size >= 16
@@ -236,16 +239,24 @@ endfunc
 vext.8  q6,  \src5, \src6, #(2*\offset)
 vmul_lane   q5,  q5,  \offset
 vmul_lane   q6,  q6,  \offset
-.else
+.elseif \size == 8
 vmul_lane   q14, q14, \offset
 vmul_lane   q15, q15, \offset
+.else
+vmul_lane   d28, d28, \offset
+vmul_lane   d30, d30, \offset
 .endif
+.if \size == 4
+vqadd.s16   \dst1d, \dst1d, d28
+vqadd.s16   \dst3d, \dst3d, d30
+.else
 vqadd.s16   \dst1,  \dst1,  q14
 vqadd.s16   \dst3,  \dst3,  q15
 .if \size >= 16
 vqadd.s16   \dst2,  \dst2,  q5
 vqadd.s16   \dst4,  \dst4,  q6
 .endif
+.endif
 .endm
 
 
@@ -308,13 +319,13 @@ function \type\()_8tap_\size\()h_\idx1\idx2
 vmul.s16q2,  q9,  d0[0]
 vmul.s16q4,  q12, d0[0]
 .endif
-extmla  q1,  q2,  q3,  q4,  q8,  q9,  q10,  q11, q12, q13, 1,  
   \size
-extmla  q1,  q2,  q3,  q4,  q8,  q9,  q10,  q11, q12, q13, 2,  
   \size
-extmla  q1,  q2,  q3,  q4,  q8,  q9,  q10,  q11, q12, q13, 
\idx1, \size
-extmla  q1,  q2,  q3,  q4,  q8,  q9,  q10,  q11, q12, q13, 5,  
   \size
-extmla  q1,  q2,  q3,  q4,  q8,  q9,  q10,  q11, q12, q13, 6,  
   \size
-extmla  q1,  q2,  q3,  q4,  q8,  q9,  q10,  q11, q12, q13, 7,  
   \size
-extmulqadd  q1,  q2,  q3,  q4,  q8,  q9,  q10,  q11, q12, q13, 
\idx2, \size
+extmla  q1,  q2,  q3,  q4,  d2,  d6,  q8,  q9,  q10, q11, q12, 
q13, 1, \size
+extmla  q1,  q2,  q3,  q4,  d2,  d6,  q8,  q9,  q10, q11, q12, 
q13, 2, \size
+extmla  q1,  q2,  q3,  q4,  d2,  d6,  q8,  q9,  q10, q11, q12, 
q13, \idx1, \size
+extmla  q1,  q2,  q3,  q4,  d2,  d6,  q8,  q9,  q10, q11, q12, 
q13, 5, \size
+extmla  q1,  q2,  q3,  q4,  d2,  d6,  q8,  q9,  q10, q11, q12, 
q13, 6, \size
+extmla  q1,  q2,  q3,  q4,  d2,  d6,  q8,  q9,  q10, q11, q12, 
q13, 7, \size
+extmulqadd  q1,  q2,  q3,  q4,  d2,  d6,  q8,  q9,  q10, q11, q12, 
q13, \idx2, \size
 
 @ Round, shift and saturate
 vqrshrun.s16d2,  q1,  #7
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 14/34] aarch64: vp9itxfm: Fix incorrect vertical alignment

2017-03-08 Thread Martin Storsjö
This is cherrypicked from libav commit
0c0b87f12d48d4e7f0d3d13f9345e828a3a5ea32.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index 5219d6e..6bb097b 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -225,7 +225,7 @@ endconst
 add v21.4s,v17.4s,v19.4s
 rshrn   \c0\().4h, v20.4s,#14
 add v16.4s,v16.4s,v17.4s
-rshrn   \c1\().4h, v21.4s, #14
+rshrn   \c1\().4h, v21.4s,#14
 sub v16.4s,v16.4s,v19.4s
 rshrn   \c2\().4h, v18.4s,#14
 rshrn   \c3\().4h, v16.4s,#14
@@ -1313,8 +1313,8 @@ function idct32_1d_8x32_pass1\suffix\()_neon
 
 bl  idct32_odd\suffix
 
-transpose_8x8H v31, v30, v29, v28, v27, v26, v25, v24, v2, v3
-transpose_8x8H v23, v22, v21, v20, v19, v18, v17, v16, v2, v3
+transpose_8x8H  v31, v30, v29, v28, v27, v26, v25, v24, v2, v3
+transpose_8x8H  v23, v22, v21, v20, v19, v18, v17, v16, v2, v3
 
 // Store the registers a, b horizontally,
 // adding into the output first, and the mirrored,
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 04/34] aarch64: vp9itxfm: Make the larger core transforms standalone functions

2017-03-08 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/aarch64/vp9itxfm_neon.o from
19496 to 14740 bytes.

This gives a small slowdown of a couple of tens of cycles, but makes
it more feasible to add more optimized versions of these transforms.

Before:
vp9_inv_dct_dct_16x16_sub4_add_neon:1036.7
vp9_inv_dct_dct_16x16_sub16_add_neon:   1372.2
vp9_inv_dct_dct_32x32_sub4_add_neon:5180.0
vp9_inv_dct_dct_32x32_sub32_add_neon:   8095.7

After:
vp9_inv_dct_dct_16x16_sub4_add_neon:1051.0
vp9_inv_dct_dct_16x16_sub16_add_neon:   1390.1
vp9_inv_dct_dct_32x32_sub4_add_neon:5199.9
vp9_inv_dct_dct_32x32_sub32_add_neon:   8125.8

This is cherrypicked from libav commit
115476018d2c97df7e9b4445fe8f6cc7420ab91f.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 42 +++---
 1 file changed, 25 insertions(+), 17 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index 7427963..a37b459 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -463,7 +463,7 @@ function idct16x16_dc_add_neon
 ret
 endfunc
 
-.macro idct16
+function idct16
 dmbutterfly0v16, v24, v16, v24, v2, v3, v4, v5, v6, v7 // v16 = 
t0a,  v24 = t1a
 dmbutterfly v20, v28, v0.h[1], v0.h[2], v2, v3, v4, v5 // v20 = 
t2a,  v28 = t3a
 dmbutterfly v18, v30, v0.h[3], v0.h[4], v2, v3, v4, v5 // v18 = 
t4a,  v30 = t7a
@@ -506,9 +506,10 @@ endfunc
 butterfly_8hv19, v28, v5,  v28   // v19 = out[3], v28 
= out[12]
 butterfly_8hv20, v27, v6,  v27   // v20 = out[4], v27 
= out[11]
 butterfly_8hv21, v26, v26, v3// v21 = out[5], v26 
= out[10]
-.endm
+ret
+endfunc
 
-.macro iadst16
+function iadst16
 ld1 {v0.8h,v1.8h}, [x11]
 
 dmbutterfly_l   v6,  v7,  v4,  v5,  v31, v16, v0.h[1], v0.h[0]   // 
v6,v7   = t1,   v4,v5   = t0
@@ -577,7 +578,8 @@ endfunc
 
 mov v16.16b, v2.16b
 mov v30.16b, v4.16b
-.endm
+ret
+endfunc
 
 // Helper macros; we can't use these expressions directly within
 // e.g. .irp due to the extra concatenation \(). Therefore wrap
@@ -604,12 +606,14 @@ endfunc
 // x9 = input stride
 .macro itxfm16_1d_funcs txfm
 function \txfm\()16_1d_8x16_pass1_neon
+mov x14, x30
+
 moviv2.8h, #0
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
 load_clear  \i,  x2,  x9
 .endr
 
-\txfm\()16
+bl  \txfm\()16
 
 // Do two 8x8 transposes. Originally, v16-v31 contain the
 // 16 rows. Afterwards, v16-v23 and v24-v31 contain the two
@@ -623,7 +627,7 @@ function \txfm\()16_1d_8x16_pass1_neon
 .irp i, 16, 24, 17, 25, 18, 26, 19, 27, 20, 28, 21, 29, 22, 30, 23, 31
 store   \i,  x0,  #16
 .endr
-ret
+br  x14
 1:
 // Special case: For the last input column (x1 == 8),
 // which would be stored as the last row in the temp buffer,
@@ -642,7 +646,7 @@ function \txfm\()16_1d_8x16_pass1_neon
 mov v29.16b, v21.16b
 mov v30.16b, v22.16b
 mov v31.16b, v23.16b
-ret
+br  x14
 endfunc
 
 // Read a vertical 8x16 slice out of a 16x16 matrix, do a transform on it,
@@ -653,6 +657,7 @@ endfunc
 // x3 = slice offset
 // x9 = temp buffer stride
 function \txfm\()16_1d_8x16_pass2_neon
+mov x14, x30
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23
 load\i,  x2,  x9
 .endr
@@ -664,7 +669,7 @@ function \txfm\()16_1d_8x16_pass2_neon
 
 add x3,  x0,  x1
 lsl x1,  x1,  #1
-\txfm\()16
+bl  \txfm\()16
 
 .macro load_add_store coef0, coef1, coef2, coef3, coef4, coef5, coef6, coef7, 
tmp1, tmp2
 srshr   \coef0, \coef0, #6
@@ -714,7 +719,7 @@ function \txfm\()16_1d_8x16_pass2_neon
 load_add_store  v24.8h, v25.8h, v26.8h, v27.8h, v28.8h, v29.8h, 
v30.8h, v31.8h, v16.8b, v17.8b
 .purgem load_add_store
 
-ret
+br  x14
 endfunc
 .endm
 
@@ -843,7 +848,7 @@ function idct32x32_dc_add_neon
 ret
 endfunc
 
-.macro idct32_odd
+function idct32_odd
 ld1 {v0.8h,v1.8h}, [x11]
 
 dmbutterfly v16, v31, v0.h[0], v0.h[1], v4, v5, v6, v7 // v16 = 
t16a, v31 = t31a
@@ -898,7 +903,8 @@ endfunc
 dmbutterfly0v26, v21, v26, v21, v2, v3, v4, v5, v6, v7 // v26 = 
t26a, v21 = t21a
 dmbutterfly0v25, v22, v25, v22, v2, v3, v4, v5, v6, v7 // v25 = 
t25,  v22 = t22
 dmbutterfly0v24, v23, v24, v23, v2, v3, v4, v5, v6, v7 // v24 = 
t24a, v23 = t23a
-.endm
+ret
+endfunc
 
 // Do an 32-point IDCT of a 8x32 slice out of a 32x32 matrix.
 // The 32-point IDCT can be decomposed into two 16-point IDCTs;
@@ 

[FFmpeg-devel] [PATCH 01/34] arm: vp9itxfm: Avoid .irp when it doesn't save any lines

2017-03-08 Thread Martin Storsjö
This makes it more readable.

This is cherrypicked from libav commit
3bc5b28d5a191864c54bba60646933a63da31656.
---
 libavcodec/arm/vp9itxfm_neon.S | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index 25f6dde..93816d2 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -690,21 +690,21 @@ function \txfm\()16_1d_4x16_pass1_neon
 @ for the first slice of the second pass (where it is the
 @ last 4x4 block).
 add r0,  r0,  #8
-.irp i, 20, 24, 28
-vst1.16 {d\i}, [r0,:64]!
-.endr
+vst1.16 {d20}, [r0,:64]!
+vst1.16 {d24}, [r0,:64]!
+vst1.16 {d28}, [r0,:64]!
 add r0,  r0,  #8
-.irp i, 21, 25, 29
-vst1.16 {d\i}, [r0,:64]!
-.endr
+vst1.16 {d21}, [r0,:64]!
+vst1.16 {d25}, [r0,:64]!
+vst1.16 {d29}, [r0,:64]!
 add r0,  r0,  #8
-.irp i, 22, 26, 30
-vst1.16 {d\i}, [r0,:64]!
-.endr
+vst1.16 {d22}, [r0,:64]!
+vst1.16 {d26}, [r0,:64]!
+vst1.16 {d30}, [r0,:64]!
 add r0,  r0,  #8
-.irp i, 23, 27, 31
-vst1.16 {d\i}, [r0,:64]!
-.endr
+vst1.16 {d23}, [r0,:64]!
+vst1.16 {d27}, [r0,:64]!
+vst1.16 {d31}, [r0,:64]!
 vmovd28, d16
 vmovd29, d17
 vmovd30, d18
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 02/34] aarch64: vp9itxfm: Restructure the idct32 store macros

2017-03-08 Thread Martin Storsjö
This avoids concatenation, which can't be used if the whole macro
is wrapped within another macro.

This is also arguably more readable.

This is cherrypicked from libav commit
58d87e0f49bcbbc6f426328f53b657bae7430cd2.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 80 +++---
 1 file changed, 40 insertions(+), 40 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index 82f1f41..7427963 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -935,23 +935,23 @@ function idct32_1d_8x32_pass1_neon
 .macro store_rev a, b
 // There's no rev128 instruction, but we reverse each 64 bit
 // half, and then flip them using an ext with 8 bytes offset.
-rev64   v1.8h, v\b\().8h
-st1 {v\a\().8h},  [x0], #16
-rev64   v0.8h, v\a\().8h
+rev64   v1.8h, \b
+st1 {\a},  [x0], #16
+rev64   v0.8h, \a
 ext v1.16b, v1.16b, v1.16b, #8
-st1 {v\b\().8h},  [x0], #16
+st1 {\b},  [x0], #16
 ext v0.16b, v0.16b, v0.16b, #8
 st1 {v1.8h},  [x0], #16
 st1 {v0.8h},  [x0], #16
 .endm
-store_rev   16, 24
-store_rev   17, 25
-store_rev   18, 26
-store_rev   19, 27
-store_rev   20, 28
-store_rev   21, 29
-store_rev   22, 30
-store_rev   23, 31
+store_rev   v16.8h, v24.8h
+store_rev   v17.8h, v25.8h
+store_rev   v18.8h, v26.8h
+store_rev   v19.8h, v27.8h
+store_rev   v20.8h, v28.8h
+store_rev   v21.8h, v29.8h
+store_rev   v22.8h, v30.8h
+store_rev   v23.8h, v31.8h
 sub x0,  x0,  #512
 .purgem store_rev
 
@@ -977,14 +977,14 @@ function idct32_1d_8x32_pass1_neon
 // subtracted from the output.
 .macro store_rev a, b
 ld1 {v4.8h},  [x0]
-rev64   v1.8h, v\b\().8h
-add v4.8h, v4.8h, v\a\().8h
-rev64   v0.8h, v\a\().8h
+rev64   v1.8h, \b
+add v4.8h, v4.8h, \a
+rev64   v0.8h, \a
 st1 {v4.8h},  [x0], #16
 ext v1.16b, v1.16b, v1.16b, #8
 ld1 {v5.8h},  [x0]
 ext v0.16b, v0.16b, v0.16b, #8
-add v5.8h, v5.8h, v\b\().8h
+add v5.8h, v5.8h, \b
 st1 {v5.8h},  [x0], #16
 ld1 {v6.8h},  [x0]
 sub v6.8h, v6.8h, v1.8h
@@ -994,14 +994,14 @@ function idct32_1d_8x32_pass1_neon
 st1 {v7.8h},  [x0], #16
 .endm
 
-store_rev   31, 23
-store_rev   30, 22
-store_rev   29, 21
-store_rev   28, 20
-store_rev   27, 19
-store_rev   26, 18
-store_rev   25, 17
-store_rev   24, 16
+store_rev   v31.8h, v23.8h
+store_rev   v30.8h, v22.8h
+store_rev   v29.8h, v21.8h
+store_rev   v28.8h, v20.8h
+store_rev   v27.8h, v19.8h
+store_rev   v26.8h, v18.8h
+store_rev   v25.8h, v17.8h
+store_rev   v24.8h, v16.8h
 .purgem store_rev
 ret
 endfunc
@@ -1047,21 +1047,21 @@ function idct32_1d_8x32_pass2_neon
 .if \neg == 0
 ld1 {v4.8h},  [x2], x9
 ld1 {v5.8h},  [x2], x9
-add v4.8h, v4.8h, v\a\().8h
+add v4.8h, v4.8h, \a
 ld1 {v6.8h},  [x2], x9
-add v5.8h, v5.8h, v\b\().8h
+add v5.8h, v5.8h, \b
 ld1 {v7.8h},  [x2], x9
-add v6.8h, v6.8h, v\c\().8h
-add v7.8h, v7.8h, v\d\().8h
+add v6.8h, v6.8h, \c
+add v7.8h, v7.8h, \d
 .else
 ld1 {v4.8h},  [x2], x7
 ld1 {v5.8h},  [x2], x7
-sub v4.8h, v4.8h, v\a\().8h
+sub v4.8h, v4.8h, \a
 ld1 {v6.8h},  [x2], x7
-sub v5.8h, v5.8h, v\b\().8h
+sub v5.8h, v5.8h, \b
 ld1 {v7.8h},  [x2], x7
-sub v6.8h, v6.8h, v\c\().8h
-sub v7.8h, v7.8h, v\d\().8h
+sub v6.8h, v6.8h, \c
+sub v7.8h, v7.8h, \d
 .endif
 ld1 {v0.8b}, [x0], x1
 ld1 {v1.8b}, [x0], x1
@@ -1085,15 +1085,15 @@ function idct32_1d_8x32_pass2_neon
 st1 {v6.8b}, [x0], x1
 st1 {v7.8b}, [x0], x1
 .endm
-load_acc_store  31, 30, 29, 28
-load_acc_store  27, 26, 25, 24
-

[FFmpeg-devel] [PATCH 31/34] arm: vp9itxfm: Reorder the idct coefficients for better pairing

2017-03-08 Thread Martin Storsjö
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.

This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.

This is cherrypicked from libav commit
de06bdfe6c8abd8266d5c6f5c68e4df0060b61fc.
---
 libavcodec/arm/vp9itxfm_neon.S | 124 -
 1 file changed, 62 insertions(+), 62 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index 9385b01..05e31e6 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -22,7 +22,7 @@
 #include "neon.S"
 
 const itxfm4_coeffs, align=4
-.short  11585, 6270, 15137, 0
+.short  11585, 0, 6270, 15137
 iadst4_coeffs:
 .short  5283, 15212, 9929, 13377
 endconst
@@ -30,8 +30,8 @@ endconst
 const iadst8_coeffs, align=4
 .short  16305, 1606, 14449, 7723, 10394, 12665, 4756, 15679
 idct_coeffs:
-.short  11585, 6270, 15137, 3196, 16069, 13623, 9102, 1606
-.short  16305, 12665, 10394, 7723, 14449, 15679, 4756, 0
+.short  11585, 0, 6270, 15137, 3196, 16069, 13623, 9102
+.short  1606, 16305, 12665, 10394, 7723, 14449, 15679, 4756
 .short  804, 16364, 12140, 11003, 7005, 14811, 15426, 5520
 .short  3981, 15893, 14053, 8423, 9760, 13160, 16207, 2404
 endconst
@@ -224,14 +224,14 @@ endconst
 .endm
 
 .macro idct4 c0, c1, c2, c3
-vmull.s16   q13,  \c1,  d0[2]
-vmull.s16   q11,  \c1,  d0[1]
+vmull.s16   q13,  \c1,  d0[3]
+vmull.s16   q11,  \c1,  d0[2]
 vadd.i16d16,  \c0,  \c2
 vsub.i16d17,  \c0,  \c2
-vmlal.s16   q13,  \c3,  d0[1]
+vmlal.s16   q13,  \c3,  d0[2]
 vmull.s16   q9,   d16,  d0[0]
 vmull.s16   q10,  d17,  d0[0]
-vmlsl.s16   q11,  \c3,  d0[2]
+vmlsl.s16   q11,  \c3,  d0[3]
 vrshrn.s32  d26,  q13,  #14
 vrshrn.s32  d18,  q9,   #14
 vrshrn.s32  d20,  q10,  #14
@@ -350,9 +350,9 @@ itxfm_func4x4 iwht,  iwht
 
 .macro idct8
 dmbutterfly0d16, d17, d24, d25, q8,  q12, q2, q4, d4, d5, d8, d9, 
q3, q2, q5, q4 @ q8 = t0a, q12 = t1a
-dmbutterfly d20, d21, d28, d29, d0[1], d0[2], q2,  q3,  q4,  q5 @ 
q10 = t2a, q14 = t3a
-dmbutterfly d18, d19, d30, d31, d0[3], d1[0], q2,  q3,  q4,  q5 @ 
q9  = t4a, q15 = t7a
-dmbutterfly d26, d27, d22, d23, d1[1], d1[2], q2,  q3,  q4,  q5 @ 
q13 = t5a, q11 = t6a
+dmbutterfly d20, d21, d28, d29, d0[2], d0[3], q2,  q3,  q4,  q5 @ 
q10 = t2a, q14 = t3a
+dmbutterfly d18, d19, d30, d31, d1[0], d1[1], q2,  q3,  q4,  q5 @ 
q9  = t4a, q15 = t7a
+dmbutterfly d26, d27, d22, d23, d1[2], d1[3], q2,  q3,  q4,  q5 @ 
q13 = t5a, q11 = t6a
 
 butterfly   q2,  q14, q8,  q14 @ q2 = t0, q14 = t3
 butterfly   q3,  q10, q12, q10 @ q3 = t1, q10 = t2
@@ -386,8 +386,8 @@ itxfm_func4x4 iwht,  iwht
 vneg.s16q15, q15  @ q15 = out[7]
 butterfly   q8,  q9,  q11, q9 @ q8 = out[0], q9 = t2
 
-dmbutterfly_l   q10, q11, q5,  q7,  d4,  d5,  d6,  d7,  d0[1], d0[2] @ 
q10,q11 = t5a, q5,q7 = t4a
-dmbutterfly_l   q2,  q3,  q13, q14, d12, d13, d8,  d9,  d0[2], d0[1] @ 
q2,q3 = t6a, q13,q14 = t7a
+dmbutterfly_l   q10, q11, q5,  q7,  d4,  d5,  d6,  d7,  d0[2], d0[3] @ 
q10,q11 = t5a, q5,q7 = t4a
+dmbutterfly_l   q2,  q3,  q13, q14, d12, d13, d8,  d9,  d0[3], d0[2] @ 
q2,q3 = t6a, q13,q14 = t7a
 
 dbutterfly_nd28, d29, d8,  d9,  q10, q11, q13, q14, q4,  q6,  q10, 
q11 @ q14 = out[6], q4 = t7
 
@@ -594,13 +594,13 @@ endfunc
 
 function idct16
 mbutterfly0 d16, d24, d16, d24, d4, d6,  q2,  q3 @ d16 = t0a,  d24 
= t1a
-mbutterfly  d20, d28, d0[1], d0[2], q2,  q3  @ d20 = t2a,  d28 = 
t3a
-mbutterfly  d18, d30, d0[3], d1[0], q2,  q3  @ d18 = t4a,  d30 = 
t7a
-mbutterfly  d26, d22, d1[1], d1[2], q2,  q3  @ d26 = t5a,  d22 = 
t6a
-mbutterfly  d17, d31, d1[3], d2[0], q2,  q3  @ d17 = t8a,  d31 = 
t15a
-mbutterfly  d25, d23, d2[1], d2[2], q2,  q3  @ d25 = t9a,  d23 = 
t14a
-mbutterfly  d21, d27, d2[3], d3[0], q2,  q3  @ d21 = t10a, d27 = 
t13a
-mbutterfly  d29, d19, d3[1], d3[2], q2,  q3  @ d29 = t11a, d19 = 
t12a
+mbutterfly  d20, d28, d0[2], d0[3], q2,  q3  @ d20 = t2a,  d28 = 
t3a
+mbutterfly  d18, d30, d1[0], d1[1], q2,  q3  @ d18 = t4a,  d30 = 
t7a
+mbutterfly  d26, d22, d1[2], d1[3], q2,  q3  @ d26 = t5a,  d22 = 
t6a
+mbutterfly  d17, d31, d2[0], d2[1], q2,  q3  @ d17 = t8a,  d31 = 
t15a
+mbutterfly  d25, d23, d2[2], d2[3], q2,  q3  @ d25 = t9a,  d23 = 
t14a
+mbutterfly  d21, d27, d3[0], d3[1], 

[FFmpeg-devel] [PATCH 32/34] aarch64: vp9itxfm: Reorder the idct coefficients for better pairing

2017-03-08 Thread Martin Storsjö
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.

This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.

This is cherrypicked from libav commit
09eb88a12e008d10a3f7a6be75d18ad98b368e68.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 124 ++---
 1 file changed, 62 insertions(+), 62 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index dd9fde1..31c6e3c 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -22,7 +22,7 @@
 #include "neon.S"
 
 const itxfm4_coeffs, align=4
-.short  11585, 6270, 15137, 0
+.short  11585, 0, 6270, 15137
 iadst4_coeffs:
 .short  5283, 15212, 9929, 13377
 endconst
@@ -30,8 +30,8 @@ endconst
 const iadst8_coeffs, align=4
 .short  16305, 1606, 14449, 7723, 10394, 12665, 4756, 15679
 idct_coeffs:
-.short  11585, 6270, 15137, 3196, 16069, 13623, 9102, 1606
-.short  16305, 12665, 10394, 7723, 14449, 15679, 4756, 0
+.short  11585, 0, 6270, 15137, 3196, 16069, 13623, 9102
+.short  1606, 16305, 12665, 10394, 7723, 14449, 15679, 4756
 .short  804, 16364, 12140, 11003, 7005, 14811, 15426, 5520
 .short  3981, 15893, 14053, 8423, 9760, 13160, 16207, 2404
 endconst
@@ -192,14 +192,14 @@ endconst
 .endm
 
 .macro idct4 c0, c1, c2, c3
-smull   v22.4s,\c1\().4h, v0.h[2]
-smull   v20.4s,\c1\().4h, v0.h[1]
+smull   v22.4s,\c1\().4h, v0.h[3]
+smull   v20.4s,\c1\().4h, v0.h[2]
 add v16.4h,\c0\().4h, \c2\().4h
 sub v17.4h,\c0\().4h, \c2\().4h
-smlal   v22.4s,\c3\().4h, v0.h[1]
+smlal   v22.4s,\c3\().4h, v0.h[2]
 smull   v18.4s,v16.4h,v0.h[0]
 smull   v19.4s,v17.4h,v0.h[0]
-smlsl   v20.4s,\c3\().4h, v0.h[2]
+smlsl   v20.4s,\c3\().4h, v0.h[3]
 rshrn   v22.4h,v22.4s,#14
 rshrn   v18.4h,v18.4s,#14
 rshrn   v19.4h,v19.4s,#14
@@ -326,9 +326,9 @@ itxfm_func4x4 iwht,  iwht
 
 .macro idct8
 dmbutterfly0v16, v20, v16, v20, v2, v3, v4, v5, v6, v7 // v16 = 
t0a, v20 = t1a
-dmbutterfly v18, v22, v0.h[1], v0.h[2], v2, v3, v4, v5 // v18 = 
t2a, v22 = t3a
-dmbutterfly v17, v23, v0.h[3], v0.h[4], v2, v3, v4, v5 // v17 = 
t4a, v23 = t7a
-dmbutterfly v21, v19, v0.h[5], v0.h[6], v2, v3, v4, v5 // v21 = 
t5a, v19 = t6a
+dmbutterfly v18, v22, v0.h[2], v0.h[3], v2, v3, v4, v5 // v18 = 
t2a, v22 = t3a
+dmbutterfly v17, v23, v0.h[4], v0.h[5], v2, v3, v4, v5 // v17 = 
t4a, v23 = t7a
+dmbutterfly v21, v19, v0.h[6], v0.h[7], v2, v3, v4, v5 // v21 = 
t5a, v19 = t6a
 
 butterfly_8hv24, v25, v16, v22 // v24 = t0, v25 = t3
 butterfly_8hv28, v29, v17, v21 // v28 = t4, v29 = t5a
@@ -361,8 +361,8 @@ itxfm_func4x4 iwht,  iwht
 dmbutterfly0v19, v20, v6, v7, v24, v26, v27, v28, v29, v30   // 
v19 = -out[3], v20 = out[4]
 neg v19.8h,   v19.8h  // v19 = out[3]
 
-dmbutterfly_l   v26, v27, v28, v29, v5,  v3,  v0.h[1], v0.h[2]   // 
v26,v27 = t5a, v28,v29 = t4a
-dmbutterfly_l   v2,  v3,  v4,  v5,  v31, v25, v0.h[2], v0.h[1]   // 
v2,v3   = t6a, v4,v5   = t7a
+dmbutterfly_l   v26, v27, v28, v29, v5,  v3,  v0.h[2], v0.h[3]   // 
v26,v27 = t5a, v28,v29 = t4a
+dmbutterfly_l   v2,  v3,  v4,  v5,  v31, v25, v0.h[3], v0.h[2]   // 
v2,v3   = t6a, v4,v5   = t7a
 
 dbutterfly_nv17, v30, v28, v29, v2,  v3,  v6,  v7,  v24, v25 // 
v17 = -out[1], v30 = t6
 dbutterfly_nv22, v31, v26, v27, v4,  v5,  v6,  v7,  v24, v25 // 
v22 = out[6],  v31 = t7
@@ -543,13 +543,13 @@ endfunc
 
 function idct16
 dmbutterfly0v16, v24, v16, v24, v2, v3, v4, v5, v6, v7 // v16 = 
t0a,  v24 = t1a
-dmbutterfly v20, v28, v0.h[1], v0.h[2], v2, v3, v4, v5 // v20 = 
t2a,  v28 = t3a
-dmbutterfly v18, v30, v0.h[3], v0.h[4], v2, v3, v4, v5 // v18 = 
t4a,  v30 = t7a
-dmbutterfly v26, v22, v0.h[5], v0.h[6], v2, v3, v4, v5 // v26 = 
t5a,  v22 = t6a
-dmbutterfly v17, v31, v0.h[7], v1.h[0], v2, v3, v4, v5 // v17 = 
t8a,  v31 = t15a
-dmbutterfly v25, v23, v1.h[1], v1.h[2], v2, v3, v4, v5 // v25 = 
t9a,  v23 = t14a
-dmbutterfly v21, v27, v1.h[3], v1.h[4], v2, v3, v4, v5 // v21 = 
t10a, v27 = t13a
-dmbutterfly v29, v19, v1.h[5], v1.h[6], v2, v3, v4, v5 // v29 = 
t11a, v19 = t12a
+dmbutterfly v20, v28, v0.h[2], v0.h[3], v2, v3, v4, v5 // v20 = 
t2a,  v28 = t3a
+dmbutterfly 

[FFmpeg-devel] [PATCH 15/34] aarch64: vp9mc: Simplify the extmla macro parameters

2017-03-08 Thread Martin Storsjö
Fold the field lengths into the macro.

This makes the macro invocations much more readable, when the
lines are shorter.

This also makes it easier to use only half the registers within
the macro.

This is cherrypicked from libav commit
5e0c2158fbc774f87d3ce4b7b950ba4d42c4a7b8.
---
 libavcodec/aarch64/vp9mc_neon.S | 50 -
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/libavcodec/aarch64/vp9mc_neon.S b/libavcodec/aarch64/vp9mc_neon.S
index 80d1d23..9403911 100644
--- a/libavcodec/aarch64/vp9mc_neon.S
+++ b/libavcodec/aarch64/vp9mc_neon.S
@@ -193,41 +193,41 @@ endfunc
 // for size >= 16), and multiply-accumulate into dst1 and dst3 (or
 // dst1-dst2 and dst3-dst4 for size >= 16)
 .macro extmla dst1, dst2, dst3, dst4, src1, src2, src3, src4, src5, src6, 
offset, size
-ext v20.16b, \src1, \src2, #(2*\offset)
-ext v22.16b, \src4, \src5, #(2*\offset)
+ext v20.16b, \src1\().16b, \src2\().16b, #(2*\offset)
+ext v22.16b, \src4\().16b, \src5\().16b, #(2*\offset)
 .if \size >= 16
-mla \dst1, v20.8h, v0.h[\offset]
-ext v21.16b, \src2, \src3, #(2*\offset)
-mla \dst3, v22.8h, v0.h[\offset]
-ext v23.16b, \src5, \src6, #(2*\offset)
-mla \dst2, v21.8h, v0.h[\offset]
-mla \dst4, v23.8h, v0.h[\offset]
+mla \dst1\().8h, v20.8h, v0.h[\offset]
+ext v21.16b, \src2\().16b, \src3\().16b, #(2*\offset)
+mla \dst3\().8h, v22.8h, v0.h[\offset]
+ext v23.16b, \src5\().16b, \src6\().16b, #(2*\offset)
+mla \dst2\().8h, v21.8h, v0.h[\offset]
+mla \dst4\().8h, v23.8h, v0.h[\offset]
 .else
-mla \dst1, v20.8h, v0.h[\offset]
-mla \dst3, v22.8h, v0.h[\offset]
+mla \dst1\().8h, v20.8h, v0.h[\offset]
+mla \dst3\().8h, v22.8h, v0.h[\offset]
 .endif
 .endm
 // The same as above, but don't accumulate straight into the
 // destination, but use a temp register and accumulate with saturation.
 .macro extmulqadd dst1, dst2, dst3, dst4, src1, src2, src3, src4, src5, src6, 
offset, size
-ext v20.16b, \src1, \src2, #(2*\offset)
-ext v22.16b, \src4, \src5, #(2*\offset)
+ext v20.16b, \src1\().16b, \src2\().16b, #(2*\offset)
+ext v22.16b, \src4\().16b, \src5\().16b, #(2*\offset)
 .if \size >= 16
 mul v20.8h, v20.8h, v0.h[\offset]
-ext v21.16b, \src2, \src3, #(2*\offset)
+ext v21.16b, \src2\().16b, \src3\().16b, #(2*\offset)
 mul v22.8h, v22.8h, v0.h[\offset]
-ext v23.16b, \src5, \src6, #(2*\offset)
+ext v23.16b, \src5\().16b, \src6\().16b, #(2*\offset)
 mul v21.8h, v21.8h, v0.h[\offset]
 mul v23.8h, v23.8h, v0.h[\offset]
 .else
 mul v20.8h, v20.8h, v0.h[\offset]
 mul v22.8h, v22.8h, v0.h[\offset]
 .endif
-sqadd   \dst1, \dst1, v20.8h
-sqadd   \dst3, \dst3, v22.8h
+sqadd   \dst1\().8h, \dst1\().8h, v20.8h
+sqadd   \dst3\().8h, \dst3\().8h, v22.8h
 .if \size >= 16
-sqadd   \dst2, \dst2, v21.8h
-sqadd   \dst4, \dst4, v23.8h
+sqadd   \dst2\().8h, \dst2\().8h, v21.8h
+sqadd   \dst4\().8h, \dst4\().8h, v23.8h
 .endif
 .endm
 
@@ -291,13 +291,13 @@ function \type\()_8tap_\size\()h_\idx1\idx2
 mul v2.8h,  v5.8h,  v0.h[0]
 mul v25.8h, v17.8h, v0.h[0]
 .endif
-extmla  v1.8h,  v2.8h,  v24.8h, v25.8h, v4.16b,  v5.16b,  
v6.16b,  v16.16b, v17.16b, v18.16b, 1, \size
-extmla  v1.8h,  v2.8h,  v24.8h, v25.8h, v4.16b,  v5.16b,  
v6.16b,  v16.16b, v17.16b, v18.16b, 2, \size
-extmla  v1.8h,  v2.8h,  v24.8h, v25.8h, v4.16b,  v5.16b,  
v6.16b,  v16.16b, v17.16b, v18.16b, \idx1, \size
-extmla  v1.8h,  v2.8h,  v24.8h, v25.8h, v4.16b,  v5.16b,  
v6.16b,  v16.16b, v17.16b, v18.16b, 5, \size
-extmla  v1.8h,  v2.8h,  v24.8h, v25.8h, v4.16b,  v5.16b,  
v6.16b,  v16.16b, v17.16b, v18.16b, 6, \size
-extmla  v1.8h,  v2.8h,  v24.8h, v25.8h, v4.16b,  v5.16b,  
v6.16b,  v16.16b, v17.16b, v18.16b, 7, \size
-extmulqadd  v1.8h,  v2.8h,  v24.8h, v25.8h, v4.16b,  v5.16b,  
v6.16b,  v16.16b, v17.16b, v18.16b, \idx2, \size
+extmla  v1,  v2,  v24, v25, v4,  v5,  v6,  v16, v17, v18, 1,   
  \size
+extmla  v1,  v2,  v24, v25, v4,  v5,  v6,  v16, v17, v18, 2,   
  \size
+extmla  v1,  v2,  v24, v25, v4,  v5,  v6,  v16, v17, v18, 
\idx1, \size
+  

[FFmpeg-devel] [PATCH 23/34] aarch64: vp9lpf: Interleave the start of flat8in into the calculation above

2017-03-08 Thread Martin Storsjö
This adds lots of extra .ifs, but speeds it up by a couple cycles,
by avoiding stalls.

This is cherrypicked from libav commit
b0806088d3b27044145b20421da8d39089ae0c6a.
---
 libavcodec/aarch64/vp9lpf_neon.S | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/libavcodec/aarch64/vp9lpf_neon.S b/libavcodec/aarch64/vp9lpf_neon.S
index 7fe2c88..cd3e26c 100644
--- a/libavcodec/aarch64/vp9lpf_neon.S
+++ b/libavcodec/aarch64/vp9lpf_neon.S
@@ -338,20 +338,28 @@
 
 uxtl_sz v0.8h,  v1.8h,  v22, \sz// p1
 uxtl_sz v2.8h,  v3.8h,  v25, \sz// q1
+.if \wd >= 8
+mov x5,  v6.d[0]
+.ifc \sz, .16b
+mov x6,  v6.d[1]
+.endif
+.endif
 saddw_szv0.8h,  v1.8h,  v0.8h,  v1.8h,  \tmp3, \sz // p1 + f
 ssubw_szv2.8h,  v3.8h,  v2.8h,  v3.8h,  \tmp3, \sz // q1 - f
 sqxtun_sz   v0,  v0.8h,  v1.8h, \sz // out p1
 sqxtun_sz   v2,  v2.8h,  v3.8h, \sz // out q1
+.if \wd >= 8
+.ifc \sz, .16b
+addsx5,  x5,  x6
+.endif
+.endif
 bit v22\sz, v0\sz,  v5\sz   // if (!hev && fm && 
!flat8in)
 bit v25\sz, v2\sz,  v5\sz
 
 // If no pixels need flat8in, jump to flat8out
 // (or to a writeout of the inner 4 pixels, for wd=8)
 .if \wd >= 8
-mov x5,  v6.d[0]
 .ifc \sz, .16b
-mov x6,  v6.d[1]
-addsx5,  x5,  x6
 b.eq6f
 .else
 cbz x5,  6f
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 29/34] arm: vp9itxfm: Avoid reloading the idct32 coefficients

2017-03-08 Thread Martin Storsjö
The idct32x32 function actually pushed q4-q7 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.

Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
in the idct16 function), and the lanewise vmul needs a register in
the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
while doing idct16.

While keeping these coefficients in registers, we still can skip pushing
q7.

Before:  Cortex A7   A8   A9  A53
vp9_inv_dct_dct_32x32_sub32_add_neon:  18553.8  17182.7  14303.3  12089.7
After:
vp9_inv_dct_dct_32x32_sub32_add_neon:  18470.3  16717.7  14173.6  11860.8

This is cherrypicked from libav commit
402546a17233a8815307df9e14ff88cd70424537.
---
 libavcodec/arm/vp9itxfm_neon.S | 246 -
 1 file changed, 120 insertions(+), 126 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index dee2f05..9385b01 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -1185,58 +1185,51 @@ function idct32x32_dc_add_neon
 endfunc
 
 .macro idct32_end
-butterfly   d16, d5,  d4,  d5  @ d16 = t16a, d5  = t19a
+butterfly   d16, d9,  d8,  d9  @ d16 = t16a, d9  = t19a
 butterfly   d17, d20, d23, d20 @ d17 = t17,  d20 = t18
-butterfly   d18, d6,  d7,  d6  @ d18 = t23a, d6  = t20a
+butterfly   d18, d10, d11, d10 @ d18 = t23a, d10 = t20a
 butterfly   d19, d21, d22, d21 @ d19 = t22,  d21 = t21
-butterfly   d4,  d28, d28, d30 @ d4  = t24a, d28 = t27a
+butterfly   d8,  d28, d28, d30 @ d8  = t24a, d28 = t27a
 butterfly   d23, d26, d25, d26 @ d23 = t25,  d26 = t26
-butterfly   d7,  d29, d29, d31 @ d7  = t31a, d29 = t28a
+butterfly   d11, d29, d29, d31 @ d11 = t31a, d29 = t28a
 butterfly   d22, d27, d24, d27 @ d22 = t30,  d27 = t29
 
 mbutterfly  d27, d20, d0[1], d0[2], q12, q15@ d27 = t18a, 
d20 = t29a
-mbutterfly  d29, d5,  d0[1], d0[2], q12, q15@ d29 = t19,  
d5  = t28
-mbutterfly  d28, d6,  d0[1], d0[2], q12, q15, neg=1 @ d28 = t27,  
d6  = t20
+mbutterfly  d29, d9,  d0[1], d0[2], q12, q15@ d29 = t19,  
d9  = t28
+mbutterfly  d28, d10, d0[1], d0[2], q12, q15, neg=1 @ d28 = t27,  
d10 = t20
 mbutterfly  d26, d21, d0[1], d0[2], q12, q15, neg=1 @ d26 = t26a, 
d21 = t21a
 
-butterfly   d31, d24, d7,  d4  @ d31 = t31,  d24 = t24
+butterfly   d31, d24, d11, d8  @ d31 = t31,  d24 = t24
 butterfly   d30, d25, d22, d23 @ d30 = t30a, d25 = t25a
 butterfly_r d23, d16, d16, d18 @ d23 = t23,  d16 = t16
 butterfly_r d22, d17, d17, d19 @ d22 = t22a, d17 = t17a
 butterfly   d18, d21, d27, d21 @ d18 = t18,  d21 = t21
-butterfly_r d27, d28, d5,  d28 @ d27 = t27a, d28 = t28a
-butterfly   d4,  d26, d20, d26 @ d4  = t29,  d26 = t26
-butterfly   d19, d20, d29, d6  @ d19 = t19a, d20 = t20
-vmovd29, d4@ d29 = t29
-
-mbutterfly0 d27, d20, d27, d20, d4, d6, q2, q3 @ d27 = t27,  d20 = 
t20
-mbutterfly0 d26, d21, d26, d21, d4, d6, q2, q3 @ d26 = t26a, d21 = 
t21a
-mbutterfly0 d25, d22, d25, d22, d4, d6, q2, q3 @ d25 = t25,  d22 = 
t22
-mbutterfly0 d24, d23, d24, d23, d4, d6, q2, q3 @ d24 = t24a, d23 = 
t23a
+butterfly_r d27, d28, d9,  d28 @ d27 = t27a, d28 = t28a
+butterfly   d8,  d26, d20, d26 @ d8  = t29,  d26 = t26
+butterfly   d19, d20, d29, d10 @ d19 = t19a, d20 = t20
+vmovd29, d8@ d29 = t29
+
+mbutterfly0 d27, d20, d27, d20, d8, d10, q4, q5 @ d27 = t27,  d20 
= t20
+mbutterfly0 d26, d21, d26, d21, d8, d10, q4, q5 @ d26 = t26a, d21 
= t21a
+mbutterfly0 d25, d22, d25, d22, d8, d10, q4, q5 @ d25 = t25,  d22 
= t22
+mbutterfly0 d24, d23, d24, d23, d8, d10, q4, q5 @ d24 = t24a, d23 
= t23a
 bx  lr
 .endm
 
 function idct32_odd
-movrel  r12, idct_coeffs
-add r12, r12, #32
-vld1.16 {q0-q1}, [r12,:128]
-
-mbutterfly  d16, d31, d0[0], d0[1], q2, q3 @ d16 = t16a, d31 = t31a
-mbutterfly  d24, d23, d0[2], d0[3], q2, q3 @ d24 = t17a, d23 = t30a
-mbutterfly  d20, d27, d1[0], d1[1], q2, q3 @ d20 = t18a, d27 = t29a
-mbutterfly  d28, d19, d1[2], d1[3], q2, q3 @ d28 = t19a, d19 = t28a
-mbutterfly  d18, d29, d2[0], d2[1], q2, q3 @ d18 = t20a, d29 = t27a
-mbutterfly  d26, d21, 

[FFmpeg-devel] [PATCH 26/34] arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit

2017-03-08 Thread Martin Storsjö
The theoretical maximum value of E is 193, so we can just
saturate the addition to 255.

Before: Cortex A7  A8  A9 A53  A53/AArch64
vp9_loop_filter_v_4_8_neon: 143.0   127.7   114.888.0 87.7
vp9_loop_filter_v_8_8_neon: 241.0   197.2   173.7   140.0136.7
vp9_loop_filter_v_16_8_neon:497.0   419.5   379.7   293.0275.7
vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0452.0
After:
vp9_loop_filter_v_4_8_neon: 136.0   125.7   112.684.0 83.0
vp9_loop_filter_v_8_8_neon: 234.0   195.5   171.5   136.0133.7
vp9_loop_filter_v_16_8_neon:490.0   417.5   377.7   289.0271.0
vp9_loop_filter_v_16_16_neon:   951.2   814.7   732.3   571.0446.7

This is cherrypicked from libav commit
c582cb8537367721bb399a5d01b652c20142b756.
---
 libavcodec/aarch64/vp9lpf_neon.S | 40 +---
 libavcodec/arm/vp9lpf_neon.S | 11 +--
 2 files changed, 14 insertions(+), 37 deletions(-)

diff --git a/libavcodec/aarch64/vp9lpf_neon.S b/libavcodec/aarch64/vp9lpf_neon.S
index ebfd9be..a9eea7f 100644
--- a/libavcodec/aarch64/vp9lpf_neon.S
+++ b/libavcodec/aarch64/vp9lpf_neon.S
@@ -51,13 +51,6 @@
 // see the arm version instead.
 
 
-.macro uabdl_sz dst1, dst2, in1, in2, sz
-uabdl   \dst1,  \in1\().8b,  \in2\().8b
-.ifc \sz, .16b
-uabdl2  \dst2,  \in1\().16b, \in2\().16b
-.endif
-.endm
-
 .macro add_sz dst1, dst2, in1, in2, in3, in4, sz
 add \dst1,  \in1,  \in3
 .ifc \sz, .16b
@@ -86,20 +79,6 @@
 .endif
 .endm
 
-.macro cmhs_sz dst1, dst2, in1, in2, in3, in4, sz
-cmhs\dst1,  \in1,  \in3
-.ifc \sz, .16b
-cmhs\dst2,  \in2,  \in4
-.endif
-.endm
-
-.macro xtn_sz dst, in1, in2, sz
-xtn \dst\().8b,  \in1
-.ifc \sz, .16b
-xtn2\dst\().16b, \in2
-.endif
-.endm
-
 .macro usubl_sz dst1, dst2, in1, in2, sz
 usubl   \dst1,  \in1\().8b,  \in2\().8b
 .ifc \sz, .16b
@@ -179,20 +158,20 @@
 // tmpq2 == tmp3 + tmp4, etc.
 .macro loop_filter wd, sz, mix, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8
 .if \mix == 0
-dup v0.8h,  w2// E
-dup v1.8h,  w2// E
+dup v0\sz,  w2// E
 dup v2\sz,  w3// I
 dup v3\sz,  w4// H
 .else
-dup v0.8h,  w2// E
+dup v0.8b,  w2// E
 dup v2.8b,  w3// I
 dup v3.8b,  w4// H
+lsr w5, w2,  #8
 lsr w6, w3,  #8
 lsr w7, w4,  #8
-ushrv1.8h,  v0.8h, #8 // E
+dup v1.8b,  w5// E
 dup v4.8b,  w6// I
-bic v0.8h,  #255, lsl 8 // E
 dup v5.8b,  w7// H
+trn1v0.2d,  v0.2d,  v1.2d
 trn1v2.2d,  v2.2d,  v4.2d
 trn1v3.2d,  v3.2d,  v5.2d
 .endif
@@ -206,16 +185,15 @@
 umaxv4\sz,  v4\sz,  v5\sz
 umaxv5\sz,  v6\sz,  v7\sz
 umax\tmp1\sz, \tmp1\sz, \tmp2\sz
-uabdl_szv6.8h,  v7.8h,  v23, v24, \sz // abs(p0 - q0)
+uabdv6\sz,  v23\sz, v24\sz// abs(p0 - q0)
 umaxv4\sz,  v4\sz,  v5\sz
-add_sz  v6.8h,  v7.8h,  v6.8h,  v7.8h,  v6.8h,  v7.8h, \sz // 
abs(p0 - q0) * 2
+uqadd   v6\sz,  v6\sz,  v6\sz // abs(p0 - q0) * 2
 uabdv5\sz,  v22\sz, v25\sz// abs(p1 - q1)
 umaxv4\sz,  v4\sz,  \tmp1\sz  // max(abs(p3 - p2), 
..., abs(q2 - q3))
 ushrv5\sz,  v5\sz,  #1
 cmhsv4\sz,  v2\sz,  v4\sz // max(abs()) <= I
-uaddw_szv6.8h,  v7.8h,  v6.8h,  v7.8h,  v5, \sz // abs(p0 - 
q0) * 2 + abs(p1 - q1) >> 1
-cmhs_sz v6.8h,  v7.8h,  v0.8h,  v1.8h,  v6.8h,  v7.8h, \sz
-xtn_sz  v5, v6.8h,  v7.8h,  \sz
+uqadd   v6\sz,  v6\sz,  v5\sz // abs(p0 - q0) * 2 + 
abs(p1 - q1) >> 1
+cmhsv5\sz,  v0\sz,  v6\sz
 and v4\sz,  v4\sz,  v5\sz // fm
 
 // If no pixels need filtering, just exit as soon as possible
diff --git a/libavcodec/arm/vp9lpf_neon.S b/libavcodec/arm/vp9lpf_neon.S
index b90c536..2d91092 100644
--- a/libavcodec/arm/vp9lpf_neon.S
+++ b/libavcodec/arm/vp9lpf_neon.S
@@ -51,7 +51,7 @@
 @ and d28-d31 as temp registers, or d8-d15.
 @ tmp1,tmp2 = tmpq1, tmp3,tmp4 = tmpq2, tmp5,tmp6 = tmpq3, tmp7,tmp8 = tmpq4
 .macro loop_filter wd, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8, tmpq1, 
tmpq2, tmpq3, tmpq4
-vdup.u16q0,  r2 @ E
+vdup.u8 d0,  r2 @ E
 vdup.u8  

[FFmpeg-devel] [PATCH 25/34] aarch64: Add parentheses around the offset parameter in movrel

2017-03-08 Thread Martin Storsjö
This fixes building with clang for linux with PIC enabled.

This is cherrypicked from libav commit
8847eeaa14189885038140fb2b8a7adc7100.
---
 libavutil/aarch64/asm.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/libavutil/aarch64/asm.S b/libavutil/aarch64/asm.S
index 523b8c5..4289729 100644
--- a/libavutil/aarch64/asm.S
+++ b/libavutil/aarch64/asm.S
@@ -83,8 +83,8 @@ ELF .size   \name, . - \name
 add \rd, \rd, \val+(\offset)@PAGEOFF
 .endif
 #elif CONFIG_PIC
-adrp\rd, \val+\offset
-add \rd, \rd, :lo12:\val+\offset
+adrp\rd, \val+(\offset)
+add \rd, \rd, :lo12:\val+(\offset)
 #else
 ldr \rd, =\val+\offset
 #endif
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 24/34] aarch64: vp9lpf: Fix broken indentation/vertical alignment

2017-03-08 Thread Martin Storsjö
This is cherrypicked from libav commit
07b5136c481d394992c7e951967df0cfbb346c0b.
---
 libavcodec/aarch64/vp9lpf_neon.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/libavcodec/aarch64/vp9lpf_neon.S b/libavcodec/aarch64/vp9lpf_neon.S
index cd3e26c..ebfd9be 100644
--- a/libavcodec/aarch64/vp9lpf_neon.S
+++ b/libavcodec/aarch64/vp9lpf_neon.S
@@ -417,7 +417,7 @@
 mov x5,  v2.d[0]
 .ifc \sz, .16b
 mov x6,  v2.d[1]
-adds x5,  x5,  x6
+addsx5,  x5,  x6
 b.ne1f
 .else
 cbnzx5,  1f
@@ -430,7 +430,7 @@
 mov x5,  v7.d[0]
 .ifc \sz, .16b
 mov x6,  v7.d[1]
-adds x5,  x5,  x6
+addsx5,  x5,  x6
 b.ne1f
 .else
 cbnzx5,  1f
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 30/34] aarch64: vp9itxfm: Avoid reloading the idct32 coefficients

2017-03-08 Thread Martin Storsjö
The idct32x32 function actually pushed d8-d15 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.

After this, we still can skip pushing d12-d15.

Before:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3
After:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3

This is cherrypicked from libav commit
65aa002d54433154a6924dc13e498bec98451ad0.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 110 +++--
 1 file changed, 43 insertions(+), 67 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index be65eb7..dd9fde1 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -1123,18 +1123,14 @@ endfunc
 .endm
 
 function idct32_odd
-ld1 {v0.8h,v1.8h}, [x11]
-
-dmbutterfly v16, v31, v0.h[0], v0.h[1], v4, v5, v6, v7 // v16 = 
t16a, v31 = t31a
-dmbutterfly v24, v23, v0.h[2], v0.h[3], v4, v5, v6, v7 // v24 = 
t17a, v23 = t30a
-dmbutterfly v20, v27, v0.h[4], v0.h[5], v4, v5, v6, v7 // v20 = 
t18a, v27 = t29a
-dmbutterfly v28, v19, v0.h[6], v0.h[7], v4, v5, v6, v7 // v28 = 
t19a, v19 = t28a
-dmbutterfly v18, v29, v1.h[0], v1.h[1], v4, v5, v6, v7 // v18 = 
t20a, v29 = t27a
-dmbutterfly v26, v21, v1.h[2], v1.h[3], v4, v5, v6, v7 // v26 = 
t21a, v21 = t26a
-dmbutterfly v22, v25, v1.h[4], v1.h[5], v4, v5, v6, v7 // v22 = 
t22a, v25 = t25a
-dmbutterfly v30, v17, v1.h[6], v1.h[7], v4, v5, v6, v7 // v30 = 
t23a, v17 = t24a
-
-ld1 {v0.8h}, [x10]
+dmbutterfly v16, v31, v8.h[0], v8.h[1], v4, v5, v6, v7 // v16 = 
t16a, v31 = t31a
+dmbutterfly v24, v23, v8.h[2], v8.h[3], v4, v5, v6, v7 // v24 = 
t17a, v23 = t30a
+dmbutterfly v20, v27, v8.h[4], v8.h[5], v4, v5, v6, v7 // v20 = 
t18a, v27 = t29a
+dmbutterfly v28, v19, v8.h[6], v8.h[7], v4, v5, v6, v7 // v28 = 
t19a, v19 = t28a
+dmbutterfly v18, v29, v9.h[0], v9.h[1], v4, v5, v6, v7 // v18 = 
t20a, v29 = t27a
+dmbutterfly v26, v21, v9.h[2], v9.h[3], v4, v5, v6, v7 // v26 = 
t21a, v21 = t26a
+dmbutterfly v22, v25, v9.h[4], v9.h[5], v4, v5, v6, v7 // v22 = 
t22a, v25 = t25a
+dmbutterfly v30, v17, v9.h[6], v9.h[7], v4, v5, v6, v7 // v30 = 
t23a, v17 = t24a
 
 butterfly_8hv4,  v24, v16, v24 // v4  = t16, v24 = t17
 butterfly_8hv5,  v20, v28, v20 // v5  = t19, v20 = t18
@@ -1153,18 +1149,14 @@ function idct32_odd
 endfunc
 
 function idct32_odd_half
-ld1 {v0.8h,v1.8h}, [x11]
-
-dmbutterfly_h1  v16, v31, v0.h[0], v0.h[1], v4, v5, v6, v7 // v16 = 
t16a, v31 = t31a
-dmbutterfly_h2  v24, v23, v0.h[2], v0.h[3], v4, v5, v6, v7 // v24 = 
t17a, v23 = t30a
-dmbutterfly_h1  v20, v27, v0.h[4], v0.h[5], v4, v5, v6, v7 // v20 = 
t18a, v27 = t29a
-dmbutterfly_h2  v28, v19, v0.h[6], v0.h[7], v4, v5, v6, v7 // v28 = 
t19a, v19 = t28a
-dmbutterfly_h1  v18, v29, v1.h[0], v1.h[1], v4, v5, v6, v7 // v18 = 
t20a, v29 = t27a
-dmbutterfly_h2  v26, v21, v1.h[2], v1.h[3], v4, v5, v6, v7 // v26 = 
t21a, v21 = t26a
-dmbutterfly_h1  v22, v25, v1.h[4], v1.h[5], v4, v5, v6, v7 // v22 = 
t22a, v25 = t25a
-dmbutterfly_h2  v30, v17, v1.h[6], v1.h[7], v4, v5, v6, v7 // v30 = 
t23a, v17 = t24a
-
-ld1 {v0.8h}, [x10]
+dmbutterfly_h1  v16, v31, v8.h[0], v8.h[1], v4, v5, v6, v7 // v16 = 
t16a, v31 = t31a
+dmbutterfly_h2  v24, v23, v8.h[2], v8.h[3], v4, v5, v6, v7 // v24 = 
t17a, v23 = t30a
+dmbutterfly_h1  v20, v27, v8.h[4], v8.h[5], v4, v5, v6, v7 // v20 = 
t18a, v27 = t29a
+dmbutterfly_h2  v28, v19, v8.h[6], v8.h[7], v4, v5, v6, v7 // v28 = 
t19a, v19 = t28a
+dmbutterfly_h1  v18, v29, v9.h[0], v9.h[1], v4, v5, v6, v7 // v18 = 
t20a, v29 = t27a
+dmbutterfly_h2  v26, v21, v9.h[2], v9.h[3], v4, v5, v6, v7 // v26 = 
t21a, v21 = t26a
+dmbutterfly_h1  v22, v25, v9.h[4], v9.h[5], v4, v5, v6, v7 // v22 = 
t22a, v25 = t25a
+dmbutterfly_h2  v30, v17, v9.h[6], v9.h[7], v4, v5, v6, v7 // v30 = 
t23a, v17 = t24a
 
 butterfly_8hv4,  v24, v16, v24 // v4  = t16, v24 = t17
 butterfly_8hv5,  v20, v28, v20 // v5  = t19, v20 = t18
@@ -1183,18 +1175,14 @@ function idct32_odd_half
 endfunc
 
 function idct32_odd_quarter
-ld1 {v0.8h,v1.8h}, [x11]
-
-dsmull_hv4,  v5,  v16, v0.h[0]
-dsmull_hv28, v29, v19, v0.h[7]
-dsmull_hv30, v31, v16, v0.h[1]
-dsmull_hv22, v23, v17, v1.h[6]
-dsmull_hv7,  v6,  v17, v1.h[7]
-dsmull_hv26, v27, v19, v0.h[6]
-dsmull_hv20, v21, v18, v1.h[0]
-dsmull_h

[FFmpeg-devel] [PATCH 27/34] aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1

2017-03-08 Thread Martin Storsjö
This is one cycle faster in total, and three instructions fewer.

Before:
vp9_loop_filter_mix2_v_44_16_neon: 123.2
After:
vp9_loop_filter_mix2_v_44_16_neon: 122.2

This is cherrypicked from libav commit
3bf9c48320f25f3d5557485b0202f22ae60748b0.
---
 libavcodec/aarch64/vp9lpf_neon.S | 21 +
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/libavcodec/aarch64/vp9lpf_neon.S b/libavcodec/aarch64/vp9lpf_neon.S
index a9eea7f..0878763 100644
--- a/libavcodec/aarch64/vp9lpf_neon.S
+++ b/libavcodec/aarch64/vp9lpf_neon.S
@@ -162,18 +162,15 @@
 dup v2\sz,  w3// I
 dup v3\sz,  w4// H
 .else
-dup v0.8b,  w2// E
-dup v2.8b,  w3// I
-dup v3.8b,  w4// H
-lsr w5, w2,  #8
-lsr w6, w3,  #8
-lsr w7, w4,  #8
-dup v1.8b,  w5// E
-dup v4.8b,  w6// I
-dup v5.8b,  w7// H
-trn1v0.2d,  v0.2d,  v1.2d
-trn1v2.2d,  v2.2d,  v4.2d
-trn1v3.2d,  v3.2d,  v5.2d
+dup v0.8h,  w2// E
+dup v2.8h,  w3// I
+dup v3.8h,  w4// H
+rev16   v1.16b, v0.16b// E
+rev16   v4.16b, v2.16b// I
+rev16   v5.16b, v3.16b// H
+uzp1v0.16b, v0.16b, v1.16b
+uzp1v2.16b, v2.16b, v4.16b
+uzp1v3.16b, v3.16b, v5.16b
 .endif
 
 uabdv4\sz,  v20\sz, v21\sz// abs(p3 - p2)
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 28/34] arm: vp9lpf: Implement the mix2_44 function with one single filter pass

2017-03-08 Thread Martin Storsjö
For this case, with 8 inputs but only changing 4 of them, we can fit
all 16 input pixels into a q register, and still have enough temporary
registers for doing the loop filter.

The wd=8 filters would require too many temporary registers for
processing all 16 pixels at once though.

Before:  Cortex A7  A8 A9 A53
vp9_loop_filter_mix2_v_44_16_neon:   289.7   256.2  237.5   181.2
After:
vp9_loop_filter_mix2_v_44_16_neon:   221.2   150.5  177.7   138.0

This is cherrypicked from libav commit
575e31e931e4178e9f1e24407503c9b4ec0ef9ba.
---
 libavcodec/arm/vp9dsp_init_arm.c |   7 +-
 libavcodec/arm/vp9lpf_neon.S | 191 +++
 2 files changed, 195 insertions(+), 3 deletions(-)

diff --git a/libavcodec/arm/vp9dsp_init_arm.c b/libavcodec/arm/vp9dsp_init_arm.c
index f7b539e..4c57fd6 100644
--- a/libavcodec/arm/vp9dsp_init_arm.c
+++ b/libavcodec/arm/vp9dsp_init_arm.c
@@ -195,6 +195,8 @@ define_loop_filters(8, 8);
 define_loop_filters(16, 8);
 define_loop_filters(16, 16);
 
+define_loop_filters(44, 16);
+
 #define lf_mix_fn(dir, wd1, wd2, stridea)  
   \
 static void loop_filter_##dir##_##wd1##wd2##_16_neon(uint8_t *dst, 
   \
  ptrdiff_t stride, 
   \
@@ -208,7 +210,6 @@ static void 
loop_filter_##dir##_##wd1##wd2##_16_neon(uint8_t *dst,
 lf_mix_fn(h, wd1, wd2, stride) \
 lf_mix_fn(v, wd1, wd2, sizeof(uint8_t))
 
-lf_mix_fns(4, 4)
 lf_mix_fns(4, 8)
 lf_mix_fns(8, 4)
 lf_mix_fns(8, 8)
@@ -228,8 +229,8 @@ static av_cold void 
vp9dsp_loopfilter_init_arm(VP9DSPContext *dsp)
 dsp->loop_filter_16[0] = ff_vp9_loop_filter_h_16_16_neon;
 dsp->loop_filter_16[1] = ff_vp9_loop_filter_v_16_16_neon;
 
-dsp->loop_filter_mix2[0][0][0] = loop_filter_h_44_16_neon;
-dsp->loop_filter_mix2[0][0][1] = loop_filter_v_44_16_neon;
+dsp->loop_filter_mix2[0][0][0] = ff_vp9_loop_filter_h_44_16_neon;
+dsp->loop_filter_mix2[0][0][1] = ff_vp9_loop_filter_v_44_16_neon;
 dsp->loop_filter_mix2[0][1][0] = loop_filter_h_48_16_neon;
 dsp->loop_filter_mix2[0][1][1] = loop_filter_v_48_16_neon;
 dsp->loop_filter_mix2[1][0][0] = loop_filter_h_84_16_neon;
diff --git a/libavcodec/arm/vp9lpf_neon.S b/libavcodec/arm/vp9lpf_neon.S
index 2d91092..8d44d58 100644
--- a/libavcodec/arm/vp9lpf_neon.S
+++ b/libavcodec/arm/vp9lpf_neon.S
@@ -44,6 +44,109 @@
 vtrn.8  \r2,  \r3
 .endm
 
+@ The input to and output from this macro is in the registers q8-q15,
+@ and q0-q7 are used as scratch registers.
+@ p3 = q8, p0 = q11, q0 = q12, q3 = q15
+.macro loop_filter_q
+vdup.u8 d0,  r2  @ E
+lsr r2,  r2,  #8
+vdup.u8 d2,  r3  @ I
+lsr r3,  r3,  #8
+vdup.u8 d1,  r2  @ E
+vdup.u8 d3,  r3  @ I
+
+vabd.u8 q2,  q8,  q9 @ abs(p3 - p2)
+vabd.u8 q3,  q9,  q10@ abs(p2 - p1)
+vabd.u8 q4,  q10, q11@ abs(p1 - p0)
+vabd.u8 q5,  q12, q13@ abs(q0 - q1)
+vabd.u8 q6,  q13, q14@ abs(q1 - q2)
+vabd.u8 q7,  q14, q15@ abs(q2 - q3)
+vmax.u8 q2,  q2,  q3
+vmax.u8 q3,  q4,  q5
+vmax.u8 q4,  q6,  q7
+vabd.u8 q5,  q11, q12@ abs(p0 - q0)
+vmax.u8 q2,  q2,  q3
+vqadd.u8q5,  q5,  q5 @ abs(p0 - q0) * 2
+vabd.u8 q7,  q10, q13@ abs(p1 - q1)
+vmax.u8 q2,  q2,  q4 @ max(abs(p3 - p2), ..., abs(q2 - q3))
+vshr.u8 q7,  q7,  #1
+vcle.u8 q2,  q2,  q1 @ max(abs()) <= I
+vqadd.u8q5,  q5,  q7 @ abs(p0 - q0) * 2 + abs(p1 - q1) >> 1
+vcle.u8 q5,  q5,  q0
+vandq2,  q2,  q5 @ fm
+
+vshrn.u16   d10, q2,  #4
+vmovr2,  r3,  d10
+orrsr2,  r2,  r3
+@ If no pixels need filtering, just exit as soon as possible
+beq 9f
+
+@ Calculate the normal inner loop filter for 2 or 4 pixels
+ldr r3,  [sp, #64]
+vabd.u8 q3,  q10, q11@ abs(p1 - p0)
+vabd.u8 q4,  q13, q12@ abs(q1 - q0)
+
+vsubl.u8q5,  d20, d26@ p1 - q1
+vsubl.u8q6,  d21, d27@ p1 - q1
+vmax.u8 q3,  q3,  q4 @ max(abs(p1 - p0), abs(q1 - q0))
+vqmovn.s16  d10, q5  @ av_clip_int8p(p1 - q1)
+vqmovn.s16  d11, q6  @ av_clip_int8p(p1 - q1)
+vdup.u8 d8,  r3  @ H
+lsr r3,  r3,  #8
+vdup.u8 d9,  r3  @ H
+vsubl.u8q6,  d24, d22@ q0 - p0
+vsubl.u8

[FFmpeg-devel] [PATCH 33/34] arm: vp9itxfm: Reorder iadst16 coeffs

2017-03-08 Thread Martin Storsjö
This matches the order they are in the 16 bpp version.

There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.

This makes the 8 bpp version match the 16 bpp version better.

This is cherrypicked from libav commit
08074c092d8c97d71c5986e5325e97ffc956119d.
---
 libavcodec/arm/vp9itxfm_neon.S | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index 05e31e6..ebbbda9 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -37,8 +37,8 @@ idct_coeffs:
 endconst
 
 const iadst16_coeffs, align=4
-.short  16364, 804, 15893, 3981, 14811, 7005, 13160, 9760
-.short  11003, 12140, 8423, 14053, 5520, 15426, 2404, 16207
+.short  16364, 804, 15893, 3981, 11003, 12140, 8423, 14053
+.short  14811, 7005, 13160, 9760, 5520, 15426, 2404, 16207
 endconst
 
 @ Do four 4x4 transposes, using q registers for the subtransposes that don't
@@ -678,19 +678,19 @@ function iadst16
 vld1.16 {q0-q1}, [r12,:128]
 
 mbutterfly_lq3,  q2,  d31, d16, d0[1], d0[0] @ q3  = t1,   q2  = t0
-mbutterfly_lq5,  q4,  d23, d24, d2[1], d2[0] @ q5  = t9,   q4  = t8
+mbutterfly_lq5,  q4,  d23, d24, d1[1], d1[0] @ q5  = t9,   q4  = t8
 butterfly_n d31, d24, q3,  q5,  q6,  q5  @ d31 = t1a,  d24 = 
t9a
 mbutterfly_lq7,  q6,  d29, d18, d0[3], d0[2] @ q7  = t3,   q6  = t2
 butterfly_n d16, d23, q2,  q4,  q3,  q4  @ d16 = t0a,  d23 = 
t8a
 
-mbutterfly_lq3,  q2,  d21, d26, d2[3], d2[2] @ q3  = t11,  q2  = 
t10
+mbutterfly_lq3,  q2,  d21, d26, d1[3], d1[2] @ q3  = t11,  q2  = 
t10
 butterfly_n d29, d26, q7,  q3,  q4,  q3  @ d29 = t3a,  d26 = 
t11a
-mbutterfly_lq5,  q4,  d27, d20, d1[1], d1[0] @ q5  = t5,   q4  = t4
+mbutterfly_lq5,  q4,  d27, d20, d2[1], d2[0] @ q5  = t5,   q4  = t4
 butterfly_n d18, d21, q6,  q2,  q3,  q2  @ d18 = t2a,  d21 = 
t10a
 
 mbutterfly_lq7,  q6,  d19, d28, d3[1], d3[0] @ q7  = t13,  q6  = 
t12
 butterfly_n d20, d28, q5,  q7,  q2,  q7  @ d20 = t5a,  d28 = 
t13a
-mbutterfly_lq3,  q2,  d25, d22, d1[3], d1[2] @ q3  = t7,   q2  = t6
+mbutterfly_lq3,  q2,  d25, d22, d2[3], d2[2] @ q3  = t7,   q2  = t6
 butterfly_n d27, d19, q4,  q6,  q5,  q6  @ d27 = t4a,  d19 = 
t12a
 
 mbutterfly_lq5,  q4,  d17, d30, d3[3], d3[2] @ q5  = t15,  q4  = 
t14
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 34/34] aarch64: vp9itxfm: Reorder iadst16 coeffs

2017-03-08 Thread Martin Storsjö
This matches the order they are in the 16 bpp version.

There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.

This makes the 8 bpp version match the 16 bpp version better.

This is cherrypicked from libav commit
b8f66c0838b4c645227f23a35b4d54373da4c60a.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index 31c6e3c..2c3c002 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -37,8 +37,8 @@ idct_coeffs:
 endconst
 
 const iadst16_coeffs, align=4
-.short  16364, 804, 15893, 3981, 14811, 7005, 13160, 9760
-.short  11003, 12140, 8423, 14053, 5520, 15426, 2404, 16207
+.short  16364, 804, 15893, 3981, 11003, 12140, 8423, 14053
+.short  14811, 7005, 13160, 9760, 5520, 15426, 2404, 16207
 endconst
 
 // out1 = ((in1 + in2) * v0[0] + (1 << 13)) >> 14
@@ -628,19 +628,19 @@ function iadst16
 ld1 {v0.8h,v1.8h}, [x11]
 
 dmbutterfly_l   v6,  v7,  v4,  v5,  v31, v16, v0.h[1], v0.h[0]   // 
v6,v7   = t1,   v4,v5   = t0
-dmbutterfly_l   v10, v11, v8,  v9,  v23, v24, v1.h[1], v1.h[0]   // 
v10,v11 = t9,   v8,v9   = t8
+dmbutterfly_l   v10, v11, v8,  v9,  v23, v24, v0.h[5], v0.h[4]   // 
v10,v11 = t9,   v8,v9   = t8
 dbutterfly_nv31, v24, v6,  v7,  v10, v11, v12, v13, v10, v11 // 
v31 = t1a,  v24 = t9a
 dmbutterfly_l   v14, v15, v12, v13, v29, v18, v0.h[3], v0.h[2]   // 
v14,v15 = t3,   v12,v13 = t2
 dbutterfly_nv16, v23, v4,  v5,  v8,  v9,  v6,  v7,  v8,  v9  // 
v16 = t0a,  v23 = t8a
 
-dmbutterfly_l   v6,  v7,  v4,  v5,  v21, v26, v1.h[3], v1.h[2]   // 
v6,v7   = t11,  v4,v5   = t10
+dmbutterfly_l   v6,  v7,  v4,  v5,  v21, v26, v0.h[7], v0.h[6]   // 
v6,v7   = t11,  v4,v5   = t10
 dbutterfly_nv29, v26, v14, v15, v6,  v7,  v8,  v9,  v6,  v7  // 
v29 = t3a,  v26 = t11a
-dmbutterfly_l   v10, v11, v8,  v9,  v27, v20, v0.h[5], v0.h[4]   // 
v10,v11 = t5,   v8,v9   = t4
+dmbutterfly_l   v10, v11, v8,  v9,  v27, v20, v1.h[1], v1.h[0]   // 
v10,v11 = t5,   v8,v9   = t4
 dbutterfly_nv18, v21, v12, v13, v4,  v5,  v6,  v7,  v4,  v5  // 
v18 = t2a,  v21 = t10a
 
 dmbutterfly_l   v14, v15, v12, v13, v19, v28, v1.h[5], v1.h[4]   // 
v14,v15 = t13,  v12,v13 = t12
 dbutterfly_nv20, v28, v10, v11, v14, v15, v4,  v5,  v14, v15 // 
v20 = t5a,  v28 = t13a
-dmbutterfly_l   v6,  v7,  v4,  v5,  v25, v22, v0.h[7], v0.h[6]   // 
v6,v7   = t7,   v4,v5   = t6
+dmbutterfly_l   v6,  v7,  v4,  v5,  v25, v22, v1.h[3], v1.h[2]   // 
v6,v7   = t7,   v4,v5   = t6
 dbutterfly_nv27, v19, v8,  v9,  v12, v13, v10, v11, v12, v13 // 
v27 = t4a,  v19 = t12a
 
 dmbutterfly_l   v10, v11, v8,  v9,  v17, v30, v1.h[7], v1.h[6]   // 
v10,v11 = t15,  v8,v9   = t14
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 03/34] arm: vp9itxfm: Make the larger core transforms standalone functions

2017-03-08 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/arm/vp9itxfm_neon.o from
15324 to 12388 bytes.

This gives a small slowdown of a couple tens of cycles, up to around
150 cycles for the full case of the largest transform, but makes
it more feasible to add more optimized versions of these transforms.

Before:  Cortex A7   A8   A9  A53
vp9_inv_dct_dct_16x16_sub4_add_neon:2063.4   1516.0   1719.5   1245.1
vp9_inv_dct_dct_16x16_sub16_add_neon:   3279.3   2454.5   2525.2   1982.3
vp9_inv_dct_dct_32x32_sub4_add_neon:   10750.0   7955.4   8525.6   6754.2
vp9_inv_dct_dct_32x32_sub32_add_neon:  18574.0  17108.4  14216.7  12010.2

After:
vp9_inv_dct_dct_16x16_sub4_add_neon:2060.8   1608.5   1735.7   1262.0
vp9_inv_dct_dct_16x16_sub16_add_neon:   3211.2   2443.5   2546.1   1999.5
vp9_inv_dct_dct_32x32_sub4_add_neon:   10682.0   8043.8   8581.3   6810.1
vp9_inv_dct_dct_32x32_sub32_add_neon:  18522.4  17277.4  14286.7  12087.9

This is cherrypicked from libav commit
0331c3f5e8cb6e6b53fab7893e91d1be1bfa979c.
---
 libavcodec/arm/vp9itxfm_neon.S | 43 +-
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index 93816d2..328bb01 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -534,7 +534,7 @@ function idct16x16_dc_add_neon
 endfunc
 .ltorg
 
-.macro idct16
+function idct16
 mbutterfly0 d16, d24, d16, d24, d4, d6,  q2,  q3 @ d16 = t0a,  d24 
= t1a
 mbutterfly  d20, d28, d0[1], d0[2], q2,  q3  @ d20 = t2a,  d28 = 
t3a
 mbutterfly  d18, d30, d0[3], d1[0], q2,  q3  @ d18 = t4a,  d30 = 
t7a
@@ -580,9 +580,10 @@ endfunc
 vmovd4,  d21 @ d4  = t10a
 butterfly   d20, d27, d6,  d27   @ d20 = out[4], d27 = 
out[11]
 butterfly   d21, d26, d26, d4@ d21 = out[5], d26 = 
out[10]
-.endm
+bx  lr
+endfunc
 
-.macro iadst16
+function iadst16
 movrel  r12, iadst16_coeffs
 vld1.16 {q0-q1}, [r12,:128]
 
@@ -653,7 +654,8 @@ endfunc
 
 vmovd16, d2
 vmovd30, d4
-.endm
+bx  lr
+endfunc
 
 .macro itxfm16_1d_funcs txfm
 @ Read a vertical 4x16 slice out of a 16x16 matrix, do a transform on it,
@@ -662,6 +664,8 @@ endfunc
 @ r1 = slice offset
 @ r2 = src
 function \txfm\()16_1d_4x16_pass1_neon
+push{lr}
+
 mov r12, #32
 vmov.s16q2, #0
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
@@ -669,7 +673,7 @@ function \txfm\()16_1d_4x16_pass1_neon
 vst1.16 {d4},  [r2,:64], r12
 .endr
 
-\txfm\()16
+bl  \txfm\()16
 
 @ Do four 4x4 transposes. Originally, d16-d31 contain the
 @ 16 rows. Afterwards, d16-d19, d20-d23, d24-d27, d28-d31
@@ -682,7 +686,7 @@ function \txfm\()16_1d_4x16_pass1_neon
 .irp i, 16, 20, 24, 28, 17, 21, 25, 29, 18, 22, 26, 30, 19, 23, 27, 31
 vst1.16 {d\i}, [r0,:64]!
 .endr
-bx  lr
+pop {pc}
 1:
 @ Special case: For the last input column (r1 == 12),
 @ which would be stored as the last row in the temp buffer,
@@ -709,7 +713,7 @@ function \txfm\()16_1d_4x16_pass1_neon
 vmovd29, d17
 vmovd30, d18
 vmovd31, d19
-bx  lr
+pop {pc}
 endfunc
 
 @ Read a vertical 4x16 slice out of a 16x16 matrix, do a transform on it,
@@ -719,6 +723,7 @@ endfunc
 @ r2 = src (temp buffer)
 @ r3 = slice offset
 function \txfm\()16_1d_4x16_pass2_neon
+push{lr}
 mov r12, #32
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27
 vld1.16 {d\i}, [r2,:64], r12
@@ -732,7 +737,7 @@ function \txfm\()16_1d_4x16_pass2_neon
 
 add r3,  r0,  r1
 lsl r1,  r1,  #1
-\txfm\()16
+bl  \txfm\()16
 
 .macro load_add_store coef0, coef1, coef2, coef3
 vrshr.s16   \coef0, \coef0, #6
@@ -773,7 +778,7 @@ function \txfm\()16_1d_4x16_pass2_neon
 load_add_store  q12, q13, q14, q15
 .purgem load_add_store
 
-bx  lr
+pop {pc}
 endfunc
 .endm
 
@@ -908,7 +913,7 @@ function idct32x32_dc_add_neon
 bx  lr
 endfunc
 
-.macro idct32_odd
+function idct32_odd
 movrel  r12, idct_coeffs
 add r12, r12, #32
 vld1.16 {q0-q1}, [r12,:128]
@@ -967,7 +972,8 @@ endfunc
 mbutterfly0 d26, d21, d26, d21, d4, d6, q2, q3 @ d26 = t26a, d21 = 
t21a
 mbutterfly0 d25, d22, d25, d22, d4, d6, q2, q3 @ d25 = t25,  d22 = 
t22
 mbutterfly0 d24, d23, d24, 

[FFmpeg-devel] [PATCH 22/34] arm: vp9lpf: Interleave the start of flat8in into the calculation above

2017-03-08 Thread Martin Storsjö
This adds lots of extra .ifs, but speeds it up by a couple cycles,
by avoiding stalls.

This is cherrypicked from libav commit
e18c39005ad1dbb178b336f691da1de91afd434e.
---
 libavcodec/arm/vp9lpf_neon.S | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/libavcodec/arm/vp9lpf_neon.S b/libavcodec/arm/vp9lpf_neon.S
index 3d289e5..b90c536 100644
--- a/libavcodec/arm/vp9lpf_neon.S
+++ b/libavcodec/arm/vp9lpf_neon.S
@@ -182,16 +182,20 @@
 
 vmovl.u8q0,  d22@ p1
 vmovl.u8q1,  d25@ q1
+.if \wd >= 8
+vmovr2,  r3,  d6
+.endif
 vaddw.s8q0,  q0,  \tmp3 @ p1 + f
 vsubw.s8q1,  q1,  \tmp3 @ q1 - f
+.if \wd >= 8
+orrsr2,  r2,  r3
+.endif
 vqmovun.s16 d0,  q0 @ out p1
 vqmovun.s16 d2,  q1 @ out q1
 vbitd22, d0,  d5@ if (!hev && fm && !flat8in)
 vbitd25, d2,  d5
 
 .if \wd >= 8
-vmovr2,  r3,  d6
-orrsr2,  r2,  r3
 @ If no pixels need flat8in, jump to flat8out
 @ (or to a writeout of the inner 4 pixels, for wd=8)
 beq 6f
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 21/34] arm: vp9lpf: Use orrs instead of orr+cmp

2017-03-08 Thread Martin Storsjö
This is cherrypicked from libav commit
435cd7bc99671bf561193421a50ac6e9d63c4266.
---
 libavcodec/arm/vp9lpf_neon.S | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/libavcodec/arm/vp9lpf_neon.S b/libavcodec/arm/vp9lpf_neon.S
index 2761956..3d289e5 100644
--- a/libavcodec/arm/vp9lpf_neon.S
+++ b/libavcodec/arm/vp9lpf_neon.S
@@ -78,8 +78,7 @@
 
 vdup.u8 d3,  r3  @ H
 vmovr2,  r3,  d4
-orr r2,  r2,  r3
-cmp r2,  #0
+orrsr2,  r2,  r3
 @ If no pixels need filtering, just exit as soon as possible
 beq 9f
 
@@ -192,8 +191,7 @@
 
 .if \wd >= 8
 vmovr2,  r3,  d6
-orr r2,  r2,  r3
-cmp r2,  #0
+orrsr2,  r2,  r3
 @ If no pixels need flat8in, jump to flat8out
 @ (or to a writeout of the inner 4 pixels, for wd=8)
 beq 6f
@@ -248,14 +246,12 @@
 6:
 vorrd2,  d6,  d7
 vmovr2,  r3,  d2
-orr r2,  r2,  r3
-cmp r2,  #0
+orrsr2,  r2,  r3
 @ If no pixels needed flat8in nor flat8out, jump to a
 @ writeout of the inner 4 pixels
 beq 7f
 vmovr2,  r3,  d7
-orr r2,  r2,  r3
-cmp r2,  #0
+orrsr2,  r2,  r3
 @ If no pixels need flat8out, jump to a writeout of the inner 6 pixels
 beq 8f
 
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 06/34] aarch64: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function

2017-03-08 Thread Martin Storsjö
This allows reusing the macro for a separate implementation of the
pass2 function.

This is cherrypicked from libav commit
79d332ebbde8c0a3e9da094dcfd10abd33ba7378.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 90 +++---
 1 file changed, 45 insertions(+), 45 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index a37b459..e45d385 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -598,6 +598,51 @@ endfunc
 st1 {v2.8h},  [\src], \inc
 .endm
 
+.macro load_add_store coef0, coef1, coef2, coef3, coef4, coef5, coef6, coef7, 
tmp1, tmp2
+srshr   \coef0, \coef0, #6
+ld1 {v2.8b},  [x0], x1
+srshr   \coef1, \coef1, #6
+ld1 {v3.8b},  [x3], x1
+srshr   \coef2, \coef2, #6
+ld1 {v4.8b},  [x0], x1
+srshr   \coef3, \coef3, #6
+uaddw   \coef0, \coef0, v2.8b
+ld1 {v5.8b},  [x3], x1
+uaddw   \coef1, \coef1, v3.8b
+srshr   \coef4, \coef4, #6
+ld1 {v6.8b},  [x0], x1
+srshr   \coef5, \coef5, #6
+ld1 {v7.8b},  [x3], x1
+sqxtun  v2.8b,  \coef0
+srshr   \coef6, \coef6, #6
+sqxtun  v3.8b,  \coef1
+srshr   \coef7, \coef7, #6
+uaddw   \coef2, \coef2, v4.8b
+ld1 {\tmp1},  [x0], x1
+uaddw   \coef3, \coef3, v5.8b
+ld1 {\tmp2},  [x3], x1
+sqxtun  v4.8b,  \coef2
+sub x0,  x0,  x1, lsl #2
+sub x3,  x3,  x1, lsl #2
+sqxtun  v5.8b,  \coef3
+uaddw   \coef4, \coef4, v6.8b
+st1 {v2.8b},  [x0], x1
+uaddw   \coef5, \coef5, v7.8b
+st1 {v3.8b},  [x3], x1
+sqxtun  v6.8b,  \coef4
+st1 {v4.8b},  [x0], x1
+sqxtun  v7.8b,  \coef5
+st1 {v5.8b},  [x3], x1
+uaddw   \coef6, \coef6, \tmp1
+st1 {v6.8b},  [x0], x1
+uaddw   \coef7, \coef7, \tmp2
+st1 {v7.8b},  [x3], x1
+sqxtun  \tmp1,  \coef6
+sqxtun  \tmp2,  \coef7
+st1 {\tmp1},  [x0], x1
+st1 {\tmp2},  [x3], x1
+.endm
+
 // Read a vertical 8x16 slice out of a 16x16 matrix, do a transform on it,
 // transpose into a horizontal 16x8 slice and store.
 // x0 = dst (temp buffer)
@@ -671,53 +716,8 @@ function \txfm\()16_1d_8x16_pass2_neon
 lsl x1,  x1,  #1
 bl  \txfm\()16
 
-.macro load_add_store coef0, coef1, coef2, coef3, coef4, coef5, coef6, coef7, 
tmp1, tmp2
-srshr   \coef0, \coef0, #6
-ld1 {v2.8b},  [x0], x1
-srshr   \coef1, \coef1, #6
-ld1 {v3.8b},  [x3], x1
-srshr   \coef2, \coef2, #6
-ld1 {v4.8b},  [x0], x1
-srshr   \coef3, \coef3, #6
-uaddw   \coef0, \coef0, v2.8b
-ld1 {v5.8b},  [x3], x1
-uaddw   \coef1, \coef1, v3.8b
-srshr   \coef4, \coef4, #6
-ld1 {v6.8b},  [x0], x1
-srshr   \coef5, \coef5, #6
-ld1 {v7.8b},  [x3], x1
-sqxtun  v2.8b,  \coef0
-srshr   \coef6, \coef6, #6
-sqxtun  v3.8b,  \coef1
-srshr   \coef7, \coef7, #6
-uaddw   \coef2, \coef2, v4.8b
-ld1 {\tmp1},  [x0], x1
-uaddw   \coef3, \coef3, v5.8b
-ld1 {\tmp2},  [x3], x1
-sqxtun  v4.8b,  \coef2
-sub x0,  x0,  x1, lsl #2
-sub x3,  x3,  x1, lsl #2
-sqxtun  v5.8b,  \coef3
-uaddw   \coef4, \coef4, v6.8b
-st1 {v2.8b},  [x0], x1
-uaddw   \coef5, \coef5, v7.8b
-st1 {v3.8b},  [x3], x1
-sqxtun  v6.8b,  \coef4
-st1 {v4.8b},  [x0], x1
-sqxtun  v7.8b,  \coef5
-st1 {v5.8b},  [x3], x1
-uaddw   \coef6, \coef6, \tmp1
-st1 {v6.8b},  [x0], x1
-uaddw   \coef7, \coef7, \tmp2
-st1 {v7.8b},  [x3], x1
-sqxtun  \tmp1,  \coef6
-sqxtun  \tmp2,  \coef7
-st1 {\tmp1},  [x0], x1
-st1 {\tmp2},  [x3], x1
-.endm
 load_add_store  v16.8h, v17.8h, v18.8h, v19.8h, v20.8h, v21.8h, 
v22.8h, v23.8h, v16.8b, v17.8b
 load_add_store  v24.8h, v25.8h, v26.8h, v27.8h, v28.8h, v29.8h, 
v30.8h, v31.8h, v16.8b, v17.8b
-.purgem load_add_store
 

[FFmpeg-devel] [PATCH 05/34] arm: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function

2017-03-08 Thread Martin Storsjö
This allows reusing the macro for a separate implementation of the
pass2 function.

This is cherrypicked from libav commit
47b3c2c18d1897f3c753ba0cec4b2d7aa24526af.
---
 libavcodec/arm/vp9itxfm_neon.S | 72 +-
 1 file changed, 36 insertions(+), 36 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index 328bb01..682a82e 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -657,6 +657,42 @@ function iadst16
 bx  lr
 endfunc
 
+.macro load_add_store coef0, coef1, coef2, coef3
+vrshr.s16   \coef0, \coef0, #6
+vrshr.s16   \coef1, \coef1, #6
+
+vld1.32 {d4[]},   [r0,:32], r1
+vld1.32 {d4[1]},  [r3,:32], r1
+vrshr.s16   \coef2, \coef2, #6
+vrshr.s16   \coef3, \coef3, #6
+vld1.32 {d5[]},   [r0,:32], r1
+vld1.32 {d5[1]},  [r3,:32], r1
+vaddw.u8\coef0, \coef0, d4
+vld1.32 {d6[]},   [r0,:32], r1
+vld1.32 {d6[1]},  [r3,:32], r1
+vaddw.u8\coef1, \coef1, d5
+vld1.32 {d7[]},   [r0,:32], r1
+vld1.32 {d7[1]},  [r3,:32], r1
+
+vqmovun.s16 d4,  \coef0
+vqmovun.s16 d5,  \coef1
+sub r0,  r0,  r1, lsl #2
+sub r3,  r3,  r1, lsl #2
+vaddw.u8\coef2, \coef2, d6
+vaddw.u8\coef3, \coef3, d7
+vst1.32 {d4[0]},  [r0,:32], r1
+vst1.32 {d4[1]},  [r3,:32], r1
+vqmovun.s16 d6,  \coef2
+vst1.32 {d5[0]},  [r0,:32], r1
+vst1.32 {d5[1]},  [r3,:32], r1
+vqmovun.s16 d7,  \coef3
+
+vst1.32 {d6[0]},  [r0,:32], r1
+vst1.32 {d6[1]},  [r3,:32], r1
+vst1.32 {d7[0]},  [r0,:32], r1
+vst1.32 {d7[1]},  [r3,:32], r1
+.endm
+
 .macro itxfm16_1d_funcs txfm
 @ Read a vertical 4x16 slice out of a 16x16 matrix, do a transform on it,
 @ transpose into a horizontal 16x4 slice and store.
@@ -739,44 +775,8 @@ function \txfm\()16_1d_4x16_pass2_neon
 lsl r1,  r1,  #1
 bl  \txfm\()16
 
-.macro load_add_store coef0, coef1, coef2, coef3
-vrshr.s16   \coef0, \coef0, #6
-vrshr.s16   \coef1, \coef1, #6
-
-vld1.32 {d4[]},   [r0,:32], r1
-vld1.32 {d4[1]},  [r3,:32], r1
-vrshr.s16   \coef2, \coef2, #6
-vrshr.s16   \coef3, \coef3, #6
-vld1.32 {d5[]},   [r0,:32], r1
-vld1.32 {d5[1]},  [r3,:32], r1
-vaddw.u8\coef0, \coef0, d4
-vld1.32 {d6[]},   [r0,:32], r1
-vld1.32 {d6[1]},  [r3,:32], r1
-vaddw.u8\coef1, \coef1, d5
-vld1.32 {d7[]},   [r0,:32], r1
-vld1.32 {d7[1]},  [r3,:32], r1
-
-vqmovun.s16 d4,  \coef0
-vqmovun.s16 d5,  \coef1
-sub r0,  r0,  r1, lsl #2
-sub r3,  r3,  r1, lsl #2
-vaddw.u8\coef2, \coef2, d6
-vaddw.u8\coef3, \coef3, d7
-vst1.32 {d4[0]},  [r0,:32], r1
-vst1.32 {d4[1]},  [r3,:32], r1
-vqmovun.s16 d6,  \coef2
-vst1.32 {d5[0]},  [r0,:32], r1
-vst1.32 {d5[1]},  [r3,:32], r1
-vqmovun.s16 d7,  \coef3
-
-vst1.32 {d6[0]},  [r0,:32], r1
-vst1.32 {d6[1]},  [r3,:32], r1
-vst1.32 {d7[0]},  [r0,:32], r1
-vst1.32 {d7[1]},  [r3,:32], r1
-.endm
 load_add_store  q8,  q9,  q10, q11
 load_add_store  q12, q13, q14, q15
-.purgem load_add_store
 
 pop {pc}
 endfunc
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 18/34] arm: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling

2017-03-08 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

Before:Cortex A7  A8  A9 A53
vp9_inv_dct_dct_16x16_sub1_add_neon:   273.0   189.5   211.7   235.8
vp9_inv_dct_dct_32x32_sub1_add_neon:   752.0   459.2   862.2   553.9
After:
vp9_inv_dct_dct_16x16_sub1_add_neon:   226.5   145.0   225.1   171.8
vp9_inv_dct_dct_32x32_sub1_add_neon:   721.2   415.7   727.6   475.0

This is cherrypicked from libav commit
a76bf8cf1277ef6feb1580b578f5e6ca327e713c.
---
 libavcodec/arm/vp9itxfm_neon.S | 54 --
 1 file changed, 36 insertions(+), 18 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index 78fdae6..dee2f05 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -542,16 +542,23 @@ function idct16x16_dc_add_neon
 
 vrshr.s16   q8,  q8,  #6
 
+mov r3,  r0
 mov r12, #16
 1:
 @ Loop to add the constant from q8 into all 16x16 outputs
-vld1.8  {q3},  [r0,:128]
-vaddw.u8q10, q8,  d6
-vaddw.u8q11, q8,  d7
-vqmovun.s16 d6,  q10
-vqmovun.s16 d7,  q11
-vst1.8  {q3},  [r0,:128], r1
-subsr12, r12, #1
+subsr12, r12, #2
+vld1.8  {q2},  [r0,:128], r1
+vaddw.u8q10, q8,  d4
+vld1.8  {q3},  [r0,:128], r1
+vaddw.u8q11, q8,  d5
+vaddw.u8q12, q8,  d6
+vaddw.u8q13, q8,  d7
+vqmovun.s16 d4,  q10
+vqmovun.s16 d5,  q11
+vqmovun.s16 d6,  q12
+vst1.8  {q2},  [r3,:128], r1
+vqmovun.s16 d7,  q13
+vst1.8  {q3},  [r3,:128], r1
 bne 1b
 
 bx  lr
@@ -1147,20 +1154,31 @@ function idct32x32_dc_add_neon
 
 vrshr.s16   q8,  q8,  #6
 
+mov r3,  r0
 mov r12, #32
 1:
 @ Loop to add the constant from q8 into all 32x32 outputs
-vld1.8  {q2-q3},  [r0,:128]
-vaddw.u8q10, q8,  d4
-vaddw.u8q11, q8,  d5
-vaddw.u8q12, q8,  d6
-vaddw.u8q13, q8,  d7
-vqmovun.s16 d4,  q10
-vqmovun.s16 d5,  q11
-vqmovun.s16 d6,  q12
-vqmovun.s16 d7,  q13
-vst1.8  {q2-q3},  [r0,:128], r1
-subsr12, r12, #1
+subsr12, r12, #2
+vld1.8  {q0-q1},  [r0,:128], r1
+vaddw.u8q9,  q8,  d0
+vaddw.u8q10, q8,  d1
+vld1.8  {q2-q3},  [r0,:128], r1
+vaddw.u8q11, q8,  d2
+vaddw.u8q12, q8,  d3
+vaddw.u8q13, q8,  d4
+vaddw.u8q14, q8,  d5
+vaddw.u8q15, q8,  d6
+vqmovun.s16 d0,  q9
+vaddw.u8q9,  q8,  d7
+vqmovun.s16 d1,  q10
+vqmovun.s16 d2,  q11
+vqmovun.s16 d3,  q12
+vqmovun.s16 d4,  q13
+vqmovun.s16 d5,  q14
+vst1.8  {q0-q1},  [r3,:128], r1
+vqmovun.s16 d6,  q15
+vqmovun.s16 d7,  q9
+vst1.8  {q2-q3},  [r3,:128], r1
 bne 1b
 
 bx  lr
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 19/34] aarch64: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling

2017-03-08 Thread Martin Storsjö
This work is sponsored by, and copyright, Google.

Before:   Cortex A53
vp9_inv_dct_dct_16x16_sub1_add_neon:   235.3
vp9_inv_dct_dct_32x32_sub1_add_neon:   555.1
After:
vp9_inv_dct_dct_16x16_sub1_add_neon:   180.2
vp9_inv_dct_dct_32x32_sub1_add_neon:   475.3

This is cherrypicked from libav commit
3fcf788fbbccc4130868e7abe58a88990290f7c1.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 54 +-
 1 file changed, 36 insertions(+), 18 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index 6bb097b..be65eb7 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -495,16 +495,23 @@ function idct16x16_dc_add_neon
 
 srshr   v2.8h, v2.8h, #6
 
+mov x3, x0
 mov x4, #16
 1:
 // Loop to add the constant from v2 into all 16x16 outputs
-ld1 {v3.16b},  [x0]
-uaddw   v4.8h,  v2.8h,  v3.8b
-uaddw2  v5.8h,  v2.8h,  v3.16b
-sqxtun  v4.8b,  v4.8h
-sqxtun2 v4.16b, v5.8h
-st1 {v4.16b},  [x0], x1
-subsx4,  x4,  #1
+subsx4,  x4,  #2
+ld1 {v3.16b},  [x0], x1
+ld1 {v4.16b},  [x0], x1
+uaddw   v16.8h, v2.8h,  v3.8b
+uaddw2  v17.8h, v2.8h,  v3.16b
+uaddw   v18.8h, v2.8h,  v4.8b
+uaddw2  v19.8h, v2.8h,  v4.16b
+sqxtun  v3.8b,  v16.8h
+sqxtun2 v3.16b, v17.8h
+sqxtun  v4.8b,  v18.8h
+sqxtun2 v4.16b, v19.8h
+st1 {v3.16b},  [x3], x1
+st1 {v4.16b},  [x3], x1
 b.ne1b
 
 ret
@@ -1054,20 +1061,31 @@ function idct32x32_dc_add_neon
 
 srshr   v0.8h, v2.8h, #6
 
+mov x3, x0
 mov x4, #32
 1:
 // Loop to add the constant v0 into all 32x32 outputs
-ld1 {v1.16b,v2.16b},  [x0]
-uaddw   v3.8h,  v0.8h,  v1.8b
-uaddw2  v4.8h,  v0.8h,  v1.16b
-uaddw   v5.8h,  v0.8h,  v2.8b
-uaddw2  v6.8h,  v0.8h,  v2.16b
-sqxtun  v3.8b,  v3.8h
-sqxtun2 v3.16b, v4.8h
-sqxtun  v4.8b,  v5.8h
-sqxtun2 v4.16b, v6.8h
-st1 {v3.16b,v4.16b},  [x0], x1
-subsx4,  x4,  #1
+subsx4,  x4,  #2
+ld1 {v1.16b,v2.16b},  [x0], x1
+uaddw   v16.8h, v0.8h,  v1.8b
+uaddw2  v17.8h, v0.8h,  v1.16b
+ld1 {v3.16b,v4.16b},  [x0], x1
+uaddw   v18.8h, v0.8h,  v2.8b
+uaddw2  v19.8h, v0.8h,  v2.16b
+uaddw   v20.8h, v0.8h,  v3.8b
+uaddw2  v21.8h, v0.8h,  v3.16b
+uaddw   v22.8h, v0.8h,  v4.8b
+uaddw2  v23.8h, v0.8h,  v4.16b
+sqxtun  v1.8b,  v16.8h
+sqxtun2 v1.16b, v17.8h
+sqxtun  v2.8b,  v18.8h
+sqxtun2 v2.16b, v19.8h
+sqxtun  v3.8b,  v20.8h
+sqxtun2 v3.16b, v21.8h
+st1 {v1.16b,v2.16b},  [x3], x1
+sqxtun  v4.8b,  v22.8h
+sqxtun2 v4.16b, v23.8h
+st1 {v3.16b,v4.16b},  [x3], x1
 b.ne1b
 
 ret
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 13/34] aarch64: vp9itxfm: Update a comment to refer to a register with a different name

2017-03-08 Thread Martin Storsjö
This is cherrypicked from libav commit
8476eb0d3ab1f7a52317b23346646389c08fb57a.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index 3b34749..5219d6e 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -41,8 +41,8 @@ const iadst16_coeffs, align=4
 .short  11003, 12140, 8423, 14053, 5520, 15426, 2404, 16207
 endconst
 
-// out1 = ((in1 + in2) * d0[0] + (1 << 13)) >> 14
-// out2 = ((in1 - in2) * d0[0] + (1 << 13)) >> 14
+// out1 = ((in1 + in2) * v0[0] + (1 << 13)) >> 14
+// out2 = ((in1 - in2) * v0[0] + (1 << 13)) >> 14
 // in/out are .8h registers; this can do with 4 temp registers, but is
 // more efficient if 6 temp registers are available.
 .macro dmbutterfly0 out1, out2, in1, in2, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, 
neg=0
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 11/34] aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possible

2017-03-08 Thread Martin Storsjö
The ld1r is a leftover from the arm version, where this trick is
beneficial on some cores.

Use a single-lane load where we don't need the semantics of ld1r.

This is cherrypicked from libav commit
ed8d293306e12c9b79022d37d39f48825ce7f2fa.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index df178d2..e42cc2d 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -255,7 +255,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_4x4_add_neon, export=1
 cmp w3,  #1
 b.ne1f
 // DC-only for idct/idct
-ld1r{v2.4h},  [x2]
+ld1 {v2.h}[0], [x2]
 smull   v2.4s,  v2.4h, v0.h[0]
 rshrn   v2.4h,  v2.4s, #14
 smull   v2.4s,  v2.4h, v0.h[0]
@@ -287,8 +287,8 @@ function ff_vp9_\txfm1\()_\txfm2\()_4x4_add_neon, export=1
 
 \txfm2\()4  v4,  v5,  v6,  v7
 2:
-ld1r{v0.2s},   [x0], x1
-ld1r{v1.2s},   [x0], x1
+ld1 {v0.s}[0],   [x0], x1
+ld1 {v1.s}[0],   [x0], x1
 .ifnc \txfm1,iwht
 srshr   v4.4h,  v4.4h,  #4
 srshr   v5.4h,  v5.4h,  #4
@@ -297,8 +297,8 @@ function ff_vp9_\txfm1\()_\txfm2\()_4x4_add_neon, export=1
 .endif
 uaddw   v4.8h,  v4.8h,  v0.8b
 uaddw   v5.8h,  v5.8h,  v1.8b
-ld1r{v2.2s},   [x0], x1
-ld1r{v3.2s},   [x0], x1
+ld1 {v2.s}[0],   [x0], x1
+ld1 {v3.s}[0],   [x0], x1
 sqxtun  v0.8b,  v4.8h
 sqxtun  v1.8b,  v5.8h
 sub x0,  x0,  x1, lsl #2
@@ -394,7 +394,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1
 cmp w3,  #1
 b.ne1f
 // DC-only for idct/idct
-ld1r{v2.4h},  [x2]
+ld1 {v2.h}[0],  [x2]
 smull   v2.4s,  v2.4h, v0.h[0]
 rshrn   v2.4h,  v2.4s, #14
 smull   v2.4s,  v2.4h, v0.h[0]
@@ -485,7 +485,7 @@ function idct16x16_dc_add_neon
 
 moviv1.4h, #0
 
-ld1r{v2.4h}, [x2]
+ld1 {v2.h}[0], [x2]
 smull   v2.4s,  v2.4h, v0.h[0]
 rshrn   v2.4h,  v2.4s, #14
 smull   v2.4s,  v2.4h, v0.h[0]
@@ -1044,7 +1044,7 @@ function idct32x32_dc_add_neon
 
 moviv1.4h, #0
 
-ld1r{v2.4h}, [x2]
+ld1 {v2.h}[0], [x2]
 smull   v2.4s,  v2.4h,  v0.h[0]
 rshrn   v2.4h,  v2.4s,  #14
 smull   v2.4s,  v2.4h,  v0.h[0]
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 09/34] arm: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function

2017-03-08 Thread Martin Storsjö
This is cherrypicked from libav commit
3933b86bb93aca47f29fbd493075b0f110c1e3f5.
---
 libavcodec/arm/vp9itxfm_neon.S | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index 33a7af1..78fdae6 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -412,13 +412,12 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1
 .ifc \txfm1\()_\txfm2,idct_idct
 movrel  r12, idct_coeffs
 vpush   {q4-q5}
-vld1.16 {q0}, [r12,:128]
 .else
 movrel  r12, iadst8_coeffs
 vld1.16 {q1}, [r12,:128]!
 vpush   {q4-q7}
-vld1.16 {q0}, [r12,:128]
 .endif
+vld1.16 {q0}, [r12,:128]
 
 vmov.i16q2, #0
 vmov.i16q3, #0
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 12/34] aarch64: vp9itxfm: Use the right lane sizes in 8x8 for improved readability

2017-03-08 Thread Martin Storsjö
This is cherrypicked from libav commit
3dd7827258ddaa2e51085d0c677d6f3b1be3572f.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index e42cc2d..3b34749 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -385,10 +385,10 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1
 .endif
 ld1 {v0.8h}, [x4]
 
-moviv2.16b, #0
-moviv3.16b, #0
-moviv4.16b, #0
-moviv5.16b, #0
+moviv2.8h, #0
+moviv3.8h, #0
+moviv4.8h, #0
+moviv5.8h, #0
 
 .ifc \txfm1\()_\txfm2,idct_idct
 cmp w3,  #1
@@ -411,11 +411,11 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1
 b   2f
 .endif
 1:
-ld1 {v16.16b,v17.16b,v18.16b,v19.16b},  [x2], #64
-ld1 {v20.16b,v21.16b,v22.16b,v23.16b},  [x2], #64
+ld1 {v16.8h,v17.8h,v18.8h,v19.8h},  [x2], #64
+ld1 {v20.8h,v21.8h,v22.8h,v23.8h},  [x2], #64
 sub x2,  x2,  #128
-st1 {v2.16b,v3.16b,v4.16b,v5.16b},  [x2], #64
-st1 {v2.16b,v3.16b,v4.16b,v5.16b},  [x2], #64
+st1 {v2.8h,v3.8h,v4.8h,v5.8h},  [x2], #64
+st1 {v2.8h,v3.8h,v4.8h,v5.8h},  [x2], #64
 
 \txfm1\()8
 
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 20/34] arm/aarch64: vp9lpf: Calculate !hev directly

2017-03-08 Thread Martin Storsjö
Previously we first calculated hev, and then negated it.

Since we were able to schedule the negation in the middle
of another calculation, we don't see any gain in all cases.

Before: Cortex A7  A8  A9 A53  A53/AArch64
vp9_loop_filter_v_4_8_neon: 147.0   129.0   115.889.0 88.7
vp9_loop_filter_v_8_8_neon: 242.0   198.5   174.7   140.0136.7
vp9_loop_filter_v_16_8_neon:500.0   419.5   382.7   293.0275.7
vp9_loop_filter_v_16_16_neon:   971.2   825.5   731.5   579.0453.0
After:
vp9_loop_filter_v_4_8_neon: 143.0   127.7   114.888.0 87.7
vp9_loop_filter_v_8_8_neon: 241.0   197.2   173.7   140.0136.7
vp9_loop_filter_v_16_8_neon:497.0   419.5   379.7   293.0275.7
vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0452.0

This is cherrypicked from libav commit
e1f9de86f454861b69b199ad801adc2ec6c3b220.
---
 libavcodec/aarch64/vp9lpf_neon.S | 5 ++---
 libavcodec/arm/vp9lpf_neon.S | 5 ++---
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/libavcodec/aarch64/vp9lpf_neon.S b/libavcodec/aarch64/vp9lpf_neon.S
index 55e1964..7fe2c88 100644
--- a/libavcodec/aarch64/vp9lpf_neon.S
+++ b/libavcodec/aarch64/vp9lpf_neon.S
@@ -292,7 +292,7 @@
 .if \mix != 0
 sxtlv1.8h,  v1.8b
 .endif
-cmhiv5\sz,  v5\sz,  v3\sz  // hev
+cmhsv5\sz,  v3\sz,  v5\sz  // !hev
 .if \wd == 8
 // If a 4/8 or 8/4 mix is used, clear the relevant half of v6
 .if \mix != 0
@@ -306,11 +306,10 @@
 .elseif \wd == 8
 bic v4\sz,  v4\sz,  v6\sz  // fm && !flat8in
 .endif
-mvn v5\sz,  v5\sz  // !hev
+and v5\sz,  v5\sz,  v4\sz  // !hev && fm && !flat8in
 .if \wd == 16
 and v7\sz,  v7\sz,  v6\sz  // flat8out && flat8in && fm
 .endif
-and v5\sz,  v5\sz,  v4\sz  // !hev && fm && !flat8in
 
 mul_sz  \tmp3\().8h,  \tmp4\().8h,  \tmp3\().8h, \tmp4\().8h,  
\tmp5\().8h,  \tmp5\().8h, \sz // 3 * (q0 - p0)
 bic \tmp1\sz,  \tmp1\sz,  v5\sz// if (!hev) 
av_clip_int8 = 0
diff --git a/libavcodec/arm/vp9lpf_neon.S b/libavcodec/arm/vp9lpf_neon.S
index e96f4db..2761956 100644
--- a/libavcodec/arm/vp9lpf_neon.S
+++ b/libavcodec/arm/vp9lpf_neon.S
@@ -141,7 +141,7 @@
 .if \wd == 8
 vcle.u8 d6,  d6,  d0@ flat8in
 .endif
-vcgt.u8 d5,  d5,  d3@ hev
+vcle.u8 d5,  d5,  d3@ !hev
 .if \wd == 8
 vandd6,  d6,  d4@ flat8in && fm
 .endif
@@ -151,11 +151,10 @@
 .elseif \wd == 8
 vbicd4,  d4,  d6@ fm && !flat8in
 .endif
-vmvnd5,  d5 @ !hev
+vandd5,  d5,  d4@ !hev && fm && !flat8in
 .if \wd == 16
 vandd7,  d7,  d6@ flat8out && flat8in && fm
 .endif
-vandd5,  d5,  d4@ !hev && fm && !flat8in
 
 vmul.s16\tmpq2,  \tmpq2, \tmpq3 @ 3 * (q0 - p0)
 vbic\tmp1,   \tmp1,   d5@ if (!hev) av_clip_int8 = 0
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 10/34] aarch64: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function

2017-03-08 Thread Martin Storsjö
This is cherrypicked from libav commit
4da4b2b87f08a1331650c7e36eb7d4029a160776.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index 3eb999a..df178d2 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -379,12 +379,11 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1
 // idct, so those always need to be loaded.
 .ifc \txfm1\()_\txfm2,idct_idct
 movrel  x4,  idct_coeffs
-ld1 {v0.8h}, [x4]
 .else
 movrel  x4, iadst8_coeffs
 ld1 {v1.8h}, [x4], #16
-ld1 {v0.8h}, [x4]
 .endif
+ld1 {v0.8h}, [x4]
 
 moviv2.16b, #0
 moviv3.16b, #0
-- 
2.7.4

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCHv3 4/4] libavcodec: v4l2: add support for v4l2 mem2mem codecs

2017-08-08 Thread Martin Storsjö

Hi Jorge,

On Mon, 7 Aug 2017, Jorge Ramirez wrote:


On 08/03/2017 01:53 AM, Mark Thompson wrote:

+default:
+return 0;
+}
+
+SET_V4L_EXT_CTRL(value, qmin, avctx->qmin, "minimum video quantizer 

scale");
+SET_V4L_EXT_CTRL(value, qmax, avctx->qmax, "maximum video quantizer 

scale");

+
+return 0;
+}
This doesn't set extradata - you need to extract the codec global headers 
(such as H.264 SPS and PPS) at init time to be able to write correct files 
for some codecs (such as H.264) with muxers requiring global headers (such as 
MP4).  It kindof works without it, but the files created will not conform and 
will not be usable on some players.


ah that might explain some things (when I play back the encoded video 
the quality is pretty lousy)
is there already some code I can use as a reference? I might be out of 
my depth here so any help will be more than welcome


This is exactly the thing I was trying to tell you about, off list, 
before.


In the OMX driver used on android, this is requested on startup, via an 
ioctl with the following private ioctl value:

V4L2_CID_MPEG_VIDC_VIDEO_REQUEST_SEQ_HEADER

See this code here:
https://android.googlesource.com/platform/hardware/qcom/media/+/63abe022/msm8996/mm-video-v4l2/vidc/venc/src/video_encoder_device_v4l2.cpp#2991

This is a qcom specific, private ioctl. In the android kernel for 
qualcomm, this is handled correctly here:


https://android.googlesource.com/kernel/msm/+/android-7.1.2_r0.33/drivers/media/platform/msm/vidc/msm_venc.c#2987
https://android.googlesource.com/kernel/msm/+/android-7.1.2_r0.33/drivers/media/platform/msm/vidc/msm_vidc_common.c#3767

In the dragonboard kernel snapshot I had been testing, that I referred to 
you before, there are incomplete stubs of handling of this. In the 
debian-qcom-dragonboard410c-16.04 tag in the linaro kernel tree:


http://git.linaro.org/landing-teams/working/qualcomm/kernel.git/tree/drivers/media/platform/msm/vidc/msm_venc-ctrls.c?h=debian-qcom-dragonboard410c-16.04=8205f603ceeb02d08a720676d9075c9e75e47b0f#n2116
This increments seq_hdr_reqs, just like in the android kernel tree (where 
this is working). However in this kernel tree, nothing actually ever reads 
the seq_hdr_reqs, so it's a non-functional stub.


Now in the kernel tree you referred me to, in the 
release/db820c/qcomlt-4.11 branch, I don't see anything similar to 
V4L2_CID_MPEG_VIDC_VIDEO_REQUEST_SEQ_HEADER. I can't help you from there, 
you need to figure that out what alternative codepath there is, intended 
to replace it - if any. If there aren't any, you first need to fix the 
v4l2 driver before userspace apps can get what they need.


There is a clear need for this, as you witness in the android version of 
the kernel. It just seems to have been removed in the vanilla linux 
version of the driver.


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 1/2] aarch64: vp9: Fix assembling with Xcode 6.2 and older

2017-06-20 Thread Martin Storsjö
From: Memphiz 

Properly use the b.eq/b.ge forms instead of the nonstandard forms
(which both gas and newer clang accept though), and expand the
register list that used a range (which the Xcode 6.2 clang, based
on clang 3.5 svn, didn't support).

This is cherrypicked from libav commit
a970f9de865c84ed5360dd0398baee7d48d04620.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 2 +-
 libavcodec/aarch64/vp9mc_neon.S| 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index b12890f0db..99413b0f70 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -1531,7 +1531,7 @@ function ff_vp9_idct_idct_32x32_add_neon, export=1
 2:
 subsx1,  x1,  #1
 .rept 4
-st1 {v16.8h-v19.8h},  [x0], #64
+st1 {v16.8h,v17.8h,v18.8h,v19.8h},  [x0], #64
 .endr
 b.ne2b
 3:
diff --git a/libavcodec/aarch64/vp9mc_neon.S b/libavcodec/aarch64/vp9mc_neon.S
index 82a0f53133..f67624ca04 100644
--- a/libavcodec/aarch64/vp9mc_neon.S
+++ b/libavcodec/aarch64/vp9mc_neon.S
@@ -341,7 +341,7 @@ function \type\()_8tap_\size\()h_\idx1\idx2
 subsx9,  x9,  #16
 st1 {v1.16b},  [x0], #16
 st1 {v24.16b}, [x6], #16
-beq 3f
+b.eq3f
 mov v4.16b,  v6.16b
 mov v16.16b, v18.16b
 ld1 {v6.16b},  [x2], #16
@@ -388,10 +388,10 @@ function ff_vp9_\type\()_\filter\()\size\()_h_neon, 
export=1
 add x9,  x6,  w5, uxtw #4
 mov x5,  #\size
 .if \size >= 16
-bge \type\()_8tap_16h_34
+b.ge\type\()_8tap_16h_34
 b   \type\()_8tap_16h_43
 .else
-bge \type\()_8tap_\size\()h_34
+b.ge\type\()_8tap_\size\()h_34
 b   \type\()_8tap_\size\()h_43
 .endif
 endfunc
-- 
2.11.0 (Apple Git-81)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 2/2] aarch64: vp9 16bpp: Fix assembling with Xcode 6.2 and older

2017-06-20 Thread Martin Storsjö
From: Memphiz 

Properly use the b.eq form instead of the nonstandard form (which
both gas and newer clang accept though), and expand the register
lists that used a range (which the Xcode 6.2 clang, based on clang
3.5 svn, didn't support).
---
 libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 8 
 libavcodec/aarch64/vp9mc_16bpp_neon.S| 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S 
b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
index 0befe383df..68296d9c40 100644
--- a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
@@ -1925,8 +1925,8 @@ function vp9_idct_idct_32x32_add_16_neon
 2:
 subsx1,  x1,  #1
 .rept 4
-st1 {v16.4s-v19.4s},  [x0], #64
-st1 {v16.4s-v19.4s},  [x0], #64
+st1 {v16.4s,v17.4s,v18.4s,v19.4s},  [x0], #64
+st1 {v16.4s,v17.4s,v18.4s,v19.4s},  [x0], #64
 .endr
 b.ne2b
 3:
@@ -1991,8 +1991,8 @@ function idct32x32_\size\()_add_16_neon
 moviv19.4s,  #0
 
 .rept 4
-st1 {v16.4s-v19.4s},  [x0], #64
-st1 {v16.4s-v19.4s},  [x0], #64
+st1 {v16.4s,v17.4s,v18.4s,v19.4s},  [x0], #64
+st1 {v16.4s,v17.4s,v18.4s,v19.4s},  [x0], #64
 .endr
 
 3:
diff --git a/libavcodec/aarch64/vp9mc_16bpp_neon.S 
b/libavcodec/aarch64/vp9mc_16bpp_neon.S
index 98ffd2e8a7..cac6428709 100644
--- a/libavcodec/aarch64/vp9mc_16bpp_neon.S
+++ b/libavcodec/aarch64/vp9mc_16bpp_neon.S
@@ -275,7 +275,7 @@ function \type\()_8tap_\size\()h
 subsx9,  x9,  #32
 st1 {v1.8h,  v2.8h},  [x0], #32
 st1 {v24.8h, v25.8h}, [x6], #32
-beq 3f
+b.eq3f
 mov v5.16b,  v7.16b
 mov v16.16b, v18.16b
 ld1 {v6.8h,  v7.8h},  [x2], #32
-- 
2.11.0 (Apple Git-81)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 1/3] arm: swscale: Only compile the rgb2yuv asm if .dn aliases are supported

2018-03-30 Thread Martin Storsjö
Vanilla clang supports altmacro since clang 5.0, and thus doesn't
require gas-preprocessor for building the arm assembly any longer.

However, the built-in assembler doesn't support .dn directives.

This readds checks that were removed in d7320ca3ed10f0d, when
the last usage of .dn directives within libav were removed.

Alternatively, the assembly could be rewritten to not use the
.dn directive, making it available to clang users.
---
 configure | 2 ++
 libswscale/arm/rgb2yuv_neon_16.S  | 3 +++
 libswscale/arm/rgb2yuv_neon_32.S  | 3 +++
 libswscale/arm/swscale_unscaled.c | 6 ++
 4 files changed, 14 insertions(+)

diff --git a/configure b/configure
index 99570a1415..81fb3fbf75 100755
--- a/configure
+++ b/configure
@@ -2149,6 +2149,7 @@ SYSTEM_LIBRARIES="
 
 TOOLCHAIN_FEATURES="
 as_arch_directive
+as_dn_directive
 as_fpu_directive
 as_func
 as_object_arch
@@ -5530,6 +5531,7 @@ EOF
 check_inline_asm asm_mod_q '"add r0, %Q0, %R0" :: "r"((long long)0)'
 
 check_as as_arch_directive ".arch armv7-a"
+check_as as_dn_directive   "ra .dn d0.i16"
 check_as as_fpu_directive  ".fpu neon"
 
 # llvm's integrated assembler supports .object_arch from llvm 3.5
diff --git a/libswscale/arm/rgb2yuv_neon_16.S b/libswscale/arm/rgb2yuv_neon_16.S
index 601bc9a9b7..ad7e679ca9 100644
--- a/libswscale/arm/rgb2yuv_neon_16.S
+++ b/libswscale/arm/rgb2yuv_neon_16.S
@@ -18,6 +18,8 @@
  * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
  */
 
+#include "config.h"
+#if HAVE_AS_DN_DIRECTIVE
 #include "rgb2yuv_neon_common.S"
 
 /* downsampled R16G16B16 x8 */
@@ -78,3 +80,4 @@ alias_qwc8x8x2, q10
 .endm
 
 loop_420sp  rgbx, nv12, init, kernel_420_16x2, 16
+#endif
diff --git a/libswscale/arm/rgb2yuv_neon_32.S b/libswscale/arm/rgb2yuv_neon_32.S
index f51a5f149f..4fd0f64a09 100644
--- a/libswscale/arm/rgb2yuv_neon_32.S
+++ b/libswscale/arm/rgb2yuv_neon_32.S
@@ -18,6 +18,8 @@
  * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
  */
 
+#include "config.h"
+#if HAVE_AS_DN_DIRECTIVE
 #include "rgb2yuv_neon_common.S"
 
 /* downsampled R16G16B16 x8 */
@@ -117,3 +119,4 @@ alias_qwc8x8x2, q10
 
 
 loop_420sp  rgbx, nv12, init, kernel_420_16x2, 32
+#endif
diff --git a/libswscale/arm/swscale_unscaled.c 
b/libswscale/arm/swscale_unscaled.c
index e1597ab42d..e41f294eac 100644
--- a/libswscale/arm/swscale_unscaled.c
+++ b/libswscale/arm/swscale_unscaled.c
@@ -23,6 +23,7 @@
 #include "libswscale/swscale_internal.h"
 #include "libavutil/arm/cpu.h"
 
+#if HAVE_AS_DN_DIRECTIVE
 extern void rgbx_to_nv12_neon_32(const uint8_t *src, uint8_t *y, uint8_t 
*chroma,
 int width, int height,
 int y_stride, int c_stride, int src_stride,
@@ -178,3 +179,8 @@ void ff_get_unscaled_swscale_arm(SwsContext *c)
 if (have_neon(cpu_flags))
 get_unscaled_swscale_neon(c);
 }
+#else
+void ff_get_unscaled_swscale_arm(SwsContext *c)
+{
+}
+#endif
-- 
2.15.1 (Apple Git-101)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 3/3] arm: hevcdsp: Avoid using macro expansion counters

2018-03-30 Thread Martin Storsjö
Clang supports the macro expansion counter (used for making unique
labels within macro expansions), but not when targeting darwin.

Convert uses of the counter into normal local labels, as used
elsewhere.

Since Xcode 9.3, the bundled clang supports altmacro and doesn't
require using gas-preprocessor any longer.
---
 libavcodec/arm/hevcdsp_qpel_neon.S | 36 ++--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/libavcodec/arm/hevcdsp_qpel_neon.S 
b/libavcodec/arm/hevcdsp_qpel_neon.S
index 86f92cf75a..caa6efa766 100644
--- a/libavcodec/arm/hevcdsp_qpel_neon.S
+++ b/libavcodec/arm/hevcdsp_qpel_neon.S
@@ -667,76 +667,76 @@ endfunc
 
 
 function ff_hevc_put_qpel_h1v1_neon_8, export=1
-hevc_put_qpel_hXvY_neon_8 qpel_filter_1 qpel_filter_1_32b
+hevc_put_qpel_hXvY_neon_8 qpel_filter_1, qpel_filter_1_32b
 endfunc
 
 function ff_hevc_put_qpel_h2v1_neon_8, export=1
-hevc_put_qpel_hXvY_neon_8 qpel_filter_2 qpel_filter_1_32b
+hevc_put_qpel_hXvY_neon_8 qpel_filter_2, qpel_filter_1_32b
 endfunc
 
 function ff_hevc_put_qpel_h3v1_neon_8, export=1
-hevc_put_qpel_hXvY_neon_8 qpel_filter_3 qpel_filter_1_32b
+hevc_put_qpel_hXvY_neon_8 qpel_filter_3, qpel_filter_1_32b
 endfunc
 
 function ff_hevc_put_qpel_h1v2_neon_8, export=1
-hevc_put_qpel_hXvY_neon_8 qpel_filter_1 qpel_filter_2_32b
+hevc_put_qpel_hXvY_neon_8 qpel_filter_1, qpel_filter_2_32b
 endfunc
 
 function ff_hevc_put_qpel_h2v2_neon_8, export=1
-hevc_put_qpel_hXvY_neon_8 qpel_filter_2 qpel_filter_2_32b
+hevc_put_qpel_hXvY_neon_8 qpel_filter_2, qpel_filter_2_32b
 endfunc
 
 function ff_hevc_put_qpel_h3v2_neon_8, export=1
-hevc_put_qpel_hXvY_neon_8 qpel_filter_3 qpel_filter_2_32b
+hevc_put_qpel_hXvY_neon_8 qpel_filter_3, qpel_filter_2_32b
 endfunc
 
 function ff_hevc_put_qpel_h1v3_neon_8, export=1
-hevc_put_qpel_hXvY_neon_8 qpel_filter_1 qpel_filter_3_32b
+hevc_put_qpel_hXvY_neon_8 qpel_filter_1, qpel_filter_3_32b
 endfunc
 
 function ff_hevc_put_qpel_h2v3_neon_8, export=1
-hevc_put_qpel_hXvY_neon_8 qpel_filter_2 qpel_filter_3_32b
+hevc_put_qpel_hXvY_neon_8 qpel_filter_2, qpel_filter_3_32b
 endfunc
 
 function ff_hevc_put_qpel_h3v3_neon_8, export=1
-hevc_put_qpel_hXvY_neon_8 qpel_filter_3 qpel_filter_3_32b
+hevc_put_qpel_hXvY_neon_8 qpel_filter_3, qpel_filter_3_32b
 endfunc
 
 
 function ff_hevc_put_qpel_uw_h1v1_neon_8, export=1
-hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_1 qpel_filter_1_32b
+hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_1, qpel_filter_1_32b
 endfunc
 
 function ff_hevc_put_qpel_uw_h2v1_neon_8, export=1
-hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_2 qpel_filter_1_32b
+hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_2, qpel_filter_1_32b
 endfunc
 
 function ff_hevc_put_qpel_uw_h3v1_neon_8, export=1
-hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_3 qpel_filter_1_32b
+hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_3, qpel_filter_1_32b
 endfunc
 
 function ff_hevc_put_qpel_uw_h1v2_neon_8, export=1
-hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_1 qpel_filter_2_32b
+hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_1, qpel_filter_2_32b
 endfunc
 
 function ff_hevc_put_qpel_uw_h2v2_neon_8, export=1
-hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_2 qpel_filter_2_32b
+hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_2, qpel_filter_2_32b
 endfunc
 
 function ff_hevc_put_qpel_uw_h3v2_neon_8, export=1
-hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_3 qpel_filter_2_32b
+hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_3, qpel_filter_2_32b
 endfunc
 
 function ff_hevc_put_qpel_uw_h1v3_neon_8, export=1
-hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_1 qpel_filter_3_32b
+hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_1, qpel_filter_3_32b
 endfunc
 
 function ff_hevc_put_qpel_uw_h2v3_neon_8, export=1
-hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_2 qpel_filter_3_32b
+hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_2, qpel_filter_3_32b
 endfunc
 
 function ff_hevc_put_qpel_uw_h3v3_neon_8, export=1
-hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_3 qpel_filter_3_32b
+hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_3, qpel_filter_3_32b
 endfunc
 
 .macro init_put_pixels
-- 
2.15.1 (Apple Git-101)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 2/3] arm: hevcdsp_deblock: Add commas between macro arguments

2018-03-30 Thread Martin Storsjö
When targeting darwin, clang requires commas between arguments,
while the no-comma form is allowed for other targets.

Since Xcode 9.3, the bundled clang supports altmacro and doesn't
require using gas-preprocessor any longer.
---
 libavcodec/arm/hevcdsp_deblock_neon.S | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/libavcodec/arm/hevcdsp_deblock_neon.S 
b/libavcodec/arm/hevcdsp_deblock_neon.S
index 166bddb104..7cb7487ef6 100644
--- a/libavcodec/arm/hevcdsp_deblock_neon.S
+++ b/libavcodec/arm/hevcdsp_deblock_neon.S
@@ -152,7 +152,7 @@
 
 andr9, r8, r7
 cmpr9, #0
-beqweakfilter_\@
+beq1f
 
 vadd.i16  q2, q11, q12
 vadd.i16  q4, q9, q8
@@ -210,11 +210,11 @@
 vbit  q13, q3, q5
 vbit  q14, q2, q5
 
-weakfilter_\@:
+1:
 mvn   r8, r8
 and   r9, r8, r7
 cmp   r9, #0
-beq   ready_\@
+beq   2f
 
 vdup.16q4, r2
 
@@ -275,7 +275,7 @@ weakfilter_\@:
 vbit  q11, q0, q5
 vbit  q12, q4, q5
 
-ready_\@:
+2:
 vqmovun.s16 d16, q8
 vqmovun.s16 d18, q9
 vqmovun.s16 d20, q10
-- 
2.15.1 (Apple Git-101)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 3/3] arm: hevcdsp: Avoid using macro expansion counters

2018-03-31 Thread Martin Storsjö

On Sat, 31 Mar 2018, Hendrik Leppkes wrote:


On Fri, Mar 30, 2018 at 9:14 PM, Martin Storsjö <mar...@martin.st> wrote:

Clang supports the macro expansion counter (used for making unique
labels within macro expansions), but not when targeting darwin.

Convert uses of the counter into normal local labels, as used
elsewhere.

Since Xcode 9.3, the bundled clang supports altmacro and doesn't
require using gas-preprocessor any longer.


Could it be that you mixed up the commit message and the contents of
commits 2/3?


Oops, yes, you're right. Will fix before pushing later today.

// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 2/2] flvdec: Export unknown metadata packets as opaque data

2018-10-28 Thread Martin Storsjö
---
Removed the option and made this behaviour the default.
---
 libavformat/flv.h|  1 +
 libavformat/flvdec.c | 18 ++
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/libavformat/flv.h b/libavformat/flv.h
index 3aabb3adc9..3571b90279 100644
--- a/libavformat/flv.h
+++ b/libavformat/flv.h
@@ -66,6 +66,7 @@ enum {
 FLV_STREAM_TYPE_VIDEO,
 FLV_STREAM_TYPE_AUDIO,
 FLV_STREAM_TYPE_SUBTITLE,
+FLV_STREAM_TYPE_DATA,
 FLV_STREAM_TYPE_NB,
 };
 
diff --git a/libavformat/flvdec.c b/libavformat/flvdec.c
index ffc975f15d..4b9f46902b 100644
--- a/libavformat/flvdec.c
+++ b/libavformat/flvdec.c
@@ -143,7 +143,9 @@ static AVStream *create_stream(AVFormatContext *s, int 
codec_type)
 st->codecpar->codec_type = codec_type;
 if (s->nb_streams>=3 ||(   s->nb_streams==2
&& s->streams[0]->codecpar->codec_type != 
AVMEDIA_TYPE_SUBTITLE
-   && s->streams[1]->codecpar->codec_type != 
AVMEDIA_TYPE_SUBTITLE))
+   && s->streams[1]->codecpar->codec_type != 
AVMEDIA_TYPE_SUBTITLE
+   && s->streams[0]->codecpar->codec_type != 
AVMEDIA_TYPE_DATA
+   && s->streams[1]->codecpar->codec_type != 
AVMEDIA_TYPE_DATA))
 s->ctx_flags &= ~AVFMTCTX_NOHEADER;
 if (codec_type == AVMEDIA_TYPE_AUDIO) {
 st->codecpar->bit_rate = flv->audio_bit_rate;
@@ -1001,7 +1003,7 @@ retry:
 int type;
 meta_pos = avio_tell(s->pb);
 type = flv_read_metabody(s, next);
-if (type == 0 && dts == 0 || type < 0 || type == TYPE_UNKNOWN) 
{
+if (type == 0 && dts == 0 || type < 0) {
 if (type < 0 && flv->validate_count &&
 flv->validate_index[0].pos > next &&
 flv->validate_index[0].pos - 4 < next
@@ -1015,6 +1017,8 @@ retry:
 return flv_data_packet(s, pkt, dts, next);
 } else if (type == TYPE_ONCAPTION) {
 return flv_data_packet(s, pkt, dts, next);
+} else if (type == TYPE_UNKNOWN) {
+stream_type = FLV_STREAM_TYPE_DATA;
 }
 avio_seek(s->pb, meta_pos, SEEK_SET);
 }
@@ -1054,10 +1058,13 @@ skip:
 } else if (stream_type == FLV_STREAM_TYPE_SUBTITLE) {
 if (st->codecpar->codec_type == AVMEDIA_TYPE_SUBTITLE)
 break;
+} else if (stream_type == FLV_STREAM_TYPE_DATA) {
+if (st->codecpar->codec_type == AVMEDIA_TYPE_DATA)
+break;
 }
 }
 if (i == s->nb_streams) {
-static const enum AVMediaType stream_types[] = 
{AVMEDIA_TYPE_VIDEO, AVMEDIA_TYPE_AUDIO, AVMEDIA_TYPE_SUBTITLE};
+static const enum AVMediaType stream_types[] = 
{AVMEDIA_TYPE_VIDEO, AVMEDIA_TYPE_AUDIO, AVMEDIA_TYPE_SUBTITLE, 
AVMEDIA_TYPE_DATA};
 st = create_stream(s, stream_types[stream_type]);
 if (!st)
 return AVERROR(ENOMEM);
@@ -1153,6 +1160,8 @@ retry_duration:
 size -= ret;
 } else if (stream_type == FLV_STREAM_TYPE_SUBTITLE) {
 st->codecpar->codec_id = AV_CODEC_ID_TEXT;
+} else if (stream_type == FLV_STREAM_TYPE_DATA) {
+st->codecpar->codec_id = AV_CODEC_ID_NONE; // Opaque AMF data
 }
 
 if (st->codecpar->codec_id == AV_CODEC_ID_AAC ||
@@ -1253,7 +1262,8 @@ retry_duration:
 
 if (stream_type == FLV_STREAM_TYPE_AUDIO ||
 ((flags & FLV_VIDEO_FRAMETYPE_MASK) == FLV_FRAME_KEY) ||
-stream_type == FLV_STREAM_TYPE_SUBTITLE)
+stream_type == FLV_STREAM_TYPE_SUBTITLE ||
+stream_type == FLV_STREAM_TYPE_DATA)
 pkt->flags |= AV_PKT_FLAG_KEY;
 
 leave:
-- 
2.17.1 (Apple Git-112)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 2/2] flvdec: Add an option for exporting unknown metadata packets as opaque data

2018-10-28 Thread Martin Storsjö

On Sun, 28 Oct 2018, Michael Niedermayer wrote:


On Sat, Oct 27, 2018 at 09:22:18PM +0300, Martin Storsjö wrote:

On Sat, 27 Oct 2018, Michael Niedermayer wrote:


On Thu, Oct 25, 2018 at 03:59:17PM +0300, Martin Storsjö wrote:

---
libavformat/flv.h|  1 +
libavformat/flvdec.c | 21 +
2 files changed, 18 insertions(+), 4 deletions(-)


[...]

@@ -1290,6 +1302,7 @@ static const AVOption options[] = {
{ "flv_full_metadata", "Dump full metadata of the onMetadata", 
OFFSET(dump_full_metadata), AV_OPT_TYPE_BOOL, { .i64 = 0 }, 0, 1, VD },
{ "flv_ignore_prevtag", "Ignore the Size of previous tag", 
OFFSET(trust_datasize), AV_OPT_TYPE_BOOL, { .i64 = 0 }, 0, 1, VD },
{ "missing_streams", "", OFFSET(missing_streams), AV_OPT_TYPE_INT, { .i64 = 
0 }, 0, 0xFF, VD | AV_OPT_FLAG_EXPORT | AV_OPT_FLAG_READONLY },
+{ "export_opaque_meta", "", OFFSET(export_opaque_meta), AV_OPT_TYPE_BOOL, 
{ .i64 = 0 }, 0, 1, VD },
{ NULL }


I think this together with doc/demuxers.texi (which doesnt document this)
is not enough to use this option by a user


Oh right, I had forgotten to actually write something here.


also why is this conditional ? is there a disadvantage of always
exporting this ?


Not sure - I thought it'd be less behaviour change and less risk of
potentially confusing packets for unsuspecting users by not doing it by
default. But as any normal flv stream doesn't contain any such packets, it
might be fine to just expose them all the time.


I dont know enough about these to have an oppinion ...

but I just realized another aspect. How do these packets interact with
flvenc ?
Should they be preserved by default ? because if so then they would need
to be exported by default


I guess it depends on what the packets actually are - as it can be 
anything, it's pretty much up to the application what treatment they want 
for them. flvenc right now does write them out properly afaik (a data 
track with codec type AV_CODEC_ID_NONE gets copied straight through into 
FLV_TAG_TYPE_META packets). I guess the sensible default would be to copy 
them, so I guess I'll amend the patch to always export them.


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 1/2] libavutil: Undeprecate the AVFrame reordered_opaque field

2018-10-29 Thread Martin Storsjö

On Mon, 29 Oct 2018, Derek Buitenhuis wrote:


On 29/10/2018 14:10, Martin Storsjö wrote:

I don't understand why this is being used in favour of a proper
pointer field? An integer field is just ascting to be misused.
Even the doxygen is really sketchy on it.


It's essentially meant to be used as union { ptr; int64_t } assuming you
don't have pointers larger than 64 bits.


It's not a union in the API, and I'm pretty sure that it violates the C spec
to use a unions to get an integer out of a pointer, shove it into an int64_t,
and then get it back out, and chnage it back via union. Especially for
32-bit pointers. It encourages terrible code.

I just don't think we should revive this as-is purely for convenience.


I also don't understand why this is at the AVCodecContext level
and not packet/frame?


It is on the frame level, but not in the packet struct (probably for
historical reasons) - instead of in the packet, it's in AVCodecContext.
For decoding, you set the value in AVCodecContext before feeding packets
to it, and get the corresponding value reordered into the output AVFrame.
If things were to be redone from scratch, moving it into AVPacket would
probably make more sense, but there's not much point in doing that right
now.


I mean, this is pretty gross, and non-obvious as far as I'm concerned.
Modifying the AVCodecContext on every call is just... eugh.


At some point, the doxygen got markers saying this mechanism was
deprecated and one should use the new pkt_pts instead. Before that,
reordered_opaque was mainly used for getting reordered pts as there
was no other mechanism for it.

But even with the proper pkt_pts field, having a generic opaque field that
travels along with the reordering is useful, which is why the deprecation
doxygen comments were removed in ad1ee5fa7. But that commit just missed to
remove one of the doxygen deprecation.


I agree it's very useful, and something we should have, but not that we should
revive/use this partiular field... it's nasty.


Sorry, I think you've misunderstood this patch altogether.

It's not about reviving this field or not, it's all in full use 
already. It was never deprecated with any active plan to remove it, the 
only steps were a few doxygen comments, never any attributes that would 
actually prompt action.


And a few years later someone noticed that these doxygen comments didn't 
match up with reality, and it was decided (with no objections on either 
project) that these really shouldn't be deprecated as it is the only 
actual mechanism we have for doing exactly this.


It's just that the undeprecation commit, ad1ee5fa7, missed one field. And 
the one I'm removing the stray deprecation comment from, is the very 
properly placed one in AVFrame non the less.


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 2/2] libx264: Pass the reordered_opaque field through the encoder

2018-10-29 Thread Martin Storsjö

On Mon, 29 Oct 2018, Derek Buitenhuis wrote:


On 25/10/2018 13:58, Martin Storsjö wrote:

+x4->nb_reordered_opaque = x264_encoder_maximum_delayed_frames(x4->enc) + 1;


Is it possible this changes when the encoder is reconfigured (e.g. to 
interlaced)?


Good point. I'm sure it's possible that it changes, if reconfiguring.

As I guess there can be old frames in flight, the only safe option is to 
enlarge, not to shrink it. But in case a realloc moves the array, the old 
pointers end up pretty useless.


Tricky, I guess I'll have to think about it to see if I can come up with 
something which isn't completely terrible.


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 2/2] libx264: Pass the reordered_opaque field through the encoder

2018-10-31 Thread Martin Storsjö

On Wed, 31 Oct 2018, Derek Buitenhuis wrote:


On 30/10/2018 19:49, Martin Storsjö wrote:

Hmm, that might make sense, but with a little twist. The max reordered
frames for H.264 is known, but onto that you also get more delay due to
frame threads and other details that this function within x264 knows
about. So that would make it [H264 max reordering] + [threads] +
[constant] or something such?


Looking at the source, it's more complicated than that, with e.g.:

h->frames.i_delay = X264_MAX( h->frames.i_delay, h->param.rc.i_lookahead );

I think you're better off not trying to duplicate this logic.


Indeed, I don't want to duplicate that.

Even though we do allow reconfiguration, it doesn't look like we support 
changing any parameters which would actually affect the delay, only RC 
method and targets (CRF, bitrate, etc). So given that, the current patch 
probably should be safe - what do you think? Or the current patch, with an 
added margin of 16 on top just in case?


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 2/2] libx264: Pass the reordered_opaque field through the encoder

2018-11-01 Thread Martin Storsjö

On Thu, 1 Nov 2018, Derek Buitenhuis wrote:


On 31/10/2018 21:41, Martin Storsjö wrote:

Even though we do allow reconfiguration, it doesn't look like we support
changing any parameters which would actually affect the delay, only RC
method and targets (CRF, bitrate, etc). So given that, the current patch
probably should be safe - what do you think? Or the current patch, with an
added margin of 16 on top just in case?


We allow reconfiguring to/from interlaced. I'm not sure if this can modify 
delay?


Not really sure either... So perhaps it'd be safest with some bit of extra 
margin/overestimate of the delay here? It just costs a couple bytes in the 
mapping array anyway.


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 2/2] libx264: Pass the reordered_opaque field through the encoder

2018-10-30 Thread Martin Storsjö

On Tue, 30 Oct 2018, Derek Buitenhuis wrote:


On 29/10/2018 21:06, Martin Storsjö wrote:

As I guess there can be old frames in flight, the only safe option is to
enlarge, not to shrink it. But in case a realloc moves the array, the old
pointers end up pretty useless.


Just always allocate the max (which is known for H.264), and adjust 
nb_reordered_opaque
as need be, on reconfig, no?


Hmm, that might make sense, but with a little twist. The max reordered 
frames for H.264 is known, but onto that you also get more delay due to 
frame threads and other details that this function within x264 knows 
about. So that would make it [H264 max reordering] + [threads] + 
[constant] or something such?


// Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


  1   2   3   4   5   6   7   8   9   10   >