Re: [FFmpeg-devel] [PATCH] avcodec: Remove libstagefright
On Sun, 3 Jan 2016, Derek Buitenhuis wrote: It serves absolutely no purpose other than to confuse potentional Android developers about how to use hardware acceleration properly on the the platform. Both stagefright itself, and MediaCodec, have avcodec backends already, and this is the correct way to use it. No, that's unrelated. Yes, people have written avcodec backends for stagefright/MediaCodec, but that's unrelated and only of interest for stock Android mediaplayers to extend their codec support. MediaCodec as a proper JNI API. wat? (Yes, using MediaCodec, either via the recent C API, or via JNI, is the correct way to do it.) Furthermore, stagefright support in avcodec needs a series of magic incantations and version-specific stuff, such that using it actually provides downsides compared just using the actual Android frameworks properly, in that it is a lot more work and confusion to get it even running. It also leads to a lot of misinformation, like these sorts of comments (in [1]) that are absolutely incorrect. Spot on, +1. [1] http://stackoverflow.com/a/29362353/3115956 Signed-off-by: Derek Buitenhuis--- I am certain there are many more reasons to remvoe this as well. I know its own author despises it, and I know j-b will same things to say. Not direct author, but co-author/mentor. // Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 2/2] libopenh264: Support building with the 1.6 release
This fixes trac issue #5417. This is cherry-picked from libav commit d825b1a5306576dcd0553b7d0d24a3a46ad92864. --- Updated the commit message to mention the ticket number. --- libavcodec/libopenh264dec.c | 2 ++ libavcodec/libopenh264enc.c | 26 -- 2 files changed, 26 insertions(+), 2 deletions(-) diff --git a/libavcodec/libopenh264dec.c b/libavcodec/libopenh264dec.c index f642082..6af60af 100644 --- a/libavcodec/libopenh264dec.c +++ b/libavcodec/libopenh264dec.c @@ -90,7 +90,9 @@ static av_cold int svc_decode_init(AVCodecContext *avctx) (*s->decoder)->SetOption(s->decoder, DECODER_OPTION_TRACE_CALLBACK, (void *)_function); (*s->decoder)->SetOption(s->decoder, DECODER_OPTION_TRACE_CALLBACK_CONTEXT, (void *)); +#if !OPENH264_VER_AT_LEAST(1, 6) param.eOutputColorFormat = videoFormatI420; +#endif param.eEcActiveIdc = ERROR_CON_DISABLE; param.sVideoProperty.eVideoBsType = VIDEO_BITSTREAM_DEFAULT; diff --git a/libavcodec/libopenh264enc.c b/libavcodec/libopenh264enc.c index d27fc41..07af31d 100644 --- a/libavcodec/libopenh264enc.c +++ b/libavcodec/libopenh264enc.c @@ -33,6 +33,10 @@ #include "internal.h" #include "libopenh264.h" +#if !OPENH264_VER_AT_LEAST(1, 6) +#define SM_SIZELIMITED_SLICE SM_DYN_SLICE +#endif + typedef struct SVCContext { const AVClass *av_class; ISVCEncoder *encoder; @@ -48,11 +52,20 @@ typedef struct SVCContext { #define OFFSET(x) offsetof(SVCContext, x) #define VE AV_OPT_FLAG_VIDEO_PARAM | AV_OPT_FLAG_ENCODING_PARAM static const AVOption options[] = { +#if OPENH264_VER_AT_LEAST(1, 6) +{ "slice_mode", "set slice mode", OFFSET(slice_mode), AV_OPT_TYPE_INT, { .i64 = SM_FIXEDSLCNUM_SLICE }, SM_SINGLE_SLICE, SM_RESERVED, VE, "slice_mode" }, +#else { "slice_mode", "set slice mode", OFFSET(slice_mode), AV_OPT_TYPE_INT, { .i64 = SM_AUTO_SLICE }, SM_SINGLE_SLICE, SM_RESERVED, VE, "slice_mode" }, +#endif { "fixed", "a fixed number of slices", 0, AV_OPT_TYPE_CONST, { .i64 = SM_FIXEDSLCNUM_SLICE }, 0, 0, VE, "slice_mode" }, +#if OPENH264_VER_AT_LEAST(1, 6) +{ "dyn", "Size limited (compatibility name)", 0, AV_OPT_TYPE_CONST, { .i64 = SM_SIZELIMITED_SLICE }, 0, 0, VE, "slice_mode" }, +{ "sizelimited", "Size limited", 0, AV_OPT_TYPE_CONST, { .i64 = SM_SIZELIMITED_SLICE }, 0, 0, VE, "slice_mode" }, +#else { "rowmb", "one slice per row of macroblocks", 0, AV_OPT_TYPE_CONST, { .i64 = SM_ROWMB_SLICE }, 0, 0, VE, "slice_mode" }, { "auto", "automatic number of slices according to number of threads", 0, AV_OPT_TYPE_CONST, { .i64 = SM_AUTO_SLICE }, 0, 0, VE, "slice_mode" }, { "dyn", "Dynamic slicing", 0, AV_OPT_TYPE_CONST, { .i64 = SM_DYN_SLICE }, 0, 0, VE, "slice_mode" }, +#endif { "loopfilter", "enable loop filter", OFFSET(loopfilter), AV_OPT_TYPE_INT, { .i64 = 1 }, 0, 1, VE }, { "profile", "set profile restrictions", OFFSET(profile), AV_OPT_TYPE_STRING, { .str = NULL }, 0, 0, VE }, { "max_nal_size", "set maximum NAL size in bytes", OFFSET(max_nal_size), AV_OPT_TYPE_INT, { .i64 = 0 }, 0, INT_MAX, VE }, @@ -159,15 +172,24 @@ FF_ENABLE_DEPRECATION_WARNINGS s->slice_mode = SM_FIXEDSLCNUM_SLICE; if (s->max_nal_size) -s->slice_mode = SM_DYN_SLICE; +s->slice_mode = SM_SIZELIMITED_SLICE; +#if OPENH264_VER_AT_LEAST(1, 6) +param.sSpatialLayers[0].sSliceArgument.uiSliceMode = s->slice_mode; +param.sSpatialLayers[0].sSliceArgument.uiSliceNum = avctx->slices; +#else param.sSpatialLayers[0].sSliceCfg.uiSliceMode = s->slice_mode; param.sSpatialLayers[0].sSliceCfg.sSliceArgument.uiSliceNum = avctx->slices; +#endif -if (s->slice_mode == SM_DYN_SLICE) { +if (s->slice_mode == SM_SIZELIMITED_SLICE) { if (s->max_nal_size){ param.uiMaxNalSize = s->max_nal_size; +#if OPENH264_VER_AT_LEAST(1, 6) +param.sSpatialLayers[0].sSliceArgument.uiSliceSizeConstraint = s->max_nal_size; +#else param.sSpatialLayers[0].sSliceCfg.sSliceArgument.uiSliceSizeConstraint = s->max_nal_size; +#endif } else { av_log(avctx, AV_LOG_ERROR, "Invalid -max_nal_size, " "specify a valid max_nal_size to use -slice_mode dyn\n"); -- 2.7.4 (Apple Git-66) ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 1/2] Add an OpenH264 decoder wrapper
This is cherrypicked from libav, from commits 82b7525173f20702a8cbc26ebedbf4b69b8fecec and d0b1e6049b06ca146ece4d2f199c5dba1565. --- Fixed the issues pointed out by Michael, removed the parts of the commit message as requested by Carl. --- Changelog | 1 + configure | 2 + doc/general.texi| 9 +- libavcodec/Makefile | 3 +- libavcodec/allcodecs.c | 2 +- libavcodec/libopenh264.c| 62 +++ libavcodec/libopenh264.h| 39 +++ libavcodec/libopenh264dec.c | 243 libavcodec/libopenh264enc.c | 48 ++--- libavcodec/version.h| 2 +- 10 files changed, 366 insertions(+), 45 deletions(-) create mode 100644 libavcodec/libopenh264.c create mode 100644 libavcodec/libopenh264.h create mode 100644 libavcodec/libopenh264dec.c diff --git a/Changelog b/Changelog index 479f164..7f536db 100644 --- a/Changelog +++ b/Changelog @@ -10,6 +10,7 @@ version : - curves filter doesn't automatically insert points at x=0 and x=1 anymore - 16-bit support in curves filter - 16-bit support in selectivecolor filter +- OpenH264 decoder wrapper version 3.1: diff --git a/configure b/configure index 1b41303..9f5b31f 100755 --- a/configure +++ b/configure @@ -2771,6 +2771,8 @@ libopencore_amrnb_decoder_deps="libopencore_amrnb" libopencore_amrnb_encoder_deps="libopencore_amrnb" libopencore_amrnb_encoder_select="audio_frame_queue" libopencore_amrwb_decoder_deps="libopencore_amrwb" +libopenh264_decoder_deps="libopenh264" +libopenh264_decoder_select="h264_mp4toannexb_bsf" libopenh264_encoder_deps="libopenh264" libopenjpeg_decoder_deps="libopenjpeg" libopenjpeg_encoder_deps="libopenjpeg" diff --git a/doc/general.texi b/doc/general.texi index 7823dc1..6b5975c 100644 --- a/doc/general.texi +++ b/doc/general.texi @@ -103,12 +103,19 @@ enable it. @section OpenH264 -FFmpeg can make use of the OpenH264 library for H.264 encoding. +FFmpeg can make use of the OpenH264 library for H.264 encoding and decoding. Go to @url{http://www.openh264.org/} and follow the instructions for installing the library. Then pass @code{--enable-libopenh264} to configure to enable it. +For decoding, this library is much more limited than the built-in decoder +in libavcodec; currently, this library lacks support for decoding B-frames +and some other main/high profile features. (It currently only supports +constrained baseline profile and CABAC.) Using it is mostly useful for +testing and for taking advantage of Cisco's patent portfolio license +(@url{http://www.openh264.org/BINARY_LICENSE.txt}). + @section x264 FFmpeg can make use of the x264 library for H.264 encoding. diff --git a/libavcodec/Makefile b/libavcodec/Makefile index a548e02..3def3ad 100644 --- a/libavcodec/Makefile +++ b/libavcodec/Makefile @@ -868,7 +868,8 @@ OBJS-$(CONFIG_LIBMP3LAME_ENCODER) += libmp3lame.o mpegaudiodata.o mpegau OBJS-$(CONFIG_LIBOPENCORE_AMRNB_DECODER) += libopencore-amr.o OBJS-$(CONFIG_LIBOPENCORE_AMRNB_ENCODER) += libopencore-amr.o OBJS-$(CONFIG_LIBOPENCORE_AMRWB_DECODER) += libopencore-amr.o -OBJS-$(CONFIG_LIBOPENH264_ENCODER)+= libopenh264enc.o +OBJS-$(CONFIG_LIBOPENH264_DECODER)+= libopenh264dec.o libopenh264.o +OBJS-$(CONFIG_LIBOPENH264_ENCODER)+= libopenh264enc.o libopenh264.o OBJS-$(CONFIG_LIBOPENJPEG_DECODER)+= libopenjpegdec.o OBJS-$(CONFIG_LIBOPENJPEG_ENCODER)+= libopenjpegenc.o OBJS-$(CONFIG_LIBOPUS_DECODER)+= libopusdec.o libopus.o \ diff --git a/libavcodec/allcodecs.c b/libavcodec/allcodecs.c index 951e199..a1ae61f 100644 --- a/libavcodec/allcodecs.c +++ b/libavcodec/allcodecs.c @@ -623,7 +623,7 @@ void avcodec_register_all(void) /* external libraries, that shouldn't be used by default if one of the * above is available */ -REGISTER_ENCODER(LIBOPENH264, libopenh264); +REGISTER_ENCDEC (LIBOPENH264, libopenh264); REGISTER_DECODER(H264_CUVID,h264_cuvid); REGISTER_ENCODER(H264_NVENC,h264_nvenc); REGISTER_ENCODER(H264_OMX, h264_omx); diff --git a/libavcodec/libopenh264.c b/libavcodec/libopenh264.c new file mode 100644 index 000..59c61a3 --- /dev/null +++ b/libavcodec/libopenh264.c @@ -0,0 +1,62 @@ +/* + * OpenH264 shared utils + * Copyright (C) 2014 Martin Storsjo + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the
Re: [FFmpeg-devel] [PATCH 1/2] Add an OpenH264 decoder wrapper
On Tue, 26 Jul 2016, Michael Niedermayer wrote: On Tue, Jul 26, 2016 at 09:31:17PM +0300, Martin Storsjö wrote: This is cherrypicked from libav, from commits 82b7525173f20702a8cbc26ebedbf4b69b8fecec and d0b1e6049b06ca146ece4d2f199c5dba1565. --- Fixed the issues pointed out by Michael, removed the parts of the commit message as requested by Carl. --- Changelog | 1 + configure | 2 + doc/general.texi| 9 +- libavcodec/Makefile | 3 +- libavcodec/allcodecs.c | 2 +- libavcodec/libopenh264.c| 62 +++ libavcodec/libopenh264.h| 39 +++ libavcodec/libopenh264dec.c | 243 libavcodec/libopenh264enc.c | 48 ++--- libavcodec/version.h| 2 +- 10 files changed, 366 insertions(+), 45 deletions(-) create mode 100644 libavcodec/libopenh264.c create mode 100644 libavcodec/libopenh264.h create mode 100644 libavcodec/libopenh264dec.c LGTM, please push, unless someone else has more comments thanks Pushed both. // Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 1/2] Add an OpenH264 decoder wrapper
While it is less featureful (and slower) than the built-in H264 decoder, one could potentially want to use it to take advantage of the cisco patent license offer. This is cherrypicked from libav, from commits 82b7525173f20702a8cbc26ebedbf4b69b8fecec and d0b1e6049b06ca146ece4d2f199c5dba1565. --- Changelog | 1 + configure | 2 + doc/general.texi| 9 +- libavcodec/Makefile | 3 +- libavcodec/allcodecs.c | 2 +- libavcodec/libopenh264.c| 62 +++ libavcodec/libopenh264.h| 39 +++ libavcodec/libopenh264dec.c | 245 libavcodec/libopenh264enc.c | 48 ++--- libavcodec/version.h| 2 +- 10 files changed, 368 insertions(+), 45 deletions(-) create mode 100644 libavcodec/libopenh264.c create mode 100644 libavcodec/libopenh264.h create mode 100644 libavcodec/libopenh264dec.c diff --git a/Changelog b/Changelog index 479f164..7f536db 100644 --- a/Changelog +++ b/Changelog @@ -10,6 +10,7 @@ version : - curves filter doesn't automatically insert points at x=0 and x=1 anymore - 16-bit support in curves filter - 16-bit support in selectivecolor filter +- OpenH264 decoder wrapper version 3.1: diff --git a/configure b/configure index 1b41303..9f5b31f 100755 --- a/configure +++ b/configure @@ -2771,6 +2771,8 @@ libopencore_amrnb_decoder_deps="libopencore_amrnb" libopencore_amrnb_encoder_deps="libopencore_amrnb" libopencore_amrnb_encoder_select="audio_frame_queue" libopencore_amrwb_decoder_deps="libopencore_amrwb" +libopenh264_decoder_deps="libopenh264" +libopenh264_decoder_select="h264_mp4toannexb_bsf" libopenh264_encoder_deps="libopenh264" libopenjpeg_decoder_deps="libopenjpeg" libopenjpeg_encoder_deps="libopenjpeg" diff --git a/doc/general.texi b/doc/general.texi index 7823dc1..6b5975c 100644 --- a/doc/general.texi +++ b/doc/general.texi @@ -103,12 +103,19 @@ enable it. @section OpenH264 -FFmpeg can make use of the OpenH264 library for H.264 encoding. +FFmpeg can make use of the OpenH264 library for H.264 encoding and decoding. Go to @url{http://www.openh264.org/} and follow the instructions for installing the library. Then pass @code{--enable-libopenh264} to configure to enable it. +For decoding, this library is much more limited than the built-in decoder +in libavcodec; currently, this library lacks support for decoding B-frames +and some other main/high profile features. (It currently only supports +constrained baseline profile and CABAC.) Using it is mostly useful for +testing and for taking advantage of Cisco's patent portfolio license +(@url{http://www.openh264.org/BINARY_LICENSE.txt}). + @section x264 FFmpeg can make use of the x264 library for H.264 encoding. diff --git a/libavcodec/Makefile b/libavcodec/Makefile index a548e02..3def3ad 100644 --- a/libavcodec/Makefile +++ b/libavcodec/Makefile @@ -868,7 +868,8 @@ OBJS-$(CONFIG_LIBMP3LAME_ENCODER) += libmp3lame.o mpegaudiodata.o mpegau OBJS-$(CONFIG_LIBOPENCORE_AMRNB_DECODER) += libopencore-amr.o OBJS-$(CONFIG_LIBOPENCORE_AMRNB_ENCODER) += libopencore-amr.o OBJS-$(CONFIG_LIBOPENCORE_AMRWB_DECODER) += libopencore-amr.o -OBJS-$(CONFIG_LIBOPENH264_ENCODER)+= libopenh264enc.o +OBJS-$(CONFIG_LIBOPENH264_DECODER)+= libopenh264dec.o libopenh264.o +OBJS-$(CONFIG_LIBOPENH264_ENCODER)+= libopenh264enc.o libopenh264.o OBJS-$(CONFIG_LIBOPENJPEG_DECODER)+= libopenjpegdec.o OBJS-$(CONFIG_LIBOPENJPEG_ENCODER)+= libopenjpegenc.o OBJS-$(CONFIG_LIBOPUS_DECODER)+= libopusdec.o libopus.o \ diff --git a/libavcodec/allcodecs.c b/libavcodec/allcodecs.c index 951e199..a1ae61f 100644 --- a/libavcodec/allcodecs.c +++ b/libavcodec/allcodecs.c @@ -623,7 +623,7 @@ void avcodec_register_all(void) /* external libraries, that shouldn't be used by default if one of the * above is available */ -REGISTER_ENCODER(LIBOPENH264, libopenh264); +REGISTER_ENCDEC (LIBOPENH264, libopenh264); REGISTER_DECODER(H264_CUVID,h264_cuvid); REGISTER_ENCODER(H264_NVENC,h264_nvenc); REGISTER_ENCODER(H264_OMX, h264_omx); diff --git a/libavcodec/libopenh264.c b/libavcodec/libopenh264.c new file mode 100644 index 000..59c61a3 --- /dev/null +++ b/libavcodec/libopenh264.c @@ -0,0 +1,62 @@ +/* + * OpenH264 shared utils + * Copyright (C) 2014 Martin Storsjo + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more
[FFmpeg-devel] [PATCH 2/2] libopenh264: Support building with the 1.6 release
This is cherry-picked from libav commit d825b1a5306576dcd0553b7d0d24a3a46ad92864. --- libavcodec/libopenh264dec.c | 2 ++ libavcodec/libopenh264enc.c | 26 -- 2 files changed, 26 insertions(+), 2 deletions(-) diff --git a/libavcodec/libopenh264dec.c b/libavcodec/libopenh264dec.c index 8388e4e..80dff4c 100644 --- a/libavcodec/libopenh264dec.c +++ b/libavcodec/libopenh264dec.c @@ -90,7 +90,9 @@ static av_cold int svc_decode_init(AVCodecContext *avctx) (*s->decoder)->SetOption(s->decoder, DECODER_OPTION_TRACE_CALLBACK, (void *)_function); (*s->decoder)->SetOption(s->decoder, DECODER_OPTION_TRACE_CALLBACK_CONTEXT, (void *)); +#if !OPENH264_VER_AT_LEAST(1, 6) param.eOutputColorFormat = videoFormatI420; +#endif param.eEcActiveIdc = ERROR_CON_DISABLE; param.sVideoProperty.eVideoBsType = VIDEO_BITSTREAM_DEFAULT; diff --git a/libavcodec/libopenh264enc.c b/libavcodec/libopenh264enc.c index d27fc41..07af31d 100644 --- a/libavcodec/libopenh264enc.c +++ b/libavcodec/libopenh264enc.c @@ -33,6 +33,10 @@ #include "internal.h" #include "libopenh264.h" +#if !OPENH264_VER_AT_LEAST(1, 6) +#define SM_SIZELIMITED_SLICE SM_DYN_SLICE +#endif + typedef struct SVCContext { const AVClass *av_class; ISVCEncoder *encoder; @@ -48,11 +52,20 @@ typedef struct SVCContext { #define OFFSET(x) offsetof(SVCContext, x) #define VE AV_OPT_FLAG_VIDEO_PARAM | AV_OPT_FLAG_ENCODING_PARAM static const AVOption options[] = { +#if OPENH264_VER_AT_LEAST(1, 6) +{ "slice_mode", "set slice mode", OFFSET(slice_mode), AV_OPT_TYPE_INT, { .i64 = SM_FIXEDSLCNUM_SLICE }, SM_SINGLE_SLICE, SM_RESERVED, VE, "slice_mode" }, +#else { "slice_mode", "set slice mode", OFFSET(slice_mode), AV_OPT_TYPE_INT, { .i64 = SM_AUTO_SLICE }, SM_SINGLE_SLICE, SM_RESERVED, VE, "slice_mode" }, +#endif { "fixed", "a fixed number of slices", 0, AV_OPT_TYPE_CONST, { .i64 = SM_FIXEDSLCNUM_SLICE }, 0, 0, VE, "slice_mode" }, +#if OPENH264_VER_AT_LEAST(1, 6) +{ "dyn", "Size limited (compatibility name)", 0, AV_OPT_TYPE_CONST, { .i64 = SM_SIZELIMITED_SLICE }, 0, 0, VE, "slice_mode" }, +{ "sizelimited", "Size limited", 0, AV_OPT_TYPE_CONST, { .i64 = SM_SIZELIMITED_SLICE }, 0, 0, VE, "slice_mode" }, +#else { "rowmb", "one slice per row of macroblocks", 0, AV_OPT_TYPE_CONST, { .i64 = SM_ROWMB_SLICE }, 0, 0, VE, "slice_mode" }, { "auto", "automatic number of slices according to number of threads", 0, AV_OPT_TYPE_CONST, { .i64 = SM_AUTO_SLICE }, 0, 0, VE, "slice_mode" }, { "dyn", "Dynamic slicing", 0, AV_OPT_TYPE_CONST, { .i64 = SM_DYN_SLICE }, 0, 0, VE, "slice_mode" }, +#endif { "loopfilter", "enable loop filter", OFFSET(loopfilter), AV_OPT_TYPE_INT, { .i64 = 1 }, 0, 1, VE }, { "profile", "set profile restrictions", OFFSET(profile), AV_OPT_TYPE_STRING, { .str = NULL }, 0, 0, VE }, { "max_nal_size", "set maximum NAL size in bytes", OFFSET(max_nal_size), AV_OPT_TYPE_INT, { .i64 = 0 }, 0, INT_MAX, VE }, @@ -159,15 +172,24 @@ FF_ENABLE_DEPRECATION_WARNINGS s->slice_mode = SM_FIXEDSLCNUM_SLICE; if (s->max_nal_size) -s->slice_mode = SM_DYN_SLICE; +s->slice_mode = SM_SIZELIMITED_SLICE; +#if OPENH264_VER_AT_LEAST(1, 6) +param.sSpatialLayers[0].sSliceArgument.uiSliceMode = s->slice_mode; +param.sSpatialLayers[0].sSliceArgument.uiSliceNum = avctx->slices; +#else param.sSpatialLayers[0].sSliceCfg.uiSliceMode = s->slice_mode; param.sSpatialLayers[0].sSliceCfg.sSliceArgument.uiSliceNum = avctx->slices; +#endif -if (s->slice_mode == SM_DYN_SLICE) { +if (s->slice_mode == SM_SIZELIMITED_SLICE) { if (s->max_nal_size){ param.uiMaxNalSize = s->max_nal_size; +#if OPENH264_VER_AT_LEAST(1, 6) +param.sSpatialLayers[0].sSliceArgument.uiSliceSizeConstraint = s->max_nal_size; +#else param.sSpatialLayers[0].sSliceCfg.sSliceArgument.uiSliceSizeConstraint = s->max_nal_size; +#endif } else { av_log(avctx, AV_LOG_ERROR, "Invalid -max_nal_size, " "specify a valid max_nal_size to use -slice_mode dyn\n"); -- 2.7.4 (Apple Git-66) ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH 1/8] arm: vp9dsp: Restructure the bpp checks
On Thu, 19 Jan 2017, Michael Niedermayer wrote: On Wed, Jan 18, 2017 at 11:45:08PM +0200, Martin Storsjö wrote: This work is sponsored by, and copyright, Google. This is more in line with how it will be extended for more bitdepths. --- libavcodec/arm/vp9dsp_init_arm.c | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) fate passes with this patchset under qemu arm Pushed, thanks! // Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 5/8] aarch64: vp9dsp: Restructure the bpp checks
This work is sponsored by, and copyright, Google. This is more in line with how it will be extended for more bitdepths. --- libavcodec/aarch64/vp9dsp_init_aarch64.c | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/libavcodec/aarch64/vp9dsp_init_aarch64.c b/libavcodec/aarch64/vp9dsp_init_aarch64.c index 0bc200e..7b50540 100644 --- a/libavcodec/aarch64/vp9dsp_init_aarch64.c +++ b/libavcodec/aarch64/vp9dsp_init_aarch64.c @@ -96,13 +96,10 @@ define_8tap_2d_funcs(16) define_8tap_2d_funcs(8) define_8tap_2d_funcs(4) -static av_cold void vp9dsp_mc_init_aarch64(VP9DSPContext *dsp, int bpp) +static av_cold void vp9dsp_mc_init_aarch64(VP9DSPContext *dsp) { int cpu_flags = av_get_cpu_flags(); -if (bpp != 8) -return; - #define init_fpel(idx1, idx2, sz, type, suffix) \ dsp->mc[idx1][FILTER_8TAP_SMOOTH ][idx2][0][0] = \ dsp->mc[idx1][FILTER_8TAP_REGULAR][idx2][0][0] = \ @@ -173,13 +170,10 @@ define_itxfm(idct, idct, 32); define_itxfm(iwht, iwht, 4); -static av_cold void vp9dsp_itxfm_init_aarch64(VP9DSPContext *dsp, int bpp) +static av_cold void vp9dsp_itxfm_init_aarch64(VP9DSPContext *dsp) { int cpu_flags = av_get_cpu_flags(); -if (bpp != 8) -return; - if (have_neon(cpu_flags)) { #define init_itxfm(tx, sz) \ dsp->itxfm_add[tx][DCT_DCT] = ff_vp9_idct_idct_##sz##_add_neon; \ @@ -219,13 +213,10 @@ define_loop_filters(48, 16); define_loop_filters(84, 16); define_loop_filters(88, 16); -static av_cold void vp9dsp_loopfilter_init_aarch64(VP9DSPContext *dsp, int bpp) +static av_cold void vp9dsp_loopfilter_init_aarch64(VP9DSPContext *dsp) { int cpu_flags = av_get_cpu_flags(); -if (bpp != 8) -return; - if (have_neon(cpu_flags)) { dsp->loop_filter_8[0][1] = ff_vp9_loop_filter_v_4_8_neon; dsp->loop_filter_8[0][0] = ff_vp9_loop_filter_h_4_8_neon; @@ -250,7 +241,10 @@ static av_cold void vp9dsp_loopfilter_init_aarch64(VP9DSPContext *dsp, int bpp) av_cold void ff_vp9dsp_init_aarch64(VP9DSPContext *dsp, int bpp) { -vp9dsp_mc_init_aarch64(dsp, bpp); -vp9dsp_loopfilter_init_aarch64(dsp, bpp); -vp9dsp_itxfm_init_aarch64(dsp, bpp); +if (bpp != 8) +return; + +vp9dsp_mc_init_aarch64(dsp); +vp9dsp_loopfilter_init_aarch64(dsp); +vp9dsp_itxfm_init_aarch64(dsp); } -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 8/8] aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter
This work is sponsored by, and copyright, Google. This is similar to the arm version, but due to the larger registers on aarch64, we can do 8 pixels at a time for all filter sizes. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_loop_filter_h_4_8_10bpp_neon: 213.2 172.6 vp9_loop_filter_h_8_8_10bpp_neon: 281.2 244.2 vp9_loop_filter_h_16_8_10bpp_neon: 657.0 444.5 vp9_loop_filter_h_16_16_10bpp_neon: 1280.4 877.7 vp9_loop_filter_mix2_h_44_16_10bpp_neon: 397.7 358.0 vp9_loop_filter_mix2_h_48_16_10bpp_neon: 465.7 429.0 vp9_loop_filter_mix2_h_84_16_10bpp_neon: 465.7 428.0 vp9_loop_filter_mix2_h_88_16_10bpp_neon: 533.7 499.0 vp9_loop_filter_mix2_v_44_16_10bpp_neon: 271.5 244.0 vp9_loop_filter_mix2_v_48_16_10bpp_neon: 330.0 305.0 vp9_loop_filter_mix2_v_84_16_10bpp_neon: 329.0 306.0 vp9_loop_filter_mix2_v_88_16_10bpp_neon: 386.0 365.0 vp9_loop_filter_v_4_8_10bpp_neon: 150.0 115.2 vp9_loop_filter_v_8_8_10bpp_neon: 209.0 175.5 vp9_loop_filter_v_16_8_10bpp_neon: 492.7 345.2 vp9_loop_filter_v_16_16_10bpp_neon:951.0 682.7 This is significantly faster than the ARM version in almost all cases except for the mix2 functions. Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 2-3x. --- libavcodec/aarch64/Makefile| 1 + .../aarch64/vp9dsp_init_16bpp_aarch64_template.c | 62 ++ libavcodec/aarch64/vp9lpf_16bpp_neon.S | 873 + 3 files changed, 936 insertions(+) create mode 100644 libavcodec/aarch64/vp9lpf_16bpp_neon.S diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile index 715cc6f..37666b4 100644 --- a/libavcodec/aarch64/Makefile +++ b/libavcodec/aarch64/Makefile @@ -44,6 +44,7 @@ NEON-OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_neon.o NEON-OBJS-$(CONFIG_VORBIS_DECODER) += aarch64/vorbisdsp_neon.o NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_16bpp_neon.o \ aarch64/vp9itxfm_neon.o \ + aarch64/vp9lpf_16bpp_neon.o \ aarch64/vp9lpf_neon.o \ aarch64/vp9mc_16bpp_neon.o \ aarch64/vp9mc_neon.o diff --git a/libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c b/libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c index 0e86b02..d5649f7 100644 --- a/libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c +++ b/libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c @@ -203,8 +203,70 @@ static av_cold void vp9dsp_itxfm_init_aarch64(VP9DSPContext *dsp) } } +#define define_loop_filter(dir, wd, size, bpp) \ +void ff_vp9_loop_filter_##dir##_##wd##_##size##_##bpp##_neon(uint8_t *dst, ptrdiff_t stride, int E, int I, int H) + +#define define_loop_filters(wd, size, bpp) \ +define_loop_filter(h, wd, size, bpp); \ +define_loop_filter(v, wd, size, bpp) + +define_loop_filters(4, 8, BPP); +define_loop_filters(8, 8, BPP); +define_loop_filters(16, 8, BPP); + +define_loop_filters(16, 16, BPP); + +define_loop_filters(44, 16, BPP); +define_loop_filters(48, 16, BPP); +define_loop_filters(84, 16, BPP); +define_loop_filters(88, 16, BPP); + +static av_cold void vp9dsp_loopfilter_init_aarch64(VP9DSPContext *dsp) +{ +int cpu_flags = av_get_cpu_flags(); + +if (have_neon(cpu_flags)) { +#define init_lpf_func_8(idx1, idx2, dir, wd, bpp) \ +dsp->loop_filter_8[idx1][idx2] = ff_vp9_loop_filter_##dir##_##wd##_8_##bpp##_neon + +#define init_lpf_func_16(idx, dir, bpp) \ +dsp->loop_filter_16[idx] = ff_vp9_loop_filter_##dir##_16_16_##bpp##_neon + +#define init_lpf_func_mix2(idx1, idx2, idx3, dir, wd, bpp) \ +dsp->loop_filter_mix2[idx1][idx2][idx3] = ff_vp9_loop_filter_##dir##_##wd##_16_##bpp##_neon + +#define init_lpf_funcs_8_wd(idx, wd, bpp) \ +init_lpf_func_8(idx, 0, h, wd, bpp); \ +init_lpf_func_8(idx, 1, v, wd, bpp) + +#define init_lpf_funcs_16(bpp) \ +init_lpf_func_16(0, h, bpp); \ +init_lpf_func_16(1, v, bpp) + +#define init_lpf_funcs_mix2_wd(idx1, idx2, wd, bpp) \ +init_lpf_func_mix2(idx1, idx2, 0, h, wd, bpp); \ +init_lpf_func_mix2(idx1, idx2, 1, v, wd, bpp) + +#define init_lpf_funcs_8(bpp)\ +init_lpf_funcs_8_wd(0, 4, bpp); \ +init_lpf_funcs_8_wd(1, 8, bpp); \ +init_lpf_funcs_8_wd(2, 16, bpp) + +#define init_lpf_funcs_mix2(bpp) \ +init_lpf_funcs_mix2_wd(0, 0, 44, bpp); \ +init_lpf_funcs_mix2_wd(0, 1, 48, bpp); \ +init_lpf_funcs_mix2_wd(1, 0, 84, bpp); \ +init_lpf_funcs_mix2_wd(1, 1, 88, bpp) + +init_lpf_funcs_8(BPP); +init_lpf_funcs_16(BPP); +init_lpf_funcs_mix2(BPP); +} +} +
[FFmpeg-devel] [PATCH 2/8] arm: Add NEON optimizations for 10 and 12 bit vp9 MC
This work is sponsored by, and copyright, Google. The plain pixel put/copy functions are used from the 8 bit version, for the double size (e.g. put16 uses ff_vp9_copy32_neon), and a new copy128 is added. Compared with the 8 bit version, the filters can no longer use the trick to accumulate in 16 bit with only saturation at the end, but now the accumulators need to be 32 bit. This avoids the need to keep track of which filter index is the largest though, reducing the size of the executable code for these filters. For the horizontal filters, we only do 4 or 8 pixels wide in parallel (while doing two rows at a time), since we don't have enough register space to filter 16 pixels wide. For the vertical filters, we still do 4 and 8 pixels in parallel just as in the 8 bit case, but we need to store the output after every 2 rows instead of after every 4 rows. Examples of relative speedup compared to the C version, from checkasm: CortexA7 A8 A9A53 vp9_avg4_10bpp_neon: 2.25 2.44 3.05 2.16 vp9_avg8_10bpp_neon: 3.66 8.48 3.86 3.50 vp9_avg16_10bpp_neon: 3.39 8.26 3.37 2.72 vp9_avg32_10bpp_neon: 4.03 10.20 4.07 3.42 vp9_avg64_10bpp_neon: 4.15 10.01 4.13 3.70 vp9_avg_8tap_smooth_4h_10bpp_neon: 3.38 6.22 3.41 4.75 vp9_avg_8tap_smooth_4hv_10bpp_neon:3.89 6.39 4.30 5.32 vp9_avg_8tap_smooth_4v_10bpp_neon: 5.32 9.73 6.34 7.31 vp9_avg_8tap_smooth_8h_10bpp_neon: 4.45 9.40 4.68 6.87 vp9_avg_8tap_smooth_8hv_10bpp_neon:4.64 8.91 5.44 6.47 vp9_avg_8tap_smooth_8v_10bpp_neon: 6.44 13.42 8.68 8.79 vp9_avg_8tap_smooth_64h_10bpp_neon:4.66 9.02 4.84 7.71 vp9_avg_8tap_smooth_64hv_10bpp_neon: 4.61 9.14 4.92 7.10 vp9_avg_8tap_smooth_64v_10bpp_neon:6.90 14.13 9.57 10.41 vp9_put4_10bpp_neon: 1.33 1.46 2.09 1.33 vp9_put8_10bpp_neon: 1.57 3.42 1.83 1.84 vp9_put16_10bpp_neon: 1.55 4.78 2.17 1.89 vp9_put32_10bpp_neon: 2.06 5.35 2.14 2.30 vp9_put64_10bpp_neon: 3.00 2.41 1.95 1.66 vp9_put_8tap_smooth_4h_10bpp_neon: 3.19 5.81 3.31 4.63 vp9_put_8tap_smooth_4hv_10bpp_neon:3.86 6.22 4.32 5.21 vp9_put_8tap_smooth_4v_10bpp_neon: 5.40 9.77 6.08 7.21 vp9_put_8tap_smooth_8h_10bpp_neon: 4.22 8.41 4.46 6.63 vp9_put_8tap_smooth_8hv_10bpp_neon:4.56 8.51 5.39 6.25 vp9_put_8tap_smooth_8v_10bpp_neon: 6.60 12.43 8.17 8.89 vp9_put_8tap_smooth_64h_10bpp_neon:4.41 8.59 4.54 7.49 vp9_put_8tap_smooth_64hv_10bpp_neon: 4.43 8.58 5.34 6.63 vp9_put_8tap_smooth_64v_10bpp_neon:7.26 13.92 9.27 10.92 For the larger 8tap filters, the speedup vs C code is around 4-14x. --- libavcodec/arm/Makefile | 5 +- libavcodec/arm/vp9dsp_init.h| 29 ++ libavcodec/arm/vp9dsp_init_10bpp_arm.c | 23 + libavcodec/arm/vp9dsp_init_12bpp_arm.c | 23 + libavcodec/arm/vp9dsp_init_16bpp_arm_template.c | 147 ++ libavcodec/arm/vp9dsp_init_arm.c| 9 +- libavcodec/arm/vp9mc_16bpp_neon.S | 615 7 files changed, 849 insertions(+), 2 deletions(-) create mode 100644 libavcodec/arm/vp9dsp_init.h create mode 100644 libavcodec/arm/vp9dsp_init_10bpp_arm.c create mode 100644 libavcodec/arm/vp9dsp_init_12bpp_arm.c create mode 100644 libavcodec/arm/vp9dsp_init_16bpp_arm_template.c create mode 100644 libavcodec/arm/vp9mc_16bpp_neon.S diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile index 7f18daa..fb35d25 100644 --- a/libavcodec/arm/Makefile +++ b/libavcodec/arm/Makefile @@ -44,7 +44,9 @@ OBJS-$(CONFIG_MLP_DECODER) += arm/mlpdsp_init_arm.o OBJS-$(CONFIG_RV40_DECODER)+= arm/rv40dsp_init_arm.o OBJS-$(CONFIG_VORBIS_DECODER) += arm/vorbisdsp_init_arm.o OBJS-$(CONFIG_VP6_DECODER) += arm/vp6dsp_init_arm.o -OBJS-$(CONFIG_VP9_DECODER) += arm/vp9dsp_init_arm.o +OBJS-$(CONFIG_VP9_DECODER) += arm/vp9dsp_init_10bpp_arm.o \ + arm/vp9dsp_init_12bpp_arm.o \ + arm/vp9dsp_init_arm.o # ARMv5 optimizations @@ -142,4 +144,5 @@ NEON-OBJS-$(CONFIG_VORBIS_DECODER) += arm/vorbisdsp_neon.o NEON-OBJS-$(CONFIG_VP6_DECODER)+= arm/vp6dsp_neon.o NEON-OBJS-$(CONFIG_VP9_DECODER)+= arm/vp9itxfm_neon.o \ arm/vp9lpf_neon.o \ + arm/vp9mc_16bpp_neon.o\ arm/vp9mc_neon.o diff --git a/libavcodec/arm/vp9dsp_init.h b/libavcodec/arm/vp9dsp_init.h new file mode 100644 index 000..0dc1c2d --- /dev/null +++
[FFmpeg-devel] [PATCH 3/8] arm: Add NEON optimizations for 10 and 12 bit vp9 itxfm
This work is sponsored by, and copyright, Google. This is structured similarly to the 8 bit version. In the 8 bit version, the coefficients are 16 bits, and intermediates are 32 bits. Here, the coefficients are 32 bit. For the 4x4 transforms for 10 bit content, the intermediates also fit in 32 bits, but for all other transforms (4x4 for 12 bit content, and 8x8 and larger for both 10 and 12 bit) the intermediates are 64 bit. For the existing 8 bit case, the 8x8 transform fit all coefficients in registers; for 10/12 bit, when the coefficients are 32 bit, the 8x8 transform also has to be done in slices of 4 pixels (just as 16x16 and 32x32 for 8 bit). The slice width also shrinks from 4 elements to 2 elements in parallel for the 16x16 and 32x32 cases. The 16 bit coefficients from idct_coeffs and similar tables also need to be lenghtened to 32 bit in order to be used in multiplication with vectors with 32 bit elements. This leads to the fixed coefficient vectors needing more space, leading to more cases where they have to be reloaded within the transform (in iadst16). This technically would need testing in checkasm for subpartitions in increments of 2, but that slows down normal checkasm runs excessively. Examples of relative speedup compared to the C version, from checkasm: CortexA7 A8 A9A53 vp9_inv_adst_adst_4x4_sub4_add_10_neon: 4.83 11.36 5.22 6.77 vp9_inv_adst_adst_8x8_sub8_add_10_neon: 4.12 7.60 4.06 4.84 vp9_inv_adst_adst_16x16_sub16_add_10_neon: 3.93 8.16 4.52 5.35 vp9_inv_dct_dct_4x4_sub1_add_10_neon:1.36 2.57 1.41 1.61 vp9_inv_dct_dct_4x4_sub4_add_10_neon:4.24 8.66 5.06 5.81 vp9_inv_dct_dct_8x8_sub1_add_10_neon:2.63 4.18 1.68 2.87 vp9_inv_dct_dct_8x8_sub4_add_10_neon:4.52 9.47 4.24 5.39 vp9_inv_dct_dct_8x8_sub8_add_10_neon:3.45 7.34 3.45 4.30 vp9_inv_dct_dct_16x16_sub1_add_10_neon: 3.56 6.21 2.47 4.32 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 5.68 12.73 5.28 7.07 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 4.42 9.28 4.24 5.45 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 3.41 7.29 3.35 4.19 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 4.52 8.35 3.83 6.40 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 5.86 13.19 6.14 7.04 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 4.29 8.11 4.59 5.06 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 3.31 5.70 3.56 3.84 vp9_inv_wht_wht_4x4_sub4_add_10_neon:1.89 2.80 1.82 1.97 The speedup compared to the C functions is around 1.3 to 7x for the full transforms, even higher for the smaller subpartitions. --- libavcodec/arm/Makefile |3 +- libavcodec/arm/vp9dsp_init_16bpp_arm_template.c | 47 + libavcodec/arm/vp9itxfm_16bpp_neon.S| 1515 +++ 3 files changed, 1564 insertions(+), 1 deletion(-) create mode 100644 libavcodec/arm/vp9itxfm_16bpp_neon.S diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile index fb35d25..856c154 100644 --- a/libavcodec/arm/Makefile +++ b/libavcodec/arm/Makefile @@ -142,7 +142,8 @@ NEON-OBJS-$(CONFIG_RV40_DECODER) += arm/rv34dsp_neon.o\ arm/rv40dsp_neon.o NEON-OBJS-$(CONFIG_VORBIS_DECODER) += arm/vorbisdsp_neon.o NEON-OBJS-$(CONFIG_VP6_DECODER)+= arm/vp6dsp_neon.o -NEON-OBJS-$(CONFIG_VP9_DECODER)+= arm/vp9itxfm_neon.o \ +NEON-OBJS-$(CONFIG_VP9_DECODER)+= arm/vp9itxfm_16bpp_neon.o \ + arm/vp9itxfm_neon.o \ arm/vp9lpf_neon.o \ arm/vp9mc_16bpp_neon.o\ arm/vp9mc_neon.o diff --git a/libavcodec/arm/vp9dsp_init_16bpp_arm_template.c b/libavcodec/arm/vp9dsp_init_16bpp_arm_template.c index 05efd29..95f2bbc 100644 --- a/libavcodec/arm/vp9dsp_init_16bpp_arm_template.c +++ b/libavcodec/arm/vp9dsp_init_16bpp_arm_template.c @@ -141,7 +141,54 @@ static av_cold void vp9dsp_mc_init_arm(VP9DSPContext *dsp) } } +#define define_itxfm2(type_a, type_b, sz, bpp) \ +void ff_vp9_##type_a##_##type_b##_##sz##x##sz##_add_##bpp##_neon(uint8_t *_dst,\ + ptrdiff_t stride, \ + int16_t *_block, int eob) +#define define_itxfm(type_a, type_b, sz, bpp) define_itxfm2(type_a, type_b, sz, bpp) + +#define define_itxfm_funcs(sz, bpp) \ +define_itxfm(idct, idct, sz, bpp); \ +define_itxfm(iadst, idct, sz, bpp); \ +define_itxfm(idct, iadst, sz, bpp); \ +define_itxfm(iadst, iadst, sz, bpp) + +define_itxfm_funcs(4, BPP); +define_itxfm_funcs(8, BPP); +define_itxfm_funcs(16, BPP);
[FFmpeg-devel] [PATCH 6/8] aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC
This work is sponsored by, and copyright, Google. This has mostly got the same differences to the 8 bit version as in the arm version. For the horizontal filters, we do 16 pixels in parallel as well. For the 8 pixel wide vertical filters, we can accumulate 4 rows before storing, just as in the 8 bit version. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_avg4_10bpp_neon: 35.7 30.7 vp9_avg8_10bpp_neon: 93.5 84.7 vp9_avg16_10bpp_neon:324.4 296.6 vp9_avg32_10bpp_neon: 1236.51148.2 vp9_avg64_10bpp_neon: 4639.64571.1 vp9_avg_8tap_smooth_4h_10bpp_neon: 130.0 128.0 vp9_avg_8tap_smooth_4hv_10bpp_neon: 440.0 440.5 vp9_avg_8tap_smooth_4v_10bpp_neon: 114.0 105.5 vp9_avg_8tap_smooth_8h_10bpp_neon: 327.0 314.0 vp9_avg_8tap_smooth_8hv_10bpp_neon: 918.7 865.4 vp9_avg_8tap_smooth_8v_10bpp_neon: 330.0 300.2 vp9_avg_8tap_smooth_16h_10bpp_neon: 1187.51155.5 vp9_avg_8tap_smooth_16hv_10bpp_neon:2663.12591.0 vp9_avg_8tap_smooth_16v_10bpp_neon: 1107.41078.3 vp9_avg_8tap_smooth_64h_10bpp_neon:17754.6 17454.7 vp9_avg_8tap_smooth_64hv_10bpp_neon: 33285.2 33001.5 vp9_avg_8tap_smooth_64v_10bpp_neon:16066.9 16048.6 vp9_put4_10bpp_neon: 25.5 21.7 vp9_put8_10bpp_neon: 56.0 52.0 vp9_put16_10bpp_neon/armv8: 183.0 163.1 vp9_put32_10bpp_neon/armv8: 678.6 563.1 vp9_put64_10bpp_neon/armv8: 2679.92195.8 vp9_put_8tap_smooth_4h_10bpp_neon: 120.0 118.0 vp9_put_8tap_smooth_4hv_10bpp_neon: 435.2 435.0 vp9_put_8tap_smooth_4v_10bpp_neon: 107.0 98.2 vp9_put_8tap_smooth_8h_10bpp_neon: 303.0 290.0 vp9_put_8tap_smooth_8hv_10bpp_neon: 893.7 828.7 vp9_put_8tap_smooth_8v_10bpp_neon: 305.5 263.5 vp9_put_8tap_smooth_16h_10bpp_neon: 1089.11059.2 vp9_put_8tap_smooth_16hv_10bpp_neon:2578.82452.4 vp9_put_8tap_smooth_16v_10bpp_neon: 1009.5 933.5 vp9_put_8tap_smooth_64h_10bpp_neon:16223.4 15918.6 vp9_put_8tap_smooth_64hv_10bpp_neon: 32153.0 31016.2 vp9_put_8tap_smooth_64v_10bpp_neon:14516.5 13748.1 These are generally about as fast as the corresponding ARM routines on the same CPU (at least on the A53), in most cases marginally faster. The speedup vs C code is around 4-9x. --- libavcodec/aarch64/Makefile| 5 +- libavcodec/aarch64/vp9dsp_init.h | 29 + libavcodec/aarch64/vp9dsp_init_10bpp_aarch64.c | 23 + libavcodec/aarch64/vp9dsp_init_12bpp_aarch64.c | 23 + .../aarch64/vp9dsp_init_16bpp_aarch64_template.c | 163 ++ libavcodec/aarch64/vp9dsp_init_aarch64.c | 9 +- libavcodec/aarch64/vp9mc_16bpp_neon.S | 631 + 7 files changed, 881 insertions(+), 2 deletions(-) create mode 100644 libavcodec/aarch64/vp9dsp_init.h create mode 100644 libavcodec/aarch64/vp9dsp_init_10bpp_aarch64.c create mode 100644 libavcodec/aarch64/vp9dsp_init_12bpp_aarch64.c create mode 100644 libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c create mode 100644 libavcodec/aarch64/vp9mc_16bpp_neon.S diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile index 5593863..0766e90 100644 --- a/libavcodec/aarch64/Makefile +++ b/libavcodec/aarch64/Makefile @@ -15,7 +15,9 @@ OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_init.o OBJS-$(CONFIG_RV40_DECODER) += aarch64/rv40dsp_init_aarch64.o OBJS-$(CONFIG_VC1DSP) += aarch64/vc1dsp_init_aarch64.o OBJS-$(CONFIG_VORBIS_DECODER) += aarch64/vorbisdsp_init.o -OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9dsp_init_aarch64.o +OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9dsp_init_10bpp_aarch64.o \ + aarch64/vp9dsp_init_12bpp_aarch64.o \ + aarch64/vp9dsp_init_aarch64.o # ARMv8 optimizations @@ -42,4 +44,5 @@ NEON-OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_neon.o NEON-OBJS-$(CONFIG_VORBIS_DECODER) += aarch64/vorbisdsp_neon.o NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_neon.o \ aarch64/vp9lpf_neon.o \ + aarch64/vp9mc_16bpp_neon.o \ aarch64/vp9mc_neon.o diff --git a/libavcodec/aarch64/vp9dsp_init.h b/libavcodec/aarch64/vp9dsp_init.h new file mode 100644 index 000..9df1752 --- /dev/null +++ b/libavcodec/aarch64/vp9dsp_init.h @@ -0,0 +1,29 @@ +/* + * Copyright (c) 2017 Google Inc. + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or
[FFmpeg-devel] [PATCH 7/8] aarch64: Add NEON optimizations for 10 and 12 bit vp9 itxfm
This work is sponsored by, and copyright, Google. Compared to the arm version, on aarch64 we can keep the full 8x8 transform in registers, and for 16x16 and 32x32, we can process it in slices of 4 pixels instead of 2. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_inv_adst_adst_4x4_sub4_add_10_neon: 111.0109.7 vp9_inv_adst_adst_8x8_sub8_add_10_neon: 914.0733.5 vp9_inv_adst_adst_16x16_sub16_add_10_neon: 5184.0 3745.7 vp9_inv_dct_dct_4x4_sub1_add_10_neon: 65.0 65.7 vp9_inv_dct_dct_4x4_sub4_add_10_neon: 100.0 96.7 vp9_inv_dct_dct_8x8_sub1_add_10_neon: 111.0119.7 vp9_inv_dct_dct_8x8_sub8_add_10_neon: 618.0494.7 vp9_inv_dct_dct_16x16_sub1_add_10_neon: 295.1284.6 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 2303.2 1883.9 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 2984.8 2189.3 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 3890.0 2799.4 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1044.4 1012.7 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 1.7 9695.1 vp9_inv_dct_dct_32x32_sub16_add_10_neon:18531.3 12459.8 vp9_inv_dct_dct_32x32_sub32_add_10_neon:24470.7 16160.2 vp9_inv_wht_wht_4x4_sub4_add_10_neon: 83.0 79.7 The larger transforms are significantly faster than the corresponding ARM versions. The speedup vs C code is smaller than in 32 bit mode, probably because the 64 bit intermediates in the C code can be expressed more efficiently in aarch64. --- libavcodec/aarch64/Makefile|3 +- .../aarch64/vp9dsp_init_16bpp_aarch64_template.c | 47 + libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 1517 3 files changed, 1566 insertions(+), 1 deletion(-) create mode 100644 libavcodec/aarch64/vp9itxfm_16bpp_neon.S diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile index 0766e90..715cc6f 100644 --- a/libavcodec/aarch64/Makefile +++ b/libavcodec/aarch64/Makefile @@ -42,7 +42,8 @@ NEON-OBJS-$(CONFIG_MPEGAUDIODSP)+= aarch64/mpegaudiodsp_neon.o # decoders/encoders NEON-OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_neon.o NEON-OBJS-$(CONFIG_VORBIS_DECODER) += aarch64/vorbisdsp_neon.o -NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_neon.o \ +NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_16bpp_neon.o \ + aarch64/vp9itxfm_neon.o \ aarch64/vp9lpf_neon.o \ aarch64/vp9mc_16bpp_neon.o \ aarch64/vp9mc_neon.o diff --git a/libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c b/libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c index 4719ea3..0e86b02 100644 --- a/libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c +++ b/libavcodec/aarch64/vp9dsp_init_16bpp_aarch64_template.c @@ -157,7 +157,54 @@ static av_cold void vp9dsp_mc_init_aarch64(VP9DSPContext *dsp) } } +#define define_itxfm2(type_a, type_b, sz, bpp) \ +void ff_vp9_##type_a##_##type_b##_##sz##x##sz##_add_##bpp##_neon(uint8_t *_dst,\ + ptrdiff_t stride, \ + int16_t *_block, int eob) +#define define_itxfm(type_a, type_b, sz, bpp) define_itxfm2(type_a, type_b, sz, bpp) + +#define define_itxfm_funcs(sz, bpp) \ +define_itxfm(idct, idct, sz, bpp); \ +define_itxfm(iadst, idct, sz, bpp); \ +define_itxfm(idct, iadst, sz, bpp); \ +define_itxfm(iadst, iadst, sz, bpp) + +define_itxfm_funcs(4, BPP); +define_itxfm_funcs(8, BPP); +define_itxfm_funcs(16, BPP); +define_itxfm(idct, idct, 32, BPP); +define_itxfm(iwht, iwht, 4, BPP); + + +static av_cold void vp9dsp_itxfm_init_aarch64(VP9DSPContext *dsp) +{ +int cpu_flags = av_get_cpu_flags(); + +if (have_neon(cpu_flags)) { +#define init_itxfm2(tx, sz, bpp) \ +dsp->itxfm_add[tx][DCT_DCT] = ff_vp9_idct_idct_##sz##_add_##bpp##_neon; \ +dsp->itxfm_add[tx][DCT_ADST] = ff_vp9_iadst_idct_##sz##_add_##bpp##_neon; \ +dsp->itxfm_add[tx][ADST_DCT] = ff_vp9_idct_iadst_##sz##_add_##bpp##_neon; \ +dsp->itxfm_add[tx][ADST_ADST] = ff_vp9_iadst_iadst_##sz##_add_##bpp##_neon +#define init_itxfm(tx, sz, bpp) init_itxfm2(tx, sz, bpp) + +#define init_idct2(tx, nm, bpp) \ +dsp->itxfm_add[tx][DCT_DCT] = \ +dsp->itxfm_add[tx][ADST_DCT] = \ +dsp->itxfm_add[tx][DCT_ADST] = \ +dsp->itxfm_add[tx][ADST_ADST] = ff_vp9_##nm##_add_##bpp##_neon +#define init_idct(tx, nm, bpp) init_idct2(tx, nm, bpp) + +init_itxfm(TX_4X4, 4x4, BPP); +init_itxfm(TX_8X8, 8x8, BPP); +
[FFmpeg-devel] [PATCH 4/8] arm: Add NEON optimizations for 10 and 12 bit vp9 loop filter
This work is sponsored by, and copyright, Google. This is pretty much similar to the 8 bpp version, but in some senses simpler. All input pixels are 16 bits, and all intermediates also fit in 16 bits, so there's no lengthening/narrowing in the filter at all. For the full 16 pixel wide filter, we can only process 4 pixels at a time (using an implementation very much similar to the one for 8 bpp), but we can do 8 pixels at a time for the 4 and 8 pixel wide filters with a different implementation of the core filter. Examples of relative speedup compared to the C version, from checkasm: CortexA7 A8 A9A53 vp9_loop_filter_h_4_8_10bpp_neon: 1.83 2.16 1.40 2.09 vp9_loop_filter_h_8_8_10bpp_neon: 1.39 1.67 1.24 1.70 vp9_loop_filter_h_16_8_10bpp_neon: 1.56 1.47 1.10 1.81 vp9_loop_filter_h_16_16_10bpp_neon:1.94 1.69 1.33 2.24 vp9_loop_filter_mix2_h_44_16_10bpp_neon: 2.01 2.27 1.67 2.39 vp9_loop_filter_mix2_h_48_16_10bpp_neon: 1.84 2.06 1.45 2.19 vp9_loop_filter_mix2_h_84_16_10bpp_neon: 1.89 2.20 1.47 2.29 vp9_loop_filter_mix2_h_88_16_10bpp_neon: 1.69 2.12 1.47 2.08 vp9_loop_filter_mix2_v_44_16_10bpp_neon: 3.16 3.98 2.50 4.05 vp9_loop_filter_mix2_v_48_16_10bpp_neon: 2.84 3.64 2.25 3.77 vp9_loop_filter_mix2_v_84_16_10bpp_neon: 2.65 3.45 2.16 3.54 vp9_loop_filter_mix2_v_88_16_10bpp_neon: 2.55 3.30 2.16 3.55 vp9_loop_filter_v_4_8_10bpp_neon: 2.85 3.97 2.24 3.68 vp9_loop_filter_v_8_8_10bpp_neon: 2.27 3.19 1.96 3.08 vp9_loop_filter_v_16_8_10bpp_neon: 3.42 2.74 2.26 4.40 vp9_loop_filter_v_16_16_10bpp_neon:2.86 2.44 1.93 3.88 The speedup vs C code measured in checkasm is around 1.1-4x. These numbers are quite inconclusive though, since the checkasm test runs multiple filterings on top of each other, so later rounds might end up with different codepaths (different decisions on which filter to apply, based on input pixel differences). Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 2-4x. --- libavcodec/arm/Makefile |1 + libavcodec/arm/vp9dsp_init_16bpp_arm_template.c | 62 ++ libavcodec/arm/vp9lpf_16bpp_neon.S | 1044 +++ 3 files changed, 1107 insertions(+) create mode 100644 libavcodec/arm/vp9lpf_16bpp_neon.S diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile index 856c154..1eeac54 100644 --- a/libavcodec/arm/Makefile +++ b/libavcodec/arm/Makefile @@ -144,6 +144,7 @@ NEON-OBJS-$(CONFIG_VORBIS_DECODER) += arm/vorbisdsp_neon.o NEON-OBJS-$(CONFIG_VP6_DECODER)+= arm/vp6dsp_neon.o NEON-OBJS-$(CONFIG_VP9_DECODER)+= arm/vp9itxfm_16bpp_neon.o \ arm/vp9itxfm_neon.o \ + arm/vp9lpf_16bpp_neon.o \ arm/vp9lpf_neon.o \ arm/vp9mc_16bpp_neon.o\ arm/vp9mc_neon.o diff --git a/libavcodec/arm/vp9dsp_init_16bpp_arm_template.c b/libavcodec/arm/vp9dsp_init_16bpp_arm_template.c index 95f2bbc..3620535 100644 --- a/libavcodec/arm/vp9dsp_init_16bpp_arm_template.c +++ b/libavcodec/arm/vp9dsp_init_16bpp_arm_template.c @@ -187,8 +187,70 @@ static av_cold void vp9dsp_itxfm_init_arm(VP9DSPContext *dsp) } } +#define define_loop_filter(dir, wd, size, bpp) \ +void ff_vp9_loop_filter_##dir##_##wd##_##size##_##bpp##_neon(uint8_t *dst, ptrdiff_t stride, int E, int I, int H) + +#define define_loop_filters(wd, size, bpp) \ +define_loop_filter(h, wd, size, bpp); \ +define_loop_filter(v, wd, size, bpp) + +define_loop_filters(4, 8, BPP); +define_loop_filters(8, 8, BPP); +define_loop_filters(16, 8, BPP); + +define_loop_filters(16, 16, BPP); + +define_loop_filters(44, 16, BPP); +define_loop_filters(48, 16, BPP); +define_loop_filters(84, 16, BPP); +define_loop_filters(88, 16, BPP); + +static av_cold void vp9dsp_loopfilter_init_arm(VP9DSPContext *dsp) +{ +int cpu_flags = av_get_cpu_flags(); + +if (have_neon(cpu_flags)) { +#define init_lpf_func_8(idx1, idx2, dir, wd, bpp) \ +dsp->loop_filter_8[idx1][idx2] = ff_vp9_loop_filter_##dir##_##wd##_8_##bpp##_neon + +#define init_lpf_func_16(idx, dir, bpp) \ +dsp->loop_filter_16[idx] = ff_vp9_loop_filter_##dir##_16_16_##bpp##_neon + +#define init_lpf_func_mix2(idx1, idx2, idx3, dir, wd, bpp) \ +dsp->loop_filter_mix2[idx1][idx2][idx3] = ff_vp9_loop_filter_##dir##_##wd##_16_##bpp##_neon + +#define init_lpf_funcs_8_wd(idx, wd, bpp) \ +init_lpf_func_8(idx, 0, h, wd, bpp); \ +init_lpf_func_8(idx, 1, v, wd, bpp) + +#define init_lpf_funcs_16(bpp) \ +init_lpf_func_16(0, h, bpp); \ +init_lpf_func_16(1, v, bpp) + +#define
[FFmpeg-devel] [PATCH 1/8] arm: vp9dsp: Restructure the bpp checks
This work is sponsored by, and copyright, Google. This is more in line with how it will be extended for more bitdepths. --- libavcodec/arm/vp9dsp_init_arm.c | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/libavcodec/arm/vp9dsp_init_arm.c b/libavcodec/arm/vp9dsp_init_arm.c index 05e50d7..0b76eb1 100644 --- a/libavcodec/arm/vp9dsp_init_arm.c +++ b/libavcodec/arm/vp9dsp_init_arm.c @@ -94,13 +94,10 @@ define_8tap_2d_funcs(8) define_8tap_2d_funcs(4) -static av_cold void vp9dsp_mc_init_arm(VP9DSPContext *dsp, int bpp) +static av_cold void vp9dsp_mc_init_arm(VP9DSPContext *dsp) { int cpu_flags = av_get_cpu_flags(); -if (bpp != 8) -return; - if (have_neon(cpu_flags)) { #define init_fpel(idx1, idx2, sz, type) \ dsp->mc[idx1][FILTER_8TAP_SMOOTH ][idx2][0][0] = \ @@ -160,13 +157,10 @@ define_itxfm(idct, idct, 32); define_itxfm(iwht, iwht, 4); -static av_cold void vp9dsp_itxfm_init_arm(VP9DSPContext *dsp, int bpp) +static av_cold void vp9dsp_itxfm_init_arm(VP9DSPContext *dsp) { int cpu_flags = av_get_cpu_flags(); -if (bpp != 8) -return; - if (have_neon(cpu_flags)) { #define init_itxfm(tx, sz) \ dsp->itxfm_add[tx][DCT_DCT] = ff_vp9_idct_idct_##sz##_add_neon; \ @@ -218,13 +212,10 @@ lf_mix_fns(4, 8) lf_mix_fns(8, 4) lf_mix_fns(8, 8) -static av_cold void vp9dsp_loopfilter_init_arm(VP9DSPContext *dsp, int bpp) +static av_cold void vp9dsp_loopfilter_init_arm(VP9DSPContext *dsp) { int cpu_flags = av_get_cpu_flags(); -if (bpp != 8) -return; - if (have_neon(cpu_flags)) { dsp->loop_filter_8[0][1] = ff_vp9_loop_filter_v_4_8_neon; dsp->loop_filter_8[0][0] = ff_vp9_loop_filter_h_4_8_neon; @@ -249,7 +240,10 @@ static av_cold void vp9dsp_loopfilter_init_arm(VP9DSPContext *dsp, int bpp) av_cold void ff_vp9dsp_init_arm(VP9DSPContext *dsp, int bpp) { -vp9dsp_mc_init_arm(dsp, bpp); -vp9dsp_loopfilter_init_arm(dsp, bpp); -vp9dsp_itxfm_init_arm(dsp, bpp); +if (bpp != 8) +return; + +vp9dsp_mc_init_arm(dsp); +vp9dsp_loopfilter_init_arm(dsp); +vp9dsp_itxfm_init_arm(dsp); } -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH 1/9] vp9dsp: Deduplicate the subpel filters
On Mon, 14 Nov 2016, Ronald S. Bultje wrote: Hi, On Mon, Nov 14, 2016 at 5:32 AM, Martin Storsjö <mar...@martin.st> wrote: Make them aligned, to allow efficient access to them from simd. This is an adapted cherry-pick from libav commit a4cfcddcb0f76e837d5abc06840c2b26c0e8aefc. --- libavcodec/vp9dsp.c | 56 +++ libavcodec/vp9dsp.h | 3 +++ libavcodec/vp9dsp_template.c | 63 +++--- -- 3 files changed, 63 insertions(+), 59 deletions(-) OK. Do I need to queue them up? Yes, that'd be appreciated. I thought they would be merged automagically from Libav... In principle, but the merging is quite far behind at the moment. I've included the commit hashes of all included commits to make it clear which commits can be no-oped in future merges at least. Also for the record, it has been tested on linux, iOS and with the MSVC toolchain (in wine). // Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 5/9] arm: vp9: Add NEON loop filters
This work is sponsored by, and copyright, Google. The implementation tries to have smart handling of cases where no pixels need the full filtering for the 8/16 width filters, skipping both calculation and writeback of the unmodified pixels in those cases. The actual effect of this is hard to test with checkasm though, since it tests the full filtering, and the benefit depends on how many filtered blocks use the shortcut. Examples of relative speedup compared to the C version, from checkasm: Cortex A7 A8 A9A53 vp9_loop_filter_h_4_8_neon: 2.72 2.68 1.78 3.15 vp9_loop_filter_h_8_8_neon: 2.36 2.38 1.70 2.91 vp9_loop_filter_h_16_8_neon: 1.80 1.89 1.45 2.01 vp9_loop_filter_h_16_16_neon:2.81 2.78 2.18 3.16 vp9_loop_filter_mix2_h_44_16_neon: 2.65 2.67 1.93 3.05 vp9_loop_filter_mix2_h_48_16_neon: 2.46 2.38 1.81 2.85 vp9_loop_filter_mix2_h_84_16_neon: 2.50 2.41 1.73 2.85 vp9_loop_filter_mix2_h_88_16_neon: 2.77 2.66 1.96 3.23 vp9_loop_filter_mix2_v_44_16_neon: 4.28 4.46 3.22 5.70 vp9_loop_filter_mix2_v_48_16_neon: 3.92 4.00 3.03 5.19 vp9_loop_filter_mix2_v_84_16_neon: 3.97 4.31 2.98 5.33 vp9_loop_filter_mix2_v_88_16_neon: 3.91 4.19 3.06 5.18 vp9_loop_filter_v_4_8_neon: 4.53 4.47 3.31 6.05 vp9_loop_filter_v_8_8_neon: 3.58 3.99 2.92 5.17 vp9_loop_filter_v_16_8_neon: 3.40 3.50 2.81 4.68 vp9_loop_filter_v_16_16_neon:4.66 4.41 3.74 6.02 The speedup vs C code is around 2-6x. The numbers are quite inconclusive though, since the checkasm test runs multiple filterings on top of each other, so later rounds might end up with different codepaths (different decisions on which filter to apply, based on input pixel differences). Disabling the early-exit in the asm doesn't give a fair comparison either though, since the C code only does the necessary calcuations for each row. Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 4-9x. This is pretty similar in runtime to the corresponding routines in libvpx. (This is comparing vpx_lpf_vertical_16_neon, vpx_lpf_horizontal_edge_8_neon and vpx_lpf_horizontal_edge_16_neon to vp9_loop_filter_h_16_8_neon, vp9_loop_filter_v_16_8_neon and vp9_loop_filter_v_16_16_neon - note that the naming of horizonal and vertical is flipped between the libraries.) In order to have stable, comparable numbers, the early exits in both asm versions were disabled, forcing the full filtering codepath. Cortex A7 A8 A9 A53 vp9_loop_filter_h_16_8_neon: 597.2 472.0 482.4 415.0 libvpx vpx_lpf_vertical_16_neon: 626.0 464.5 470.7 445.0 vp9_loop_filter_v_16_8_neon: 500.2 422.5 429.7 295.0 libvpx vpx_lpf_horizontal_edge_8_neon: 586.5 414.5 415.6 383.2 vp9_loop_filter_v_16_16_neon:905.0 784.7 791.5 546.0 libvpx vpx_lpf_horizontal_edge_16_neon: 1060.2 751.7 743.5 685.2 Our version is consistently faster on on A7 and A53, marginally slower on A8, and sometimes faster, sometimes slower on A9 (marginally slower in all three tests in this particular test run). This is an adapted cherry-pick from libav commit dd299a2d6d4d1af9528ed35a8131c35946be5973. --- libavcodec/arm/Makefile | 1 + libavcodec/arm/vp9dsp_init_arm.c | 60 +++ libavcodec/arm/vp9lpf_neon.S | 770 +++ 3 files changed, 831 insertions(+) create mode 100644 libavcodec/arm/vp9lpf_neon.S diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile index 8602e28..7f18daa 100644 --- a/libavcodec/arm/Makefile +++ b/libavcodec/arm/Makefile @@ -141,4 +141,5 @@ NEON-OBJS-$(CONFIG_RV40_DECODER) += arm/rv34dsp_neon.o\ NEON-OBJS-$(CONFIG_VORBIS_DECODER) += arm/vorbisdsp_neon.o NEON-OBJS-$(CONFIG_VP6_DECODER)+= arm/vp6dsp_neon.o NEON-OBJS-$(CONFIG_VP9_DECODER)+= arm/vp9itxfm_neon.o \ + arm/vp9lpf_neon.o \ arm/vp9mc_neon.o diff --git a/libavcodec/arm/vp9dsp_init_arm.c b/libavcodec/arm/vp9dsp_init_arm.c index 1d4eabf..05e50d7 100644 --- a/libavcodec/arm/vp9dsp_init_arm.c +++ b/libavcodec/arm/vp9dsp_init_arm.c @@ -188,8 +188,68 @@ static av_cold void vp9dsp_itxfm_init_arm(VP9DSPContext *dsp, int bpp) } } +#define define_loop_filter(dir, wd, size) \ +void ff_vp9_loop_filter_##dir##_##wd##_##size##_neon(uint8_t *dst, ptrdiff_t stride, int E, int I, int H) + +#define define_loop_filters(wd, size) \ +define_loop_filter(h, wd, size); \ +define_loop_filter(v, wd, size) + +define_loop_filters(4, 8); +define_loop_filters(8, 8); +define_loop_filters(16, 8); +define_loop_filters(16, 16); + +#define lf_mix_fn(dir, wd1, wd2, stridea)
[FFmpeg-devel] [PATCH 9/9] aarch64: vp9: Implement NEON loop filters
This work is sponsored by, and copyright, Google. These are ported from the ARM version; thanks to the larger amount of registers available, we can do the loop filters with 16 pixels at a time. The implementation is fully templated, with a single macro which can generate versions for both 8 and 16 pixels wide, for both 4, 8 and 16 pixels loop filters (and the 4/8 mixed versions as well). For the 8 pixel wide versions, it is pretty close in speed (the v_4_8 and v_8_8 filters are the best examples of this; the h_4_8 and h_8_8 filters seem to get some gain in the load/transpose/store part). For the 16 pixels wide ones, we get a speedup of around 1.2-1.4x compared to the 32 bit version. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_loop_filter_h_4_8_neon: 144.0 127.2 vp9_loop_filter_h_8_8_neon: 207.0 182.5 vp9_loop_filter_h_16_8_neon: 415.0 328.7 vp9_loop_filter_h_16_16_neon:672.0 558.6 vp9_loop_filter_mix2_h_44_16_neon: 302.0 203.5 vp9_loop_filter_mix2_h_48_16_neon: 365.0 305.2 vp9_loop_filter_mix2_h_84_16_neon: 365.0 305.2 vp9_loop_filter_mix2_h_88_16_neon: 376.0 305.2 vp9_loop_filter_mix2_v_44_16_neon: 193.2 128.2 vp9_loop_filter_mix2_v_48_16_neon: 246.7 218.4 vp9_loop_filter_mix2_v_84_16_neon: 248.0 218.5 vp9_loop_filter_mix2_v_88_16_neon: 302.0 218.2 vp9_loop_filter_v_4_8_neon: 89.088.7 vp9_loop_filter_v_8_8_neon: 141.0 137.7 vp9_loop_filter_v_16_8_neon: 295.0 272.7 vp9_loop_filter_v_16_16_neon:546.0 453.7 The speedup vs C code in checkasm tests is around 2-7x, which is pretty much the same as for the 32 bit version. Even if these functions are faster than their 32 bit equivalent, the C version that we compare to also became around 1.3-1.7x faster than the C version in 32 bit. Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 4-5x. Examples of runtimes vs C on a Cortex A57 (for a slightly older version of the patch): A57 gcc-5.3 neon loop_filter_h_4_8_neon:256.6 93.4 loop_filter_h_8_8_neon:307.3 139.1 loop_filter_h_16_8_neon: 340.1 254.1 loop_filter_h_16_16_neon: 827.0 407.9 loop_filter_mix2_h_44_16_neon: 524.5 155.4 loop_filter_mix2_h_48_16_neon: 644.5 173.3 loop_filter_mix2_h_84_16_neon: 630.5 222.0 loop_filter_mix2_h_88_16_neon: 697.3 222.0 loop_filter_mix2_v_44_16_neon: 598.5 100.6 loop_filter_mix2_v_48_16_neon: 651.5 127.0 loop_filter_mix2_v_84_16_neon: 591.5 167.1 loop_filter_mix2_v_88_16_neon: 855.1 166.7 loop_filter_v_4_8_neon:271.7 65.3 loop_filter_v_8_8_neon:312.5 106.9 loop_filter_v_16_8_neon: 473.3 206.5 loop_filter_v_16_16_neon: 976.1 327.8 The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57 is again 30-50% faster than the cortex-a53. This is an adapted cherry-pick from libav commits 9d2afd1eb8c5cc0633062430e66326dbf98c99e0 and 31756abe29eb039a11c59a42cb12e0cc2aef3b97. --- libavcodec/aarch64/Makefile |1 + libavcodec/aarch64/vp9dsp_init_aarch64.c | 48 ++ libavcodec/aarch64/vp9lpf_neon.S | 1355 ++ 3 files changed, 1404 insertions(+) create mode 100644 libavcodec/aarch64/vp9lpf_neon.S diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile index e8a7f7a..b7bb898 100644 --- a/libavcodec/aarch64/Makefile +++ b/libavcodec/aarch64/Makefile @@ -43,4 +43,5 @@ NEON-OBJS-$(CONFIG_MPEGAUDIODSP)+= aarch64/mpegaudiodsp_neon.o NEON-OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_neon.o NEON-OBJS-$(CONFIG_VORBIS_DECODER) += aarch64/vorbisdsp_neon.o NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_neon.o \ + aarch64/vp9lpf_neon.o \ aarch64/vp9mc_neon.o diff --git a/libavcodec/aarch64/vp9dsp_init_aarch64.c b/libavcodec/aarch64/vp9dsp_init_aarch64.c index 2848608..7e34375 100644 --- a/libavcodec/aarch64/vp9dsp_init_aarch64.c +++ b/libavcodec/aarch64/vp9dsp_init_aarch64.c @@ -201,8 +201,56 @@ static av_cold void vp9dsp_itxfm_init_aarch64(VP9DSPContext *dsp, int bpp) } } +#define define_loop_filter(dir, wd, len) \ +void ff_vp9_loop_filter_##dir##_##wd##_##len##_neon(uint8_t *dst, ptrdiff_t stride, int E, int I, int H) + +#define define_loop_filters(wd, len) \ +define_loop_filter(h, wd, len); \ +define_loop_filter(v, wd, len) + +define_loop_filters(4, 8); +define_loop_filters(8, 8); +define_loop_filters(16, 8); + +define_loop_filters(16, 16); + +define_loop_filters(44, 16); +define_loop_filters(48, 16); +define_loop_filters(84, 16); +define_loop_filters(88, 16); + +static av_cold void vp9dsp_loopfilter_init_aarch64(VP9DSPContext *dsp, int bpp) +{ +int cpu_flags = av_get_cpu_flags(); + +if (bpp != 8) +
[FFmpeg-devel] [PATCH 7/9] aarch64: vp9: Add NEON optimizations of VP9 MC functions
This work is sponsored by, and copyright, Google. These are ported from the ARM version; it is essentially a 1:1 port with no extra added features, but with some hand tuning (especially for the plain copy/avg functions). The ARM version isn't very register starved to begin with, so there's not much to be gained from having more spare registers here - we only avoid having to clobber callee-saved registers. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_avg4_neon: 27.2 23.7 vp9_avg8_neon: 56.5 54.7 vp9_avg16_neon:169.9 167.4 vp9_avg32_neon:585.8 585.2 vp9_avg64_neon: 2460.32294.7 vp9_avg_8tap_smooth_4h_neon: 132.7 125.2 vp9_avg_8tap_smooth_4hv_neon: 478.8 442.0 vp9_avg_8tap_smooth_4v_neon: 126.0 93.7 vp9_avg_8tap_smooth_8h_neon: 241.7 234.2 vp9_avg_8tap_smooth_8hv_neon: 690.9 646.5 vp9_avg_8tap_smooth_8v_neon: 245.0 205.5 vp9_avg_8tap_smooth_64h_neon:11273.2 11280.1 vp9_avg_8tap_smooth_64hv_neon: 22980.6 22184.1 vp9_avg_8tap_smooth_64v_neon:11549.7 10781.1 vp9_put4_neon: 18.0 17.2 vp9_put8_neon: 40.2 37.7 vp9_put16_neon: 97.4 99.5 vp9_put32_neon/armv8: 346.0 307.4 vp9_put64_neon/armv8: 1319.01107.5 vp9_put_8tap_smooth_4h_neon: 126.7 118.2 vp9_put_8tap_smooth_4hv_neon: 465.7 434.0 vp9_put_8tap_smooth_4v_neon: 113.0 86.5 vp9_put_8tap_smooth_8h_neon: 229.7 221.6 vp9_put_8tap_smooth_8hv_neon: 658.9 621.3 vp9_put_8tap_smooth_8v_neon: 215.0 187.5 vp9_put_8tap_smooth_64h_neon:10636.7 10627.8 vp9_put_8tap_smooth_64hv_neon: 21076.8 21026.9 vp9_put_8tap_smooth_64v_neon: 9635.09632.4 These are generally about as fast as the corresponding ARM routines on the same CPU (at least on the A53), in most cases marginally faster. The speedup vs C code is pretty much the same as for the 32 bit case; on the A53 it's around 6-13x for ther larger 8tap filters. The exact speedup varies a little, since the C versions generally don't end up exactly as slow/fast as on 32 bit. This is an adapted cherry-pick from libav commit 383d96aa2229f644d9bd77b821ed3a309da5e9fc. --- libavcodec/aarch64/Makefile | 2 + libavcodec/aarch64/vp9dsp_init_aarch64.c | 156 +++ libavcodec/aarch64/vp9mc_neon.S | 676 +++ libavcodec/vp9.c | 8 +- libavcodec/vp9dsp.c | 1 + libavcodec/vp9dsp.h | 1 + 6 files changed, 840 insertions(+), 4 deletions(-) create mode 100644 libavcodec/aarch64/vp9dsp_init_aarch64.c create mode 100644 libavcodec/aarch64/vp9mc_neon.S diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile index c3df887..e7db95e 100644 --- a/libavcodec/aarch64/Makefile +++ b/libavcodec/aarch64/Makefile @@ -16,6 +16,7 @@ OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_init.o OBJS-$(CONFIG_RV40_DECODER) += aarch64/rv40dsp_init_aarch64.o OBJS-$(CONFIG_VC1DSP) += aarch64/vc1dsp_init_aarch64.o OBJS-$(CONFIG_VORBIS_DECODER) += aarch64/vorbisdsp_init.o +OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9dsp_init_aarch64.o # ARMv8 optimizations @@ -41,3 +42,4 @@ NEON-OBJS-$(CONFIG_MPEGAUDIODSP)+= aarch64/mpegaudiodsp_neon.o # decoders/encoders NEON-OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_neon.o NEON-OBJS-$(CONFIG_VORBIS_DECODER) += aarch64/vorbisdsp_neon.o +NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9mc_neon.o diff --git a/libavcodec/aarch64/vp9dsp_init_aarch64.c b/libavcodec/aarch64/vp9dsp_init_aarch64.c new file mode 100644 index 000..4adf363 --- /dev/null +++ b/libavcodec/aarch64/vp9dsp_init_aarch64.c @@ -0,0 +1,156 @@ +/* + * Copyright (c) 2016 Google Inc. + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include + +#include "libavutil/attributes.h" +#include "libavutil/aarch64/cpu.h" +#include "libavcodec/vp9dsp.h" + +#define
[FFmpeg-devel] [PATCH 3/9] arm: vp9: Add NEON optimizations of VP9 MC functions
This work is sponsored by, and copyright, Google. The filter coefficients are signed values, where the product of the multiplication with one individual filter coefficient doesn't overflow a 16 bit signed value (the largest filter coefficient is 127). But when the products are accumulated, the resulting sum can overflow the 16 bit signed range. Instead of accumulating in 32 bit, we accumulate the largest product (either index 3 or 4) last with a saturated addition. (The VP8 MC asm does something similar, but slightly simpler, by accumulating each half of the filter separately. In the VP9 MC filters, each half of the filter can also overflow though, so the largest component has to be handled individually.) Examples of relative speedup compared to the C version, from checkasm: Cortex A7 A8 A9A53 vp9_avg4_neon: 1.71 1.15 1.42 1.49 vp9_avg8_neon: 2.51 3.63 3.14 2.58 vp9_avg16_neon: 2.95 6.76 3.01 2.84 vp9_avg32_neon: 3.29 6.64 2.85 3.00 vp9_avg64_neon: 3.47 6.67 3.14 2.80 vp9_avg_8tap_smooth_4h_neon: 3.22 4.73 2.76 4.67 vp9_avg_8tap_smooth_4hv_neon:3.67 4.76 3.28 4.71 vp9_avg_8tap_smooth_4v_neon: 5.52 7.60 4.60 6.31 vp9_avg_8tap_smooth_8h_neon: 6.22 9.04 5.12 9.32 vp9_avg_8tap_smooth_8hv_neon:6.38 8.21 5.72 8.17 vp9_avg_8tap_smooth_8v_neon: 9.22 12.66 8.15 11.10 vp9_avg_8tap_smooth_64h_neon:7.02 10.23 5.54 11.58 vp9_avg_8tap_smooth_64hv_neon: 6.76 9.46 5.93 9.40 vp9_avg_8tap_smooth_64v_neon: 10.76 14.13 9.46 13.37 vp9_put4_neon: 1.11 1.47 1.00 1.21 vp9_put8_neon: 1.23 2.17 1.94 1.48 vp9_put16_neon: 1.63 4.02 1.73 1.97 vp9_put32_neon: 1.56 4.92 2.00 1.96 vp9_put64_neon: 2.10 5.28 2.03 2.35 vp9_put_8tap_smooth_4h_neon: 3.11 4.35 2.63 4.35 vp9_put_8tap_smooth_4hv_neon:3.67 4.69 3.25 4.71 vp9_put_8tap_smooth_4v_neon: 5.45 7.27 4.49 6.52 vp9_put_8tap_smooth_8h_neon: 5.97 8.18 4.81 8.56 vp9_put_8tap_smooth_8hv_neon:6.39 7.90 5.64 8.15 vp9_put_8tap_smooth_8v_neon: 9.03 11.84 8.07 11.51 vp9_put_8tap_smooth_64h_neon:6.78 9.48 4.88 10.89 vp9_put_8tap_smooth_64hv_neon: 6.99 8.87 5.94 9.56 vp9_put_8tap_smooth_64v_neon: 10.69 13.30 9.43 14.34 For the larger 8tap filters, the speedup vs C code is around 5-14x. This is significantly faster than libvpx's implementation of the same functions, at least when comparing the put_8tap_smooth_64 functions (compared to vpx_convolve8_horiz_neon and vpx_convolve8_vert_neon from libvpx). Absolute runtimes from checkasm: Cortex A7A8A9 A53 vp9_put_8tap_smooth_64h_neon:20150.3 14489.4 19733.6 10863.7 libvpx vpx_convolve8_horiz_neon: 52623.3 19736.4 21907.7 25027.7 vp9_put_8tap_smooth_64v_neon:14455.0 12303.9 13746.49628.9 libvpx vpx_convolve8_vert_neon: 42090.0 17706.2 17659.9 16941.2 Thus, on the A9, the horizontal filter is only marginally faster than libvpx, while our version is significantly faster on the other cores, and the vertical filter is significantly faster on all cores. The difference is especially large on the A7. The libvpx implementation does the accumulation in 32 bit, which probably explains most of the differences. This is an adapted cherry-pick from libav commits ffbd1d2b0002576ef0d976a41ff959c635373fdc, 392caa65df3efa8b2d48a80f08a6af4892c61c08, 557c1675cf0e803b2fee43b4c8b58433842c84d0 and 11623217e3c9b859daee544e31acdd0821b61039. --- libavcodec/arm/Makefile | 2 + libavcodec/arm/vp9dsp_init_arm.c | 143 libavcodec/arm/vp9mc_neon.S | 709 +++ libavcodec/vp9.c | 20 +- libavcodec/vp9dsp.c | 1 + libavcodec/vp9dsp.h | 1 + 6 files changed, 872 insertions(+), 4 deletions(-) create mode 100644 libavcodec/arm/vp9dsp_init_arm.c create mode 100644 libavcodec/arm/vp9mc_neon.S diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile index a4ceca7..82b740b 100644 --- a/libavcodec/arm/Makefile +++ b/libavcodec/arm/Makefile @@ -44,6 +44,7 @@ OBJS-$(CONFIG_MLP_DECODER) += arm/mlpdsp_init_arm.o OBJS-$(CONFIG_RV40_DECODER)+= arm/rv40dsp_init_arm.o OBJS-$(CONFIG_VORBIS_DECODER) += arm/vorbisdsp_init_arm.o OBJS-$(CONFIG_VP6_DECODER) += arm/vp6dsp_init_arm.o +OBJS-$(CONFIG_VP9_DECODER) += arm/vp9dsp_init_arm.o # ARMv5 optimizations @@ -139,3 +140,4 @@ NEON-OBJS-$(CONFIG_RV40_DECODER) += arm/rv34dsp_neon.o\ arm/rv40dsp_neon.o NEON-OBJS-$(CONFIG_VORBIS_DECODER) += arm/vorbisdsp_neon.o NEON-OBJS-$(CONFIG_VP6_DECODER)
[FFmpeg-devel] [PATCH 6/9] aarch64: Add an offset parameter to the movrel macro
With apple tools, the linker fails with errors like these, if the offset is negative: ld: in section __TEXT,__text reloc 8: symbol index out of range for architecture arm64 This is cherry-picked from libav commit c44a8a3eabcd6acd2ba79f32ec8a432e6ebe552c. --- libavutil/aarch64/asm.S | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/libavutil/aarch64/asm.S b/libavutil/aarch64/asm.S index ff34e7a..523b8c5 100644 --- a/libavutil/aarch64/asm.S +++ b/libavutil/aarch64/asm.S @@ -72,15 +72,21 @@ ELF .size \name, . - \name \name: .endm -.macro movrel rd, val +.macro movrel rd, val, offset=0 #if CONFIG_PIC && defined(__APPLE__) +.if \offset < 0 adrp\rd, \val@PAGE add \rd, \rd, \val@PAGEOFF +sub \rd, \rd, -(\offset) +.else +adrp\rd, \val+(\offset)@PAGE +add \rd, \rd, \val+(\offset)@PAGEOFF +.endif #elif CONFIG_PIC -adrp\rd, \val -add \rd, \rd, :lo12:\val +adrp\rd, \val+\offset +add \rd, \rd, :lo12:\val+\offset #else -ldr \rd, =\val +ldr \rd, =\val+\offset #endif .endm -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 4/9] arm: vp9: Add NEON itxfm routines
This work is sponsored by, and copyright, Google. For the transforms up to 8x8, we can fit all the data (including temporaries) in registers and just do a straightforward transform of all the data. For 16x16, we do a transform of 4x16 pixels in 4 slices, using a temporary buffer. For 32x32, we transform 4x32 pixels at a time, in two steps of 4x16 pixels each. Examples of relative speedup compared to the C version, from checkasm: Cortex A7 A8 A9A53 vp9_inv_adst_adst_4x4_add_neon: 3.39 5.83 4.17 4.01 vp9_inv_adst_adst_8x8_add_neon: 3.79 4.86 4.23 3.98 vp9_inv_adst_adst_16x16_add_neon: 3.33 4.36 4.11 4.16 vp9_inv_dct_dct_4x4_add_neon: 4.06 6.16 4.59 4.46 vp9_inv_dct_dct_8x8_add_neon: 4.61 6.01 4.98 4.86 vp9_inv_dct_dct_16x16_add_neon: 3.35 3.44 3.36 3.79 vp9_inv_dct_dct_32x32_add_neon: 3.89 3.50 3.79 4.42 vp9_inv_wht_wht_4x4_add_neon: 3.22 5.13 3.53 3.77 Thus, the speedup vs C code is around 3-6x. This is mostly marginally faster than the corresponding routines in libvpx on most cores, tested with their 32x32 idct (compared to vpx_idct32x32_1024_add_neon). These numbers are slightly in libvpx's favour since their version doesn't clear the input buffer like ours do (although the effect of that on the total runtime probably is negligible.) Cortex A7 A8 A9 A53 vp9_inv_dct_dct_32x32_add_neon:18436.8 16874.1 14235.1 11988.9 libvpx vpx_idct32x32_1024_add_neon 20789.0 13344.3 15049.9 13030.5 Only on the Cortex A8, the libvpx function is faster. On the other cores, ours is slightly faster even though ours has got source block clearing integrated. This is an adapted cherry-pick from libav commits a67ae67083151f2f9595a1f2d17b601da19b939e and 52d196fb30fb6628921b5f1b31e7bd11eb7e1d9a. --- libavcodec/arm/Makefile |3 +- libavcodec/arm/vp9dsp_init_arm.c | 54 +- libavcodec/arm/vp9itxfm_neon.S | 1149 ++ 3 files changed, 1204 insertions(+), 2 deletions(-) create mode 100644 libavcodec/arm/vp9itxfm_neon.S diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile index 82b740b..8602e28 100644 --- a/libavcodec/arm/Makefile +++ b/libavcodec/arm/Makefile @@ -140,4 +140,5 @@ NEON-OBJS-$(CONFIG_RV40_DECODER) += arm/rv34dsp_neon.o\ arm/rv40dsp_neon.o NEON-OBJS-$(CONFIG_VORBIS_DECODER) += arm/vorbisdsp_neon.o NEON-OBJS-$(CONFIG_VP6_DECODER)+= arm/vp6dsp_neon.o -NEON-OBJS-$(CONFIG_VP9_DECODER)+= arm/vp9mc_neon.o +NEON-OBJS-$(CONFIG_VP9_DECODER)+= arm/vp9itxfm_neon.o \ + arm/vp9mc_neon.o diff --git a/libavcodec/arm/vp9dsp_init_arm.c b/libavcodec/arm/vp9dsp_init_arm.c index bd0ac37..1d4eabf 100644 --- a/libavcodec/arm/vp9dsp_init_arm.c +++ b/libavcodec/arm/vp9dsp_init_arm.c @@ -94,7 +94,7 @@ define_8tap_2d_funcs(8) define_8tap_2d_funcs(4) -av_cold void ff_vp9dsp_init_arm(VP9DSPContext *dsp, int bpp) +static av_cold void vp9dsp_mc_init_arm(VP9DSPContext *dsp, int bpp) { int cpu_flags = av_get_cpu_flags(); @@ -141,3 +141,55 @@ av_cold void ff_vp9dsp_init_arm(VP9DSPContext *dsp, int bpp) init_mc_funcs_dirs(4, 4); } } + +#define define_itxfm(type_a, type_b, sz) \ +void ff_vp9_##type_a##_##type_b##_##sz##x##sz##_add_neon(uint8_t *_dst,\ + ptrdiff_t stride, \ + int16_t *_block, int eob) + +#define define_itxfm_funcs(sz) \ +define_itxfm(idct, idct, sz); \ +define_itxfm(iadst, idct, sz); \ +define_itxfm(idct, iadst, sz); \ +define_itxfm(iadst, iadst, sz) + +define_itxfm_funcs(4); +define_itxfm_funcs(8); +define_itxfm_funcs(16); +define_itxfm(idct, idct, 32); +define_itxfm(iwht, iwht, 4); + + +static av_cold void vp9dsp_itxfm_init_arm(VP9DSPContext *dsp, int bpp) +{ +int cpu_flags = av_get_cpu_flags(); + +if (bpp != 8) +return; + +if (have_neon(cpu_flags)) { +#define init_itxfm(tx, sz) \ +dsp->itxfm_add[tx][DCT_DCT] = ff_vp9_idct_idct_##sz##_add_neon; \ +dsp->itxfm_add[tx][DCT_ADST] = ff_vp9_iadst_idct_##sz##_add_neon; \ +dsp->itxfm_add[tx][ADST_DCT] = ff_vp9_idct_iadst_##sz##_add_neon; \ +dsp->itxfm_add[tx][ADST_ADST] = ff_vp9_iadst_iadst_##sz##_add_neon + +#define init_idct(tx, nm) \ +dsp->itxfm_add[tx][DCT_DCT] = \ +dsp->itxfm_add[tx][ADST_DCT] = \ +dsp->itxfm_add[tx][DCT_ADST] = \ +dsp->itxfm_add[tx][ADST_ADST] = ff_vp9_##nm##_add_neon + +init_itxfm(TX_4X4, 4x4); +init_itxfm(TX_8X8, 8x8); +init_itxfm(TX_16X16, 16x16); +init_idct(TX_32X32, idct_idct_32x32); +init_idct(4, iwht_iwht_4x4); +} +}
[FFmpeg-devel] [PATCH 2/9] arm: Clear the gp register alias at the end of functions
We reset .Lpic_gp to zero at the start of each function, which means that the logic within movrelx for clearing gp when necessary will be missed. This fixes using movrelx in different functions with a different helper register. This is cherry-picked from libav commit 824e8c284054f323f854892d1b4739239ed1fdc7. --- libavutil/arm/asm.S | 3 +++ 1 file changed, 3 insertions(+) diff --git a/libavutil/arm/asm.S b/libavutil/arm/asm.S index e9b0bca..b0a6e50 100644 --- a/libavutil/arm/asm.S +++ b/libavutil/arm/asm.S @@ -77,6 +77,9 @@ ELF .section .note.GNU-stack,"",%progbits @ Mark stack as non-executable put_pic %(.Lpic_idx - 1) .noaltmacro .endif + .if .Lpic_gp +.unreq gp + .endif ELF .size \name, . - \name FUNC.endfunc .purgem endfunc -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 8/9] aarch64: vp9: Add NEON itxfm routines
This work is sponsored by, and copyright, Google. These are ported from the ARM version; thanks to the larger amount of registers available, we can do the 16x16 and 32x32 transforms in slices 8 pixels wide instead of 4. This gives a speedup of around 1.4x compared to the 32 bit version. The fact that aarch64 doesn't have the same d/q register aliasing makes some of the macros quite a bit simpler as well. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_inv_adst_adst_4x4_add_neon: 90.0 87.7 vp9_inv_adst_adst_8x8_add_neon: 400.0354.7 vp9_inv_adst_adst_16x16_add_neon: 2526.5 1827.2 vp9_inv_dct_dct_4x4_add_neon: 74.0 72.7 vp9_inv_dct_dct_8x8_add_neon:271.0256.7 vp9_inv_dct_dct_16x16_add_neon: 1960.7 1372.7 vp9_inv_dct_dct_32x32_add_neon:11988.9 8088.3 vp9_inv_wht_wht_4x4_add_neon: 63.0 57.7 The speedup vs C code (2-4x) is smaller than in the 32 bit case, mostly because the C code ends up significantly faster (around 1.6x faster, with GCC 5.4) when built for aarch64. Examples of runtimes vs C on a Cortex A57 (for a slightly older version of the patch): A57 gcc-5.3 neon vp9_inv_adst_adst_4x4_add_neon: 152.2 60.0 vp9_inv_adst_adst_8x8_add_neon: 948.2 288.0 vp9_inv_adst_adst_16x16_add_neon:4830.4 1380.5 vp9_inv_dct_dct_4x4_add_neon: 153.0 58.6 vp9_inv_dct_dct_8x8_add_neon: 789.2 180.2 vp9_inv_dct_dct_16x16_add_neon: 3639.6 917.1 vp9_inv_dct_dct_32x32_add_neon: 20462.1 4985.0 vp9_inv_wht_wht_4x4_add_neon: 91.0 49.8 The asm is around factor 3-4 faster than C on the cortex-a57 and the asm is around 30-50% faster on the a57 compared to the a53. This is an adapted cherry-pick from libav commit 3c9546dfafcdfe8e7860aff9ebbf609318220f29. --- libavcodec/aarch64/Makefile |3 +- libavcodec/aarch64/vp9dsp_init_aarch64.c | 54 +- libavcodec/aarch64/vp9itxfm_neon.S | 1116 ++ 3 files changed, 1171 insertions(+), 2 deletions(-) create mode 100644 libavcodec/aarch64/vp9itxfm_neon.S diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile index e7db95e..e8a7f7a 100644 --- a/libavcodec/aarch64/Makefile +++ b/libavcodec/aarch64/Makefile @@ -42,4 +42,5 @@ NEON-OBJS-$(CONFIG_MPEGAUDIODSP)+= aarch64/mpegaudiodsp_neon.o # decoders/encoders NEON-OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_neon.o NEON-OBJS-$(CONFIG_VORBIS_DECODER) += aarch64/vorbisdsp_neon.o -NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9mc_neon.o +NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_neon.o \ + aarch64/vp9mc_neon.o diff --git a/libavcodec/aarch64/vp9dsp_init_aarch64.c b/libavcodec/aarch64/vp9dsp_init_aarch64.c index 4adf363..2848608 100644 --- a/libavcodec/aarch64/vp9dsp_init_aarch64.c +++ b/libavcodec/aarch64/vp9dsp_init_aarch64.c @@ -96,7 +96,7 @@ define_8tap_2d_funcs(16) define_8tap_2d_funcs(8) define_8tap_2d_funcs(4) -av_cold void ff_vp9dsp_init_aarch64(VP9DSPContext *dsp, int bpp) +static av_cold void vp9dsp_mc_init_aarch64(VP9DSPContext *dsp, int bpp) { int cpu_flags = av_get_cpu_flags(); @@ -154,3 +154,55 @@ av_cold void ff_vp9dsp_init_aarch64(VP9DSPContext *dsp, int bpp) init_mc_funcs_dirs(4, 4); } } + +#define define_itxfm(type_a, type_b, sz) \ +void ff_vp9_##type_a##_##type_b##_##sz##x##sz##_add_neon(uint8_t *_dst,\ + ptrdiff_t stride, \ + int16_t *_block, int eob) + +#define define_itxfm_funcs(sz) \ +define_itxfm(idct, idct, sz); \ +define_itxfm(iadst, idct, sz); \ +define_itxfm(idct, iadst, sz); \ +define_itxfm(iadst, iadst, sz) + +define_itxfm_funcs(4); +define_itxfm_funcs(8); +define_itxfm_funcs(16); +define_itxfm(idct, idct, 32); +define_itxfm(iwht, iwht, 4); + + +static av_cold void vp9dsp_itxfm_init_aarch64(VP9DSPContext *dsp, int bpp) +{ +int cpu_flags = av_get_cpu_flags(); + +if (bpp != 8) +return; + +if (have_neon(cpu_flags)) { +#define init_itxfm(tx, sz) \ +dsp->itxfm_add[tx][DCT_DCT] = ff_vp9_idct_idct_##sz##_add_neon; \ +dsp->itxfm_add[tx][DCT_ADST] = ff_vp9_iadst_idct_##sz##_add_neon; \ +dsp->itxfm_add[tx][ADST_DCT] = ff_vp9_idct_iadst_##sz##_add_neon; \ +dsp->itxfm_add[tx][ADST_ADST] = ff_vp9_iadst_iadst_##sz##_add_neon + +#define init_idct(tx, nm) \ +dsp->itxfm_add[tx][DCT_DCT] = \ +dsp->itxfm_add[tx][ADST_DCT] = \ +dsp->itxfm_add[tx][DCT_ADST] = \ +dsp->itxfm_add[tx][ADST_ADST] = ff_vp9_##nm##_add_neon + +init_itxfm(TX_4X4, 4x4); +init_itxfm(TX_8X8, 8x8); +init_itxfm(TX_16X16, 16x16); +
[FFmpeg-devel] [PATCH 1/9] vp9dsp: Deduplicate the subpel filters
Make them aligned, to allow efficient access to them from simd. This is an adapted cherry-pick from libav commit a4cfcddcb0f76e837d5abc06840c2b26c0e8aefc. --- libavcodec/vp9dsp.c | 56 +++ libavcodec/vp9dsp.h | 3 +++ libavcodec/vp9dsp_template.c | 63 +++- 3 files changed, 63 insertions(+), 59 deletions(-) diff --git a/libavcodec/vp9dsp.c b/libavcodec/vp9dsp.c index 54e77e2..6dd49c8 100644 --- a/libavcodec/vp9dsp.c +++ b/libavcodec/vp9dsp.c @@ -25,6 +25,62 @@ #include "libavutil/common.h" #include "vp9dsp.h" +const DECLARE_ALIGNED(16, int16_t, ff_vp9_subpel_filters)[3][16][8] = { +[FILTER_8TAP_REGULAR] = { +{ 0, 0, 0, 128, 0, 0, 0, 0 }, +{ 0, 1, -5, 126, 8, -3, 1, 0 }, +{ -1, 3, -10, 122, 18, -6, 2, 0 }, +{ -1, 4, -13, 118, 27, -9, 3, -1 }, +{ -1, 4, -16, 112, 37, -11, 4, -1 }, +{ -1, 5, -18, 105, 48, -14, 4, -1 }, +{ -1, 5, -19, 97, 58, -16, 5, -1 }, +{ -1, 6, -19, 88, 68, -18, 5, -1 }, +{ -1, 6, -19, 78, 78, -19, 6, -1 }, +{ -1, 5, -18, 68, 88, -19, 6, -1 }, +{ -1, 5, -16, 58, 97, -19, 5, -1 }, +{ -1, 4, -14, 48, 105, -18, 5, -1 }, +{ -1, 4, -11, 37, 112, -16, 4, -1 }, +{ -1, 3, -9, 27, 118, -13, 4, -1 }, +{ 0, 2, -6, 18, 122, -10, 3, -1 }, +{ 0, 1, -3, 8, 126, -5, 1, 0 }, +}, [FILTER_8TAP_SHARP] = { +{ 0, 0, 0, 128, 0, 0, 0, 0 }, +{ -1, 3, -7, 127, 8, -3, 1, 0 }, +{ -2, 5, -13, 125, 17, -6, 3, -1 }, +{ -3, 7, -17, 121, 27, -10, 5, -2 }, +{ -4, 9, -20, 115, 37, -13, 6, -2 }, +{ -4, 10, -23, 108, 48, -16, 8, -3 }, +{ -4, 10, -24, 100, 59, -19, 9, -3 }, +{ -4, 11, -24, 90, 70, -21, 10, -4 }, +{ -4, 11, -23, 80, 80, -23, 11, -4 }, +{ -4, 10, -21, 70, 90, -24, 11, -4 }, +{ -3, 9, -19, 59, 100, -24, 10, -4 }, +{ -3, 8, -16, 48, 108, -23, 10, -4 }, +{ -2, 6, -13, 37, 115, -20, 9, -4 }, +{ -2, 5, -10, 27, 121, -17, 7, -3 }, +{ -1, 3, -6, 17, 125, -13, 5, -2 }, +{ 0, 1, -3, 8, 127, -7, 3, -1 }, +}, [FILTER_8TAP_SMOOTH] = { +{ 0, 0, 0, 128, 0, 0, 0, 0 }, +{ -3, -1, 32, 64, 38, 1, -3, 0 }, +{ -2, -2, 29, 63, 41, 2, -3, 0 }, +{ -2, -2, 26, 63, 43, 4, -4, 0 }, +{ -2, -3, 24, 62, 46, 5, -4, 0 }, +{ -2, -3, 21, 60, 49, 7, -4, 0 }, +{ -1, -4, 18, 59, 51, 9, -4, 0 }, +{ -1, -4, 16, 57, 53, 12, -4, -1 }, +{ -1, -4, 14, 55, 55, 14, -4, -1 }, +{ -1, -4, 12, 53, 57, 16, -4, -1 }, +{ 0, -4, 9, 51, 59, 18, -4, -1 }, +{ 0, -4, 7, 49, 60, 21, -3, -2 }, +{ 0, -4, 5, 46, 62, 24, -3, -2 }, +{ 0, -4, 4, 43, 63, 26, -2, -2 }, +{ 0, -3, 2, 41, 63, 29, -2, -2 }, +{ 0, -3, 1, 38, 64, 32, -1, -3 }, +} +}; + + av_cold void ff_vp9dsp_init(VP9DSPContext *dsp, int bpp, int bitexact) { if (bpp == 8) { diff --git a/libavcodec/vp9dsp.h b/libavcodec/vp9dsp.h index 733f5bf..cb43f5e 100644 --- a/libavcodec/vp9dsp.h +++ b/libavcodec/vp9dsp.h @@ -120,6 +120,9 @@ typedef struct VP9DSPContext { vp9_scaled_mc_func smc[5][4][2]; } VP9DSPContext; + +extern const int16_t ff_vp9_subpel_filters[3][16][8]; + void ff_vp9dsp_init(VP9DSPContext *dsp, int bpp, int bitexact); void ff_vp9dsp_init_8(VP9DSPContext *dsp); diff --git a/libavcodec/vp9dsp_template.c b/libavcodec/vp9dsp_template.c index 4d810fe..bb54561 100644 --- a/libavcodec/vp9dsp_template.c +++ b/libavcodec/vp9dsp_template.c @@ -1991,61 +1991,6 @@ copy_avg_fn(4) #endif /* BIT_DEPTH != 12 */ -static const int16_t vp9_subpel_filters[3][16][8] = { -[FILTER_8TAP_REGULAR] = { -{ 0, 0, 0, 128, 0, 0, 0, 0 }, -{ 0, 1, -5, 126, 8, -3, 1, 0 }, -{ -1, 3, -10, 122, 18, -6, 2, 0 }, -{ -1, 4, -13, 118, 27, -9, 3, -1 }, -{ -1, 4, -16, 112, 37, -11, 4, -1 }, -{ -1, 5, -18, 105, 48, -14, 4, -1 }, -{ -1, 5, -19, 97, 58, -16, 5, -1 }, -{ -1, 6, -19, 88, 68, -18, 5, -1 }, -{ -1, 6, -19, 78, 78, -19, 6, -1 }, -{ -1, 5, -18, 68, 88, -19, 6, -1 }, -{ -1, 5, -16, 58, 97, -19, 5, -1 }, -{ -1, 4, -14, 48, 105, -18, 5, -1 }, -{ -1, 4, -11, 37, 112, -16, 4, -1 }, -{ -1, 3, -9, 27, 118, -13, 4, -1 }, -{ 0, 2, -6, 18, 122, -10, 3, -1 }, -{ 0, 1, -3, 8, 126, -5, 1, 0 }, -}, [FILTER_8TAP_SHARP] = { -{ 0, 0, 0, 128, 0, 0, 0, 0 }, -{ -1, 3, -7, 127, 8, -3, 1, 0 }, -{ -2, 5, -13, 125, 17, -6, 3, -1 }, -{ -3, 7, -17, 121, 27, -10,
[FFmpeg-devel] [PATCH 01/13] aarch64: vp9: use alternative returns in the core loop filter function
From: Janne GrunauSince aarch64 has enough free general purpose registers use them to branch to the appropiate storage code. 1-2 cycles faster for the functions using loop_filter 8/16, ... on a cortex-a53. Mixed results (up to 2 cycles faster/slower) on a cortex-a57. This is cherrypicked from libav commit d7595de0b25e7064fd9e06dea5d0425536cef6dc. --- libavcodec/aarch64/vp9lpf_neon.S | 48 +++- 1 file changed, 18 insertions(+), 30 deletions(-) diff --git a/libavcodec/aarch64/vp9lpf_neon.S b/libavcodec/aarch64/vp9lpf_neon.S index e727a4d..78aae61 100644 --- a/libavcodec/aarch64/vp9lpf_neon.S +++ b/libavcodec/aarch64/vp9lpf_neon.S @@ -410,15 +410,19 @@ .endif // If no pixels needed flat8in nor flat8out, jump to a // writeout of the inner 4 pixels -cbz x5, 7f +cbnzx5, 1f +br x14 +1: mov x5, v7.d[0] .ifc \sz, .16b mov x6, v7.d[1] orr x5, x5, x6 .endif // If no pixels need flat8out, jump to a writeout of the inner 6 pixels -cbz x5, 8f +cbnzx5, 1f +br x15 +1: // flat8out // This writes all outputs into v2-v17 (skipping v6 and v16). // If this part is skipped, the output is read from v21-v26 (which is the input @@ -549,35 +553,24 @@ endfunc function vp9_loop_filter_8 loop_filter 8, .8b, 0,v16, v17, v18, v19, v28, v29, v30, v31 -mov x5, #0 ret 6: -mov x5, #6 -ret +br x13 9: br x10 endfunc function vp9_loop_filter_8_16b_mix loop_filter 8, .16b, 88, v16, v17, v18, v19, v28, v29, v30, v31 -mov x5, #0 ret 6: -mov x5, #6 -ret +br x13 9: br x10 endfunc function vp9_loop_filter_16 loop_filter 16, .8b, 0,v8, v9, v10, v11, v12, v13, v14, v15 -mov x5, #0 -ret -7: -mov x5, #7 -ret -8: -mov x5, #8 ret 9: ldp d8, d9, [sp], 0x10 @@ -589,13 +582,6 @@ endfunc function vp9_loop_filter_16_16b loop_filter 16, .16b, 0,v8, v9, v10, v11, v12, v13, v14, v15 -mov x5, #0 -ret -7: -mov x5, #7 -ret -8: -mov x5, #8 ret 9: ldp d8, d9, [sp], 0x10 @@ -614,11 +600,14 @@ endfunc .endm .macro loop_filter_8 +// calculate alternative 'return' targets +adr x13, 6f bl vp9_loop_filter_8 -cbnzx5, 6f .endm .macro loop_filter_8_16b_mix mix +// calculate alternative 'return' targets +adr x13, 6f .if \mix == 48 mov x11, #0x .elseif \mix == 84 @@ -627,21 +616,20 @@ endfunc mov x11, #0x .endif bl vp9_loop_filter_8_16b_mix -cbnzx5, 6f .endm .macro loop_filter_16 +// calculate alternative 'return' targets +adr x14, 7f +adr x15, 8f bl vp9_loop_filter_16 -cmp x5, 7 -b.gt8f -b.eq7f .endm .macro loop_filter_16_16b +// calculate alternative 'return' targets +adr x14, 7f +adr x15, 8f bl vp9_loop_filter_16_16b -cmp x5, 7 -b.gt8f -b.eq7f .endm -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 10/13] aarch64: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32
This work is sponsored by, and copyright, Google. Previously all subpartitions except the eob=1 (DC) case ran with the same runtime: vp9_inv_dct_dct_16x16_sub16_add_neon: 1373.2 vp9_inv_dct_dct_32x32_sub32_add_neon: 8089.0 By skipping individual 8x16 or 8x32 pixel slices in the first pass, we reduce the runtime of these functions like this: vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3 vp9_inv_dct_dct_16x16_sub2_add_neon:1036.7 vp9_inv_dct_dct_16x16_sub4_add_neon:1036.7 vp9_inv_dct_dct_16x16_sub8_add_neon:1036.7 vp9_inv_dct_dct_16x16_sub12_add_neon: 1372.1 vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.1 vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1 vp9_inv_dct_dct_32x32_sub2_add_neon:5190.2 vp9_inv_dct_dct_32x32_sub4_add_neon:5180.0 vp9_inv_dct_dct_32x32_sub8_add_neon:5183.1 vp9_inv_dct_dct_32x32_sub12_add_neon: 6161.5 vp9_inv_dct_dct_32x32_sub16_add_neon: 6155.5 vp9_inv_dct_dct_32x32_sub20_add_neon: 7136.3 vp9_inv_dct_dct_32x32_sub24_add_neon: 7128.4 vp9_inv_dct_dct_32x32_sub28_add_neon: 8098.9 vp9_inv_dct_dct_32x32_sub32_add_neon: 8098.8 I.e. in general a very minor overhead for the full subpartition case due to the additional cmps, but a significant speedup for the cases when we only need to process a small part of the actual input data. This is cherrypicked from libav commits cad42fadcd2c2ae1b3676bb398844a1f521a2d7b and a0c443a3980dc22eb02b067ac4cb9ffa2f9b04d2. --- libavcodec/aarch64/vp9itxfm_neon.S | 61 ++ 1 file changed, 56 insertions(+), 5 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index e5fc612..82f1f41 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -588,6 +588,9 @@ endfunc .macro store i, dst, inc st1 {v\i\().8h}, [\dst], \inc .endm +.macro movi_v i, size, imm +moviv\i\()\size, \imm +.endm .macro load_clear i, src, inc ld1 {v\i\().8h}, [\src] st1 {v2.8h}, [\src], \inc @@ -596,9 +599,8 @@ endfunc // Read a vertical 8x16 slice out of a 16x16 matrix, do a transform on it, // transpose into a horizontal 16x8 slice and store. // x0 = dst (temp buffer) -// x1 = unused +// x1 = slice offset // x2 = src -// x3 = slice offset // x9 = input stride .macro itxfm16_1d_funcs txfm function \txfm\()16_1d_8x16_pass1_neon @@ -616,14 +618,14 @@ function \txfm\()16_1d_8x16_pass1_neon transpose_8x8H v24, v25, v26, v27, v28, v29, v30, v31, v2, v3 // Store the transposed 8x8 blocks horizontally. -cmp x3, #8 +cmp x1, #8 b.eq1f .irp i, 16, 24, 17, 25, 18, 26, 19, 27, 20, 28, 21, 29, 22, 30, 23, 31 store \i, x0, #16 .endr ret 1: -// Special case: For the last input column (x3 == 8), +// Special case: For the last input column (x1 == 8), // which would be stored as the last row in the temp buffer, // don't store the first 8x8 block, but keep it in registers // for the first slice of the second pass (where it is the @@ -751,13 +753,36 @@ function ff_vp9_\txfm1\()_\txfm2\()_16x16_add_neon, export=1 .irp i, 0, 8 add x0, sp, #(\i*32) +.ifc \txfm1\()_\txfm2,idct_idct +.if \i == 8 +cmp w3, #38 +b.le1f +.endif +.endif +mov x1, #\i add x2, x6, #(\i*2) -mov x3, #\i bl \txfm1\()16_1d_8x16_pass1_neon .endr .ifc \txfm1\()_\txfm2,iadst_idct ld1 {v0.8h,v1.8h}, [x10] .endif + +.ifc \txfm1\()_\txfm2,idct_idct +b 3f +1: +// Set v24-v31 to zero, for the in-register passthrough of +// coefficients to pass 2. Since we only do two slices, this can +// only ever happen for the second slice. So we only need to store +// zeros to the temp buffer for the second half of the buffer. +// Move x0 to the second half, and use x9 == 32 as increment. +add x0, x0, #16 +.irp i, 24, 25, 26, 27, 28, 29, 30, 31 +movi_v \i, .16b, #0 +st1 {v24.8h}, [x0], x9 +.endr +3: +.endif + .irp i, 0, 8 add x0, x4, #(\i) mov x1, x5 @@ -1073,12 +1098,17 @@ function idct32_1d_8x32_pass2_neon ret endfunc +const min_eob_idct_idct_32, align=4 +.short 0, 34, 135, 336 +endconst + function ff_vp9_idct_idct_32x32_add_neon, export=1 cmp w3, #1 b.eqidct32x32_dc_add_neon movrel x10, idct_coeffs add x11, x10, #32 +movrel x12, min_eob_idct_idct_32, 2 mov x15, x30 @@ -1099,9 +1129,30 @@ function ff_vp9_idct_idct_32x32_add_neon, export=1 .irp i, 0, 8, 16, 24
[FFmpeg-devel] [PATCH 09/13] arm: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32
This work is sponsored by, and copyright, Google. Previously all subpartitions except the eob=1 (DC) case ran with the same runtime: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub16_add_neon: 3188.1 2435.4 2499.0 1969.0 vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.7 16582.3 14207.6 12000.3 By skipping individual 4x16 or 4x32 pixel slices in the first pass, we reduce the runtime of these functions like this: vp9_inv_dct_dct_16x16_sub1_add_neon: 274.6189.5211.7235.8 vp9_inv_dct_dct_16x16_sub2_add_neon:2064.0 1534.8 1719.4 1248.7 vp9_inv_dct_dct_16x16_sub4_add_neon:2135.0 1477.2 1736.3 1249.5 vp9_inv_dct_dct_16x16_sub8_add_neon:2446.7 1828.7 1993.6 1494.7 vp9_inv_dct_dct_16x16_sub12_add_neon: 2832.4 2118.3 2266.5 1735.1 vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.7 2475.3 2523.5 1983.1 vp9_inv_dct_dct_32x32_sub1_add_neon: 756.2456.7862.0553.9 vp9_inv_dct_dct_32x32_sub2_add_neon: 10682.2 8190.4 8539.2 6762.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 10813.5 8014.9 8518.3 6762.8 vp9_inv_dct_dct_32x32_sub8_add_neon: 11859.6 9313.0 9347.4 7514.5 vp9_inv_dct_dct_32x32_sub12_add_neon: 12946.6 10752.4 10192.2 8280.2 vp9_inv_dct_dct_32x32_sub16_add_neon: 14074.6 11946.5 11001.4 9008.6 vp9_inv_dct_dct_32x32_sub20_add_neon: 15269.9 13662.7 11816.1 9762.6 vp9_inv_dct_dct_32x32_sub24_add_neon: 16327.9 14940.1 12626.7 10516.0 vp9_inv_dct_dct_32x32_sub28_add_neon: 17462.7 15776.1 13446.2 11264.7 vp9_inv_dct_dct_32x32_sub32_add_neon: 18575.5 17157.0 14249.3 12015.1 I.e. in general a very minor overhead for the full subpartition case due to the additional loads and cmps, but a significant speedup for the cases when we only need to process a small part of the actual input data. In common VP9 content in a few inspected clips, 70-90% of the non-dc-only 16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left 8x8 or 16x16 subpartitions respectively. This is cherrypicked from libav commit 9c8bc74c2b40537b0997f646c87c008042d788c2. --- libavcodec/arm/vp9itxfm_neon.S | 75 +- tests/checkasm/vp9dsp.c| 6 ++-- 2 files changed, 70 insertions(+), 11 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index d5b8495..25f6dde 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -659,9 +659,8 @@ endfunc @ Read a vertical 4x16 slice out of a 16x16 matrix, do a transform on it, @ transpose into a horizontal 16x4 slice and store. @ r0 = dst (temp buffer) -@ r1 = unused +@ r1 = slice offset @ r2 = src -@ r3 = slice offset function \txfm\()16_1d_4x16_pass1_neon mov r12, #32 vmov.s16q2, #0 @@ -678,14 +677,14 @@ function \txfm\()16_1d_4x16_pass1_neon transpose16_q_4x_4x4 q8, q9, q10, q11, q12, q13, q14, q15, d16, d17, d18, d19, d20, d21, d22, d23, d24, d25, d26, d27, d28, d29, d30, d31 @ Store the transposed 4x4 blocks horizontally. -cmp r3, #12 +cmp r1, #12 beq 1f .irp i, 16, 20, 24, 28, 17, 21, 25, 29, 18, 22, 26, 30, 19, 23, 27, 31 vst1.16 {d\i}, [r0,:64]! .endr bx lr 1: -@ Special case: For the last input column (r3 == 12), +@ Special case: For the last input column (r1 == 12), @ which would be stored as the last row in the temp buffer, @ don't store the first 4x4 block, but keep it in registers @ for the first slice of the second pass (where it is the @@ -781,15 +780,22 @@ endfunc itxfm16_1d_funcs idct itxfm16_1d_funcs iadst +@ This is the minimum eob value for each subpartition, in increments of 4 +const min_eob_idct_idct_16, align=4 +.short 0, 10, 38, 89 +endconst + .macro itxfm_func16x16 txfm1, txfm2 function ff_vp9_\txfm1\()_\txfm2\()_16x16_add_neon, export=1 .ifc \txfm1\()_\txfm2,idct_idct cmp r3, #1 beq idct16x16_dc_add_neon .endif -push{r4-r7,lr} +push{r4-r8,lr} .ifnc \txfm1\()_\txfm2,idct_idct vpush {q4-q7} +.else +movrel r8, min_eob_idct_idct_16 + 2 .endif @ Align the stack, allocate a temp buffer @@ -810,10 +816,36 @@ A and r7, sp, #15 .irp i, 0, 4, 8, 12 add r0, sp, #(\i*32) +.ifc \txfm1\()_\txfm2,idct_idct +.if \i > 0 +ldrh_post r1, r8, #2 +cmp r3, r1 +it le +movle r1, #(16 - \i)/4 +ble 1f +.endif +.endif +mov r1, #\i add r2, r6, #(\i*2) -mov r3, #\i bl \txfm1\()16_1d_4x16_pass1_neon .endr + +.ifc \txfm1\()_\txfm2,idct_idct
[FFmpeg-devel] [PATCH 06/13] arm: vp9itxfm: Rename a macro parameter to fit better
Since the same parameter is used for both input and output, the name inout is more fitting. This matches the naming used below in the dmbutterfly macro. This is cherrypicked from libav commit 79566ec8c77969d5f9be533de04b1349834cca62. --- libavcodec/arm/vp9itxfm_neon.S | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index b4cc592..0097f5f 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -125,16 +125,16 @@ endconst vmlal.s16 \out4, \in4, \coef1 .endm -@ in1 = (in1 * coef1 - in2 * coef2 + (1 << 13)) >> 14 -@ in2 = (in1 * coef2 + in2 * coef1 + (1 << 13)) >> 14 -@ in are 2 d registers, tmp are 2 q registers -.macro mbutterfly in1, in2, coef1, coef2, tmp1, tmp2, neg=0 -mbutterfly_l\tmp1, \tmp2, \in1, \in2, \coef1, \coef2 +@ inout1 = (inout1 * coef1 - inout2 * coef2 + (1 << 13)) >> 14 +@ inout2 = (inout1 * coef2 + inout2 * coef1 + (1 << 13)) >> 14 +@ inout are 2 d registers, tmp are 2 q registers +.macro mbutterfly inout1, inout2, coef1, coef2, tmp1, tmp2, neg=0 +mbutterfly_l\tmp1, \tmp2, \inout1, \inout2, \coef1, \coef2 .if \neg > 0 vneg.s32\tmp2, \tmp2 .endif -vrshrn.s32 \in1, \tmp1, #14 -vrshrn.s32 \in2, \tmp2, #14 +vrshrn.s32 \inout1, \tmp1, #14 +vrshrn.s32 \inout2, \tmp2, #14 .endm @ inout1,inout2 = (inout1,inout2 * coef1 - inout3,inout4 * coef2 + (1 << 13)) >> 14 -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 05/13] arm/aarch64: vp9itxfm: Fix indentation of macro arguments
This is cherrypicked from libav commit 721bc37522c5c1d6a8c3cea5e9c3fcde8d256c05. --- libavcodec/aarch64/vp9itxfm_neon.S | 16 libavcodec/arm/vp9itxfm_neon.S | 8 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index 3535c7b..d5165bf 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -969,14 +969,14 @@ function idct32_1d_8x32_pass1_neon st1 {v7.8h}, [x0], #16 .endm -store_rev 31, 23 -store_rev 30, 22 -store_rev 29, 21 -store_rev 28, 20 -store_rev 27, 19 -store_rev 26, 18 -store_rev 25, 17 -store_rev 24, 16 +store_rev 31, 23 +store_rev 30, 22 +store_rev 29, 21 +store_rev 28, 20 +store_rev 27, 19 +store_rev 26, 18 +store_rev 25, 17 +store_rev 24, 16 .purgem store_rev ret endfunc diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index d7a2654..b4cc592 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -1017,10 +1017,10 @@ function idct32_1d_4x32_pass1_neon .endr .endm -store_rev 31, 27, 23, 19 -store_rev 30, 26, 22, 18 -store_rev 29, 25, 21, 17 -store_rev 28, 24, 20, 16 +store_rev 31, 27, 23, 19 +store_rev 30, 26, 22, 18 +store_rev 29, 25, 21, 17 +store_rev 28, 24, 20, 16 .purgem store_rev bx lr endfunc -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 13/13] aarch64: vp9mc: Fix a comment to refer to a register with the right name
This is cherrypicked from libav commit 85ad5ea72ce3983947a3b07e4b35c66cb16dfaba. --- libavcodec/aarch64/vp9mc_neon.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/libavcodec/aarch64/vp9mc_neon.S b/libavcodec/aarch64/vp9mc_neon.S index 69dad6d..80d1d23 100644 --- a/libavcodec/aarch64/vp9mc_neon.S +++ b/libavcodec/aarch64/vp9mc_neon.S @@ -250,7 +250,7 @@ function \type\()_8tap_\size\()h_\idx1\idx2 .if \size >= 16 sub x1, x1, x5 .endif -// size >= 16 loads two qwords and increments r2, +// size >= 16 loads two qwords and increments x2, // for size 4/8 it's enough with one qword and no // postincrement .if \size >= 16 -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 03/13] arm: vp9itxfm: Simplify the stack alignment code
From: Janne GrunauThis is one instruction less for thumb, and only have got 1/2 arm/thumb specific instructions. This is cherrypicked from libav commit e5b0fc170f85b00f7dd0ac514918fb5c95253d39. --- libavcodec/arm/vp9itxfm_neon.S | 28 1 file changed, 12 insertions(+), 16 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index 06470a3..d7a2654 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -791,15 +791,13 @@ function ff_vp9_\txfm1\()_\txfm2\()_16x16_add_neon, export=1 .ifnc \txfm1\()_\txfm2,idct_idct vpush {q4-q7} .endif -mov r7, sp @ Align the stack, allocate a temp buffer -T mov r12, sp -T bic r12, r12, #15 -T sub r12, r12, #512 -T mov sp, r12 -A bic sp, sp, #15 -A sub sp, sp, #512 +T mov r7, sp +T and r7, r7, #15 +A and r7, sp, #15 +add r7, r7, #512 +sub sp, sp, r7 mov r4, r0 mov r5, r1 @@ -828,7 +826,7 @@ A sub sp, sp, #512 bl \txfm2\()16_1d_4x16_pass2_neon .endr -mov sp, r7 +add sp, sp, r7 .ifnc \txfm1\()_\txfm2,idct_idct vpop{q4-q7} .endif @@ -1117,15 +1115,13 @@ function ff_vp9_idct_idct_32x32_add_neon, export=1 beq idct32x32_dc_add_neon push{r4-r7,lr} vpush {q4-q7} -mov r7, sp @ Align the stack, allocate a temp buffer -T mov r12, sp -T bic r12, r12, #15 -T sub r12, r12, #2048 -T mov sp, r12 -A bic sp, sp, #15 -A sub sp, sp, #2048 +T mov r7, sp +T and r7, r7, #15 +A and r7, sp, #15 +add r7, r7, #2048 +sub sp, sp, r7 mov r4, r0 mov r5, r1 @@ -1143,7 +1139,7 @@ A sub sp, sp, #2048 bl idct32_1d_4x32_pass2_neon .endr -mov sp, r7 +add sp, sp, r7 vpop{q4-q7} pop {r4-r7,pc} endfunc -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 12/13] aarch64: vp9dsp: Fix vertical alignment in the init file
This is cherrypicked from libav commit 65074791e8f8397600aacc9801efdd1eb6e3. --- libavcodec/aarch64/vp9dsp_init_aarch64.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/libavcodec/aarch64/vp9dsp_init_aarch64.c b/libavcodec/aarch64/vp9dsp_init_aarch64.c index 7e34375..0bc200e 100644 --- a/libavcodec/aarch64/vp9dsp_init_aarch64.c +++ b/libavcodec/aarch64/vp9dsp_init_aarch64.c @@ -103,7 +103,7 @@ static av_cold void vp9dsp_mc_init_aarch64(VP9DSPContext *dsp, int bpp) if (bpp != 8) return; -#define init_fpel(idx1, idx2, sz, type, suffix) \ +#define init_fpel(idx1, idx2, sz, type, suffix) \ dsp->mc[idx1][FILTER_8TAP_SMOOTH ][idx2][0][0] = \ dsp->mc[idx1][FILTER_8TAP_REGULAR][idx2][0][0] = \ dsp->mc[idx1][FILTER_8TAP_SHARP ][idx2][0][0] = \ @@ -128,7 +128,7 @@ static av_cold void vp9dsp_mc_init_aarch64(VP9DSPContext *dsp, int bpp) #define init_mc_func(idx1, idx2, op, filter, fname, dir, mx, my, sz, pfx) \ dsp->mc[idx1][filter][idx2][mx][my] = pfx##op##_##fname##sz##_##dir##_neon -#define init_mc_funcs(idx, dir, mx, my, sz, pfx) \ +#define init_mc_funcs(idx, dir, mx, my, sz, pfx) \ init_mc_func(idx, 0, put, FILTER_8TAP_REGULAR, regular, dir, mx, my, sz, pfx); \ init_mc_func(idx, 0, put, FILTER_8TAP_SHARP, sharp, dir, mx, my, sz, pfx); \ init_mc_func(idx, 0, put, FILTER_8TAP_SMOOTH, smooth, dir, mx, my, sz, pfx); \ @@ -136,7 +136,7 @@ static av_cold void vp9dsp_mc_init_aarch64(VP9DSPContext *dsp, int bpp) init_mc_func(idx, 1, avg, FILTER_8TAP_SHARP, sharp, dir, mx, my, sz, pfx); \ init_mc_func(idx, 1, avg, FILTER_8TAP_SMOOTH, smooth, dir, mx, my, sz, pfx) -#define init_mc_funcs_dirs(idx, sz) \ +#define init_mc_funcs_dirs(idx, sz)\ init_mc_funcs(idx, h, 1, 0, sz, ff_vp9_); \ init_mc_funcs(idx, v, 0, 1, sz, ff_vp9_); \ init_mc_funcs(idx, hv, 1, 1, sz,) -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 08/13] arm: vp9itxfm: Only reload the idct coeffs for the iadst_idct combination
This avoids reloading them if they haven't been clobbered, if the first pass also was idct. This is similar to what was done in the aarch64 version. This is cherrypicked from libav commit 3c87039a404c5659ae9bf7454a04e186532eb40b. --- libavcodec/arm/vp9itxfm_neon.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index 0097f5f..d5b8495 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -814,7 +814,7 @@ A and r7, sp, #15 mov r3, #\i bl \txfm1\()16_1d_4x16_pass1_neon .endr -.ifc \txfm2,idct +.ifc \txfm1\()_\txfm2,iadst_idct movrel r12, idct_coeffs vld1.16 {q0-q1}, [r12,:128] .endif -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 07/13] aarch64: vp9itxfm: Don't repeatedly set x9 when nothing overwrites it
This is cherrypicked from libav commit 2f99117f6ff24ce5be2abb9e014cb8b86c2aa0e0. --- libavcodec/aarch64/vp9itxfm_neon.S | 26 +++--- 1 file changed, 15 insertions(+), 11 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index d5165bf..e5fc612 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -599,9 +599,9 @@ endfunc // x1 = unused // x2 = src // x3 = slice offset +// x9 = input stride .macro itxfm16_1d_funcs txfm function \txfm\()16_1d_8x16_pass1_neon -mov x9, #32 moviv2.8h, #0 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 load_clear \i, x2, x9 @@ -649,8 +649,8 @@ endfunc // x1 = dst stride // x2 = src (temp buffer) // x3 = slice offset +// x9 = temp buffer stride function \txfm\()16_1d_8x16_pass2_neon -mov x9, #32 .irp i, 16, 17, 18, 19, 20, 21, 22, 23 load\i, x2, x9 .endr @@ -747,6 +747,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_16x16_add_neon, export=1 .ifc \txfm1,idct ld1 {v0.8h,v1.8h}, [x10] .endif +mov x9, #32 .irp i, 0, 8 add x0, sp, #(\i*32) @@ -882,13 +883,12 @@ endfunc // x0 = dst (temp buffer) // x1 = unused // x2 = src +// x9 = double input stride // x10 = idct_coeffs // x11 = idct_coeffs + 32 function idct32_1d_8x32_pass1_neon ld1 {v0.8h,v1.8h}, [x10] -// Double stride of the input, since we only read every other line -mov x9, #128 moviv4.8h, #0 // v16 = IN(0), v17 = IN(2) ... v31 = IN(30) @@ -987,12 +987,13 @@ endfunc // x0 = dst // x1 = dst stride // x2 = src (temp buffer) +// x7 = negative double temp buffer stride +// x9 = double temp buffer stride // x10 = idct_coeffs // x11 = idct_coeffs + 32 function idct32_1d_8x32_pass2_neon ld1 {v0.8h,v1.8h}, [x10] -mov x9, #128 // v16 = IN(0), v17 = IN(2) ... v31 = IN(30) .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 ld1 {v\i\().8h}, [x2], x9 @@ -1001,7 +1002,6 @@ function idct32_1d_8x32_pass2_neon idct16 -mov x9, #128 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 st1 {v\i\().8h}, [x2], x9 .endr @@ -1018,11 +1018,10 @@ function idct32_1d_8x32_pass2_neon idct32_odd -mov x9, #128 .macro load_acc_store a, b, c, d, neg=0 +.if \neg == 0 ld1 {v4.8h}, [x2], x9 ld1 {v5.8h}, [x2], x9 -.if \neg == 0 add v4.8h, v4.8h, v\a\().8h ld1 {v6.8h}, [x2], x9 add v5.8h, v5.8h, v\b\().8h @@ -1030,10 +1029,12 @@ function idct32_1d_8x32_pass2_neon add v6.8h, v6.8h, v\c\().8h add v7.8h, v7.8h, v\d\().8h .else +ld1 {v4.8h}, [x2], x7 +ld1 {v5.8h}, [x2], x7 sub v4.8h, v4.8h, v\a\().8h -ld1 {v6.8h}, [x2], x9 +ld1 {v6.8h}, [x2], x7 sub v5.8h, v5.8h, v\b\().8h -ld1 {v7.8h}, [x2], x9 +ld1 {v7.8h}, [x2], x7 sub v6.8h, v6.8h, v\c\().8h sub v7.8h, v7.8h, v\d\().8h .endif @@ -1064,7 +1065,6 @@ function idct32_1d_8x32_pass2_neon load_acc_store 23, 22, 21, 20 load_acc_store 19, 18, 17, 16 sub x2, x2, x9 -neg x9, x9 load_acc_store 16, 17, 18, 19, 1 load_acc_store 20, 21, 22, 23, 1 load_acc_store 24, 25, 26, 27, 1 @@ -1093,6 +1093,10 @@ function ff_vp9_idct_idct_32x32_add_neon, export=1 mov x5, x1 mov x6, x2 +// Double stride of the input, since we only read every other line +mov x9, #128 +neg x7, x9 + .irp i, 0, 8, 16, 24 add x0, sp, #(\i*64) add x2, x6, #(\i*2) -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 04/13] aarch64: vp9itxfm: Use w3 instead of x3 for the int eob parameter
The clobbering tests in checkasm are only invoked when testing correctness, so this bug didn't show up when benchmarking the dc-only version. This is cherrypicked from libav commit 4d960a11855f4212eb3a4e470ce890db7f01df29. --- libavcodec/aarch64/vp9itxfm_neon.S | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index 7ce3116..3535c7b 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -204,7 +204,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_4x4_add_neon, export=1 moviv31.8h, #0 .ifc \txfm1\()_\txfm2,idct_idct -cmp x3, #1 +cmp w3, #1 b.ne1f // DC-only for idct/idct ld1r{v2.4h}, [x2] @@ -344,7 +344,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1 moviv5.16b, #0 .ifc \txfm1\()_\txfm2,idct_idct -cmp x3, #1 +cmp w3, #1 b.ne1f // DC-only for idct/idct ld1r{v2.4h}, [x2] @@ -722,7 +722,7 @@ itxfm16_1d_funcs iadst .macro itxfm_func16x16 txfm1, txfm2 function ff_vp9_\txfm1\()_\txfm2\()_16x16_add_neon, export=1 .ifc \txfm1\()_\txfm2,idct_idct -cmp x3, #1 +cmp w3, #1 b.eqidct16x16_dc_add_neon .endif mov x15, x30 @@ -1074,7 +1074,7 @@ function idct32_1d_8x32_pass2_neon endfunc function ff_vp9_idct_idct_32x32_add_neon, export=1 -cmp x3, #1 +cmp w3, #1 b.eqidct32x32_dc_add_neon movrel x10, idct_coeffs -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 02/13] aarch64: vp9: loop filter: replace 'orr; cbn?z' with 'adds; b.{eq,ne};
From: Janne GrunauThe latter is 1 cycle faster on a cortex-53 and since the operands are bytewise (or larger) bitmask (impossible to overflow to zero) both are equivalent. This is cherrypicked from libav commit e7ae8f7a715843a5089d18e033afb3ee19ab3057. --- libavcodec/aarch64/vp9lpf_neon.S | 31 --- 1 file changed, 20 insertions(+), 11 deletions(-) diff --git a/libavcodec/aarch64/vp9lpf_neon.S b/libavcodec/aarch64/vp9lpf_neon.S index 78aae61..55e1964 100644 --- a/libavcodec/aarch64/vp9lpf_neon.S +++ b/libavcodec/aarch64/vp9lpf_neon.S @@ -218,13 +218,15 @@ xtn_sz v5, v6.8h, v7.8h, \sz and v4\sz, v4\sz, v5\sz // fm +// If no pixels need filtering, just exit as soon as possible mov x5, v4.d[0] .ifc \sz, .16b mov x6, v4.d[1] -orr x5, x5, x6 -.endif -// If no pixels need filtering, just exit as soon as possible +addsx5, x5, x6 +b.eq9f +.else cbz x5, 9f +.endif .if \wd >= 8 moviv0\sz, #1 @@ -344,15 +346,17 @@ bit v22\sz, v0\sz, v5\sz // if (!hev && fm && !flat8in) bit v25\sz, v2\sz, v5\sz +// If no pixels need flat8in, jump to flat8out +// (or to a writeout of the inner 4 pixels, for wd=8) .if \wd >= 8 mov x5, v6.d[0] .ifc \sz, .16b mov x6, v6.d[1] -orr x5, x5, x6 -.endif -// If no pixels need flat8in, jump to flat8out -// (or to a writeout of the inner 4 pixels, for wd=8) +addsx5, x5, x6 +b.eq6f +.else cbz x5, 6f +.endif // flat8in uaddl_sz\tmp1\().8h, \tmp2\().8h, v20, v21, \sz @@ -406,20 +410,25 @@ mov x5, v2.d[0] .ifc \sz, .16b mov x6, v2.d[1] -orr x5, x5, x6 +adds x5, x5, x6 +b.ne1f +.else +cbnzx5, 1f .endif // If no pixels needed flat8in nor flat8out, jump to a // writeout of the inner 4 pixels -cbnzx5, 1f br x14 1: + mov x5, v7.d[0] .ifc \sz, .16b mov x6, v7.d[1] -orr x5, x5, x6 +adds x5, x5, x6 +b.ne1f +.else +cbnzx5, 1f .endif // If no pixels need flat8out, jump to a writeout of the inner 6 pixels -cbnzx5, 1f br x15 1: -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 11/13] arm: vp9mc: Fix vertical alignment of operands
This is cherrypicked from libav commit c536e5e8698110c139b1c17938998a5547550aa3. --- libavcodec/arm/vp9mc_neon.S | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/libavcodec/arm/vp9mc_neon.S b/libavcodec/arm/vp9mc_neon.S index 5fe3024..83235ff 100644 --- a/libavcodec/arm/vp9mc_neon.S +++ b/libavcodec/arm/vp9mc_neon.S @@ -79,7 +79,7 @@ function ff_vp9_avg32_neon, export=1 vrhadd.u8 q0, q0, q2 vrhadd.u8 q1, q1, q3 subsr12, r12, #1 -vst1.8 {q0, q1}, [r0, :128], r1 +vst1.8 {q0, q1}, [r0, :128], r1 bne 1b bx lr endfunc @@ -407,7 +407,7 @@ function ff_vp9_\type\()_\filter\()\size\()_h_neon, export=1 add r12, r12, 256*\offset cmp r5, #8 add r12, r12, r5, lsl #4 -mov r5, #\size +mov r5, #\size .if \size >= 16 bge \type\()_8tap_16h_34 b \type\()_8tap_16h_43 @@ -541,7 +541,7 @@ function \type\()_8tap_8v_\idx1\idx2 sub r2, r2, r3 vld1.16 {q0}, [r12, :128] 1: -mov r12, r4 +mov r12, r4 loadl q5, q6, q7 loadl q8, q9, q10, q11 -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 01/14] arm: vp9itxfm: Template the quarter/half idct32 function
This reduces the number of lines and reduces the duplication. Also simplify the eob check for the half case. If we are in the half case, we know we at least will need to do the first three slices, we only need to check eob for the fourth one, so we can hardcode the value to check against instead of loading from the min_eob array. Since at most one slice can be skipped in the first pass, we can unroll the loop for filling zeros completely, as it was done for the quarter case before. This allows skipping loading the min_eob pointer when using the quarter/half cases. This is cherrypicked from libav commit 98ee855ae0cc118bd1d20921d6bdb14731832462. --- libavcodec/arm/vp9itxfm_neon.S | 57 +++--- 1 file changed, 20 insertions(+), 37 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index ebbbda9..adc9896 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -1575,7 +1575,6 @@ function ff_vp9_idct_idct_32x32_add_neon, export=1 beq idct32x32_dc_add_neon push{r4-r8,lr} vpush {q4-q6} -movrel r8, min_eob_idct_idct_32 + 2 @ Align the stack, allocate a temp buffer T mov r7, sp @@ -1597,6 +1596,8 @@ A and r7, sp, #15 cmp r3, #135 ble idct32x32_half_add_neon +movrel r8, min_eob_idct_idct_32 + 2 + .irp i, 0, 4, 8, 12, 16, 20, 24, 28 add r0, sp, #(\i*64) .if \i > 0 @@ -1634,72 +1635,54 @@ A and r7, sp, #15 pop {r4-r8,pc} endfunc -function idct32x32_quarter_add_neon +.macro idct32_partial size +function idct32x32_\size\()_add_neon .irp i, 0, 4 add r0, sp, #(\i*64) +.ifc \size,quarter .if \i == 4 cmp r3, #9 ble 1f .endif +.endif add r2, r6, #(\i*2) -bl idct32_1d_4x32_pass1_quarter_neon -.endr -b 3f - -1: -@ Write zeros to the temp buffer for pass 2 -vmov.i16q14, #0 -vmov.i16q15, #0 -.rept 8 -vst1.16 {q14-q15}, [r0,:128]! -.endr -3: -.irp i, 0, 4, 8, 12, 16, 20, 24, 28 -add r0, r4, #(\i) -mov r1, r5 -add r2, sp, #(\i*2) -bl idct32_1d_4x32_pass2_quarter_neon +bl idct32_1d_4x32_pass1_\size\()_neon .endr -add sp, sp, r7 -vpop{q4-q6} -pop {r4-r8,pc} -endfunc - -function idct32x32_half_add_neon -.irp i, 0, 4, 8, 12 +.ifc \size,half +.irp i, 8, 12 add r0, sp, #(\i*64) -.if \i > 0 -ldrh_post r1, r8, #2 -cmp r3, r1 -it le -movle r1, #(16 - \i)/2 +.if \i == 12 +cmp r3, #70 ble 1f .endif add r2, r6, #(\i*2) -bl idct32_1d_4x32_pass1_half_neon +bl idct32_1d_4x32_pass1_\size\()_neon .endr +.endif b 3f 1: @ Write zeros to the temp buffer for pass 2 vmov.i16q14, #0 vmov.i16q15, #0 -2: -subsr1, r1, #1 -.rept 4 +.rept 8 vst1.16 {q14-q15}, [r0,:128]! .endr -bne 2b + 3: .irp i, 0, 4, 8, 12, 16, 20, 24, 28 add r0, r4, #(\i) mov r1, r5 add r2, sp, #(\i*2) -bl idct32_1d_4x32_pass2_half_neon +bl idct32_1d_4x32_pass2_\size\()_neon .endr add sp, sp, r7 vpop{q4-q6} pop {r4-r8,pc} endfunc +.endm + +idct32_partial quarter +idct32_partial half -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 04/14] arm: vp9itxfm16: Use the right lane size
This makes the code slightly clearer, but doesn't make any functional difference. --- libavcodec/arm/vp9itxfm_16bpp_neon.S | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_16bpp_neon.S b/libavcodec/arm/vp9itxfm_16bpp_neon.S index e6e9440..a92f323 100644 --- a/libavcodec/arm/vp9itxfm_16bpp_neon.S +++ b/libavcodec/arm/vp9itxfm_16bpp_neon.S @@ -1082,8 +1082,8 @@ A and r7, sp, #15 .ifc \txfm1\()_\txfm2,idct_idct b 3f 1: -vmov.i16q14, #0 -vmov.i16q15, #0 +vmov.i32q14, #0 +vmov.i32q15, #0 2: subsr1, r1, #1 @ Unroll for 2 lines -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 06/14] arm: vp9itxfm16: Avoid reloading the idct32 coefficients
Keep the idct32 coefficients in narrow form in q6-q7, and idct16 coefficients in lengthened 32 bit form in q0-q3. Avoid clobbering q0-q3 in the pass1 function, and squeeze the idct16 coefficients into q0-q1 in the pass2 function to avoid reloading them. The idct16 coefficients are clobbered and reloaded within idct32_odd though, since that turns out to be faster than narrowing them and swapping them into q6-q7. Before:Cortex A7A8A9 A53 vp9_inv_dct_dct_32x32_sub4_add_10_neon:22653.8 18268.4 19598.0 14079.0 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37699.0 38665.2 32542.3 24472.2 After: vp9_inv_dct_dct_32x32_sub4_add_10_neon:22270.8 18159.3 19531.0 13865.0 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37523.3 37731.6 32181.7 24071.2 --- libavcodec/arm/vp9itxfm_16bpp_neon.S | 128 +++ 1 file changed, 69 insertions(+), 59 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_16bpp_neon.S b/libavcodec/arm/vp9itxfm_16bpp_neon.S index 9c02ed9..29d95ca 100644 --- a/libavcodec/arm/vp9itxfm_16bpp_neon.S +++ b/libavcodec/arm/vp9itxfm_16bpp_neon.S @@ -1195,12 +1195,12 @@ endfunc .macro idct32_odd movrel r12, idct_coeffs -add r12, r12, #32 -vld1.16 {q0-q1}, [r12,:128] -vmovl.s16 q2, d2 -vmovl.s16 q3, d3 -vmovl.s16 q1, d1 -vmovl.s16 q0, d0 + +@ Overwrite the idct16 coeffs with the stored ones for idct32 +vmovl.s16 q0, d12 +vmovl.s16 q1, d13 +vmovl.s16 q2, d14 +vmovl.s16 q3, d15 mbutterfly d16, d31, d0[0], d0[1], q4, q5 @ d16 = t16a, d31 = t31a mbutterfly d24, d23, d1[0], d1[1], q4, q5 @ d24 = t17a, d23 = t30a @@ -1211,15 +1211,19 @@ endfunc mbutterfly d22, d25, d6[0], d6[1], q4, q5 @ d22 = t22a, d25 = t25a mbutterfly d30, d17, d7[0], d7[1], q4, q5 @ d30 = t23a, d17 = t24a -sub r12, r12, #32 -vld1.16 {q0}, [r12,:128] +@ Reload the idct16 coefficients. We could swap the coefficients between +@ q0-q3 and q6-q7 by narrowing/lengthening, but that's slower than just +@ loading and lengthening. +vld1.16 {q0-q1}, [r12,:128] + +butterfly d8, d24, d16, d24 @ d8 = t16, d24 = t17 +butterfly d9, d20, d28, d20 @ d9 = t19, d20 = t18 +butterfly d10, d26, d18, d26 @ d10 = t20, d26 = t21 +butterfly d11, d22, d30, d22 @ d11 = t23, d22 = t22 +vmovl.s16 q2, d2 +vmovl.s16 q3, d3 vmovl.s16 q1, d1 vmovl.s16 q0, d0 - -butterfly d4, d24, d16, d24 @ d4 = t16, d24 = t17 -butterfly d5, d20, d28, d20 @ d5 = t19, d20 = t18 -butterfly d6, d26, d18, d26 @ d6 = t20, d26 = t21 -butterfly d7, d22, d30, d22 @ d7 = t23, d22 = t22 butterfly d28, d25, d17, d25 @ d28 = t24, d25 = t25 butterfly d30, d21, d29, d21 @ d30 = t27, d21 = t26 butterfly d29, d23, d31, d23 @ d29 = t31, d23 = t30 @@ -1230,34 +1234,34 @@ endfunc mbutterfly d21, d26, d3[0], d3[1], q8, q9@ d21 = t21a, d26 = t26a mbutterfly d25, d22, d3[0], d3[1], q8, q9, neg=1 @ d25 = t25a, d22 = t22a -butterfly d16, d5, d4, d5 @ d16 = t16a, d5 = t19a +butterfly d16, d9, d8, d9 @ d16 = t16a, d9 = t19a butterfly d17, d20, d23, d20 @ d17 = t17, d20 = t18 -butterfly d18, d6, d7, d6 @ d18 = t23a, d6 = t20a +butterfly d18, d10, d11, d10 @ d18 = t23a, d10 = t20a butterfly d19, d21, d22, d21 @ d19 = t22, d21 = t21 -butterfly d4, d28, d28, d30 @ d4 = t24a, d28 = t27a +butterfly d8, d28, d28, d30 @ d8 = t24a, d28 = t27a butterfly d23, d26, d25, d26 @ d23 = t25, d26 = t26 -butterfly d7, d29, d29, d31 @ d7 = t31a, d29 = t28a +butterfly d11, d29, d29, d31 @ d11 = t31a, d29 = t28a butterfly d22, d27, d24, d27 @ d22 = t30, d27 = t29 mbutterfly d27, d20, d1[0], d1[1], q12, q15@ d27 = t18a, d20 = t29a -mbutterfly d29, d5, d1[0], d1[1], q12, q15@ d29 = t19, d5 = t28 -mbutterfly d28, d6, d1[0], d1[1], q12, q15, neg=1 @ d28 = t27, d6 = t20 +mbutterfly d29, d9, d1[0], d1[1], q12, q15@ d29 = t19, d9 = t28 +mbutterfly d28, d10, d1[0], d1[1], q12, q15, neg=1 @ d28 = t27, d10 = t20 mbutterfly d26, d21, d1[0], d1[1], q12, q15, neg=1 @ d26 = t26a, d21 = t21a -butterfly d31, d24, d7, d4 @ d31 = t31, d24 = t24 +butterfly d31, d24, d11, d8 @ d31 = t31, d24 = t24 butterfly d30, d25,
[FFmpeg-devel] [PATCH 02/14] arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used
In the half/quarter cases where we don't use the min_eob array, defer loading the pointer until we know it will be needed. This is cherrypicked from libav commit 3a0d5e206d24d41d87a25ba16a79b2ea04c39d4c. --- libavcodec/aarch64/vp9itxfm_neon.S | 3 ++- libavcodec/arm/vp9itxfm_neon.S | 4 ++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index 2c3c002..3e5da08 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -1483,7 +1483,6 @@ function ff_vp9_idct_idct_32x32_add_neon, export=1 b.eqidct32x32_dc_add_neon movrel x10, idct_coeffs -movrel x12, min_eob_idct_idct_32, 2 mov x15, x30 @@ -1508,6 +1507,8 @@ function ff_vp9_idct_idct_32x32_add_neon, export=1 cmp w3, #135 b.leidct32x32_half_add_neon +movrel x12, min_eob_idct_idct_32, 2 + .irp i, 0, 8, 16, 24 add x0, sp, #(\i*64) .if \i > 0 diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index adc9896..6d4d765 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -889,8 +889,6 @@ function ff_vp9_\txfm1\()_\txfm2\()_16x16_add_neon, export=1 push{r4-r8,lr} .ifnc \txfm1\()_\txfm2,idct_idct vpush {q4-q7} -.else -movrel r8, min_eob_idct_idct_16 + 2 .endif @ Align the stack, allocate a temp buffer @@ -914,6 +912,8 @@ A and r7, sp, #15 ble idct16x16_quarter_add_neon cmp r3, #38 ble idct16x16_half_add_neon + +movrel r8, min_eob_idct_idct_16 + 2 .endif .irp i, 0, 4, 8, 12 -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 12/14] aarch64: vp9itxfm16: Move the load_add_store macro out from the itxfm16 pass2 function
This allows reusing the macro for a separate implementation of the pass2 function. --- libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 98 1 file changed, 49 insertions(+), 49 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S index de1da55..f30fdd8 100644 --- a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S +++ b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S @@ -851,6 +851,55 @@ endfunc st1 {v4.4s}, [\src], \inc .endm +.macro load_add_store coef0, coef1, coef2, coef3, coef4, coef5, coef6, coef7 +srshr \coef0, \coef0, #6 +ld1 {v4.4h}, [x0], x1 +srshr \coef1, \coef1, #6 +ld1 {v4.d}[1], [x3], x1 +srshr \coef2, \coef2, #6 +ld1 {v5.4h}, [x0], x1 +srshr \coef3, \coef3, #6 +uaddw \coef0, \coef0, v4.4h +ld1 {v5.d}[1], [x3], x1 +srshr \coef4, \coef4, #6 +uaddw2 \coef1, \coef1, v4.8h +ld1 {v6.4h}, [x0], x1 +srshr \coef5, \coef5, #6 +uaddw \coef2, \coef2, v5.4h +ld1 {v6.d}[1], [x3], x1 +sqxtun v4.4h, \coef0 +srshr \coef6, \coef6, #6 +uaddw2 \coef3, \coef3, v5.8h +ld1 {v7.4h}, [x0], x1 +sqxtun2 v4.8h, \coef1 +srshr \coef7, \coef7, #6 +uaddw \coef4, \coef4, v6.4h +ld1 {v7.d}[1], [x3], x1 +uminv4.8h, v4.8h, v8.8h +sub x0, x0, x1, lsl #2 +sub x3, x3, x1, lsl #2 +sqxtun v5.4h, \coef2 +uaddw2 \coef5, \coef5, v6.8h +st1 {v4.4h}, [x0], x1 +sqxtun2 v5.8h, \coef3 +uaddw \coef6, \coef6, v7.4h +st1 {v4.d}[1], [x3], x1 +uminv5.8h, v5.8h, v8.8h +sqxtun v6.4h, \coef4 +uaddw2 \coef7, \coef7, v7.8h +st1 {v5.4h}, [x0], x1 +sqxtun2 v6.8h, \coef5 +st1 {v5.d}[1], [x3], x1 +uminv6.8h, v6.8h, v8.8h +sqxtun v7.4h, \coef6 +st1 {v6.4h}, [x0], x1 +sqxtun2 v7.8h, \coef7 +st1 {v6.d}[1], [x3], x1 +uminv7.8h, v7.8h, v8.8h +st1 {v7.4h}, [x0], x1 +st1 {v7.d}[1], [x3], x1 +.endm + // Read a vertical 4x16 slice out of a 16x16 matrix, do a transform on it, // transpose into a horizontal 16x4 slice and store. // x0 = dst (temp buffer) @@ -937,57 +986,8 @@ function \txfm\()16_1d_4x16_pass2_neon bl \txfm\()16 dup v8.8h, w13 -.macro load_add_store coef0, coef1, coef2, coef3, coef4, coef5, coef6, coef7 -srshr \coef0, \coef0, #6 -ld1 {v4.4h}, [x0], x1 -srshr \coef1, \coef1, #6 -ld1 {v4.d}[1], [x3], x1 -srshr \coef2, \coef2, #6 -ld1 {v5.4h}, [x0], x1 -srshr \coef3, \coef3, #6 -uaddw \coef0, \coef0, v4.4h -ld1 {v5.d}[1], [x3], x1 -srshr \coef4, \coef4, #6 -uaddw2 \coef1, \coef1, v4.8h -ld1 {v6.4h}, [x0], x1 -srshr \coef5, \coef5, #6 -uaddw \coef2, \coef2, v5.4h -ld1 {v6.d}[1], [x3], x1 -sqxtun v4.4h, \coef0 -srshr \coef6, \coef6, #6 -uaddw2 \coef3, \coef3, v5.8h -ld1 {v7.4h}, [x0], x1 -sqxtun2 v4.8h, \coef1 -srshr \coef7, \coef7, #6 -uaddw \coef4, \coef4, v6.4h -ld1 {v7.d}[1], [x3], x1 -uminv4.8h, v4.8h, v8.8h -sub x0, x0, x1, lsl #2 -sub x3, x3, x1, lsl #2 -sqxtun v5.4h, \coef2 -uaddw2 \coef5, \coef5, v6.8h -st1 {v4.4h}, [x0], x1 -sqxtun2 v5.8h, \coef3 -uaddw \coef6, \coef6, v7.4h -st1 {v4.d}[1], [x3], x1 -uminv5.8h, v5.8h, v8.8h -sqxtun v6.4h, \coef4 -uaddw2 \coef7, \coef7, v7.8h -st1 {v5.4h}, [x0], x1 -sqxtun2 v6.8h, \coef5 -st1 {v5.d}[1], [x3], x1 -uminv6.8h, v6.8h, v8.8h -sqxtun v7.4h, \coef6 -st1 {v6.4h}, [x0], x1 -sqxtun2 v7.8h, \coef7 -st1 {v6.d}[1], [x3], x1 -uminv7.8h, v7.8h, v8.8h -
[FFmpeg-devel] [PATCH 05/14] arm: vp9itxfm16: Fix vertical alignment
--- libavcodec/arm/vp9itxfm_16bpp_neon.S | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_16bpp_neon.S b/libavcodec/arm/vp9itxfm_16bpp_neon.S index a92f323..9c02ed9 100644 --- a/libavcodec/arm/vp9itxfm_16bpp_neon.S +++ b/libavcodec/arm/vp9itxfm_16bpp_neon.S @@ -1395,25 +1395,25 @@ function idct32_1d_2x32_pass2_neon vld1.32 {d4}, [r2,:64], r12 vld1.32 {d5}, [r2,:64], r12 .if \neg == 0 -vadd.s32d4, d4, d\a +vadd.s32d4, d4, d\a vld1.32 {d6}, [r2,:64], r12 -vadd.s32d5, d5, d\b +vadd.s32d5, d5, d\b vld1.32 {d7}, [r2,:64], r12 -vadd.s32d6, d6, d\c -vadd.s32d7, d7, d\d +vadd.s32d6, d6, d\c +vadd.s32d7, d7, d\d .else -vsub.s32d4, d4, d\a +vsub.s32d4, d4, d\a vld1.32 {d6}, [r2,:64], r12 -vsub.s32d5, d5, d\b +vsub.s32d5, d5, d\b vld1.32 {d7}, [r2,:64], r12 -vsub.s32d6, d6, d\c -vsub.s32d7, d7, d\d +vsub.s32d6, d6, d\c +vsub.s32d7, d7, d\d .endif vld1.32 {d2[]}, [r0,:32], r1 vld1.32 {d2[1]}, [r0,:32], r1 -vrshr.s32 q2, q2, #6 +vrshr.s32 q2, q2, #6 vld1.32 {d3[]}, [r0,:32], r1 -vrshr.s32 q3, q3, #6 +vrshr.s32 q3, q3, #6 vld1.32 {d3[1]}, [r0,:32], r1 sub r0, r0, r1, lsl #2 vaddw.u16 q2, q2, d2 -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 07/14] aarch64: vp9itxfm16: Fix a typo in a comment
--- libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S index f53e94a..f80604f 100644 --- a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S +++ b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S @@ -872,7 +872,7 @@ function \txfm\()16_1d_4x16_pass1_neon transpose_4x4s v24, v25, v26, v27, v4, v5, v6, v7 transpose_4x4s v28, v29, v30, v31, v4, v5, v6, v7 -// Store the transposed 8x8 blocks horizontally. +// Store the transposed 4x4 blocks horizontally. cmp x1, #12 b.eq1f .irp i, 16, 20, 24, 28, 17, 21, 25, 29, 18, 22, 26, 30, 19, 23, 27, 31 -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 14/14] aarch64: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible
This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 21512 bytes to 31400 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: vp9_inv_dct_dct_16x16_sub1_add_10_neon: 284.6 vp9_inv_dct_dct_16x16_sub2_add_10_neon:1902.7 vp9_inv_dct_dct_16x16_sub4_add_10_neon:1903.0 vp9_inv_dct_dct_16x16_sub8_add_10_neon:2201.1 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 2510.0 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2821.3 vp9_inv_dct_dct_32x32_sub1_add_10_neon:1011.6 vp9_inv_dct_dct_32x32_sub2_add_10_neon:9716.5 vp9_inv_dct_dct_32x32_sub4_add_10_neon:9704.9 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 10641.7 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 11555.7 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 12499.8 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 13403.7 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 14335.8 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 15253.6 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16179.5 After: vp9_inv_dct_dct_16x16_sub1_add_10_neon: 282.8 vp9_inv_dct_dct_16x16_sub2_add_10_neon:1142.4 vp9_inv_dct_dct_16x16_sub4_add_10_neon:1139.0 vp9_inv_dct_dct_16x16_sub8_add_10_neon:1772.9 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 2515.2 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2823.5 vp9_inv_dct_dct_32x32_sub1_add_10_neon:1012.7 vp9_inv_dct_dct_32x32_sub2_add_10_neon:6944.4 vp9_inv_dct_dct_32x32_sub4_add_10_neon:6944.2 vp9_inv_dct_dct_32x32_sub8_add_10_neon:7609.8 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 9953.4 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 10770.1 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 13418.8 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 14330.7 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 15257.1 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16190.6 --- libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 605 --- 1 file changed, 547 insertions(+), 58 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S index f30fdd8..0befe38 100644 --- a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S +++ b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S @@ -124,6 +124,17 @@ endconst .endif .endm +// Same as dmbutterfly0 above, but treating the input in in2 as zero, +// writing the same output into both out1 and out2. +.macro dmbutterfly0_h out1, out2, in1, in2, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6 +smull \tmp1\().2d, \in1\().2s, v0.s[0] +smull2 \tmp2\().2d, \in1\().4s, v0.s[0] +rshrn \out1\().2s, \tmp1\().2d, #14 +rshrn2 \out1\().4s, \tmp2\().2d, #14 +rshrn \out2\().2s, \tmp1\().2d, #14 +rshrn2 \out2\().4s, \tmp2\().2d, #14 +.endm + // out1,out2 = in1 * coef1 - in2 * coef2 // out3,out4 = in1 * coef2 + in2 * coef1 // out are 4 x .2d registers, in are 2 x .4s registers @@ -153,6 +164,43 @@ endconst rshrn2 \inout2\().4s, \tmp4\().2d, #14 .endm +// Same as dmbutterfly above, but treating the input in inout2 as zero +.macro dmbutterfly_h1 inout1, inout2, coef1, coef2, tmp1, tmp2, tmp3, tmp4 +smull \tmp1\().2d, \inout1\().2s, \coef1 +smull2 \tmp2\().2d, \inout1\().4s, \coef1 +smull \tmp3\().2d, \inout1\().2s, \coef2 +smull2 \tmp4\().2d, \inout1\().4s, \coef2 +rshrn \inout1\().2s, \tmp1\().2d, #14 +rshrn2 \inout1\().4s, \tmp2\().2d, #14 +rshrn \inout2\().2s, \tmp3\().2d, #14 +rshrn2 \inout2\().4s, \tmp4\().2d, #14 +.endm + +// Same as dmbutterfly above, but treating the input in inout1 as zero +.macro dmbutterfly_h2 inout1, inout2, coef1, coef2, tmp1, tmp2, tmp3, tmp4 +smull \tmp1\().2d, \inout2\().2s, \coef2 +smull2 \tmp2\().2d, \inout2\().4s, \coef2 +smull \tmp3\().2d, \inout2\().2s, \coef1 +smull2 \tmp4\().2d, \inout2\().4s, \coef1 +neg \tmp1\().2d, \tmp1\().2d +neg \tmp2\().2d, \tmp2\().2d +rshrn \inout2\().2s, \tmp3\().2d, #14 +rshrn2 \inout2\().4s, \tmp4\().2d, #14 +rshrn \inout1\().2s, \tmp1\().2d, #14 +rshrn2 \inout1\().4s, \tmp2\().2d, #14 +.endm + +.macro dsmull_h out1, out2, in, coef +smull \out1\().2d, \in\().2s, \coef +smull2 \out2\().2d, \in\().4s, \coef +.endm + +.macro drshrn_h out, in1, in2, shift +rshrn \out\().2s, \in1\().2d, \shift +
[FFmpeg-devel] [PATCH 10/14] arm: vp9itxfm16: Make the larger core transforms standalone functions
This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/arm/vp9itxfm_16bpp_neon.o from 17500 to 14516 bytes. This gives a small slowdown of a couple tens of cycles, up to around 150 cycles for the full case of the largest transform, but makes it more feasible to add more optimized versions of these transforms. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub4_add_10_neon:4237.4 3561.5 3971.8 2525.3 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6371.9 5452.0 5779.3 3910.5 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22068.8 17867.5 19555.2 13871.6 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37268.9 38684.2 32314.2 23969.0 After: vp9_inv_dct_dct_16x16_sub4_add_10_neon:4375.1 3571.9 4283.8 2567.2 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6415.6 5578.9 5844.6 3948.3 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22653.7 18079.7 19603.7 13905.3 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37593.2 38862.2 32235.8 24070.9 --- libavcodec/arm/vp9itxfm_16bpp_neon.S | 43 ++-- 1 file changed, 27 insertions(+), 16 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_16bpp_neon.S b/libavcodec/arm/vp9itxfm_16bpp_neon.S index 29d95ca..8350153 100644 --- a/libavcodec/arm/vp9itxfm_16bpp_neon.S +++ b/libavcodec/arm/vp9itxfm_16bpp_neon.S @@ -807,7 +807,7 @@ function idct16x16_dc_add_neon endfunc .ltorg -.macro idct16 +function idct16 mbutterfly0 d16, d24, d16, d24, d8, d10, q4, q5 @ d16 = t0a, d24 = t1a mbutterfly d20, d28, d1[0], d1[1], q4, q5 @ d20 = t2a, d28 = t3a mbutterfly d18, d30, d2[0], d2[1], q4, q5 @ d18 = t4a, d30 = t7a @@ -853,9 +853,10 @@ endfunc vmovd8, d21 @ d8 = t10a butterfly d20, d27, d10, d27 @ d20 = out[4], d27 = out[11] butterfly d21, d26, d26, d8@ d21 = out[5], d26 = out[10] -.endm +bx lr +endfunc -.macro iadst16 +function iadst16 movrel r12, iadst16_coeffs vld1.16 {q0}, [r12,:128]! vmovl.s16 q1, d1 @@ -933,7 +934,8 @@ endfunc vmovd16, d2 vmovd30, d4 -.endm +bx lr +endfunc .macro itxfm16_1d_funcs txfm @ Read a vertical 2x16 slice out of a 16x16 matrix, do a transform on it, @@ -941,6 +943,8 @@ endfunc @ r0 = dst (temp buffer) @ r2 = src function \txfm\()16_1d_2x16_pass1_neon +push{lr} + mov r12, #64 vmov.s32q4, #0 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 @@ -948,7 +952,7 @@ function \txfm\()16_1d_2x16_pass1_neon vst1.32 {d8}, [r2,:64], r12 .endr -\txfm\()16 +bl \txfm\()16 @ Do eight 2x2 transposes. Originally, d16-d31 contain the @ 16 rows. Afterwards, d16-d17, d18-d19 etc contain the eight @@ -959,7 +963,7 @@ function \txfm\()16_1d_2x16_pass1_neon .irp i, 16, 18, 20, 22, 24, 26, 28, 30, 17, 19, 21, 23, 25, 27, 29, 31 vst1.32 {d\i}, [r0,:64]! .endr -bx lr +pop {pc} endfunc @ Read a vertical 2x16 slice out of a 16x16 matrix, do a transform on it, @@ -968,6 +972,8 @@ endfunc @ r1 = dst stride @ r2 = src (temp buffer) function \txfm\()16_1d_2x16_pass2_neon +push{lr} + mov r12, #64 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 vld1.16 {d\i}, [r2,:64], r12 @@ -975,7 +981,7 @@ function \txfm\()16_1d_2x16_pass2_neon add r3, r0, r1 lsl r1, r1, #1 -\txfm\()16 +bl \txfm\()16 .macro load_add_store coef0, coef1, coef2, coef3 vrshr.s32 \coef0, \coef0, #6 @@ -1019,7 +1025,7 @@ function \txfm\()16_1d_2x16_pass2_neon load_add_store q12, q13, q14, q15 .purgem load_add_store -bx lr +pop {pc} endfunc .endm @@ -1193,7 +1199,7 @@ function idct32x32_dc_add_neon pop {r4-r9,pc} endfunc -.macro idct32_odd +function idct32_odd movrel r12, idct_coeffs @ Overwrite the idct16 coeffs with the stored ones for idct32 @@ -1262,7 +1268,8 @@ endfunc mbutterfly0 d26, d21, d26, d21, d8, d10, q4, q5 @ d26 = t26a, d21 = t21a mbutterfly0 d25, d22, d25, d22, d8, d10, q4, q5 @ d25 = t25, d22 = t22 mbutterfly0 d24, d23, d24, d23, d8, d10, q4, q5 @ d24 = t24a, d23 = t23a -.endm +bx lr +endfunc @ Do an 32-point IDCT of a 2x32 slice out of a 32x32 matrix. @ We don't have register space to do a single pass IDCT of 2x32 though, @@ -1274,6 +1281,8 @@ endfunc @ r1 = unused @ r2 = src function idct32_1d_2x32_pass1_neon
[FFmpeg-devel] [PATCH 11/14] aarch64: vp9itxfm16: Make the larger core transforms standalone functions
This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/aarch64/vp9itxfm_16bpp_neon.o from 26288 to 21512 bytes. This gives a small slowdown of a couple of tens of cycles, but makes it more feasible to add more optimized versions of these transforms. Before: vp9_inv_dct_dct_16x16_sub4_add_10_neon:1887.4 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2801.5 vp9_inv_dct_dct_32x32_sub4_add_10_neon:9691.4 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16154.9 After: vp9_inv_dct_dct_16x16_sub4_add_10_neon:1899.5 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2827.2 vp9_inv_dct_dct_32x32_sub4_add_10_neon:9714.7 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16175.9 --- libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 45 1 file changed, 28 insertions(+), 17 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S index a97c1b6..de1da55 100644 --- a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S +++ b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S @@ -710,7 +710,7 @@ function idct16x16_dc_add_neon ret endfunc -.macro idct16 +function idct16 dmbutterfly0v16, v24, v16, v24, v4, v5, v6, v7, v8, v9 // v16 = t0a, v24 = t1a dmbutterfly v20, v28, v0.s[2], v0.s[3], v4, v5, v6, v7 // v20 = t2a, v28 = t3a dmbutterfly v18, v30, v1.s[0], v1.s[1], v4, v5, v6, v7 // v18 = t4a, v30 = t7a @@ -753,9 +753,10 @@ endfunc butterfly_4sv19, v28, v5, v28 // v19 = out[3], v28 = out[12] butterfly_4sv20, v27, v6, v27 // v20 = out[4], v27 = out[11] butterfly_4sv21, v26, v26, v9// v21 = out[5], v26 = out[10] -.endm +ret +endfunc -.macro iadst16 +function iadst16 ld1 {v0.8h,v1.8h}, [x11] sxtlv2.4s, v1.4h sxtl2 v3.4s, v1.8h @@ -830,7 +831,8 @@ endfunc mov v16.16b, v2.16b mov v30.16b, v4.16b -.endm +ret +endfunc // Helper macros; we can't use these expressions directly within // e.g. .irp due to the extra concatenation \(). Therefore wrap @@ -857,12 +859,14 @@ endfunc // x9 = input stride .macro itxfm16_1d_funcs txfm function \txfm\()16_1d_4x16_pass1_neon +mov x14, x30 + moviv4.4s, #0 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 load_clear \i, x2, x9 .endr -\txfm\()16 +bl \txfm\()16 // Do four 4x4 transposes. Originally, v16-v31 contain the // 16 rows. Afterwards, v16-v19, v20-v23, v24-v27 and v28-v31 @@ -878,7 +882,7 @@ function \txfm\()16_1d_4x16_pass1_neon .irp i, 16, 20, 24, 28, 17, 21, 25, 29, 18, 22, 26, 30, 19, 23, 27, 31 store \i, x0, #16 .endr -ret +br x14 1: // Special case: For the last input column (x1 == 12), // which would be stored as the last row in the temp buffer, @@ -906,7 +910,7 @@ function \txfm\()16_1d_4x16_pass1_neon mov v29.16b, v17.16b mov v30.16b, v18.16b mov v31.16b, v19.16b -ret +br x14 endfunc // Read a vertical 4x16 slice out of a 16x16 matrix, do a transform on it, @@ -917,6 +921,8 @@ endfunc // x3 = slice offset // x9 = temp buffer stride function \txfm\()16_1d_4x16_pass2_neon +mov x14, x30 + .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27 load\i, x2, x9 .endr @@ -928,7 +934,7 @@ function \txfm\()16_1d_4x16_pass2_neon add x3, x0, x1 lsl x1, x1, #1 -\txfm\()16 +bl \txfm\()16 dup v8.8h, w13 .macro load_add_store coef0, coef1, coef2, coef3, coef4, coef5, coef6, coef7 @@ -983,7 +989,7 @@ function \txfm\()16_1d_4x16_pass2_neon load_add_store v24.4s, v25.4s, v26.4s, v27.4s, v28.4s, v29.4s, v30.4s, v31.4s .purgem load_add_store -ret +br x14 endfunc .endm @@ -1158,7 +1164,7 @@ function idct32x32_dc_add_neon ret endfunc -.macro idct32_odd +function idct32_odd dmbutterfly v16, v31, v10.s[0], v10.s[1], v4, v5, v6, v7 // v16 = t16a, v31 = t31a dmbutterfly v24, v23, v10.s[2], v10.s[3], v4, v5, v6, v7 // v24 = t17a, v23 = t30a dmbutterfly v20, v27, v11.s[0], v11.s[1], v4, v5, v6, v7 // v20 = t18a, v27 = t29a @@ -1209,7 +1215,8 @@ endfunc dmbutterfly0v26, v21, v26, v21, v4, v5, v6, v7, v8, v9 // v26 = t26a, v21 = t21a dmbutterfly0v25, v22, v25, v22, v4, v5, v6, v7, v8, v9 // v25 = t25, v22 = t22 dmbutterfly0v24, v23, v24, v23, v4, v5, v6, v7, v8, v9 // v24 = t24a, v23 = t23a -.endm +ret +endfunc // Do an 32-point IDCT of a 4x32 slice out of a
[FFmpeg-devel] [PATCH 08/14] aarch64: vp9itxfm16: Avoid .irp when it doesn't save any lines
This makes the code a bit more readable. --- libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 24 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S index f80604f..86ea29e 100644 --- a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S +++ b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S @@ -886,21 +886,21 @@ function \txfm\()16_1d_4x16_pass1_neon // for the first slice of the second pass (where it is the // last 4x4 block). add x0, x0, #16 -.irp i, 20, 24, 28 -store \i, x0, #16 -.endr +st1 {v20.4s}, [x0], #16 +st1 {v24.4s}, [x0], #16 +st1 {v28.4s}, [x0], #16 add x0, x0, #16 -.irp i, 21, 25, 29 -store \i, x0, #16 -.endr +st1 {v21.4s}, [x0], #16 +st1 {v25.4s}, [x0], #16 +st1 {v29.4s}, [x0], #16 add x0, x0, #16 -.irp i, 22, 26, 30 -store \i, x0, #16 -.endr +st1 {v22.4s}, [x0], #16 +st1 {v26.4s}, [x0], #16 +st1 {v30.4s}, [x0], #16 add x0, x0, #16 -.irp i, 23, 27, 31 -store \i, x0, #16 -.endr +st1 {v23.4s}, [x0], #16 +st1 {v27.4s}, [x0], #16 +st1 {v31.4s}, [x0], #16 mov v28.16b, v16.16b mov v29.16b, v17.16b -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 13/14] arm: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible
This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 14516 bytes to 22484 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub1_add_10_neon: 454.0270.7418.5295.4 vp9_inv_dct_dct_16x16_sub2_add_10_neon:3840.2 3244.8 3700.1 2337.9 vp9_inv_dct_dct_16x16_sub4_add_10_neon:4212.5 3575.4 3996.9 2571.6 vp9_inv_dct_dct_16x16_sub8_add_10_neon:5174.4 4270.5 4615.5 3031.9 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 5676.0 4908.5 5226.5 3491.3 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6403.9 5589.0 5839.8 3948.5 vp9_inv_dct_dct_32x32_sub1_add_10_neon:1710.7944.7 1582.1 1045.4 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 21040.7 16706.1 18687.7 13193.1 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22197.7 18282.7 19577.5 13918.6 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 24511.5 20911.5 21472.5 15367.5 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 26939.5 24264.3 23239.1 16830.3 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 29419.5 26845.1 25020.6 18259.9 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 31146.4 29633.5 26803.3 19721.7 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 33376.3 32507.8 28642.4 21174.2 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 35629.4 35439.6 30416.5 22625.7 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37269.9 37914.9 32271.9 24078.9 After: vp9_inv_dct_dct_16x16_sub1_add_10_neon: 454.0276.0418.5295.1 vp9_inv_dct_dct_16x16_sub2_add_10_neon:2336.2 1886.0 2251.0 1458.6 vp9_inv_dct_dct_16x16_sub4_add_10_neon:2531.0 2054.7 2402.8 1591.1 vp9_inv_dct_dct_16x16_sub8_add_10_neon:3848.6 3491.1 3845.7 2554.8 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 5703.8 4831.6 5230.8 3493.4 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6399.5 5567.0 5832.4 3951.5 vp9_inv_dct_dct_32x32_sub1_add_10_neon:1722.1938.5 1577.3 1044.5 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 15003.5 11576.8 13105.8 9602.2 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 15768.5 12677.2 13726.0 10138.1 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 17278.8 14825.4 14907.5 11185.7 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 22335.7 21544.5 20379.5 15019.8 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 24165.6 23881.7 21938.6 16308.2 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 31082.2 30860.9 26835.3 19711.3 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 33102.6 31922.8 28638.3 21161.0 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 35104.9 34867.5 30411.7 22621.2 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37438.1 39103.4 32217.8 24067.6 --- libavcodec/arm/vp9itxfm_16bpp_neon.S | 529 +++ 1 file changed, 469 insertions(+), 60 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_16bpp_neon.S b/libavcodec/arm/vp9itxfm_16bpp_neon.S index 8350153..b4f615e 100644 --- a/libavcodec/arm/vp9itxfm_16bpp_neon.S +++ b/libavcodec/arm/vp9itxfm_16bpp_neon.S @@ -82,6 +82,14 @@ endconst vrshrn.s64 \out2, \tmpq4, #14 .endm +@ Same as mbutterfly0 above, but treating the input in in2 as zero, +@ writing the same output into both out1 and out2. +.macro mbutterfly0_h out1, out2, in1, in2, tmpd1, tmpd2, tmpq3, tmpq4 +vmull.s32 \tmpq3, \in1, d0[0] +vrshrn.s64 \out1, \tmpq3, #14 +vrshrn.s64 \out2, \tmpq3, #14 +.endm + @ out1,out2 = ((in1 + in2) * d0[0] + (1 << 13)) >> 14 @ out3,out4 = ((in1 - in2) * d0[0] + (1 << 13)) >> 14 @ Same as mbutterfly0, but with input being 2 q registers, output @@ -148,6 +156,23 @@ endconst vrshrn.s64 \inout2, \tmp2, #14 .endm +@ Same as mbutterfly above, but treating the input in inout2 as zero +.macro mbutterfly_h1 inout1, inout2, coef1, coef2, tmp1, tmp2 +vmull.s32 \tmp1, \inout1, \coef1 +vmull.s32 \tmp2, \inout1, \coef2 +vrshrn.s64 \inout1, \tmp1, #14 +vrshrn.s64 \inout2, \tmp2, #14 +.endm + +@ Same as mbutterfly above, but treating the input in inout1 as zero +.macro mbutterfly_h2 inout1, inout2, coef1, coef2, tmp1, tmp2 +vmov.s64\tmp1, #0 +vmull.s32 \tmp2, \inout2, \coef1 +vmlsl.s32 \tmp1, \inout2, \coef2 +vrshrn.s64 \inout2, \tmp2, #14 +vrshrn.s64 \inout1, \tmp1, #14 +.endm + @ inout1,inout2 = (inout1,inout2 * coef1 - inout3,inout4 * coef2 + (1 << 13)) >> 14 @ inout3,inout4 = (inout1,inout2 * coef2 +
[FFmpeg-devel] [PATCH 03/14] arm/aarch64: vp9: Fix vertical alignment
Align the second/third operands as they usually are. Due to the wildly varying sizes of the written out operands in aarch64 assembly, the column alignment is usually not as clear as in arm assembly. This is cherrypicked from libav commit 7995ebfad12002033c73feed422a1cfc62081e8f. --- libavcodec/aarch64/vp9itxfm_neon.S | 36 ++-- libavcodec/arm/vp9itxfm_neon.S | 14 +++--- libavcodec/arm/vp9lpf_neon.S | 2 +- 3 files changed, 26 insertions(+), 26 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index 3e5da08..b12890f 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -380,7 +380,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1 .ifc \txfm1\()_\txfm2,idct_idct movrel x4, idct_coeffs .else -movrel x4, iadst8_coeffs +movrel x4, iadst8_coeffs ld1 {v1.8h}, [x4], #16 .endif ld1 {v0.8h}, [x4] @@ -480,23 +480,23 @@ itxfm_func8x8 iadst, iadst function idct16x16_dc_add_neon -movrel x4, idct_coeffs +movrel x4, idct_coeffs ld1 {v0.4h}, [x4] -moviv1.4h, #0 +moviv1.4h, #0 ld1 {v2.h}[0], [x2] -smull v2.4s, v2.4h, v0.h[0] -rshrn v2.4h, v2.4s, #14 -smull v2.4s, v2.4h, v0.h[0] -rshrn v2.4h, v2.4s, #14 +smull v2.4s, v2.4h, v0.h[0] +rshrn v2.4h, v2.4s, #14 +smull v2.4s, v2.4h, v0.h[0] +rshrn v2.4h, v2.4s, #14 dup v2.8h, v2.h[0] st1 {v1.h}[0], [x2] -srshr v2.8h, v2.8h, #6 +srshr v2.8h, v2.8h, #6 -mov x3, x0 -mov x4, #16 +mov x3, x0 +mov x4, #16 1: // Loop to add the constant from v2 into all 16x16 outputs subsx4, x4, #2 @@ -869,7 +869,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_16x16_add_neon, export=1 .ifc \txfm1,idct ld1 {v0.8h,v1.8h}, [x10] .endif -mov x9, #32 +mov x9, #32 .ifc \txfm1\()_\txfm2,idct_idct cmp w3, #10 @@ -1046,10 +1046,10 @@ idct16_partial quarter idct16_partial half function idct32x32_dc_add_neon -movrel x4, idct_coeffs +movrel x4, idct_coeffs ld1 {v0.4h}, [x4] -moviv1.4h, #0 +moviv1.4h, #0 ld1 {v2.h}[0], [x2] smull v2.4s, v2.4h, v0.h[0] @@ -1059,10 +1059,10 @@ function idct32x32_dc_add_neon dup v2.8h, v2.h[0] st1 {v1.h}[0], [x2] -srshr v0.8h, v2.8h, #6 +srshr v0.8h, v2.8h, #6 -mov x3, x0 -mov x4, #32 +mov x3, x0 +mov x4, #32 1: // Loop to add the constant v0 into all 32x32 outputs subsx4, x4, #2 @@ -1230,7 +1230,7 @@ endfunc // x9 = double input stride function idct32_1d_8x32_pass1\suffix\()_neon mov x14, x30 -moviv2.8h, #0 +moviv2.8h, #0 // v16 = IN(0), v17 = IN(2) ... v31 = IN(30) .ifb \suffix @@ -1295,7 +1295,7 @@ function idct32_1d_8x32_pass1\suffix\()_neon .endif add x2, x2, #64 -moviv2.8h, #0 +moviv2.8h, #0 // v16 = IN(1), v17 = IN(3) ... v31 = IN(31) .ifb \suffix .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index 6d4d765..6c09922 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -530,7 +530,7 @@ function idct16x16_dc_add_neon movrel r12, idct_coeffs vld1.16 {d0}, [r12,:64] -vmov.i16q2, #0 +vmov.i16q2, #0 vld1.16 {d16[]}, [r2,:16] vmull.s16 q8, d16, d0[0] @@ -793,7 +793,7 @@ function \txfm\()16_1d_4x16_pass1_neon push{lr} mov r12, #32 -vmov.s16q2, #0 +vmov.s16q2, #0 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 vld1.16 {d\i}, [r2,:64] vst1.16 {d4}, [r2,:64], r12 @@ -1142,7 +1142,7 @@ function idct32x32_dc_add_neon movrel r12, idct_coeffs vld1.16 {d0}, [r12,:64] -vmov.i16q2, #0 +vmov.i16q2, #0 vld1.16 {d16[]}, [r2,:16] vmull.s16
[FFmpeg-devel] [PATCH 09/14] aarch64: vp9itxfm16: Restructure the idct32 store macros
This avoids concatenation, which can't be used if the whole macro is wrapped within another macro. --- libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 90 1 file changed, 45 insertions(+), 45 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S index 86ea29e..a97c1b6 100644 --- a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S +++ b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S @@ -1244,27 +1244,27 @@ function idct32_1d_4x32_pass1_neon .macro store_rev a, b, c, d // There's no rev128 instruction, but we reverse each 64 bit // half, and then flip them using an ext with 8 bytes offset. -rev64 v7.4s, v\d\().4s -st1 {v\a\().4s}, [x0], #16 +rev64 v7.4s, \d +st1 {\a}, [x0], #16 ext v7.16b, v7.16b, v7.16b, #8 -st1 {v\b\().4s}, [x0], #16 -rev64 v6.4s, v\c\().4s -st1 {v\c\().4s}, [x0], #16 +st1 {\b}, [x0], #16 +rev64 v6.4s, \c +st1 {\c}, [x0], #16 ext v6.16b, v6.16b, v6.16b, #8 -st1 {v\d\().4s}, [x0], #16 -rev64 v5.4s, v\b\().4s +st1 {\d}, [x0], #16 +rev64 v5.4s, \b st1 {v7.4s}, [x0], #16 ext v5.16b, v5.16b, v5.16b, #8 st1 {v6.4s}, [x0], #16 -rev64 v4.4s, v\a\().4s +rev64 v4.4s, \a st1 {v5.4s}, [x0], #16 ext v4.16b, v4.16b, v4.16b, #8 st1 {v4.4s}, [x0], #16 .endm -store_rev 16, 20, 24, 28 -store_rev 17, 21, 25, 29 -store_rev 18, 22, 26, 30 -store_rev 19, 23, 27, 31 +store_rev v16.4s, v20.4s, v24.4s, v28.4s +store_rev v17.4s, v21.4s, v25.4s, v29.4s +store_rev v18.4s, v22.4s, v26.4s, v30.4s +store_rev v19.4s, v23.4s, v27.4s, v31.4s sub x0, x0, #512 .purgem store_rev @@ -1290,27 +1290,27 @@ function idct32_1d_4x32_pass1_neon // Store the registers a, b, c, d horizontally, // adding into the output first, and the mirrored, // subtracted from the output. -.macro store_rev a, b, c, d +.macro store_rev a, b, c, d, a16b, b16b ld1 {v4.4s}, [x0] -rev64 v9.4s, v\d\().4s -add v4.4s, v4.4s, v\a\().4s +rev64 v9.4s, \d +add v4.4s, v4.4s, \a st1 {v4.4s}, [x0], #16 -rev64 v8.4s, v\c\().4s +rev64 v8.4s, \c ld1 {v4.4s}, [x0] ext v9.16b, v9.16b, v9.16b, #8 -add v4.4s, v4.4s, v\b\().4s +add v4.4s, v4.4s, \b st1 {v4.4s}, [x0], #16 ext v8.16b, v8.16b, v8.16b, #8 ld1 {v4.4s}, [x0] -rev64 v\b\().4s, v\b\().4s -add v4.4s, v4.4s, v\c\().4s +rev64 \b, \b +add v4.4s, v4.4s, \c st1 {v4.4s}, [x0], #16 -rev64 v\a\().4s, v\a\().4s +rev64 \a, \a ld1 {v4.4s}, [x0] -ext v\b\().16b, v\b\().16b, v\b\().16b, #8 -add v4.4s, v4.4s, v\d\().4s +ext \b16b, \b16b, \b16b, #8 +add v4.4s, v4.4s, \d st1 {v4.4s}, [x0], #16 -ext v\a\().16b, v\a\().16b, v\a\().16b, #8 +ext \a16b, \a16b, \a16b, #8 ld1 {v4.4s}, [x0] sub v4.4s, v4.4s, v9.4s st1 {v4.4s}, [x0], #16 @@ -1318,17 +1318,17 @@ function idct32_1d_4x32_pass1_neon sub v4.4s, v4.4s, v8.4s st1 {v4.4s}, [x0], #16 ld1 {v4.4s}, [x0] -sub v4.4s, v4.4s, v\b\().4s +sub v4.4s, v4.4s, \b st1 {v4.4s}, [x0], #16 ld1 {v4.4s}, [x0] -sub v4.4s, v4.4s, v\a\().4s +sub v4.4s, v4.4s, \a st1 {v4.4s}, [x0], #16 .endm -store_rev 31, 27, 23, 19 -store_rev 30, 26, 22, 18 -store_rev 29, 25, 21, 17 -store_rev 28, 24, 20, 16 +store_rev v31.4s, v27.4s, v23.4s, v19.4s, v31.16b, v27.16b +store_rev v30.4s, v26.4s, v22.4s, v18.4s, v30.16b, v26.16b +store_rev v29.4s, v25.4s, v21.4s, v17.4s, v29.16b, v25.16b +store_rev v28.4s, v24.4s, v20.4s, v16.4s, v28.16b, v24.16b .purgem store_rev ret endfunc @@ -1370,21 +1370,21 @@ function
[FFmpeg-devel] [PATCH 08/34] aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32
This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 14740 bytes to 24292 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7 vp9_inv_dct_dct_16x16_sub2_add_neon:1051.0 vp9_inv_dct_dct_16x16_sub4_add_neon:1051.0 vp9_inv_dct_dct_16x16_sub8_add_neon:1051.0 vp9_inv_dct_dct_16x16_sub12_add_neon: 1387.4 vp9_inv_dct_dct_16x16_sub16_add_neon: 1387.6 vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1 vp9_inv_dct_dct_32x32_sub2_add_neon:5198.5 vp9_inv_dct_dct_32x32_sub4_add_neon:5198.6 vp9_inv_dct_dct_32x32_sub8_add_neon:5196.3 vp9_inv_dct_dct_32x32_sub12_add_neon: 6183.4 vp9_inv_dct_dct_32x32_sub16_add_neon: 6174.3 vp9_inv_dct_dct_32x32_sub20_add_neon: 7151.4 vp9_inv_dct_dct_32x32_sub24_add_neon: 7145.3 vp9_inv_dct_dct_32x32_sub28_add_neon: 8119.3 vp9_inv_dct_dct_32x32_sub32_add_neon: 8118.7 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7 vp9_inv_dct_dct_16x16_sub2_add_neon: 640.8 vp9_inv_dct_dct_16x16_sub4_add_neon: 639.0 vp9_inv_dct_dct_16x16_sub8_add_neon: 842.0 vp9_inv_dct_dct_16x16_sub12_add_neon: 1388.3 vp9_inv_dct_dct_16x16_sub16_add_neon: 1389.3 vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1 vp9_inv_dct_dct_32x32_sub2_add_neon:3685.5 vp9_inv_dct_dct_32x32_sub4_add_neon:3685.1 vp9_inv_dct_dct_32x32_sub8_add_neon:3684.4 vp9_inv_dct_dct_32x32_sub12_add_neon: 5312.2 vp9_inv_dct_dct_32x32_sub16_add_neon: 5315.4 vp9_inv_dct_dct_32x32_sub20_add_neon: 7154.9 vp9_inv_dct_dct_32x32_sub24_add_neon: 7154.5 vp9_inv_dct_dct_32x32_sub28_add_neon: 8126.6 vp9_inv_dct_dct_32x32_sub32_add_neon: 8127.2 This is cherrypicked from libav commit a63da4511d0fee66695ff4afd264ba1dbf1e812d. --- libavcodec/aarch64/vp9itxfm_neon.S | 525 - 1 file changed, 466 insertions(+), 59 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index e45d385..3eb999a 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -75,6 +75,17 @@ endconst .endif .endm +// Same as dmbutterfly0 above, but treating the input in in2 as zero, +// writing the same output into both out1 and out2. +.macro dmbutterfly0_h out1, out2, in1, in2, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6 +smull \tmp1\().4s, \in1\().4h, v0.h[0] +smull2 \tmp2\().4s, \in1\().8h, v0.h[0] +rshrn \out1\().4h, \tmp1\().4s, #14 +rshrn2 \out1\().8h, \tmp2\().4s, #14 +rshrn \out2\().4h, \tmp1\().4s, #14 +rshrn2 \out2\().8h, \tmp2\().4s, #14 +.endm + // out1,out2 = in1 * coef1 - in2 * coef2 // out3,out4 = in1 * coef2 + in2 * coef1 // out are 4 x .4s registers, in are 2 x .8h registers @@ -104,6 +115,43 @@ endconst rshrn2 \inout2\().8h, \tmp4\().4s, #14 .endm +// Same as dmbutterfly above, but treating the input in inout2 as zero +.macro dmbutterfly_h1 inout1, inout2, coef1, coef2, tmp1, tmp2, tmp3, tmp4 +smull \tmp1\().4s, \inout1\().4h, \coef1 +smull2 \tmp2\().4s, \inout1\().8h, \coef1 +smull \tmp3\().4s, \inout1\().4h, \coef2 +smull2 \tmp4\().4s, \inout1\().8h, \coef2 +rshrn \inout1\().4h, \tmp1\().4s, #14 +rshrn2 \inout1\().8h, \tmp2\().4s, #14 +rshrn \inout2\().4h, \tmp3\().4s, #14 +rshrn2 \inout2\().8h, \tmp4\().4s, #14 +.endm + +// Same as dmbutterfly above, but treating the input in inout1 as zero +.macro dmbutterfly_h2 inout1, inout2, coef1, coef2, tmp1, tmp2, tmp3, tmp4 +smull \tmp1\().4s, \inout2\().4h, \coef2 +smull2 \tmp2\().4s, \inout2\().8h, \coef2 +smull \tmp3\().4s, \inout2\().4h, \coef1 +smull2 \tmp4\().4s, \inout2\().8h, \coef1 +neg \tmp1\().4s, \tmp1\().4s +neg \tmp2\().4s, \tmp2\().4s +rshrn \inout2\().4h, \tmp3\().4s, #14 +rshrn2 \inout2\().8h, \tmp4\().4s, #14 +rshrn \inout1\().4h, \tmp1\().4s, #14 +rshrn2 \inout1\().8h, \tmp2\().4s, #14 +.endm + +.macro dsmull_h out1, out2, in, coef +smull \out1\().4s, \in\().4h, \coef +smull2 \out2\().4s, \in\().8h, \coef +.endm + +.macro drshrn_h out, in1, in2, shift +rshrn \out\().4h, \in1\().4s, \shift +rshrn2 \out\().8h,
[FFmpeg-devel] [PATCH 07/34] arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible
This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 12388 bytes to 19784 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0189.5212.0235.8 vp9_inv_dct_dct_16x16_sub2_add_neon:2102.1 1521.7 1736.2 1265.8 vp9_inv_dct_dct_16x16_sub4_add_neon:2104.5 1533.0 1736.6 1265.5 vp9_inv_dct_dct_16x16_sub8_add_neon:2484.8 1828.7 2014.4 1506.5 vp9_inv_dct_dct_16x16_sub12_add_neon: 2851.2 2117.8 2294.8 1753.2 vp9_inv_dct_dct_16x16_sub16_add_neon: 3239.4 2408.3 2543.5 1994.9 vp9_inv_dct_dct_32x32_sub1_add_neon: 758.3456.7864.5553.9 vp9_inv_dct_dct_32x32_sub2_add_neon: 10776.7 7949.8 8567.7 6819.7 vp9_inv_dct_dct_32x32_sub4_add_neon: 10865.6 8131.5 8589.6 6816.3 vp9_inv_dct_dct_32x32_sub8_add_neon: 12053.9 9271.3 9387.7 7564.0 vp9_inv_dct_dct_32x32_sub12_add_neon: 13328.3 10463.2 10217.0 8321.3 vp9_inv_dct_dct_32x32_sub16_add_neon: 14176.4 11509.5 11018.7 9062.3 vp9_inv_dct_dct_32x32_sub20_add_neon: 15301.5 12999.9 11855.1 9828.2 vp9_inv_dct_dct_32x32_sub24_add_neon: 16482.7 14931.5 12650.1 10575.0 vp9_inv_dct_dct_32x32_sub28_add_neon: 17589.5 15811.9 13482.8 11333.4 vp9_inv_dct_dct_32x32_sub32_add_neon: 18696.2 17049.2 14355.6 12089.7 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0189.5211.7235.8 vp9_inv_dct_dct_16x16_sub2_add_neon:1203.5998.2 1035.3763.0 vp9_inv_dct_dct_16x16_sub4_add_neon:1203.5998.1 1035.5760.8 vp9_inv_dct_dct_16x16_sub8_add_neon:1926.1 1610.6 1722.1 1271.7 vp9_inv_dct_dct_16x16_sub12_add_neon: 2873.2 2129.7 2285.1 1757.3 vp9_inv_dct_dct_16x16_sub16_add_neon: 3221.4 2520.3 2557.6 2002.1 vp9_inv_dct_dct_32x32_sub1_add_neon: 753.0457.5866.6554.6 vp9_inv_dct_dct_32x32_sub2_add_neon:7554.6 5652.4 6048.4 4920.2 vp9_inv_dct_dct_32x32_sub4_add_neon:7549.9 5685.0 6046.9 4925.7 vp9_inv_dct_dct_32x32_sub8_add_neon:8336.9 6704.5 6604.0 5478.0 vp9_inv_dct_dct_32x32_sub12_add_neon: 10914.0 9777.2 9240.4 7416.9 vp9_inv_dct_dct_32x32_sub16_add_neon: 11859.2 11223.3 9966.3 8095.1 vp9_inv_dct_dct_32x32_sub20_add_neon: 15237.1 13029.4 11838.3 9829.4 vp9_inv_dct_dct_32x32_sub24_add_neon: 16293.2 14379.8 12644.9 10572.0 vp9_inv_dct_dct_32x32_sub28_add_neon: 17424.3 15734.7 13473.0 11326.9 vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.3 17457.0 14298.6 12080.0 This is cherrypicked from libav commit 5eb5aec475aabc884d083566f902876ecbc072cb. --- libavcodec/arm/vp9itxfm_neon.S | 591 + 1 file changed, 537 insertions(+), 54 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index 682a82e..33a7af1 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -74,6 +74,14 @@ endconst vrshrn.s32 \out2, \tmpq4, #14 .endm +@ Same as mbutterfly0 above, but treating the input in in2 as zero, +@ writing the same output into both out1 and out2. +.macro mbutterfly0_h out1, out2, in1, in2, tmpd1, tmpd2, tmpq3, tmpq4 +vmull.s16 \tmpq3, \in1, d0[0] +vrshrn.s32 \out1, \tmpq3, #14 +vrshrn.s32 \out2, \tmpq3, #14 +.endm + @ out1,out2 = ((in1 + in2) * d0[0] + (1 << 13)) >> 14 @ out3,out4 = ((in1 - in2) * d0[0] + (1 << 13)) >> 14 @ Same as mbutterfly0, but with input being 2 q registers, output @@ -137,6 +145,23 @@ endconst vrshrn.s32 \inout2, \tmp2, #14 .endm +@ Same as mbutterfly above, but treating the input in inout2 as zero +.macro mbutterfly_h1 inout1, inout2, coef1, coef2, tmp1, tmp2 +vmull.s16 \tmp1, \inout1, \coef1 +vmull.s16 \tmp2, \inout1, \coef2 +vrshrn.s32 \inout1, \tmp1, #14 +vrshrn.s32 \inout2, \tmp2, #14 +.endm + +@ Same as mbutterfly above, but treating the input in inout1 as zero +.macro mbutterfly_h2 inout1, inout2, coef1, coef2, tmp1, tmp2 +vmull.s16 \tmp1, \inout2, \coef2 +vmull.s16 \tmp2, \inout2, \coef1 +vneg.s32\tmp1, \tmp1 +vrshrn.s32 \inout2, \tmp2, #14 +vrshrn.s32 \inout1, \tmp1, #14 +.endm + @ inout1,inout2 = (inout1,inout2 * coef1 - inout3,inout4 * coef2 + (1 << 13)) >> 14 @ inout3,inout4 = (inout1,inout2 * coef2 + inout3,inout4 * coef1 + (1 << 13)) >> 14 @
[FFmpeg-devel] [PATCH 17/34] aarch64: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter
No measured speedup on a Cortex A53, but other cores might benefit. This is cherrypicked from libav commit 388e0d2515bc6bbc9d0c9af1d230bd16cf945fe7. --- libavcodec/aarch64/vp9mc_neon.S | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/libavcodec/aarch64/vp9mc_neon.S b/libavcodec/aarch64/vp9mc_neon.S index 9403911..82a0f53 100644 --- a/libavcodec/aarch64/vp9mc_neon.S +++ b/libavcodec/aarch64/vp9mc_neon.S @@ -202,9 +202,12 @@ endfunc ext v23.16b, \src5\().16b, \src6\().16b, #(2*\offset) mla \dst2\().8h, v21.8h, v0.h[\offset] mla \dst4\().8h, v23.8h, v0.h[\offset] -.else +.elseif \size == 8 mla \dst1\().8h, v20.8h, v0.h[\offset] mla \dst3\().8h, v22.8h, v0.h[\offset] +.else +mla \dst1\().4h, v20.4h, v0.h[\offset] +mla \dst3\().4h, v22.4h, v0.h[\offset] .endif .endm // The same as above, but don't accumulate straight into the @@ -219,16 +222,24 @@ endfunc ext v23.16b, \src5\().16b, \src6\().16b, #(2*\offset) mul v21.8h, v21.8h, v0.h[\offset] mul v23.8h, v23.8h, v0.h[\offset] -.else +.elseif \size == 8 mul v20.8h, v20.8h, v0.h[\offset] mul v22.8h, v22.8h, v0.h[\offset] +.else +mul v20.4h, v20.4h, v0.h[\offset] +mul v22.4h, v22.4h, v0.h[\offset] .endif +.if \size == 4 +sqadd \dst1\().4h, \dst1\().4h, v20.4h +sqadd \dst3\().4h, \dst3\().4h, v22.4h +.else sqadd \dst1\().8h, \dst1\().8h, v20.8h sqadd \dst3\().8h, \dst3\().8h, v22.8h .if \size >= 16 sqadd \dst2\().8h, \dst2\().8h, v21.8h sqadd \dst4\().8h, \dst4\().8h, v23.8h .endif +.endif .endm -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 16/34] arm: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter
Before:Cortex A7 A8 A9 A53 vp9_put_8tap_smooth_4h_neon: 378.1 273.2 340.7 229.5 After: vp9_put_8tap_smooth_4h_neon: 352.1 222.2 290.5 229.5 This is cherrypicked from libav commit fea92a4b57d1c328b1de226a5f213a629ee63754. --- libavcodec/arm/vp9mc_neon.S | 33 ++--- 1 file changed, 22 insertions(+), 11 deletions(-) diff --git a/libavcodec/arm/vp9mc_neon.S b/libavcodec/arm/vp9mc_neon.S index 83235ff..bd8cda7 100644 --- a/libavcodec/arm/vp9mc_neon.S +++ b/libavcodec/arm/vp9mc_neon.S @@ -209,7 +209,7 @@ endfunc @ Extract a vector from src1-src2 and src4-src5 (src1-src3 and src4-src6 @ for size >= 16), and multiply-accumulate into dst1 and dst3 (or @ dst1-dst2 and dst3-dst4 for size >= 16) -.macro extmla dst1, dst2, dst3, dst4, src1, src2, src3, src4, src5, src6, offset, size +.macro extmla dst1, dst2, dst3, dst4, dst1d, dst3d, src1, src2, src3, src4, src5, src6, offset, size vext.8 q14, \src1, \src2, #(2*\offset) vext.8 q15, \src4, \src5, #(2*\offset) .if \size >= 16 @@ -219,14 +219,17 @@ endfunc vext.8 q6, \src5, \src6, #(2*\offset) vmla_lane \dst2, q5, \offset vmla_lane \dst4, q6, \offset -.else +.elseif \size == 8 vmla_lane \dst1, q14, \offset vmla_lane \dst3, q15, \offset +.else +vmla_lane \dst1d, d28, \offset +vmla_lane \dst3d, d30, \offset .endif .endm @ The same as above, but don't accumulate straight into the @ destination, but use a temp register and accumulate with saturation. -.macro extmulqadd dst1, dst2, dst3, dst4, src1, src2, src3, src4, src5, src6, offset, size +.macro extmulqadd dst1, dst2, dst3, dst4, dst1d, dst3d, src1, src2, src3, src4, src5, src6, offset, size vext.8 q14, \src1, \src2, #(2*\offset) vext.8 q15, \src4, \src5, #(2*\offset) .if \size >= 16 @@ -236,16 +239,24 @@ endfunc vext.8 q6, \src5, \src6, #(2*\offset) vmul_lane q5, q5, \offset vmul_lane q6, q6, \offset -.else +.elseif \size == 8 vmul_lane q14, q14, \offset vmul_lane q15, q15, \offset +.else +vmul_lane d28, d28, \offset +vmul_lane d30, d30, \offset .endif +.if \size == 4 +vqadd.s16 \dst1d, \dst1d, d28 +vqadd.s16 \dst3d, \dst3d, d30 +.else vqadd.s16 \dst1, \dst1, q14 vqadd.s16 \dst3, \dst3, q15 .if \size >= 16 vqadd.s16 \dst2, \dst2, q5 vqadd.s16 \dst4, \dst4, q6 .endif +.endif .endm @@ -308,13 +319,13 @@ function \type\()_8tap_\size\()h_\idx1\idx2 vmul.s16q2, q9, d0[0] vmul.s16q4, q12, d0[0] .endif -extmla q1, q2, q3, q4, q8, q9, q10, q11, q12, q13, 1, \size -extmla q1, q2, q3, q4, q8, q9, q10, q11, q12, q13, 2, \size -extmla q1, q2, q3, q4, q8, q9, q10, q11, q12, q13, \idx1, \size -extmla q1, q2, q3, q4, q8, q9, q10, q11, q12, q13, 5, \size -extmla q1, q2, q3, q4, q8, q9, q10, q11, q12, q13, 6, \size -extmla q1, q2, q3, q4, q8, q9, q10, q11, q12, q13, 7, \size -extmulqadd q1, q2, q3, q4, q8, q9, q10, q11, q12, q13, \idx2, \size +extmla q1, q2, q3, q4, d2, d6, q8, q9, q10, q11, q12, q13, 1, \size +extmla q1, q2, q3, q4, d2, d6, q8, q9, q10, q11, q12, q13, 2, \size +extmla q1, q2, q3, q4, d2, d6, q8, q9, q10, q11, q12, q13, \idx1, \size +extmla q1, q2, q3, q4, d2, d6, q8, q9, q10, q11, q12, q13, 5, \size +extmla q1, q2, q3, q4, d2, d6, q8, q9, q10, q11, q12, q13, 6, \size +extmla q1, q2, q3, q4, d2, d6, q8, q9, q10, q11, q12, q13, 7, \size +extmulqadd q1, q2, q3, q4, d2, d6, q8, q9, q10, q11, q12, q13, \idx2, \size @ Round, shift and saturate vqrshrun.s16d2, q1, #7 -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 14/34] aarch64: vp9itxfm: Fix incorrect vertical alignment
This is cherrypicked from libav commit 0c0b87f12d48d4e7f0d3d13f9345e828a3a5ea32. --- libavcodec/aarch64/vp9itxfm_neon.S | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index 5219d6e..6bb097b 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -225,7 +225,7 @@ endconst add v21.4s,v17.4s,v19.4s rshrn \c0\().4h, v20.4s,#14 add v16.4s,v16.4s,v17.4s -rshrn \c1\().4h, v21.4s, #14 +rshrn \c1\().4h, v21.4s,#14 sub v16.4s,v16.4s,v19.4s rshrn \c2\().4h, v18.4s,#14 rshrn \c3\().4h, v16.4s,#14 @@ -1313,8 +1313,8 @@ function idct32_1d_8x32_pass1\suffix\()_neon bl idct32_odd\suffix -transpose_8x8H v31, v30, v29, v28, v27, v26, v25, v24, v2, v3 -transpose_8x8H v23, v22, v21, v20, v19, v18, v17, v16, v2, v3 +transpose_8x8H v31, v30, v29, v28, v27, v26, v25, v24, v2, v3 +transpose_8x8H v23, v22, v21, v20, v19, v18, v17, v16, v2, v3 // Store the registers a, b horizontally, // adding into the output first, and the mirrored, -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 04/34] aarch64: vp9itxfm: Make the larger core transforms standalone functions
This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/aarch64/vp9itxfm_neon.o from 19496 to 14740 bytes. This gives a small slowdown of a couple of tens of cycles, but makes it more feasible to add more optimized versions of these transforms. Before: vp9_inv_dct_dct_16x16_sub4_add_neon:1036.7 vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.2 vp9_inv_dct_dct_32x32_sub4_add_neon:5180.0 vp9_inv_dct_dct_32x32_sub32_add_neon: 8095.7 After: vp9_inv_dct_dct_16x16_sub4_add_neon:1051.0 vp9_inv_dct_dct_16x16_sub16_add_neon: 1390.1 vp9_inv_dct_dct_32x32_sub4_add_neon:5199.9 vp9_inv_dct_dct_32x32_sub32_add_neon: 8125.8 This is cherrypicked from libav commit 115476018d2c97df7e9b4445fe8f6cc7420ab91f. --- libavcodec/aarch64/vp9itxfm_neon.S | 42 +++--- 1 file changed, 25 insertions(+), 17 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index 7427963..a37b459 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -463,7 +463,7 @@ function idct16x16_dc_add_neon ret endfunc -.macro idct16 +function idct16 dmbutterfly0v16, v24, v16, v24, v2, v3, v4, v5, v6, v7 // v16 = t0a, v24 = t1a dmbutterfly v20, v28, v0.h[1], v0.h[2], v2, v3, v4, v5 // v20 = t2a, v28 = t3a dmbutterfly v18, v30, v0.h[3], v0.h[4], v2, v3, v4, v5 // v18 = t4a, v30 = t7a @@ -506,9 +506,10 @@ endfunc butterfly_8hv19, v28, v5, v28 // v19 = out[3], v28 = out[12] butterfly_8hv20, v27, v6, v27 // v20 = out[4], v27 = out[11] butterfly_8hv21, v26, v26, v3// v21 = out[5], v26 = out[10] -.endm +ret +endfunc -.macro iadst16 +function iadst16 ld1 {v0.8h,v1.8h}, [x11] dmbutterfly_l v6, v7, v4, v5, v31, v16, v0.h[1], v0.h[0] // v6,v7 = t1, v4,v5 = t0 @@ -577,7 +578,8 @@ endfunc mov v16.16b, v2.16b mov v30.16b, v4.16b -.endm +ret +endfunc // Helper macros; we can't use these expressions directly within // e.g. .irp due to the extra concatenation \(). Therefore wrap @@ -604,12 +606,14 @@ endfunc // x9 = input stride .macro itxfm16_1d_funcs txfm function \txfm\()16_1d_8x16_pass1_neon +mov x14, x30 + moviv2.8h, #0 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 load_clear \i, x2, x9 .endr -\txfm\()16 +bl \txfm\()16 // Do two 8x8 transposes. Originally, v16-v31 contain the // 16 rows. Afterwards, v16-v23 and v24-v31 contain the two @@ -623,7 +627,7 @@ function \txfm\()16_1d_8x16_pass1_neon .irp i, 16, 24, 17, 25, 18, 26, 19, 27, 20, 28, 21, 29, 22, 30, 23, 31 store \i, x0, #16 .endr -ret +br x14 1: // Special case: For the last input column (x1 == 8), // which would be stored as the last row in the temp buffer, @@ -642,7 +646,7 @@ function \txfm\()16_1d_8x16_pass1_neon mov v29.16b, v21.16b mov v30.16b, v22.16b mov v31.16b, v23.16b -ret +br x14 endfunc // Read a vertical 8x16 slice out of a 16x16 matrix, do a transform on it, @@ -653,6 +657,7 @@ endfunc // x3 = slice offset // x9 = temp buffer stride function \txfm\()16_1d_8x16_pass2_neon +mov x14, x30 .irp i, 16, 17, 18, 19, 20, 21, 22, 23 load\i, x2, x9 .endr @@ -664,7 +669,7 @@ function \txfm\()16_1d_8x16_pass2_neon add x3, x0, x1 lsl x1, x1, #1 -\txfm\()16 +bl \txfm\()16 .macro load_add_store coef0, coef1, coef2, coef3, coef4, coef5, coef6, coef7, tmp1, tmp2 srshr \coef0, \coef0, #6 @@ -714,7 +719,7 @@ function \txfm\()16_1d_8x16_pass2_neon load_add_store v24.8h, v25.8h, v26.8h, v27.8h, v28.8h, v29.8h, v30.8h, v31.8h, v16.8b, v17.8b .purgem load_add_store -ret +br x14 endfunc .endm @@ -843,7 +848,7 @@ function idct32x32_dc_add_neon ret endfunc -.macro idct32_odd +function idct32_odd ld1 {v0.8h,v1.8h}, [x11] dmbutterfly v16, v31, v0.h[0], v0.h[1], v4, v5, v6, v7 // v16 = t16a, v31 = t31a @@ -898,7 +903,8 @@ endfunc dmbutterfly0v26, v21, v26, v21, v2, v3, v4, v5, v6, v7 // v26 = t26a, v21 = t21a dmbutterfly0v25, v22, v25, v22, v2, v3, v4, v5, v6, v7 // v25 = t25, v22 = t22 dmbutterfly0v24, v23, v24, v23, v2, v3, v4, v5, v6, v7 // v24 = t24a, v23 = t23a -.endm +ret +endfunc // Do an 32-point IDCT of a 8x32 slice out of a 32x32 matrix. // The 32-point IDCT can be decomposed into two 16-point IDCTs; @@
[FFmpeg-devel] [PATCH 01/34] arm: vp9itxfm: Avoid .irp when it doesn't save any lines
This makes it more readable. This is cherrypicked from libav commit 3bc5b28d5a191864c54bba60646933a63da31656. --- libavcodec/arm/vp9itxfm_neon.S | 24 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index 25f6dde..93816d2 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -690,21 +690,21 @@ function \txfm\()16_1d_4x16_pass1_neon @ for the first slice of the second pass (where it is the @ last 4x4 block). add r0, r0, #8 -.irp i, 20, 24, 28 -vst1.16 {d\i}, [r0,:64]! -.endr +vst1.16 {d20}, [r0,:64]! +vst1.16 {d24}, [r0,:64]! +vst1.16 {d28}, [r0,:64]! add r0, r0, #8 -.irp i, 21, 25, 29 -vst1.16 {d\i}, [r0,:64]! -.endr +vst1.16 {d21}, [r0,:64]! +vst1.16 {d25}, [r0,:64]! +vst1.16 {d29}, [r0,:64]! add r0, r0, #8 -.irp i, 22, 26, 30 -vst1.16 {d\i}, [r0,:64]! -.endr +vst1.16 {d22}, [r0,:64]! +vst1.16 {d26}, [r0,:64]! +vst1.16 {d30}, [r0,:64]! add r0, r0, #8 -.irp i, 23, 27, 31 -vst1.16 {d\i}, [r0,:64]! -.endr +vst1.16 {d23}, [r0,:64]! +vst1.16 {d27}, [r0,:64]! +vst1.16 {d31}, [r0,:64]! vmovd28, d16 vmovd29, d17 vmovd30, d18 -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 02/34] aarch64: vp9itxfm: Restructure the idct32 store macros
This avoids concatenation, which can't be used if the whole macro is wrapped within another macro. This is also arguably more readable. This is cherrypicked from libav commit 58d87e0f49bcbbc6f426328f53b657bae7430cd2. --- libavcodec/aarch64/vp9itxfm_neon.S | 80 +++--- 1 file changed, 40 insertions(+), 40 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index 82f1f41..7427963 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -935,23 +935,23 @@ function idct32_1d_8x32_pass1_neon .macro store_rev a, b // There's no rev128 instruction, but we reverse each 64 bit // half, and then flip them using an ext with 8 bytes offset. -rev64 v1.8h, v\b\().8h -st1 {v\a\().8h}, [x0], #16 -rev64 v0.8h, v\a\().8h +rev64 v1.8h, \b +st1 {\a}, [x0], #16 +rev64 v0.8h, \a ext v1.16b, v1.16b, v1.16b, #8 -st1 {v\b\().8h}, [x0], #16 +st1 {\b}, [x0], #16 ext v0.16b, v0.16b, v0.16b, #8 st1 {v1.8h}, [x0], #16 st1 {v0.8h}, [x0], #16 .endm -store_rev 16, 24 -store_rev 17, 25 -store_rev 18, 26 -store_rev 19, 27 -store_rev 20, 28 -store_rev 21, 29 -store_rev 22, 30 -store_rev 23, 31 +store_rev v16.8h, v24.8h +store_rev v17.8h, v25.8h +store_rev v18.8h, v26.8h +store_rev v19.8h, v27.8h +store_rev v20.8h, v28.8h +store_rev v21.8h, v29.8h +store_rev v22.8h, v30.8h +store_rev v23.8h, v31.8h sub x0, x0, #512 .purgem store_rev @@ -977,14 +977,14 @@ function idct32_1d_8x32_pass1_neon // subtracted from the output. .macro store_rev a, b ld1 {v4.8h}, [x0] -rev64 v1.8h, v\b\().8h -add v4.8h, v4.8h, v\a\().8h -rev64 v0.8h, v\a\().8h +rev64 v1.8h, \b +add v4.8h, v4.8h, \a +rev64 v0.8h, \a st1 {v4.8h}, [x0], #16 ext v1.16b, v1.16b, v1.16b, #8 ld1 {v5.8h}, [x0] ext v0.16b, v0.16b, v0.16b, #8 -add v5.8h, v5.8h, v\b\().8h +add v5.8h, v5.8h, \b st1 {v5.8h}, [x0], #16 ld1 {v6.8h}, [x0] sub v6.8h, v6.8h, v1.8h @@ -994,14 +994,14 @@ function idct32_1d_8x32_pass1_neon st1 {v7.8h}, [x0], #16 .endm -store_rev 31, 23 -store_rev 30, 22 -store_rev 29, 21 -store_rev 28, 20 -store_rev 27, 19 -store_rev 26, 18 -store_rev 25, 17 -store_rev 24, 16 +store_rev v31.8h, v23.8h +store_rev v30.8h, v22.8h +store_rev v29.8h, v21.8h +store_rev v28.8h, v20.8h +store_rev v27.8h, v19.8h +store_rev v26.8h, v18.8h +store_rev v25.8h, v17.8h +store_rev v24.8h, v16.8h .purgem store_rev ret endfunc @@ -1047,21 +1047,21 @@ function idct32_1d_8x32_pass2_neon .if \neg == 0 ld1 {v4.8h}, [x2], x9 ld1 {v5.8h}, [x2], x9 -add v4.8h, v4.8h, v\a\().8h +add v4.8h, v4.8h, \a ld1 {v6.8h}, [x2], x9 -add v5.8h, v5.8h, v\b\().8h +add v5.8h, v5.8h, \b ld1 {v7.8h}, [x2], x9 -add v6.8h, v6.8h, v\c\().8h -add v7.8h, v7.8h, v\d\().8h +add v6.8h, v6.8h, \c +add v7.8h, v7.8h, \d .else ld1 {v4.8h}, [x2], x7 ld1 {v5.8h}, [x2], x7 -sub v4.8h, v4.8h, v\a\().8h +sub v4.8h, v4.8h, \a ld1 {v6.8h}, [x2], x7 -sub v5.8h, v5.8h, v\b\().8h +sub v5.8h, v5.8h, \b ld1 {v7.8h}, [x2], x7 -sub v6.8h, v6.8h, v\c\().8h -sub v7.8h, v7.8h, v\d\().8h +sub v6.8h, v6.8h, \c +sub v7.8h, v7.8h, \d .endif ld1 {v0.8b}, [x0], x1 ld1 {v1.8b}, [x0], x1 @@ -1085,15 +1085,15 @@ function idct32_1d_8x32_pass2_neon st1 {v6.8b}, [x0], x1 st1 {v7.8b}, [x0], x1 .endm -load_acc_store 31, 30, 29, 28 -load_acc_store 27, 26, 25, 24 -
[FFmpeg-devel] [PATCH 31/34] arm: vp9itxfm: Reorder the idct coefficients for better pairing
All elements are used pairwise, except for the first one. Previously, the 16th element was unused. Move the unused element to the second slot, to make the later element pairs not split across registers. This simplifies loading only parts of the coefficients, reducing the difference to the 16 bpp version. This is cherrypicked from libav commit de06bdfe6c8abd8266d5c6f5c68e4df0060b61fc. --- libavcodec/arm/vp9itxfm_neon.S | 124 - 1 file changed, 62 insertions(+), 62 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index 9385b01..05e31e6 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -22,7 +22,7 @@ #include "neon.S" const itxfm4_coeffs, align=4 -.short 11585, 6270, 15137, 0 +.short 11585, 0, 6270, 15137 iadst4_coeffs: .short 5283, 15212, 9929, 13377 endconst @@ -30,8 +30,8 @@ endconst const iadst8_coeffs, align=4 .short 16305, 1606, 14449, 7723, 10394, 12665, 4756, 15679 idct_coeffs: -.short 11585, 6270, 15137, 3196, 16069, 13623, 9102, 1606 -.short 16305, 12665, 10394, 7723, 14449, 15679, 4756, 0 +.short 11585, 0, 6270, 15137, 3196, 16069, 13623, 9102 +.short 1606, 16305, 12665, 10394, 7723, 14449, 15679, 4756 .short 804, 16364, 12140, 11003, 7005, 14811, 15426, 5520 .short 3981, 15893, 14053, 8423, 9760, 13160, 16207, 2404 endconst @@ -224,14 +224,14 @@ endconst .endm .macro idct4 c0, c1, c2, c3 -vmull.s16 q13, \c1, d0[2] -vmull.s16 q11, \c1, d0[1] +vmull.s16 q13, \c1, d0[3] +vmull.s16 q11, \c1, d0[2] vadd.i16d16, \c0, \c2 vsub.i16d17, \c0, \c2 -vmlal.s16 q13, \c3, d0[1] +vmlal.s16 q13, \c3, d0[2] vmull.s16 q9, d16, d0[0] vmull.s16 q10, d17, d0[0] -vmlsl.s16 q11, \c3, d0[2] +vmlsl.s16 q11, \c3, d0[3] vrshrn.s32 d26, q13, #14 vrshrn.s32 d18, q9, #14 vrshrn.s32 d20, q10, #14 @@ -350,9 +350,9 @@ itxfm_func4x4 iwht, iwht .macro idct8 dmbutterfly0d16, d17, d24, d25, q8, q12, q2, q4, d4, d5, d8, d9, q3, q2, q5, q4 @ q8 = t0a, q12 = t1a -dmbutterfly d20, d21, d28, d29, d0[1], d0[2], q2, q3, q4, q5 @ q10 = t2a, q14 = t3a -dmbutterfly d18, d19, d30, d31, d0[3], d1[0], q2, q3, q4, q5 @ q9 = t4a, q15 = t7a -dmbutterfly d26, d27, d22, d23, d1[1], d1[2], q2, q3, q4, q5 @ q13 = t5a, q11 = t6a +dmbutterfly d20, d21, d28, d29, d0[2], d0[3], q2, q3, q4, q5 @ q10 = t2a, q14 = t3a +dmbutterfly d18, d19, d30, d31, d1[0], d1[1], q2, q3, q4, q5 @ q9 = t4a, q15 = t7a +dmbutterfly d26, d27, d22, d23, d1[2], d1[3], q2, q3, q4, q5 @ q13 = t5a, q11 = t6a butterfly q2, q14, q8, q14 @ q2 = t0, q14 = t3 butterfly q3, q10, q12, q10 @ q3 = t1, q10 = t2 @@ -386,8 +386,8 @@ itxfm_func4x4 iwht, iwht vneg.s16q15, q15 @ q15 = out[7] butterfly q8, q9, q11, q9 @ q8 = out[0], q9 = t2 -dmbutterfly_l q10, q11, q5, q7, d4, d5, d6, d7, d0[1], d0[2] @ q10,q11 = t5a, q5,q7 = t4a -dmbutterfly_l q2, q3, q13, q14, d12, d13, d8, d9, d0[2], d0[1] @ q2,q3 = t6a, q13,q14 = t7a +dmbutterfly_l q10, q11, q5, q7, d4, d5, d6, d7, d0[2], d0[3] @ q10,q11 = t5a, q5,q7 = t4a +dmbutterfly_l q2, q3, q13, q14, d12, d13, d8, d9, d0[3], d0[2] @ q2,q3 = t6a, q13,q14 = t7a dbutterfly_nd28, d29, d8, d9, q10, q11, q13, q14, q4, q6, q10, q11 @ q14 = out[6], q4 = t7 @@ -594,13 +594,13 @@ endfunc function idct16 mbutterfly0 d16, d24, d16, d24, d4, d6, q2, q3 @ d16 = t0a, d24 = t1a -mbutterfly d20, d28, d0[1], d0[2], q2, q3 @ d20 = t2a, d28 = t3a -mbutterfly d18, d30, d0[3], d1[0], q2, q3 @ d18 = t4a, d30 = t7a -mbutterfly d26, d22, d1[1], d1[2], q2, q3 @ d26 = t5a, d22 = t6a -mbutterfly d17, d31, d1[3], d2[0], q2, q3 @ d17 = t8a, d31 = t15a -mbutterfly d25, d23, d2[1], d2[2], q2, q3 @ d25 = t9a, d23 = t14a -mbutterfly d21, d27, d2[3], d3[0], q2, q3 @ d21 = t10a, d27 = t13a -mbutterfly d29, d19, d3[1], d3[2], q2, q3 @ d29 = t11a, d19 = t12a +mbutterfly d20, d28, d0[2], d0[3], q2, q3 @ d20 = t2a, d28 = t3a +mbutterfly d18, d30, d1[0], d1[1], q2, q3 @ d18 = t4a, d30 = t7a +mbutterfly d26, d22, d1[2], d1[3], q2, q3 @ d26 = t5a, d22 = t6a +mbutterfly d17, d31, d2[0], d2[1], q2, q3 @ d17 = t8a, d31 = t15a +mbutterfly d25, d23, d2[2], d2[3], q2, q3 @ d25 = t9a, d23 = t14a +mbutterfly d21, d27, d3[0], d3[1],
[FFmpeg-devel] [PATCH 32/34] aarch64: vp9itxfm: Reorder the idct coefficients for better pairing
All elements are used pairwise, except for the first one. Previously, the 16th element was unused. Move the unused element to the second slot, to make the later element pairs not split across registers. This simplifies loading only parts of the coefficients, reducing the difference to the 16 bpp version. This is cherrypicked from libav commit 09eb88a12e008d10a3f7a6be75d18ad98b368e68. --- libavcodec/aarch64/vp9itxfm_neon.S | 124 ++--- 1 file changed, 62 insertions(+), 62 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index dd9fde1..31c6e3c 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -22,7 +22,7 @@ #include "neon.S" const itxfm4_coeffs, align=4 -.short 11585, 6270, 15137, 0 +.short 11585, 0, 6270, 15137 iadst4_coeffs: .short 5283, 15212, 9929, 13377 endconst @@ -30,8 +30,8 @@ endconst const iadst8_coeffs, align=4 .short 16305, 1606, 14449, 7723, 10394, 12665, 4756, 15679 idct_coeffs: -.short 11585, 6270, 15137, 3196, 16069, 13623, 9102, 1606 -.short 16305, 12665, 10394, 7723, 14449, 15679, 4756, 0 +.short 11585, 0, 6270, 15137, 3196, 16069, 13623, 9102 +.short 1606, 16305, 12665, 10394, 7723, 14449, 15679, 4756 .short 804, 16364, 12140, 11003, 7005, 14811, 15426, 5520 .short 3981, 15893, 14053, 8423, 9760, 13160, 16207, 2404 endconst @@ -192,14 +192,14 @@ endconst .endm .macro idct4 c0, c1, c2, c3 -smull v22.4s,\c1\().4h, v0.h[2] -smull v20.4s,\c1\().4h, v0.h[1] +smull v22.4s,\c1\().4h, v0.h[3] +smull v20.4s,\c1\().4h, v0.h[2] add v16.4h,\c0\().4h, \c2\().4h sub v17.4h,\c0\().4h, \c2\().4h -smlal v22.4s,\c3\().4h, v0.h[1] +smlal v22.4s,\c3\().4h, v0.h[2] smull v18.4s,v16.4h,v0.h[0] smull v19.4s,v17.4h,v0.h[0] -smlsl v20.4s,\c3\().4h, v0.h[2] +smlsl v20.4s,\c3\().4h, v0.h[3] rshrn v22.4h,v22.4s,#14 rshrn v18.4h,v18.4s,#14 rshrn v19.4h,v19.4s,#14 @@ -326,9 +326,9 @@ itxfm_func4x4 iwht, iwht .macro idct8 dmbutterfly0v16, v20, v16, v20, v2, v3, v4, v5, v6, v7 // v16 = t0a, v20 = t1a -dmbutterfly v18, v22, v0.h[1], v0.h[2], v2, v3, v4, v5 // v18 = t2a, v22 = t3a -dmbutterfly v17, v23, v0.h[3], v0.h[4], v2, v3, v4, v5 // v17 = t4a, v23 = t7a -dmbutterfly v21, v19, v0.h[5], v0.h[6], v2, v3, v4, v5 // v21 = t5a, v19 = t6a +dmbutterfly v18, v22, v0.h[2], v0.h[3], v2, v3, v4, v5 // v18 = t2a, v22 = t3a +dmbutterfly v17, v23, v0.h[4], v0.h[5], v2, v3, v4, v5 // v17 = t4a, v23 = t7a +dmbutterfly v21, v19, v0.h[6], v0.h[7], v2, v3, v4, v5 // v21 = t5a, v19 = t6a butterfly_8hv24, v25, v16, v22 // v24 = t0, v25 = t3 butterfly_8hv28, v29, v17, v21 // v28 = t4, v29 = t5a @@ -361,8 +361,8 @@ itxfm_func4x4 iwht, iwht dmbutterfly0v19, v20, v6, v7, v24, v26, v27, v28, v29, v30 // v19 = -out[3], v20 = out[4] neg v19.8h, v19.8h // v19 = out[3] -dmbutterfly_l v26, v27, v28, v29, v5, v3, v0.h[1], v0.h[2] // v26,v27 = t5a, v28,v29 = t4a -dmbutterfly_l v2, v3, v4, v5, v31, v25, v0.h[2], v0.h[1] // v2,v3 = t6a, v4,v5 = t7a +dmbutterfly_l v26, v27, v28, v29, v5, v3, v0.h[2], v0.h[3] // v26,v27 = t5a, v28,v29 = t4a +dmbutterfly_l v2, v3, v4, v5, v31, v25, v0.h[3], v0.h[2] // v2,v3 = t6a, v4,v5 = t7a dbutterfly_nv17, v30, v28, v29, v2, v3, v6, v7, v24, v25 // v17 = -out[1], v30 = t6 dbutterfly_nv22, v31, v26, v27, v4, v5, v6, v7, v24, v25 // v22 = out[6], v31 = t7 @@ -543,13 +543,13 @@ endfunc function idct16 dmbutterfly0v16, v24, v16, v24, v2, v3, v4, v5, v6, v7 // v16 = t0a, v24 = t1a -dmbutterfly v20, v28, v0.h[1], v0.h[2], v2, v3, v4, v5 // v20 = t2a, v28 = t3a -dmbutterfly v18, v30, v0.h[3], v0.h[4], v2, v3, v4, v5 // v18 = t4a, v30 = t7a -dmbutterfly v26, v22, v0.h[5], v0.h[6], v2, v3, v4, v5 // v26 = t5a, v22 = t6a -dmbutterfly v17, v31, v0.h[7], v1.h[0], v2, v3, v4, v5 // v17 = t8a, v31 = t15a -dmbutterfly v25, v23, v1.h[1], v1.h[2], v2, v3, v4, v5 // v25 = t9a, v23 = t14a -dmbutterfly v21, v27, v1.h[3], v1.h[4], v2, v3, v4, v5 // v21 = t10a, v27 = t13a -dmbutterfly v29, v19, v1.h[5], v1.h[6], v2, v3, v4, v5 // v29 = t11a, v19 = t12a +dmbutterfly v20, v28, v0.h[2], v0.h[3], v2, v3, v4, v5 // v20 = t2a, v28 = t3a +dmbutterfly
[FFmpeg-devel] [PATCH 15/34] aarch64: vp9mc: Simplify the extmla macro parameters
Fold the field lengths into the macro. This makes the macro invocations much more readable, when the lines are shorter. This also makes it easier to use only half the registers within the macro. This is cherrypicked from libav commit 5e0c2158fbc774f87d3ce4b7b950ba4d42c4a7b8. --- libavcodec/aarch64/vp9mc_neon.S | 50 - 1 file changed, 25 insertions(+), 25 deletions(-) diff --git a/libavcodec/aarch64/vp9mc_neon.S b/libavcodec/aarch64/vp9mc_neon.S index 80d1d23..9403911 100644 --- a/libavcodec/aarch64/vp9mc_neon.S +++ b/libavcodec/aarch64/vp9mc_neon.S @@ -193,41 +193,41 @@ endfunc // for size >= 16), and multiply-accumulate into dst1 and dst3 (or // dst1-dst2 and dst3-dst4 for size >= 16) .macro extmla dst1, dst2, dst3, dst4, src1, src2, src3, src4, src5, src6, offset, size -ext v20.16b, \src1, \src2, #(2*\offset) -ext v22.16b, \src4, \src5, #(2*\offset) +ext v20.16b, \src1\().16b, \src2\().16b, #(2*\offset) +ext v22.16b, \src4\().16b, \src5\().16b, #(2*\offset) .if \size >= 16 -mla \dst1, v20.8h, v0.h[\offset] -ext v21.16b, \src2, \src3, #(2*\offset) -mla \dst3, v22.8h, v0.h[\offset] -ext v23.16b, \src5, \src6, #(2*\offset) -mla \dst2, v21.8h, v0.h[\offset] -mla \dst4, v23.8h, v0.h[\offset] +mla \dst1\().8h, v20.8h, v0.h[\offset] +ext v21.16b, \src2\().16b, \src3\().16b, #(2*\offset) +mla \dst3\().8h, v22.8h, v0.h[\offset] +ext v23.16b, \src5\().16b, \src6\().16b, #(2*\offset) +mla \dst2\().8h, v21.8h, v0.h[\offset] +mla \dst4\().8h, v23.8h, v0.h[\offset] .else -mla \dst1, v20.8h, v0.h[\offset] -mla \dst3, v22.8h, v0.h[\offset] +mla \dst1\().8h, v20.8h, v0.h[\offset] +mla \dst3\().8h, v22.8h, v0.h[\offset] .endif .endm // The same as above, but don't accumulate straight into the // destination, but use a temp register and accumulate with saturation. .macro extmulqadd dst1, dst2, dst3, dst4, src1, src2, src3, src4, src5, src6, offset, size -ext v20.16b, \src1, \src2, #(2*\offset) -ext v22.16b, \src4, \src5, #(2*\offset) +ext v20.16b, \src1\().16b, \src2\().16b, #(2*\offset) +ext v22.16b, \src4\().16b, \src5\().16b, #(2*\offset) .if \size >= 16 mul v20.8h, v20.8h, v0.h[\offset] -ext v21.16b, \src2, \src3, #(2*\offset) +ext v21.16b, \src2\().16b, \src3\().16b, #(2*\offset) mul v22.8h, v22.8h, v0.h[\offset] -ext v23.16b, \src5, \src6, #(2*\offset) +ext v23.16b, \src5\().16b, \src6\().16b, #(2*\offset) mul v21.8h, v21.8h, v0.h[\offset] mul v23.8h, v23.8h, v0.h[\offset] .else mul v20.8h, v20.8h, v0.h[\offset] mul v22.8h, v22.8h, v0.h[\offset] .endif -sqadd \dst1, \dst1, v20.8h -sqadd \dst3, \dst3, v22.8h +sqadd \dst1\().8h, \dst1\().8h, v20.8h +sqadd \dst3\().8h, \dst3\().8h, v22.8h .if \size >= 16 -sqadd \dst2, \dst2, v21.8h -sqadd \dst4, \dst4, v23.8h +sqadd \dst2\().8h, \dst2\().8h, v21.8h +sqadd \dst4\().8h, \dst4\().8h, v23.8h .endif .endm @@ -291,13 +291,13 @@ function \type\()_8tap_\size\()h_\idx1\idx2 mul v2.8h, v5.8h, v0.h[0] mul v25.8h, v17.8h, v0.h[0] .endif -extmla v1.8h, v2.8h, v24.8h, v25.8h, v4.16b, v5.16b, v6.16b, v16.16b, v17.16b, v18.16b, 1, \size -extmla v1.8h, v2.8h, v24.8h, v25.8h, v4.16b, v5.16b, v6.16b, v16.16b, v17.16b, v18.16b, 2, \size -extmla v1.8h, v2.8h, v24.8h, v25.8h, v4.16b, v5.16b, v6.16b, v16.16b, v17.16b, v18.16b, \idx1, \size -extmla v1.8h, v2.8h, v24.8h, v25.8h, v4.16b, v5.16b, v6.16b, v16.16b, v17.16b, v18.16b, 5, \size -extmla v1.8h, v2.8h, v24.8h, v25.8h, v4.16b, v5.16b, v6.16b, v16.16b, v17.16b, v18.16b, 6, \size -extmla v1.8h, v2.8h, v24.8h, v25.8h, v4.16b, v5.16b, v6.16b, v16.16b, v17.16b, v18.16b, 7, \size -extmulqadd v1.8h, v2.8h, v24.8h, v25.8h, v4.16b, v5.16b, v6.16b, v16.16b, v17.16b, v18.16b, \idx2, \size +extmla v1, v2, v24, v25, v4, v5, v6, v16, v17, v18, 1, \size +extmla v1, v2, v24, v25, v4, v5, v6, v16, v17, v18, 2, \size +extmla v1, v2, v24, v25, v4, v5, v6, v16, v17, v18, \idx1, \size +
[FFmpeg-devel] [PATCH 23/34] aarch64: vp9lpf: Interleave the start of flat8in into the calculation above
This adds lots of extra .ifs, but speeds it up by a couple cycles, by avoiding stalls. This is cherrypicked from libav commit b0806088d3b27044145b20421da8d39089ae0c6a. --- libavcodec/aarch64/vp9lpf_neon.S | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/libavcodec/aarch64/vp9lpf_neon.S b/libavcodec/aarch64/vp9lpf_neon.S index 7fe2c88..cd3e26c 100644 --- a/libavcodec/aarch64/vp9lpf_neon.S +++ b/libavcodec/aarch64/vp9lpf_neon.S @@ -338,20 +338,28 @@ uxtl_sz v0.8h, v1.8h, v22, \sz// p1 uxtl_sz v2.8h, v3.8h, v25, \sz// q1 +.if \wd >= 8 +mov x5, v6.d[0] +.ifc \sz, .16b +mov x6, v6.d[1] +.endif +.endif saddw_szv0.8h, v1.8h, v0.8h, v1.8h, \tmp3, \sz // p1 + f ssubw_szv2.8h, v3.8h, v2.8h, v3.8h, \tmp3, \sz // q1 - f sqxtun_sz v0, v0.8h, v1.8h, \sz // out p1 sqxtun_sz v2, v2.8h, v3.8h, \sz // out q1 +.if \wd >= 8 +.ifc \sz, .16b +addsx5, x5, x6 +.endif +.endif bit v22\sz, v0\sz, v5\sz // if (!hev && fm && !flat8in) bit v25\sz, v2\sz, v5\sz // If no pixels need flat8in, jump to flat8out // (or to a writeout of the inner 4 pixels, for wd=8) .if \wd >= 8 -mov x5, v6.d[0] .ifc \sz, .16b -mov x6, v6.d[1] -addsx5, x5, x6 b.eq6f .else cbz x5, 6f -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 29/34] arm: vp9itxfm: Avoid reloading the idct32 coefficients
The idct32x32 function actually pushed q4-q7 onto the stack even though it didn't clobber them; there are plenty of registers that can be used to allow keeping all the idct coefficients in registers without having to reload different subsets of them at different stages in the transform. Since the idct16 core transform avoids clobbering q4-q7 (but clobbers q2-q3 instead, to avoid needing to back up and restore q4-q7 at all in the idct16 function), and the lanewise vmul needs a register in the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5 while doing idct16. While keeping these coefficients in registers, we still can skip pushing q7. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_32x32_sub32_add_neon: 18553.8 17182.7 14303.3 12089.7 After: vp9_inv_dct_dct_32x32_sub32_add_neon: 18470.3 16717.7 14173.6 11860.8 This is cherrypicked from libav commit 402546a17233a8815307df9e14ff88cd70424537. --- libavcodec/arm/vp9itxfm_neon.S | 246 - 1 file changed, 120 insertions(+), 126 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index dee2f05..9385b01 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -1185,58 +1185,51 @@ function idct32x32_dc_add_neon endfunc .macro idct32_end -butterfly d16, d5, d4, d5 @ d16 = t16a, d5 = t19a +butterfly d16, d9, d8, d9 @ d16 = t16a, d9 = t19a butterfly d17, d20, d23, d20 @ d17 = t17, d20 = t18 -butterfly d18, d6, d7, d6 @ d18 = t23a, d6 = t20a +butterfly d18, d10, d11, d10 @ d18 = t23a, d10 = t20a butterfly d19, d21, d22, d21 @ d19 = t22, d21 = t21 -butterfly d4, d28, d28, d30 @ d4 = t24a, d28 = t27a +butterfly d8, d28, d28, d30 @ d8 = t24a, d28 = t27a butterfly d23, d26, d25, d26 @ d23 = t25, d26 = t26 -butterfly d7, d29, d29, d31 @ d7 = t31a, d29 = t28a +butterfly d11, d29, d29, d31 @ d11 = t31a, d29 = t28a butterfly d22, d27, d24, d27 @ d22 = t30, d27 = t29 mbutterfly d27, d20, d0[1], d0[2], q12, q15@ d27 = t18a, d20 = t29a -mbutterfly d29, d5, d0[1], d0[2], q12, q15@ d29 = t19, d5 = t28 -mbutterfly d28, d6, d0[1], d0[2], q12, q15, neg=1 @ d28 = t27, d6 = t20 +mbutterfly d29, d9, d0[1], d0[2], q12, q15@ d29 = t19, d9 = t28 +mbutterfly d28, d10, d0[1], d0[2], q12, q15, neg=1 @ d28 = t27, d10 = t20 mbutterfly d26, d21, d0[1], d0[2], q12, q15, neg=1 @ d26 = t26a, d21 = t21a -butterfly d31, d24, d7, d4 @ d31 = t31, d24 = t24 +butterfly d31, d24, d11, d8 @ d31 = t31, d24 = t24 butterfly d30, d25, d22, d23 @ d30 = t30a, d25 = t25a butterfly_r d23, d16, d16, d18 @ d23 = t23, d16 = t16 butterfly_r d22, d17, d17, d19 @ d22 = t22a, d17 = t17a butterfly d18, d21, d27, d21 @ d18 = t18, d21 = t21 -butterfly_r d27, d28, d5, d28 @ d27 = t27a, d28 = t28a -butterfly d4, d26, d20, d26 @ d4 = t29, d26 = t26 -butterfly d19, d20, d29, d6 @ d19 = t19a, d20 = t20 -vmovd29, d4@ d29 = t29 - -mbutterfly0 d27, d20, d27, d20, d4, d6, q2, q3 @ d27 = t27, d20 = t20 -mbutterfly0 d26, d21, d26, d21, d4, d6, q2, q3 @ d26 = t26a, d21 = t21a -mbutterfly0 d25, d22, d25, d22, d4, d6, q2, q3 @ d25 = t25, d22 = t22 -mbutterfly0 d24, d23, d24, d23, d4, d6, q2, q3 @ d24 = t24a, d23 = t23a +butterfly_r d27, d28, d9, d28 @ d27 = t27a, d28 = t28a +butterfly d8, d26, d20, d26 @ d8 = t29, d26 = t26 +butterfly d19, d20, d29, d10 @ d19 = t19a, d20 = t20 +vmovd29, d8@ d29 = t29 + +mbutterfly0 d27, d20, d27, d20, d8, d10, q4, q5 @ d27 = t27, d20 = t20 +mbutterfly0 d26, d21, d26, d21, d8, d10, q4, q5 @ d26 = t26a, d21 = t21a +mbutterfly0 d25, d22, d25, d22, d8, d10, q4, q5 @ d25 = t25, d22 = t22 +mbutterfly0 d24, d23, d24, d23, d8, d10, q4, q5 @ d24 = t24a, d23 = t23a bx lr .endm function idct32_odd -movrel r12, idct_coeffs -add r12, r12, #32 -vld1.16 {q0-q1}, [r12,:128] - -mbutterfly d16, d31, d0[0], d0[1], q2, q3 @ d16 = t16a, d31 = t31a -mbutterfly d24, d23, d0[2], d0[3], q2, q3 @ d24 = t17a, d23 = t30a -mbutterfly d20, d27, d1[0], d1[1], q2, q3 @ d20 = t18a, d27 = t29a -mbutterfly d28, d19, d1[2], d1[3], q2, q3 @ d28 = t19a, d19 = t28a -mbutterfly d18, d29, d2[0], d2[1], q2, q3 @ d18 = t20a, d29 = t27a -mbutterfly d26, d21,
[FFmpeg-devel] [PATCH 26/34] arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit
The theoretical maximum value of E is 193, so we can just saturate the addition to 255. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.888.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0136.7 vp9_loop_filter_v_16_8_neon:497.0 419.5 379.7 293.0275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0452.0 After: vp9_loop_filter_v_4_8_neon: 136.0 125.7 112.684.0 83.0 vp9_loop_filter_v_8_8_neon: 234.0 195.5 171.5 136.0133.7 vp9_loop_filter_v_16_8_neon:490.0 417.5 377.7 289.0271.0 vp9_loop_filter_v_16_16_neon: 951.2 814.7 732.3 571.0446.7 This is cherrypicked from libav commit c582cb8537367721bb399a5d01b652c20142b756. --- libavcodec/aarch64/vp9lpf_neon.S | 40 +--- libavcodec/arm/vp9lpf_neon.S | 11 +-- 2 files changed, 14 insertions(+), 37 deletions(-) diff --git a/libavcodec/aarch64/vp9lpf_neon.S b/libavcodec/aarch64/vp9lpf_neon.S index ebfd9be..a9eea7f 100644 --- a/libavcodec/aarch64/vp9lpf_neon.S +++ b/libavcodec/aarch64/vp9lpf_neon.S @@ -51,13 +51,6 @@ // see the arm version instead. -.macro uabdl_sz dst1, dst2, in1, in2, sz -uabdl \dst1, \in1\().8b, \in2\().8b -.ifc \sz, .16b -uabdl2 \dst2, \in1\().16b, \in2\().16b -.endif -.endm - .macro add_sz dst1, dst2, in1, in2, in3, in4, sz add \dst1, \in1, \in3 .ifc \sz, .16b @@ -86,20 +79,6 @@ .endif .endm -.macro cmhs_sz dst1, dst2, in1, in2, in3, in4, sz -cmhs\dst1, \in1, \in3 -.ifc \sz, .16b -cmhs\dst2, \in2, \in4 -.endif -.endm - -.macro xtn_sz dst, in1, in2, sz -xtn \dst\().8b, \in1 -.ifc \sz, .16b -xtn2\dst\().16b, \in2 -.endif -.endm - .macro usubl_sz dst1, dst2, in1, in2, sz usubl \dst1, \in1\().8b, \in2\().8b .ifc \sz, .16b @@ -179,20 +158,20 @@ // tmpq2 == tmp3 + tmp4, etc. .macro loop_filter wd, sz, mix, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8 .if \mix == 0 -dup v0.8h, w2// E -dup v1.8h, w2// E +dup v0\sz, w2// E dup v2\sz, w3// I dup v3\sz, w4// H .else -dup v0.8h, w2// E +dup v0.8b, w2// E dup v2.8b, w3// I dup v3.8b, w4// H +lsr w5, w2, #8 lsr w6, w3, #8 lsr w7, w4, #8 -ushrv1.8h, v0.8h, #8 // E +dup v1.8b, w5// E dup v4.8b, w6// I -bic v0.8h, #255, lsl 8 // E dup v5.8b, w7// H +trn1v0.2d, v0.2d, v1.2d trn1v2.2d, v2.2d, v4.2d trn1v3.2d, v3.2d, v5.2d .endif @@ -206,16 +185,15 @@ umaxv4\sz, v4\sz, v5\sz umaxv5\sz, v6\sz, v7\sz umax\tmp1\sz, \tmp1\sz, \tmp2\sz -uabdl_szv6.8h, v7.8h, v23, v24, \sz // abs(p0 - q0) +uabdv6\sz, v23\sz, v24\sz// abs(p0 - q0) umaxv4\sz, v4\sz, v5\sz -add_sz v6.8h, v7.8h, v6.8h, v7.8h, v6.8h, v7.8h, \sz // abs(p0 - q0) * 2 +uqadd v6\sz, v6\sz, v6\sz // abs(p0 - q0) * 2 uabdv5\sz, v22\sz, v25\sz// abs(p1 - q1) umaxv4\sz, v4\sz, \tmp1\sz // max(abs(p3 - p2), ..., abs(q2 - q3)) ushrv5\sz, v5\sz, #1 cmhsv4\sz, v2\sz, v4\sz // max(abs()) <= I -uaddw_szv6.8h, v7.8h, v6.8h, v7.8h, v5, \sz // abs(p0 - q0) * 2 + abs(p1 - q1) >> 1 -cmhs_sz v6.8h, v7.8h, v0.8h, v1.8h, v6.8h, v7.8h, \sz -xtn_sz v5, v6.8h, v7.8h, \sz +uqadd v6\sz, v6\sz, v5\sz // abs(p0 - q0) * 2 + abs(p1 - q1) >> 1 +cmhsv5\sz, v0\sz, v6\sz and v4\sz, v4\sz, v5\sz // fm // If no pixels need filtering, just exit as soon as possible diff --git a/libavcodec/arm/vp9lpf_neon.S b/libavcodec/arm/vp9lpf_neon.S index b90c536..2d91092 100644 --- a/libavcodec/arm/vp9lpf_neon.S +++ b/libavcodec/arm/vp9lpf_neon.S @@ -51,7 +51,7 @@ @ and d28-d31 as temp registers, or d8-d15. @ tmp1,tmp2 = tmpq1, tmp3,tmp4 = tmpq2, tmp5,tmp6 = tmpq3, tmp7,tmp8 = tmpq4 .macro loop_filter wd, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8, tmpq1, tmpq2, tmpq3, tmpq4 -vdup.u16q0, r2 @ E +vdup.u8 d0, r2 @ E vdup.u8
[FFmpeg-devel] [PATCH 25/34] aarch64: Add parentheses around the offset parameter in movrel
This fixes building with clang for linux with PIC enabled. This is cherrypicked from libav commit 8847eeaa14189885038140fb2b8a7adc7100. --- libavutil/aarch64/asm.S | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/libavutil/aarch64/asm.S b/libavutil/aarch64/asm.S index 523b8c5..4289729 100644 --- a/libavutil/aarch64/asm.S +++ b/libavutil/aarch64/asm.S @@ -83,8 +83,8 @@ ELF .size \name, . - \name add \rd, \rd, \val+(\offset)@PAGEOFF .endif #elif CONFIG_PIC -adrp\rd, \val+\offset -add \rd, \rd, :lo12:\val+\offset +adrp\rd, \val+(\offset) +add \rd, \rd, :lo12:\val+(\offset) #else ldr \rd, =\val+\offset #endif -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 24/34] aarch64: vp9lpf: Fix broken indentation/vertical alignment
This is cherrypicked from libav commit 07b5136c481d394992c7e951967df0cfbb346c0b. --- libavcodec/aarch64/vp9lpf_neon.S | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/libavcodec/aarch64/vp9lpf_neon.S b/libavcodec/aarch64/vp9lpf_neon.S index cd3e26c..ebfd9be 100644 --- a/libavcodec/aarch64/vp9lpf_neon.S +++ b/libavcodec/aarch64/vp9lpf_neon.S @@ -417,7 +417,7 @@ mov x5, v2.d[0] .ifc \sz, .16b mov x6, v2.d[1] -adds x5, x5, x6 +addsx5, x5, x6 b.ne1f .else cbnzx5, 1f @@ -430,7 +430,7 @@ mov x5, v7.d[0] .ifc \sz, .16b mov x6, v7.d[1] -adds x5, x5, x6 +addsx5, x5, x6 b.ne1f .else cbnzx5, 1f -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 30/34] aarch64: vp9itxfm: Avoid reloading the idct32 coefficients
The idct32x32 function actually pushed d8-d15 onto the stack even though it didn't clobber them; there are plenty of registers that can be used to allow keeping all the idct coefficients in registers without having to reload different subsets of them at different stages in the transform. After this, we still can skip pushing d12-d15. Before: vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3 After: vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3 This is cherrypicked from libav commit 65aa002d54433154a6924dc13e498bec98451ad0. --- libavcodec/aarch64/vp9itxfm_neon.S | 110 +++-- 1 file changed, 43 insertions(+), 67 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index be65eb7..dd9fde1 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -1123,18 +1123,14 @@ endfunc .endm function idct32_odd -ld1 {v0.8h,v1.8h}, [x11] - -dmbutterfly v16, v31, v0.h[0], v0.h[1], v4, v5, v6, v7 // v16 = t16a, v31 = t31a -dmbutterfly v24, v23, v0.h[2], v0.h[3], v4, v5, v6, v7 // v24 = t17a, v23 = t30a -dmbutterfly v20, v27, v0.h[4], v0.h[5], v4, v5, v6, v7 // v20 = t18a, v27 = t29a -dmbutterfly v28, v19, v0.h[6], v0.h[7], v4, v5, v6, v7 // v28 = t19a, v19 = t28a -dmbutterfly v18, v29, v1.h[0], v1.h[1], v4, v5, v6, v7 // v18 = t20a, v29 = t27a -dmbutterfly v26, v21, v1.h[2], v1.h[3], v4, v5, v6, v7 // v26 = t21a, v21 = t26a -dmbutterfly v22, v25, v1.h[4], v1.h[5], v4, v5, v6, v7 // v22 = t22a, v25 = t25a -dmbutterfly v30, v17, v1.h[6], v1.h[7], v4, v5, v6, v7 // v30 = t23a, v17 = t24a - -ld1 {v0.8h}, [x10] +dmbutterfly v16, v31, v8.h[0], v8.h[1], v4, v5, v6, v7 // v16 = t16a, v31 = t31a +dmbutterfly v24, v23, v8.h[2], v8.h[3], v4, v5, v6, v7 // v24 = t17a, v23 = t30a +dmbutterfly v20, v27, v8.h[4], v8.h[5], v4, v5, v6, v7 // v20 = t18a, v27 = t29a +dmbutterfly v28, v19, v8.h[6], v8.h[7], v4, v5, v6, v7 // v28 = t19a, v19 = t28a +dmbutterfly v18, v29, v9.h[0], v9.h[1], v4, v5, v6, v7 // v18 = t20a, v29 = t27a +dmbutterfly v26, v21, v9.h[2], v9.h[3], v4, v5, v6, v7 // v26 = t21a, v21 = t26a +dmbutterfly v22, v25, v9.h[4], v9.h[5], v4, v5, v6, v7 // v22 = t22a, v25 = t25a +dmbutterfly v30, v17, v9.h[6], v9.h[7], v4, v5, v6, v7 // v30 = t23a, v17 = t24a butterfly_8hv4, v24, v16, v24 // v4 = t16, v24 = t17 butterfly_8hv5, v20, v28, v20 // v5 = t19, v20 = t18 @@ -1153,18 +1149,14 @@ function idct32_odd endfunc function idct32_odd_half -ld1 {v0.8h,v1.8h}, [x11] - -dmbutterfly_h1 v16, v31, v0.h[0], v0.h[1], v4, v5, v6, v7 // v16 = t16a, v31 = t31a -dmbutterfly_h2 v24, v23, v0.h[2], v0.h[3], v4, v5, v6, v7 // v24 = t17a, v23 = t30a -dmbutterfly_h1 v20, v27, v0.h[4], v0.h[5], v4, v5, v6, v7 // v20 = t18a, v27 = t29a -dmbutterfly_h2 v28, v19, v0.h[6], v0.h[7], v4, v5, v6, v7 // v28 = t19a, v19 = t28a -dmbutterfly_h1 v18, v29, v1.h[0], v1.h[1], v4, v5, v6, v7 // v18 = t20a, v29 = t27a -dmbutterfly_h2 v26, v21, v1.h[2], v1.h[3], v4, v5, v6, v7 // v26 = t21a, v21 = t26a -dmbutterfly_h1 v22, v25, v1.h[4], v1.h[5], v4, v5, v6, v7 // v22 = t22a, v25 = t25a -dmbutterfly_h2 v30, v17, v1.h[6], v1.h[7], v4, v5, v6, v7 // v30 = t23a, v17 = t24a - -ld1 {v0.8h}, [x10] +dmbutterfly_h1 v16, v31, v8.h[0], v8.h[1], v4, v5, v6, v7 // v16 = t16a, v31 = t31a +dmbutterfly_h2 v24, v23, v8.h[2], v8.h[3], v4, v5, v6, v7 // v24 = t17a, v23 = t30a +dmbutterfly_h1 v20, v27, v8.h[4], v8.h[5], v4, v5, v6, v7 // v20 = t18a, v27 = t29a +dmbutterfly_h2 v28, v19, v8.h[6], v8.h[7], v4, v5, v6, v7 // v28 = t19a, v19 = t28a +dmbutterfly_h1 v18, v29, v9.h[0], v9.h[1], v4, v5, v6, v7 // v18 = t20a, v29 = t27a +dmbutterfly_h2 v26, v21, v9.h[2], v9.h[3], v4, v5, v6, v7 // v26 = t21a, v21 = t26a +dmbutterfly_h1 v22, v25, v9.h[4], v9.h[5], v4, v5, v6, v7 // v22 = t22a, v25 = t25a +dmbutterfly_h2 v30, v17, v9.h[6], v9.h[7], v4, v5, v6, v7 // v30 = t23a, v17 = t24a butterfly_8hv4, v24, v16, v24 // v4 = t16, v24 = t17 butterfly_8hv5, v20, v28, v20 // v5 = t19, v20 = t18 @@ -1183,18 +1175,14 @@ function idct32_odd_half endfunc function idct32_odd_quarter -ld1 {v0.8h,v1.8h}, [x11] - -dsmull_hv4, v5, v16, v0.h[0] -dsmull_hv28, v29, v19, v0.h[7] -dsmull_hv30, v31, v16, v0.h[1] -dsmull_hv22, v23, v17, v1.h[6] -dsmull_hv7, v6, v17, v1.h[7] -dsmull_hv26, v27, v19, v0.h[6] -dsmull_hv20, v21, v18, v1.h[0] -dsmull_h
[FFmpeg-devel] [PATCH 27/34] aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1
This is one cycle faster in total, and three instructions fewer. Before: vp9_loop_filter_mix2_v_44_16_neon: 123.2 After: vp9_loop_filter_mix2_v_44_16_neon: 122.2 This is cherrypicked from libav commit 3bf9c48320f25f3d5557485b0202f22ae60748b0. --- libavcodec/aarch64/vp9lpf_neon.S | 21 + 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/libavcodec/aarch64/vp9lpf_neon.S b/libavcodec/aarch64/vp9lpf_neon.S index a9eea7f..0878763 100644 --- a/libavcodec/aarch64/vp9lpf_neon.S +++ b/libavcodec/aarch64/vp9lpf_neon.S @@ -162,18 +162,15 @@ dup v2\sz, w3// I dup v3\sz, w4// H .else -dup v0.8b, w2// E -dup v2.8b, w3// I -dup v3.8b, w4// H -lsr w5, w2, #8 -lsr w6, w3, #8 -lsr w7, w4, #8 -dup v1.8b, w5// E -dup v4.8b, w6// I -dup v5.8b, w7// H -trn1v0.2d, v0.2d, v1.2d -trn1v2.2d, v2.2d, v4.2d -trn1v3.2d, v3.2d, v5.2d +dup v0.8h, w2// E +dup v2.8h, w3// I +dup v3.8h, w4// H +rev16 v1.16b, v0.16b// E +rev16 v4.16b, v2.16b// I +rev16 v5.16b, v3.16b// H +uzp1v0.16b, v0.16b, v1.16b +uzp1v2.16b, v2.16b, v4.16b +uzp1v3.16b, v3.16b, v5.16b .endif uabdv4\sz, v20\sz, v21\sz// abs(p3 - p2) -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 28/34] arm: vp9lpf: Implement the mix2_44 function with one single filter pass
For this case, with 8 inputs but only changing 4 of them, we can fit all 16 input pixels into a q register, and still have enough temporary registers for doing the loop filter. The wd=8 filters would require too many temporary registers for processing all 16 pixels at once though. Before: Cortex A7 A8 A9 A53 vp9_loop_filter_mix2_v_44_16_neon: 289.7 256.2 237.5 181.2 After: vp9_loop_filter_mix2_v_44_16_neon: 221.2 150.5 177.7 138.0 This is cherrypicked from libav commit 575e31e931e4178e9f1e24407503c9b4ec0ef9ba. --- libavcodec/arm/vp9dsp_init_arm.c | 7 +- libavcodec/arm/vp9lpf_neon.S | 191 +++ 2 files changed, 195 insertions(+), 3 deletions(-) diff --git a/libavcodec/arm/vp9dsp_init_arm.c b/libavcodec/arm/vp9dsp_init_arm.c index f7b539e..4c57fd6 100644 --- a/libavcodec/arm/vp9dsp_init_arm.c +++ b/libavcodec/arm/vp9dsp_init_arm.c @@ -195,6 +195,8 @@ define_loop_filters(8, 8); define_loop_filters(16, 8); define_loop_filters(16, 16); +define_loop_filters(44, 16); + #define lf_mix_fn(dir, wd1, wd2, stridea) \ static void loop_filter_##dir##_##wd1##wd2##_16_neon(uint8_t *dst, \ ptrdiff_t stride, \ @@ -208,7 +210,6 @@ static void loop_filter_##dir##_##wd1##wd2##_16_neon(uint8_t *dst, lf_mix_fn(h, wd1, wd2, stride) \ lf_mix_fn(v, wd1, wd2, sizeof(uint8_t)) -lf_mix_fns(4, 4) lf_mix_fns(4, 8) lf_mix_fns(8, 4) lf_mix_fns(8, 8) @@ -228,8 +229,8 @@ static av_cold void vp9dsp_loopfilter_init_arm(VP9DSPContext *dsp) dsp->loop_filter_16[0] = ff_vp9_loop_filter_h_16_16_neon; dsp->loop_filter_16[1] = ff_vp9_loop_filter_v_16_16_neon; -dsp->loop_filter_mix2[0][0][0] = loop_filter_h_44_16_neon; -dsp->loop_filter_mix2[0][0][1] = loop_filter_v_44_16_neon; +dsp->loop_filter_mix2[0][0][0] = ff_vp9_loop_filter_h_44_16_neon; +dsp->loop_filter_mix2[0][0][1] = ff_vp9_loop_filter_v_44_16_neon; dsp->loop_filter_mix2[0][1][0] = loop_filter_h_48_16_neon; dsp->loop_filter_mix2[0][1][1] = loop_filter_v_48_16_neon; dsp->loop_filter_mix2[1][0][0] = loop_filter_h_84_16_neon; diff --git a/libavcodec/arm/vp9lpf_neon.S b/libavcodec/arm/vp9lpf_neon.S index 2d91092..8d44d58 100644 --- a/libavcodec/arm/vp9lpf_neon.S +++ b/libavcodec/arm/vp9lpf_neon.S @@ -44,6 +44,109 @@ vtrn.8 \r2, \r3 .endm +@ The input to and output from this macro is in the registers q8-q15, +@ and q0-q7 are used as scratch registers. +@ p3 = q8, p0 = q11, q0 = q12, q3 = q15 +.macro loop_filter_q +vdup.u8 d0, r2 @ E +lsr r2, r2, #8 +vdup.u8 d2, r3 @ I +lsr r3, r3, #8 +vdup.u8 d1, r2 @ E +vdup.u8 d3, r3 @ I + +vabd.u8 q2, q8, q9 @ abs(p3 - p2) +vabd.u8 q3, q9, q10@ abs(p2 - p1) +vabd.u8 q4, q10, q11@ abs(p1 - p0) +vabd.u8 q5, q12, q13@ abs(q0 - q1) +vabd.u8 q6, q13, q14@ abs(q1 - q2) +vabd.u8 q7, q14, q15@ abs(q2 - q3) +vmax.u8 q2, q2, q3 +vmax.u8 q3, q4, q5 +vmax.u8 q4, q6, q7 +vabd.u8 q5, q11, q12@ abs(p0 - q0) +vmax.u8 q2, q2, q3 +vqadd.u8q5, q5, q5 @ abs(p0 - q0) * 2 +vabd.u8 q7, q10, q13@ abs(p1 - q1) +vmax.u8 q2, q2, q4 @ max(abs(p3 - p2), ..., abs(q2 - q3)) +vshr.u8 q7, q7, #1 +vcle.u8 q2, q2, q1 @ max(abs()) <= I +vqadd.u8q5, q5, q7 @ abs(p0 - q0) * 2 + abs(p1 - q1) >> 1 +vcle.u8 q5, q5, q0 +vandq2, q2, q5 @ fm + +vshrn.u16 d10, q2, #4 +vmovr2, r3, d10 +orrsr2, r2, r3 +@ If no pixels need filtering, just exit as soon as possible +beq 9f + +@ Calculate the normal inner loop filter for 2 or 4 pixels +ldr r3, [sp, #64] +vabd.u8 q3, q10, q11@ abs(p1 - p0) +vabd.u8 q4, q13, q12@ abs(q1 - q0) + +vsubl.u8q5, d20, d26@ p1 - q1 +vsubl.u8q6, d21, d27@ p1 - q1 +vmax.u8 q3, q3, q4 @ max(abs(p1 - p0), abs(q1 - q0)) +vqmovn.s16 d10, q5 @ av_clip_int8p(p1 - q1) +vqmovn.s16 d11, q6 @ av_clip_int8p(p1 - q1) +vdup.u8 d8, r3 @ H +lsr r3, r3, #8 +vdup.u8 d9, r3 @ H +vsubl.u8q6, d24, d22@ q0 - p0 +vsubl.u8
[FFmpeg-devel] [PATCH 33/34] arm: vp9itxfm: Reorder iadst16 coeffs
This matches the order they are in the 16 bpp version. There they are in this order, to make sure we access them in the same order they are declared, easing loading only half of the coefficients at a time. This makes the 8 bpp version match the 16 bpp version better. This is cherrypicked from libav commit 08074c092d8c97d71c5986e5325e97ffc956119d. --- libavcodec/arm/vp9itxfm_neon.S | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index 05e31e6..ebbbda9 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -37,8 +37,8 @@ idct_coeffs: endconst const iadst16_coeffs, align=4 -.short 16364, 804, 15893, 3981, 14811, 7005, 13160, 9760 -.short 11003, 12140, 8423, 14053, 5520, 15426, 2404, 16207 +.short 16364, 804, 15893, 3981, 11003, 12140, 8423, 14053 +.short 14811, 7005, 13160, 9760, 5520, 15426, 2404, 16207 endconst @ Do four 4x4 transposes, using q registers for the subtransposes that don't @@ -678,19 +678,19 @@ function iadst16 vld1.16 {q0-q1}, [r12,:128] mbutterfly_lq3, q2, d31, d16, d0[1], d0[0] @ q3 = t1, q2 = t0 -mbutterfly_lq5, q4, d23, d24, d2[1], d2[0] @ q5 = t9, q4 = t8 +mbutterfly_lq5, q4, d23, d24, d1[1], d1[0] @ q5 = t9, q4 = t8 butterfly_n d31, d24, q3, q5, q6, q5 @ d31 = t1a, d24 = t9a mbutterfly_lq7, q6, d29, d18, d0[3], d0[2] @ q7 = t3, q6 = t2 butterfly_n d16, d23, q2, q4, q3, q4 @ d16 = t0a, d23 = t8a -mbutterfly_lq3, q2, d21, d26, d2[3], d2[2] @ q3 = t11, q2 = t10 +mbutterfly_lq3, q2, d21, d26, d1[3], d1[2] @ q3 = t11, q2 = t10 butterfly_n d29, d26, q7, q3, q4, q3 @ d29 = t3a, d26 = t11a -mbutterfly_lq5, q4, d27, d20, d1[1], d1[0] @ q5 = t5, q4 = t4 +mbutterfly_lq5, q4, d27, d20, d2[1], d2[0] @ q5 = t5, q4 = t4 butterfly_n d18, d21, q6, q2, q3, q2 @ d18 = t2a, d21 = t10a mbutterfly_lq7, q6, d19, d28, d3[1], d3[0] @ q7 = t13, q6 = t12 butterfly_n d20, d28, q5, q7, q2, q7 @ d20 = t5a, d28 = t13a -mbutterfly_lq3, q2, d25, d22, d1[3], d1[2] @ q3 = t7, q2 = t6 +mbutterfly_lq3, q2, d25, d22, d2[3], d2[2] @ q3 = t7, q2 = t6 butterfly_n d27, d19, q4, q6, q5, q6 @ d27 = t4a, d19 = t12a mbutterfly_lq5, q4, d17, d30, d3[3], d3[2] @ q5 = t15, q4 = t14 -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 34/34] aarch64: vp9itxfm: Reorder iadst16 coeffs
This matches the order they are in the 16 bpp version. There they are in this order, to make sure we access them in the same order they are declared, easing loading only half of the coefficients at a time. This makes the 8 bpp version match the 16 bpp version better. This is cherrypicked from libav commit b8f66c0838b4c645227f23a35b4d54373da4c60a. --- libavcodec/aarch64/vp9itxfm_neon.S | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index 31c6e3c..2c3c002 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -37,8 +37,8 @@ idct_coeffs: endconst const iadst16_coeffs, align=4 -.short 16364, 804, 15893, 3981, 14811, 7005, 13160, 9760 -.short 11003, 12140, 8423, 14053, 5520, 15426, 2404, 16207 +.short 16364, 804, 15893, 3981, 11003, 12140, 8423, 14053 +.short 14811, 7005, 13160, 9760, 5520, 15426, 2404, 16207 endconst // out1 = ((in1 + in2) * v0[0] + (1 << 13)) >> 14 @@ -628,19 +628,19 @@ function iadst16 ld1 {v0.8h,v1.8h}, [x11] dmbutterfly_l v6, v7, v4, v5, v31, v16, v0.h[1], v0.h[0] // v6,v7 = t1, v4,v5 = t0 -dmbutterfly_l v10, v11, v8, v9, v23, v24, v1.h[1], v1.h[0] // v10,v11 = t9, v8,v9 = t8 +dmbutterfly_l v10, v11, v8, v9, v23, v24, v0.h[5], v0.h[4] // v10,v11 = t9, v8,v9 = t8 dbutterfly_nv31, v24, v6, v7, v10, v11, v12, v13, v10, v11 // v31 = t1a, v24 = t9a dmbutterfly_l v14, v15, v12, v13, v29, v18, v0.h[3], v0.h[2] // v14,v15 = t3, v12,v13 = t2 dbutterfly_nv16, v23, v4, v5, v8, v9, v6, v7, v8, v9 // v16 = t0a, v23 = t8a -dmbutterfly_l v6, v7, v4, v5, v21, v26, v1.h[3], v1.h[2] // v6,v7 = t11, v4,v5 = t10 +dmbutterfly_l v6, v7, v4, v5, v21, v26, v0.h[7], v0.h[6] // v6,v7 = t11, v4,v5 = t10 dbutterfly_nv29, v26, v14, v15, v6, v7, v8, v9, v6, v7 // v29 = t3a, v26 = t11a -dmbutterfly_l v10, v11, v8, v9, v27, v20, v0.h[5], v0.h[4] // v10,v11 = t5, v8,v9 = t4 +dmbutterfly_l v10, v11, v8, v9, v27, v20, v1.h[1], v1.h[0] // v10,v11 = t5, v8,v9 = t4 dbutterfly_nv18, v21, v12, v13, v4, v5, v6, v7, v4, v5 // v18 = t2a, v21 = t10a dmbutterfly_l v14, v15, v12, v13, v19, v28, v1.h[5], v1.h[4] // v14,v15 = t13, v12,v13 = t12 dbutterfly_nv20, v28, v10, v11, v14, v15, v4, v5, v14, v15 // v20 = t5a, v28 = t13a -dmbutterfly_l v6, v7, v4, v5, v25, v22, v0.h[7], v0.h[6] // v6,v7 = t7, v4,v5 = t6 +dmbutterfly_l v6, v7, v4, v5, v25, v22, v1.h[3], v1.h[2] // v6,v7 = t7, v4,v5 = t6 dbutterfly_nv27, v19, v8, v9, v12, v13, v10, v11, v12, v13 // v27 = t4a, v19 = t12a dmbutterfly_l v10, v11, v8, v9, v17, v30, v1.h[7], v1.h[6] // v10,v11 = t15, v8,v9 = t14 -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 03/34] arm: vp9itxfm: Make the larger core transforms standalone functions
This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/arm/vp9itxfm_neon.o from 15324 to 12388 bytes. This gives a small slowdown of a couple tens of cycles, up to around 150 cycles for the full case of the largest transform, but makes it more feasible to add more optimized versions of these transforms. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub4_add_neon:2063.4 1516.0 1719.5 1245.1 vp9_inv_dct_dct_16x16_sub16_add_neon: 3279.3 2454.5 2525.2 1982.3 vp9_inv_dct_dct_32x32_sub4_add_neon: 10750.0 7955.4 8525.6 6754.2 vp9_inv_dct_dct_32x32_sub32_add_neon: 18574.0 17108.4 14216.7 12010.2 After: vp9_inv_dct_dct_16x16_sub4_add_neon:2060.8 1608.5 1735.7 1262.0 vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.2 2443.5 2546.1 1999.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 10682.0 8043.8 8581.3 6810.1 vp9_inv_dct_dct_32x32_sub32_add_neon: 18522.4 17277.4 14286.7 12087.9 This is cherrypicked from libav commit 0331c3f5e8cb6e6b53fab7893e91d1be1bfa979c. --- libavcodec/arm/vp9itxfm_neon.S | 43 +- 1 file changed, 26 insertions(+), 17 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index 93816d2..328bb01 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -534,7 +534,7 @@ function idct16x16_dc_add_neon endfunc .ltorg -.macro idct16 +function idct16 mbutterfly0 d16, d24, d16, d24, d4, d6, q2, q3 @ d16 = t0a, d24 = t1a mbutterfly d20, d28, d0[1], d0[2], q2, q3 @ d20 = t2a, d28 = t3a mbutterfly d18, d30, d0[3], d1[0], q2, q3 @ d18 = t4a, d30 = t7a @@ -580,9 +580,10 @@ endfunc vmovd4, d21 @ d4 = t10a butterfly d20, d27, d6, d27 @ d20 = out[4], d27 = out[11] butterfly d21, d26, d26, d4@ d21 = out[5], d26 = out[10] -.endm +bx lr +endfunc -.macro iadst16 +function iadst16 movrel r12, iadst16_coeffs vld1.16 {q0-q1}, [r12,:128] @@ -653,7 +654,8 @@ endfunc vmovd16, d2 vmovd30, d4 -.endm +bx lr +endfunc .macro itxfm16_1d_funcs txfm @ Read a vertical 4x16 slice out of a 16x16 matrix, do a transform on it, @@ -662,6 +664,8 @@ endfunc @ r1 = slice offset @ r2 = src function \txfm\()16_1d_4x16_pass1_neon +push{lr} + mov r12, #32 vmov.s16q2, #0 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 @@ -669,7 +673,7 @@ function \txfm\()16_1d_4x16_pass1_neon vst1.16 {d4}, [r2,:64], r12 .endr -\txfm\()16 +bl \txfm\()16 @ Do four 4x4 transposes. Originally, d16-d31 contain the @ 16 rows. Afterwards, d16-d19, d20-d23, d24-d27, d28-d31 @@ -682,7 +686,7 @@ function \txfm\()16_1d_4x16_pass1_neon .irp i, 16, 20, 24, 28, 17, 21, 25, 29, 18, 22, 26, 30, 19, 23, 27, 31 vst1.16 {d\i}, [r0,:64]! .endr -bx lr +pop {pc} 1: @ Special case: For the last input column (r1 == 12), @ which would be stored as the last row in the temp buffer, @@ -709,7 +713,7 @@ function \txfm\()16_1d_4x16_pass1_neon vmovd29, d17 vmovd30, d18 vmovd31, d19 -bx lr +pop {pc} endfunc @ Read a vertical 4x16 slice out of a 16x16 matrix, do a transform on it, @@ -719,6 +723,7 @@ endfunc @ r2 = src (temp buffer) @ r3 = slice offset function \txfm\()16_1d_4x16_pass2_neon +push{lr} mov r12, #32 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27 vld1.16 {d\i}, [r2,:64], r12 @@ -732,7 +737,7 @@ function \txfm\()16_1d_4x16_pass2_neon add r3, r0, r1 lsl r1, r1, #1 -\txfm\()16 +bl \txfm\()16 .macro load_add_store coef0, coef1, coef2, coef3 vrshr.s16 \coef0, \coef0, #6 @@ -773,7 +778,7 @@ function \txfm\()16_1d_4x16_pass2_neon load_add_store q12, q13, q14, q15 .purgem load_add_store -bx lr +pop {pc} endfunc .endm @@ -908,7 +913,7 @@ function idct32x32_dc_add_neon bx lr endfunc -.macro idct32_odd +function idct32_odd movrel r12, idct_coeffs add r12, r12, #32 vld1.16 {q0-q1}, [r12,:128] @@ -967,7 +972,8 @@ endfunc mbutterfly0 d26, d21, d26, d21, d4, d6, q2, q3 @ d26 = t26a, d21 = t21a mbutterfly0 d25, d22, d25, d22, d4, d6, q2, q3 @ d25 = t25, d22 = t22 mbutterfly0 d24, d23, d24,
[FFmpeg-devel] [PATCH 22/34] arm: vp9lpf: Interleave the start of flat8in into the calculation above
This adds lots of extra .ifs, but speeds it up by a couple cycles, by avoiding stalls. This is cherrypicked from libav commit e18c39005ad1dbb178b336f691da1de91afd434e. --- libavcodec/arm/vp9lpf_neon.S | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/libavcodec/arm/vp9lpf_neon.S b/libavcodec/arm/vp9lpf_neon.S index 3d289e5..b90c536 100644 --- a/libavcodec/arm/vp9lpf_neon.S +++ b/libavcodec/arm/vp9lpf_neon.S @@ -182,16 +182,20 @@ vmovl.u8q0, d22@ p1 vmovl.u8q1, d25@ q1 +.if \wd >= 8 +vmovr2, r3, d6 +.endif vaddw.s8q0, q0, \tmp3 @ p1 + f vsubw.s8q1, q1, \tmp3 @ q1 - f +.if \wd >= 8 +orrsr2, r2, r3 +.endif vqmovun.s16 d0, q0 @ out p1 vqmovun.s16 d2, q1 @ out q1 vbitd22, d0, d5@ if (!hev && fm && !flat8in) vbitd25, d2, d5 .if \wd >= 8 -vmovr2, r3, d6 -orrsr2, r2, r3 @ If no pixels need flat8in, jump to flat8out @ (or to a writeout of the inner 4 pixels, for wd=8) beq 6f -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 21/34] arm: vp9lpf: Use orrs instead of orr+cmp
This is cherrypicked from libav commit 435cd7bc99671bf561193421a50ac6e9d63c4266. --- libavcodec/arm/vp9lpf_neon.S | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/libavcodec/arm/vp9lpf_neon.S b/libavcodec/arm/vp9lpf_neon.S index 2761956..3d289e5 100644 --- a/libavcodec/arm/vp9lpf_neon.S +++ b/libavcodec/arm/vp9lpf_neon.S @@ -78,8 +78,7 @@ vdup.u8 d3, r3 @ H vmovr2, r3, d4 -orr r2, r2, r3 -cmp r2, #0 +orrsr2, r2, r3 @ If no pixels need filtering, just exit as soon as possible beq 9f @@ -192,8 +191,7 @@ .if \wd >= 8 vmovr2, r3, d6 -orr r2, r2, r3 -cmp r2, #0 +orrsr2, r2, r3 @ If no pixels need flat8in, jump to flat8out @ (or to a writeout of the inner 4 pixels, for wd=8) beq 6f @@ -248,14 +246,12 @@ 6: vorrd2, d6, d7 vmovr2, r3, d2 -orr r2, r2, r3 -cmp r2, #0 +orrsr2, r2, r3 @ If no pixels needed flat8in nor flat8out, jump to a @ writeout of the inner 4 pixels beq 7f vmovr2, r3, d7 -orr r2, r2, r3 -cmp r2, #0 +orrsr2, r2, r3 @ If no pixels need flat8out, jump to a writeout of the inner 6 pixels beq 8f -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 06/34] aarch64: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function
This allows reusing the macro for a separate implementation of the pass2 function. This is cherrypicked from libav commit 79d332ebbde8c0a3e9da094dcfd10abd33ba7378. --- libavcodec/aarch64/vp9itxfm_neon.S | 90 +++--- 1 file changed, 45 insertions(+), 45 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index a37b459..e45d385 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -598,6 +598,51 @@ endfunc st1 {v2.8h}, [\src], \inc .endm +.macro load_add_store coef0, coef1, coef2, coef3, coef4, coef5, coef6, coef7, tmp1, tmp2 +srshr \coef0, \coef0, #6 +ld1 {v2.8b}, [x0], x1 +srshr \coef1, \coef1, #6 +ld1 {v3.8b}, [x3], x1 +srshr \coef2, \coef2, #6 +ld1 {v4.8b}, [x0], x1 +srshr \coef3, \coef3, #6 +uaddw \coef0, \coef0, v2.8b +ld1 {v5.8b}, [x3], x1 +uaddw \coef1, \coef1, v3.8b +srshr \coef4, \coef4, #6 +ld1 {v6.8b}, [x0], x1 +srshr \coef5, \coef5, #6 +ld1 {v7.8b}, [x3], x1 +sqxtun v2.8b, \coef0 +srshr \coef6, \coef6, #6 +sqxtun v3.8b, \coef1 +srshr \coef7, \coef7, #6 +uaddw \coef2, \coef2, v4.8b +ld1 {\tmp1}, [x0], x1 +uaddw \coef3, \coef3, v5.8b +ld1 {\tmp2}, [x3], x1 +sqxtun v4.8b, \coef2 +sub x0, x0, x1, lsl #2 +sub x3, x3, x1, lsl #2 +sqxtun v5.8b, \coef3 +uaddw \coef4, \coef4, v6.8b +st1 {v2.8b}, [x0], x1 +uaddw \coef5, \coef5, v7.8b +st1 {v3.8b}, [x3], x1 +sqxtun v6.8b, \coef4 +st1 {v4.8b}, [x0], x1 +sqxtun v7.8b, \coef5 +st1 {v5.8b}, [x3], x1 +uaddw \coef6, \coef6, \tmp1 +st1 {v6.8b}, [x0], x1 +uaddw \coef7, \coef7, \tmp2 +st1 {v7.8b}, [x3], x1 +sqxtun \tmp1, \coef6 +sqxtun \tmp2, \coef7 +st1 {\tmp1}, [x0], x1 +st1 {\tmp2}, [x3], x1 +.endm + // Read a vertical 8x16 slice out of a 16x16 matrix, do a transform on it, // transpose into a horizontal 16x8 slice and store. // x0 = dst (temp buffer) @@ -671,53 +716,8 @@ function \txfm\()16_1d_8x16_pass2_neon lsl x1, x1, #1 bl \txfm\()16 -.macro load_add_store coef0, coef1, coef2, coef3, coef4, coef5, coef6, coef7, tmp1, tmp2 -srshr \coef0, \coef0, #6 -ld1 {v2.8b}, [x0], x1 -srshr \coef1, \coef1, #6 -ld1 {v3.8b}, [x3], x1 -srshr \coef2, \coef2, #6 -ld1 {v4.8b}, [x0], x1 -srshr \coef3, \coef3, #6 -uaddw \coef0, \coef0, v2.8b -ld1 {v5.8b}, [x3], x1 -uaddw \coef1, \coef1, v3.8b -srshr \coef4, \coef4, #6 -ld1 {v6.8b}, [x0], x1 -srshr \coef5, \coef5, #6 -ld1 {v7.8b}, [x3], x1 -sqxtun v2.8b, \coef0 -srshr \coef6, \coef6, #6 -sqxtun v3.8b, \coef1 -srshr \coef7, \coef7, #6 -uaddw \coef2, \coef2, v4.8b -ld1 {\tmp1}, [x0], x1 -uaddw \coef3, \coef3, v5.8b -ld1 {\tmp2}, [x3], x1 -sqxtun v4.8b, \coef2 -sub x0, x0, x1, lsl #2 -sub x3, x3, x1, lsl #2 -sqxtun v5.8b, \coef3 -uaddw \coef4, \coef4, v6.8b -st1 {v2.8b}, [x0], x1 -uaddw \coef5, \coef5, v7.8b -st1 {v3.8b}, [x3], x1 -sqxtun v6.8b, \coef4 -st1 {v4.8b}, [x0], x1 -sqxtun v7.8b, \coef5 -st1 {v5.8b}, [x3], x1 -uaddw \coef6, \coef6, \tmp1 -st1 {v6.8b}, [x0], x1 -uaddw \coef7, \coef7, \tmp2 -st1 {v7.8b}, [x3], x1 -sqxtun \tmp1, \coef6 -sqxtun \tmp2, \coef7 -st1 {\tmp1}, [x0], x1 -st1 {\tmp2}, [x3], x1 -.endm load_add_store v16.8h, v17.8h, v18.8h, v19.8h, v20.8h, v21.8h, v22.8h, v23.8h, v16.8b, v17.8b load_add_store v24.8h, v25.8h, v26.8h, v27.8h, v28.8h, v29.8h, v30.8h, v31.8h, v16.8b, v17.8b -.purgem load_add_store
[FFmpeg-devel] [PATCH 05/34] arm: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function
This allows reusing the macro for a separate implementation of the pass2 function. This is cherrypicked from libav commit 47b3c2c18d1897f3c753ba0cec4b2d7aa24526af. --- libavcodec/arm/vp9itxfm_neon.S | 72 +- 1 file changed, 36 insertions(+), 36 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index 328bb01..682a82e 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -657,6 +657,42 @@ function iadst16 bx lr endfunc +.macro load_add_store coef0, coef1, coef2, coef3 +vrshr.s16 \coef0, \coef0, #6 +vrshr.s16 \coef1, \coef1, #6 + +vld1.32 {d4[]}, [r0,:32], r1 +vld1.32 {d4[1]}, [r3,:32], r1 +vrshr.s16 \coef2, \coef2, #6 +vrshr.s16 \coef3, \coef3, #6 +vld1.32 {d5[]}, [r0,:32], r1 +vld1.32 {d5[1]}, [r3,:32], r1 +vaddw.u8\coef0, \coef0, d4 +vld1.32 {d6[]}, [r0,:32], r1 +vld1.32 {d6[1]}, [r3,:32], r1 +vaddw.u8\coef1, \coef1, d5 +vld1.32 {d7[]}, [r0,:32], r1 +vld1.32 {d7[1]}, [r3,:32], r1 + +vqmovun.s16 d4, \coef0 +vqmovun.s16 d5, \coef1 +sub r0, r0, r1, lsl #2 +sub r3, r3, r1, lsl #2 +vaddw.u8\coef2, \coef2, d6 +vaddw.u8\coef3, \coef3, d7 +vst1.32 {d4[0]}, [r0,:32], r1 +vst1.32 {d4[1]}, [r3,:32], r1 +vqmovun.s16 d6, \coef2 +vst1.32 {d5[0]}, [r0,:32], r1 +vst1.32 {d5[1]}, [r3,:32], r1 +vqmovun.s16 d7, \coef3 + +vst1.32 {d6[0]}, [r0,:32], r1 +vst1.32 {d6[1]}, [r3,:32], r1 +vst1.32 {d7[0]}, [r0,:32], r1 +vst1.32 {d7[1]}, [r3,:32], r1 +.endm + .macro itxfm16_1d_funcs txfm @ Read a vertical 4x16 slice out of a 16x16 matrix, do a transform on it, @ transpose into a horizontal 16x4 slice and store. @@ -739,44 +775,8 @@ function \txfm\()16_1d_4x16_pass2_neon lsl r1, r1, #1 bl \txfm\()16 -.macro load_add_store coef0, coef1, coef2, coef3 -vrshr.s16 \coef0, \coef0, #6 -vrshr.s16 \coef1, \coef1, #6 - -vld1.32 {d4[]}, [r0,:32], r1 -vld1.32 {d4[1]}, [r3,:32], r1 -vrshr.s16 \coef2, \coef2, #6 -vrshr.s16 \coef3, \coef3, #6 -vld1.32 {d5[]}, [r0,:32], r1 -vld1.32 {d5[1]}, [r3,:32], r1 -vaddw.u8\coef0, \coef0, d4 -vld1.32 {d6[]}, [r0,:32], r1 -vld1.32 {d6[1]}, [r3,:32], r1 -vaddw.u8\coef1, \coef1, d5 -vld1.32 {d7[]}, [r0,:32], r1 -vld1.32 {d7[1]}, [r3,:32], r1 - -vqmovun.s16 d4, \coef0 -vqmovun.s16 d5, \coef1 -sub r0, r0, r1, lsl #2 -sub r3, r3, r1, lsl #2 -vaddw.u8\coef2, \coef2, d6 -vaddw.u8\coef3, \coef3, d7 -vst1.32 {d4[0]}, [r0,:32], r1 -vst1.32 {d4[1]}, [r3,:32], r1 -vqmovun.s16 d6, \coef2 -vst1.32 {d5[0]}, [r0,:32], r1 -vst1.32 {d5[1]}, [r3,:32], r1 -vqmovun.s16 d7, \coef3 - -vst1.32 {d6[0]}, [r0,:32], r1 -vst1.32 {d6[1]}, [r3,:32], r1 -vst1.32 {d7[0]}, [r0,:32], r1 -vst1.32 {d7[1]}, [r3,:32], r1 -.endm load_add_store q8, q9, q10, q11 load_add_store q12, q13, q14, q15 -.purgem load_add_store pop {pc} endfunc -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 18/34] arm: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling
This work is sponsored by, and copyright, Google. Before:Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0 189.5 211.7 235.8 vp9_inv_dct_dct_32x32_sub1_add_neon: 752.0 459.2 862.2 553.9 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 226.5 145.0 225.1 171.8 vp9_inv_dct_dct_32x32_sub1_add_neon: 721.2 415.7 727.6 475.0 This is cherrypicked from libav commit a76bf8cf1277ef6feb1580b578f5e6ca327e713c. --- libavcodec/arm/vp9itxfm_neon.S | 54 -- 1 file changed, 36 insertions(+), 18 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index 78fdae6..dee2f05 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -542,16 +542,23 @@ function idct16x16_dc_add_neon vrshr.s16 q8, q8, #6 +mov r3, r0 mov r12, #16 1: @ Loop to add the constant from q8 into all 16x16 outputs -vld1.8 {q3}, [r0,:128] -vaddw.u8q10, q8, d6 -vaddw.u8q11, q8, d7 -vqmovun.s16 d6, q10 -vqmovun.s16 d7, q11 -vst1.8 {q3}, [r0,:128], r1 -subsr12, r12, #1 +subsr12, r12, #2 +vld1.8 {q2}, [r0,:128], r1 +vaddw.u8q10, q8, d4 +vld1.8 {q3}, [r0,:128], r1 +vaddw.u8q11, q8, d5 +vaddw.u8q12, q8, d6 +vaddw.u8q13, q8, d7 +vqmovun.s16 d4, q10 +vqmovun.s16 d5, q11 +vqmovun.s16 d6, q12 +vst1.8 {q2}, [r3,:128], r1 +vqmovun.s16 d7, q13 +vst1.8 {q3}, [r3,:128], r1 bne 1b bx lr @@ -1147,20 +1154,31 @@ function idct32x32_dc_add_neon vrshr.s16 q8, q8, #6 +mov r3, r0 mov r12, #32 1: @ Loop to add the constant from q8 into all 32x32 outputs -vld1.8 {q2-q3}, [r0,:128] -vaddw.u8q10, q8, d4 -vaddw.u8q11, q8, d5 -vaddw.u8q12, q8, d6 -vaddw.u8q13, q8, d7 -vqmovun.s16 d4, q10 -vqmovun.s16 d5, q11 -vqmovun.s16 d6, q12 -vqmovun.s16 d7, q13 -vst1.8 {q2-q3}, [r0,:128], r1 -subsr12, r12, #1 +subsr12, r12, #2 +vld1.8 {q0-q1}, [r0,:128], r1 +vaddw.u8q9, q8, d0 +vaddw.u8q10, q8, d1 +vld1.8 {q2-q3}, [r0,:128], r1 +vaddw.u8q11, q8, d2 +vaddw.u8q12, q8, d3 +vaddw.u8q13, q8, d4 +vaddw.u8q14, q8, d5 +vaddw.u8q15, q8, d6 +vqmovun.s16 d0, q9 +vaddw.u8q9, q8, d7 +vqmovun.s16 d1, q10 +vqmovun.s16 d2, q11 +vqmovun.s16 d3, q12 +vqmovun.s16 d4, q13 +vqmovun.s16 d5, q14 +vst1.8 {q0-q1}, [r3,:128], r1 +vqmovun.s16 d6, q15 +vqmovun.s16 d7, q9 +vst1.8 {q2-q3}, [r3,:128], r1 bne 1b bx lr -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 19/34] aarch64: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling
This work is sponsored by, and copyright, Google. Before: Cortex A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3 vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 180.2 vp9_inv_dct_dct_32x32_sub1_add_neon: 475.3 This is cherrypicked from libav commit 3fcf788fbbccc4130868e7abe58a88990290f7c1. --- libavcodec/aarch64/vp9itxfm_neon.S | 54 +- 1 file changed, 36 insertions(+), 18 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index 6bb097b..be65eb7 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -495,16 +495,23 @@ function idct16x16_dc_add_neon srshr v2.8h, v2.8h, #6 +mov x3, x0 mov x4, #16 1: // Loop to add the constant from v2 into all 16x16 outputs -ld1 {v3.16b}, [x0] -uaddw v4.8h, v2.8h, v3.8b -uaddw2 v5.8h, v2.8h, v3.16b -sqxtun v4.8b, v4.8h -sqxtun2 v4.16b, v5.8h -st1 {v4.16b}, [x0], x1 -subsx4, x4, #1 +subsx4, x4, #2 +ld1 {v3.16b}, [x0], x1 +ld1 {v4.16b}, [x0], x1 +uaddw v16.8h, v2.8h, v3.8b +uaddw2 v17.8h, v2.8h, v3.16b +uaddw v18.8h, v2.8h, v4.8b +uaddw2 v19.8h, v2.8h, v4.16b +sqxtun v3.8b, v16.8h +sqxtun2 v3.16b, v17.8h +sqxtun v4.8b, v18.8h +sqxtun2 v4.16b, v19.8h +st1 {v3.16b}, [x3], x1 +st1 {v4.16b}, [x3], x1 b.ne1b ret @@ -1054,20 +1061,31 @@ function idct32x32_dc_add_neon srshr v0.8h, v2.8h, #6 +mov x3, x0 mov x4, #32 1: // Loop to add the constant v0 into all 32x32 outputs -ld1 {v1.16b,v2.16b}, [x0] -uaddw v3.8h, v0.8h, v1.8b -uaddw2 v4.8h, v0.8h, v1.16b -uaddw v5.8h, v0.8h, v2.8b -uaddw2 v6.8h, v0.8h, v2.16b -sqxtun v3.8b, v3.8h -sqxtun2 v3.16b, v4.8h -sqxtun v4.8b, v5.8h -sqxtun2 v4.16b, v6.8h -st1 {v3.16b,v4.16b}, [x0], x1 -subsx4, x4, #1 +subsx4, x4, #2 +ld1 {v1.16b,v2.16b}, [x0], x1 +uaddw v16.8h, v0.8h, v1.8b +uaddw2 v17.8h, v0.8h, v1.16b +ld1 {v3.16b,v4.16b}, [x0], x1 +uaddw v18.8h, v0.8h, v2.8b +uaddw2 v19.8h, v0.8h, v2.16b +uaddw v20.8h, v0.8h, v3.8b +uaddw2 v21.8h, v0.8h, v3.16b +uaddw v22.8h, v0.8h, v4.8b +uaddw2 v23.8h, v0.8h, v4.16b +sqxtun v1.8b, v16.8h +sqxtun2 v1.16b, v17.8h +sqxtun v2.8b, v18.8h +sqxtun2 v2.16b, v19.8h +sqxtun v3.8b, v20.8h +sqxtun2 v3.16b, v21.8h +st1 {v1.16b,v2.16b}, [x3], x1 +sqxtun v4.8b, v22.8h +sqxtun2 v4.16b, v23.8h +st1 {v3.16b,v4.16b}, [x3], x1 b.ne1b ret -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 13/34] aarch64: vp9itxfm: Update a comment to refer to a register with a different name
This is cherrypicked from libav commit 8476eb0d3ab1f7a52317b23346646389c08fb57a. --- libavcodec/aarch64/vp9itxfm_neon.S | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index 3b34749..5219d6e 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -41,8 +41,8 @@ const iadst16_coeffs, align=4 .short 11003, 12140, 8423, 14053, 5520, 15426, 2404, 16207 endconst -// out1 = ((in1 + in2) * d0[0] + (1 << 13)) >> 14 -// out2 = ((in1 - in2) * d0[0] + (1 << 13)) >> 14 +// out1 = ((in1 + in2) * v0[0] + (1 << 13)) >> 14 +// out2 = ((in1 - in2) * v0[0] + (1 << 13)) >> 14 // in/out are .8h registers; this can do with 4 temp registers, but is // more efficient if 6 temp registers are available. .macro dmbutterfly0 out1, out2, in1, in2, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, neg=0 -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 11/34] aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possible
The ld1r is a leftover from the arm version, where this trick is beneficial on some cores. Use a single-lane load where we don't need the semantics of ld1r. This is cherrypicked from libav commit ed8d293306e12c9b79022d37d39f48825ce7f2fa. --- libavcodec/aarch64/vp9itxfm_neon.S | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index df178d2..e42cc2d 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -255,7 +255,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_4x4_add_neon, export=1 cmp w3, #1 b.ne1f // DC-only for idct/idct -ld1r{v2.4h}, [x2] +ld1 {v2.h}[0], [x2] smull v2.4s, v2.4h, v0.h[0] rshrn v2.4h, v2.4s, #14 smull v2.4s, v2.4h, v0.h[0] @@ -287,8 +287,8 @@ function ff_vp9_\txfm1\()_\txfm2\()_4x4_add_neon, export=1 \txfm2\()4 v4, v5, v6, v7 2: -ld1r{v0.2s}, [x0], x1 -ld1r{v1.2s}, [x0], x1 +ld1 {v0.s}[0], [x0], x1 +ld1 {v1.s}[0], [x0], x1 .ifnc \txfm1,iwht srshr v4.4h, v4.4h, #4 srshr v5.4h, v5.4h, #4 @@ -297,8 +297,8 @@ function ff_vp9_\txfm1\()_\txfm2\()_4x4_add_neon, export=1 .endif uaddw v4.8h, v4.8h, v0.8b uaddw v5.8h, v5.8h, v1.8b -ld1r{v2.2s}, [x0], x1 -ld1r{v3.2s}, [x0], x1 +ld1 {v2.s}[0], [x0], x1 +ld1 {v3.s}[0], [x0], x1 sqxtun v0.8b, v4.8h sqxtun v1.8b, v5.8h sub x0, x0, x1, lsl #2 @@ -394,7 +394,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1 cmp w3, #1 b.ne1f // DC-only for idct/idct -ld1r{v2.4h}, [x2] +ld1 {v2.h}[0], [x2] smull v2.4s, v2.4h, v0.h[0] rshrn v2.4h, v2.4s, #14 smull v2.4s, v2.4h, v0.h[0] @@ -485,7 +485,7 @@ function idct16x16_dc_add_neon moviv1.4h, #0 -ld1r{v2.4h}, [x2] +ld1 {v2.h}[0], [x2] smull v2.4s, v2.4h, v0.h[0] rshrn v2.4h, v2.4s, #14 smull v2.4s, v2.4h, v0.h[0] @@ -1044,7 +1044,7 @@ function idct32x32_dc_add_neon moviv1.4h, #0 -ld1r{v2.4h}, [x2] +ld1 {v2.h}[0], [x2] smull v2.4s, v2.4h, v0.h[0] rshrn v2.4h, v2.4s, #14 smull v2.4s, v2.4h, v0.h[0] -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 09/34] arm: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function
This is cherrypicked from libav commit 3933b86bb93aca47f29fbd493075b0f110c1e3f5. --- libavcodec/arm/vp9itxfm_neon.S | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index 33a7af1..78fdae6 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -412,13 +412,12 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1 .ifc \txfm1\()_\txfm2,idct_idct movrel r12, idct_coeffs vpush {q4-q5} -vld1.16 {q0}, [r12,:128] .else movrel r12, iadst8_coeffs vld1.16 {q1}, [r12,:128]! vpush {q4-q7} -vld1.16 {q0}, [r12,:128] .endif +vld1.16 {q0}, [r12,:128] vmov.i16q2, #0 vmov.i16q3, #0 -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 12/34] aarch64: vp9itxfm: Use the right lane sizes in 8x8 for improved readability
This is cherrypicked from libav commit 3dd7827258ddaa2e51085d0c677d6f3b1be3572f. --- libavcodec/aarch64/vp9itxfm_neon.S | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index e42cc2d..3b34749 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -385,10 +385,10 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1 .endif ld1 {v0.8h}, [x4] -moviv2.16b, #0 -moviv3.16b, #0 -moviv4.16b, #0 -moviv5.16b, #0 +moviv2.8h, #0 +moviv3.8h, #0 +moviv4.8h, #0 +moviv5.8h, #0 .ifc \txfm1\()_\txfm2,idct_idct cmp w3, #1 @@ -411,11 +411,11 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1 b 2f .endif 1: -ld1 {v16.16b,v17.16b,v18.16b,v19.16b}, [x2], #64 -ld1 {v20.16b,v21.16b,v22.16b,v23.16b}, [x2], #64 +ld1 {v16.8h,v17.8h,v18.8h,v19.8h}, [x2], #64 +ld1 {v20.8h,v21.8h,v22.8h,v23.8h}, [x2], #64 sub x2, x2, #128 -st1 {v2.16b,v3.16b,v4.16b,v5.16b}, [x2], #64 -st1 {v2.16b,v3.16b,v4.16b,v5.16b}, [x2], #64 +st1 {v2.8h,v3.8h,v4.8h,v5.8h}, [x2], #64 +st1 {v2.8h,v3.8h,v4.8h,v5.8h}, [x2], #64 \txfm1\()8 -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 20/34] arm/aarch64: vp9lpf: Calculate !hev directly
Previously we first calculated hev, and then negated it. Since we were able to schedule the negation in the middle of another calculation, we don't see any gain in all cases. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 147.0 129.0 115.889.0 88.7 vp9_loop_filter_v_8_8_neon: 242.0 198.5 174.7 140.0136.7 vp9_loop_filter_v_16_8_neon:500.0 419.5 382.7 293.0275.7 vp9_loop_filter_v_16_16_neon: 971.2 825.5 731.5 579.0453.0 After: vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.888.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0136.7 vp9_loop_filter_v_16_8_neon:497.0 419.5 379.7 293.0275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0452.0 This is cherrypicked from libav commit e1f9de86f454861b69b199ad801adc2ec6c3b220. --- libavcodec/aarch64/vp9lpf_neon.S | 5 ++--- libavcodec/arm/vp9lpf_neon.S | 5 ++--- 2 files changed, 4 insertions(+), 6 deletions(-) diff --git a/libavcodec/aarch64/vp9lpf_neon.S b/libavcodec/aarch64/vp9lpf_neon.S index 55e1964..7fe2c88 100644 --- a/libavcodec/aarch64/vp9lpf_neon.S +++ b/libavcodec/aarch64/vp9lpf_neon.S @@ -292,7 +292,7 @@ .if \mix != 0 sxtlv1.8h, v1.8b .endif -cmhiv5\sz, v5\sz, v3\sz // hev +cmhsv5\sz, v3\sz, v5\sz // !hev .if \wd == 8 // If a 4/8 or 8/4 mix is used, clear the relevant half of v6 .if \mix != 0 @@ -306,11 +306,10 @@ .elseif \wd == 8 bic v4\sz, v4\sz, v6\sz // fm && !flat8in .endif -mvn v5\sz, v5\sz // !hev +and v5\sz, v5\sz, v4\sz // !hev && fm && !flat8in .if \wd == 16 and v7\sz, v7\sz, v6\sz // flat8out && flat8in && fm .endif -and v5\sz, v5\sz, v4\sz // !hev && fm && !flat8in mul_sz \tmp3\().8h, \tmp4\().8h, \tmp3\().8h, \tmp4\().8h, \tmp5\().8h, \tmp5\().8h, \sz // 3 * (q0 - p0) bic \tmp1\sz, \tmp1\sz, v5\sz// if (!hev) av_clip_int8 = 0 diff --git a/libavcodec/arm/vp9lpf_neon.S b/libavcodec/arm/vp9lpf_neon.S index e96f4db..2761956 100644 --- a/libavcodec/arm/vp9lpf_neon.S +++ b/libavcodec/arm/vp9lpf_neon.S @@ -141,7 +141,7 @@ .if \wd == 8 vcle.u8 d6, d6, d0@ flat8in .endif -vcgt.u8 d5, d5, d3@ hev +vcle.u8 d5, d5, d3@ !hev .if \wd == 8 vandd6, d6, d4@ flat8in && fm .endif @@ -151,11 +151,10 @@ .elseif \wd == 8 vbicd4, d4, d6@ fm && !flat8in .endif -vmvnd5, d5 @ !hev +vandd5, d5, d4@ !hev && fm && !flat8in .if \wd == 16 vandd7, d7, d6@ flat8out && flat8in && fm .endif -vandd5, d5, d4@ !hev && fm && !flat8in vmul.s16\tmpq2, \tmpq2, \tmpq3 @ 3 * (q0 - p0) vbic\tmp1, \tmp1, d5@ if (!hev) av_clip_int8 = 0 -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 10/34] aarch64: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function
This is cherrypicked from libav commit 4da4b2b87f08a1331650c7e36eb7d4029a160776. --- libavcodec/aarch64/vp9itxfm_neon.S | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index 3eb999a..df178d2 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -379,12 +379,11 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1 // idct, so those always need to be loaded. .ifc \txfm1\()_\txfm2,idct_idct movrel x4, idct_coeffs -ld1 {v0.8h}, [x4] .else movrel x4, iadst8_coeffs ld1 {v1.8h}, [x4], #16 -ld1 {v0.8h}, [x4] .endif +ld1 {v0.8h}, [x4] moviv2.16b, #0 moviv3.16b, #0 -- 2.7.4 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCHv3 4/4] libavcodec: v4l2: add support for v4l2 mem2mem codecs
Hi Jorge, On Mon, 7 Aug 2017, Jorge Ramirez wrote: On 08/03/2017 01:53 AM, Mark Thompson wrote: +default: +return 0; +} + +SET_V4L_EXT_CTRL(value, qmin, avctx->qmin, "minimum video quantizer scale"); +SET_V4L_EXT_CTRL(value, qmax, avctx->qmax, "maximum video quantizer scale"); + +return 0; +} This doesn't set extradata - you need to extract the codec global headers (such as H.264 SPS and PPS) at init time to be able to write correct files for some codecs (such as H.264) with muxers requiring global headers (such as MP4). It kindof works without it, but the files created will not conform and will not be usable on some players. ah that might explain some things (when I play back the encoded video the quality is pretty lousy) is there already some code I can use as a reference? I might be out of my depth here so any help will be more than welcome This is exactly the thing I was trying to tell you about, off list, before. In the OMX driver used on android, this is requested on startup, via an ioctl with the following private ioctl value: V4L2_CID_MPEG_VIDC_VIDEO_REQUEST_SEQ_HEADER See this code here: https://android.googlesource.com/platform/hardware/qcom/media/+/63abe022/msm8996/mm-video-v4l2/vidc/venc/src/video_encoder_device_v4l2.cpp#2991 This is a qcom specific, private ioctl. In the android kernel for qualcomm, this is handled correctly here: https://android.googlesource.com/kernel/msm/+/android-7.1.2_r0.33/drivers/media/platform/msm/vidc/msm_venc.c#2987 https://android.googlesource.com/kernel/msm/+/android-7.1.2_r0.33/drivers/media/platform/msm/vidc/msm_vidc_common.c#3767 In the dragonboard kernel snapshot I had been testing, that I referred to you before, there are incomplete stubs of handling of this. In the debian-qcom-dragonboard410c-16.04 tag in the linaro kernel tree: http://git.linaro.org/landing-teams/working/qualcomm/kernel.git/tree/drivers/media/platform/msm/vidc/msm_venc-ctrls.c?h=debian-qcom-dragonboard410c-16.04=8205f603ceeb02d08a720676d9075c9e75e47b0f#n2116 This increments seq_hdr_reqs, just like in the android kernel tree (where this is working). However in this kernel tree, nothing actually ever reads the seq_hdr_reqs, so it's a non-functional stub. Now in the kernel tree you referred me to, in the release/db820c/qcomlt-4.11 branch, I don't see anything similar to V4L2_CID_MPEG_VIDC_VIDEO_REQUEST_SEQ_HEADER. I can't help you from there, you need to figure that out what alternative codepath there is, intended to replace it - if any. If there aren't any, you first need to fix the v4l2 driver before userspace apps can get what they need. There is a clear need for this, as you witness in the android version of the kernel. It just seems to have been removed in the vanilla linux version of the driver. // Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 1/2] aarch64: vp9: Fix assembling with Xcode 6.2 and older
From: MemphizProperly use the b.eq/b.ge forms instead of the nonstandard forms (which both gas and newer clang accept though), and expand the register list that used a range (which the Xcode 6.2 clang, based on clang 3.5 svn, didn't support). This is cherrypicked from libav commit a970f9de865c84ed5360dd0398baee7d48d04620. --- libavcodec/aarch64/vp9itxfm_neon.S | 2 +- libavcodec/aarch64/vp9mc_neon.S| 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index b12890f0db..99413b0f70 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -1531,7 +1531,7 @@ function ff_vp9_idct_idct_32x32_add_neon, export=1 2: subsx1, x1, #1 .rept 4 -st1 {v16.8h-v19.8h}, [x0], #64 +st1 {v16.8h,v17.8h,v18.8h,v19.8h}, [x0], #64 .endr b.ne2b 3: diff --git a/libavcodec/aarch64/vp9mc_neon.S b/libavcodec/aarch64/vp9mc_neon.S index 82a0f53133..f67624ca04 100644 --- a/libavcodec/aarch64/vp9mc_neon.S +++ b/libavcodec/aarch64/vp9mc_neon.S @@ -341,7 +341,7 @@ function \type\()_8tap_\size\()h_\idx1\idx2 subsx9, x9, #16 st1 {v1.16b}, [x0], #16 st1 {v24.16b}, [x6], #16 -beq 3f +b.eq3f mov v4.16b, v6.16b mov v16.16b, v18.16b ld1 {v6.16b}, [x2], #16 @@ -388,10 +388,10 @@ function ff_vp9_\type\()_\filter\()\size\()_h_neon, export=1 add x9, x6, w5, uxtw #4 mov x5, #\size .if \size >= 16 -bge \type\()_8tap_16h_34 +b.ge\type\()_8tap_16h_34 b \type\()_8tap_16h_43 .else -bge \type\()_8tap_\size\()h_34 +b.ge\type\()_8tap_\size\()h_34 b \type\()_8tap_\size\()h_43 .endif endfunc -- 2.11.0 (Apple Git-81) ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 2/2] aarch64: vp9 16bpp: Fix assembling with Xcode 6.2 and older
From: MemphizProperly use the b.eq form instead of the nonstandard form (which both gas and newer clang accept though), and expand the register lists that used a range (which the Xcode 6.2 clang, based on clang 3.5 svn, didn't support). --- libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 8 libavcodec/aarch64/vp9mc_16bpp_neon.S| 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S index 0befe383df..68296d9c40 100644 --- a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S +++ b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S @@ -1925,8 +1925,8 @@ function vp9_idct_idct_32x32_add_16_neon 2: subsx1, x1, #1 .rept 4 -st1 {v16.4s-v19.4s}, [x0], #64 -st1 {v16.4s-v19.4s}, [x0], #64 +st1 {v16.4s,v17.4s,v18.4s,v19.4s}, [x0], #64 +st1 {v16.4s,v17.4s,v18.4s,v19.4s}, [x0], #64 .endr b.ne2b 3: @@ -1991,8 +1991,8 @@ function idct32x32_\size\()_add_16_neon moviv19.4s, #0 .rept 4 -st1 {v16.4s-v19.4s}, [x0], #64 -st1 {v16.4s-v19.4s}, [x0], #64 +st1 {v16.4s,v17.4s,v18.4s,v19.4s}, [x0], #64 +st1 {v16.4s,v17.4s,v18.4s,v19.4s}, [x0], #64 .endr 3: diff --git a/libavcodec/aarch64/vp9mc_16bpp_neon.S b/libavcodec/aarch64/vp9mc_16bpp_neon.S index 98ffd2e8a7..cac6428709 100644 --- a/libavcodec/aarch64/vp9mc_16bpp_neon.S +++ b/libavcodec/aarch64/vp9mc_16bpp_neon.S @@ -275,7 +275,7 @@ function \type\()_8tap_\size\()h subsx9, x9, #32 st1 {v1.8h, v2.8h}, [x0], #32 st1 {v24.8h, v25.8h}, [x6], #32 -beq 3f +b.eq3f mov v5.16b, v7.16b mov v16.16b, v18.16b ld1 {v6.8h, v7.8h}, [x2], #32 -- 2.11.0 (Apple Git-81) ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 1/3] arm: swscale: Only compile the rgb2yuv asm if .dn aliases are supported
Vanilla clang supports altmacro since clang 5.0, and thus doesn't require gas-preprocessor for building the arm assembly any longer. However, the built-in assembler doesn't support .dn directives. This readds checks that were removed in d7320ca3ed10f0d, when the last usage of .dn directives within libav were removed. Alternatively, the assembly could be rewritten to not use the .dn directive, making it available to clang users. --- configure | 2 ++ libswscale/arm/rgb2yuv_neon_16.S | 3 +++ libswscale/arm/rgb2yuv_neon_32.S | 3 +++ libswscale/arm/swscale_unscaled.c | 6 ++ 4 files changed, 14 insertions(+) diff --git a/configure b/configure index 99570a1415..81fb3fbf75 100755 --- a/configure +++ b/configure @@ -2149,6 +2149,7 @@ SYSTEM_LIBRARIES=" TOOLCHAIN_FEATURES=" as_arch_directive +as_dn_directive as_fpu_directive as_func as_object_arch @@ -5530,6 +5531,7 @@ EOF check_inline_asm asm_mod_q '"add r0, %Q0, %R0" :: "r"((long long)0)' check_as as_arch_directive ".arch armv7-a" +check_as as_dn_directive "ra .dn d0.i16" check_as as_fpu_directive ".fpu neon" # llvm's integrated assembler supports .object_arch from llvm 3.5 diff --git a/libswscale/arm/rgb2yuv_neon_16.S b/libswscale/arm/rgb2yuv_neon_16.S index 601bc9a9b7..ad7e679ca9 100644 --- a/libswscale/arm/rgb2yuv_neon_16.S +++ b/libswscale/arm/rgb2yuv_neon_16.S @@ -18,6 +18,8 @@ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ +#include "config.h" +#if HAVE_AS_DN_DIRECTIVE #include "rgb2yuv_neon_common.S" /* downsampled R16G16B16 x8 */ @@ -78,3 +80,4 @@ alias_qwc8x8x2, q10 .endm loop_420sp rgbx, nv12, init, kernel_420_16x2, 16 +#endif diff --git a/libswscale/arm/rgb2yuv_neon_32.S b/libswscale/arm/rgb2yuv_neon_32.S index f51a5f149f..4fd0f64a09 100644 --- a/libswscale/arm/rgb2yuv_neon_32.S +++ b/libswscale/arm/rgb2yuv_neon_32.S @@ -18,6 +18,8 @@ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ +#include "config.h" +#if HAVE_AS_DN_DIRECTIVE #include "rgb2yuv_neon_common.S" /* downsampled R16G16B16 x8 */ @@ -117,3 +119,4 @@ alias_qwc8x8x2, q10 loop_420sp rgbx, nv12, init, kernel_420_16x2, 32 +#endif diff --git a/libswscale/arm/swscale_unscaled.c b/libswscale/arm/swscale_unscaled.c index e1597ab42d..e41f294eac 100644 --- a/libswscale/arm/swscale_unscaled.c +++ b/libswscale/arm/swscale_unscaled.c @@ -23,6 +23,7 @@ #include "libswscale/swscale_internal.h" #include "libavutil/arm/cpu.h" +#if HAVE_AS_DN_DIRECTIVE extern void rgbx_to_nv12_neon_32(const uint8_t *src, uint8_t *y, uint8_t *chroma, int width, int height, int y_stride, int c_stride, int src_stride, @@ -178,3 +179,8 @@ void ff_get_unscaled_swscale_arm(SwsContext *c) if (have_neon(cpu_flags)) get_unscaled_swscale_neon(c); } +#else +void ff_get_unscaled_swscale_arm(SwsContext *c) +{ +} +#endif -- 2.15.1 (Apple Git-101) ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 3/3] arm: hevcdsp: Avoid using macro expansion counters
Clang supports the macro expansion counter (used for making unique labels within macro expansions), but not when targeting darwin. Convert uses of the counter into normal local labels, as used elsewhere. Since Xcode 9.3, the bundled clang supports altmacro and doesn't require using gas-preprocessor any longer. --- libavcodec/arm/hevcdsp_qpel_neon.S | 36 ++-- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/libavcodec/arm/hevcdsp_qpel_neon.S b/libavcodec/arm/hevcdsp_qpel_neon.S index 86f92cf75a..caa6efa766 100644 --- a/libavcodec/arm/hevcdsp_qpel_neon.S +++ b/libavcodec/arm/hevcdsp_qpel_neon.S @@ -667,76 +667,76 @@ endfunc function ff_hevc_put_qpel_h1v1_neon_8, export=1 -hevc_put_qpel_hXvY_neon_8 qpel_filter_1 qpel_filter_1_32b +hevc_put_qpel_hXvY_neon_8 qpel_filter_1, qpel_filter_1_32b endfunc function ff_hevc_put_qpel_h2v1_neon_8, export=1 -hevc_put_qpel_hXvY_neon_8 qpel_filter_2 qpel_filter_1_32b +hevc_put_qpel_hXvY_neon_8 qpel_filter_2, qpel_filter_1_32b endfunc function ff_hevc_put_qpel_h3v1_neon_8, export=1 -hevc_put_qpel_hXvY_neon_8 qpel_filter_3 qpel_filter_1_32b +hevc_put_qpel_hXvY_neon_8 qpel_filter_3, qpel_filter_1_32b endfunc function ff_hevc_put_qpel_h1v2_neon_8, export=1 -hevc_put_qpel_hXvY_neon_8 qpel_filter_1 qpel_filter_2_32b +hevc_put_qpel_hXvY_neon_8 qpel_filter_1, qpel_filter_2_32b endfunc function ff_hevc_put_qpel_h2v2_neon_8, export=1 -hevc_put_qpel_hXvY_neon_8 qpel_filter_2 qpel_filter_2_32b +hevc_put_qpel_hXvY_neon_8 qpel_filter_2, qpel_filter_2_32b endfunc function ff_hevc_put_qpel_h3v2_neon_8, export=1 -hevc_put_qpel_hXvY_neon_8 qpel_filter_3 qpel_filter_2_32b +hevc_put_qpel_hXvY_neon_8 qpel_filter_3, qpel_filter_2_32b endfunc function ff_hevc_put_qpel_h1v3_neon_8, export=1 -hevc_put_qpel_hXvY_neon_8 qpel_filter_1 qpel_filter_3_32b +hevc_put_qpel_hXvY_neon_8 qpel_filter_1, qpel_filter_3_32b endfunc function ff_hevc_put_qpel_h2v3_neon_8, export=1 -hevc_put_qpel_hXvY_neon_8 qpel_filter_2 qpel_filter_3_32b +hevc_put_qpel_hXvY_neon_8 qpel_filter_2, qpel_filter_3_32b endfunc function ff_hevc_put_qpel_h3v3_neon_8, export=1 -hevc_put_qpel_hXvY_neon_8 qpel_filter_3 qpel_filter_3_32b +hevc_put_qpel_hXvY_neon_8 qpel_filter_3, qpel_filter_3_32b endfunc function ff_hevc_put_qpel_uw_h1v1_neon_8, export=1 -hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_1 qpel_filter_1_32b +hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_1, qpel_filter_1_32b endfunc function ff_hevc_put_qpel_uw_h2v1_neon_8, export=1 -hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_2 qpel_filter_1_32b +hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_2, qpel_filter_1_32b endfunc function ff_hevc_put_qpel_uw_h3v1_neon_8, export=1 -hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_3 qpel_filter_1_32b +hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_3, qpel_filter_1_32b endfunc function ff_hevc_put_qpel_uw_h1v2_neon_8, export=1 -hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_1 qpel_filter_2_32b +hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_1, qpel_filter_2_32b endfunc function ff_hevc_put_qpel_uw_h2v2_neon_8, export=1 -hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_2 qpel_filter_2_32b +hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_2, qpel_filter_2_32b endfunc function ff_hevc_put_qpel_uw_h3v2_neon_8, export=1 -hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_3 qpel_filter_2_32b +hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_3, qpel_filter_2_32b endfunc function ff_hevc_put_qpel_uw_h1v3_neon_8, export=1 -hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_1 qpel_filter_3_32b +hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_1, qpel_filter_3_32b endfunc function ff_hevc_put_qpel_uw_h2v3_neon_8, export=1 -hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_2 qpel_filter_3_32b +hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_2, qpel_filter_3_32b endfunc function ff_hevc_put_qpel_uw_h3v3_neon_8, export=1 -hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_3 qpel_filter_3_32b +hevc_put_qpel_uw_hXvY_neon_8 qpel_filter_3, qpel_filter_3_32b endfunc .macro init_put_pixels -- 2.15.1 (Apple Git-101) ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 2/3] arm: hevcdsp_deblock: Add commas between macro arguments
When targeting darwin, clang requires commas between arguments, while the no-comma form is allowed for other targets. Since Xcode 9.3, the bundled clang supports altmacro and doesn't require using gas-preprocessor any longer. --- libavcodec/arm/hevcdsp_deblock_neon.S | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/libavcodec/arm/hevcdsp_deblock_neon.S b/libavcodec/arm/hevcdsp_deblock_neon.S index 166bddb104..7cb7487ef6 100644 --- a/libavcodec/arm/hevcdsp_deblock_neon.S +++ b/libavcodec/arm/hevcdsp_deblock_neon.S @@ -152,7 +152,7 @@ andr9, r8, r7 cmpr9, #0 -beqweakfilter_\@ +beq1f vadd.i16 q2, q11, q12 vadd.i16 q4, q9, q8 @@ -210,11 +210,11 @@ vbit q13, q3, q5 vbit q14, q2, q5 -weakfilter_\@: +1: mvn r8, r8 and r9, r8, r7 cmp r9, #0 -beq ready_\@ +beq 2f vdup.16q4, r2 @@ -275,7 +275,7 @@ weakfilter_\@: vbit q11, q0, q5 vbit q12, q4, q5 -ready_\@: +2: vqmovun.s16 d16, q8 vqmovun.s16 d18, q9 vqmovun.s16 d20, q10 -- 2.15.1 (Apple Git-101) ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH 3/3] arm: hevcdsp: Avoid using macro expansion counters
On Sat, 31 Mar 2018, Hendrik Leppkes wrote: On Fri, Mar 30, 2018 at 9:14 PM, Martin Storsjö <mar...@martin.st> wrote: Clang supports the macro expansion counter (used for making unique labels within macro expansions), but not when targeting darwin. Convert uses of the counter into normal local labels, as used elsewhere. Since Xcode 9.3, the bundled clang supports altmacro and doesn't require using gas-preprocessor any longer. Could it be that you mixed up the commit message and the contents of commits 2/3? Oops, yes, you're right. Will fix before pushing later today. // Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 2/2] flvdec: Export unknown metadata packets as opaque data
--- Removed the option and made this behaviour the default. --- libavformat/flv.h| 1 + libavformat/flvdec.c | 18 ++ 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/libavformat/flv.h b/libavformat/flv.h index 3aabb3adc9..3571b90279 100644 --- a/libavformat/flv.h +++ b/libavformat/flv.h @@ -66,6 +66,7 @@ enum { FLV_STREAM_TYPE_VIDEO, FLV_STREAM_TYPE_AUDIO, FLV_STREAM_TYPE_SUBTITLE, +FLV_STREAM_TYPE_DATA, FLV_STREAM_TYPE_NB, }; diff --git a/libavformat/flvdec.c b/libavformat/flvdec.c index ffc975f15d..4b9f46902b 100644 --- a/libavformat/flvdec.c +++ b/libavformat/flvdec.c @@ -143,7 +143,9 @@ static AVStream *create_stream(AVFormatContext *s, int codec_type) st->codecpar->codec_type = codec_type; if (s->nb_streams>=3 ||( s->nb_streams==2 && s->streams[0]->codecpar->codec_type != AVMEDIA_TYPE_SUBTITLE - && s->streams[1]->codecpar->codec_type != AVMEDIA_TYPE_SUBTITLE)) + && s->streams[1]->codecpar->codec_type != AVMEDIA_TYPE_SUBTITLE + && s->streams[0]->codecpar->codec_type != AVMEDIA_TYPE_DATA + && s->streams[1]->codecpar->codec_type != AVMEDIA_TYPE_DATA)) s->ctx_flags &= ~AVFMTCTX_NOHEADER; if (codec_type == AVMEDIA_TYPE_AUDIO) { st->codecpar->bit_rate = flv->audio_bit_rate; @@ -1001,7 +1003,7 @@ retry: int type; meta_pos = avio_tell(s->pb); type = flv_read_metabody(s, next); -if (type == 0 && dts == 0 || type < 0 || type == TYPE_UNKNOWN) { +if (type == 0 && dts == 0 || type < 0) { if (type < 0 && flv->validate_count && flv->validate_index[0].pos > next && flv->validate_index[0].pos - 4 < next @@ -1015,6 +1017,8 @@ retry: return flv_data_packet(s, pkt, dts, next); } else if (type == TYPE_ONCAPTION) { return flv_data_packet(s, pkt, dts, next); +} else if (type == TYPE_UNKNOWN) { +stream_type = FLV_STREAM_TYPE_DATA; } avio_seek(s->pb, meta_pos, SEEK_SET); } @@ -1054,10 +1058,13 @@ skip: } else if (stream_type == FLV_STREAM_TYPE_SUBTITLE) { if (st->codecpar->codec_type == AVMEDIA_TYPE_SUBTITLE) break; +} else if (stream_type == FLV_STREAM_TYPE_DATA) { +if (st->codecpar->codec_type == AVMEDIA_TYPE_DATA) +break; } } if (i == s->nb_streams) { -static const enum AVMediaType stream_types[] = {AVMEDIA_TYPE_VIDEO, AVMEDIA_TYPE_AUDIO, AVMEDIA_TYPE_SUBTITLE}; +static const enum AVMediaType stream_types[] = {AVMEDIA_TYPE_VIDEO, AVMEDIA_TYPE_AUDIO, AVMEDIA_TYPE_SUBTITLE, AVMEDIA_TYPE_DATA}; st = create_stream(s, stream_types[stream_type]); if (!st) return AVERROR(ENOMEM); @@ -1153,6 +1160,8 @@ retry_duration: size -= ret; } else if (stream_type == FLV_STREAM_TYPE_SUBTITLE) { st->codecpar->codec_id = AV_CODEC_ID_TEXT; +} else if (stream_type == FLV_STREAM_TYPE_DATA) { +st->codecpar->codec_id = AV_CODEC_ID_NONE; // Opaque AMF data } if (st->codecpar->codec_id == AV_CODEC_ID_AAC || @@ -1253,7 +1262,8 @@ retry_duration: if (stream_type == FLV_STREAM_TYPE_AUDIO || ((flags & FLV_VIDEO_FRAMETYPE_MASK) == FLV_FRAME_KEY) || -stream_type == FLV_STREAM_TYPE_SUBTITLE) +stream_type == FLV_STREAM_TYPE_SUBTITLE || +stream_type == FLV_STREAM_TYPE_DATA) pkt->flags |= AV_PKT_FLAG_KEY; leave: -- 2.17.1 (Apple Git-112) ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH 2/2] flvdec: Add an option for exporting unknown metadata packets as opaque data
On Sun, 28 Oct 2018, Michael Niedermayer wrote: On Sat, Oct 27, 2018 at 09:22:18PM +0300, Martin Storsjö wrote: On Sat, 27 Oct 2018, Michael Niedermayer wrote: On Thu, Oct 25, 2018 at 03:59:17PM +0300, Martin Storsjö wrote: --- libavformat/flv.h| 1 + libavformat/flvdec.c | 21 + 2 files changed, 18 insertions(+), 4 deletions(-) [...] @@ -1290,6 +1302,7 @@ static const AVOption options[] = { { "flv_full_metadata", "Dump full metadata of the onMetadata", OFFSET(dump_full_metadata), AV_OPT_TYPE_BOOL, { .i64 = 0 }, 0, 1, VD }, { "flv_ignore_prevtag", "Ignore the Size of previous tag", OFFSET(trust_datasize), AV_OPT_TYPE_BOOL, { .i64 = 0 }, 0, 1, VD }, { "missing_streams", "", OFFSET(missing_streams), AV_OPT_TYPE_INT, { .i64 = 0 }, 0, 0xFF, VD | AV_OPT_FLAG_EXPORT | AV_OPT_FLAG_READONLY }, +{ "export_opaque_meta", "", OFFSET(export_opaque_meta), AV_OPT_TYPE_BOOL, { .i64 = 0 }, 0, 1, VD }, { NULL } I think this together with doc/demuxers.texi (which doesnt document this) is not enough to use this option by a user Oh right, I had forgotten to actually write something here. also why is this conditional ? is there a disadvantage of always exporting this ? Not sure - I thought it'd be less behaviour change and less risk of potentially confusing packets for unsuspecting users by not doing it by default. But as any normal flv stream doesn't contain any such packets, it might be fine to just expose them all the time. I dont know enough about these to have an oppinion ... but I just realized another aspect. How do these packets interact with flvenc ? Should they be preserved by default ? because if so then they would need to be exported by default I guess it depends on what the packets actually are - as it can be anything, it's pretty much up to the application what treatment they want for them. flvenc right now does write them out properly afaik (a data track with codec type AV_CODEC_ID_NONE gets copied straight through into FLV_TAG_TYPE_META packets). I guess the sensible default would be to copy them, so I guess I'll amend the patch to always export them. // Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH 1/2] libavutil: Undeprecate the AVFrame reordered_opaque field
On Mon, 29 Oct 2018, Derek Buitenhuis wrote: On 29/10/2018 14:10, Martin Storsjö wrote: I don't understand why this is being used in favour of a proper pointer field? An integer field is just ascting to be misused. Even the doxygen is really sketchy on it. It's essentially meant to be used as union { ptr; int64_t } assuming you don't have pointers larger than 64 bits. It's not a union in the API, and I'm pretty sure that it violates the C spec to use a unions to get an integer out of a pointer, shove it into an int64_t, and then get it back out, and chnage it back via union. Especially for 32-bit pointers. It encourages terrible code. I just don't think we should revive this as-is purely for convenience. I also don't understand why this is at the AVCodecContext level and not packet/frame? It is on the frame level, but not in the packet struct (probably for historical reasons) - instead of in the packet, it's in AVCodecContext. For decoding, you set the value in AVCodecContext before feeding packets to it, and get the corresponding value reordered into the output AVFrame. If things were to be redone from scratch, moving it into AVPacket would probably make more sense, but there's not much point in doing that right now. I mean, this is pretty gross, and non-obvious as far as I'm concerned. Modifying the AVCodecContext on every call is just... eugh. At some point, the doxygen got markers saying this mechanism was deprecated and one should use the new pkt_pts instead. Before that, reordered_opaque was mainly used for getting reordered pts as there was no other mechanism for it. But even with the proper pkt_pts field, having a generic opaque field that travels along with the reordering is useful, which is why the deprecation doxygen comments were removed in ad1ee5fa7. But that commit just missed to remove one of the doxygen deprecation. I agree it's very useful, and something we should have, but not that we should revive/use this partiular field... it's nasty. Sorry, I think you've misunderstood this patch altogether. It's not about reviving this field or not, it's all in full use already. It was never deprecated with any active plan to remove it, the only steps were a few doxygen comments, never any attributes that would actually prompt action. And a few years later someone noticed that these doxygen comments didn't match up with reality, and it was decided (with no objections on either project) that these really shouldn't be deprecated as it is the only actual mechanism we have for doing exactly this. It's just that the undeprecation commit, ad1ee5fa7, missed one field. And the one I'm removing the stray deprecation comment from, is the very properly placed one in AVFrame non the less. // Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH 2/2] libx264: Pass the reordered_opaque field through the encoder
On Mon, 29 Oct 2018, Derek Buitenhuis wrote: On 25/10/2018 13:58, Martin Storsjö wrote: +x4->nb_reordered_opaque = x264_encoder_maximum_delayed_frames(x4->enc) + 1; Is it possible this changes when the encoder is reconfigured (e.g. to interlaced)? Good point. I'm sure it's possible that it changes, if reconfiguring. As I guess there can be old frames in flight, the only safe option is to enlarge, not to shrink it. But in case a realloc moves the array, the old pointers end up pretty useless. Tricky, I guess I'll have to think about it to see if I can come up with something which isn't completely terrible. // Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH 2/2] libx264: Pass the reordered_opaque field through the encoder
On Wed, 31 Oct 2018, Derek Buitenhuis wrote: On 30/10/2018 19:49, Martin Storsjö wrote: Hmm, that might make sense, but with a little twist. The max reordered frames for H.264 is known, but onto that you also get more delay due to frame threads and other details that this function within x264 knows about. So that would make it [H264 max reordering] + [threads] + [constant] or something such? Looking at the source, it's more complicated than that, with e.g.: h->frames.i_delay = X264_MAX( h->frames.i_delay, h->param.rc.i_lookahead ); I think you're better off not trying to duplicate this logic. Indeed, I don't want to duplicate that. Even though we do allow reconfiguration, it doesn't look like we support changing any parameters which would actually affect the delay, only RC method and targets (CRF, bitrate, etc). So given that, the current patch probably should be safe - what do you think? Or the current patch, with an added margin of 16 on top just in case? // Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH 2/2] libx264: Pass the reordered_opaque field through the encoder
On Thu, 1 Nov 2018, Derek Buitenhuis wrote: On 31/10/2018 21:41, Martin Storsjö wrote: Even though we do allow reconfiguration, it doesn't look like we support changing any parameters which would actually affect the delay, only RC method and targets (CRF, bitrate, etc). So given that, the current patch probably should be safe - what do you think? Or the current patch, with an added margin of 16 on top just in case? We allow reconfiguring to/from interlaced. I'm not sure if this can modify delay? Not really sure either... So perhaps it'd be safest with some bit of extra margin/overestimate of the delay here? It just costs a couple bytes in the mapping array anyway. // Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH 2/2] libx264: Pass the reordered_opaque field through the encoder
On Tue, 30 Oct 2018, Derek Buitenhuis wrote: On 29/10/2018 21:06, Martin Storsjö wrote: As I guess there can be old frames in flight, the only safe option is to enlarge, not to shrink it. But in case a realloc moves the array, the old pointers end up pretty useless. Just always allocate the max (which is known for H.264), and adjust nb_reordered_opaque as need be, on reconfig, no? Hmm, that might make sense, but with a little twist. The max reordered frames for H.264 is known, but onto that you also get more delay due to frame threads and other details that this function within x264 knows about. So that would make it [H264 max reordering] + [threads] + [constant] or something such? // Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel