Re: [FFmpeg-devel] [PATCH 5/5] aarch64/opusdsp: implement NEON accerelated postfilter and deemphasis

2019-03-24 Thread Carl Eugen Hoyos
2019-03-24 13:26 GMT+01:00, Lynne :

>> Which toolchain did you test?

>> make libavcodec/aarch64/opusdsp_neon.o
>> AS  libavcodec/aarch64/opusdsp_neon.o
>> /tmp/opusdsp_neon-ac304f.s:86:33: error: invalid operand for instruction
>>  fmul v0.4s, v4.4s, v0.4s[0]
>>  ^
>>
>
> Does the toolchain you use compile fft_neon.S?

(Yes)
This is the Android toolchain (that I thought you had already
tested), please consider it a requirement for this patchset
that you install it and test yourself.
(I absolutely understand that you cannot test arm64 Windows
and if you have no macos system, you also cannot test ios
but testing Android compilation is necessary and reasonable.)

Carl Eugen
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 5/5] aarch64/opusdsp: implement NEON accerelated postfilter and deemphasis

2019-03-24 Thread Lynne



24 Mar 2019, 00:25 by ceffm...@gmail.com:

> 2019-03-24 0:26 GMT+01:00, Lynne <> d...@lynne.ee > >:
>
>>
>> 23 Mar 2019, 22:27 by >> ceffm...@gmail.com >> :
>>
>>> 2019-03-23 19:20 GMT+01:00, Lynne <> >>> d...@lynne.ee 
>>> > d...@lynne.ee 
>>> >
>>> >:
>>>
>>> Which toolchains did you test?
>>> (For compilation, not performance.)
>>>
>>
>> gcc 8.2.1 on both aarch64 and x86-64
>>
>
> Please also test Android and tell us if you
> can test ios compilation.
> (Assuming you cannot test arm64 for Windows.)
>

 I can't install aarch64 android on the raspberry pi 3
 so I can't test that.

>>>
>>> Please test compilation for Android (there is no native
>>> toolchain afaik).
>>>
>>
>> Cross compilation works fine.
>>
>
> Which toolchain did you test?
>
> Carl Eugen
>
> make libavcodec/aarch64/opusdsp_neon.o
> AS  libavcodec/aarch64/opusdsp_neon.o
> /tmp/opusdsp_neon-ac304f.s:86:33: error: invalid operand for instruction
>  fmul v0.4s, v4.4s, v0.4s[0]
>  ^
>

Does the toolchain you use compile fft_neon.S? It uses the same syntax.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 5/5] aarch64/opusdsp: implement NEON accerelated postfilter and deemphasis

2019-03-23 Thread Carl Eugen Hoyos
2019-03-24 0:26 GMT+01:00, Lynne :
>
> 23 Mar 2019, 22:27 by ceffm...@gmail.com:
>
>> 2019-03-23 19:20 GMT+01:00, Lynne <> d...@lynne.ee >
>> >:
>>
>> Which toolchains did you test?
>> (For compilation, not performance.)
>
> gcc 8.2.1 on both aarch64 and x86-64

 Please also test Android and tell us if you
 can test ios compilation.
 (Assuming you cannot test arm64 for Windows.)
>>>
>>> I can't install aarch64 android on the raspberry pi 3
>>> so I can't test that.
>>
>> Please test compilation for Android (there is no native
>> toolchain afaik).
>
> Cross compilation works fine.

Which toolchain did you test?

Carl Eugen

make libavcodec/aarch64/opusdsp_neon.o
AS  libavcodec/aarch64/opusdsp_neon.o
/tmp/opusdsp_neon-ac304f.s:86:33: error: invalid operand for instruction
fmul v0.4s, v4.4s, v0.4s[0]
^
/tmp/opusdsp_neon-ac304f.s:90:33: error: invalid operand for instruction
fmla v0.4s, v5.4s, v1.4s[0]
^
/tmp/opusdsp_neon-ac304f.s:91:33: error: invalid operand for instruction
fmul v3.4s, v7.4s, v2.4s[2]
^
/tmp/opusdsp_neon-ac304f.s:93:33: error: invalid operand for instruction
fmla v0.4s, v6.4s, v1.4s[1]
^
/tmp/opusdsp_neon-ac304f.s:94:33: error: invalid operand for instruction
fmla v3.4s, v6.4s, v2.4s[1]
^
/tmp/opusdsp_neon-ac304f.s:96:33: error: invalid operand for instruction
fmla v0.4s, v7.4s, v1.4s[2]
^
/tmp/opusdsp_neon-ac304f.s:97:33: error: invalid operand for instruction
fmla v3.4s, v5.4s, v2.4s[0]
^
/tmp/opusdsp_neon-ac304f.s:102:33: error: invalid operand for instruction
fmla v2.4s, v4.4s, v1.4s[3]
^
/tmp/opusdsp_neon-ac304f.s:105:33: error: invalid operand for instruction
fmul v0.4s, v4.4s, v2.4s[3]
^
/tmp/opusdsp_neon-ac304f.s:110:17: error: invalid operand for instruction
mov s0, v2.4s[3]
^
/tmp/opusdsp_neon-ac304f.s:117:20: error: invalid operand for instruction
dup v1.4s, v0.4s[1]
   ^
/tmp/opusdsp_neon-ac304f.s:118:20: error: invalid operand for instruction
dup v2.4s, v0.4s[2]
   ^
/tmp/opusdsp_neon-ac304f.s:119:20: error: invalid operand for instruction
dup v0.4s, v0.4s[0]
   ^
make: *** [libavcodec/aarch64/opusdsp_neon.o] Error 1
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 5/5] aarch64/opusdsp: implement NEON accerelated postfilter and deemphasis

2019-03-23 Thread Lynne



23 Mar 2019, 22:27 by ceffm...@gmail.com:

> 2019-03-23 19:20 GMT+01:00, Lynne <> d...@lynne.ee > >:
>
> Which toolchains did you test?
> (For compilation, not performance.)
>

 gcc 8.2.1 on both aarch64 and x86-64

>>>
>>> Please also test Android and tell us if you
>>> can test ios compilation.
>>> (Assuming you cannot test arm64 for Windows.)
>>>
>>
>> I can't install aarch64 android on the raspberry pi 3
>> so I can't test that.
>>
>
> Please test compilation for Android (there is no native
> toolchain afaik).
>

Cross compilation works fine.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 5/5] aarch64/opusdsp: implement NEON accerelated postfilter and deemphasis

2019-03-23 Thread Carl Eugen Hoyos
2019-03-23 19:20 GMT+01:00, Lynne :

 Which toolchains did you test?
 (For compilation, not performance.)
>>>
>>> gcc 8.2.1 on both aarch64 and x86-64
>>
>> Please also test Android and tell us if you
>> can test ios compilation.
>> (Assuming you cannot test arm64 for Windows.)
>
> I can't install aarch64 android on the raspberry pi 3
> so I can't test that.

Please test compilation for Android (there is no native
toolchain afaik).

Carl Eugen
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 5/5] aarch64/opusdsp: implement NEON accerelated postfilter and deemphasis

2019-03-23 Thread Lynne



23 Mar 2019, 16:20 by ceffm...@gmail.com:

> 2019-03-23 17:16 GMT+01:00, Lynne <> d...@lynne.ee > >:
>
>> 23 Mar 2019, 15:04 by >> ceffm...@gmail.com >> :
>>
>>> 2019-03-23 15:23 GMT+01:00, Lynne <> >>> d...@lynne.ee 
>>> > d...@lynne.ee 
>>> >
>>> >:
>>>
 16 Mar 2019, 16:34 by >>  d...@lynne.ee   
 >>> d...@lynne.ee  >>> :

> 153372 UNITS in postfilter_c,   65536 runs,  0 skips
> 73164 UNITS in postfilter_neon,   65536 runs,  0 skips -> 2.1x
> speedup
>
> 80591 UNITS in deemphasis_c,  131072 runs,  0 skips
> 43969 UNITS in deemphasis_neon,  131072 runs,  0 skips -> 1.83x
> speedup
>
> Total decoder speedup: ~15% on a Raspberry Pi 3 (from 28.1x to 33.5x
> realtime)
>
> Deemphasis SIMD based on the following unrolling:
> const float c1 = CELT_EMPH_COEFF, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;
> float state = coeff;
>
> for (int i = 0; i < len; i += 4) {
>  y[0] = x[0] + c1*state;
>  y[1] = x[1] + c2*state + c1*x[0];
>  y[2] = x[2] + c3*state + c1*x[1] + c2*x[0];
>  y[3] = x[3] + c4*state + c1*x[2] + c2*x[1] + c3*x[0];
>
>  state = y[3];
>  y += 4;
>  x += 4;
> }
>
> Unlike the x86 version, duplication is used instead of pslldq so
> the structure and tables are different.
> Same approach tested on x86 (3x pslldq -> vbroadcastss + shufps +
> pslldq)
> had the same performance, so 3x pslldq was kept as vbroadcastss has a
> higher latency.
>

 Could someone review the patches?

>>>
>>> Which toolchains did you test?
>>> (For compilation, not performance.)
>>>
>>
>> gcc 8.2.1 on both aarch64 and x86-64
>>
>
> Please also test Android and tell us if you
> can test ios compilation.
> (Assuming you cannot test arm64 for Windows.)
>

I can't install aarch64 android on the raspberry pi 3 so I can't test that. I 
don't know if cross-compilation is even possible for aarch64 windows.
I don't have an ios device.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 5/5] aarch64/opusdsp: implement NEON accerelated postfilter and deemphasis

2019-03-23 Thread Carl Eugen Hoyos
2019-03-23 17:16 GMT+01:00, Lynne :
> 23 Mar 2019, 15:04 by ceffm...@gmail.com:
>
>> 2019-03-23 15:23 GMT+01:00, Lynne <> d...@lynne.ee >
>> >:
>>
>>> 16 Mar 2019, 16:34 by >> d...@lynne.ee >> :
>>>
 153372 UNITS in postfilter_c,   65536 runs,  0 skips
 73164 UNITS in postfilter_neon,   65536 runs,  0 skips -> 2.1x
 speedup

 80591 UNITS in deemphasis_c,  131072 runs,  0 skips
 43969 UNITS in deemphasis_neon,  131072 runs,  0 skips -> 1.83x
 speedup

 Total decoder speedup: ~15% on a Raspberry Pi 3 (from 28.1x to 33.5x
 realtime)

 Deemphasis SIMD based on the following unrolling:
 const float c1 = CELT_EMPH_COEFF, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;
 float state = coeff;

 for (int i = 0; i < len; i += 4) {
  y[0] = x[0] + c1*state;
  y[1] = x[1] + c2*state + c1*x[0];
  y[2] = x[2] + c3*state + c1*x[1] + c2*x[0];
  y[3] = x[3] + c4*state + c1*x[2] + c2*x[1] + c3*x[0];

  state = y[3];
  y += 4;
  x += 4;
 }

 Unlike the x86 version, duplication is used instead of pslldq so
 the structure and tables are different.
 Same approach tested on x86 (3x pslldq -> vbroadcastss + shufps +
 pslldq)
 had the same performance, so 3x pslldq was kept as vbroadcastss has a
 higher latency.

>>>
>>> Could someone review the patches?
>>>
>>
>> Which toolchains did you test?
>> (For compilation, not performance.)
>>
>
> gcc 8.2.1 on both aarch64 and x86-64

Please also test Android and tell us if you
can test ios compilation.
(Assuming you cannot test arm64 for Windows.)

Carl Eugen
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 5/5] aarch64/opusdsp: implement NEON accerelated postfilter and deemphasis

2019-03-23 Thread Lynne
23 Mar 2019, 15:55 by barsn...@gmx.net:

> On Sat, Mar 16, 2019 at 17:34:49 +0100, Lynne wrote:
>
>> Subject: [PATCH 5/5] aarch64/opusdsp: implement NEON accerelated postfilter 
>> and deemphasis
>>
> ^accelerated
>
> Moritz
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org 
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel 
> 
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-requ...@ffmpeg.org >  
> with subject "unsubscribe".
>

Fixed in attached version.
>From 2b5371b660812b4ee24b2beb90dea168dd9675e2 Mon Sep 17 00:00:00 2001
From: Lynne 
Date: Fri, 15 Mar 2019 14:37:31 +
Subject: [PATCH] aarch64/opusdsp: implement NEON accelerated postfilter and
 deemphasis

153372 UNITS in postfilter_c,   65536 runs,  0 skips
73164 UNITS in postfilter_neon,   65536 runs,  0 skips -> 2.1x speedup

80591 UNITS in deemphasis_c,  131072 runs,  0 skips
43969 UNITS in deemphasis_neon,  131072 runs,  0 skips -> 1.83x speedup

Total decoder speedup: ~15% on a Raspberry Pi 3 (from 28.1x to 33.5x realtime)

Deemphasis SIMD based on the following unrolling:
const float c1 = CELT_EMPH_COEFF, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;
float state = coeff;

for (int i = 0; i < len; i += 4) {
y[0] = x[0] + c1*state;
y[1] = x[1] + c2*state + c1*x[0];
y[2] = x[2] + c3*state + c1*x[1] + c2*x[0];
y[3] = x[3] + c4*state + c1*x[2] + c2*x[1] + c3*x[0];

state = y[3];
y += 4;
x += 4;
}

Unlike the x86 version, duplication is used instead of pslldq so
the structure and tables are different.
Same approach tested on x86 (3x pslldq -> vbroadcastss + shufps + pslldq)
had the same performance, so 3x pslldq was kept as vbroadcastss has a higher
latency.
---
 libavcodec/aarch64/Makefile   |   2 +
 libavcodec/aarch64/opusdsp_init.c |  35 +
 libavcodec/aarch64/opusdsp_neon.S | 113 ++
 libavcodec/opusdsp.c  |   3 +
 libavcodec/opusdsp.h  |   1 +
 5 files changed, 154 insertions(+)
 create mode 100644 libavcodec/aarch64/opusdsp_init.c
 create mode 100644 libavcodec/aarch64/opusdsp_neon.S

diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index 8bc8bc528c..00f93bf59f 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -15,6 +15,7 @@ OBJS-$(CONFIG_VP8DSP)   += aarch64/vp8dsp_init_aarch64.o
 OBJS-$(CONFIG_AAC_DECODER)  += aarch64/aacpsdsp_init_aarch64.o \
aarch64/sbrdsp_init_aarch64.o
 OBJS-$(CONFIG_DCA_DECODER)  += aarch64/synth_filter_init.o
+OBJS-$(CONFIG_OPUS_DECODER) += aarch64/opusdsp_init.o
 OBJS-$(CONFIG_RV40_DECODER) += aarch64/rv40dsp_init_aarch64.o
 OBJS-$(CONFIG_VC1DSP)   += aarch64/vc1dsp_init_aarch64.o
 OBJS-$(CONFIG_VORBIS_DECODER)   += aarch64/vorbisdsp_init.o
@@ -49,6 +50,7 @@ NEON-OBJS-$(CONFIG_VP8DSP)  += aarch64/vp8dsp_neon.o
 # decoders/encoders
 NEON-OBJS-$(CONFIG_AAC_DECODER) += aarch64/aacpsdsp_neon.o
 NEON-OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_neon.o
+NEON-OBJS-$(CONFIG_OPUS_DECODER)+= aarch64/opusdsp_neon.o
 NEON-OBJS-$(CONFIG_VORBIS_DECODER)  += aarch64/vorbisdsp_neon.o
 NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_16bpp_neon.o   \
aarch64/vp9itxfm_neon.o \
diff --git a/libavcodec/aarch64/opusdsp_init.c b/libavcodec/aarch64/opusdsp_init.c
new file mode 100644
index 00..cc6a1b672d
--- /dev/null
+++ b/libavcodec/aarch64/opusdsp_init.c
@@ -0,0 +1,35 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "config.h"
+
+#include "libavutil/aarch64/cpu.h"
+#include "libavcodec/opusdsp.h"
+
+void ff_opus_postfilter_neon(float *data, int period, float *gains, int len);
+float ff_opus_deemphasis_neon(float *out, float *in, float coeff, int len);
+
+av_cold void ff_opus_dsp_init_aarch64(OpusDSP *ctx)
+{
+int cpu_flags = av_get_cpu_flags();
+
+if (have_neon(cpu_flags)) {
+ctx->postfilter = 

Re: [FFmpeg-devel] [PATCH 5/5] aarch64/opusdsp: implement NEON accerelated postfilter and deemphasis

2019-03-23 Thread Lynne
23 Mar 2019, 15:04 by ceffm...@gmail.com:

> 2019-03-23 15:23 GMT+01:00, Lynne <> d...@lynne.ee > >:
>
>> 16 Mar 2019, 16:34 by >> d...@lynne.ee >> :
>>
>>> 153372 UNITS in postfilter_c,   65536 runs,  0 skips
>>> 73164 UNITS in postfilter_neon,   65536 runs,  0 skips -> 2.1x speedup
>>>
>>> 80591 UNITS in deemphasis_c,  131072 runs,  0 skips
>>> 43969 UNITS in deemphasis_neon,  131072 runs,  0 skips -> 1.83x
>>> speedup
>>>
>>> Total decoder speedup: ~15% on a Raspberry Pi 3 (from 28.1x to 33.5x
>>> realtime)
>>>
>>> Deemphasis SIMD based on the following unrolling:
>>> const float c1 = CELT_EMPH_COEFF, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;
>>> float state = coeff;
>>>
>>> for (int i = 0; i < len; i += 4) {
>>>  y[0] = x[0] + c1*state;
>>>  y[1] = x[1] + c2*state + c1*x[0];
>>>  y[2] = x[2] + c3*state + c1*x[1] + c2*x[0];
>>>  y[3] = x[3] + c4*state + c1*x[2] + c2*x[1] + c3*x[0];
>>>
>>>  state = y[3];
>>>  y += 4;
>>>  x += 4;
>>> }
>>>
>>> Unlike the x86 version, duplication is used instead of pslldq so
>>> the structure and tables are different.
>>> Same approach tested on x86 (3x pslldq -> vbroadcastss + shufps + pslldq)
>>> had the same performance, so 3x pslldq was kept as vbroadcastss has a
>>> higher latency.
>>>
>>
>> Could someone review the patches?
>>
>
> Which toolchains did you test?
> (For compilation, not performance.)
>

gcc 8.2.1 on both aarch64 and x86-64
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 5/5] aarch64/opusdsp: implement NEON accerelated postfilter and deemphasis

2019-03-23 Thread Moritz Barsnick
On Sat, Mar 16, 2019 at 17:34:49 +0100, Lynne wrote:
> Subject: [PATCH 5/5] aarch64/opusdsp: implement NEON accerelated postfilter 
> and deemphasis
   ^accelerated

Moritz
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 5/5] aarch64/opusdsp: implement NEON accerelated postfilter and deemphasis

2019-03-23 Thread Carl Eugen Hoyos
2019-03-23 15:23 GMT+01:00, Lynne :
> 16 Mar 2019, 16:34 by d...@lynne.ee:
>
>> 153372 UNITS in postfilter_c,   65536 runs,  0 skips
>> 73164 UNITS in postfilter_neon,   65536 runs,  0 skips -> 2.1x speedup
>>
>> 80591 UNITS in deemphasis_c,  131072 runs,  0 skips
>> 43969 UNITS in deemphasis_neon,  131072 runs,  0 skips -> 1.83x
>> speedup
>>
>> Total decoder speedup: ~15% on a Raspberry Pi 3 (from 28.1x to 33.5x
>> realtime)
>>
>> Deemphasis SIMD based on the following unrolling:
>> const float c1 = CELT_EMPH_COEFF, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;
>> float state = coeff;
>>
>> for (int i = 0; i < len; i += 4) {
>> y[0] = x[0] + c1*state;
>> y[1] = x[1] + c2*state + c1*x[0];
>> y[2] = x[2] + c3*state + c1*x[1] + c2*x[0];
>> y[3] = x[3] + c4*state + c1*x[2] + c2*x[1] + c3*x[0];
>>
>> state = y[3];
>> y += 4;
>> x += 4;
>> }
>>
>> Unlike the x86 version, duplication is used instead of pslldq so
>> the structure and tables are different.
>> Same approach tested on x86 (3x pslldq -> vbroadcastss + shufps + pslldq)
>> had the same performance, so 3x pslldq was kept as vbroadcastss has a
>> higher latency.
>
> Could someone review the patches?

Which toolchains did you test?
(For compilation, not performance.)

Carl Eugen
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 5/5] aarch64/opusdsp: implement NEON accerelated postfilter and deemphasis

2019-03-23 Thread Lynne
16 Mar 2019, 16:34 by d...@lynne.ee:

> 153372 UNITS in postfilter_c,   65536 runs,  0 skips
> 73164 UNITS in postfilter_neon,   65536 runs,  0 skips -> 2.1x speedup
>
> 80591 UNITS in deemphasis_c,  131072 runs,  0 skips
> 43969 UNITS in deemphasis_neon,  131072 runs,  0 skips -> 1.83x speedup
>
> Total decoder speedup: ~15% on a Raspberry Pi 3 (from 28.1x to 33.5x realtime)
>
> Deemphasis SIMD based on the following unrolling:
> const float c1 = CELT_EMPH_COEFF, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;
> float state = coeff;
>
> for (int i = 0; i < len; i += 4) {
>     y[0] = x[0] + c1*state;
>     y[1] = x[1] + c2*state + c1*x[0];
>     y[2] = x[2] + c3*state + c1*x[1] + c2*x[0];
>     y[3] = x[3] + c4*state + c1*x[2] + c2*x[1] + c3*x[0];
>
>     state = y[3];
>     y += 4;
>     x += 4;
> }
>
> Unlike the x86 version, duplication is used instead of pslldq so
> the structure and tables are different.
> Same approach tested on x86 (3x pslldq -> vbroadcastss + shufps + pslldq)
> had the same performance, so 3x pslldq was kept as vbroadcastss has a higher
> latency.
>

Could someone review the patches?
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".