Re: [libav-devel] [RFC/RFBench] AVX FFT

Loren Merritt Mon, 04 Apr 2011 05:58:44 -0700

On Mon, 4 Apr 2011, Vitor Sessak wrote:

On 04/04/2011 03:44 AM, Loren Merritt wrote:
On Fri, 1 Apr 2011, Vitor Sessak wrote:
+ vextractf128 Z(0), m0, 0
+ vextractf128 ZH(0), m1, 0
+ vextractf128 Z(1), m0, 1
+ vextractf128 ZH(1), m1, 1
+ vextractf128 Z(2), m5, 0
+ vextractf128 ZH(2), m3, 0
+ vextractf128 Z(3), m5, 1
+ vextractf128 ZH(3), m3, 1
Deinterleave real and imaginary some more, and update imdct to match.
I'm not sure I understand your suggestion. In my patch, the output of fft_avxis already {r0,r1,r2,r3,r4,r5,r6,r7,i0,i1,i2,i3,i4,i5,i6,i7,...} (which iswhat one needs for an efficient pass).


Why didn't imdct_half_sse break? It should assume 4x deinterleave.

+ vinsertf128 m0, m0, Z(4), 0
+ vinsertf128 m0, m0, Z(6), 1
+ vinsertf128 m1, m1, Z(5), 0
+ vinsertf128 m1, m1, Z(7), 1
Factor into fft_permute.
I thought about that, but it seemed a lot of ugliness (for this particularpermutation) for little speed gain.


More than 8 LOC?

+ %macro T4_2_AVX 3
+ %macro T8_2_AVX 6
+ %macro PASS_SMALL_AVX 3


These are identical to the sse versions if you apply x86inc's
sse-emulation of vex ops, right?
And the deinterleaving should make PASS_BIG_AVX identical too.


I'll try that. Would it work for 3DN?


Yes.

--Loren Merritt
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [RFC/RFBench] AVX FFT

Reply via email to