On Mon, Apr 4, 2011 at 9:58 AM, Loren Merritt <[email protected]> wrote: > On Mon, 4 Apr 2011, Vitor Sessak wrote: >> >> On 04/04/2011 03:44 AM, Loren Merritt wrote: >>> >>> On Fri, 1 Apr 2011, Vitor Sessak wrote: >>> >>>> + vextractf128 Z(0), m0, 0 >>>> + vextractf128 ZH(0), m1, 0 >>>> + vextractf128 Z(1), m0, 1 >>>> + vextractf128 ZH(1), m1, 1 >>>> + vextractf128 Z(2), m5, 0 >>>> + vextractf128 ZH(2), m3, 0 >>>> + vextractf128 Z(3), m5, 1 >>>> + vextractf128 ZH(3), m3, 1 >>> >>> Deinterleave real and imaginary some more, and update imdct to match. >> >> I'm not sure I understand your suggestion. In my patch, the output of >> fft_avx is already {r0,r1,r2,r3,r4,r5,r6,r7,i0,i1,i2,i3,i4,i5,i6,i7,...} >> (which is what one needs for an efficient pass). > > Why didn't imdct_half_sse break? It should assume 4x deinterleave.
Because it call directly fft_dispatch_sse, thus running no AVX code. >>>> + vinsertf128 m0, m0, Z(4), 0 >>>> + vinsertf128 m0, m0, Z(6), 1 >>>> + vinsertf128 m1, m1, Z(5), 0 >>>> + vinsertf128 m1, m1, Z(7), 1 >>> >>> Factor into fft_permute. >> >> I thought about that, but it seemed a lot of ugliness (for this particular >> permutation) for little speed gain. > > More than 8 LOC? I'll see. >>>> + %macro T4_2_AVX 3 >>>> + %macro T8_2_AVX 6 >>>> + %macro PASS_SMALL_AVX 3 >>> >>> These are identical to the sse versions if you apply x86inc's >>> sse-emulation of vex ops, right? >>> And the deinterleaving should make PASS_BIG_AVX identical too. >> >> I'll try that. Would it work for 3DN? > > Yes. Thanks for the review, -Vitor _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
