Hi, dead thread ping.
On Mon, Sep 5, 2011 at 6:33 PM, Loren Merritt <[email protected]> wrote: > On Sat, Sep 3, 2011, Vitor Sessak <[email protected]> wrote: >> On Sat, Sep 3, 2011, Ronald S. Bultje <[email protected]> wrote: >>> On Thu, Sep 1, 2011, Vitor Sessak <[email protected]> wrote: >>> >>>> +%macro LOADA64 2 >>>> + movlps %1, [%2] >>>> + movhps %1, [%2 + 8] >>>> +%endmacro >>>> + >>>> +%macro STOREA64 2 >>>> + movlps [%1 ], %2 >>>> + movhps [%1 + 8], %2 >>>> +%endmacro >>> >>> Why not movdqa? >> >> Because the buffer have only 64-bit alignment. movups would do it too, >> but it cannot use the fact that the buffer is 64-bit aligned, so I >> expect it to be slower. > > movl/hps doesn't assume alignment either. > > movups vs movl/hps: > Conroe/Penryn, loads: same number of uops, but different execution units. > which one is faster depends on what else you're executing, but movups is a > better bet in general (and confirmed in this case). > Conroe/Penryn, stores: movl/hps is faster. > Nehalem/Sandybridge: movups is faster. > Atom: movl/hps is faster. > K8: movl/hps is faster. > K10, loads: movups is faster. > K10, stores: same throughput in isolation. movl/hps is fewer uops but more > decoding resources, dunno which is the bottleneck here. > (Penryn and Sandybridge confirmed by benchmark, the rest are just from > Agner's instruction tables.) > >>+; input %1={x1,x2,x3,x4}, %2={y1,y2,y3,y4} >>+; output %3={x4,y1,y2,y3} >>+%macro ROTLEFT_SSE 3 >>+ BUILDINVHIGHLOW %1, %2, %3 >>+ shufps %3, %3, %2, 0x99 >>+%endmacro > (and other such macros) > > If some macro args can be described as output and some as input, then > output should come first. > >>--- a/libavutil/x86/x86inc.asm >>+++ b/libavutil/x86/x86inc.asm >>@@ -790,6 +790,8 @@ AVX_INSTR minps, 1, 0 >> AVX_INSTR minsd, 1, 0 >> AVX_INSTR minss, 1, 0 >> AVX_INSTR mpsadbw, 0, 1 >>+AVX_INSTR movhlps, 1, 0 >>+AVX_INSTR movlhps, 1, 0 >> AVX_INSTR mulpd, 1, 0 >> AVX_INSTR mulps, 1, 0 >> AVX_INSTR mulsd, 1, 0 > > Alphabetize. Vitor, will you work on this? I know thread is old, but these are minor things to implement (basically have separate load/store instructions depending on sse vs ssse3 vs avx, and the other stuff is even more trivial.) If not, I can perhaps finish this next weekend. > On Sat, 27 Aug 2011, Vitor Sessak wrote: >> >>>> The main optimization i see is to interleave a few blocks so as to >>>> simplify the shuffling of data >>> >>> Agreed. >> >> I'm also attaching a pseudo-SIMD C version of the code. I've done my >> best to have the minimum number of shuffles, but suggestions are >> welcome. > > Err, what? The interleaved-block version should be exactly the same as the > scalar C version, except with each scalar instruction replaced by a simd > instruction on a vector containing 1 coef from each of 4 blocks. There > should be a transpose on load of in[] and store of buf[], and no other > shuffles. (out[] prefers the interleaved data, and win[] can be statically > rearranged.) > This doesn't quite obsolete the implementation of a single imdct36, since > we don't always have a multiple of 4 blocks. Though mdct_long_end is > usually large enough that we wouldn't lose proportionally much by rounding > it. I think this should be done long-term and isn't necessary to commit this patch. Ronald _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
