Hi,

dead thread ping.

On Mon, Sep 5, 2011 at 6:33 PM, Loren Merritt <[email protected]> wrote:
> On Sat, Sep 3, 2011, Vitor Sessak <[email protected]> wrote:
>> On Sat, Sep 3, 2011, Ronald S. Bultje <[email protected]> wrote:
>>> On Thu, Sep 1, 2011, Vitor Sessak <[email protected]> wrote:
>>>
>>>> +%macro LOADA64 2
>>>> +   movlps   %1, [%2]
>>>> +   movhps   %1, [%2 + 8]
>>>> +%endmacro
>>>> +
>>>> +%macro STOREA64 2
>>>> +   movlps   [%1    ], %2
>>>> +   movhps   [%1 + 8], %2
>>>> +%endmacro
>>>
>>> Why not movdqa?
>>
>> Because the buffer have only 64-bit alignment. movups would do it too,
>> but it cannot use the fact that the buffer is 64-bit aligned, so I
>> expect it to be slower.
>
> movl/hps doesn't assume alignment either.
>
> movups vs movl/hps:
> Conroe/Penryn, loads: same number of uops, but different execution units.
> which one is faster depends on what else you're executing, but movups is a
> better bet in general (and confirmed in this case).
> Conroe/Penryn, stores: movl/hps is faster.
> Nehalem/Sandybridge: movups is faster.
> Atom: movl/hps is faster.
> K8: movl/hps is faster.
> K10, loads: movups is faster.
> K10, stores: same throughput in isolation. movl/hps is fewer uops but more
> decoding resources, dunno which is the bottleneck here.
> (Penryn and Sandybridge confirmed by benchmark, the rest are just from
> Agner's instruction tables.)
>
>>+; input  %1={x1,x2,x3,x4}, %2={y1,y2,y3,y4}
>>+; output %3={x4,y1,y2,y3}
>>+%macro ROTLEFT_SSE 3
>>+    BUILDINVHIGHLOW %1, %2, %3
>>+    shufps  %3, %3, %2, 0x99
>>+%endmacro
> (and other such macros)
>
> If some macro args can be described as output and some as input, then
> output should come first.
>
>>--- a/libavutil/x86/x86inc.asm
>>+++ b/libavutil/x86/x86inc.asm
>>@@ -790,6 +790,8 @@ AVX_INSTR minps, 1, 0
>> AVX_INSTR minsd, 1, 0
>> AVX_INSTR minss, 1, 0
>> AVX_INSTR mpsadbw, 0, 1
>>+AVX_INSTR movhlps, 1, 0
>>+AVX_INSTR movlhps, 1, 0
>> AVX_INSTR mulpd, 1, 0
>> AVX_INSTR mulps, 1, 0
>> AVX_INSTR mulsd, 1, 0
>
> Alphabetize.

Vitor, will you work on this? I know thread is old, but these are
minor things to implement (basically have separate load/store
instructions depending on sse vs ssse3 vs avx, and the other stuff is
even more trivial.) If not, I can perhaps finish this next weekend.

> On Sat, 27 Aug 2011, Vitor Sessak wrote:
>>
>>>> The main optimization i see is to interleave a few blocks so as to
>>>> simplify the shuffling of data
>>>
>>> Agreed.
>>
>> I'm also attaching a pseudo-SIMD C version of the code. I've done my
>> best to have the minimum number of shuffles, but suggestions are
>> welcome.
>
> Err, what? The interleaved-block version should be exactly the same as the
> scalar C version, except with each scalar instruction replaced by a simd
> instruction on a vector containing 1 coef from each of 4 blocks. There
> should be a transpose on load of in[] and store of buf[], and no other
> shuffles. (out[] prefers the interleaved data, and win[] can be statically
> rearranged.)
> This doesn't quite obsolete the implementation of a single imdct36, since
> we don't always have a multiple of 4 blocks. Though mdct_long_end is
> usually large enough that we wouldn't lose proportionally much by rounding
> it.

I think this should be done long-term and isn't necessary to commit this patch.

Ronald
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to