Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls

2016-01-14 Thread James Almer
On 1/14/2016 11:12 AM, Ganesh Ajjanagadde wrote:
> On Thu, Jan 14, 2016 at 5:02 AM, Henrik Gramner  wrote:
>> Use the x86inc syntax for FMA instructions (basically FMA4 syntax that
>> gets assembled as FMA3) since normal FMA3 opcodes are horrible to
>> read, nobody ever remembers the ordering of operands.
> 
> 1. It is very easy to remember: take fmadd231pd x, y, z for instance.
> This means 2*3 + 1, so x = y*z+x. How the macro is more readable is
> beyond me; especially with some side cases that are undocumented, see
> below.

fmaddps dst, src1, src2, src3 is always going to be easier to read for anyone
without having to think about what number belongs to what operation and what
operand. And it will output either FMA4 or FMA3 depending on the value passed
to INIT_[XY]MM.

> 2. If anything, the macro is harder, since it is not Intel supported,

Of course it wont be there, it's not defined by them. Non-destructive four
operand fma is defined by AMD.

> I can't look it up at
> https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf.

Neither are any of the dozens other compat macros in x86utils. And many of
them are also undocumented within x86utils. This point is absurd.

> 3. The macro does not seem to take care of the mov's (if any), still
> requiring explicit thought on the part of the programmer.

Yes, and? It's not an emulation macro like the uppercase ones that become
several instructions. It translate a single FMA4-like instruction into
either an FMA4 or FMA3 one.

fmaddps xmm0, xmm0, xmm1, xmm2

becomes

vfmaddps xmm0, xmm0, xmm1, xmm2 if FMA4
vfmadd132ps xmm0, xmm2, xmm1 if FMA3

If you try to use it with four different operands, it will work with FMA4
but not FMA3, since as i said it's not trying to emulate anything.

> 4. The macro lacks documentation. In particular, it is not a thorough
> fma4 emulation in the spirit of
> https://gist.github.com/rygorous/22180ced9c7a00bd68dd.
> 
> Or put in other words, IMO not good.

No, it's good and what's done in every other asm file precisely for being
more flexible and readable. Especially since it allows one to write both
FMA4 and FMA3 functions without duplicating code.

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls

2016-01-14 Thread Henrik Gramner
Use the x86inc syntax for FMA instructions (basically FMA4 syntax that
gets assembled as FMA3) since normal FMA3 opcodes are horrible to
read, nobody ever remembers the ordering of operands.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls

2016-01-14 Thread Ganesh Ajjanagadde
On Thu, Jan 14, 2016 at 5:02 AM, Henrik Gramner  wrote:
> Use the x86inc syntax for FMA instructions (basically FMA4 syntax that
> gets assembled as FMA3) since normal FMA3 opcodes are horrible to
> read, nobody ever remembers the ordering of operands.

1. It is very easy to remember: take fmadd231pd x, y, z for instance.
This means 2*3 + 1, so x = y*z+x. How the macro is more readable is
beyond me; especially with some side cases that are undocumented, see
below.
2. If anything, the macro is harder, since it is not Intel supported,
I can't look it up at
https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf.
3. The macro does not seem to take care of the mov's (if any), still
requiring explicit thought on the part of the programmer.
4. The macro lacks documentation. In particular, it is not a thorough
fma4 emulation in the spirit of
https://gist.github.com/rygorous/22180ced9c7a00bd68dd.

Or put in other words, IMO not good.

> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls

2016-01-14 Thread Henrik Gramner
On Thu, Jan 14, 2016 at 5:26 PM, Ganesh Ajjanagadde  wrote:
> readability still no.

" dst, mult1, mult2, add" is significantly more readable
than " src1, src2, src3" where you need
to mentally parse which source operand corresponds to which
mathematical operator depending on the order of the digits.

Compare the following instruction sequences which are identical (just
a random example I made up on the spot):

; m0 = m2 * m4 + m0
; m1 = m2 * m1 + m3
; m2 = m2 * m3 + m4

fmaddpd m0, m2, m4, m0
fmaddpd m1, m2, m1, m3
fmaddpd m2, m2, m3, m4

vfmadd231pd m0, m2, m4
vfmadd213pd m1, m2, m3
vfmadd132pd m2, m4, m3

In the first section it's immediately clear at a quick glance which
registers get multiplied by which.
The second section on the other hand takes some time to parse.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls

2016-01-14 Thread Ganesh Ajjanagadde
On Thu, Jan 14, 2016 at 6:54 PM, Henrik Gramner  wrote:
> On Thu, Jan 14, 2016 at 11:47 PM, Ganesh Ajjanagadde  wrote:
>> BTW, this is why I personally don't like the macro:
>> so I was moving along, replacing one after the other, till I came to this 
>> line
>> vfmadd213pd ymm1, ymm5, COVAR(iq  ,1)
>> I naturally replace by
>> fmaddpd ymm1, ymm1, ymm5, COVAR(iq,1)
>> giving error "invalid combination of opcode and operand"
>> I could spend the time seeing why it is broken, but frankly don't
>> care. The point is, the macro is broken, and the lack of documentation
>> just bit back.
>
> Then that's a bug and it should be fixed. For the record I gave the
> code a quick glance and I'm pretty sure I know what the underlying
> problem is, I'll try to make a fix for it when I have the time to do
> so.

Thank you.

>
> The documentation basically states that it's an FMA3-emulation of
> FMA4-syntax, I'm personally not sure how much there is to expand on
> that but if you do have some concrete suggestions on what kind of
> documentation would be beneficial feel free to make your voice heard
> and maybe someone with knowledge of the code will improve it.

I think it is fine, assuming this bug (and possibly others) are fixed.
It may be good to update/add this to FATE; assuming there is infra for
these kinds of tings.

>
> Just complaining that something is absurd and broken and how you don't
> care doesn't really accomplish much however.

Sorry, it was a needless rant. I was just unhappy with the general
idea to use the macro, and was justifying my own lack of use of it.

> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls

2016-01-14 Thread Henrik Gramner
On Thu, Jan 14, 2016 at 11:47 PM, Ganesh Ajjanagadde  wrote:
> BTW, this is why I personally don't like the macro:
> so I was moving along, replacing one after the other, till I came to this line
> vfmadd213pd ymm1, ymm5, COVAR(iq  ,1)
> I naturally replace by
> fmaddpd ymm1, ymm1, ymm5, COVAR(iq,1)
> giving error "invalid combination of opcode and operand"
> I could spend the time seeing why it is broken, but frankly don't
> care. The point is, the macro is broken, and the lack of documentation
> just bit back.

Then that's a bug and it should be fixed. For the record I gave the
code a quick glance and I'm pretty sure I know what the underlying
problem is, I'll try to make a fix for it when I have the time to do
so.

The documentation basically states that it's an FMA3-emulation of
FMA4-syntax, I'm personally not sure how much there is to expand on
that but if you do have some concrete suggestions on what kind of
documentation would be beneficial feel free to make your voice heard
and maybe someone with knowledge of the code will improve it.

Just complaining that something is absurd and broken and how you don't
care doesn't really accomplish much however.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls

2016-01-14 Thread Ganesh Ajjanagadde
On Thu, Jan 14, 2016 at 11:48 AM, James Almer  wrote:
> On 1/14/2016 1:26 PM, Ganesh Ajjanagadde wrote:
>> On Thu, Jan 14, 2016 at 11:16 AM, James Almer  wrote:
>>> On 1/14/2016 11:12 AM, Ganesh Ajjanagadde wrote:
 On Thu, Jan 14, 2016 at 5:02 AM, Henrik Gramner  wrote:
> Use the x86inc syntax for FMA instructions (basically FMA4 syntax that
> gets assembled as FMA3) since normal FMA3 opcodes are horrible to
> read, nobody ever remembers the ordering of operands.

 1. It is very easy to remember: take fmadd231pd x, y, z for instance.
 This means 2*3 + 1, so x = y*z+x. How the macro is more readable is
 beyond me; especially with some side cases that are undocumented, see
 below.
>>>
>>> fmaddps dst, src1, src2, src3 is always going to be easier to read for 
>>> anyone
>>> without having to think about what number belongs to what operation and what
>>> operand. And it will output either FMA4 or FMA3 depending on the value 
>>> passed
>>> to INIT_[XY]MM.
>>
>> The fma3/fma4 thing is the only benefit. Even that is generally not a
>> big deal; AMD quickly started supporting fma3.
>
> Nobody is asking you to write an FMA4 version of this function. We're asking
> you to use the x86inc FMA4-like macros for readability purposes.
>
>>
>>>
 2. If anything, the macro is harder, since it is not Intel supported,
>>>
>>> Of course it wont be there, it's not defined by them. Non-destructive four
>>> operand fma is defined by AMD.
>>
>> Of course I know this.
>>
>>>
 I can't look it up at
 https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf.
>>>
>>> Neither are any of the dozens other compat macros in x86utils. And many of
>>> them are also undocumented within x86utils. This point is absurd.
>>
>> How is it absurd? You expect me to use something that lacks clear
>> documentation, and claim that it is "more readable". What other macros
>> have/lack is irrelevant to the point.
>
> If you want documentation for FMA4 look at AMD docs, just like you didn't
> hesitate to look at Intel's.
>
>>
>>>
 3. The macro does not seem to take care of the mov's (if any), still
 requiring explicit thought on the part of the programmer.
>>>
>>> Yes, and? It's not an emulation macro like the uppercase ones that become
>>> several instructions. It translate a single FMA4-like instruction into
>>> either an FMA4 or FMA3 one.
>>>
>>> fmaddps xmm0, xmm0, xmm1, xmm2
>>>
>>> becomes
>>>
>>> vfmaddps xmm0, xmm0, xmm1, xmm2 if FMA4
>>> vfmadd132ps xmm0, xmm2, xmm1 if FMA3
>>>
>>> If you try to use it with four different operands, it will work with FMA4
>>> but not FMA3, since as i said it's not trying to emulate anything.
>>
>> Thanks for mentioning the convention; but this is an important one and
>> AFAIK not mentioned in any documentation within FFmpeg.
>>
>>>
 4. The macro lacks documentation. In particular, it is not a thorough
 fma4 emulation in the spirit of
 https://gist.github.com/rygorous/22180ced9c7a00bd68dd.

 Or put in other words, IMO not good.
>>>
>>> No, it's good and what's done in every other asm file precisely for being
>>> more flexible and readable.
>>
>> Flexibility, yes, readability still no.
>
> dst = src1 * src2 + src3
>
> That's all you need to know to read an FMA4-like instruction. Are you going to
> tell me that the clusterfuck that's FMA3 with varying numbers that change the
> order or operations and meaning of operands is easier to read?

BTW, this is why I personally don't like the macro:
so I was moving along, replacing one after the other, till I came to this line
vfmadd213pd ymm1, ymm5, COVAR(iq  ,1)
I naturally replace by
fmaddpd ymm1, ymm1, ymm5, COVAR(iq,1)
giving error "invalid combination of opcode and operand"
I could spend the time seeing why it is broken, but frankly don't
care. The point is, the macro is broken, and the lack of documentation
just bit back.
fmaddpd ymm1, ymm5, ymm1, COVAR(iq,1)
works though (switch order of mult).
And the idea of just looking at the amd docs does not help either,
both are perfectly fine for fma4.

All said, patchv2 posted.

>
> With the compat macros in x86inc, as long as two of the four operands are the
> same register then it's going to output the relevant FMA3 instruction for you.
>
>>
>>> Especially since it allows one to write both
>>> FMA4 and FMA3 functions without duplicating code.
>>
>> Fine.
>>
>>>
>>> ___
>>> ffmpeg-devel mailing list
>>> ffmpeg-devel@ffmpeg.org
>>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>> ___
>> ffmpeg-devel mailing list
>> ffmpeg-devel@ffmpeg.org
>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>
>
> ___
> ffmpeg-devel mailing list
> 

Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls

2016-01-14 Thread Ganesh Ajjanagadde
On Thu, Jan 14, 2016 at 11:16 AM, James Almer  wrote:
> On 1/14/2016 11:12 AM, Ganesh Ajjanagadde wrote:
>> On Thu, Jan 14, 2016 at 5:02 AM, Henrik Gramner  wrote:
>>> Use the x86inc syntax for FMA instructions (basically FMA4 syntax that
>>> gets assembled as FMA3) since normal FMA3 opcodes are horrible to
>>> read, nobody ever remembers the ordering of operands.
>>
>> 1. It is very easy to remember: take fmadd231pd x, y, z for instance.
>> This means 2*3 + 1, so x = y*z+x. How the macro is more readable is
>> beyond me; especially with some side cases that are undocumented, see
>> below.
>
> fmaddps dst, src1, src2, src3 is always going to be easier to read for anyone
> without having to think about what number belongs to what operation and what
> operand. And it will output either FMA4 or FMA3 depending on the value passed
> to INIT_[XY]MM.

The fma3/fma4 thing is the only benefit. Even that is generally not a
big deal; AMD quickly started supporting fma3.

>
>> 2. If anything, the macro is harder, since it is not Intel supported,
>
> Of course it wont be there, it's not defined by them. Non-destructive four
> operand fma is defined by AMD.

Of course I know this.

>
>> I can't look it up at
>> https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf.
>
> Neither are any of the dozens other compat macros in x86utils. And many of
> them are also undocumented within x86utils. This point is absurd.

How is it absurd? You expect me to use something that lacks clear
documentation, and claim that it is "more readable". What other macros
have/lack is irrelevant to the point.

>
>> 3. The macro does not seem to take care of the mov's (if any), still
>> requiring explicit thought on the part of the programmer.
>
> Yes, and? It's not an emulation macro like the uppercase ones that become
> several instructions. It translate a single FMA4-like instruction into
> either an FMA4 or FMA3 one.
>
> fmaddps xmm0, xmm0, xmm1, xmm2
>
> becomes
>
> vfmaddps xmm0, xmm0, xmm1, xmm2 if FMA4
> vfmadd132ps xmm0, xmm2, xmm1 if FMA3
>
> If you try to use it with four different operands, it will work with FMA4
> but not FMA3, since as i said it's not trying to emulate anything.

Thanks for mentioning the convention; but this is an important one and
AFAIK not mentioned in any documentation within FFmpeg.

>
>> 4. The macro lacks documentation. In particular, it is not a thorough
>> fma4 emulation in the spirit of
>> https://gist.github.com/rygorous/22180ced9c7a00bd68dd.
>>
>> Or put in other words, IMO not good.
>
> No, it's good and what's done in every other asm file precisely for being
> more flexible and readable.

Flexibility, yes, readability still no.

> Especially since it allows one to write both
> FMA4 and FMA3 functions without duplicating code.

Fine.

>
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls

2016-01-14 Thread Ganesh Ajjanagadde
On Thu, Jan 14, 2016 at 11:48 AM, James Almer  wrote:
> On 1/14/2016 1:26 PM, Ganesh Ajjanagadde wrote:
>> On Thu, Jan 14, 2016 at 11:16 AM, James Almer  wrote:
>>> On 1/14/2016 11:12 AM, Ganesh Ajjanagadde wrote:
 On Thu, Jan 14, 2016 at 5:02 AM, Henrik Gramner  wrote:
[...]

There is no need for discussion; I have already said it is fine and am
amending the patch. It is really a personal thing; I prefer explicit
ops when working at that low a level. Even mova I would have changed;
just wanted to keep code consistency. It is for this reason that I
will change it.

[...]
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls

2016-01-14 Thread James Almer
On 1/14/2016 1:26 PM, Ganesh Ajjanagadde wrote:
> On Thu, Jan 14, 2016 at 11:16 AM, James Almer  wrote:
>> On 1/14/2016 11:12 AM, Ganesh Ajjanagadde wrote:
>>> On Thu, Jan 14, 2016 at 5:02 AM, Henrik Gramner  wrote:
 Use the x86inc syntax for FMA instructions (basically FMA4 syntax that
 gets assembled as FMA3) since normal FMA3 opcodes are horrible to
 read, nobody ever remembers the ordering of operands.
>>>
>>> 1. It is very easy to remember: take fmadd231pd x, y, z for instance.
>>> This means 2*3 + 1, so x = y*z+x. How the macro is more readable is
>>> beyond me; especially with some side cases that are undocumented, see
>>> below.
>>
>> fmaddps dst, src1, src2, src3 is always going to be easier to read for anyone
>> without having to think about what number belongs to what operation and what
>> operand. And it will output either FMA4 or FMA3 depending on the value passed
>> to INIT_[XY]MM.
> 
> The fma3/fma4 thing is the only benefit. Even that is generally not a
> big deal; AMD quickly started supporting fma3.

Nobody is asking you to write an FMA4 version of this function. We're asking
you to use the x86inc FMA4-like macros for readability purposes.

> 
>>
>>> 2. If anything, the macro is harder, since it is not Intel supported,
>>
>> Of course it wont be there, it's not defined by them. Non-destructive four
>> operand fma is defined by AMD.
> 
> Of course I know this.
> 
>>
>>> I can't look it up at
>>> https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf.
>>
>> Neither are any of the dozens other compat macros in x86utils. And many of
>> them are also undocumented within x86utils. This point is absurd.
> 
> How is it absurd? You expect me to use something that lacks clear
> documentation, and claim that it is "more readable". What other macros
> have/lack is irrelevant to the point.

If you want documentation for FMA4 look at AMD docs, just like you didn't
hesitate to look at Intel's.

> 
>>
>>> 3. The macro does not seem to take care of the mov's (if any), still
>>> requiring explicit thought on the part of the programmer.
>>
>> Yes, and? It's not an emulation macro like the uppercase ones that become
>> several instructions. It translate a single FMA4-like instruction into
>> either an FMA4 or FMA3 one.
>>
>> fmaddps xmm0, xmm0, xmm1, xmm2
>>
>> becomes
>>
>> vfmaddps xmm0, xmm0, xmm1, xmm2 if FMA4
>> vfmadd132ps xmm0, xmm2, xmm1 if FMA3
>>
>> If you try to use it with four different operands, it will work with FMA4
>> but not FMA3, since as i said it's not trying to emulate anything.
> 
> Thanks for mentioning the convention; but this is an important one and
> AFAIK not mentioned in any documentation within FFmpeg.
> 
>>
>>> 4. The macro lacks documentation. In particular, it is not a thorough
>>> fma4 emulation in the spirit of
>>> https://gist.github.com/rygorous/22180ced9c7a00bd68dd.
>>>
>>> Or put in other words, IMO not good.
>>
>> No, it's good and what's done in every other asm file precisely for being
>> more flexible and readable.
> 
> Flexibility, yes, readability still no.

dst = src1 * src2 + src3

That's all you need to know to read an FMA4-like instruction. Are you going to
tell me that the clusterfuck that's FMA3 with varying numbers that change the
order or operations and meaning of operands is easier to read?

With the compat macros in x86inc, as long as two of the four operands are the
same register then it's going to output the relevant FMA3 instruction for you.

> 
>> Especially since it allows one to write both
>> FMA4 and FMA3 functions without duplicating code.
> 
> Fine.
> 
>>
>> ___
>> ffmpeg-devel mailing list
>> ffmpeg-devel@ffmpeg.org
>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls

2016-01-13 Thread Ganesh Ajjanagadde
This improves accuracy (very slightly) and speed for processors having
fma3.

Sample benchmark (fate flac-16-lpc-cholesky, Haswell):
old:
5993610 decicycles in ff_lpc_calc_coefs,  64 runs,  0 skips
5951528 decicycles in ff_lpc_calc_coefs, 128 runs,  0 skips

new:
5252410 decicycles in ff_lpc_calc_coefs,  64 runs,  0 skips
5232869 decicycles in ff_lpc_calc_coefs, 128 runs,  0 skips

Tested with FATE and --disable-fma3, also examined contents of
lavu/lls-test.

Signed-off-by: Ganesh Ajjanagadde 
---
 libavutil/x86/lls.asm| 61 ++--
 libavutil/x86/lls_init.c |  4 
 2 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/libavutil/x86/lls.asm b/libavutil/x86/lls.asm
index 769befb..358603a 100644
--- a/libavutil/x86/lls.asm
+++ b/libavutil/x86/lls.asm
@@ -125,8 +125,7 @@ cglobal update_lls, 2,5,8, ctx, var, i, j, covar2
 .ret:
 REP_RET
 
-%if HAVE_AVX_EXTERNAL
-INIT_YMM avx
+%macro UPDATE_LLS 0
 cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2
 %define covarq ctxq
 mov  countd, [ctxq + LLSModel.indep_count]
@@ -140,6 +139,18 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2
 vbroadcastsd ymm6, [varq + iq*8 + 16]
 vbroadcastsd ymm7, [varq + iq*8 + 24]
 vextractf128 xmm3, ymm1, 1
+%if cpuflag(fma3)
+mova ymm0, COVAR(iq  ,0)
+mova xmm2, COVAR(iq+2,2)
+vfmadd231pd ymm0, ymm1, ymm4
+vfmadd231pd xmm2, xmm3, xmm6
+vfmadd213pd ymm1, ymm5, COVAR(iq  ,1)
+vfmadd213pd xmm3, xmm7, COVAR(iq+2,3)
+mova COVAR(iq  ,0), ymm0
+mova COVAR(iq  ,1), ymm1
+mova COVAR(iq+2,2), xmm2
+mova COVAR(iq+2,3), xmm3
+%else
 vmulpd  ymm0, ymm1, ymm4
 vmulpd  ymm1, ymm1, ymm5
 vmulpd  xmm2, xmm3, xmm6
@@ -148,12 +159,27 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2
 ADDPD_MEM COVAR(iq  ,1), ymm1
 ADDPD_MEM COVAR(iq+2,2), xmm2
 ADDPD_MEM COVAR(iq+2,3), xmm3
+%endif ; cpuflag(fma3)
 lea jd, [iq + 4]
 cmp jd, count2d
 jg .skip4x4
 .loop4x4:
 ; Compute all 16 pairwise products of a 4x4 block
 movaymm3, [varq + jq*8]
+%if cpuflag(fma3)
+mova ymm0, COVAR(jq, 0)
+mova ymm1, COVAR(jq, 1)
+mova ymm2, COVAR(jq, 2)
+mova ymm3, COVAR(jq, 3)
+vfmadd231pd ymm0, ymm3, ymm4
+vfmadd231pd ymm1, ymm3, ymm5
+vfmadd231pd ymm2, ymm3, ymm6
+vfmadd231pd ymm3, ymm3, ymm7
+mova COVAR(jq, 0), ymm0
+mova COVAR(jq, 1), ymm1
+mova COVAR(jq, 2), ymm2
+mova COVAR(jq, 3), ymm3
+%else
 vmulpd  ymm0, ymm3, ymm4
 vmulpd  ymm1, ymm3, ymm5
 vmulpd  ymm2, ymm3, ymm6
@@ -162,6 +188,7 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2
 ADDPD_MEM COVAR(jq,1), ymm1
 ADDPD_MEM COVAR(jq,2), ymm2
 ADDPD_MEM COVAR(jq,3), ymm3
+%endif ; cpuflag(fma3)
 add jd, 4
 cmp jd, count2d
 jle .loop4x4
@@ -169,6 +196,20 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2
 cmp jd, countd
 jg .skip2x4
 movaxmm3, [varq + jq*8]
+%if cpuflag(fma3)
+mova xmm0, COVAR(jq, 0)
+mova xmm1, COVAR(jq, 1)
+mova xmm2, COVAR(jq, 2)
+mova xmm3, COVAR(jq, 3)
+vfmadd231pd xmm0, xmm3, xmm4
+vfmadd231pd xmm1, xmm3, xmm5
+vfmadd231pd xmm2, xmm3, xmm6
+vfmadd231pd xmm3, xmm3, xmm7
+mova COVAR(jq, 0), xmm0
+mova COVAR(jq, 1), xmm1
+mova COVAR(jq, 2), xmm2
+mova COVAR(jq, 3), xmm3
+%else
 vmulpd  xmm0, xmm3, xmm4
 vmulpd  xmm1, xmm3, xmm5
 vmulpd  xmm2, xmm3, xmm6
@@ -177,6 +218,7 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2
 ADDPD_MEM COVAR(jq,1), xmm1
 ADDPD_MEM COVAR(jq,2), xmm2
 ADDPD_MEM COVAR(jq,3), xmm3
+%endif ; cpuflag(fma3)
 .skip2x4:
 add id, 4
 add covarq, 4*COVAR_STRIDE
@@ -187,14 +229,29 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2
 mov jd, id
 .loop2x1:
 vmovddup xmm0, [varq + iq*8]
+%if cpuflag(fma3)
+mova xmm1, [varq + jq*8]
+vfmadd213pd xmm0, xmm1, COVAR(jq,0)
+mova COVAR(jq,0), xmm0
+%else
 vmulpd   xmm0, [varq + jq*8]
 ADDPD_MEM COVAR(jq,0), xmm0
+%endif ; cpuflag(fma3)
 inc id
 add covarq, COVAR_STRIDE
 cmp id, countd
 jle .loop2x1
 .ret:
 REP_RET
+%endmacro ; UPDATE_LLS
+
+%if HAVE_AVX_EXTERNAL
+INIT_YMM avx
+UPDATE_LLS
+%endif
+%if HAVE_FMA3_EXTERNAL
+INIT_YMM fma3
+UPDATE_LLS
 %endif
 
 INIT_XMM sse2
diff --git a/libavutil/x86/lls_init.c b/libavutil/x86/lls_init.c
index 81f141c..9f0d862 100644
--- a/libavutil/x86/lls_init.c
+++ b/libavutil/x86/lls_init.c
@@ -25,6 +25,7 @@
 
 void ff_update_lls_sse2(LLSModel *m, const double *var);
 void ff_update_lls_avx(LLSModel *m, const double *var);
+void ff_update_lls_fma3(LLSModel *m, const double *var);
 double ff_evaluate_lls_sse2(LLSModel *m, const double *var, int order);
 
 av_cold void ff_init_lls_x86(LLSModel *m)
@@ -38,4 +39,7 @@ av_cold void ff_init_lls_x86(LLSModel *m)
 if 

Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls

2016-01-13 Thread Ganesh Ajjanagadde
On Wed, Jan 13, 2016 at 6:59 PM, Ganesh Ajjanagadde
 wrote:
> This improves accuracy (very slightly) and speed for processors having
> fma3.
>
> Sample benchmark (fate flac-16-lpc-cholesky, Haswell):
> old:
> 5993610 decicycles in ff_lpc_calc_coefs,  64 runs,  0 skips
> 5951528 decicycles in ff_lpc_calc_coefs, 128 runs,  0 skips
>
> new:
> 5252410 decicycles in ff_lpc_calc_coefs,  64 runs,  0 skips
> 5232869 decicycles in ff_lpc_calc_coefs, 128 runs,  0 skips
>
> Tested with FATE and --disable-fma3, also examined contents of
> lavu/lls-test.
>
> Signed-off-by: Ganesh Ajjanagadde 
> ---
>  libavutil/x86/lls.asm| 61 
> ++--
>  libavutil/x86/lls_init.c |  4 
>  2 files changed, 63 insertions(+), 2 deletions(-)
>
> diff --git a/libavutil/x86/lls.asm b/libavutil/x86/lls.asm
> index 769befb..358603a 100644
> --- a/libavutil/x86/lls.asm
> +++ b/libavutil/x86/lls.asm
> @@ -125,8 +125,7 @@ cglobal update_lls, 2,5,8, ctx, var, i, j, covar2
>  .ret:
>  REP_RET
>
> -%if HAVE_AVX_EXTERNAL
> -INIT_YMM avx
> +%macro UPDATE_LLS 0
>  cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2
>  %define covarq ctxq
>  mov  countd, [ctxq + LLSModel.indep_count]
> @@ -140,6 +139,18 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2
>  vbroadcastsd ymm6, [varq + iq*8 + 16]
>  vbroadcastsd ymm7, [varq + iq*8 + 24]
>  vextractf128 xmm3, ymm1, 1
> +%if cpuflag(fma3)
> +mova ymm0, COVAR(iq  ,0)
> +mova xmm2, COVAR(iq+2,2)
> +vfmadd231pd ymm0, ymm1, ymm4
> +vfmadd231pd xmm2, xmm3, xmm6
> +vfmadd213pd ymm1, ymm5, COVAR(iq  ,1)
> +vfmadd213pd xmm3, xmm7, COVAR(iq+2,3)
> +mova COVAR(iq  ,0), ymm0
> +mova COVAR(iq  ,1), ymm1
> +mova COVAR(iq+2,2), xmm2
> +mova COVAR(iq+2,3), xmm3
> +%else
>  vmulpd  ymm0, ymm1, ymm4
>  vmulpd  ymm1, ymm1, ymm5
>  vmulpd  xmm2, xmm3, xmm6
> @@ -148,12 +159,27 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2
>  ADDPD_MEM COVAR(iq  ,1), ymm1
>  ADDPD_MEM COVAR(iq+2,2), xmm2
>  ADDPD_MEM COVAR(iq+2,3), xmm3
> +%endif ; cpuflag(fma3)
>  lea jd, [iq + 4]
>  cmp jd, count2d
>  jg .skip4x4
>  .loop4x4:
>  ; Compute all 16 pairwise products of a 4x4 block
>  movaymm3, [varq + jq*8]
> +%if cpuflag(fma3)
> +mova ymm0, COVAR(jq, 0)
> +mova ymm1, COVAR(jq, 1)
> +mova ymm2, COVAR(jq, 2)
> +mova ymm3, COVAR(jq, 3)
> +vfmadd231pd ymm0, ymm3, ymm4
> +vfmadd231pd ymm1, ymm3, ymm5
> +vfmadd231pd ymm2, ymm3, ymm6
> +vfmadd231pd ymm3, ymm3, ymm7
> +mova COVAR(jq, 0), ymm0
> +mova COVAR(jq, 1), ymm1
> +mova COVAR(jq, 2), ymm2
> +mova COVAR(jq, 3), ymm3
> +%else
>  vmulpd  ymm0, ymm3, ymm4
>  vmulpd  ymm1, ymm3, ymm5
>  vmulpd  ymm2, ymm3, ymm6
> @@ -162,6 +188,7 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2
>  ADDPD_MEM COVAR(jq,1), ymm1
>  ADDPD_MEM COVAR(jq,2), ymm2
>  ADDPD_MEM COVAR(jq,3), ymm3
> +%endif ; cpuflag(fma3)
>  add jd, 4
>  cmp jd, count2d
>  jle .loop4x4
> @@ -169,6 +196,20 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2
>  cmp jd, countd
>  jg .skip2x4
>  movaxmm3, [varq + jq*8]
> +%if cpuflag(fma3)
> +mova xmm0, COVAR(jq, 0)
> +mova xmm1, COVAR(jq, 1)
> +mova xmm2, COVAR(jq, 2)
> +mova xmm3, COVAR(jq, 3)
> +vfmadd231pd xmm0, xmm3, xmm4
> +vfmadd231pd xmm1, xmm3, xmm5
> +vfmadd231pd xmm2, xmm3, xmm6
> +vfmadd231pd xmm3, xmm3, xmm7
> +mova COVAR(jq, 0), xmm0
> +mova COVAR(jq, 1), xmm1
> +mova COVAR(jq, 2), xmm2
> +mova COVAR(jq, 3), xmm3
> +%else
>  vmulpd  xmm0, xmm3, xmm4
>  vmulpd  xmm1, xmm3, xmm5
>  vmulpd  xmm2, xmm3, xmm6
> @@ -177,6 +218,7 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2
>  ADDPD_MEM COVAR(jq,1), xmm1
>  ADDPD_MEM COVAR(jq,2), xmm2
>  ADDPD_MEM COVAR(jq,3), xmm3
> +%endif ; cpuflag(fma3)
>  .skip2x4:
>  add id, 4
>  add covarq, 4*COVAR_STRIDE
> @@ -187,14 +229,29 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2
>  mov jd, id
>  .loop2x1:
>  vmovddup xmm0, [varq + iq*8]
> +%if cpuflag(fma3)
> +mova xmm1, [varq + jq*8]
> +vfmadd213pd xmm0, xmm1, COVAR(jq,0)
> +mova COVAR(jq,0), xmm0
> +%else
>  vmulpd   xmm0, [varq + jq*8]
>  ADDPD_MEM COVAR(jq,0), xmm0
> +%endif ; cpuflag(fma3)
>  inc id
>  add covarq, COVAR_STRIDE
>  cmp id, countd
>  jle .loop2x1
>  .ret:
>  REP_RET
> +%endmacro ; UPDATE_LLS
> +
> +%if HAVE_AVX_EXTERNAL
> +INIT_YMM avx
> +UPDATE_LLS
> +%endif
> +%if HAVE_FMA3_EXTERNAL
> +INIT_YMM fma3
> +UPDATE_LLS
>  %endif
>
>  INIT_XMM sse2
> diff --git a/libavutil/x86/lls_init.c b/libavutil/x86/lls_init.c
> index 81f141c..9f0d862 100644
> --- a/libavutil/x86/lls_init.c
> +++ b/libavutil/x86/lls_init.c
> @@ -25,6