Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls
On 1/14/2016 11:12 AM, Ganesh Ajjanagadde wrote: > On Thu, Jan 14, 2016 at 5:02 AM, Henrik Gramnerwrote: >> Use the x86inc syntax for FMA instructions (basically FMA4 syntax that >> gets assembled as FMA3) since normal FMA3 opcodes are horrible to >> read, nobody ever remembers the ordering of operands. > > 1. It is very easy to remember: take fmadd231pd x, y, z for instance. > This means 2*3 + 1, so x = y*z+x. How the macro is more readable is > beyond me; especially with some side cases that are undocumented, see > below. fmaddps dst, src1, src2, src3 is always going to be easier to read for anyone without having to think about what number belongs to what operation and what operand. And it will output either FMA4 or FMA3 depending on the value passed to INIT_[XY]MM. > 2. If anything, the macro is harder, since it is not Intel supported, Of course it wont be there, it's not defined by them. Non-destructive four operand fma is defined by AMD. > I can't look it up at > https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf. Neither are any of the dozens other compat macros in x86utils. And many of them are also undocumented within x86utils. This point is absurd. > 3. The macro does not seem to take care of the mov's (if any), still > requiring explicit thought on the part of the programmer. Yes, and? It's not an emulation macro like the uppercase ones that become several instructions. It translate a single FMA4-like instruction into either an FMA4 or FMA3 one. fmaddps xmm0, xmm0, xmm1, xmm2 becomes vfmaddps xmm0, xmm0, xmm1, xmm2 if FMA4 vfmadd132ps xmm0, xmm2, xmm1 if FMA3 If you try to use it with four different operands, it will work with FMA4 but not FMA3, since as i said it's not trying to emulate anything. > 4. The macro lacks documentation. In particular, it is not a thorough > fma4 emulation in the spirit of > https://gist.github.com/rygorous/22180ced9c7a00bd68dd. > > Or put in other words, IMO not good. No, it's good and what's done in every other asm file precisely for being more flexible and readable. Especially since it allows one to write both FMA4 and FMA3 functions without duplicating code. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls
Use the x86inc syntax for FMA instructions (basically FMA4 syntax that gets assembled as FMA3) since normal FMA3 opcodes are horrible to read, nobody ever remembers the ordering of operands. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls
On Thu, Jan 14, 2016 at 5:02 AM, Henrik Gramnerwrote: > Use the x86inc syntax for FMA instructions (basically FMA4 syntax that > gets assembled as FMA3) since normal FMA3 opcodes are horrible to > read, nobody ever remembers the ordering of operands. 1. It is very easy to remember: take fmadd231pd x, y, z for instance. This means 2*3 + 1, so x = y*z+x. How the macro is more readable is beyond me; especially with some side cases that are undocumented, see below. 2. If anything, the macro is harder, since it is not Intel supported, I can't look it up at https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf. 3. The macro does not seem to take care of the mov's (if any), still requiring explicit thought on the part of the programmer. 4. The macro lacks documentation. In particular, it is not a thorough fma4 emulation in the spirit of https://gist.github.com/rygorous/22180ced9c7a00bd68dd. Or put in other words, IMO not good. > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls
On Thu, Jan 14, 2016 at 5:26 PM, Ganesh Ajjanagaddewrote: > readability still no. " dst, mult1, mult2, add" is significantly more readable than " src1, src2, src3" where you need to mentally parse which source operand corresponds to which mathematical operator depending on the order of the digits. Compare the following instruction sequences which are identical (just a random example I made up on the spot): ; m0 = m2 * m4 + m0 ; m1 = m2 * m1 + m3 ; m2 = m2 * m3 + m4 fmaddpd m0, m2, m4, m0 fmaddpd m1, m2, m1, m3 fmaddpd m2, m2, m3, m4 vfmadd231pd m0, m2, m4 vfmadd213pd m1, m2, m3 vfmadd132pd m2, m4, m3 In the first section it's immediately clear at a quick glance which registers get multiplied by which. The second section on the other hand takes some time to parse. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls
On Thu, Jan 14, 2016 at 6:54 PM, Henrik Gramnerwrote: > On Thu, Jan 14, 2016 at 11:47 PM, Ganesh Ajjanagadde wrote: >> BTW, this is why I personally don't like the macro: >> so I was moving along, replacing one after the other, till I came to this >> line >> vfmadd213pd ymm1, ymm5, COVAR(iq ,1) >> I naturally replace by >> fmaddpd ymm1, ymm1, ymm5, COVAR(iq,1) >> giving error "invalid combination of opcode and operand" >> I could spend the time seeing why it is broken, but frankly don't >> care. The point is, the macro is broken, and the lack of documentation >> just bit back. > > Then that's a bug and it should be fixed. For the record I gave the > code a quick glance and I'm pretty sure I know what the underlying > problem is, I'll try to make a fix for it when I have the time to do > so. Thank you. > > The documentation basically states that it's an FMA3-emulation of > FMA4-syntax, I'm personally not sure how much there is to expand on > that but if you do have some concrete suggestions on what kind of > documentation would be beneficial feel free to make your voice heard > and maybe someone with knowledge of the code will improve it. I think it is fine, assuming this bug (and possibly others) are fixed. It may be good to update/add this to FATE; assuming there is infra for these kinds of tings. > > Just complaining that something is absurd and broken and how you don't > care doesn't really accomplish much however. Sorry, it was a needless rant. I was just unhappy with the general idea to use the macro, and was justifying my own lack of use of it. > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls
On Thu, Jan 14, 2016 at 11:47 PM, Ganesh Ajjanagaddewrote: > BTW, this is why I personally don't like the macro: > so I was moving along, replacing one after the other, till I came to this line > vfmadd213pd ymm1, ymm5, COVAR(iq ,1) > I naturally replace by > fmaddpd ymm1, ymm1, ymm5, COVAR(iq,1) > giving error "invalid combination of opcode and operand" > I could spend the time seeing why it is broken, but frankly don't > care. The point is, the macro is broken, and the lack of documentation > just bit back. Then that's a bug and it should be fixed. For the record I gave the code a quick glance and I'm pretty sure I know what the underlying problem is, I'll try to make a fix for it when I have the time to do so. The documentation basically states that it's an FMA3-emulation of FMA4-syntax, I'm personally not sure how much there is to expand on that but if you do have some concrete suggestions on what kind of documentation would be beneficial feel free to make your voice heard and maybe someone with knowledge of the code will improve it. Just complaining that something is absurd and broken and how you don't care doesn't really accomplish much however. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls
On Thu, Jan 14, 2016 at 11:48 AM, James Almerwrote: > On 1/14/2016 1:26 PM, Ganesh Ajjanagadde wrote: >> On Thu, Jan 14, 2016 at 11:16 AM, James Almer wrote: >>> On 1/14/2016 11:12 AM, Ganesh Ajjanagadde wrote: On Thu, Jan 14, 2016 at 5:02 AM, Henrik Gramner wrote: > Use the x86inc syntax for FMA instructions (basically FMA4 syntax that > gets assembled as FMA3) since normal FMA3 opcodes are horrible to > read, nobody ever remembers the ordering of operands. 1. It is very easy to remember: take fmadd231pd x, y, z for instance. This means 2*3 + 1, so x = y*z+x. How the macro is more readable is beyond me; especially with some side cases that are undocumented, see below. >>> >>> fmaddps dst, src1, src2, src3 is always going to be easier to read for >>> anyone >>> without having to think about what number belongs to what operation and what >>> operand. And it will output either FMA4 or FMA3 depending on the value >>> passed >>> to INIT_[XY]MM. >> >> The fma3/fma4 thing is the only benefit. Even that is generally not a >> big deal; AMD quickly started supporting fma3. > > Nobody is asking you to write an FMA4 version of this function. We're asking > you to use the x86inc FMA4-like macros for readability purposes. > >> >>> 2. If anything, the macro is harder, since it is not Intel supported, >>> >>> Of course it wont be there, it's not defined by them. Non-destructive four >>> operand fma is defined by AMD. >> >> Of course I know this. >> >>> I can't look it up at https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf. >>> >>> Neither are any of the dozens other compat macros in x86utils. And many of >>> them are also undocumented within x86utils. This point is absurd. >> >> How is it absurd? You expect me to use something that lacks clear >> documentation, and claim that it is "more readable". What other macros >> have/lack is irrelevant to the point. > > If you want documentation for FMA4 look at AMD docs, just like you didn't > hesitate to look at Intel's. > >> >>> 3. The macro does not seem to take care of the mov's (if any), still requiring explicit thought on the part of the programmer. >>> >>> Yes, and? It's not an emulation macro like the uppercase ones that become >>> several instructions. It translate a single FMA4-like instruction into >>> either an FMA4 or FMA3 one. >>> >>> fmaddps xmm0, xmm0, xmm1, xmm2 >>> >>> becomes >>> >>> vfmaddps xmm0, xmm0, xmm1, xmm2 if FMA4 >>> vfmadd132ps xmm0, xmm2, xmm1 if FMA3 >>> >>> If you try to use it with four different operands, it will work with FMA4 >>> but not FMA3, since as i said it's not trying to emulate anything. >> >> Thanks for mentioning the convention; but this is an important one and >> AFAIK not mentioned in any documentation within FFmpeg. >> >>> 4. The macro lacks documentation. In particular, it is not a thorough fma4 emulation in the spirit of https://gist.github.com/rygorous/22180ced9c7a00bd68dd. Or put in other words, IMO not good. >>> >>> No, it's good and what's done in every other asm file precisely for being >>> more flexible and readable. >> >> Flexibility, yes, readability still no. > > dst = src1 * src2 + src3 > > That's all you need to know to read an FMA4-like instruction. Are you going to > tell me that the clusterfuck that's FMA3 with varying numbers that change the > order or operations and meaning of operands is easier to read? BTW, this is why I personally don't like the macro: so I was moving along, replacing one after the other, till I came to this line vfmadd213pd ymm1, ymm5, COVAR(iq ,1) I naturally replace by fmaddpd ymm1, ymm1, ymm5, COVAR(iq,1) giving error "invalid combination of opcode and operand" I could spend the time seeing why it is broken, but frankly don't care. The point is, the macro is broken, and the lack of documentation just bit back. fmaddpd ymm1, ymm5, ymm1, COVAR(iq,1) works though (switch order of mult). And the idea of just looking at the amd docs does not help either, both are perfectly fine for fma4. All said, patchv2 posted. > > With the compat macros in x86inc, as long as two of the four operands are the > same register then it's going to output the relevant FMA3 instruction for you. > >> >>> Especially since it allows one to write both >>> FMA4 and FMA3 functions without duplicating code. >> >> Fine. >> >>> >>> ___ >>> ffmpeg-devel mailing list >>> ffmpeg-devel@ffmpeg.org >>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel >> ___ >> ffmpeg-devel mailing list >> ffmpeg-devel@ffmpeg.org >> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel >> > > ___ > ffmpeg-devel mailing list >
Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls
On Thu, Jan 14, 2016 at 11:16 AM, James Almerwrote: > On 1/14/2016 11:12 AM, Ganesh Ajjanagadde wrote: >> On Thu, Jan 14, 2016 at 5:02 AM, Henrik Gramner wrote: >>> Use the x86inc syntax for FMA instructions (basically FMA4 syntax that >>> gets assembled as FMA3) since normal FMA3 opcodes are horrible to >>> read, nobody ever remembers the ordering of operands. >> >> 1. It is very easy to remember: take fmadd231pd x, y, z for instance. >> This means 2*3 + 1, so x = y*z+x. How the macro is more readable is >> beyond me; especially with some side cases that are undocumented, see >> below. > > fmaddps dst, src1, src2, src3 is always going to be easier to read for anyone > without having to think about what number belongs to what operation and what > operand. And it will output either FMA4 or FMA3 depending on the value passed > to INIT_[XY]MM. The fma3/fma4 thing is the only benefit. Even that is generally not a big deal; AMD quickly started supporting fma3. > >> 2. If anything, the macro is harder, since it is not Intel supported, > > Of course it wont be there, it's not defined by them. Non-destructive four > operand fma is defined by AMD. Of course I know this. > >> I can't look it up at >> https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf. > > Neither are any of the dozens other compat macros in x86utils. And many of > them are also undocumented within x86utils. This point is absurd. How is it absurd? You expect me to use something that lacks clear documentation, and claim that it is "more readable". What other macros have/lack is irrelevant to the point. > >> 3. The macro does not seem to take care of the mov's (if any), still >> requiring explicit thought on the part of the programmer. > > Yes, and? It's not an emulation macro like the uppercase ones that become > several instructions. It translate a single FMA4-like instruction into > either an FMA4 or FMA3 one. > > fmaddps xmm0, xmm0, xmm1, xmm2 > > becomes > > vfmaddps xmm0, xmm0, xmm1, xmm2 if FMA4 > vfmadd132ps xmm0, xmm2, xmm1 if FMA3 > > If you try to use it with four different operands, it will work with FMA4 > but not FMA3, since as i said it's not trying to emulate anything. Thanks for mentioning the convention; but this is an important one and AFAIK not mentioned in any documentation within FFmpeg. > >> 4. The macro lacks documentation. In particular, it is not a thorough >> fma4 emulation in the spirit of >> https://gist.github.com/rygorous/22180ced9c7a00bd68dd. >> >> Or put in other words, IMO not good. > > No, it's good and what's done in every other asm file precisely for being > more flexible and readable. Flexibility, yes, readability still no. > Especially since it allows one to write both > FMA4 and FMA3 functions without duplicating code. Fine. > > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls
On Thu, Jan 14, 2016 at 11:48 AM, James Almerwrote: > On 1/14/2016 1:26 PM, Ganesh Ajjanagadde wrote: >> On Thu, Jan 14, 2016 at 11:16 AM, James Almer wrote: >>> On 1/14/2016 11:12 AM, Ganesh Ajjanagadde wrote: On Thu, Jan 14, 2016 at 5:02 AM, Henrik Gramner wrote: [...] There is no need for discussion; I have already said it is fine and am amending the patch. It is really a personal thing; I prefer explicit ops when working at that low a level. Even mova I would have changed; just wanted to keep code consistency. It is for this reason that I will change it. [...] ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls
On 1/14/2016 1:26 PM, Ganesh Ajjanagadde wrote: > On Thu, Jan 14, 2016 at 11:16 AM, James Almerwrote: >> On 1/14/2016 11:12 AM, Ganesh Ajjanagadde wrote: >>> On Thu, Jan 14, 2016 at 5:02 AM, Henrik Gramner wrote: Use the x86inc syntax for FMA instructions (basically FMA4 syntax that gets assembled as FMA3) since normal FMA3 opcodes are horrible to read, nobody ever remembers the ordering of operands. >>> >>> 1. It is very easy to remember: take fmadd231pd x, y, z for instance. >>> This means 2*3 + 1, so x = y*z+x. How the macro is more readable is >>> beyond me; especially with some side cases that are undocumented, see >>> below. >> >> fmaddps dst, src1, src2, src3 is always going to be easier to read for anyone >> without having to think about what number belongs to what operation and what >> operand. And it will output either FMA4 or FMA3 depending on the value passed >> to INIT_[XY]MM. > > The fma3/fma4 thing is the only benefit. Even that is generally not a > big deal; AMD quickly started supporting fma3. Nobody is asking you to write an FMA4 version of this function. We're asking you to use the x86inc FMA4-like macros for readability purposes. > >> >>> 2. If anything, the macro is harder, since it is not Intel supported, >> >> Of course it wont be there, it's not defined by them. Non-destructive four >> operand fma is defined by AMD. > > Of course I know this. > >> >>> I can't look it up at >>> https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf. >> >> Neither are any of the dozens other compat macros in x86utils. And many of >> them are also undocumented within x86utils. This point is absurd. > > How is it absurd? You expect me to use something that lacks clear > documentation, and claim that it is "more readable". What other macros > have/lack is irrelevant to the point. If you want documentation for FMA4 look at AMD docs, just like you didn't hesitate to look at Intel's. > >> >>> 3. The macro does not seem to take care of the mov's (if any), still >>> requiring explicit thought on the part of the programmer. >> >> Yes, and? It's not an emulation macro like the uppercase ones that become >> several instructions. It translate a single FMA4-like instruction into >> either an FMA4 or FMA3 one. >> >> fmaddps xmm0, xmm0, xmm1, xmm2 >> >> becomes >> >> vfmaddps xmm0, xmm0, xmm1, xmm2 if FMA4 >> vfmadd132ps xmm0, xmm2, xmm1 if FMA3 >> >> If you try to use it with four different operands, it will work with FMA4 >> but not FMA3, since as i said it's not trying to emulate anything. > > Thanks for mentioning the convention; but this is an important one and > AFAIK not mentioned in any documentation within FFmpeg. > >> >>> 4. The macro lacks documentation. In particular, it is not a thorough >>> fma4 emulation in the spirit of >>> https://gist.github.com/rygorous/22180ced9c7a00bd68dd. >>> >>> Or put in other words, IMO not good. >> >> No, it's good and what's done in every other asm file precisely for being >> more flexible and readable. > > Flexibility, yes, readability still no. dst = src1 * src2 + src3 That's all you need to know to read an FMA4-like instruction. Are you going to tell me that the clusterfuck that's FMA3 with varying numbers that change the order or operations and meaning of operands is easier to read? With the compat macros in x86inc, as long as two of the four operands are the same register then it's going to output the relevant FMA3 instruction for you. > >> Especially since it allows one to write both >> FMA4 and FMA3 functions without duplicating code. > > Fine. > >> >> ___ >> ffmpeg-devel mailing list >> ffmpeg-devel@ffmpeg.org >> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls
This improves accuracy (very slightly) and speed for processors having fma3. Sample benchmark (fate flac-16-lpc-cholesky, Haswell): old: 5993610 decicycles in ff_lpc_calc_coefs, 64 runs, 0 skips 5951528 decicycles in ff_lpc_calc_coefs, 128 runs, 0 skips new: 5252410 decicycles in ff_lpc_calc_coefs, 64 runs, 0 skips 5232869 decicycles in ff_lpc_calc_coefs, 128 runs, 0 skips Tested with FATE and --disable-fma3, also examined contents of lavu/lls-test. Signed-off-by: Ganesh Ajjanagadde--- libavutil/x86/lls.asm| 61 ++-- libavutil/x86/lls_init.c | 4 2 files changed, 63 insertions(+), 2 deletions(-) diff --git a/libavutil/x86/lls.asm b/libavutil/x86/lls.asm index 769befb..358603a 100644 --- a/libavutil/x86/lls.asm +++ b/libavutil/x86/lls.asm @@ -125,8 +125,7 @@ cglobal update_lls, 2,5,8, ctx, var, i, j, covar2 .ret: REP_RET -%if HAVE_AVX_EXTERNAL -INIT_YMM avx +%macro UPDATE_LLS 0 cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2 %define covarq ctxq mov countd, [ctxq + LLSModel.indep_count] @@ -140,6 +139,18 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2 vbroadcastsd ymm6, [varq + iq*8 + 16] vbroadcastsd ymm7, [varq + iq*8 + 24] vextractf128 xmm3, ymm1, 1 +%if cpuflag(fma3) +mova ymm0, COVAR(iq ,0) +mova xmm2, COVAR(iq+2,2) +vfmadd231pd ymm0, ymm1, ymm4 +vfmadd231pd xmm2, xmm3, xmm6 +vfmadd213pd ymm1, ymm5, COVAR(iq ,1) +vfmadd213pd xmm3, xmm7, COVAR(iq+2,3) +mova COVAR(iq ,0), ymm0 +mova COVAR(iq ,1), ymm1 +mova COVAR(iq+2,2), xmm2 +mova COVAR(iq+2,3), xmm3 +%else vmulpd ymm0, ymm1, ymm4 vmulpd ymm1, ymm1, ymm5 vmulpd xmm2, xmm3, xmm6 @@ -148,12 +159,27 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2 ADDPD_MEM COVAR(iq ,1), ymm1 ADDPD_MEM COVAR(iq+2,2), xmm2 ADDPD_MEM COVAR(iq+2,3), xmm3 +%endif ; cpuflag(fma3) lea jd, [iq + 4] cmp jd, count2d jg .skip4x4 .loop4x4: ; Compute all 16 pairwise products of a 4x4 block movaymm3, [varq + jq*8] +%if cpuflag(fma3) +mova ymm0, COVAR(jq, 0) +mova ymm1, COVAR(jq, 1) +mova ymm2, COVAR(jq, 2) +mova ymm3, COVAR(jq, 3) +vfmadd231pd ymm0, ymm3, ymm4 +vfmadd231pd ymm1, ymm3, ymm5 +vfmadd231pd ymm2, ymm3, ymm6 +vfmadd231pd ymm3, ymm3, ymm7 +mova COVAR(jq, 0), ymm0 +mova COVAR(jq, 1), ymm1 +mova COVAR(jq, 2), ymm2 +mova COVAR(jq, 3), ymm3 +%else vmulpd ymm0, ymm3, ymm4 vmulpd ymm1, ymm3, ymm5 vmulpd ymm2, ymm3, ymm6 @@ -162,6 +188,7 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2 ADDPD_MEM COVAR(jq,1), ymm1 ADDPD_MEM COVAR(jq,2), ymm2 ADDPD_MEM COVAR(jq,3), ymm3 +%endif ; cpuflag(fma3) add jd, 4 cmp jd, count2d jle .loop4x4 @@ -169,6 +196,20 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2 cmp jd, countd jg .skip2x4 movaxmm3, [varq + jq*8] +%if cpuflag(fma3) +mova xmm0, COVAR(jq, 0) +mova xmm1, COVAR(jq, 1) +mova xmm2, COVAR(jq, 2) +mova xmm3, COVAR(jq, 3) +vfmadd231pd xmm0, xmm3, xmm4 +vfmadd231pd xmm1, xmm3, xmm5 +vfmadd231pd xmm2, xmm3, xmm6 +vfmadd231pd xmm3, xmm3, xmm7 +mova COVAR(jq, 0), xmm0 +mova COVAR(jq, 1), xmm1 +mova COVAR(jq, 2), xmm2 +mova COVAR(jq, 3), xmm3 +%else vmulpd xmm0, xmm3, xmm4 vmulpd xmm1, xmm3, xmm5 vmulpd xmm2, xmm3, xmm6 @@ -177,6 +218,7 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2 ADDPD_MEM COVAR(jq,1), xmm1 ADDPD_MEM COVAR(jq,2), xmm2 ADDPD_MEM COVAR(jq,3), xmm3 +%endif ; cpuflag(fma3) .skip2x4: add id, 4 add covarq, 4*COVAR_STRIDE @@ -187,14 +229,29 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2 mov jd, id .loop2x1: vmovddup xmm0, [varq + iq*8] +%if cpuflag(fma3) +mova xmm1, [varq + jq*8] +vfmadd213pd xmm0, xmm1, COVAR(jq,0) +mova COVAR(jq,0), xmm0 +%else vmulpd xmm0, [varq + jq*8] ADDPD_MEM COVAR(jq,0), xmm0 +%endif ; cpuflag(fma3) inc id add covarq, COVAR_STRIDE cmp id, countd jle .loop2x1 .ret: REP_RET +%endmacro ; UPDATE_LLS + +%if HAVE_AVX_EXTERNAL +INIT_YMM avx +UPDATE_LLS +%endif +%if HAVE_FMA3_EXTERNAL +INIT_YMM fma3 +UPDATE_LLS %endif INIT_XMM sse2 diff --git a/libavutil/x86/lls_init.c b/libavutil/x86/lls_init.c index 81f141c..9f0d862 100644 --- a/libavutil/x86/lls_init.c +++ b/libavutil/x86/lls_init.c @@ -25,6 +25,7 @@ void ff_update_lls_sse2(LLSModel *m, const double *var); void ff_update_lls_avx(LLSModel *m, const double *var); +void ff_update_lls_fma3(LLSModel *m, const double *var); double ff_evaluate_lls_sse2(LLSModel *m, const double *var, int order); av_cold void ff_init_lls_x86(LLSModel *m) @@ -38,4 +39,7 @@ av_cold void ff_init_lls_x86(LLSModel *m) if
Re: [FFmpeg-devel] [PATCH] lavu/x86/lls: add fma3 optimizations for update_lls
On Wed, Jan 13, 2016 at 6:59 PM, Ganesh Ajjanagaddewrote: > This improves accuracy (very slightly) and speed for processors having > fma3. > > Sample benchmark (fate flac-16-lpc-cholesky, Haswell): > old: > 5993610 decicycles in ff_lpc_calc_coefs, 64 runs, 0 skips > 5951528 decicycles in ff_lpc_calc_coefs, 128 runs, 0 skips > > new: > 5252410 decicycles in ff_lpc_calc_coefs, 64 runs, 0 skips > 5232869 decicycles in ff_lpc_calc_coefs, 128 runs, 0 skips > > Tested with FATE and --disable-fma3, also examined contents of > lavu/lls-test. > > Signed-off-by: Ganesh Ajjanagadde > --- > libavutil/x86/lls.asm| 61 > ++-- > libavutil/x86/lls_init.c | 4 > 2 files changed, 63 insertions(+), 2 deletions(-) > > diff --git a/libavutil/x86/lls.asm b/libavutil/x86/lls.asm > index 769befb..358603a 100644 > --- a/libavutil/x86/lls.asm > +++ b/libavutil/x86/lls.asm > @@ -125,8 +125,7 @@ cglobal update_lls, 2,5,8, ctx, var, i, j, covar2 > .ret: > REP_RET > > -%if HAVE_AVX_EXTERNAL > -INIT_YMM avx > +%macro UPDATE_LLS 0 > cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2 > %define covarq ctxq > mov countd, [ctxq + LLSModel.indep_count] > @@ -140,6 +139,18 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2 > vbroadcastsd ymm6, [varq + iq*8 + 16] > vbroadcastsd ymm7, [varq + iq*8 + 24] > vextractf128 xmm3, ymm1, 1 > +%if cpuflag(fma3) > +mova ymm0, COVAR(iq ,0) > +mova xmm2, COVAR(iq+2,2) > +vfmadd231pd ymm0, ymm1, ymm4 > +vfmadd231pd xmm2, xmm3, xmm6 > +vfmadd213pd ymm1, ymm5, COVAR(iq ,1) > +vfmadd213pd xmm3, xmm7, COVAR(iq+2,3) > +mova COVAR(iq ,0), ymm0 > +mova COVAR(iq ,1), ymm1 > +mova COVAR(iq+2,2), xmm2 > +mova COVAR(iq+2,3), xmm3 > +%else > vmulpd ymm0, ymm1, ymm4 > vmulpd ymm1, ymm1, ymm5 > vmulpd xmm2, xmm3, xmm6 > @@ -148,12 +159,27 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2 > ADDPD_MEM COVAR(iq ,1), ymm1 > ADDPD_MEM COVAR(iq+2,2), xmm2 > ADDPD_MEM COVAR(iq+2,3), xmm3 > +%endif ; cpuflag(fma3) > lea jd, [iq + 4] > cmp jd, count2d > jg .skip4x4 > .loop4x4: > ; Compute all 16 pairwise products of a 4x4 block > movaymm3, [varq + jq*8] > +%if cpuflag(fma3) > +mova ymm0, COVAR(jq, 0) > +mova ymm1, COVAR(jq, 1) > +mova ymm2, COVAR(jq, 2) > +mova ymm3, COVAR(jq, 3) > +vfmadd231pd ymm0, ymm3, ymm4 > +vfmadd231pd ymm1, ymm3, ymm5 > +vfmadd231pd ymm2, ymm3, ymm6 > +vfmadd231pd ymm3, ymm3, ymm7 > +mova COVAR(jq, 0), ymm0 > +mova COVAR(jq, 1), ymm1 > +mova COVAR(jq, 2), ymm2 > +mova COVAR(jq, 3), ymm3 > +%else > vmulpd ymm0, ymm3, ymm4 > vmulpd ymm1, ymm3, ymm5 > vmulpd ymm2, ymm3, ymm6 > @@ -162,6 +188,7 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2 > ADDPD_MEM COVAR(jq,1), ymm1 > ADDPD_MEM COVAR(jq,2), ymm2 > ADDPD_MEM COVAR(jq,3), ymm3 > +%endif ; cpuflag(fma3) > add jd, 4 > cmp jd, count2d > jle .loop4x4 > @@ -169,6 +196,20 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2 > cmp jd, countd > jg .skip2x4 > movaxmm3, [varq + jq*8] > +%if cpuflag(fma3) > +mova xmm0, COVAR(jq, 0) > +mova xmm1, COVAR(jq, 1) > +mova xmm2, COVAR(jq, 2) > +mova xmm3, COVAR(jq, 3) > +vfmadd231pd xmm0, xmm3, xmm4 > +vfmadd231pd xmm1, xmm3, xmm5 > +vfmadd231pd xmm2, xmm3, xmm6 > +vfmadd231pd xmm3, xmm3, xmm7 > +mova COVAR(jq, 0), xmm0 > +mova COVAR(jq, 1), xmm1 > +mova COVAR(jq, 2), xmm2 > +mova COVAR(jq, 3), xmm3 > +%else > vmulpd xmm0, xmm3, xmm4 > vmulpd xmm1, xmm3, xmm5 > vmulpd xmm2, xmm3, xmm6 > @@ -177,6 +218,7 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2 > ADDPD_MEM COVAR(jq,1), xmm1 > ADDPD_MEM COVAR(jq,2), xmm2 > ADDPD_MEM COVAR(jq,3), xmm3 > +%endif ; cpuflag(fma3) > .skip2x4: > add id, 4 > add covarq, 4*COVAR_STRIDE > @@ -187,14 +229,29 @@ cglobal update_lls, 3,6,8, ctx, var, count, i, j, count2 > mov jd, id > .loop2x1: > vmovddup xmm0, [varq + iq*8] > +%if cpuflag(fma3) > +mova xmm1, [varq + jq*8] > +vfmadd213pd xmm0, xmm1, COVAR(jq,0) > +mova COVAR(jq,0), xmm0 > +%else > vmulpd xmm0, [varq + jq*8] > ADDPD_MEM COVAR(jq,0), xmm0 > +%endif ; cpuflag(fma3) > inc id > add covarq, COVAR_STRIDE > cmp id, countd > jle .loop2x1 > .ret: > REP_RET > +%endmacro ; UPDATE_LLS > + > +%if HAVE_AVX_EXTERNAL > +INIT_YMM avx > +UPDATE_LLS > +%endif > +%if HAVE_FMA3_EXTERNAL > +INIT_YMM fma3 > +UPDATE_LLS > %endif > > INIT_XMM sse2 > diff --git a/libavutil/x86/lls_init.c b/libavutil/x86/lls_init.c > index 81f141c..9f0d862 100644 > --- a/libavutil/x86/lls_init.c > +++ b/libavutil/x86/lls_init.c > @@ -25,6