Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

J. Gareth Moreton Wed, 23 Oct 2019 14:08:20 -0700

Hmmm, that is unfortunate if the horizontal operations are inefficient. I had a look at them athttps://www.agner.org/optimize/instruction_tables.pdf - you are right inthat HADDPS has a surprisingly high latency (approximately how manycycles it takes to execute), although HADDPD isn't as bad, probablybecause it's only dealing with 2 Doubles instead of 4 Singles, and itseems mostly equivalent in speed to the multiplication instructions.


Using just SSE2:


mulpd %xmm0,%xmm0
shufpd %xmm0,%xmm1,1
addsd %xmm1,%xmm0
sqrtsd %xmm0,%xmm0

Ultimately it's not much better than what you have:

shufpd %xmm0,%xmm1,1 { Only needed if both fields are in %xmm0 }
mulsd %xmm0,%xmm0
mulsd %xmm1,%xmm1
addsd %xmm1,%xmm0
sqrtsd %xmm0,%xmm0

If you measure the dependencies between the instructions (shufpd and thefirst mulsd can run simultaneously, or equivalently, the two mulsdinstructions), it still amounts to 4 cycles, assuming each instructiontakes an equal amount of time to execute (which they don't, but it's areasonable approximation). The subroutines are also probably too smallto get accurate timing metrics on them. It might be something toexperiment on though - I would hope at the very least that thehorizontal operations have improved in later years.

I know though that vectorising instructions is, by and large, a netgain. For example, let's go to a simpler example of adding two complexnumbers together:


  operator + (z1, z2 : complex) z : complex; vectorcall;
  {$ifdef TEST_INLINE}
  inline;
  {$endif TEST_INLINE}
    { addition : z := z1 + z2 }
    begin
       z.re := z1.re + z2.re;
       z.im := z1.im + z2.im;
    end;

No horizonal adds here, just a simple packed addition and storing theresult into %xmm0 as opposed to two scalar additions and then combiningthe result in whatever way is demanded (if aligned, it's all in %xmm0,if unaligned, I think %xmm0 and %xmm1 are supposed to be used). Mindyou, in this case the function is inlined, so the parameter passingdoesn't always apply.

Once again though, I was surprised at how inefficient HADDPS is once youpointed it out. The double-precision versions aren't nearly as badthough, so maybe they can still be used.


Gareth aka. Kit

P.S. As far as 128-bit aligned vector types are concerned, vectorcalland the System V ABI can be considered equivalent. Vectorcall can usemore MM registers for return values and more complex aggregates asparameters, but in our examples, we don't have to worry about that yet.



On 23/10/2019 21:20, Florian Klämpfl wrote:

Am 22.10.19 um 05:01 schrieb J. Gareth Moreton:
mulpd %xmm0, %xmm0 { Calculates "re * re" and "im * im"simultaneously }haddpd %xmm0, %xmm0 { Adds the above multiplications together(horizontal add) }
Unfortunatly, those horizontal operations are normally not veryefficient IIRC.
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

Reply via email to