Hmmm, that is unfortunate if the horizontal operations are inefficient.  I had a look at them at https://www.agner.org/optimize/instruction_tables.pdf - you are right in that HADDPS has a surprisingly high latency (approximately how many cycles it takes to execute), although HADDPD isn't as bad, probably because it's only dealing with 2 Doubles instead of 4 Singles, and it seems mostly equivalent in speed to the multiplication instructions.

Using just SSE2:

mulpd %xmm0,%xmm0
shufpd %xmm0,%xmm1,1
addsd %xmm1,%xmm0
sqrtsd %xmm0,%xmm0

Ultimately it's not much better than what you have:

shufpd %xmm0,%xmm1,1 { Only needed if both fields are in %xmm0 }
mulsd %xmm0,%xmm0
mulsd %xmm1,%xmm1
addsd %xmm1,%xmm0
sqrtsd %xmm0,%xmm0

If you measure the dependencies between the instructions (shufpd and the first mulsd can run simultaneously, or equivalently, the two mulsd instructions), it still amounts to 4 cycles, assuming each instruction takes an equal amount of time to execute (which they don't, but it's a reasonable approximation).  The subroutines are also probably too small to get accurate timing metrics on them.  It might be something to experiment on though - I would hope at the very least that the horizontal operations have improved in later years.

I know though that vectorising instructions is, by and large, a net gain.  For example, let's go to a simpler example of adding two complex numbers together:

  operator + (z1, z2 : complex) z : complex; vectorcall;
  {$ifdef TEST_INLINE}
  inline;
  {$endif TEST_INLINE}
    { addition : z := z1 + z2 }
    begin
       z.re := z1.re + z2.re;
       z.im := z1.im + z2.im;
    end;

No horizonal adds here, just a simple packed addition and storing the result into %xmm0 as opposed to two scalar additions and then combining the result in whatever way is demanded (if aligned, it's all in %xmm0, if unaligned, I think %xmm0 and %xmm1 are supposed to be used).  Mind you, in this case the function is inlined, so the parameter passing doesn't always apply.

Once again though, I was surprised at how inefficient HADDPS is once you pointed it out.  The double-precision versions aren't nearly as bad though, so maybe they can still be used.

Gareth aka. Kit

P.S. As far as 128-bit aligned vector types are concerned, vectorcall and the System V ABI can be considered equivalent. Vectorcall can use more MM registers for return values and more complex aggregates as parameters, but in our examples, we don't have to worry about that yet.


On 23/10/2019 21:20, Florian Klämpfl wrote:
Am 22.10.19 um 05:01 schrieb J. Gareth Moreton:

mulpd    %xmm0, %xmm0 { Calculates "re * re" and "im * im" simultaneously } haddpd    %xmm0, %xmm0 { Adds the above multiplications together (horizontal add) }

Unfortunatly, those horizontal operations are normally not very efficient IIRC.
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to