[fpc-devel] State of SSE/AVX intrinsics

J. Gareth Moreton Mon, 20 Apr 2020 19:47:27 -0700

Hi everyone,

So to start the story, I'm planning to make use of the nodes that wereintroduced for the SSE and AVX intrinstics as part of some vectorisationcode, since they can help manage the code generation and contain somesanity checks. I noticed though that some intrinsics are missing; forexample, the VMASKMOV instructions, which were introduced with AVX anddon't have a direct SSE equivalent (MASKMOVDQU and MASKMOVQ can onlywrite to memory and work on integers rather than floating-point values,and mixing MM integer and floating-point instructions incur a CPU stateswitch penalty).

Instructions like VMASKMOV are very useful because it would allowvectorisation of 3-component arrays (e.g. Cartesean coordinates), so Iplan to look at introducing nodes for these instructions. Would this beokay to do?

On another note, I'm wondering if the RTL could benefit from someinitializers for the MM types for ease of use. For example, one of theregisters used in VMASKMOV is a mask, and for a programmer usingintrinstics, being able to do something like "mmval :=x86_vmaskmovps(Coord, [True, True, True, False]);" - granted I have tothink about performance since all those Boolean consrants should ideallybe merged into single 128-bit memory block (asFFFFFFFFFFFFFFFFFFFFFFFF00000000) that is loaded into an XMM registerwith VMOVPS.

What would you suggest? I'm just speaking a bit from experience in thatusing C++ intrinsics can get a little cumbersome sometimes and easy toget wrong (at least as far as performance and alignment are concerned,for example), and having the FPC ones be friendlier would make a worldof difference.

I'm still working out quite a few things and experimenting a lot. I'llbe sure to be doing a lot of documentation. However, any help orinsight into the current design practices for intrinstics and theirrespective nodes will be greatly appreciated.


Gareth aka. Kit

P.S. Regarding vectorisation challenges, I'm looking at sequences like"V.X*V.X + V.Y*V.Y + V.Z*V.Z" (scalar length of a 3-dimensional vector),which I would love to be able to naturally compile into:


VMOVPS XMM1, Mask_1110
VMASKMOVPS XMM0, XMM1, V
VMULPS XMM0, XMM0, XMM0
VHADDPS XMM0, XMM0, XMM0
VHADDPS XMM0, XMM0, XMM0

And then maybe take it further to produce:

VMOVPS XMM1, Mask_1110
VMASKMOVPS XMM0, XMM1, V
VDPPS XMM0, XMM0, $71 { 01110001b }

(This could be an optimisation at the node level rather than a peepholeoptimisation, although if it doesn't know exactly what VMASKMOVPS isdoing, then the immediate in (V)DPPS will be forced to be $FF)



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] State of SSE/AVX intrinsics

Reply via email to