Hi everyone,

So to start the story, I'm planning to make use of the nodes that were introduced for the SSE and AVX intrinstics as part of some vectorisation code, since they can help manage the code generation and contain some sanity checks.  I noticed though that some intrinsics are missing; for example, the VMASKMOV instructions, which were introduced with AVX and don't have a direct SSE equivalent (MASKMOVDQU and MASKMOVQ can only write to memory and work on integers rather than floating-point values, and mixing MM integer and floating-point instructions incur a CPU state switch penalty).

Instructions like VMASKMOV are very useful because it would allow vectorisation of 3-component arrays (e.g. Cartesean coordinates), so I plan to look at introducing nodes for these instructions. Would this be okay to do?

On another note, I'm wondering if the RTL could benefit from some initializers for the MM types for ease of use.  For example, one of the registers used in VMASKMOV is a mask, and for a programmer using intrinstics, being able to do something like "mmval := x86_vmaskmovps(Coord, [True, True, True, False]);" - granted I have to think about performance since all those Boolean consrants should ideally be merged into single 128-bit memory block (as FFFFFFFFFFFFFFFFFFFFFFFF00000000) that is loaded into an XMM register with VMOVPS.

What would you suggest? I'm just speaking a bit from experience in that using C++ intrinsics can get a little cumbersome sometimes and easy to get wrong (at least as far as performance and alignment are concerned, for example), and having the FPC ones be friendlier would make a world of difference.

I'm still working out quite a few things and experimenting a lot.  I'll be sure to be doing a lot of documentation.  However, any help or insight into the current design practices for intrinstics and their respective nodes will be greatly appreciated.

Gareth aka. Kit

P.S. Regarding vectorisation challenges, I'm looking at sequences like "V.X*V.X + V.Y*V.Y + V.Z*V.Z" (scalar length of a 3-dimensional vector), which I would love to be able to naturally compile into:

VMOVPS XMM1, Mask_1110
VMASKMOVPS XMM0, XMM1, V
VMULPS XMM0, XMM0, XMM0
VHADDPS XMM0, XMM0, XMM0
VHADDPS XMM0, XMM0, XMM0

And then maybe take it further to produce:

VMOVPS XMM1, Mask_1110
VMASKMOVPS XMM0, XMM1, V
VDPPS XMM0, XMM0, $71 { 01110001b }

(This could be an optimisation at the node level rather than a peephole optimisation, although if it doesn't know exactly what VMASKMOVPS is doing, then the immediate in (V)DPPS will be forced to be $FF)


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to