One issue with the current state of the intrinsics is that they don't
really follow the common style among other languages, and there's no
agreed consensus about what path to take yet. To implement the more
common style would be a lot of work though compared to the current
autogenerated way
Adding the AVX/AVX2 intrinsics isn't hard. I think I have done it on a
branch somewhere including a bunch of fixes.
On 4/21/20 8:29 PM, J. Gareth Moreton wrote:
Hi everyone,
I hope this doesn't become a monthly podcast for me or something, but
during my bursts of motivation, inspiration and creativity, I start to
plan and research things. There are a few things I'd like to develop
for FPC, mostly together because there's a lot of interdependency.
* SSE/AVX intrinsics
Most of the node types for the SSE instructions have been implemented,
as well as some wrapper functions that are disabled by default while
their format is finalised. The nodes that the compiler generates
would be useful when it comes to vectorisation, since a lot of things
like parameters and type checks will be already handled by them.
There are some gaps though. For example, AVX introduced more powerful
'mask move' instructions that allow you to read as well as write
partial vectors, which would be very useful when it comes to, say,
optimising algorithms that deal with 3-component vectors (very common
because 3-component vectors could represent 3D Cartesean coordinates
or an RGB triplet, for example).
* Vectorisation
I think this is probably the next big iteration for the compiler and
optimiser. Besides the obvious loop unrolling vectorisation, there
are a number of common algorithms that are logically easy to vectorise
but which may take some careful analysis to actually detect. One of
my test cases is the classic dot product. In raybench.pas, a
3-dimensional dot product appears as part of a function that returns a
vector's length - Sqrt(V.X*V.X + V.Y*V.Y + V.Z*V.Z) - under AVX, the
expression inside the square root can be optimised into a mask move
(so only the first 3 components of an XMM register are loaded with the
fields of V and the 4th component set to zero) and then all the
additions and multiplications are performed with a single instruction:
VDPPS XMM0, XMM0, XMM0, $71 - ($71 specifically says 'only multiply
and horizontally add the first three components, and then store the
result only in the 1st component - $FF will still work since the 4th
component is equal to zero and only the 1st component is read for the
result, but is a little more clumsy in my opinion).
My intention, at least for these kinds of algorithms, is to make use
of the new intrinstic nodes for specific SSE and AVX instructions,
although there are some intrinsics missing, like the aforementioned
mask move.
* Pure functions
It might be overly ambitious, but I seek to make the SSE/AVX
intrinsics much easier to use (it easily becomes inefficient in C++ if
you haven't got data alignments correct). One example I came up with
is using masks in SSE/AVX instructions. If you want to call, say,
x86_vmaskmovps (an intrinsic for VMASKMOVPS), you would have to set up
an additional _m128 store and load in a custom-made mask (e.g. const
M128Mask: _m128 = (-1.0; -1.0; -1.0; 0.0); ...
x86_vmaskmovps(DestAddr, M128Data, x86_movaps(M128Mask));). This
becomes more problematic if you need to specifically represent
$80000000 or $FFFFFFFF in one of the floating-point fields (the former
is negative zero, and the latter is one of many thousands of quiet NaN
representations). An example of a much a cleaner solution could be
x86_vmaskmovps(DestAddr, M128Data, [True, True, True, False]);, with
an explicit typecast/assignment operator that converts an array of
Booleans into a mask that could be defined and implemented somewhere
in the RTL. Nomally, this would be a prohibitively slow function to
execute, but if the typecast/assignment operator was defined as a pure
function, then it could be evaluated at design time and the resultant
_m128 stored as an implicit constant that is loaded directly into an
MM register when needed, and not having to task the programmer with
floating-point bit manipulation in order to create said constant in
the code.
* Aligned Allocation
This couples with SSE and AVX specifically, but has other uses such as
with paging, for example. Following in the footsteps of C11, I would
like to propose a couple of new intrinsic operations: GetMemAligned
and ReallocMemAligned, that allow you to reserve memory with an
alignment of your choice (with the constraint that it has to be a
power of 2 and at least the size of a Pointer). Having such intrinsics
will also allow the FPC language itself to better support aligned
dynamic arrays, for example.
C11's "aligned_alloc" is compatible with "free", while Microsoft's own
"_aligned_malloc" is not compatible with "free" and requires its own
"_aligned_free" call to properly release. Ideally I rather find a
solution where GetMemAligned and ReallocMemAligned will work with
FreeMem without having unpredictable effects. This would be quite an
undertaking though since it would involve deep research into the
memory manager and ensuring all platforms have a means with which to
support it.
----
I haven't fully organised myself with this yet. Looking at these
proposals as a dependency graph, I feel that pure functions is the
feature that doesn't depend on everything else and I should focus my
efforts here first. I'll be writing up design specifications so
hopefully everyone else can understand what's going on and either
throw in suggestions, note where performance can be improved or plain
shoot something down if it's a very bad idea.
My personal vision... I would like to see Free Pascal being relatively
easy to use while still allowing access to powerful features like
intrinsics and having a powerful optimising compiler so games and
scientific programming can greatly benefit.
What are everyone's thoughts?
Gareth aka. Kit
_______________________________________________
fpc-devel maillist - fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel