Am 21.10.19 um 00:57 schrieb J. Gareth Moreton:
Hi everyone,

I'm trying to make some optimisation improvements to UComplex so the compiler can take advantage of SSE2 or AVX features without needing to write specialised code (other than using the "vectorcall" directive under Win64).  I am having some difficulty though.

The record type "complex" is defined as follows:

*type *complex = *record*
                      re : real;
                      im : real;

(Real is equivalent to Double on x86_64)

This also corresponds with how a complex number is defined for Extended Pascal.  Currently, when compiled under x86_64-win64, the fields are placed on 8-byte boundaries, but because the type as a whole is also on an 8-byte boundary (not 16-byte), the compiler cannot take advantage of the XMM registers when passing such a construct as a parameter or return value, and hence has to pass it by reference.  For high-speed scientific programming, this quickly adds up to a notable penalty.  For example, the compiled assembly language for adding together two complex numbers on x86_64-win64 ("Z := Z + X;"):

     movsd    U_$P$COMPLEX_$$_Z(%rip),%xmm0
     addsd    U_$P$COMPLEX_$$_X(%rip),%xmm0
     movsd    %xmm0,40(%rsp)
     movsd    U_$P$COMPLEX_$$_Z+8(%rip),%xmm0
     addsd    U_$P$COMPLEX_$$_X+8(%rip),%xmm0
     movsd    %xmm0,48(%rsp)
     movq    40(%rsp),%rax
     movq    %rax,U_$P$COMPLEX_$$_Z(%rip)
     movq    48(%rsp),%rax
     movq    %rax,U_$P$COMPLEX_$$_Z+8(%rip)

Even if the reads and writes to memory cannot be removed, treating the complex data type as an aligned array of doubles should be able to yield far more efficient code (might require some compiler quirks so it detects the component-wise addition in the inlined + operator for the complex type):

     movapd   U_$P$COMPLEX_$$_Z(%rip),%xmm0
     addpd    U_$P$COMPLEX_$$_X(%rip),%xmm0
     movapd   %xmm0,U_$P$COMPLEX_$$_Z(%rip)

The problem here is that there's no practical way to force the entire record's alignment onto a 16-byte boundary (a requirement for "vectorcall") without also snapping each individual field to such a boundary.  Strictly speaking, I don't think the 16-byte boundary is a requirement for the System V ABI (the Unix calling convention for 64-bit Intel processors),

The stack is 16 byte aligned, aligning data is up to the compiler.

and there are unaligned move instructions to accommodate for this (which have traditionally been slightly slower than the aligned counterparts), but currently the Free Pascal Compiler demands the alignment, mainly because of shared compiler code between Windows and non-Windows builds.

Each target can have its own aligment requirements.

The only way to enforce a construct where the record is on a 16-byte boundary but the two 8-byte fields are packed is to use an array element; e.g:

   {$codealign RECORDMIN=16}
*type* complex = *record*
                      part: *array*[0..1] of real;

Mapping "re" to "part[0]" and "im" to "part[1]" using a union is impossible because "im" will be put on the next 16-byte boundary and be its own separate entity.  Other constructs such as nested unions are possible, but this will break backward compatibility with code that uses the uComplex unit.

A while ago I requested a means to specify an alignment on a per-type basis so it is easier for third-party programmers to take advantage of the extra efficiency brought upon by vectorcall and the System V ABI: - this effectively boils down to being able to define something akin to the following:

*type *complex = *record*
                      re : real;
                      im : real;
*end*/{$ifdef CPUX86_64}/ *align* 16/{$endif CPUX86_64}/;

It was assigned to Maciej last year, but hasn't seen any progress since.

If not that alignment feature, is there any other way to cleanly enforce a 16-byte boundary for such a packed type without having to completely redesign it to the point that it breaks compatibility?

What's the problem with

{$codealign RECORDMIN=16}
type complex = record
                      re : real;
                      im : real;


fpc-devel maillist  -

Reply via email to