This is a long read, so strap in!

Well, I finally got it to work - the required type defintion was as follows:

{$push}
{$codealign RECORDMIN=16}
{$PACKRECORDS C}
  { This record forces "complex" to be aligned to a 16-byte boundary }
  type align_dummy = record
    filler: array[0..1] of real;
  end;
{$pop}

  type complex = record
                   case Byte of
                   0: (
                        alignment: align_dummy;
                      );
                   1: (
                        re : real;
                        im : real;
                      );
                 end;

It is so, so easy to get wrong because if align_dummy's field is 1, 2, 4 or 8 bytes in size, it is classed as an integer under Windows, and that overrides the Double-type in the union, causing the entire record to still be passed by reference.  Additionally, the dummy field has to be of type Single or Double (or Real); if it is an integral type (e.g. "array[0..15] of Byte"), it is once again classified as an integer and overrides the Double type as per the rules of System V ABI parameter classification (in other words, the entire thing would get passed by reference under both x86_64-win64 and x86_64-linux etc.).  Long story short, this is an absolute minefield!!

I still seriously think that having an alignment attribute or some such will make life so much easier for third-party developers who may not know the exact quirks of how x86_64 classifies its parameters.  To me, this trick feels incredibly hacky and very hard to get right.

Compiled code isnt perfect though - for example, when moving parameters to and from the relevant xmm registers, the "movdqa" instruction is used instead of "movapd", which causes a performance penalty because the internal CPU state has to switch between double-precision and integer (this is why, for example, there are separate VINSERTF128 and VINSERTI128 instructions, even though they superficially do the same thing).  Additionally, inlined vectorcall routines still seem to fall back onto using movq to transfer 8 bytes at a time between a function result and wherever it is to be stored, but this is because everything is decomposed at the node level and the compiler currently lacks any decent vectorisation algorithms.

Nevertheless, I think I'm ready to prepare a patch for uComplex for evaluation, and it's given me some things to play with to see if the compiler can be made to work with packed data better.  I figure the uComplex unit is a good place to start because it's an array of 2 Doubles internally and a lot of the operations like addition are component-wise.

Bigger challenges would be optimising the modulus of a complex number:

  function cmod (z : complex): real; vectorcall;
    { module : r = |z| }
    begin
       with z do
         cmod := sqrt((re * re) + (im * im));
    end;

A perfect compiler with permission to use SSE3 (for haddpd) should generate the following (note that no stack frame is required):

mulpd    %xmm0, %xmm0 { Calculates "re * re" and "im * im" simultaneously }
haddpd    %xmm0, %xmm0 { Adds the above multiplications together (horizontal add) }
sqrtsd    %xmm0
ret

Currently, with vectorcall, the routine compiles into this:

leaq    -24(%rsp),%rsp
movdqa    %xmm0,(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
mulsd    %xmm1,%xmm1
movsd    8(%rax),%xmm0
mulsd    %xmm0,%xmm0
addsd    %xmm1,%xmm0
sqrtsd    %xmm0,%xmm0
leaq    24(%rsp),%rsp
ret

And without vectorcall (or an unaligned record type):

leaq    -24(%rsp),%rsp
movq    %rcx,%rax
movq    (%rax),%rdx
movq    %rdx,(%rsp)
movq    8(%rax),%rax
movq    %rax,8(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
mulsd    %xmm1,%xmm1
movsd    8(%rax),%xmm0
mulsd    %xmm0,%xmm0
addsd    %xmm1,%xmm0
sqrtsd    %xmm0,%xmm0
leaq    24(%rsp),%rsp
ret

Maybe I'm in the minority here, and definitely getting ahead of myself, but seeing ways of improving the compiled assembly language excites me!  Even without vectorcall, I want to see if I can get my deep optimiser in a workable form, because things like "movq %rsp,%rax" and then merely reading from %rax is completely unnecessary.  Also, things like this:

...
movdqa    %xmm0,(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
...

Just... why?!  Just do "movsd %xmm0,%xmm1"!! The peephole optimiser may struggle to spot this anyway because of the inefficient mixing of integer and floating-point XMM instructions - of course, it might be the case that the original contents of %xmm0 is needed later - this is where my deep optimiser or some other form of data-flow analysis would come into play.  Playing the logical flow in my head, I can see it optimising the triplet as follows:

1. Notice that %rax = %rsp and change the movsd instruction to minimise a pipeline stall (the later "movsd 8(%rax),%xmm0" instruction would get changed too):

...
movdqa    %xmm0,(%rsp)
movq    %rsp,%rax
movsd    (%rsp),%xmm1
...

2. Notice that %rax is now never used, so "movq %rsp,%rax" can be safely removed:

...
movdqa    %xmm0,(%rsp)
movsd    (%rsp),%xmm1
...

3. Note that what's being read from the stack is equal to %xmm0 at this point, so just read from %xmm0 directly to prevent a pipeline stall:

...
movdqa    %xmm0,(%rsp)
movsd    %xmm0,%xmm1
...

It might not be able to remove the movdqa instruction because a later instruction reads from 8(%rsp), but vectorisation improvements will help mitigate this.

Okay, enough theorising, but I think my contagious enthusiasm is back!

Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to