Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

J. Gareth Moreton Mon, 21 Oct 2019 20:02:30 -0700

This is a long read, so strap in!

Well, I finally got it to work - the required type defintion was as follows:


{$push}
{$codealign RECORDMIN=16}
{$PACKRECORDS C}
  { This record forces "complex" to be aligned to a 16-byte boundary }
  type align_dummy = record
    filler: array[0..1] of real;
  end;
{$pop}

  type complex = record
                   case Byte of
                   0: (
                        alignment: align_dummy;
                      );
                   1: (
                        re : real;
                        im : real;
                      );
                 end;

It is so, so easy to get wrong because if align_dummy's field is 1, 2, 4or 8 bytes in size, it is classed as an integer under Windows, and thatoverrides the Double-type in the union, causing the entire record tostill be passed by reference. Additionally, the dummy field has to beof type Single or Double (or Real); if it is an integral type (e.g."array[0..15] of Byte"), it is once again classified as an integer andoverrides the Double type as per the rules of System V ABI parameterclassification (in other words, the entire thing would get passed byreference under both x86_64-win64 and x86_64-linux etc.). Long storyshort, this is an absolute minefield!!

I still seriously think that having an alignment attribute or some suchwill make life so much easier for third-party developers who may notknow the exact quirks of how x86_64 classifies its parameters. To me,this trick feels incredibly hacky and very hard to get right.

Compiled code isnt perfect though - for example, when moving parametersto and from the relevant xmm registers, the "movdqa" instruction is usedinstead of "movapd", which causes a performance penalty because theinternal CPU state has to switch between double-precision and integer(this is why, for example, there are separate VINSERTF128 andVINSERTI128 instructions, even though they superficially do the samething). Additionally, inlined vectorcall routines still seem to fallback onto using movq to transfer 8 bytes at a time between a functionresult and wherever it is to be stored, but this is because everythingis decomposed at the node level and the compiler currently lacks anydecent vectorisation algorithms.

Nevertheless, I think I'm ready to prepare a patch for uComplex forevaluation, and it's given me some things to play with to see if thecompiler can be made to work with packed data better. I figure theuComplex unit is a good place to start because it's an array of 2Doubles internally and a lot of the operations like addition arecomponent-wise.


Bigger challenges would be optimising the modulus of a complex number:

  function cmod (z : complex): real; vectorcall;
    { module : r = |z| }
    begin
       with z do
         cmod := sqrt((re * re) + (im * im));
    end;

A perfect compiler with permission to use SSE3 (for haddpd) shouldgenerate the following (note that no stack frame is required):


mulpd    %xmm0, %xmm0 { Calculates "re * re" and "im * im" simultaneously }

haddpd %xmm0, %xmm0 { Adds the above multiplications together(horizontal add) }

sqrtsd    %xmm0
ret

Currently, with vectorcall, the routine compiles into this:

leaq    -24(%rsp),%rsp
movdqa    %xmm0,(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
mulsd    %xmm1,%xmm1
movsd    8(%rax),%xmm0
mulsd    %xmm0,%xmm0
addsd    %xmm1,%xmm0
sqrtsd    %xmm0,%xmm0
leaq    24(%rsp),%rsp
ret

And without vectorcall (or an unaligned record type):

leaq    -24(%rsp),%rsp
movq    %rcx,%rax
movq    (%rax),%rdx
movq    %rdx,(%rsp)
movq    8(%rax),%rax
movq    %rax,8(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
mulsd    %xmm1,%xmm1
movsd    8(%rax),%xmm0
mulsd    %xmm0,%xmm0
addsd    %xmm1,%xmm0
sqrtsd    %xmm0,%xmm0
leaq    24(%rsp),%rsp
ret

Maybe I'm in the minority here, and definitely getting ahead of myself,but seeing ways of improving the compiled assembly language excites me! Even without vectorcall, I want to see if I can get my deep optimiser ina workable form, because things like "movq %rsp,%rax" and then merelyreading from %rax is completely unnecessary. Also, things like this:


...
movdqa    %xmm0,(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
...

Just... why?! Just do "movsd %xmm0,%xmm1"!! The peephole optimiser maystruggle to spot this anyway because of the inefficient mixing ofinteger and floating-point XMM instructions - of course, it might be thecase that the original contents of %xmm0 is needed later - this is wheremy deep optimiser or some other form of data-flow analysis would comeinto play. Playing the logical flow in my head, I can see it optimisingthe triplet as follows:

1. Notice that %rax = %rsp and change the movsd instruction to minimisea pipeline stall (the later "movsd 8(%rax),%xmm0" instruction would getchanged too):


...
movdqa    %xmm0,(%rsp)
movq    %rsp,%rax
movsd    (%rsp),%xmm1
...

2. Notice that %rax is now never used, so "movq %rsp,%rax" can be safelyremoved:


...
movdqa    %xmm0,(%rsp)
movsd    (%rsp),%xmm1
...

3. Note that what's being read from the stack is equal to %xmm0 at thispoint, so just read from %xmm0 directly to prevent a pipeline stall:


...
movdqa    %xmm0,(%rsp)
movsd    %xmm0,%xmm1
...

It might not be able to remove the movdqa instruction because a laterinstruction reads from 8(%rsp), but vectorisation improvements will helpmitigate this.


Okay, enough theorising, but I think my contagious enthusiasm is back!

Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

Reply via email to