Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

J. Gareth Moreton Tue, 22 Oct 2019 16:15:03 -0700

That's definitely a marked improvement. Under the System V ABI andvectorcall, both fields of a complex type would be passed through xmm0. Splitting it up into two separate registers would require something like:

shufpd %xmm0,%xmm1,3 { Copy the high-order Double into the low-orderposition - an immediate operand of "1" will also work, since we're notconcerned with the upper 64 bits of %xmm1 }

After which your complied code will work correctly (since it looks like%xmm1 was undefined before):


mulsd    %xmm0,%xmm0
mulsd    %xmm1,%xmm1

addsd %xmm0,%xmm1 { In terms of register usage, the most optimalcombination of instructions here would be "addsd %xmm1,%xmm0" then"sqrtsd %xmm0,%xmm0", since %xmm1 is released for other purposes oneinstruction sooner }

sqrtsd    %xmm1,%xmm0
ret

Otherwise you'd have to load in the data from reference (%rcx underwin64, and %rdi under other x86_64 platforms) - for example:


movsd    (%rcx),%xmm0
movsd    8(%rcx),%xmm1

I would be interested to see the the patch when it's ready.

Under SSE2 (no horizontal add), I think the most optimal set ofinstructions (assuming the entirety of the parameter is passed through%xmm0) is:


mulpd    %xmm0,%xmm0
shufpd    %xmm0,%xmm1,3
addsd    %xmm1,%xmm0
sqrtsd    %xmm0,%xmm0
ret

The main motivation in my eyes is the fact that it removes one of themultiplication instructions - mind you, on a modern processor, a pair of"mulsd" instructions working on independent data will be executedsimultaneously, in which case the only time a cycle-counting improvementbecomes visible is if the core is hyperthreaded and another thread isusing the ALUs. Of course, a sufficiently-skilled assembler programmerwill be able to beat the compiler in many cases, but it's still a targetto strive for.


Gareth aka. Kit

On 22/10/2019 22:03, Florian Klämpfl wrote:

Am 22.10.19 um 05:01 schrieb J. Gareth Moreton:


Bigger challenges would be optimising the modulus of a complex number:

   function cmod (z : complex): real; vectorcall;
     { module : r = |z| }
     begin
        with z do
          cmod := sqrt((re * re) + (im * im));
     end;

A perfect compiler with permission to use SSE3 (for haddpd) shouldgenerate the following (note that no stack frame is required):

mulpd %xmm0, %xmm0 { Calculates "re * re" and "im * im"simultaneously }haddpd %xmm0, %xmm0 { Adds the above multiplications together(horizontal add) }

sqrtsd    %xmm0
ret

Currently, with vectorcall, the routine compiles into this:

leaq    -24(%rsp),%rsp
movdqa    %xmm0,(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
mulsd    %xmm1,%xmm1
movsd    8(%rax),%xmm0
mulsd    %xmm0,%xmm0
addsd    %xmm1,%xmm0
sqrtsd    %xmm0,%xmm0
leaq    24(%rsp),%rsp
ret

And without vectorcall (or an unaligned record type):

leaq    -24(%rsp),%rsp
movq    %rcx,%rax
movq    (%rax),%rdx
movq    %rdx,(%rsp)
movq    8(%rax),%rax
movq    %rax,8(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
mulsd    %xmm1,%xmm1
movsd    8(%rax),%xmm0
mulsd    %xmm0,%xmm0
addsd    %xmm1,%xmm0
sqrtsd    %xmm0,%xmm0
leaq    24(%rsp),%rsp
ret

With a few additions (the git patch is less than 500 lines) in thecompiler I get (it is not ready for committing yet):


.section .text.n_p$program_$$_cmod$complex$$real,"ax"
    .balign 16,0x90
.globl    P$PROGRAM_$$_CMOD$COMPLEX$$REAL
    .type    P$PROGRAM_$$_CMOD$COMPLEX$$REAL,@function
P$PROGRAM_$$_CMOD$COMPLEX$$REAL:
.Lc2:
# Var $result located in register xmm0
# Var z located in register xmm0
# [test.pp]
# [20] begin
# [22] cmod := sqrt((re * re) + (im * im));
    mulsd    %xmm0,%xmm0
    mulsd    %xmm1,%xmm1
    addsd    %xmm0,%xmm1
    sqrtsd    %xmm1,%xmm0
# Var $result located in register xmm0
.Lc3:
# [23] end;
    ret
.Lc1:
.Le0:

.size P$PROGRAM_$$_CMOD$COMPLEX$$REAL, .Le0 -P$PROGRAM_$$_CMOD$COMPLEX$$REAL

It mainly keeps records in mm registers. I am not sure about the rightapproach yet. But to allocate one register to each field of suitablerecords seems to be a reasonable approach.

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

Reply via email to