Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Gabriel Ravier via Gcc Thu, 05 Aug 2021 06:18:21 -0700


On 8/5/21 11:42 AM, Gabriel Paubert wrote:

On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:

Hi,

targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
following code (13 instructions using 57 bytes, plus 4 quadwords
using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:

                                 .text
    0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
                         4: R_X86_64_PC32        .rdata
    8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
                         c: R_X86_64_PC32        .rdata
   10:   66 0f 28 d8             movapd %xmm0, %xmm3
   14:   66 0f 28 c8             movapd %xmm0, %xmm1
   18:   66 0f 54 da             andpd  %xmm2, %xmm3
   1c:   66 0f 2e e3             ucomisd %xmm3, %xmm4
   20:   76 16                   jbe    38 <_trunc+0x38>
   22:   f2 48 0f 2c c0          cvttsd2si %xmm0, %rax
   27:   66 0f ef c0             pxor   %xmm0, %xmm0
   2b:   66 0f 55 d1             andnpd %xmm1, %xmm2
   2f:   f2 48 0f 2a c0          cvtsi2sd %rax, %xmm0
   34:   66 0f 56 c2             orpd   %xmm2, %xmm0
   38:   c3                      retq

                                 .rdata
                                 .align 8
    0:   00 00 00 00     .LC0:   .quad  0x1.0p52
         00 00 30 43
         00 00 00 00
         00 00 00 00
                                 .align 16
   10:   ff ff ff ff     .LC1:   .quad  ~(-0.0)
         ff ff ff 7f
   18:   00 00 00 00             .quad  0.0
         00 00 00 00
                                 .end

JFTR: in the best case, the memory accesses cost several cycles,
       while in the worst case they yield a page fault!


Properly optimized, shorter and faster code, using but only 9 instructions
in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
and saving at least 16 + 32 bytes, follows:

                               .intel_syntax
                               .text
    0:   f2 48 0f 2c c0        cvttsd2si rax, xmm0  # rax = trunc(argument)
    5:   48 f7 d8              neg     rax
                         #     jz      .L0          # argument zero?
    8:   70 16                 jo      .L0          # argument indefinite?
                                                    # argument overflows 64-bit 
integer?
    a:   48 f7 d8              neg     rax
    d:   f2 48 0f 2a c8        cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
   12:   66 0f 73 d0 3f        psrlq   xmm0, 63
   17:   66 0f 73 f0 3f        psllq   xmm0, 63     # xmm0 = (argument & -0.0) 
? -0.0 : 0.0
   1c:   66 0f 56 c1           orpd    xmm0, xmm1   # xmm0 = trunc(argument)
   20:   c3              .L0:  ret
                               .end

There is one important difference, namely setting the invalid exception
flag when the parameter can't be represented in a signed integer.  So
using your code may require some option (-fast-math comes to mind), or
you need at least a check on the exponent before cvttsd2si.

The last part of your code then goes to take into account the special
case of -0.0, which I most often don't care about (I'd like to have a
-fdont-split-hairs-about-the-sign-of-zero option).

`-fno-signed-zeros` does that, if you need it


Potentially generating spurious invalid operation and then carefully
taking into account the sign of zero does not seem very consistent.

Apart from this, in your code, after cvttsd2si I'd rather use:
        mov rcx,rax # make a second copy to a scratch register
        neg rcx
        jo .L0
        cvtsi2sd xmm1,rax

The reason is latency, in an OoO engine, splitting the two paths is
almost always a win.

With your patch:

cvttsd2si-->neg-?->neg-->cvtsi2sd

where the ? means that the following instructions are speculated.


With an auxiliary register there are two dependency chains:

cvttsd2si-?->cvtsi2sd
          |->mov->neg->jump

Actually some OoO cores just eliminate register copies using register
renaming mechanism. But even this is probably completely irrelevant in
this case where the latency is dominated by the two conversion
instructions.

        Regards,
        Gabriel

regards
Stefan

--
_________________________
Gabriel RAVIER
First year student at Epitech
+33 6 36 46 16 43
gabriel.rav...@epitech.eu
11 Quai Finkwiller
67000 STRASBOURG

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Reply via email to