Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Gabriel Paubert Thu, 05 Aug 2021 02:44:26 -0700

On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> Hi,
> 
> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> following code (13 instructions using 57 bytes, plus 4 quadwords
> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
> 
>                                 .text
>    0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
>                         4: R_X86_64_PC32        .rdata
>    8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
>                         c: R_X86_64_PC32        .rdata
>   10:   66 0f 28 d8             movapd %xmm0, %xmm3
>   14:   66 0f 28 c8             movapd %xmm0, %xmm1
>   18:   66 0f 54 da             andpd  %xmm2, %xmm3
>   1c:   66 0f 2e e3             ucomisd %xmm3, %xmm4
>   20:   76 16                   jbe    38 <_trunc+0x38>
>   22:   f2 48 0f 2c c0          cvttsd2si %xmm0, %rax
>   27:   66 0f ef c0             pxor   %xmm0, %xmm0
>   2b:   66 0f 55 d1             andnpd %xmm1, %xmm2
>   2f:   f2 48 0f 2a c0          cvtsi2sd %rax, %xmm0
>   34:   66 0f 56 c2             orpd   %xmm2, %xmm0
>   38:   c3                      retq
> 
>                                 .rdata
>                                 .align 8
>    0:   00 00 00 00     .LC0:   .quad  0x1.0p52
>         00 00 30 43
>         00 00 00 00
>         00 00 00 00
>                                 .align 16
>   10:   ff ff ff ff     .LC1:   .quad  ~(-0.0)
>         ff ff ff 7f
>   18:   00 00 00 00             .quad  0.0
>         00 00 00 00
>                                 .end
> 
> JFTR: in the best case, the memory accesses cost several cycles,
>       while in the worst case they yield a page fault!
> 
> 
> Properly optimized, shorter and faster code, using but only 9 instructions
> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
> and saving at least 16 + 32 bytes, follows:
> 
>                               .intel_syntax
>                               .text
>    0:   f2 48 0f 2c c0        cvttsd2si rax, xmm0  # rax = trunc(argument)
>    5:   48 f7 d8              neg     rax
>                         #     jz      .L0          # argument zero?
>    8:   70 16                 jo      .L0          # argument indefinite?
>                                                    # argument overflows 
> 64-bit integer?
>    a:   48 f7 d8              neg     rax
>    d:   f2 48 0f 2a c8        cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>   12:   66 0f 73 d0 3f        psrlq   xmm0, 63
>   17:   66 0f 73 f0 3f        psllq   xmm0, 63     # xmm0 = (argument & -0.0) 
> ? -0.0 : 0.0
>   1c:   66 0f 56 c1           orpd    xmm0, xmm1   # xmm0 = trunc(argument)
>   20:   c3              .L0:  ret
>                               .end


There is one important difference, namely setting the invalid exception
flag when the parameter can't be represented in a signed integer.  So
using your code may require some option (-fast-math comes to mind), or
you need at least a check on the exponent before cvttsd2si.

The last part of your code then goes to take into account the special
case of -0.0, which I most often don't care about (I'd like to have a
-fdont-split-hairs-about-the-sign-of-zero option).

Potentially generating spurious invalid operation and then carefully
taking into account the sign of zero does not seem very consistent.

Apart from this, in your code, after cvttsd2si I'd rather use:
        mov rcx,rax # make a second copy to a scratch register
        neg rcx
        jo .L0
        cvtsi2sd xmm1,rax

The reason is latency, in an OoO engine, splitting the two paths is
almost always a win.

With your patch:

cvttsd2si-->neg-?->neg-->cvtsi2sd
              
where the ? means that the following instructions are speculated.  

With an auxiliary register there are two dependency chains:

cvttsd2si-?->cvtsi2sd
         |->mov->neg->jump

Actually some OoO cores just eliminate register copies using register
renaming mechanism. But even this is probably completely irrelevant in
this case where the latency is dominated by the two conversion
instructions.

        Regards,
        Gabriel



> 
> regards
> Stefan

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Reply via email to