On 8/5/21 11:42 AM, Gabriel Paubert wrote:
On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
Hi,
targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
following code (13 instructions using 57 bytes, plus 4 quadwords
using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
.text
0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2
4: R_X86_64_PC32 .rdata
8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4
c: R_X86_64_PC32 .rdata
10: 66 0f 28 d8 movapd %xmm0, %xmm3
14: 66 0f 28 c8 movapd %xmm0, %xmm1
18: 66 0f 54 da andpd %xmm2, %xmm3
1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4
20: 76 16 jbe 38 <_trunc+0x38>
22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax
27: 66 0f ef c0 pxor %xmm0, %xmm0
2b: 66 0f 55 d1 andnpd %xmm1, %xmm2
2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0
34: 66 0f 56 c2 orpd %xmm2, %xmm0
38: c3 retq
.rdata
.align 8
0: 00 00 00 00 .LC0: .quad 0x1.0p52
00 00 30 43
00 00 00 00
00 00 00 00
.align 16
10: ff ff ff ff .LC1: .quad ~(-0.0)
ff ff ff 7f
18: 00 00 00 00 .quad 0.0
00 00 00 00
.end
JFTR: in the best case, the memory accesses cost several cycles,
while in the worst case they yield a page fault!
Properly optimized, shorter and faster code, using but only 9 instructions
in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
and saving at least 16 + 32 bytes, follows:
.intel_syntax
.text
0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument)
5: 48 f7 d8 neg rax
# jz .L0 # argument zero?
8: 70 16 jo .L0 # argument indefinite?
# argument overflows 64-bit
integer?
a: 48 f7 d8 neg rax
d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
12: 66 0f 73 d0 3f psrlq xmm0, 63
17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & -0.0)
? -0.0 : 0.0
1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
20: c3 .L0: ret
.end
There is one important difference, namely setting the invalid exception
flag when the parameter can't be represented in a signed integer. So
using your code may require some option (-fast-math comes to mind), or
you need at least a check on the exponent before cvttsd2si.
The last part of your code then goes to take into account the special
case of -0.0, which I most often don't care about (I'd like to have a
-fdont-split-hairs-about-the-sign-of-zero option).
`-fno-signed-zeros` does that, if you need it
Potentially generating spurious invalid operation and then carefully
taking into account the sign of zero does not seem very consistent.
Apart from this, in your code, after cvttsd2si I'd rather use:
mov rcx,rax # make a second copy to a scratch register
neg rcx
jo .L0
cvtsi2sd xmm1,rax
The reason is latency, in an OoO engine, splitting the two paths is
almost always a win.
With your patch:
cvttsd2si-->neg-?->neg-->cvtsi2sd
where the ? means that the following instructions are speculated.
With an auxiliary register there are two dependency chains:
cvttsd2si-?->cvtsi2sd
|->mov->neg->jump
Actually some OoO cores just eliminate register copies using register
renaming mechanism. But even this is probably completely irrelevant in
this case where the latency is dominated by the two conversion
instructions.
Regards,
Gabriel
regards
Stefan
--
_________________________
Gabriel RAVIER
First year student at Epitech
+33 6 36 46 16 43
gabriel.rav...@epitech.eu
11 Quai Finkwiller
67000 STRASBOURG