On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote: > Hi, > > targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the > following code (13 instructions using 57 bytes, plus 4 quadwords > using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given: > > .text > 0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2 > 4: R_X86_64_PC32 .rdata > 8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4 > c: R_X86_64_PC32 .rdata > 10: 66 0f 28 d8 movapd %xmm0, %xmm3 > 14: 66 0f 28 c8 movapd %xmm0, %xmm1 > 18: 66 0f 54 da andpd %xmm2, %xmm3 > 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4 > 20: 76 16 jbe 38 <_trunc+0x38> > 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax > 27: 66 0f ef c0 pxor %xmm0, %xmm0 > 2b: 66 0f 55 d1 andnpd %xmm1, %xmm2 > 2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0 > 34: 66 0f 56 c2 orpd %xmm2, %xmm0 > 38: c3 retq > > .rdata > .align 8 > 0: 00 00 00 00 .LC0: .quad 0x1.0p52 > 00 00 30 43 > 00 00 00 00 > 00 00 00 00 > .align 16 > 10: ff ff ff ff .LC1: .quad ~(-0.0) > ff ff ff 7f > 18: 00 00 00 00 .quad 0.0 > 00 00 00 00 > .end > > JFTR: in the best case, the memory accesses cost several cycles, > while in the worst case they yield a page fault! > > > Properly optimized, shorter and faster code, using but only 9 instructions > in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses > and saving at least 16 + 32 bytes, follows: > > .intel_syntax > .text > 0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument) > 5: 48 f7 d8 neg rax > # jz .L0 # argument zero? > 8: 70 16 jo .L0 # argument indefinite? > # argument overflows > 64-bit integer? > a: 48 f7 d8 neg rax > d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = trunc(argument) > 12: 66 0f 73 d0 3f psrlq xmm0, 63 > 17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & -0.0) > ? -0.0 : 0.0 > 1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument) > 20: c3 .L0: ret > .end
There is one important difference, namely setting the invalid exception flag when the parameter can't be represented in a signed integer. So using your code may require some option (-fast-math comes to mind), or you need at least a check on the exponent before cvttsd2si. The last part of your code then goes to take into account the special case of -0.0, which I most often don't care about (I'd like to have a -fdont-split-hairs-about-the-sign-of-zero option). Potentially generating spurious invalid operation and then carefully taking into account the sign of zero does not seem very consistent. Apart from this, in your code, after cvttsd2si I'd rather use: mov rcx,rax # make a second copy to a scratch register neg rcx jo .L0 cvtsi2sd xmm1,rax The reason is latency, in an OoO engine, splitting the two paths is almost always a win. With your patch: cvttsd2si-->neg-?->neg-->cvtsi2sd where the ? means that the following instructions are speculated. With an auxiliary register there are two dependency chains: cvttsd2si-?->cvtsi2sd |->mov->neg->jump Actually some OoO cores just eliminate register copies using register renaming mechanism. But even this is probably completely irrelevant in this case where the latency is dominated by the two conversion instructions. Regards, Gabriel > > regards > Stefan