Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Richard Biener via Gcc Fri, 06 Aug 2021 08:16:49 -0700

On August 6, 2021 4:32:48 PM GMT+02:00, Stefan Kanthak 
<stefan.kant...@nexgo.de> wrote:
>Michael Matz <m...@suse.de> wrote:
>
>
>> Hello,
>> 
>> On Fri, 6 Aug 2021, Stefan Kanthak wrote:
>> 
>>> For -ffast-math, where the sign of -0.0 is not handled and the spurios
>>> invalid floating-point exception for |argument| >= 2**63 is acceptable,
>> 
>> This claim would need to be proven in the wild.
>
>I should have left the "when" after the "and" which I originally had
>written...
>
>> |argument| > 2**52 are already integer, and shouldn't generate a spurious
>> exception from the various to-int conversions, not even in fast-math mode
>> for some relevant set of applications (at least SPECcpu).
>> 
>> Btw, have you made speed measurements with your improvements?
>
>No.
>
>> The size improvements are obvious, but speed changes can be fairly
>> unintuitive, e.g. there were old K8 CPUs where the memory loads for
>> constants are actually faster than the equivalent sequence of shifting
>> and masking for the >= compares.  That's an irrelevant CPU now, but it
>> shows that intuition about speed consequences can be wrong.
>
>I know. I also know of CPUs that can't load a 16-byte wide XMM register
>in one go, but had to split the load into 2 8-byte loads.
>
>If the constant happens to be present in L1 cache, it MAY load as fast
>as an immediate.
>BUT: on current CPUs, the code GCC generates
>
>        movsd  .LC1(%rip), %xmm2
>        movsd  .LC0(%rip), %xmm4
>        movapd %xmm0, %xmm3
>        movapd %xmm0, %xmm1
>        andpd  %xmm2, %xmm3
>        ucomisd %xmm3, %xmm4
>        jbe    38 <_trunc+0x38>
> 
>needs
>- 4 cycles if the movsd are executed in parallel and the movapd are
>  handled by the register renamer,
>- 5 cycles if the movsd and the movapd are executed in parallel,
>- 7 cycles else,
>plus an unknown number of cycles if the constants are not in L1.
>The proposed
>
>        movq   rax, xmm0


The xmm to GPR move costs you an extra cycle in latency. Shifts also tend to be 
port constrained. The original sequences are also somewhat straight forward to 
vectorize. 

>        add    rax, rax
>        shr    rax, 53
>        cmp    eax, 53+1023
>        jae    return
>
>needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
>register renamer).
>
>Stefan

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Reply via email to