Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Stefan Kanthak Fri, 06 Aug 2021 10:06:02 -0700

Richard Biener <richard.guent...@gmail.com> wrote:

> On August 6, 2021 4:32:48 PM GMT+02:00, Stefan Kanthak 
> <stefan.kant...@nexgo.de> wrote:
>>Michael Matz <m...@suse.de> wrote:


>>> Btw, have you made speed measurements with your improvements?
>>
>>No.
[...]
>>If the constant happens to be present in L1 cache, it MAY load as fast
>>as an immediate.
>>BUT: on current CPUs, the code GCC generates
>>
>>        movsd  .LC1(%rip), %xmm2
>>        movsd  .LC0(%rip), %xmm4
>>        movapd %xmm0, %xmm3
>>        movapd %xmm0, %xmm1
>>        andpd  %xmm2, %xmm3
>>        ucomisd %xmm3, %xmm4
>>        jbe    38 <_trunc+0x38>
>> 
>>needs
>>- 4 cycles if the movsd are executed in parallel and the movapd are
>>  handled by the register renamer,
>>- 5 cycles if the movsd and the movapd are executed in parallel,
>>- 7 cycles else,
>>plus an unknown number of cycles if the constants are not in L1.
>>The proposed
>>
>>        movq   rax, xmm0
>
> The xmm to GPR move costs you an extra cycle in latency. Shifts also
> tend to be port constrained. The original sequences are also somewhat
> straight forward to vectorize. 

Please show how GCC vectorizes CVT[T]SD2SI and CVTSI2SD!
These are the bottlenecks in the current code.

If you want the code for trunc() and cousins to be vectorizable you
should stay with the alternative code I presented some posts before,
which GCC should be (able to) generate from its other procedural
variant.

Stefan

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Reply via email to