Richard Biener <richard.guent...@gmail.com> wrote: > On August 6, 2021 4:32:48 PM GMT+02:00, Stefan Kanthak > <stefan.kant...@nexgo.de> wrote: >>Michael Matz <m...@suse.de> wrote:
>>> Btw, have you made speed measurements with your improvements? >> >>No. [...] >>If the constant happens to be present in L1 cache, it MAY load as fast >>as an immediate. >>BUT: on current CPUs, the code GCC generates >> >> movsd .LC1(%rip), %xmm2 >> movsd .LC0(%rip), %xmm4 >> movapd %xmm0, %xmm3 >> movapd %xmm0, %xmm1 >> andpd %xmm2, %xmm3 >> ucomisd %xmm3, %xmm4 >> jbe 38 <_trunc+0x38> >> >>needs >>- 4 cycles if the movsd are executed in parallel and the movapd are >> handled by the register renamer, >>- 5 cycles if the movsd and the movapd are executed in parallel, >>- 7 cycles else, >>plus an unknown number of cycles if the constants are not in L1. >>The proposed >> >> movq rax, xmm0 > > The xmm to GPR move costs you an extra cycle in latency. Shifts also > tend to be port constrained. The original sequences are also somewhat > straight forward to vectorize. Please show how GCC vectorizes CVT[T]SD2SI and CVTSI2SD! These are the bottlenecks in the current code. If you want the code for trunc() and cousins to be vectorizable you should stay with the alternative code I presented some posts before, which GCC should be (able to) generate from its other procedural variant. Stefan