Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-09 Thread Joseph Myers
On Sat, 7 Aug 2021, Stefan Kanthak wrote:

> Joseph Myers  wrote:
> > You should be looking at TS 18661-3 / C2x Annex F for sNaN handling;
> 
> I'll do so as soon as GCC drops support for all C dialects before C2x!
> 
> Unless you use a time machine and fix the POSIX and ISO C standards
> written in the past you CAN'T neglect all software written before C2x
> modified sNaN handling that relies on the documented behaviour at the
> time it was written.

Pre-C2x versions of C don't cover signaling NaNs at all; they use "NaN" to 
mean "quiet NaN" (so signaling NaNs are trap representations).  Software 
written before C2x thus can't rely on any particular sNaN handling.

The POSIX description of signaling NaNs ("On implementations that support 
the IEC 60559:1989 standard floating point, functions with signaling NaN 
argument(s) shall be treated as if the function were called with an 
argument that is a required domain error and shall return a quiet NaN 
result, except where stated otherwise.") is consistent with C2x as regards 
trunc (sNaN) needing to return a quiet NaN with INVALID raised.  The 
problems are (a) POSIX fails to "state otherwise" for the cases (e.g. 
fabs, copysign) where a signaling NaN argument should not result in a 
quiet NaN with INVALID raised (as per IEEE semantics for those operations) 
and (b) the POSIX rule about setting errno to EDOM when (math_errhandling 
& MATH_ERRNO) is nonzero is inappropriate for sNaN arguments (incompatible 
with the normal approach of generating INVALID and a quiet NaN by passing 
NaN arguments through arithmetic) and the C2x approach of being 
implementation-defined whether an sNaN input is a domain error is more 
appropriate.

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-08 Thread Vincent Lefevre
On 2021-08-07 14:32:32 +0200, Stefan Kanthak wrote:
> Joseph Myers  wrote:
> > On Fri, 6 Aug 2021, Stefan Kanthak wrote:
> 
> PLEASE DON'T STRIP ATTRIBUTION LINES: I did not write the following paragraph!
> 
> >> > I don't know what the standard says about NaNs in this case, I seem to
> >> > remember that arithmetic instructions typically produce QNaN when one of
> >> > the inputs is a NaN, whether signaling or not. 
> >> 
> >> 
> >> and its cousins as well as the C standard say
> >> 
> >> | If x is NaN, a NaN shall be returned.
> >> 
> >> That's why I mentioned that the code GCC generates also doesn't
> >> quiet SNaNs.
> > 
> > You should be looking at TS 18661-3 / C2x Annex F for sNaN handling;
> 
> I'll do so as soon as GCC drops support for all C dialects before C2x!
> 
> Unless you use a time machine and fix the POSIX and ISO C standards
> written in the past you CAN'T neglect all software written before C2x
> modified sNaN handling that relies on the documented behaviour at the
> time it was written.

Before C2x:

  This specification does not define the behavior of signaling NaNs.365)
  It generally uses the term NaN to denote quiet NaNs.

  365) Since NaNs created by IEC 60559 operations are always quiet,
  quiet NaNs (along with infinities) are sufficient for closure of
  the arithmetic.

(in Annex F).

You can't create signaling NaNs with C operations, but you may get
them when reading data from memory. So, IMHO, they should really be
supported in practice, at least in some sense. I would expect that
when a sNaN occurs as an input, it is handled either like a sNaN
(see IEC 60559 / IEEE 754) or like a qNaN. Propagating the
signaling status (forbidden in IEEE 754 for almost all operations)
could be acceptable (this means that an implementation may ignore
whether a NaN is quiet or signaling), but should be avoided.

-- 
Vincent Lefèvre  - Web: 
100% accessible validated (X)HTML - Blog: 
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-07 Thread Stefan Kanthak
Joseph Myers  wrote:


> On Fri, 6 Aug 2021, Stefan Kanthak wrote:

PLEASE DON'T STRIP ATTRIBUTION LINES: I did not write the following paragraph!

>> > I don't know what the standard says about NaNs in this case, I seem to
>> > remember that arithmetic instructions typically produce QNaN when one of
>> > the inputs is a NaN, whether signaling or not. 
>> 
>> 
>> and its cousins as well as the C standard say
>> 
>> | If x is NaN, a NaN shall be returned.
>> 
>> That's why I mentioned that the code GCC generates also doesn't quiet SNaNs.
> 
> You should be looking at TS 18661-3 / C2x Annex F for sNaN handling;

I'll do so as soon as GCC drops support for all C dialects before C2x!

Unless you use a time machine and fix the POSIX and ISO C standards
written in the past you CAN'T neglect all software written before C2x
modified sNaN handling that relies on the documented behaviour at the
time it was written.

> the POSIX attempts to deal with signaling NaNs aren't well thought out.

[...]

> Though in C2x mode, these SSE2 code sequences won't be used by default
> anyway, except for rint (C2x implies -fno-fp-int-builtin-inexact).

regards
Stefan


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Joseph Myers
On Fri, 6 Aug 2021, Stefan Kanthak wrote:

> > I don't know what the standard says about NaNs in this case, I seem to
> > remember that arithmetic instructions typically produce QNaN when one of
> > the inputs is a NaN, whether signaling or not. 
> 
> 
> and its cousins as well as the C standard say
> 
> | If x is NaN, a NaN shall be returned.
> 
> That's why I mentioned that the code GCC generates also doesn't quiet SNaNs.

You should be looking at TS 18661-3 / C2x Annex F for sNaN handling; the 
POSIX attempts to deal with signaling NaNs aren't well thought out.  (And 
possibly adding flag_signaling_nans conditions as appropriate to disable 
for -fsignaling-nans anything for these to-integer operations that doesn't 
produce a qNaN with INVALID raised from sNaN input.)  Though in C2x mode, 
these SSE2 code sequences won't be used by default anyway, except for rint 
(C2x implies -fno-fp-int-builtin-inexact).

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Stefan Kanthak
Richard Biener  wrote:

> On August 6, 2021 4:32:48 PM GMT+02:00, Stefan Kanthak 
>  wrote:
>>Michael Matz  wrote:

>>> Btw, have you made speed measurements with your improvements?
>>
>>No.
[...]
>>If the constant happens to be present in L1 cache, it MAY load as fast
>>as an immediate.
>>BUT: on current CPUs, the code GCC generates
>>
>>movsd  .LC1(%rip), %xmm2
>>movsd  .LC0(%rip), %xmm4
>>movapd %xmm0, %xmm3
>>movapd %xmm0, %xmm1
>>andpd  %xmm2, %xmm3
>>ucomisd %xmm3, %xmm4
>>jbe38 <_trunc+0x38>
>> 
>>needs
>>- 4 cycles if the movsd are executed in parallel and the movapd are
>>  handled by the register renamer,
>>- 5 cycles if the movsd and the movapd are executed in parallel,
>>- 7 cycles else,
>>plus an unknown number of cycles if the constants are not in L1.
>>The proposed
>>
>>movq   rax, xmm0
>
> The xmm to GPR move costs you an extra cycle in latency. Shifts also
> tend to be port constrained. The original sequences are also somewhat
> straight forward to vectorize. 

Please show how GCC vectorizes CVT[T]SD2SI and CVTSI2SD!
These are the bottlenecks in the current code.

If you want the code for trunc() and cousins to be vectorizable you
should stay with the alternative code I presented some posts before,
which GCC should be (able to) generate from its other procedural
variant.

Stefan


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Richard Biener via Gcc
On August 6, 2021 4:32:48 PM GMT+02:00, Stefan Kanthak 
 wrote:
>Michael Matz  wrote:
>
>
>> Hello,
>> 
>> On Fri, 6 Aug 2021, Stefan Kanthak wrote:
>> 
>>> For -ffast-math, where the sign of -0.0 is not handled and the spurios
>>> invalid floating-point exception for |argument| >= 2**63 is acceptable,
>> 
>> This claim would need to be proven in the wild.
>
>I should have left the "when" after the "and" which I originally had
>written...
>
>> |argument| > 2**52 are already integer, and shouldn't generate a spurious
>> exception from the various to-int conversions, not even in fast-math mode
>> for some relevant set of applications (at least SPECcpu).
>> 
>> Btw, have you made speed measurements with your improvements?
>
>No.
>
>> The size improvements are obvious, but speed changes can be fairly
>> unintuitive, e.g. there were old K8 CPUs where the memory loads for
>> constants are actually faster than the equivalent sequence of shifting
>> and masking for the >= compares.  That's an irrelevant CPU now, but it
>> shows that intuition about speed consequences can be wrong.
>
>I know. I also know of CPUs that can't load a 16-byte wide XMM register
>in one go, but had to split the load into 2 8-byte loads.
>
>If the constant happens to be present in L1 cache, it MAY load as fast
>as an immediate.
>BUT: on current CPUs, the code GCC generates
>
>movsd  .LC1(%rip), %xmm2
>movsd  .LC0(%rip), %xmm4
>movapd %xmm0, %xmm3
>movapd %xmm0, %xmm1
>andpd  %xmm2, %xmm3
>ucomisd %xmm3, %xmm4
>jbe38 <_trunc+0x38>
> 
>needs
>- 4 cycles if the movsd are executed in parallel and the movapd are
>  handled by the register renamer,
>- 5 cycles if the movsd and the movapd are executed in parallel,
>- 7 cycles else,
>plus an unknown number of cycles if the constants are not in L1.
>The proposed
>
>movq   rax, xmm0

The xmm to GPR move costs you an extra cycle in latency. Shifts also tend to be 
port constrained. The original sequences are also somewhat straight forward to 
vectorize. 

>addrax, rax
>shrrax, 53
>cmpeax, 53+1023
>jaereturn
>
>needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
>register renamer).
>
>Stefan



Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Michael Matz via Gcc
Hello,

On Fri, 6 Aug 2021, Stefan Kanthak wrote:

> >> For -ffast-math, where the sign of -0.0 is not handled and the 
> >> spurios invalid floating-point exception for |argument| >= 2**63 is 
> >> acceptable,
> > 
> > This claim would need to be proven in the wild.
> 
> I should have left the "when" after the "and" which I originally had 
> written...
> 
> > |argument| > 2**52 are already integer, and shouldn't generate a 
> > spurious exception from the various to-int conversions, not even in 
> > fast-math mode for some relevant set of applications (at least 
> > SPECcpu).
> > 
> > Btw, have you made speed measurements with your improvements?
> 
> No.
> 
> > The size improvements are obvious, but speed changes can be fairly 
> > unintuitive, e.g. there were old K8 CPUs where the memory loads for 
> > constants are actually faster than the equivalent sequence of shifting 
> > and masking for the >= compares.  That's an irrelevant CPU now, but it 
> > shows that intuition about speed consequences can be wrong.
> 
> I know. I also know of CPUs that can't load a 16-byte wide XMM register 
> in one go, but had to split the load into 2 8-byte loads.
> 
> If the constant happens to be present in L1 cache, it MAY load as fast
> as an immediate.
> BUT: on current CPUs, the code GCC generates
> 
> movsd  .LC1(%rip), %xmm2
> movsd  .LC0(%rip), %xmm4
> movapd %xmm0, %xmm3
> movapd %xmm0, %xmm1
> andpd  %xmm2, %xmm3
> ucomisd %xmm3, %xmm4
> jbe38 <_trunc+0x38>
>  
> needs
> - 4 cycles if the movsd are executed in parallel and the movapd are
>   handled by the register renamer,
> - 5 cycles if the movsd and the movapd are executed in parallel,
> - 7 cycles else,
> plus an unknown number of cycles if the constants are not in L1.

You also need to consider the case that the to-int converters are called 
in a loop (which ultimately are the only interesting cases for 
performance), where it's possible to load the constants before the loop 
and keep them in registers (at the expense of two register pressure of 
course) effectively removing the loads from cost considerations.  It's all 
tough choices, which is why stuff needs to be measured in some contexts 
:-)

(I do like your sequences btw, it's just not 100% clearcut that they are 
always a speed improvement).


Ciao,
Michael.

> The proposed
> 
> movq   rax, xmm0
> addrax, rax
> shrrax, 53
> cmpeax, 53+1023
> jaereturn
> 
> needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
> register renamer).
> 
> Stefan
> 


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Stefan Kanthak
Gabriel Paubert  wrote:


> On Fri, Aug 06, 2021 at 02:43:34PM +0200, Stefan Kanthak wrote:
>> Gabriel Paubert  wrote:
>> 
>> > Hi,
>> > 
>> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:

[...]

>> >> The whole idea behind these implementations is to get rid of loading
>> >> floating-point constants to perform comparisions.
>> > 
>> > Indeed, but what I had in mind was something along the following lines:
>> > 
>> > movq rax,xmm0   # and copy rax to say rcx, if needed later
>> > shrq rax,52 # move sign and exponent to 12 LSBs 
>> > andl eax,0x7ff  # mask the sign
>> > cmpl eax,0x434  # value to be checked
>> > ja return   # exponent too large, we're done (what about NaNs?)
>> > cvttsd2si rax,xmm0 # safe after exponent check
>> > cvtsi2sd xmm0,rax  # conversion done
>> > 
>> > and a bit more to handle the corner cases (essentially preserve the
>> > sign to be correct between -1 and -0.0).
>> 
>> The sign of -0.0 is the only corner case and already handled in my code.
>> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
>> preserved, as in the code GCC generates as well as my code.
> 
> I don't know what the standard says about NaNs in this case, I seem to
> remember that arithmetic instructions typically produce QNaN when one of
> the inputs is a NaN, whether signaling or not. 


and its cousins as well as the C standard say

| If x is NaN, a NaN shall be returned.

That's why I mentioned that the code GCC generates also doesn't quiet SNaNs.

>> > But the CPU can (speculatively) start the conversions early, so the
>> > dependency chain is rather short.
>> 
>> Correct.
>>  
>> > I don't know if it's faster than your new code,
>> 
>> It should be faster.
>> 
>> > I'm almost sure that it's shorter.
>> 
>> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
>> 5+4+5+5+2=21 bytes.
>> 
>> JFTR: better use "add rax,rax; shr rax,53" instead of
>>   "shr rax,52; and eax,0x7ff" and save 2 bytes.
> 
> Indeed, I don't have the exact size of instructions in my head,
> especially since I've not written x86 assembly since the mid 90s.
> 
> In any case, with your last improvement, the code is now down to a
> single 32 bit immediate constant. And I don't see how to eliminate it...
> 
>> 
>> Complete properly optimized code for __builtin_trunc is then as follows
>> (11 instructions, 44 bytes):
>> 
>> .code64
>> .intel_syntax
>> .equBIAS, 1023
>> .text
>> movqrax, xmm0# rax = argument
>> add rax, rax
>> shr rax, 53  # rax = exponent of |argument|
>> cmp eax, BIAS + 53
>> jae .Lexit   # argument indefinite?
> 
> Maybe s/.Lexit/.L0/

Surely!

>>  # |argument| >= 0x1.0p53?
>> cvttsd2si rax, xmm0  # rax = trunc(argument)
>> cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>> psrlq   xmm0, 63
>> psllq   xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
>> orpdxmm0, xmm1   # xmm0 = trunc(argument)
>> .L0:ret
>> .end
>> 
> 
> This looks nice.

Let's see how to convince GCC to generate such code sequences...

Stefan


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Stefan Kanthak
Michael Matz  wrote:


> Hello,
> 
> On Fri, 6 Aug 2021, Stefan Kanthak wrote:
> 
>> For -ffast-math, where the sign of -0.0 is not handled and the spurios
>> invalid floating-point exception for |argument| >= 2**63 is acceptable,
> 
> This claim would need to be proven in the wild.

I should have left the "when" after the "and" which I originally had
written...

> |argument| > 2**52 are already integer, and shouldn't generate a spurious
> exception from the various to-int conversions, not even in fast-math mode
> for some relevant set of applications (at least SPECcpu).
> 
> Btw, have you made speed measurements with your improvements?

No.

> The size improvements are obvious, but speed changes can be fairly
> unintuitive, e.g. there were old K8 CPUs where the memory loads for
> constants are actually faster than the equivalent sequence of shifting
> and masking for the >= compares.  That's an irrelevant CPU now, but it
> shows that intuition about speed consequences can be wrong.

I know. I also know of CPUs that can't load a 16-byte wide XMM register
in one go, but had to split the load into 2 8-byte loads.

If the constant happens to be present in L1 cache, it MAY load as fast
as an immediate.
BUT: on current CPUs, the code GCC generates

movsd  .LC1(%rip), %xmm2
movsd  .LC0(%rip), %xmm4
movapd %xmm0, %xmm3
movapd %xmm0, %xmm1
andpd  %xmm2, %xmm3
ucomisd %xmm3, %xmm4
jbe38 <_trunc+0x38>
 
needs
- 4 cycles if the movsd are executed in parallel and the movapd are
  handled by the register renamer,
- 5 cycles if the movsd and the movapd are executed in parallel,
- 7 cycles else,
plus an unknown number of cycles if the constants are not in L1.
The proposed

movq   rax, xmm0
addrax, rax
shrrax, 53
cmpeax, 53+1023
jaereturn

needs 5 cycles (moves from XMM to GPR are AFAIK not handled by the
register renamer).

Stefan


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Michael Matz via Gcc
Hello,

On Fri, 6 Aug 2021, Stefan Kanthak wrote:

> For -ffast-math, where the sign of -0.0 is not handled and the spurios
> invalid floating-point exception for |argument| >= 2**63 is acceptable,

This claim would need to be proven in the wild.  |argument| > 2**52 are 
already integer, and shouldn't generate a spurious exception from the 
various to-int conversions, not even in fast-math mode for some relevant 
set of applications (at least SPECcpu).

Btw, have you made speed measurements with your improvements?  The size 
improvements are obvious, but speed changes can be fairly unintuitive, 
e.g. there were old K8 CPUs where the memory loads for constants are 
actually faster than the equivalent sequence of shifting and masking for 
the >= compares.  That's an irrelevant CPU now, but it shows that 
intuition about speed consequences can be wrong.


Ciao,
Michael.


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Gabriel Paubert
On Fri, Aug 06, 2021 at 02:43:34PM +0200, Stefan Kanthak wrote:
> Gabriel Paubert  wrote:
> 
> > Hi,
> > 
> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> >> Gabriel Paubert  wrote:
> >> 
> >> 
> >> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> 
> >> >>   .intel_syntax
> >> >>   .text
> >> >>0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = 
> >> >> trunc(argument)
> >> >>5:   48 f7 d8  neg rax
> >> >> # jz  .L0  # argument zero?
> >> >>8:   70 16 jo  .L0  # argument 
> >> >> indefinite?
> >> >># argument overflows 
> >> >> 64-bit integer?
> >> >>a:   48 f7 d8  neg rax
> >> >>d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = 
> >> >> trunc(argument)
> >> >>   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
> >> >>   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & 
> >> >> -0.0) ? -0.0 : 0.0
> >> >>   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = 
> >> >> trunc(argument)
> >> >>   20:   c3  .L0:  ret
> >> >>   .end
> >> > 
> >> > There is one important difference, namely setting the invalid exception
> >> > flag when the parameter can't be represented in a signed integer.
> >> 
> >> Right, I overlooked this fault. Thanks for pointing out.
> >> 
> >> > So using your code may require some option (-fast-math comes to mind),
> >> > or you need at least a check on the exponent before cvttsd2si.
> >> 
> >> The whole idea behind these implementations is to get rid of loading
> >> floating-point constants to perform comparisions.
> > 
> > Indeed, but what I had in mind was something along the following lines:
> > 
> > movq rax,xmm0   # and copy rax to say rcx, if needed later
> > shrq rax,52 # move sign and exponent to 12 LSBs 
> > andl eax,0x7ff  # mask the sign
> > cmpl eax,0x434  # value to be checked
> > ja return   # exponent too large, we're done (what about NaNs?)
> > cvttsd2si rax,xmm0 # safe after exponent check
> > cvtsi2sd xmm0,rax  # conversion done
> > 
> > and a bit more to handle the corner cases (essentially preserve the
> > sign to be correct between -1 and -0.0).
> 
> The sign of -0.0 is the only corner case and already handled in my code.
> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
> preserved, as in the code GCC generates as well as my code.

I don't know what the standard says about NaNs in this case, I seem to
remember that arithmetic instructions typically produce QNaN when one of
the inputs is a NaN, whether signaling or not. 

> 
> > But the CPU can (speculatively) start the conversions early, so the
> > dependency chain is rather short.
> 
> Correct.
>  
> > I don't know if it's faster than your new code,
> 
> It should be faster.
> 
> > I'm almost sure that it's shorter.
> 
> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
> 5+4+5+5+2=21 bytes.
> 
> JFTR: better use "add rax,rax; shr rax,53" instead of
>   "shr rax,52; and eax,0x7ff" and save 2 bytes.

Indeed, I don't have the exact size of instructions in my head,
especially since I've not written x86 assembly since the mid 90s.

In any case, with your last improvement, the code is now down to a
single 32 bit immediate constant. And I don't see how to eliminate it...

> 
> Complete properly optimized code for __builtin_trunc is then as follows
> (11 instructions, 44 bytes):
> 
> .code64
> .intel_syntax
> .equBIAS, 1023
> .text
> movqrax, xmm0# rax = argument
> add rax, rax
> shr rax, 53  # rax = exponent of |argument|
> cmp eax, BIAS + 53
> jae .Lexit   # argument indefinite?

Maybe s/.Lexit/.L0/

>  # |argument| >= 0x1.0p53?
> cvttsd2si rax, xmm0  # rax = trunc(argument)
> cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
> psrlq   xmm0, 63
> psllq   xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> orpdxmm0, xmm1   # xmm0 = trunc(argument)
> .L0:ret
> .end
> 

This looks nice.

> @Richard Biener (et. al.):
> 
> 1. Is a primitive for "floating-point > 2**x", which generates such
>an "integer" code sequence, already available, at least for
>float/binary32 and double/binary64?
> 
> 2. the procedural code generator for __builtin_trunc() etc.  uses
>__builtin_fabs() and __builtin_copysign() as building blocks.
>These would need to (and of course should) be modified to generate
>psllq/psrlq pairs instead of andpd/andnpd referencing a memory
>location with either -0.0 oder ~(-0.0).
> 
> For -ffast-math, where the sign of -0.0 is not handled and the spurios
> invalid floating-point exception for |argument| >= 2**63 is acceptable,
> it boils down to:
> 
> .code64

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Richard Biener via Gcc
On Fri, Aug 6, 2021 at 2:47 PM Stefan Kanthak  wrote:
>
> Gabriel Paubert  wrote:
>
> > Hi,
> >
> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> >> Gabriel Paubert  wrote:
> >>
> >>
> >> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>
> >> >>   .intel_syntax
> >> >>   .text
> >> >>0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = 
> >> >> trunc(argument)
> >> >>5:   48 f7 d8  neg rax
> >> >> # jz  .L0  # argument zero?
> >> >>8:   70 16 jo  .L0  # argument 
> >> >> indefinite?
> >> >># argument overflows 
> >> >> 64-bit integer?
> >> >>a:   48 f7 d8  neg rax
> >> >>d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = 
> >> >> trunc(argument)
> >> >>   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
> >> >>   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & 
> >> >> -0.0) ? -0.0 : 0.0
> >> >>   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = 
> >> >> trunc(argument)
> >> >>   20:   c3  .L0:  ret
> >> >>   .end
> >> >
> >> > There is one important difference, namely setting the invalid exception
> >> > flag when the parameter can't be represented in a signed integer.
> >>
> >> Right, I overlooked this fault. Thanks for pointing out.
> >>
> >> > So using your code may require some option (-fast-math comes to mind),
> >> > or you need at least a check on the exponent before cvttsd2si.
> >>
> >> The whole idea behind these implementations is to get rid of loading
> >> floating-point constants to perform comparisions.
> >
> > Indeed, but what I had in mind was something along the following lines:
> >
> > movq rax,xmm0   # and copy rax to say rcx, if needed later
> > shrq rax,52 # move sign and exponent to 12 LSBs
> > andl eax,0x7ff  # mask the sign
> > cmpl eax,0x434  # value to be checked
> > ja return   # exponent too large, we're done (what about NaNs?)
> > cvttsd2si rax,xmm0 # safe after exponent check
> > cvtsi2sd xmm0,rax  # conversion done
> >
> > and a bit more to handle the corner cases (essentially preserve the
> > sign to be correct between -1 and -0.0).
>
> The sign of -0.0 is the only corner case and already handled in my code.
> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
> preserved, as in the code GCC generates as well as my code.
>
> > But the CPU can (speculatively) start the conversions early, so the
> > dependency chain is rather short.
>
> Correct.
>
> > I don't know if it's faster than your new code,
>
> It should be faster.
>
> > I'm almost sure that it's shorter.
>
> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
> 5+4+5+5+2=21 bytes.
>
> JFTR: better use "add rax,rax; shr rax,53" instead of
>   "shr rax,52; and eax,0x7ff" and save 2 bytes.
>
> Complete properly optimized code for __builtin_trunc is then as follows
> (11 instructions, 44 bytes):
>
> .code64
> .intel_syntax
> .equBIAS, 1023
> .text
> movqrax, xmm0# rax = argument
> add rax, rax
> shr rax, 53  # rax = exponent of |argument|
> cmp eax, BIAS + 53
> jae .Lexit   # argument indefinite?
>  # |argument| >= 0x1.0p53?
> cvttsd2si rax, xmm0  # rax = trunc(argument)
> cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
> psrlq   xmm0, 63
> psllq   xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> orpdxmm0, xmm1   # xmm0 = trunc(argument)
> .L0:ret
> .end
>
> @Richard Biener (et. al.):
>
> 1. Is a primitive for "floating-point > 2**x", which generates such
>an "integer" code sequence, already available, at least for
>float/binary32 and double/binary64?

Not that I know, but it should be possible to craft that.

> 2. the procedural code generator for __builtin_trunc() etc.  uses
>__builtin_fabs() and __builtin_copysign() as building blocks.
>These would need to (and of course should) be modified to generate
>psllq/psrlq pairs instead of andpd/andnpd referencing a memory
>location with either -0.0 oder ~(-0.0).
>
> For -ffast-math, where the sign of -0.0 is not handled and the spurios
> invalid floating-point exception for |argument| >= 2**63 is acceptable,
> it boils down to:
>
> .code64
> .intel_syntax
> .equBIAS, 1023
> .text
> cvttsd2si rax, xmm0  # rax = trunc(argument)
> jo  .Lexit   # argument indefinite?
>  # |argument| > 0x1.0p63?
> cvtsi2sd xmm0, rax   # xmm1 = trunc(argument)
> .L0:ret
> .end
>
> [...]
>
> >> Right, the conversions dominate both the original and the code I posted.
> >> It's easy to get rid of them, with still slightly shorter and faster
> >> branchless code (17 

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Stefan Kanthak
Gabriel Paubert  wrote:

> Hi,
> 
> On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
>> Gabriel Paubert  wrote:
>> 
>> 
>> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:

>> >>   .intel_syntax
>> >>   .text
>> >>0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
>> >>5:   48 f7 d8  neg rax
>> >> # jz  .L0  # argument zero?
>> >>8:   70 16 jo  .L0  # argument indefinite?
>> >># argument overflows 
>> >> 64-bit integer?
>> >>a:   48 f7 d8  neg rax
>> >>d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = 
>> >> trunc(argument)
>> >>   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
>> >>   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & 
>> >> -0.0) ? -0.0 : 0.0
>> >>   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = 
>> >> trunc(argument)
>> >>   20:   c3  .L0:  ret
>> >>   .end
>> > 
>> > There is one important difference, namely setting the invalid exception
>> > flag when the parameter can't be represented in a signed integer.
>> 
>> Right, I overlooked this fault. Thanks for pointing out.
>> 
>> > So using your code may require some option (-fast-math comes to mind),
>> > or you need at least a check on the exponent before cvttsd2si.
>> 
>> The whole idea behind these implementations is to get rid of loading
>> floating-point constants to perform comparisions.
> 
> Indeed, but what I had in mind was something along the following lines:
> 
> movq rax,xmm0   # and copy rax to say rcx, if needed later
> shrq rax,52 # move sign and exponent to 12 LSBs 
> andl eax,0x7ff  # mask the sign
> cmpl eax,0x434  # value to be checked
> ja return   # exponent too large, we're done (what about NaNs?)
> cvttsd2si rax,xmm0 # safe after exponent check
> cvtsi2sd xmm0,rax  # conversion done
> 
> and a bit more to handle the corner cases (essentially preserve the
> sign to be correct between -1 and -0.0).

The sign of -0.0 is the only corner case and already handled in my code.
Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
preserved, as in the code GCC generates as well as my code.

> But the CPU can (speculatively) start the conversions early, so the
> dependency chain is rather short.

Correct.
 
> I don't know if it's faster than your new code,

It should be faster.

> I'm almost sure that it's shorter.

"neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
5+4+5+5+2=21 bytes.

JFTR: better use "add rax,rax; shr rax,53" instead of
  "shr rax,52; and eax,0x7ff" and save 2 bytes.

Complete properly optimized code for __builtin_trunc is then as follows
(11 instructions, 44 bytes):

.code64
.intel_syntax
.equBIAS, 1023
.text
movqrax, xmm0# rax = argument
add rax, rax
shr rax, 53  # rax = exponent of |argument|
cmp eax, BIAS + 53
jae .Lexit   # argument indefinite?
 # |argument| >= 0x1.0p53?
cvttsd2si rax, xmm0  # rax = trunc(argument)
cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
psrlq   xmm0, 63
psllq   xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
orpdxmm0, xmm1   # xmm0 = trunc(argument)
.L0:ret
.end

@Richard Biener (et. al.):

1. Is a primitive for "floating-point > 2**x", which generates such
   an "integer" code sequence, already available, at least for
   float/binary32 and double/binary64?

2. the procedural code generator for __builtin_trunc() etc.  uses
   __builtin_fabs() and __builtin_copysign() as building blocks.
   These would need to (and of course should) be modified to generate
   psllq/psrlq pairs instead of andpd/andnpd referencing a memory
   location with either -0.0 oder ~(-0.0).

For -ffast-math, where the sign of -0.0 is not handled and the spurios
invalid floating-point exception for |argument| >= 2**63 is acceptable,
it boils down to:

.code64
.intel_syntax
.equBIAS, 1023
.text
cvttsd2si rax, xmm0  # rax = trunc(argument)
jo  .Lexit   # argument indefinite?
 # |argument| > 0x1.0p63?
cvtsi2sd xmm0, rax   # xmm1 = trunc(argument)
.L0:ret
.end

[...]

>> Right, the conversions dominate both the original and the code I posted.
>> It's easy to get rid of them, with still slightly shorter and faster
>> branchless code (17 instructions, 84 bytes, instead of 13 instructions,
>> 57 + 32 = 89 bytes):
>> 
>> .code64
>> .intel_syntax noprefix
>> .text
>>0:   48 b8 00 00 00 00 00 00 30 43   mov rax, 0x4330
>>a:   66 48 0f 6e d0  

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-05 Thread Gabriel Paubert
Hi,

On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> Gabriel Paubert  wrote:
> 
> 
> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> >> Hi,
> >> 
> >> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> >> following code (13 instructions using 57 bytes, plus 4 quadwords
> >> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
> >> 
> >> .text
> >>0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
> >> 4: R_X86_64_PC32.rdata
> >>8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
> >> c: R_X86_64_PC32.rdata
> >>   10:   66 0f 28 d8 movapd %xmm0, %xmm3
> >>   14:   66 0f 28 c8 movapd %xmm0, %xmm1
> >>   18:   66 0f 54 da andpd  %xmm2, %xmm3
> >>   1c:   66 0f 2e e3 ucomisd %xmm3, %xmm4
> >>   20:   76 16   jbe38 <_trunc+0x38>
> >>   22:   f2 48 0f 2c c0  cvttsd2si %xmm0, %rax
> >>   27:   66 0f ef c0 pxor   %xmm0, %xmm0
> >>   2b:   66 0f 55 d1 andnpd %xmm1, %xmm2
> >>   2f:   f2 48 0f 2a c0  cvtsi2sd %rax, %xmm0
> >>   34:   66 0f 56 c2 orpd   %xmm2, %xmm0
> >>   38:   c3  retq
> >> 
> >> .rdata
> >> .align 8
> >>0:   00 00 00 00 .LC0:   .quad  0x1.0p52
> >> 00 00 30 43
> >> 00 00 00 00
> >> 00 00 00 00
> >> .align 16
> >>   10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
> >> ff ff ff 7f
> >>   18:   00 00 00 00 .quad  0.0
> >> 00 00 00 00
> >> .end
> >> 
> >> JFTR: in the best case, the memory accesses cost several cycles,
> >>   while in the worst case they yield a page fault!
> >> 
> >> 
> >> Properly optimized, shorter and faster code, using but only 9 instructions
> >> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory 
> >> accesses
> >> and saving at least 16 + 32 bytes, follows:
> >> 
> >>   .intel_syntax
> >>   .text
> >>0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
> >>5:   48 f7 d8  neg rax
> >> # jz  .L0  # argument zero?
> >>8:   70 16 jo  .L0  # argument indefinite?
> >># argument overflows 
> >> 64-bit integer?
> >>a:   48 f7 d8  neg rax
> >>d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
> >>   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
> >>   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & 
> >> -0.0) ? -0.0 : 0.0
> >>   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = trunc(argument)
> >>   20:   c3  .L0:  ret
> >>   .end
> > 
> > There is one important difference, namely setting the invalid exception
> > flag when the parameter can't be represented in a signed integer.
> 
> Right, I overlooked this fault. Thanks for pointing out.
> 
> > So using your code may require some option (-fast-math comes to mind),
> > or you need at least a check on the exponent before cvttsd2si.
> 
> The whole idea behind these implementations is to get rid of loading
> floating-point constants to perform comparisions.

Indeed, but what I had in mind was something along the following lines:

movq rax,xmm0   # and copy rax to say rcx, if needed later
shrq rax,52 # move sign and exponent to 12 LSBs 
andl eax,0x7ff  # mask the sign
cmpl eax,0x434  # value to be checked
ja return   # exponent too large, we're done (what about NaNs?)
cvttsd2si rax,xmm0 # safe after exponent check
cvtsi2sd xmm0,rax  # conversion done

and a bit more to handle the corner cases (essentially preserve the
sign to be correct between -1 and -0.0). But the CPU can (speculatively) 
start the conversions early, so the dependency chain is rather short.

I don't know if it's faster than your new code, I'm almost sure that
it's shorter. Your new code also has a fairly long dependency chain.

> 
> > The last part of your code then goes to take into account the special
> > case of -0.0, which I most often don't care about (I'd like to have a
> > -fdont-split-hairs-about-the-sign-of-zero option).
> 
> Preserving the sign of -0.0 is explicitly specified in the standard,
> and is cheap, as shown in my code.
> 
> > Potentially generating spurious invalid operation and then carefully
> > taking into account the sign of zero does not seem very consistent.
> > 
> > Apart from this, in your code, after cvttsd2si I'd rather use:
> > mov rcx,rax # make a second copy to a scratch register
> > neg rcx
> > jo .L0
> > cvtsi2sd 

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-05 Thread Gabriel Ravier via Gcc



On 8/5/21 11:42 AM, Gabriel Paubert wrote:

On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:

Hi,

targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
following code (13 instructions using 57 bytes, plus 4 quadwords
using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:

 .text
0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
 4: R_X86_64_PC32.rdata
8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
 c: R_X86_64_PC32.rdata
   10:   66 0f 28 d8 movapd %xmm0, %xmm3
   14:   66 0f 28 c8 movapd %xmm0, %xmm1
   18:   66 0f 54 da andpd  %xmm2, %xmm3
   1c:   66 0f 2e e3 ucomisd %xmm3, %xmm4
   20:   76 16   jbe38 <_trunc+0x38>
   22:   f2 48 0f 2c c0  cvttsd2si %xmm0, %rax
   27:   66 0f ef c0 pxor   %xmm0, %xmm0
   2b:   66 0f 55 d1 andnpd %xmm1, %xmm2
   2f:   f2 48 0f 2a c0  cvtsi2sd %rax, %xmm0
   34:   66 0f 56 c2 orpd   %xmm2, %xmm0
   38:   c3  retq

 .rdata
 .align 8
0:   00 00 00 00 .LC0:   .quad  0x1.0p52
 00 00 30 43
 00 00 00 00
 00 00 00 00
 .align 16
   10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
 ff ff ff 7f
   18:   00 00 00 00 .quad  0.0
 00 00 00 00
 .end

JFTR: in the best case, the memory accesses cost several cycles,
   while in the worst case they yield a page fault!


Properly optimized, shorter and faster code, using but only 9 instructions
in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
and saving at least 16 + 32 bytes, follows:

   .intel_syntax
   .text
0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
5:   48 f7 d8  neg rax
 # jz  .L0  # argument zero?
8:   70 16 jo  .L0  # argument indefinite?
# argument overflows 64-bit 
integer?
a:   48 f7 d8  neg rax
d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & -0.0) 
? -0.0 : 0.0
   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = trunc(argument)
   20:   c3  .L0:  ret
   .end

There is one important difference, namely setting the invalid exception
flag when the parameter can't be represented in a signed integer.  So
using your code may require some option (-fast-math comes to mind), or
you need at least a check on the exponent before cvttsd2si.

The last part of your code then goes to take into account the special
case of -0.0, which I most often don't care about (I'd like to have a
-fdont-split-hairs-about-the-sign-of-zero option).

`-fno-signed-zeros` does that, if you need it


Potentially generating spurious invalid operation and then carefully
taking into account the sign of zero does not seem very consistent.

Apart from this, in your code, after cvttsd2si I'd rather use:
mov rcx,rax # make a second copy to a scratch register
neg rcx
jo .L0
cvtsi2sd xmm1,rax

The reason is latency, in an OoO engine, splitting the two paths is
almost always a win.

With your patch:

cvttsd2si-->neg-?->neg-->cvtsi2sd
   
where the ? means that the following instructions are speculated.


With an auxiliary register there are two dependency chains:

cvttsd2si-?->cvtsi2sd
  |->mov->neg->jump

Actually some OoO cores just eliminate register copies using register
renaming mechanism. But even this is probably completely irrelevant in
this case where the latency is dominated by the two conversion
instructions.

Regards,
Gabriel




regards
Stefan
  


--
_
Gabriel RAVIER
First year student at Epitech
+33 6 36 46 16 43
gabriel.rav...@epitech.eu
11 Quai Finkwiller
67000 STRASBOURG



Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-05 Thread Stefan Kanthak
Gabriel Paubert  wrote:


> On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>> Hi,
>> 
>> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
>> following code (13 instructions using 57 bytes, plus 4 quadwords
>> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
>> 
>> .text
>>0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
>> 4: R_X86_64_PC32.rdata
>>8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
>> c: R_X86_64_PC32.rdata
>>   10:   66 0f 28 d8 movapd %xmm0, %xmm3
>>   14:   66 0f 28 c8 movapd %xmm0, %xmm1
>>   18:   66 0f 54 da andpd  %xmm2, %xmm3
>>   1c:   66 0f 2e e3 ucomisd %xmm3, %xmm4
>>   20:   76 16   jbe38 <_trunc+0x38>
>>   22:   f2 48 0f 2c c0  cvttsd2si %xmm0, %rax
>>   27:   66 0f ef c0 pxor   %xmm0, %xmm0
>>   2b:   66 0f 55 d1 andnpd %xmm1, %xmm2
>>   2f:   f2 48 0f 2a c0  cvtsi2sd %rax, %xmm0
>>   34:   66 0f 56 c2 orpd   %xmm2, %xmm0
>>   38:   c3  retq
>> 
>> .rdata
>> .align 8
>>0:   00 00 00 00 .LC0:   .quad  0x1.0p52
>> 00 00 30 43
>> 00 00 00 00
>> 00 00 00 00
>> .align 16
>>   10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
>> ff ff ff 7f
>>   18:   00 00 00 00 .quad  0.0
>> 00 00 00 00
>> .end
>> 
>> JFTR: in the best case, the memory accesses cost several cycles,
>>   while in the worst case they yield a page fault!
>> 
>> 
>> Properly optimized, shorter and faster code, using but only 9 instructions
>> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
>> and saving at least 16 + 32 bytes, follows:
>> 
>>   .intel_syntax
>>   .text
>>0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
>>5:   48 f7 d8  neg rax
>> # jz  .L0  # argument zero?
>>8:   70 16 jo  .L0  # argument indefinite?
>># argument overflows 
>> 64-bit integer?
>>a:   48 f7 d8  neg rax
>>d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>>   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
>>   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & 
>> -0.0) ? -0.0 : 0.0
>>   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = trunc(argument)
>>   20:   c3  .L0:  ret
>>   .end
> 
> There is one important difference, namely setting the invalid exception
> flag when the parameter can't be represented in a signed integer.

Right, I overlooked this fault. Thanks for pointing out.

> So using your code may require some option (-fast-math comes to mind),
> or you need at least a check on the exponent before cvttsd2si.

The whole idea behind these implementations is to get rid of loading
floating-point constants to perform comparisions.

> The last part of your code then goes to take into account the special
> case of -0.0, which I most often don't care about (I'd like to have a
> -fdont-split-hairs-about-the-sign-of-zero option).

Preserving the sign of -0.0 is explicitly specified in the standard,
and is cheap, as shown in my code.

> Potentially generating spurious invalid operation and then carefully
> taking into account the sign of zero does not seem very consistent.
> 
> Apart from this, in your code, after cvttsd2si I'd rather use:
> mov rcx,rax # make a second copy to a scratch register
> neg rcx
> jo .L0
> cvtsi2sd xmm1,rax

I don't know how GCC generates the code for builtins, and what kind of
templates it uses: the second goal was to minimize register usage.

> The reason is latency, in an OoO engine, splitting the two paths is
> almost always a win.
> 
> With your patch:
> 
> cvttsd2si-->neg-?->neg-->cvtsi2sd
>  
> where the ? means that the following instructions are speculated.  
> 
> With an auxiliary register there are two dependency chains:
> 
> cvttsd2si-?->cvtsi2sd
> |->mov->neg->jump

Correct; see above: I expect the template(s) for builtins to give
the register allocator some freedom to split code paths and resolve
dependency chains.

> Actually some OoO cores just eliminate register copies using register
> renaming mechanism. But even this is probably completely irrelevant in
> this case where the latency is dominated by the two conversion
> instructions.

Right, the conversions dominate both the original and the code I posted.
It's easy to get rid of them, with still slightly shorter and faster
branchless code (17 instructions, 84 bytes, instead of 

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-05 Thread Richard Biener via Gcc
On Thu, Aug 5, 2021 at 11:44 AM Gabriel Paubert  wrote:
>
> On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> > Hi,
> >
> > targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> > following code (13 instructions using 57 bytes, plus 4 quadwords
> > using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
> >
> > .text
> >0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
> > 4: R_X86_64_PC32.rdata
> >8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
> > c: R_X86_64_PC32.rdata
> >   10:   66 0f 28 d8 movapd %xmm0, %xmm3
> >   14:   66 0f 28 c8 movapd %xmm0, %xmm1
> >   18:   66 0f 54 da andpd  %xmm2, %xmm3
> >   1c:   66 0f 2e e3 ucomisd %xmm3, %xmm4
> >   20:   76 16   jbe38 <_trunc+0x38>
> >   22:   f2 48 0f 2c c0  cvttsd2si %xmm0, %rax
> >   27:   66 0f ef c0 pxor   %xmm0, %xmm0
> >   2b:   66 0f 55 d1 andnpd %xmm1, %xmm2
> >   2f:   f2 48 0f 2a c0  cvtsi2sd %rax, %xmm0
> >   34:   66 0f 56 c2 orpd   %xmm2, %xmm0
> >   38:   c3  retq
> >
> > .rdata
> > .align 8
> >0:   00 00 00 00 .LC0:   .quad  0x1.0p52
> > 00 00 30 43
> > 00 00 00 00
> > 00 00 00 00
> > .align 16
> >   10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
> > ff ff ff 7f
> >   18:   00 00 00 00 .quad  0.0
> > 00 00 00 00
> > .end
> >
> > JFTR: in the best case, the memory accesses cost several cycles,
> >   while in the worst case they yield a page fault!
> >
> >
> > Properly optimized, shorter and faster code, using but only 9 instructions
> > in just 33 bytes, WITHOUT any constants, thus avoiding costly memory 
> > accesses
> > and saving at least 16 + 32 bytes, follows:
> >
> >   .intel_syntax
> >   .text
> >0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
> >5:   48 f7 d8  neg rax
> > # jz  .L0  # argument zero?
> >8:   70 16 jo  .L0  # argument indefinite?
> ># argument overflows 
> > 64-bit integer?
> >a:   48 f7 d8  neg rax
> >d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
> >   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
> >   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & 
> > -0.0) ? -0.0 : 0.0
> >   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = trunc(argument)
> >   20:   c3  .L0:  ret
> >   .end
>
> There is one important difference, namely setting the invalid exception
> flag when the parameter can't be represented in a signed integer.  So
> using your code may require some option (-fast-math comes to mind), or
> you need at least a check on the exponent before cvttsd2si.
>
> The last part of your code then goes to take into account the special
> case of -0.0, which I most often don't care about (I'd like to have a
> -fdont-split-hairs-about-the-sign-of-zero option).
>
> Potentially generating spurious invalid operation and then carefully
> taking into account the sign of zero does not seem very consistent.
>
> Apart from this, in your code, after cvttsd2si I'd rather use:
> mov rcx,rax # make a second copy to a scratch register
> neg rcx
> jo .L0
> cvtsi2sd xmm1,rax
>
> The reason is latency, in an OoO engine, splitting the two paths is
> almost always a win.
>
> With your patch:
>
> cvttsd2si-->neg-?->neg-->cvtsi2sd
>
> where the ? means that the following instructions are speculated.
>
> With an auxiliary register there are two dependency chains:
>
> cvttsd2si-?->cvtsi2sd
>  |->mov->neg->jump
>
> Actually some OoO cores just eliminate register copies using register
> renaming mechanism. But even this is probably completely irrelevant in
> this case where the latency is dominated by the two conversion
> instructions.

Btw, the code to emit these sequences is in
gcc/config/i386/i386-expand.c:ix86_expand_trunc
and friends.

Richard.

> Regards,
> Gabriel
>
>
>
> >
> > regards
> > Stefan
>
>


Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-05 Thread Gabriel Paubert
On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> Hi,
> 
> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> following code (13 instructions using 57 bytes, plus 4 quadwords
> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
> 
> .text
>0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
> 4: R_X86_64_PC32.rdata
>8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
> c: R_X86_64_PC32.rdata
>   10:   66 0f 28 d8 movapd %xmm0, %xmm3
>   14:   66 0f 28 c8 movapd %xmm0, %xmm1
>   18:   66 0f 54 da andpd  %xmm2, %xmm3
>   1c:   66 0f 2e e3 ucomisd %xmm3, %xmm4
>   20:   76 16   jbe38 <_trunc+0x38>
>   22:   f2 48 0f 2c c0  cvttsd2si %xmm0, %rax
>   27:   66 0f ef c0 pxor   %xmm0, %xmm0
>   2b:   66 0f 55 d1 andnpd %xmm1, %xmm2
>   2f:   f2 48 0f 2a c0  cvtsi2sd %rax, %xmm0
>   34:   66 0f 56 c2 orpd   %xmm2, %xmm0
>   38:   c3  retq
> 
> .rdata
> .align 8
>0:   00 00 00 00 .LC0:   .quad  0x1.0p52
> 00 00 30 43
> 00 00 00 00
> 00 00 00 00
> .align 16
>   10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
> ff ff ff 7f
>   18:   00 00 00 00 .quad  0.0
> 00 00 00 00
> .end
> 
> JFTR: in the best case, the memory accesses cost several cycles,
>   while in the worst case they yield a page fault!
> 
> 
> Properly optimized, shorter and faster code, using but only 9 instructions
> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
> and saving at least 16 + 32 bytes, follows:
> 
>   .intel_syntax
>   .text
>0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
>5:   48 f7 d8  neg rax
> # jz  .L0  # argument zero?
>8:   70 16 jo  .L0  # argument indefinite?
># argument overflows 
> 64-bit integer?
>a:   48 f7 d8  neg rax
>d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
>   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & -0.0) 
> ? -0.0 : 0.0
>   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = trunc(argument)
>   20:   c3  .L0:  ret
>   .end

There is one important difference, namely setting the invalid exception
flag when the parameter can't be represented in a signed integer.  So
using your code may require some option (-fast-math comes to mind), or
you need at least a check on the exponent before cvttsd2si.

The last part of your code then goes to take into account the special
case of -0.0, which I most often don't care about (I'd like to have a
-fdont-split-hairs-about-the-sign-of-zero option).

Potentially generating spurious invalid operation and then carefully
taking into account the sign of zero does not seem very consistent.

Apart from this, in your code, after cvttsd2si I'd rather use:
mov rcx,rax # make a second copy to a scratch register
neg rcx
jo .L0
cvtsi2sd xmm1,rax

The reason is latency, in an OoO engine, splitting the two paths is
almost always a win.

With your patch:

cvttsd2si-->neg-?->neg-->cvtsi2sd
  
where the ? means that the following instructions are speculated.  

With an auxiliary register there are two dependency chains:

cvttsd2si-?->cvtsi2sd
 |->mov->neg->jump

Actually some OoO cores just eliminate register copies using register
renaming mechanism. But even this is probably completely irrelevant in
this case where the latency is dominated by the two conversion
instructions.

Regards,
Gabriel



> 
> regards
> Stefan