Re: Different ASM for ReLU function between GCC11 and GCC12

2023-06-20 Thread Jakub Jelinek via Gcc
On Tue, Jun 20, 2023 at 03:03:19PM +, Michael Matz via Gcc wrote:
> Hello,
> 
> On Tue, 20 Jun 2023, Jakub Jelinek via Gcc wrote:
> 
> > ce1 pass results in emit_conditional_move with
> > (gt (reg/v:SF 83 [ x ]) (reg:SF 84)), (reg/v:SF 83 [ x ]), (reg:SF 84)
> > operands in the GCC 11 case and so is successfully matched by
> > ix86_expand_fp_movcc as ix86_expand_sse_fp_minmax.
> > But, in GCC 12+, emit_conditional_move is called with
> > (gt (reg/v:SF 83 [ x ]) (reg:SF 84)), (reg/v:SF 83 [ x ]), (const_double:SF 
> > 0.0 [0x0.0p+0])
> > instead (reg:SF 84 in both cases contains the (const_double:SF 0.0 
> > [0x0.0p+0])
> > value, in the GCC 11 case loaded from memory, in the GCC 12+ case set
> > directly in a move.  The reason for the difference is exactly that
> > because cheap SSE constant can be moved directly into a reg, it is done so
> > instead of reusing a pseudo that contains that value already.
> 
> But reg84 is _also_ used as operand of emit_conditional_move, so there's 
> no reason to not also use it as third operand.  It seems more canonical to 
> call a function with
> 
>   X-containing-B, A, B
> 
> than with
> 
>   X-containing-B, A, something-equal-to-B-but-not-B
> 
> so either the (const_double) RTL should be used in both, or reg84 should, 
> but not a mix.  Exactly so to ...

If ifcvt.cc knows that, then sure, but it can't be (easily) worked around on the
backend side...

Jakub



Re: Different ASM for ReLU function between GCC11 and GCC12

2023-06-20 Thread Michael Matz via Gcc
Hello,

On Tue, 20 Jun 2023, Jakub Jelinek via Gcc wrote:

> ce1 pass results in emit_conditional_move with
> (gt (reg/v:SF 83 [ x ]) (reg:SF 84)), (reg/v:SF 83 [ x ]), (reg:SF 84)
> operands in the GCC 11 case and so is successfully matched by
> ix86_expand_fp_movcc as ix86_expand_sse_fp_minmax.
> But, in GCC 12+, emit_conditional_move is called with
> (gt (reg/v:SF 83 [ x ]) (reg:SF 84)), (reg/v:SF 83 [ x ]), (const_double:SF 
> 0.0 [0x0.0p+0])
> instead (reg:SF 84 in both cases contains the (const_double:SF 0.0 [0x0.0p+0])
> value, in the GCC 11 case loaded from memory, in the GCC 12+ case set
> directly in a move.  The reason for the difference is exactly that
> because cheap SSE constant can be moved directly into a reg, it is done so
> instead of reusing a pseudo that contains that value already.

But reg84 is _also_ used as operand of emit_conditional_move, so there's 
no reason to not also use it as third operand.  It seems more canonical to 
call a function with

  X-containing-B, A, B

than with

  X-containing-B, A, something-equal-to-B-but-not-B

so either the (const_double) RTL should be used in both, or reg84 should, 
but not a mix.  Exactly so to ...

> actually a minmax.  Even if it allowed the cheap SSE constants,
> it wouldn't know that r84 is also zero (unless the expander checks that
> it is a pseudo with a single setter and verifies it or something similar).

... not have this problem.


Ciao,
Michael.


Re: Different ASM for ReLU function between GCC11 and GCC12

2023-06-20 Thread Jakub Jelinek via Gcc
On Tue, Jun 20, 2023 at 10:15:37AM +0200, Richard Biener wrote:
> On Mon, Jun 19, 2023 at 9:45 PM Jakub Jelinek via Gcc  wrote:
> >
> > On Mon, Jun 19, 2023 at 09:10:53PM +0200, André Günther via Gcc wrote:
> > > I noticed that a simple function like
> > > auto relu( float x ) {
> > > return x > 0.f ? x : 0.f;
> > > }
> > > compiles to different ASM using GCC11 (or lower) and GCC12 (or higher). On
> > > -O3 -mavx2 the former compiles above function to
> >
> > Such reports should go into gcc.gnu.org/bugzilla/, not to the mailing list,
> > if you are convinced that loading the constant from memory is faster.
> > Another possibility is
> > vxorps xmm1, xmm1, xmm1
> > vmaxss xmm0, xmm0, xmm1
> > ret
> > which doesn't need to wait for the memory.
> > This changed with https://gcc.gnu.org/r12-7693
> 
> I guess we previously were able to see that one operand of
> the comparison was not NaN.  Maybe some REG_EQUAL
> note can come to the rescue here?

ce1 pass results in emit_conditional_move with
(gt (reg/v:SF 83 [ x ]) (reg:SF 84)), (reg/v:SF 83 [ x ]), (reg:SF 84)
operands in the GCC 11 case and so is successfully matched by
ix86_expand_fp_movcc as ix86_expand_sse_fp_minmax.
But, in GCC 12+, emit_conditional_move is called with
(gt (reg/v:SF 83 [ x ]) (reg:SF 84)), (reg/v:SF 83 [ x ]), (const_double:SF 0.0 
[0x0.0p+0])
instead (reg:SF 84 in both cases contains the (const_double:SF 0.0 [0x0.0p+0])
value, in the GCC 11 case loaded from memory, in the GCC 12+ case set
directly in a move.  The reason for the difference is exactly that
because cheap SSE constant can be moved directly into a reg, it is done so
instead of reusing a pseudo that contains that value already.
In the latter case ix86_expand_fp_movcc is called even not with the
const_double because the expander doesn't allow immediates, but with
it forced into some other register, so it can't really find out it is
actually a minmax.  Even if it allowed the cheap SSE constants,
it wouldn't know that r84 is also zero (unless the expander checks that
it is a pseudo with a single setter and verifies it or something similar).

Jakub



Re: Different ASM for ReLU function between GCC11 and GCC12

2023-06-20 Thread Richard Biener via Gcc
On Mon, Jun 19, 2023 at 9:45 PM Jakub Jelinek via Gcc  wrote:
>
> On Mon, Jun 19, 2023 at 09:10:53PM +0200, André Günther via Gcc wrote:
> > I noticed that a simple function like
> > auto relu( float x ) {
> > return x > 0.f ? x : 0.f;
> > }
> > compiles to different ASM using GCC11 (or lower) and GCC12 (or higher). On
> > -O3 -mavx2 the former compiles above function to
>
> Such reports should go into gcc.gnu.org/bugzilla/, not to the mailing list,
> if you are convinced that loading the constant from memory is faster.
> Another possibility is
> vxorps xmm1, xmm1, xmm1
> vmaxss xmm0, xmm0, xmm1
> ret
> which doesn't need to wait for the memory.
> This changed with https://gcc.gnu.org/r12-7693

I guess we previously were able to see that one operand of
the comparison was not NaN.  Maybe some REG_EQUAL
note can come to the rescue here?

> >
> > relu(float):
> > vmaxss xmm0, xmm0, DWORD PTR .LC0[rip]
> > ret
> > .LC0:
> > .long 0
> >
> > which is what I would naively expect and what also clang essentially does
> > (clang actually uses an xor before the maxss to get the zero). The latter,
> > however, compiles the function to
> >
> > relu(float):
> > vxorps xmm1, xmm1, xmm1
> > vcmpltss xmm2, xmm1, xmm0
> > vblendvps xmm0, xmm1, xmm0, xmm2
> > ret
> >
> > which looks like a missed optimisation. Does anyone know if there's a
> > reason for the changed behaviour?
>
> Jakub
>


Re: Different ASM for ReLU function between GCC11 and GCC12

2023-06-19 Thread Marc Glisse via Gcc

On Mon, 19 Jun 2023, André Günther via Gcc wrote:


I noticed that a simple function like
auto relu( float x ) {
   return x > 0.f ? x : 0.f;
}
compiles to different ASM using GCC11 (or lower) and GCC12 (or higher). On
-O3 -mavx2 the former compiles above function to

relu(float):
   vmaxss xmm0, xmm0, DWORD PTR .LC0[rip]
   ret
.LC0:
   .long 0

which is what I would naively expect and what also clang essentially does
(clang actually uses an xor before the maxss to get the zero). The latter,
however, compiles the function to

relu(float):
   vxorps xmm1, xmm1, xmm1
   vcmpltss xmm2, xmm1, xmm0
   vblendvps xmm0, xmm1, xmm0, xmm2
   ret

which looks like a missed optimisation. Does anyone know if there's a
reason for the changed behaviour?


With -fno-signed-zeros -ffinite-math-only, gcc-12 still uses max instead 
of cmp+blend. So the first thing to check would be if both versions give 
the same result on negative 0 and NaN.


--
Marc Glisse


Re: Different ASM for ReLU function between GCC11 and GCC12

2023-06-19 Thread Jakub Jelinek via Gcc
On Mon, Jun 19, 2023 at 09:10:53PM +0200, André Günther via Gcc wrote:
> I noticed that a simple function like
> auto relu( float x ) {
> return x > 0.f ? x : 0.f;
> }
> compiles to different ASM using GCC11 (or lower) and GCC12 (or higher). On
> -O3 -mavx2 the former compiles above function to

Such reports should go into gcc.gnu.org/bugzilla/, not to the mailing list,
if you are convinced that loading the constant from memory is faster.
Another possibility is
vxorps xmm1, xmm1, xmm1
vmaxss xmm0, xmm0, xmm1
ret
which doesn't need to wait for the memory.
This changed with https://gcc.gnu.org/r12-7693

> 
> relu(float):
> vmaxss xmm0, xmm0, DWORD PTR .LC0[rip]
> ret
> .LC0:
> .long 0
> 
> which is what I would naively expect and what also clang essentially does
> (clang actually uses an xor before the maxss to get the zero). The latter,
> however, compiles the function to
> 
> relu(float):
> vxorps xmm1, xmm1, xmm1
> vcmpltss xmm2, xmm1, xmm0
> vblendvps xmm0, xmm1, xmm0, xmm2
> ret
> 
> which looks like a missed optimisation. Does anyone know if there's a
> reason for the changed behaviour?

Jakub