[Bug target/108401] gcc defeats vector constant generation with intrinsics

2023-01-16 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

--- Comment #8 from Hongtao.liu  ---

> But, if you're going to improve constant generation, please make it so that
> it can recognize not only the particular pattern described in this bug. More
> importantly, it should recognize the all-ones case (as a single pcmpeq) as a
> starting point. Then it can apply shifts to achieve the final result from
> the all-ones vector - shifts of any width, length or direction, including
> psrldq/pslldq. This would improve generated code in a wider range of cases.
yes, we will try to do that. Generally fold intrinsic into compiler IR helps
performance, and for this case we need to optimize codegen for special
immediate broadcast(all-ones + shift)

[Bug target/108401] gcc defeats vector constant generation with intrinsics

2023-01-16 Thread andysem at mail dot ru via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

--- Comment #7 from andysem at mail dot ru ---
To be clear, I'm not asking the compiler to recognize the particular pattern of
alternating 0x00 and 0xFF bytes. Because hardcoding this particular pattern
won't improve generated code in other cases.

Rather, I'm asking to tune down code transformations for intrinsics. If the
developer wrote a sequence of intrinsics to generate a constant then he
probably wanted that sequence instead of a simple _mm_set1_epi32 or a load from
memory.

But, if you're going to improve constant generation, please make it so that it
can recognize not only the particular pattern described in this bug. More
importantly, it should recognize the all-ones case (as a single pcmpeq) as a
starting point. Then it can apply shifts to achieve the final result from the
all-ones vector - shifts of any width, length or direction, including
psrldq/pslldq. This would improve generated code in a wider range of cases.

[Bug target/108401] gcc defeats vector constant generation with intrinsics

2023-01-16 Thread andysem at mail dot ru via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

--- Comment #6 from andysem at mail dot ru ---
(In reply to Andrew Pinski from comment #1)
> >and gcc 12 generates a worse code:
> 
> it is not worse really; depending on the how fast moving between the
> register sets is.

I meant "worse" compared to vpcmpeq+vpsrlw pair.

(Side note about the broadcast version: it could have been smaller if it used a
32-bit constant and vpbroadcastd. vpcmpeq+vpsrlw would still be better in this
particular case, but if broadcast is needed, a smaller footprint code is
preferred.)

[Bug target/108401] gcc defeats vector constant generation with intrinsics

2023-01-15 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed||2023-01-16
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1

--- Comment #5 from Richard Biener  ---
Confirmed.  We expand from

  return { 71777214294589695, 71777214294589695, 71777214294589695,
71777214294589695 };

where we could reduce the DImode broadcast to a HImode one (if that exists).
But sure, the x86 backend could implement the intrinsic suggested way to
generate this particular pattern.

I'll also note that -O0 produces quite bad code here.

[Bug target/108401] gcc defeats vector constant generation with intrinsics

2023-01-15 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
(In reply to Hongtao.liu from comment #3)
> > and gcc 12 generates a worse code:
> > 
> > movabs  rax, 71777214294589695
> > vmovq   xmm1, rax
> > vpbroadcastqymm0, xmm1
> > ret
> > 
> 
> It's on purpose by edafb35bdadf309ebb9d1eddc5549f9e1ad49c09 since
> microbenchmark shows moving from imm is faster than memory.

But the bug is not asking you to reinstate loading from memory. The bug is
asking you to notice that the result can be constructed via cmpeq+psrlw, which
is even better than a broadcast (cmpeq with dst same as src is usually a
dependency-breaking instruction that does not occupy an execution port).

[Bug target/108401] gcc defeats vector constant generation with intrinsics

2023-01-15 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #3 from Hongtao.liu  ---

> and gcc 12 generates a worse code:
> 
> movabs  rax, 71777214294589695
> vmovq   xmm1, rax
> vpbroadcastqymm0, xmm1
> ret
> 

It's on purpose by edafb35bdadf309ebb9d1eddc5549f9e1ad49c09 since
microbenchmark shows moving from imm is faster than memory.

> In all cases, the compiler flags are: -O3 -march=haswell
> 
> Code on godbolt.org: https://gcc.godbolt.org/z/sfT787PY9
> 
> I think the compiler should follow the code in intrinsics more closely since
> despite the apparent equivalence, the choice of instructions can have
> performance implications. The original code that is written by the developer
> is better anyway, so it's not clear why the compiler is being so creative in
> this case.

[Bug target/108401] gcc defeats vector constant generation with intrinsics

2023-01-15 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

--- Comment #2 from Andrew Pinski  ---
r12-1958-gedafb35bdadf30 changed the behavior in GCC 12 to be better ...
(see the commit message that it shows it is better than doing a memory load).

[Bug target/108401] gcc defeats vector constant generation with intrinsics

2023-01-15 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

--- Comment #1 from Andrew Pinski  ---
>and gcc 12 generates a worse code:


it is not worse really; depending on the how fast moving between the register
sets is.