[Bug target/100711] Miss optimization for pandn

2023-07-05 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711

--- Comment #15 from CVS Commits  ---
The master branch has been updated by Jan Beulich :

https://gcc.gnu.org/g:e007369c8b67bcabd57c4fed8cff2a6db82e78e6

commit r14-2314-ge007369c8b67bcabd57c4fed8cff2a6db82e78e6
Author: Jan Beulich 
Date:   Wed Jul 5 09:49:16 2023 +0200

x86: yet more PR target/100711-like splitting

Following two-operand bitwise operations, add another splitter to also
deal with not followed by broadcast all on its own, which can be
expressed as simple embedded broadcast instead once a broadcast operand
is actually permitted in the respective insn. While there also permit
a broadcast operand in the corresponding expander.

gcc/

PR target/100711
* config/i386/sse.md: New splitters to simplify
not;vec_duplicate as a singular vpternlog.
(one_cmpl2): Allow broadcast for operand 1.
(one_cmpl2): Likewise.

gcc/testsuite/

PR target/100711
* gcc.target/i386/pr100711-6.c: New test.

[Bug target/100711] Miss optimization for pandn

2023-07-05 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711

--- Comment #14 from CVS Commits  ---
The master branch has been updated by Jan Beulich :

https://gcc.gnu.org/g:fa58c2871a1235cb5ba475303a2bd11ae90416d5

commit r14-2313-gfa58c2871a1235cb5ba475303a2bd11ae90416d5
Author: Jan Beulich 
Date:   Wed Jul 5 09:48:47 2023 +0200

x86: further PR target/100711-like splitting

With respective two-operand bitwise operations now expressable by a
single VPTERNLOG, add splitters to also deal with ior and xor
counterparts of the original and-only case. Note that the splitters need
to be separate, as the placement of "not" differs in the final insns
(*iornot3, *xnor3) which are intended to pick up one half of
the result.

gcc/

PR target/100711
* config/i386/sse.md: New splitters to simplify
not;vec_duplicate;{ior,xor} as vec_duplicate;{iornot,xnor}.

gcc/testsuite/

PR target/100711
* gcc.target/i386/pr100711-4.c: New test.
* gcc.target/i386/pr100711-5.c: New test.

[Bug target/100711] Miss optimization for pandn

2023-07-05 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711

--- Comment #13 from CVS Commits  ---
The master branch has been updated by Jan Beulich :

https://gcc.gnu.org/g:3186ef0cb9e2d25e8455f9990e50187e3d1eee19

commit r14-2312-g3186ef0cb9e2d25e8455f9990e50187e3d1eee19
Author: Jan Beulich 
Date:   Wed Jul 5 09:48:19 2023 +0200

x86: allow memory operand for AVX2 splitter for PR target/100711

The intended broadcast (with AVX512) can very well be done right from
memory.

gcc/

PR target/100711
* config/i386/sse.md: Permit non-immediate operand 1 in AVX2
form of splitter for PR target/100711.

[Bug target/100711] Miss optimization for pandn

2023-05-27 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711

--- Comment #12 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:ed6a9a35799c9298321d1589533767c2bb6f8d42

commit r14-1307-ged6a9a35799c9298321d1589533767c2bb6f8d42
Author: liuhongt 
Date:   Thu May 25 16:14:14 2023 +0800

Split notl + pbraodcast + pand to pbroadcast + pandn more modes.

r12-5595-gc39d77f252e895306ef88c1efb3eff04e4232554 adds 2 splitter to
transform notl + pbroadcast + pand to pbroadcast + pandn for
VI124_AVX2 which leaves out all DI-element-size ones as
well as all 512-bit ones.
This patch extend the splitter to VI_AVX2 which will handle DImode for
AVX2, and V64QImode,V32HImode,V16SImode,V8DImode for AVX512.

gcc/ChangeLog:

PR target/100711
* config/i386/sse.md (*andnot3): Extend below splitter
to VI_AVX2 to cover more modes.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr100711-2.c: Add v4di/v2di testcases.
* gcc.target/i386/pr100711-3.c: New test.

[Bug target/100711] Miss optimization for pandn

2023-05-25 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711

--- Comment #11 from Hongtao.liu  ---
(In reply to jbeulich from comment #10)
> (In reply to Hongtao.liu from comment #9)
> > We don't have single instruction for V8HI/V16QImode broadcast without AVX2,
> > that's why the first splitter only have VI48_128.
> 
> Does this matter? The splitters are about subsuming the "not". How the
> "vec_duplicate" is carried out isn't really relevant here, is it?

It's splitted to 2 patterns, but there's no V8HI/V16QImode define_insn for the
first pattern  w/o AVX2, there will be ICE of unrecognisable insn.

17110  [(set (match_dup 3)
17111(vec_duplicate:VI48_128 (match_dup 1)))
17112   (set (match_dup 0)
17113(and:VI48_128 (not:VI48_128 (match_dup 3))
17114  (match_dup 2)))]

[Bug target/100711] Miss optimization for pandn

2023-05-25 Thread jbeulich at suse dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711

--- Comment #10 from jbeulich at suse dot com ---
(In reply to Hongtao.liu from comment #9)
> We don't have single instruction for V8HI/V16QImode broadcast without AVX2,
> that's why the first splitter only have VI48_128.

Does this matter? The splitters are about subsuming the "not". How the
"vec_duplicate" is carried out isn't really relevant here, is it?

[Bug target/100711] Miss optimization for pandn

2023-05-25 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711

--- Comment #9 from Hongtao.liu  ---
(In reply to jbeulich from comment #8)
> Since the commit doesn't really explain it (maybe it's obvious to others,
> but it isn't to me), may I ask why two splitters were introduced, yet then
> still not covering all possible modes? VI48_128 only covers two of the four
> possible SSE2 modes, while VI124_AVX2 leaves out all DI-element-size ones as
> well as all 512-bit ones. Shouldn't both be folded, using VI_AVX2 as the
> mode iterator?
We don't have single instruction for V8HI/V16QImode broadcast without AVX2,
that's why the first splitter only have VI48_128.
And yes, for the second splitter, I think we should use VI_AVX2 to cover all
modes.
> 
> As an aside, it is also interesting that the 1st splitter uses TARGET_SSE
> without the corresponding testcase limiting itself to just SSE. When
> building that testcase with SSE2 turned off, foo() uses shufps and andnps as
> expected, but the splitter doesn't appear to come into play at all for
> bar(), when really it is only the broadcast that needs synthesizing, while
> andnps can be used regardless of mode.

[Bug target/100711] Miss optimization for pandn

2023-05-24 Thread jbeulich at suse dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711

jbeulich at suse dot com changed:

   What|Removed |Added

 CC||jbeulich at suse dot com

--- Comment #8 from jbeulich at suse dot com ---
Since the commit doesn't really explain it (maybe it's obvious to others, but
it isn't to me), may I ask why two splitters were introduced, yet then still
not covering all possible modes? VI48_128 only covers two of the four possible
SSE2 modes, while VI124_AVX2 leaves out all DI-element-size ones as well as all
512-bit ones. Shouldn't both be folded, using VI_AVX2 as the mode iterator?

As an aside, it is also interesting that the 1st splitter uses TARGET_SSE
without the corresponding testcase limiting itself to just SSE. When building
that testcase with SSE2 turned off, foo() uses shufps and andnps as expected,
but the splitter doesn't appear to come into play at all for bar(), when really
it is only the broadcast that needs synthesizing, while andnps can be used
regardless of mode.

[Bug target/100711] Miss optimization for pandn

2021-11-30 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711

Hongtao.liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #7 from Hongtao.liu  ---
Fixed in GCC12.

[Bug target/100711] Miss optimization for pandn

2021-11-30 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711

--- Comment #6 from CVS Commits  ---
The master branch has been updated by Roger Sayle :

https://gcc.gnu.org/g:c39d77f252e895306ef88c1efb3eff04e4232554

commit r12-5595-gc39d77f252e895306ef88c1efb3eff04e4232554
Author: Roger Sayle 
Date:   Tue Nov 30 08:35:39 2021 +

x86_64: PR target/100711: Splitters for pandn

This patch addresses PR target/100711 by introducing define_split
patterns so that not/broadcast/pand may be simplified (by combine)
to broadcast/pandn.  This introduces two splitters one for optimizing
pandn on TARGET_SSE for V4SI and V2DI, and another for vpandn on
TARGET_AVX2 for V16QI, V8HI, V32QI, V16HI and V8SI.  Each splitter
has its own new testcase.

I've also confirmed that not/broadcast/pandn is already getting
simplified to broadcast/pand by the middle-end optimizers.

2021-11-30  Roger Sayle  
Uroš Bizjak  

gcc/ChangeLog
PR target/100711
* config/i386/sse.md (define_split): New splitters to simplify
not;vec_duplicate;and as vec_duplicate;andn.

gcc/testsuite/ChangeLog
PR target/100711
* gcc.target/i386/pr100711-1.c: New test case.
* gcc.target/i386/pr100711-2.c: New test case.

[Bug target/100711] Miss optimization for pandn

2021-05-25 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711

--- Comment #5 from Segher Boessenkool  ---
(In reply to Hongtao.liu from comment #4)
> > Even w/ canonical RTL, i think a combine splitter is also needed here, the
> > canonical RTL only helps combine/forwprop to match more possibility but
> > won't split patterns by itselies.
> 
> I was wrong, i thought combine only support n->1 combining, but actually
> pass_combine also support 3->2 combining which means a define_split is not
> needed here.

->1 and ->2, yes.  But note that combine can often split
RTL
without having an explicit define_split; and also note the opposite, combine
does
not always pick the best spot to split, "manual" help (a define_split) can be
needed for good results.

[Bug target/100711] Miss optimization for pandn

2021-05-25 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711

--- Comment #4 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #3)
> 
> (In reply to Segher Boessenkool from comment #2)
> > (In reply to Richard Biener from comment #1)
> > > I suppose we're confused about the vec_duplicate.  Would generally 
> Even w/ canonical RTL, i think a combine splitter is also needed here, the
> canonical RTL only helps combine/forwprop to match more possibility but
> won't split patterns by itselies.

I was wrong, i thought combine only support n->1 combining, but actually
pass_combine also support 3->2 combining which means a define_split is not
needed here.

[Bug target/100711] Miss optimization for pandn

2021-05-24 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711

--- Comment #3 from Hongtao.liu  ---


(In reply to Segher Boessenkool from comment #2)
> (In reply to Richard Biener from comment #1)
> > I suppose we're confused about the vec_duplicate.  Would generally swapping
> > the duplicate and the bit_not be profitable?  Eventually it's a
> > simplification
> > combine could try - I belive it has some cases where it tries variants of 
> > the
> > original instructions when combining.  Adding a combine helper pattern
> > looks like putting too much burden on the backend IMHO.
> > 
> > We don't have a generic nand optab so handling this in ISEL on gimple
> > isn't straight-forward.
> > 
> > But combine and/or forwprop could do this.
> 
> Combine never tries anything.  Combine makes *one* result; if that does not
> work,
> it does not do the combination.  (This is not completely true, but in essence
> that is how it works, and it has to to not have exponential complexity).
> 
> It would be good to define a canonical form for anything vec_duplicate.  It
> probably is a good idea to pull the vec_duplicate as far outside as possible?
> 
> Canonical forms hugely reduce the amount of work needed.  Compare to how
> "andc"
> is represented (canonically with the inverted input first), or how "nand" is
> (we
> write that as an "orcc", an "or" with both inputs inverted, in canonical
> RTL).
> Because only one form is allowed, we only have to check for that one form
> everywhere.
> 
> Confirmed.

Even w/ canonical RTL, i think a combine splitter is also needed here, the
canonical RTL only helps combine/forwprop to match more possibility but won't
split patterns by itselies.

[Bug target/100711] Miss optimization for pandn

2021-05-21 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711

Segher Boessenkool  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Last reconfirmed||2021-05-21

--- Comment #2 from Segher Boessenkool  ---
(In reply to Richard Biener from comment #1)
> I suppose we're confused about the vec_duplicate.  Would generally swapping
> the duplicate and the bit_not be profitable?  Eventually it's a
> simplification
> combine could try - I belive it has some cases where it tries variants of the
> original instructions when combining.  Adding a combine helper pattern
> looks like putting too much burden on the backend IMHO.
> 
> We don't have a generic nand optab so handling this in ISEL on gimple
> isn't straight-forward.
> 
> But combine and/or forwprop could do this.

Combine never tries anything.  Combine makes *one* result; if that does not
work,
it does not do the combination.  (This is not completely true, but in essence
that is how it works, and it has to to not have exponential complexity).

It would be good to define a canonical form for anything vec_duplicate.  It
probably is a good idea to pull the vec_duplicate as far outside as possible?

Canonical forms hugely reduce the amount of work needed.  Compare to how "andc"
is represented (canonically with the inverted input first), or how "nand" is
(we
write that as an "orcc", an "or" with both inputs inverted, in canonical RTL).
Because only one form is allowed, we only have to check for that one form
everywhere.

Confirmed.

[Bug target/100711] Miss optimization for pandn

2021-05-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711

Richard Biener  changed:

   What|Removed |Added

 CC||segher at gcc dot gnu.org

--- Comment #1 from Richard Biener  ---
7: r82:SI=~r89:SI
  REG_DEAD r89:SI
8: r88:V4SI=vec_duplicate(r82:SI)
  REG_DEAD r82:SI
9: r87:V4SI=r88:V4SI:V4SI
  REG_DEAD r90:V4SI
  REG_DEAD r88:V4SI

I suppose we're confused about the vec_duplicate.  Would generally swapping
the duplicate and the bit_not be profitable?  Eventually it's a simplification
combine could try - I belive it has some cases where it tries variants of the
original instructions when combining.  Adding a combine helper pattern
looks like putting too much burden on the backend IMHO.

We don't have a generic nand optab so handling this in ISEL on gimple
isn't straight-forward.

But combine and/or forwprop could do this.