[Bug target/100711] Miss optimization for pandn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711 --- Comment #15 from CVS Commits --- The master branch has been updated by Jan Beulich : https://gcc.gnu.org/g:e007369c8b67bcabd57c4fed8cff2a6db82e78e6 commit r14-2314-ge007369c8b67bcabd57c4fed8cff2a6db82e78e6 Author: Jan Beulich Date: Wed Jul 5 09:49:16 2023 +0200 x86: yet more PR target/100711-like splitting Following two-operand bitwise operations, add another splitter to also deal with not followed by broadcast all on its own, which can be expressed as simple embedded broadcast instead once a broadcast operand is actually permitted in the respective insn. While there also permit a broadcast operand in the corresponding expander. gcc/ PR target/100711 * config/i386/sse.md: New splitters to simplify not;vec_duplicate as a singular vpternlog. (one_cmpl2): Allow broadcast for operand 1. (one_cmpl2): Likewise. gcc/testsuite/ PR target/100711 * gcc.target/i386/pr100711-6.c: New test.
[Bug target/100711] Miss optimization for pandn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711 --- Comment #14 from CVS Commits --- The master branch has been updated by Jan Beulich : https://gcc.gnu.org/g:fa58c2871a1235cb5ba475303a2bd11ae90416d5 commit r14-2313-gfa58c2871a1235cb5ba475303a2bd11ae90416d5 Author: Jan Beulich Date: Wed Jul 5 09:48:47 2023 +0200 x86: further PR target/100711-like splitting With respective two-operand bitwise operations now expressable by a single VPTERNLOG, add splitters to also deal with ior and xor counterparts of the original and-only case. Note that the splitters need to be separate, as the placement of "not" differs in the final insns (*iornot3, *xnor3) which are intended to pick up one half of the result. gcc/ PR target/100711 * config/i386/sse.md: New splitters to simplify not;vec_duplicate;{ior,xor} as vec_duplicate;{iornot,xnor}. gcc/testsuite/ PR target/100711 * gcc.target/i386/pr100711-4.c: New test. * gcc.target/i386/pr100711-5.c: New test.
[Bug target/100711] Miss optimization for pandn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711 --- Comment #13 from CVS Commits --- The master branch has been updated by Jan Beulich : https://gcc.gnu.org/g:3186ef0cb9e2d25e8455f9990e50187e3d1eee19 commit r14-2312-g3186ef0cb9e2d25e8455f9990e50187e3d1eee19 Author: Jan Beulich Date: Wed Jul 5 09:48:19 2023 +0200 x86: allow memory operand for AVX2 splitter for PR target/100711 The intended broadcast (with AVX512) can very well be done right from memory. gcc/ PR target/100711 * config/i386/sse.md: Permit non-immediate operand 1 in AVX2 form of splitter for PR target/100711.
[Bug target/100711] Miss optimization for pandn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711 --- Comment #12 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:ed6a9a35799c9298321d1589533767c2bb6f8d42 commit r14-1307-ged6a9a35799c9298321d1589533767c2bb6f8d42 Author: liuhongt Date: Thu May 25 16:14:14 2023 +0800 Split notl + pbraodcast + pand to pbroadcast + pandn more modes. r12-5595-gc39d77f252e895306ef88c1efb3eff04e4232554 adds 2 splitter to transform notl + pbroadcast + pand to pbroadcast + pandn for VI124_AVX2 which leaves out all DI-element-size ones as well as all 512-bit ones. This patch extend the splitter to VI_AVX2 which will handle DImode for AVX2, and V64QImode,V32HImode,V16SImode,V8DImode for AVX512. gcc/ChangeLog: PR target/100711 * config/i386/sse.md (*andnot3): Extend below splitter to VI_AVX2 to cover more modes. gcc/testsuite/ChangeLog: * gcc.target/i386/pr100711-2.c: Add v4di/v2di testcases. * gcc.target/i386/pr100711-3.c: New test.
[Bug target/100711] Miss optimization for pandn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711 --- Comment #11 from Hongtao.liu --- (In reply to jbeulich from comment #10) > (In reply to Hongtao.liu from comment #9) > > We don't have single instruction for V8HI/V16QImode broadcast without AVX2, > > that's why the first splitter only have VI48_128. > > Does this matter? The splitters are about subsuming the "not". How the > "vec_duplicate" is carried out isn't really relevant here, is it? It's splitted to 2 patterns, but there's no V8HI/V16QImode define_insn for the first pattern w/o AVX2, there will be ICE of unrecognisable insn. 17110 [(set (match_dup 3) 17111(vec_duplicate:VI48_128 (match_dup 1))) 17112 (set (match_dup 0) 17113(and:VI48_128 (not:VI48_128 (match_dup 3)) 17114 (match_dup 2)))]
[Bug target/100711] Miss optimization for pandn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711 --- Comment #10 from jbeulich at suse dot com --- (In reply to Hongtao.liu from comment #9) > We don't have single instruction for V8HI/V16QImode broadcast without AVX2, > that's why the first splitter only have VI48_128. Does this matter? The splitters are about subsuming the "not". How the "vec_duplicate" is carried out isn't really relevant here, is it?
[Bug target/100711] Miss optimization for pandn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711 --- Comment #9 from Hongtao.liu --- (In reply to jbeulich from comment #8) > Since the commit doesn't really explain it (maybe it's obvious to others, > but it isn't to me), may I ask why two splitters were introduced, yet then > still not covering all possible modes? VI48_128 only covers two of the four > possible SSE2 modes, while VI124_AVX2 leaves out all DI-element-size ones as > well as all 512-bit ones. Shouldn't both be folded, using VI_AVX2 as the > mode iterator? We don't have single instruction for V8HI/V16QImode broadcast without AVX2, that's why the first splitter only have VI48_128. And yes, for the second splitter, I think we should use VI_AVX2 to cover all modes. > > As an aside, it is also interesting that the 1st splitter uses TARGET_SSE > without the corresponding testcase limiting itself to just SSE. When > building that testcase with SSE2 turned off, foo() uses shufps and andnps as > expected, but the splitter doesn't appear to come into play at all for > bar(), when really it is only the broadcast that needs synthesizing, while > andnps can be used regardless of mode.
[Bug target/100711] Miss optimization for pandn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711 jbeulich at suse dot com changed: What|Removed |Added CC||jbeulich at suse dot com --- Comment #8 from jbeulich at suse dot com --- Since the commit doesn't really explain it (maybe it's obvious to others, but it isn't to me), may I ask why two splitters were introduced, yet then still not covering all possible modes? VI48_128 only covers two of the four possible SSE2 modes, while VI124_AVX2 leaves out all DI-element-size ones as well as all 512-bit ones. Shouldn't both be folded, using VI_AVX2 as the mode iterator? As an aside, it is also interesting that the 1st splitter uses TARGET_SSE without the corresponding testcase limiting itself to just SSE. When building that testcase with SSE2 turned off, foo() uses shufps and andnps as expected, but the splitter doesn't appear to come into play at all for bar(), when really it is only the broadcast that needs synthesizing, while andnps can be used regardless of mode.
[Bug target/100711] Miss optimization for pandn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711 Hongtao.liu changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #7 from Hongtao.liu --- Fixed in GCC12.
[Bug target/100711] Miss optimization for pandn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711 --- Comment #6 from CVS Commits --- The master branch has been updated by Roger Sayle : https://gcc.gnu.org/g:c39d77f252e895306ef88c1efb3eff04e4232554 commit r12-5595-gc39d77f252e895306ef88c1efb3eff04e4232554 Author: Roger Sayle Date: Tue Nov 30 08:35:39 2021 + x86_64: PR target/100711: Splitters for pandn This patch addresses PR target/100711 by introducing define_split patterns so that not/broadcast/pand may be simplified (by combine) to broadcast/pandn. This introduces two splitters one for optimizing pandn on TARGET_SSE for V4SI and V2DI, and another for vpandn on TARGET_AVX2 for V16QI, V8HI, V32QI, V16HI and V8SI. Each splitter has its own new testcase. I've also confirmed that not/broadcast/pandn is already getting simplified to broadcast/pand by the middle-end optimizers. 2021-11-30 Roger Sayle Uroš Bizjak gcc/ChangeLog PR target/100711 * config/i386/sse.md (define_split): New splitters to simplify not;vec_duplicate;and as vec_duplicate;andn. gcc/testsuite/ChangeLog PR target/100711 * gcc.target/i386/pr100711-1.c: New test case. * gcc.target/i386/pr100711-2.c: New test case.
[Bug target/100711] Miss optimization for pandn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711 --- Comment #5 from Segher Boessenkool --- (In reply to Hongtao.liu from comment #4) > > Even w/ canonical RTL, i think a combine splitter is also needed here, the > > canonical RTL only helps combine/forwprop to match more possibility but > > won't split patterns by itselies. > > I was wrong, i thought combine only support n->1 combining, but actually > pass_combine also support 3->2 combining which means a define_split is not > needed here. ->1 and ->2, yes. But note that combine can often split RTL without having an explicit define_split; and also note the opposite, combine does not always pick the best spot to split, "manual" help (a define_split) can be needed for good results.
[Bug target/100711] Miss optimization for pandn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711 --- Comment #4 from Hongtao.liu --- (In reply to Hongtao.liu from comment #3) > > (In reply to Segher Boessenkool from comment #2) > > (In reply to Richard Biener from comment #1) > > > I suppose we're confused about the vec_duplicate. Would generally > Even w/ canonical RTL, i think a combine splitter is also needed here, the > canonical RTL only helps combine/forwprop to match more possibility but > won't split patterns by itselies. I was wrong, i thought combine only support n->1 combining, but actually pass_combine also support 3->2 combining which means a define_split is not needed here.
[Bug target/100711] Miss optimization for pandn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711 --- Comment #3 from Hongtao.liu --- (In reply to Segher Boessenkool from comment #2) > (In reply to Richard Biener from comment #1) > > I suppose we're confused about the vec_duplicate. Would generally swapping > > the duplicate and the bit_not be profitable? Eventually it's a > > simplification > > combine could try - I belive it has some cases where it tries variants of > > the > > original instructions when combining. Adding a combine helper pattern > > looks like putting too much burden on the backend IMHO. > > > > We don't have a generic nand optab so handling this in ISEL on gimple > > isn't straight-forward. > > > > But combine and/or forwprop could do this. > > Combine never tries anything. Combine makes *one* result; if that does not > work, > it does not do the combination. (This is not completely true, but in essence > that is how it works, and it has to to not have exponential complexity). > > It would be good to define a canonical form for anything vec_duplicate. It > probably is a good idea to pull the vec_duplicate as far outside as possible? > > Canonical forms hugely reduce the amount of work needed. Compare to how > "andc" > is represented (canonically with the inverted input first), or how "nand" is > (we > write that as an "orcc", an "or" with both inputs inverted, in canonical > RTL). > Because only one form is allowed, we only have to check for that one form > everywhere. > > Confirmed. Even w/ canonical RTL, i think a combine splitter is also needed here, the canonical RTL only helps combine/forwprop to match more possibility but won't split patterns by itselies.
[Bug target/100711] Miss optimization for pandn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711 Segher Boessenkool changed: What|Removed |Added Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed||2021-05-21 --- Comment #2 from Segher Boessenkool --- (In reply to Richard Biener from comment #1) > I suppose we're confused about the vec_duplicate. Would generally swapping > the duplicate and the bit_not be profitable? Eventually it's a > simplification > combine could try - I belive it has some cases where it tries variants of the > original instructions when combining. Adding a combine helper pattern > looks like putting too much burden on the backend IMHO. > > We don't have a generic nand optab so handling this in ISEL on gimple > isn't straight-forward. > > But combine and/or forwprop could do this. Combine never tries anything. Combine makes *one* result; if that does not work, it does not do the combination. (This is not completely true, but in essence that is how it works, and it has to to not have exponential complexity). It would be good to define a canonical form for anything vec_duplicate. It probably is a good idea to pull the vec_duplicate as far outside as possible? Canonical forms hugely reduce the amount of work needed. Compare to how "andc" is represented (canonically with the inverted input first), or how "nand" is (we write that as an "orcc", an "or" with both inputs inverted, in canonical RTL). Because only one form is allowed, we only have to check for that one form everywhere. Confirmed.
[Bug target/100711] Miss optimization for pandn
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100711 Richard Biener changed: What|Removed |Added CC||segher at gcc dot gnu.org --- Comment #1 from Richard Biener --- 7: r82:SI=~r89:SI REG_DEAD r89:SI 8: r88:V4SI=vec_duplicate(r82:SI) REG_DEAD r82:SI 9: r87:V4SI=r88:V4SI:V4SI REG_DEAD r90:V4SI REG_DEAD r88:V4SI I suppose we're confused about the vec_duplicate. Would generally swapping the duplicate and the bit_not be profitable? Eventually it's a simplification combine could try - I belive it has some cases where it tries variants of the original instructions when combining. Adding a combine helper pattern looks like putting too much burden on the backend IMHO. We don't have a generic nand optab so handling this in ISEL on gimple isn't straight-forward. But combine and/or forwprop could do this.