[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 rsandifo at gcc dot gnu.org changed: What|Removed |Added CC||rsandifo at gcc dot

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #17 from Alexander Monakov --- To me this suggests that in fact it's okay to carry the combined form in RTL up to register allocation, but RA should decompose it to load+fma instead of inserting a register copy that preserves the

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #16 from Alexander Monakov --- Mostly because prior to register allocation the compiler does not naturally see that x = *mem + a*b will need an extra mov when both 'a' and 'b' are live (as in that case registers allocated for them

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-25 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #15 from Michael_S --- (In reply to Hongtao.liu from comment #14) > > Still I don't understand why compiler does not compare the cost of full loop > > body after combining to the cost before combining and does not come to > >

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-24 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #14 from Hongtao.liu --- > Still I don't understand why compiler does not compare the cost of full loop > body after combining to the cost before combining and does not come to > conclusion that combining increased the cost. As

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-24 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #13 from Michael_S --- (In reply to Hongtao.liu from comment #11) > (In reply to Michael_S from comment #10) > > (In reply to Hongtao.liu from comment #9) > > > (In reply to Michael_S from comment #8) > > > > What are values of gcc

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-24 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #12 from Hongtao.liu --- Correct AVX256 load cost outside of register allocation and vectorizer > they are > 1. AVX256 Load 16 > 2. FMA3 ymm,ymm,ymm --- 16 > 3. AVX256 Regmove --- 2 > 4. FMA3 mem,ymm,ymm --- 32 That's why

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-24 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #11 from Hongtao.liu --- (In reply to Michael_S from comment #10) > (In reply to Hongtao.liu from comment #9) > > (In reply to Michael_S from comment #8) > > > What are values of gcc "loop" cost of the relevant instructions now? > >

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-24 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #10 from Michael_S --- (In reply to Hongtao.liu from comment #9) > (In reply to Michael_S from comment #8) > > What are values of gcc "loop" cost of the relevant instructions now? > > 1. AVX256 Load > > 2. FMA3 ymm,ymm,ymm > > 3.

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-23 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #9 from Hongtao.liu --- (In reply to Michael_S from comment #8) > What are values of gcc "loop" cost of the relevant instructions now? > 1. AVX256 Load > 2. FMA3 ymm,ymm,ymm > 3. AVX256 Regmove > 4. FMA3 mem,ymm,ymm For skylake,

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-23 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #8 from Michael_S --- What are values of gcc "loop" cost of the relevant instructions now? 1. AVX256 Load 2. FMA3 ymm,ymm,ymm 3. AVX256 Regmove 4. FMA3 mem,ymm,ymm

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-22 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #7 from Hongtao.liu --- (In reply to Michael_S from comment #6) > Why do you see it as addition of peephole pattern? > I see it as removal. Like, "do what's written in the source and don't try to > be tricky". > Probably, I am too

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-22 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #6 from Michael_S --- Why do you see it as addition of peephole pattern? I see it as removal. Like, "do what's written in the source and don't try to be tricky". Probably, I am too removed from how compilers work :( Or, may be,

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-22 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #5 from Hongtao.liu --- (In reply to Michael_S from comment #3) > (In reply to Alexander Monakov from comment #2) > > Richard, though register moves are resolved by renaming, they still occupy a > > uop in all stages except

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-21 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #4 from Alexander Monakov --- > More so, gcc variant occupies 2 reservation station entries (2 fused uOps) vs > 4 entries by de-transformed sequence. I don't think this is true for the test at hand? With base+offset memory operand

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-21 Thread already5chosen at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #3 from Michael_S --- (In reply to Alexander Monakov from comment #2) > Richard, though register moves are resolved by renaming, they still occupy a > uop in all stages except execution, and since renaming is one of the > narrowest

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-21 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org ---

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-21 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 Richard Biener changed: What|Removed |Added Target|i386,x86-64 |x86_64-*-* i?86-*-*