[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287 --- Comment #13 from Hongtao.liu --- ;; Function fn (fn, funcdef_no=5484, decl_uid=32317, cgraph_uid=5485, symbol_order=5484) int fn (const int * px, const int * py, const int * pz, const int * pw, const int * pa, const int * pb, const int * pc, const int * pd) { vector(16) short unsigned int _3; vector(16) short unsigned int _5; vector(16) short int _7; vector(16) short int _9; vector(32) char _12; vector(32) unsigned char _14; vector(16) short unsigned int _16; vector(16) short unsigned int _17; vector(16) short int _18; vector(16) short int _19; vector(32) char _20; vector(32) unsigned char _21; vector(16) short unsigned int _22; vector(16) short unsigned int _23; vector(16) short int _24; vector(16) short int _25; vector(32) char _26; vector(32) unsigned char _27; vector(16) short unsigned int _28; vector(16) short unsigned int _29; vector(16) short int _30; vector(16) short int _31; int _32; vector(4) int _33; vector(8) int _34; vector(32) unsigned char _35; vector(32) char _36; vector(16) short unsigned int _37; vector(16) short unsigned int _38; vector(16) short unsigned int _39; vector(16) short unsigned int _40; vector(16) short unsigned int _41; vector(16) short unsigned int _42; vector(16) short unsigned int _43; vector(16) short unsigned int _44; vector(16) short unsigned int _45; vector(16) short unsigned int _46; vector(16) short unsigned int _47; vector(16) short unsigned int _48; vector(16) short unsigned int _50; vector(16) short unsigned int _51; vector(16) short unsigned int _53; vector(16) short unsigned int _54; vector(16) short unsigned int _56; vector(16) short unsigned int _57; vector(16) short unsigned int _59; vector(16) short unsigned int _60; vector(16) short int _62; vector(16) short int _63; vector(16) short unsigned int _64; vector(16) short unsigned int _65; vector(32) unsigned char _66; vector(32) char _67; vector(16) short int _68; vector(16) short int _69; vector(16) short unsigned int _70; vector(16) short unsigned int _71; vector(32) unsigned char _72; vector(32) char _73; vector(16) short int _74; vector(16) short int _75; vector(16) short unsigned int _76; vector(16) short unsigned int _77; vector(32) unsigned char _78; vector(32) char _79; vector(16) short int _80; vector(16) short int _81; vector(16) short unsigned int _82; vector(16) short unsigned int _83; vector(32) unsigned char _84; vector(32) char _85; vector(16) short int _86; vector(16) short int _87; vector(16) short unsigned int _88; vector(16) short unsigned int _89; vector(32) unsigned char _90; vector(32) char _91; vector(4) long long int _92; vector(4) long long int _93; vector(4) long long int _94; vector(4) long long int _95; vector(4) long long int _96; vector(4) long long int _97; vector(4) long long int _98; vector(4) long long int _99; vector(4) long long int _100; vector(4) long long int _101; vector(16) short unsigned int _107; vector(16) short unsigned int _108; vector(16) short unsigned int _109; vector(16) short unsigned int _110; vector(16) short unsigned int _111; [local count: 1073741824]: _101 = MEM[(const __m256i_u * {ref-all})px_2(D)]; _100 = MEM[(const __m256i_u * {ref-all})py_4(D)]; _99 = MEM[(const __m256i_u * {ref-all})pz_6(D)]; _98 = MEM[(const __m256i_u * {ref-all})pw_8(D)]; _97 = MEM[(const __m256i_u * {ref-all})pa_10(D)]; _96 = MEM[(const __m256i_u * {ref-all})pb_11(D)]; _95 = MEM[(const __m256i_u * {ref-all})pc_13(D)]; _94 = MEM[(const __m256i_u * {ref-all})pd_15(D)]; _93 = MEM[(const __m256i_u * {ref-all})pc_13(D) + 32B]; _92 = MEM[(const __m256i_u * {ref-all})pd_15(D) + 32B]; _86 = VIEW_CONVERT_EXPR(_96); _87 = VIEW_CONVERT_EXPR(_101); _88 = (vector(16) short unsigned int) _87; _89 = (vector(16) short unsigned int) _86; _90 = VEC_PACK_SAT_EXPR <_88, _89>; _91 = (vector(32) char) _90; _80 = VIEW_CONVERT_EXPR(_95); _81 = VIEW_CONVERT_EXPR(_100); _82 = (vector(16) short unsigned int) _81; _83 = (vector(16) short unsigned int) _80; _84 = VEC_PACK_SAT_EXPR <_82, _83>; _85 = (vector(32) char) _84; _74 = VIEW_CONVERT_EXPR(_94); _75 = VIEW_CONVERT_EXPR(_99); _76 = (vector(16) short unsigned int) _75; _77 = (vector(16) short unsigned int) _74; _78 = VEC_PACK_SAT_EXPR <_76, _77>; _79 = (vector(32) char) _78; _68 = VIEW_CONVERT_EXPR(_93); _69 = VIEW_CONVERT_EXPR(_98); _70 = (vector(16) short unsigned int) _69; _71 = (vector(16) short unsigned int) _68; _72 = VEC_PACK_SAT_EXPR <_70, _71>; _73 = (vector(32) char) _72; _62 = VIEW_CONVERT_EXPR(_92); _63 = VIEW_CONVERT_EXPR(_97); _64 = (vector(16) short unsigned int) _63; _65 = (vector(16) short unsigned int) _62; _66 = VEC_PACK_SAT_EXPR <_64, _65>; _67 = (vector(32) char) _66; _59 = VIEW_CONVERT_EXPR(_91); _60 =
[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287 --- Comment #12 from Hongtao.liu --- (In reply to Hongtao.liu from comment #11) > (In reply to Andrew Pinski from comment #10) > > (In reply to Andrew Pinski from comment #9) > > > (In reply to Hongtao.liu from comment #8) > > > > Do we have IR for unsigned/signed saturation in gimple level? > > > > > > Not yet. I was just looking for that today due because of PR 51492. > > > > But there is a RFC out for it: > > https://gcc.gnu.org/pipermail/gcc/2021-May/236015.html > > Oh, VEC_PACK_SAT_EXPR is exact what i needed for _mm256_packus_epi16, thanks > for the pointer. And ‘vec_pack_ssat_m’, ‘vec_pack_usat_m’ for optab.
[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287 --- Comment #11 from Hongtao.liu --- (In reply to Andrew Pinski from comment #10) > (In reply to Andrew Pinski from comment #9) > > (In reply to Hongtao.liu from comment #8) > > > Do we have IR for unsigned/signed saturation in gimple level? > > > > Not yet. I was just looking for that today due because of PR 51492. > > But there is a RFC out for it: > https://gcc.gnu.org/pipermail/gcc/2021-May/236015.html Oh, VEC_PACK_SAT_EXPR is exact what i needed for _mm256_packus_epi16, thanks for the pointer.
[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287 --- Comment #10 from Andrew Pinski --- (In reply to Andrew Pinski from comment #9) > (In reply to Hongtao.liu from comment #8) > > Do we have IR for unsigned/signed saturation in gimple level? > > Not yet. I was just looking for that today due because of PR 51492. But there is a RFC out for it: https://gcc.gnu.org/pipermail/gcc/2021-May/236015.html
[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287 --- Comment #9 from Andrew Pinski --- (In reply to Hongtao.liu from comment #8) > Do we have IR for unsigned/signed saturation in gimple level? Not yet. I was just looking for that today due because of PR 51492.
[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287 --- Comment #8 from Hongtao.liu --- (In reply to Hongtao.liu from comment #7) > (In reply to Andrew Pinski from comment #5) > > clang can now produce: > > mov eax, dword ptr [esp + 16] > > mov ecx, dword ptr [esp + 28] > > vmovdqu xmm0, xmmword ptr [ecx + 32] > > vmovdqu xmm1, xmmword ptr [eax] > > vpackuswb xmm2, xmm1, xmm0 > > vpsubw xmm0, xmm1, xmm0 > > vpaddw xmm0, xmm0, xmm2 > > vpackuswb xmm0, xmm0, xmm0 > > vpackuswb xmm0, xmm0, xmm0 > > vpextrd eax, xmm0, 1 > > ret > > > > I suspect if the back-end is able to "fold" at the gimple level the builtins > > into gimple, GCC will do a much better job. > > Currently we have stuff like: > > _27 = __builtin_ia32_vextractf128_si256 (_28, 0); > > _26 = __builtin_ia32_vec_ext_v4si (_27, 1); [tail call] > > > > I think both are just a BIT_FIELD_REF really and even more can be simplified > > to just one bitfield extraction rather than what we do now: > > vpackuswb %ymm1, %ymm0, %ymm0 > > vpextrd $1, %xmm0, %eax > > > > Plus it looks like with __builtin_ia32_vextractf128_si256 (_28, 0), clang is > > able to remove half of the code due to only needing 128 bytes stuff :). > > Yes, let's me try this. Do we have IR for unsigned/signed saturation in gimple level?
[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287 --- Comment #7 from Hongtao.liu --- (In reply to Andrew Pinski from comment #5) > clang can now produce: > mov eax, dword ptr [esp + 16] > mov ecx, dword ptr [esp + 28] > vmovdqu xmm0, xmmword ptr [ecx + 32] > vmovdqu xmm1, xmmword ptr [eax] > vpackuswb xmm2, xmm1, xmm0 > vpsubw xmm0, xmm1, xmm0 > vpaddw xmm0, xmm0, xmm2 > vpackuswb xmm0, xmm0, xmm0 > vpackuswb xmm0, xmm0, xmm0 > vpextrd eax, xmm0, 1 > ret > > I suspect if the back-end is able to "fold" at the gimple level the builtins > into gimple, GCC will do a much better job. > Currently we have stuff like: > _27 = __builtin_ia32_vextractf128_si256 (_28, 0); > _26 = __builtin_ia32_vec_ext_v4si (_27, 1); [tail call] > > I think both are just a BIT_FIELD_REF really and even more can be simplified > to just one bitfield extraction rather than what we do now: > vpackuswb %ymm1, %ymm0, %ymm0 > vpextrd $1, %xmm0, %eax > > Plus it looks like with __builtin_ia32_vextractf128_si256 (_28, 0), clang is > able to remove half of the code due to only needing 128 bytes stuff :). Yes, let's me try this.
[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287 --- Comment #6 from Petr --- Yes, the code is not really doing anything useful, I only wrote it to demonstrate the spills problem. Clang actually outsmarted me by removing half of the code :) I think this issue can be closed, I cannot repro this with the newest GCC.
[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287 Andrew Pinski changed: What|Removed |Added Component|rtl-optimization|target --- Comment #5 from Andrew Pinski --- clang can now produce: mov eax, dword ptr [esp + 16] mov ecx, dword ptr [esp + 28] vmovdqu xmm0, xmmword ptr [ecx + 32] vmovdqu xmm1, xmmword ptr [eax] vpackuswb xmm2, xmm1, xmm0 vpsubw xmm0, xmm1, xmm0 vpaddw xmm0, xmm0, xmm2 vpackuswb xmm0, xmm0, xmm0 vpackuswb xmm0, xmm0, xmm0 vpextrd eax, xmm0, 1 ret I suspect if the back-end is able to "fold" at the gimple level the builtins into gimple, GCC will do a much better job. Currently we have stuff like: _27 = __builtin_ia32_vextractf128_si256 (_28, 0); _26 = __builtin_ia32_vec_ext_v4si (_27, 1); [tail call] I think both are just a BIT_FIELD_REF really and even more can be simplified to just one bitfield extraction rather than what we do now: vpackuswb %ymm1, %ymm0, %ymm0 vpextrd $1, %xmm0, %eax Plus it looks like with __builtin_ia32_vextractf128_si256 (_28, 0), clang is able to remove half of the code due to only needing 128 bytes stuff :).
[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287 --- Comment #2 from Petr --- With '-mtune=intel' the push/pop sequence is gone, but YMM register management remains the same - 24 memory accesses more than clang.
[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287 Andrew Pinski changed: What|Removed |Added Component|c++ |target --- Comment #1 from Andrew Pinski --- Try adding -march=intel or-mtune=intel . The default tuning for gcc is generic which is combination of Intel and amd tuning. And because amd tuning needs not to use gprs and SIMD registers at the same time spilling is faster there. It tunes for that.