[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 --- Comment #12 from Hongtao.liu --- > That's pretty good, but VMOVD eax, xmm0 would be more efficient than > VPEXTRW when we don't need to avoid high garbage (because it's a return > value in this case). And TARGET_AVX512FP16 has vmovw.
[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 --- Comment #11 from Peter Cordes --- Also, horizontal byte sums are generally best done with VPSADBW against a zero vector, even if that means some fiddling to flip to unsigned first and then undo the bias. simde_vaddlv_s8: vpxorxmm0, xmm0, .LC0[rip] # set1_epi8(0x80) flip to unsigned 0..255 range vpxorxmm1, xmm1 vpsadbw xmm0, xmm0, xmm1 # horizontal byte sum within each 64-bit half vmovdeax, xmm0 # we only wanted the low half anyway sub eax, 8 * 128 # subtract the bias we added earlier by flipping sign bits ret This is so much shorter we'd still be ahead if we generated the vector constant on the fly instead of loading it. (3 instructions: vpcmpeqd same,same / vpabsb / vpslld by 7. Or pcmpeqd / psllw 8 / packsswb same,same to saturate to -128) If we had wanted a 128-bit (16 byte) vector sum, we'd need ... vpsadbw ... vpshufd xmm1, xmm0, 0xfe # shuffle upper 64 bits to the bottom vpaddd xmm0, xmm0, xmm1 vmovdeax, xmm0 sub eax, 16 * 128 Works efficiently with only SSE2. Actually with AVX2, we should unpack the top half with VUNPCKHQDQ to save a byte (no immediate operand), since we don't need PSHUFD copy-and-shuffle. Or movd / pextrw / scalar add but that's more uops: pextrw is 2 on its own.
[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #10 from Peter Cordes --- Current trunk with -fopenmp is still not good https://godbolt.org/z/b3jjhcvTa Still doing two separate sign extensions and two stores / wider reload (store forwarding stall): -O3 -march=skylake -fopenmp simde_vaddlv_s8: pushrbp vpmovsxbw xmm2, xmm0 vpsrlq xmm0, xmm0, 32 mov rbp, rsp vpmovsxbw xmm3, xmm0 and rsp, -32 vmovq QWORD PTR [rsp-16], xmm2 vmovq QWORD PTR [rsp-8], xmm3 vmovdqa xmm4, XMMWORD PTR [rsp-16] ... then asm using byte-shifts Including stuff like movdqa xmm1, xmm0 psrldq xmm1, 4 instead of pshufd, which is an option because high garbage can be ignored. And ARM64 goes scalar. Current trunk *without* -fopenmp produces decent asm https://godbolt.org/z/h1KEKPTW9 For ARM64 we've been making good asm since GCC 10.x (vs. scalar in 9.3) simde_vaddlv_s8: sxtlv0.8h, v0.8b addvh0, v0.8h umovw0, v0.h[0] ret x86-64 gcc -O3 -march=skylake simde_vaddlv_s8: vpmovsxbw xmm1, xmm0 vpsrlq xmm0, xmm0, 32 vpmovsxbw xmm0, xmm0 vpaddw xmm0, xmm1, xmm0 vpsrlq xmm1, xmm0, 32 vpaddw xmm0, xmm0, xmm1 vpsrlq xmm1, xmm0, 16 vpaddw xmm0, xmm0, xmm1 vpextrw eax, xmm0, 0 ret That's pretty good, but VMOVD eax, xmm0 would be more efficient than VPEXTRW when we don't need to avoid high garbage (because it's a return value in this case). VPEXTRW zero-extends into RAX, so it's not directly helpful if we need to sign-extend to 32 or 64-bit for some reason; we'd still need a scalar movsx. Or with BMI2, go scalar before the last shift / VPADDW step, e.g. ... vmovd eax, xmm0 rorx edx, eax, 16 addeax, edx
[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 --- Comment #9 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:77ca2cfcdcccee3c8e8aeaf1d03e9920893d2486 commit r12-4241-g77ca2cfcdcccee3c8e8aeaf1d03e9920893d2486 Author: liuhongt Date: Tue Sep 28 12:55:10 2021 +0800 Support reduc_{plus,smax,smin,umax,min}_scal_v4hi. gcc/ChangeLog: PR target/102494 * config/i386/i386-expand.c (emit_reduc_half): Hanlde V4HImode. * config/i386/mmx.md (reduc_plus_scal_v4hi): New. (reduc__scal_v4hi): New. gcc/testsuite/ChangeLog: * gcc.target/i386/mmx-reduce-op-1.c: New test. * gcc.target/i386/mmx-reduce-op-2.c: New test.
[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 --- Comment #8 from rguenther at suse dot de --- On Tue, 28 Sep 2021, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 > > --- Comment #7 from Hongtao.liu --- > After supporting v4hi reduce, gimple seems not optimal to convert v8qi to > v8hi. > > 6 vector(4) short int vect__21.36; > 7 vector(4) unsigned short vect__2.31; > 8 int16_t stmp_r_17.17; > 9 vector(8) short int vect__16.15; > 10 int16_t D.2229[8]; > 11 vector(8) short int _50; > 12 vector(8) short int _51; > 13 vector(8) short int _52; > 14 vector(8) short int _53; > 15 vector(8) short int _54; > 16 vector(8) short int _55; > > 18 [local count: 189214783]: > 19 vect__2.31_97 = [vec_unpack_lo_expr] a_90(D); > 20 vect__2.31_98 = [vec_unpack_hi_expr] a_90(D); > 21 vect__21.36_105 = VIEW_CONVERT_EXPR(vect__2.31_97); > 22 vect__21.36_106 = VIEW_CONVERT_EXPR(vect__2.31_98); > 23 MEM [(short int *)] = vect__21.36_105; > 24 MEM [(short int *) + 8B] = vect__21.36_106; so the above could possibly use a V8QI -> V8HI conversion, the loop vectorizer isn't good at producing those though. And of course the appropriate conversion optab has to exist. > 25 vect__16.15_47 = MEM [(short int *)]; Here's lack of "CSE" - I do have patches somewhere to turn this into vect__16.15_47 = { vect__21.36_105, vect__21.36_106 }; but I'm not sure that's going to be profitable (well, the code as-is will get a STLF hit). There's also store-merging that could instead merge the stores similarly (but then there's no CSE after store-merging so the load would remain).
[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 --- Comment #7 from Hongtao.liu --- After supporting v4hi reduce, gimple seems not optimal to convert v8qi to v8hi. 6 vector(4) short int vect__21.36; 7 vector(4) unsigned short vect__2.31; 8 int16_t stmp_r_17.17; 9 vector(8) short int vect__16.15; 10 int16_t D.2229[8]; 11 vector(8) short int _50; 12 vector(8) short int _51; 13 vector(8) short int _52; 14 vector(8) short int _53; 15 vector(8) short int _54; 16 vector(8) short int _55; 18 [local count: 189214783]: 19 vect__2.31_97 = [vec_unpack_lo_expr] a_90(D); 20 vect__2.31_98 = [vec_unpack_hi_expr] a_90(D); 21 vect__21.36_105 = VIEW_CONVERT_EXPR(vect__2.31_97); 22 vect__21.36_106 = VIEW_CONVERT_EXPR(vect__2.31_98); 23 MEM [(short int *)] = vect__21.36_105; 24 MEM [(short int *) + 8B] = vect__21.36_106; 25 vect__16.15_47 = MEM [(short int *)];
[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 --- Comment #6 from Richard Biener --- The vectorizer looks for a way to "shift" the whole vector by either vec_shr or a corresponding vec_perm with constant shuffle operands. When the target provides none of those you get element extracts and scalar adds. So yes, the vectorizer does the work for you but only if you hand it the pieces. It could possibly use a larger vector, doing only the "tail" of its final reduction, so try with v8hi instead of v4hi, but it's not really clear if such strategy would be good in general.
[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 --- Comment #5 from Hongtao.liu --- (In reply to Hongtao.liu from comment #4) > > > > But for the case in PR, it's v8qi -> 2 v4hi, and no vector reduction for > > v4hi. > > We need add (define_expand "reduc_plus_scal_v4hi" just like (define_expand > "reduc_plus_scal_v8qi" in mmx.md. Also for reduc_{umax,umin,smax,smin}_scal_v4hi
[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 --- Comment #4 from Hongtao.liu --- > > But for the case in PR, it's v8qi -> 2 v4hi, and no vector reduction for > v4hi. We need add (define_expand "reduc_plus_scal_v4hi" just like (define_expand "reduc_plus_scal_v8qi" in mmx.md.
[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 --- Comment #3 from Hongtao.liu --- (In reply to Hongtao.liu from comment #2) > It seems x86 doesn't supports optab reduc_plus_scal_v8hi yet. vectorizer does the work for backend. typedef short v8hi __attribute__((vector_size(16))); short foo1 (v8hi p, int n) { short sum = 0; for (int i = 0; i != 8; i++) sum += p[i]; return sum; } # sum_21 = PHI # vect_sum_9.26_5 = PHI _22 = (vector(8) unsigned short) vect_sum_9.26_5; _23 = VEC_PERM_EXPR <_22, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 4, 5, 6, 7, 8, 9, 10, 11 }>; _24 = _23 + _22; _25 = VEC_PERM_EXPR <_24, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 2, 3, 4, 5, 6, 7, 8, 9 }>; _26 = _25 + _24; _27 = VEC_PERM_EXPR <_26, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 1, 2, 3, 4, 5, 6, 7, 8 }>; _28 = _27 + _26; stmp_sum_9.27_29 = BIT_FIELD_REF <_28, 16, 0>; But for the case in PR, it's v8qi -> 2 v4hi, and no vector reduction for v4hi.