[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-10-26 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #12 from Hongtao.liu  ---

> That's pretty good, but  VMOVD eax, xmm0  would be more efficient than 
> VPEXTRW when we don't need to avoid high garbage (because it's a return
> value in this case). 
And TARGET_AVX512FP16 has vmovw.

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-10-25 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #11 from Peter Cordes  ---
Also, horizontal byte sums are generally best done with  VPSADBW against a zero
vector, even if that means some fiddling to flip to unsigned first and then
undo the bias.

simde_vaddlv_s8:
 vpxorxmm0, xmm0, .LC0[rip]  # set1_epi8(0x80) flip to unsigned 0..255
range
 vpxorxmm1, xmm1
 vpsadbw  xmm0, xmm0, xmm1   # horizontal byte sum within each 64-bit half
 vmovdeax, xmm0  # we only wanted the low half anyway
 sub  eax, 8 * 128  # subtract the bias we added earlier by flipping
sign bits
 ret

This is so much shorter we'd still be ahead if we generated the vector constant
on the fly instead of loading it.  (3 instructions: vpcmpeqd same,same / vpabsb
/ vpslld by 7.  Or pcmpeqd / psllw 8 / packsswb same,same to saturate to -128)

If we had wanted a 128-bit (16 byte) vector sum, we'd need

  ...
  vpsadbw ...

  vpshufd  xmm1, xmm0, 0xfe # shuffle upper 64 bits to the bottom
  vpaddd   xmm0, xmm0, xmm1
  vmovdeax, xmm0
  sub  eax, 16 * 128

Works efficiently with only SSE2.  Actually with AVX2, we should unpack the top
half with VUNPCKHQDQ to save a byte (no immediate operand), since we don't need
PSHUFD copy-and-shuffle.

Or movd / pextrw / scalar add but that's more uops: pextrw is 2 on its own.

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-10-25 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #10 from Peter Cordes  ---
Current trunk with -fopenmp is still not good https://godbolt.org/z/b3jjhcvTa 
Still doing two separate sign extensions and two stores / wider reload (store
forwarding stall):

-O3 -march=skylake -fopenmp
simde_vaddlv_s8:
pushrbp
vpmovsxbw   xmm2, xmm0
vpsrlq  xmm0, xmm0, 32
mov rbp, rsp
vpmovsxbw   xmm3, xmm0
and rsp, -32
vmovq   QWORD PTR [rsp-16], xmm2
vmovq   QWORD PTR [rsp-8], xmm3
vmovdqa xmm4, XMMWORD PTR [rsp-16]
   ... then asm using byte-shifts

Including stuff like
   movdqa  xmm1, xmm0
   psrldq  xmm1, 4

instead of pshufd, which is an option because high garbage can be ignored.

And ARM64 goes scalar.



Current trunk *without* -fopenmp produces decent asm
https://godbolt.org/z/h1KEKPTW9

For ARM64 we've been making good asm since GCC 10.x (vs. scalar in 9.3)
simde_vaddlv_s8:
sxtlv0.8h, v0.8b
addvh0, v0.8h
umovw0, v0.h[0]
ret

x86-64 gcc  -O3 -march=skylake
simde_vaddlv_s8:
vpmovsxbw   xmm1, xmm0
vpsrlq  xmm0, xmm0, 32
vpmovsxbw   xmm0, xmm0
vpaddw  xmm0, xmm1, xmm0
vpsrlq  xmm1, xmm0, 32
vpaddw  xmm0, xmm0, xmm1
vpsrlq  xmm1, xmm0, 16
vpaddw  xmm0, xmm0, xmm1
vpextrw eax, xmm0, 0
ret


That's pretty good, but  VMOVD eax, xmm0  would be more efficient than  VPEXTRW
when we don't need to avoid high garbage (because it's a return value in this
case).  VPEXTRW zero-extends into RAX, so it's not directly helpful if we need
to sign-extend to 32 or 64-bit for some reason; we'd still need a scalar movsx.

Or with BMI2, go scalar before the last shift / VPADDW step, e.g.
  ...
  vmovd  eax, xmm0
  rorx   edx, eax, 16
  addeax, edx

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-10-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #9 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:77ca2cfcdcccee3c8e8aeaf1d03e9920893d2486

commit r12-4241-g77ca2cfcdcccee3c8e8aeaf1d03e9920893d2486
Author: liuhongt 
Date:   Tue Sep 28 12:55:10 2021 +0800

Support reduc_{plus,smax,smin,umax,min}_scal_v4hi.

gcc/ChangeLog:

PR target/102494
* config/i386/i386-expand.c (emit_reduc_half): Hanlde V4HImode.
* config/i386/mmx.md (reduc_plus_scal_v4hi): New.
(reduc__scal_v4hi): New.

gcc/testsuite/ChangeLog:

* gcc.target/i386/mmx-reduce-op-1.c: New test.
* gcc.target/i386/mmx-reduce-op-2.c: New test.

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-09-28 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #8 from rguenther at suse dot de  ---
On Tue, 28 Sep 2021, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494
> 
> --- Comment #7 from Hongtao.liu  ---
> After supporting v4hi reduce, gimple seems not optimal to convert v8qi to 
> v8hi.
> 
>  6  vector(4) short int vect__21.36;
>  7  vector(4) unsigned short vect__2.31;
>  8  int16_t stmp_r_17.17;
>  9  vector(8) short int vect__16.15;
> 10  int16_t D.2229[8];
> 11  vector(8) short int _50;
> 12  vector(8) short int _51;
> 13  vector(8) short int _52;
> 14  vector(8) short int _53;
> 15  vector(8) short int _54;
> 16  vector(8) short int _55;
> 
> 18   [local count: 189214783]:
> 19  vect__2.31_97 = [vec_unpack_lo_expr] a_90(D);
> 20  vect__2.31_98 = [vec_unpack_hi_expr] a_90(D);
> 21  vect__21.36_105 = VIEW_CONVERT_EXPR(vect__2.31_97);
> 22  vect__21.36_106 = VIEW_CONVERT_EXPR(vect__2.31_98);
> 23  MEM  [(short int *)] = vect__21.36_105;
> 24  MEM  [(short int *) + 8B] = vect__21.36_106;

so the above could possibly use a V8QI -> V8HI conversion, the loop
vectorizer isn't good at producing those though.  And of course the
appropriate conversion optab has to exist.

> 25  vect__16.15_47 = MEM  [(short int *)];

Here's lack of "CSE" - I do have patches somewhere to turn this into

  vect__16.15_47 = { vect__21.36_105, vect__21.36_106 };

but I'm not sure that's going to be profitable (well, the code as-is
will get a STLF hit).

There's also store-merging that could instead merge the stores
similarly (but then there's no CSE after store-merging so the load
would remain).

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-09-28 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #7 from Hongtao.liu  ---
After supporting v4hi reduce, gimple seems not optimal to convert v8qi to v8hi.

 6  vector(4) short int vect__21.36;
 7  vector(4) unsigned short vect__2.31;
 8  int16_t stmp_r_17.17;
 9  vector(8) short int vect__16.15;
10  int16_t D.2229[8];
11  vector(8) short int _50;
12  vector(8) short int _51;
13  vector(8) short int _52;
14  vector(8) short int _53;
15  vector(8) short int _54;
16  vector(8) short int _55;

18   [local count: 189214783]:
19  vect__2.31_97 = [vec_unpack_lo_expr] a_90(D);
20  vect__2.31_98 = [vec_unpack_hi_expr] a_90(D);
21  vect__21.36_105 = VIEW_CONVERT_EXPR(vect__2.31_97);
22  vect__21.36_106 = VIEW_CONVERT_EXPR(vect__2.31_98);
23  MEM  [(short int *)] = vect__21.36_105;
24  MEM  [(short int *) + 8B] = vect__21.36_106;
25  vect__16.15_47 = MEM  [(short int *)];

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-09-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #6 from Richard Biener  ---
The vectorizer looks for a way to "shift" the whole vector by either vec_shr
or a corresponding vec_perm with constant shuffle operands.  When the target
provides none of those you get element extracts and scalar adds.

So yes, the vectorizer does the work for you but only if you hand it the
pieces.

It could possibly use a larger vector, doing only the "tail" of its final
reduction, so try with v8hi instead of v4hi, but it's not really clear if
such strategy would be good in general.

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-09-26 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #5 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #4)
> > 
> > But for the case in PR, it's v8qi -> 2 v4hi, and no vector reduction for
> > v4hi.
> 
> We need add (define_expand "reduc_plus_scal_v4hi" just like (define_expand
> "reduc_plus_scal_v8qi" in mmx.md.

Also for reduc_{umax,umin,smax,smin}_scal_v4hi

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-09-26 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #4 from Hongtao.liu  ---

> 
> But for the case in PR, it's v8qi -> 2 v4hi, and no vector reduction for
> v4hi.

We need add (define_expand "reduc_plus_scal_v4hi" just like (define_expand
"reduc_plus_scal_v8qi" in mmx.md.

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-09-26 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #3 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #2)
> It seems x86 doesn't supports optab reduc_plus_scal_v8hi yet.
vectorizer does the work for backend. 

typedef short v8hi __attribute__((vector_size(16)));
short
foo1 (v8hi p, int n)
{
  short sum = 0;
  for (int i = 0; i != 8; i++)
sum += p[i];
  return sum;
}

  # sum_21 = PHI 
  # vect_sum_9.26_5 = PHI 
  _22 = (vector(8) unsigned short) vect_sum_9.26_5;
  _23 = VEC_PERM_EXPR <_22, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 4, 5, 6, 7, 8, 9, 10,
11 }>;
  _24 = _23 + _22;
  _25 = VEC_PERM_EXPR <_24, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 2, 3, 4, 5, 6, 7, 8,
9 }>;
  _26 = _25 + _24;
  _27 = VEC_PERM_EXPR <_26, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 1, 2, 3, 4, 5, 6, 7,
8 }>;
  _28 = _27 + _26;
  stmp_sum_9.27_29 = BIT_FIELD_REF <_28, 16, 0>;


But for the case in PR, it's v8qi -> 2 v4hi, and no vector reduction for v4hi.