[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt

2020-12-02 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770

--- Comment #13 from Hongtao.liu  ---
(In reply to Richard Biener from comment #10)
> Hmm, but
> 
> DEF_INTERNAL_INT_FN (POPCOUNT, ECF_CONST | ECF_NOTHROW, popcount, unary)
> 
> so there's clearly a mismatch between either the vectorizers interpretation
> or the optab.  But as far as I can see this is not a direct internal fn so
> vectorizable_internal_function shouldn't apply and I do not see the x86
> backend handle POPCOUNT in the vectorizable function target hook.
> 
> So w/o a compiler capable I can't trace how the vectorizer vectorizes this
> and thus I have no idea where it goes wrong ...

capable compiler is ready.

[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt

2020-12-02 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770

--- Comment #12 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:81d590760c31e11e3a09135f4e182aea232035f2

commit r11-5693-g81d590760c31e11e3a09135f4e182aea232035f2
Author: Hongyu Wang 
Date:   Wed Nov 11 09:41:13 2020 +0800

Add popcount expander to enable popcount auto vectorization under
AVX512BITALG/AVX512POPCNTDQ target.

gcc/ChangeLog

PR target/97770
* config/i386/sse.md (popcount2): New expander
for SI/DI vector modes.
(popcount2): Likewise for QI/HI vector modes.

gcc/testsuite/ChangeLog

PR target/97770
* gcc.target/i386/avx512bitalg-pr97770-1.c: New test.
* gcc.target/i386/avx512vpopcntdq-pr97770-1.c: Likewise.
* gcc.target/i386/avx512vpopcntdq-pr97770-2.c: Likewise.
* gcc.target/i386/avx512vpopcntdqvl-pr97770-1.c: Likewise.

[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt

2020-11-12 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770

--- Comment #11 from Hongtao.liu  ---
A patch is posted at
https://gcc.gnu.org/pipermail/gcc-patches/2020-November/558777.html

[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt

2020-11-11 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770

Richard Biener  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #10 from Richard Biener  ---
Hmm, but

DEF_INTERNAL_INT_FN (POPCOUNT, ECF_CONST | ECF_NOTHROW, popcount, unary)

so there's clearly a mismatch between either the vectorizers interpretation
or the optab.  But as far as I can see this is not a direct internal fn so
vectorizable_internal_function shouldn't apply and I do not see the x86
backend handle POPCOUNT in the vectorizable function target hook.

So w/o a compiler capable I can't trace how the vectorizer vectorizes this
and thus I have no idea where it goes wrong ...

[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt

2020-11-11 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770

--- Comment #9 from Hongtao.liu  ---

> I guess that the vectorized popcount IFN is defined to be VnDI -> VnDI
> but we want to have VnSImode results.  This means the instruction is
> wrongly modeled in vectorized form?
> 

Yes, because we have __builtin_popcount{l,ll} defined as {BT_FN_INT_ULONG,
BT_FN_INT_ULONGLONG}

but for vectorized form, gcc require mode of src and dest to be the same. 

popcountm2:

Store into operand 0 the number of 1-bits in operand 1.
m is either a scalar or vector integer mode. When it is a scalar, operand 1 has
mode m but operand 0 can have whatever scalar integer mode is suitable for the
target. The compiler will insert conversion instructions as necessary
(typically to convert the result to the same width as int). When m is a vector,
both operands must have mode m. This pattern is not allowed to FAIL.

[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt

2020-11-10 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770

--- Comment #8 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #4)
> What's missing is middle-end folding support to narrow popcount to the
> appropriate internal function call with byte/half-word width when target
> support
> is available.  But I'm quite sure there's no scalar popcount instruction
> operating on half-word or byte pieces of a GPR?

x86 has popcnt that operates on 16bit register.

https://www.felixcloutier.com/x86/popcnt

[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt

2020-11-10 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770

Thomas Koenig  changed:

   What|Removed |Added

 CC||tkoenig at gcc dot gnu.org

--- Comment #7 from Thomas Koenig  ---
Some literature:

https://arxiv.org/pdf/1611.07612

[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt

2020-11-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #6 from Richard Biener  ---
(In reply to Hongtao.liu from comment #5)
> (In reply to Richard Biener from comment #4)
> > What's missing is middle-end folding support to narrow popcount to the
> > appropriate internal function call with byte/half-word width when target
> > support
> > is available.  But I'm quite sure there's no scalar popcount instruction
> > operating on half-word or byte pieces of a GPR?
> > 
> > Alternatively the vectorizer can use patterns to do this.
> 
> Yes, but for 64bit width, vectorizer generate suboptimal code.
> 
> sse #c3
> 
>   vector(2) long long unsigned int vect__4.6;
>   vector(2) long long unsigned int vect__4.5;
>   vector(2) long long unsigned int _8;
>   vector(2) long long unsigned int _26;
> 
>   ...
>   ...
> 
>   _8 = .POPCOUNT (vect__4.5_16);
>   _26 = .POPCOUNT (vect__4.6_9);
>   vect__5.7_22 = VEC_PACK_TRUNC_EXPR <_8, _26>; --- Why do we do this?
>   vector(4) int vect__5.7;
> 
> 
> It could generate directly
> 
>   v4di = .POPCOUNT (v4di);

I guess that the vectorized popcount IFN is defined to be VnDI -> VnDI
but we want to have VnSImode results.  This means the instruction is
wrongly modeled in vectorized form?

Note the vectorizer isn't very good in handling narrowing operations here.

If you can push the missing patterns I can have a look.  Bonus points for
a correctness testcase (from the above I think we're generating wrong code).

[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt

2020-11-10 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770

--- Comment #5 from Hongtao.liu  ---
(In reply to Richard Biener from comment #4)
> What's missing is middle-end folding support to narrow popcount to the
> appropriate internal function call with byte/half-word width when target
> support
> is available.  But I'm quite sure there's no scalar popcount instruction
> operating on half-word or byte pieces of a GPR?
> 
> Alternatively the vectorizer can use patterns to do this.

Yes, but for 64bit width, vectorizer generate suboptimal code.

sse #c3

  vector(2) long long unsigned int vect__4.6;
  vector(2) long long unsigned int vect__4.5;
  vector(2) long long unsigned int _8;
  vector(2) long long unsigned int _26;

  ...
  ...

  _8 = .POPCOUNT (vect__4.5_16);
  _26 = .POPCOUNT (vect__4.6_9);
  vect__5.7_22 = VEC_PACK_TRUNC_EXPR <_8, _26>; --- Why do we do this?
  vector(4) int vect__5.7;


It could generate directly

  v4di = .POPCOUNT (v4di);

[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt

2020-11-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed||2020-11-10
 Ever confirmed|0   |1
 Blocks||53947
 Status|UNCONFIRMED |NEW

--- Comment #4 from Richard Biener  ---
What's missing is middle-end folding support to narrow popcount to the
appropriate internal function call with byte/half-word width when target
support
is available.  But I'm quite sure there's no scalar popcount instruction
operating on half-word or byte pieces of a GPR?

Alternatively the vectorizer can use patterns to do this.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt

2020-11-09 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770

--- Comment #3 from Hongtao.liu  ---
> But for vector byte/word/quadword, vectorizer still use vpopcntd, but not
> vpopcnt{b,w,q}, missing corresponding ifn?

We don't have __builtin_popcount{w,b}, but we have __builtin_popcountl.

for testcase
---
void
fooq(unsigned long long* __restrict dest, unsigned long long* src)
{
  for (int i = 0; i != 4; i++)
dest[i] = __builtin_popcountl (src[i]);
}


icc/clang generate
---
_Z4fooqPxS_:# @_Z4fooqPxS_
vpopcntqymm0, ymmword ptr [rsi]
vmovdqu ymmword ptr [rdi], ymm0
vzeroupper
ret
---

But gcc generate
---
fooq:
.LFB0:
.cfi_startproc
vpopcntq16(%rsi), %xmm1
vpopcntq(%rsi), %xmm0
vshufps $136, %xmm1, %xmm0, %xmm0
vpmovsxdq   %xmm0, %xmm1
vpsrldq $8, %xmm0, %xmm0
vpmovsxdq   %xmm0, %xmm0
vmovdqu %xmm1, (%rdi)
vmovdqu %xmm0, 16(%rdi)
ret
.cfi_endproc
---

dump for 164.vect

---
;; Function fooq (fooq, funcdef_no=0, decl_uid=4228, cgraph_uid=1,
symbol_order=0)

Merging blocks 2 and 6
fooq (long long unsigned int * restrict dest, long long unsigned int * src)
{
  vector(2) long long unsigned int * vectp_dest.10;
  vector(2) long long unsigned int * vectp_dest.9;
  vector(2) long long unsigned int vect__7.8;
  vector(4) int vect__5.7;
  vector(2) long long unsigned int vect__4.6;
  vector(2) long long unsigned int vect__4.5;
  vector(2) long long unsigned int * vectp_src.4;
  vector(2) long long unsigned int * vectp_src.3;
  int i;
  long unsigned int _1;
  long unsigned int _2;
  long long unsigned int * _3;
  long long unsigned int _4;
  int _5;
  long long unsigned int * _6;
  long long unsigned int _7;
  vector(2) long long unsigned int _8;
  vector(2) long long unsigned int _26;
  unsigned int ivtmp_30;
  unsigned int ivtmp_31;
  unsigned int ivtmp_36;
  unsigned int ivtmp_37;

   [local count: 214748368]:

   [local count: 214748371]:
  # i_18 = PHI 
  # ivtmp_31 = PHI 
  # vectp_src.3_20 = PHI 
  # vectp_dest.9_24 = PHI 
  # ivtmp_36 = PHI 
  _1 = (long unsigned int) i_18;
  _2 = _1 * 8;
  _3 = src_11(D) + _2;
  vect__4.5_16 = MEM  [(long long unsigned
int *)vectp_src.3_20];
  vectp_src.3_15 = vectp_src.3_20 + 16;
  vect__4.6_9 = MEM  [(long long unsigned int
*)vectp_src.3_15];
  _4 = *_3;
  _8 = .POPCOUNT (vect__4.5_16);
  _26 = .POPCOUNT (vect__4.6_9);
  vect__5.7_22 = VEC_PACK_TRUNC_EXPR <_8, _26>; --- Why do we do this?
  _5 = 0;
  _6 = dest_12(D) + _2;
  vect__7.8_23 = [vec_unpack_lo_expr] vect__5.7_22;
  vect__7.8_25 = [vec_unpack_hi_expr] vect__5.7_22;
  _7 = (long long unsigned int) _5;
  MEM  [(long long unsigned int
*)vectp_dest.9_24] = vect__7.8_23;
  vectp_dest.9_34 = vectp_dest.9_24 + 16;
  MEM  [(long long unsigned int
*)vectp_dest.9_34] = vect__7.8_25;
  i_14 = i_18 + 1;
  ivtmp_30 = ivtmp_31 - 1;
  vectp_src.3_17 = vectp_src.3_15 + 16;
  vectp_dest.9_32 = vectp_dest.9_34 + 16;
  ivtmp_37 = ivtmp_36 + 1;
  if (ivtmp_37 < 1)
goto ; [0.00%]
  else
goto ; [100.00%]

   [local count: 0]:
  goto ; [100.00%]

   [local count: 214748368]:
  return;

}
---

[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt

2020-11-09 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770

--- Comment #2 from Hongtao.liu  ---
After adding expander, successfully vectorize the loop.
---
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index b153a87fb98..e8159997c40 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -22678,6 +22678,12 @@ (define_insn "avx5124vnniw_vp4dpwssds_maskz"
 (set_attr ("prefix") ("evex"))
 (set_attr ("mode") ("TI"))])

+(define_expand "popcount2"
+  [(set (match_operand:VI48_AVX512VL 0 "register_operand")
+   (popcount:VI48_AVX512VL
+ (match_operand:VI48_AVX512VL 1 "nonimmediate_operand")))]
+  "TARGET_AVX512VPOPCNTDQ")
+
 (define_insn "vpopcount"
   [(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v")
(popcount:VI48_AVX512VL
@@ -22722,6 +22728,12 @@ (define_insn "*restore_multiple_leave_return"
   "TARGET_SSE && TARGET_64BIT"
   "jmp\t%P1")

+(define_insn "popcount2"
+  [(set (match_operand:VI12_AVX512VL 0 "register_operand" "=v")
+   (popcount:VI12_AVX512VL
+ (match_operand:VI12_AVX512VL 1 "nonimmediate_operand" "vm")))]
+  "TARGET_AVX512BITALG")
+
 (define_insn "vpopcount"
   [(set (match_operand:VI12_AVX512VL 0 "register_operand" "=v")
(popcount:VI12_AVX512VL

---

But for vector byte/word/quadword, vectorizer still use vpopcntd, but not
vpopcnt{b,w,q}, missing corresponding ifn?

void
fooq(long long* __restrict dest, long long* src)
{
  for (int i = 0; i != 4; i++)
dest[i] = __builtin_popcount (src[i]);
}

void
foow(short* __restrict dest, short* src)
{
  for (int i = 0; i != 16; i++)
dest[i] = __builtin_popcount (src[i]);
}

void
foob(char* __restrict dest, char* src)
{
  for (int i = 0; i != 32; i++)
dest[i] = __builtin_popcount (src[i]);
}


dump of test.c.164.vect

;; Function foow (foow, funcdef_no=0, decl_uid=4228, cgraph_uid=1,
symbol_order=0)

Merging blocks 2 and 6
foow (short int * restrict dest, short int * src)
{
  vector(8) short int * vectp_dest.10;
  vector(8) short int * vectp_dest.9;
  vector(8) short int vect__8.8;
  vector(4) int vect__6.7;
  vector(4) unsigned int vect__5.6;
  vector(8) short int vect__4.5;
  vector(8) short int * vectp_src.4;
  vector(8) short int * vectp_src.3;
  int i;
  long unsigned int _1;
  long unsigned int _2;
  short int * _3;
  short int _4;
  unsigned int _5;
  int _6;
  short int * _7;
  short int _8;
  unsigned int ivtmp_26;
  unsigned int ivtmp_28;
  unsigned int ivtmp_34;
  unsigned int ivtmp_35;

   [local count: 119292720]:

   [local count: 119292719]:
  # i_19 = PHI 
  # ivtmp_35 = PHI 
  # vectp_src.3_24 = PHI 
  # vectp_dest.9_9 = PHI 
  # ivtmp_26 = PHI 
  _1 = (long unsigned int) i_19;
  _2 = _1 * 2;
  _3 = src_12(D) + _2;
  vect__4.5_22 = MEM  [(short int *)vectp_src.3_24];
  _4 = *_3;
  vect__5.6_21 = [vec_unpack_lo_expr] vect__4.5_22;
  vect__5.6_18 = [vec_unpack_hi_expr] vect__4.5_22;
  _5 = (unsigned int) _4;
  vect__6.7_17 = .POPCOUNT (vect__5.6_21);
  vect__6.7_16 = .POPCOUNT (vect__5.6_18);
  _6 = 0;
  _7 = dest_13(D) + _2;
  vect__8.8_10 = VEC_PACK_TRUNC_EXPR ;
  _8 = (short int) _6;
  MEM  [(short int *)vectp_dest.9_9] = vect__8.8_10;
  i_15 = i_19 + 1;
  ivtmp_34 = ivtmp_35 - 1;
  vectp_src.3_23 = vectp_src.3_24 + 16;
  vectp_dest.9_29 = vectp_dest.9_9 + 16;
  ivtmp_28 = ivtmp_26 + 1;
  if (ivtmp_28 < 1)
goto ; [0.00%]
  else
goto ; [100.00%]

   [local count: 0]:
  goto ; [100.00%]

   [local count: 119292720]:
  return;

}