[PATCH] AVX2 vector permutation fixes plus vec_pack_trunc_{v16hi,v8si,v4di} support

2011-10-12 Thread Jakub Jelinek
Hi!

This patch started with noticing while working on PR50596 that
#define N 1024
long long a[N];
char b[N];
void
foo (void)
{
  int i;
  for (i = 0; i  N; i++)
b[i] = a[i];
}
is even with -O3 -mavx2 vectorized just with 16-byte vectors
instead of 32-byte vectors and has various fixes I've noticed
when diving into it.  The vector permutations with AVX2 aren't
very easy, because some instructions don't shuffle cross-lane,
some do but only for some modes.  The patch adds AVX2
vec_pack_trunc* expanders so that the above can be vectorized,
and implements a couple of permutation sequences, including for
a single operand __builtin_vec_shuffle a 4 insn sequence that
handles arbitrary V32QI/V16HI constant permutations (and some
cases where 1 insn is possible too) and also a variable mask
V{32Q,16H,8S,4D}I permutations.
I think we badly need testcase which will try all possible
constant permutations (probably one testcase per mode),
even for V32QImode that's just 32x32 plus 32x64 tests (if
split into 32 tests in a function times 96 noinline functions),
but with that I'd like to wait for Richard's permutation improvements,
because although currently the backend signalizes it can handle
some constant argument e.g. V32QImode permutation, as there is no
V32QImode permutation builtin __builtin_shuffle emits it as
variable mask operation.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2011-10-12  Jakub Jelinek  ja...@redhat.com

* config/i386/i386.md (UNSPEC_VPERMDI): Remove.
* config/i386/i386.c (ix86_expand_vec_perm): Handle
V16QImode and V32QImode for TARGET_AVX2.
(MAX_VECT_LEN): Increase to 32.
(expand_vec_perm_blend): Add support for 32-byte integer
vectors with TARGET_AVX2.
(valid_perm_using_mode_p): New function.
(expand_vec_perm_pshufb): Add support for 32-byte integer
vectors with TARGET_AVX2.
(expand_vec_perm_vpshufb2_vpermq): New function.
(expand_vec_perm_vpshufb2_vpermq_even_odd): New function.
(expand_vec_perm_even_odd_1): Handle 32-byte integer vectors
with TARGET_AVX2.
(ix86_expand_vec_perm_builtin_1): Try expand_vec_perm_vpshufb2_vpermq
and expand_vec_perm_vpshufb2_vpermq_even_odd.
* config/i386/sse.md (VEC_EXTRACT_EVENODD_MODE): Add for TARGET_AVX2
32-byte integer vector modes.
(vec_pack_trunc_mode): Use VI248_AVX2 instead of VI248_128.
(avx2_interleave_highv32qi, avx2_interleave_lowv32qi): Remove pasto.
(avx2_pshufdv3, avx2_pshuflwv3, avx2_pshufhwv3): Generate
4 new operands.
(avx2_pshufd_1, avx2_pshuflw_1, avx2_pshufhw_1): Don't use
match_dup, instead add 4 new operands and require they have
right cross-lane values.
(avx2_permv4di): Change into define_expand.
(avx2_permv4di_1): New instruction.
(avx2_permv2ti): Use nonimmediate_operand instead of register_operand
for xm constrained operand.
(VEC_PERM_AVX2): Add V32QI and V16QI for TARGET_AVX2.

--- gcc/config/i386/i386.md.jj  2011-10-06 16:42:12.0 +0200
+++ gcc/config/i386/i386.md 2011-10-11 10:07:04.0 +0200
@@ -235,7 +235,6 @@ (define_c_enum unspec [
   UNSPEC_VPERMSI
   UNSPEC_VPERMDF
   UNSPEC_VPERMSF
-  UNSPEC_VPERMDI
   UNSPEC_VPERMTI
   UNSPEC_GATHER
 
--- gcc/config/i386/i386.c.jj   2011-10-10 09:41:28.0 +0200
+++ gcc/config/i386/i386.c  2011-10-12 11:05:06.0 +0200
@@ -19334,7 +19334,7 @@ ix86_expand_vec_perm (rtx operands[])
   rtx op0 = operands[1];
   rtx op1 = operands[2];
   rtx mask = operands[3];
-  rtx t1, t2, vt, vec[16];
+  rtx t1, t2, t3, t4, vt, vt2, vec[32];
   enum machine_mode mode = GET_MODE (op0);
   enum machine_mode maskmode = GET_MODE (mask);
   int w, e, i;
@@ -19343,50 +19343,72 @@ ix86_expand_vec_perm (rtx operands[])
   /* Number of elements in the vector.  */
   w = GET_MODE_NUNITS (mode);
   e = GET_MODE_UNIT_SIZE (mode);
-  gcc_assert (w = 16);
+  gcc_assert (w = 32);
 
   if (TARGET_AVX2)
 {
-  if (mode == V4DImode || mode == V4DFmode)
+  if (mode == V4DImode || mode == V4DFmode || mode == V16HImode)
{
  /* Unfortunately, the VPERMQ and VPERMPD instructions only support
 an constant shuffle operand.  With a tiny bit of effort we can
 use VPERMD instead.  A re-interpretation stall for V4DFmode is
-unfortunate but there's no avoiding it.  */
- t1 = gen_reg_rtx (V8SImode);
+unfortunate but there's no avoiding it.
+Similarly for V16HImode we don't have instructions for variable
+shuffling, while for V32QImode we can use after preparing suitable
+masks vpshufb; vpshufb; vpermq; vpor.  */
+
+ if (mode == V16HImode)
+   {
+ maskmode = mode = V32QImode;
+ w = 32;
+ e = 1;
+   }
+ else
+   {
+ maskmode = mode = V8SImode;
+

Re: [PATCH] AVX2 vector permutation fixes plus vec_pack_trunc_{v16hi,v8si,v4di} support

2011-10-12 Thread Richard Henderson
On 10/12/2011 09:09 AM, Jakub Jelinek wrote:
 /* Multiply the shuffle indicies by two.  */
 -   emit_insn (gen_avx2_lshlv8si3 (t1, t1, const1_rtx));
 +   if (maskmode == V8SImode)
 + emit_insn (gen_avx2_lshlv8si3 (t1, t1, const1_rtx));
 +   else
 + emit_insn (gen_addv32qi3 (t1, t1, t1));

I guess this would be cleaner to use plus always.  And thus
expand_simple_binop instead of (a couple of) these mode tests.

  
 + case V32QImode:
 +   t1 = gen_reg_rtx (V32QImode);
 +   t2 = gen_reg_rtx (V32QImode);
 +   t3 = gen_reg_rtx (V32QImode);
 +   vt2 = GEN_INT (128);
 +   for (i = 0; i  32; i++)
 + vec[i] = vt2;
 +   vt = gen_rtx_CONST_VECTOR (V32QImode, gen_rtvec_v (32, vec));
 +   vt = force_reg (V32QImode, vt);
 +   for (i = 0; i  32; i++)
 + vec[i] = i  16 ? vt2 : const0_rtx;
 +   vt2 = gen_rtx_CONST_VECTOR (V32QImode, gen_rtvec_v (32, vec));
 +   vt2 = force_reg (V32QImode, vt2);
 +   emit_insn (gen_avx2_lshlv4di3 (gen_lowpart (V4DImode, t1),
 +  gen_lowpart (V4DImode, mask),
 +  GEN_INT (3)));
 +   emit_insn (gen_avx2_andnotv32qi3 (t2, vt, mask));
 +   emit_insn (gen_xorv32qi3 (t1, t1, vt2));
 +   emit_insn (gen_andv32qi3 (t1, t1, vt));
 +   emit_insn (gen_iorv32qi3 (t3, t1, t2));
 +   emit_insn (gen_xorv32qi3 (t1, t1, vt));
 +   emit_insn (gen_avx2_permv4di_1 (gen_lowpart (V4DImode, t3),
 +   gen_lowpart (V4DImode, t3),
 +   const2_rtx, GEN_INT (3),
 +   const0_rtx, const1_rtx));
 +   emit_insn (gen_iorv32qi3 (t1, t1, t2));

Some commentary here is required.  I might have expected to see a compare,
or something, but the logical operations here are less than obvious.

I believe I've commented on everything else in the previous messages.


r~


[PATCH] AVX2 vector permutation fixes plus vec_pack_trunc_{v16hi,v8si,v4di} support (take 2)

2011-10-12 Thread Jakub Jelinek
On Wed, Oct 12, 2011 at 10:49:33AM -0700, Richard Henderson wrote:
 I believe I've commented on everything else in the previous messages.

Here is an updated patch which should incorporate your comments from both
mails (thanks for them).  Bootstrapped/regtested on x86_64-linux and
i686-linux, ok for trunk?

2011-10-12  Jakub Jelinek  ja...@redhat.com

* config/i386/i386.md (UNSPEC_VPERMDI): Remove.
* config/i386/i386.c (ix86_expand_vec_perm): Handle
V16QImode and V32QImode for TARGET_AVX2.
(MAX_VECT_LEN): Increase to 32.
(expand_vec_perm_blend): Add support for 32-byte integer
vectors with TARGET_AVX2.
(valid_perm_using_mode_p): New function.
(expand_vec_perm_pshufb): Add support for 32-byte integer
vectors with TARGET_AVX2.
(expand_vec_perm_vpshufb2_vpermq): New function.
(expand_vec_perm_vpshufb2_vpermq_even_odd): New function.
(expand_vec_perm_even_odd_1): Handle 32-byte integer vectors
with TARGET_AVX2.
(ix86_expand_vec_perm_builtin_1): Try expand_vec_perm_vpshufb2_vpermq
and expand_vec_perm_vpshufb2_vpermq_even_odd.
* config/i386/sse.md (VEC_EXTRACT_EVENODD_MODE): Add for TARGET_AVX2
32-byte integer vector modes.
(vec_pack_trunc_mode): Use VI248_AVX2 instead of VI248_128.
(avx2_interleave_highv32qi, avx2_interleave_lowv32qi): Remove pasto.
(avx2_pshufdv3, avx2_pshuflwv3, avx2_pshufhwv3): Generate
4 new operands.
(avx2_pshufd_1, avx2_pshuflw_1, avx2_pshufhw_1): Don't use
match_dup, instead add 4 new operands and require they have
right cross-lane values.
(avx2_permv4di): Change into define_expand.
(avx2_permv4di_1): New instruction.
(avx2_permv2ti): Use nonimmediate_operand instead of register_operand
for xm constrained operand.
(VEC_PERM_AVX2): Add V32QI and V16QI for TARGET_AVX2.

--- gcc/config/i386/i386.md.jj  2011-10-12 20:28:19.0 +0200
+++ gcc/config/i386/i386.md 2011-10-12 20:30:00.0 +0200
@@ -235,7 +235,6 @@ (define_c_enum unspec [
   UNSPEC_VPERMSI
   UNSPEC_VPERMDF
   UNSPEC_VPERMSF
-  UNSPEC_VPERMDI
   UNSPEC_VPERMTI
   UNSPEC_GATHER
 
--- gcc/config/i386/i386.c.jj   2011-10-12 20:28:19.0 +0200
+++ gcc/config/i386/i386.c  2011-10-12 21:01:39.0 +0200
@@ -19334,7 +19334,7 @@ ix86_expand_vec_perm (rtx operands[])
   rtx op0 = operands[1];
   rtx op1 = operands[2];
   rtx mask = operands[3];
-  rtx t1, t2, vt, vec[16];
+  rtx t1, t2, t3, t4, vt, vt2, vec[32];
   enum machine_mode mode = GET_MODE (op0);
   enum machine_mode maskmode = GET_MODE (mask);
   int w, e, i;
@@ -19343,50 +19343,68 @@ ix86_expand_vec_perm (rtx operands[])
   /* Number of elements in the vector.  */
   w = GET_MODE_NUNITS (mode);
   e = GET_MODE_UNIT_SIZE (mode);
-  gcc_assert (w = 16);
+  gcc_assert (w = 32);
 
   if (TARGET_AVX2)
 {
-  if (mode == V4DImode || mode == V4DFmode)
+  if (mode == V4DImode || mode == V4DFmode || mode == V16HImode)
{
  /* Unfortunately, the VPERMQ and VPERMPD instructions only support
 an constant shuffle operand.  With a tiny bit of effort we can
 use VPERMD instead.  A re-interpretation stall for V4DFmode is
-unfortunate but there's no avoiding it.  */
- t1 = gen_reg_rtx (V8SImode);
+unfortunate but there's no avoiding it.
+Similarly for V16HImode we don't have instructions for variable
+shuffling, while for V32QImode we can use after preparing suitable
+masks vpshufb; vpshufb; vpermq; vpor.  */
+
+ if (mode == V16HImode)
+   {
+ maskmode = mode = V32QImode;
+ w = 32;
+ e = 1;
+   }
+ else
+   {
+ maskmode = mode = V8SImode;
+ w = 8;
+ e = 4;
+   }
+ t1 = gen_reg_rtx (maskmode);
 
  /* Replicate the low bits of the V4DImode mask into V8SImode:
   mask = { A B C D }
   t1 = { A A B B C C D D }.  */
- for (i = 0; i  4; ++i)
+ for (i = 0; i  w / 2; ++i)
vec[i*2 + 1] = vec[i*2] = GEN_INT (i * 2);
- vt = gen_rtx_CONST_VECTOR (V8SImode, gen_rtvec_v (8, vec));
- vt = force_reg (V8SImode, vt);
- mask = gen_lowpart (V8SImode, mask);
- emit_insn (gen_avx2_permvarv8si (t1, vt, mask));
+ vt = gen_rtx_CONST_VECTOR (maskmode, gen_rtvec_v (w, vec));
+ vt = force_reg (maskmode, vt);
+ mask = gen_lowpart (maskmode, mask);
+ if (maskmode == V8SImode)
+   emit_insn (gen_avx2_permvarv8si (t1, vt, mask));
+ else
+   emit_insn (gen_avx2_pshufbv32qi3 (t1, mask, vt));
 
  /* Multiply the shuffle indicies by two.  */
- emit_insn (gen_avx2_lshlv8si3 (t1, t1, const1_rtx));
+ t1 = expand_simple_binop (maskmode, PLUS, t1, t1, t1, 1,
+

Re: [PATCH] AVX2 vector permutation fixes plus vec_pack_trunc_{v16hi,v8si,v4di} support (take 2)

2011-10-12 Thread Richard Henderson
On 10/12/2011 02:23 PM, Jakub Jelinek wrote:
 2011-10-12  Jakub Jelinek  ja...@redhat.com
 
   * config/i386/i386.md (UNSPEC_VPERMDI): Remove.
   * config/i386/i386.c (ix86_expand_vec_perm): Handle
   V16QImode and V32QImode for TARGET_AVX2.
   (MAX_VECT_LEN): Increase to 32.
   (expand_vec_perm_blend): Add support for 32-byte integer
   vectors with TARGET_AVX2.
   (valid_perm_using_mode_p): New function.
   (expand_vec_perm_pshufb): Add support for 32-byte integer
   vectors with TARGET_AVX2.
   (expand_vec_perm_vpshufb2_vpermq): New function.
   (expand_vec_perm_vpshufb2_vpermq_even_odd): New function.
   (expand_vec_perm_even_odd_1): Handle 32-byte integer vectors
   with TARGET_AVX2.
   (ix86_expand_vec_perm_builtin_1): Try expand_vec_perm_vpshufb2_vpermq
   and expand_vec_perm_vpshufb2_vpermq_even_odd.
   * config/i386/sse.md (VEC_EXTRACT_EVENODD_MODE): Add for TARGET_AVX2
   32-byte integer vector modes.
   (vec_pack_trunc_mode): Use VI248_AVX2 instead of VI248_128.
   (avx2_interleave_highv32qi, avx2_interleave_lowv32qi): Remove pasto.
   (avx2_pshufdv3, avx2_pshuflwv3, avx2_pshufhwv3): Generate
   4 new operands.
   (avx2_pshufd_1, avx2_pshuflw_1, avx2_pshufhw_1): Don't use
   match_dup, instead add 4 new operands and require they have
   right cross-lane values.
   (avx2_permv4di): Change into define_expand.
   (avx2_permv4di_1): New instruction.
   (avx2_permv2ti): Use nonimmediate_operand instead of register_operand
   for xm constrained operand.
   (VEC_PERM_AVX2): Add V32QI and V16QI for TARGET_AVX2.

Ok.


r~