[PATCH] AVX2 vector permutation fixes plus vec_pack_trunc_{v16hi,v8si,v4di} support
Hi! This patch started with noticing while working on PR50596 that #define N 1024 long long a[N]; char b[N]; void foo (void) { int i; for (i = 0; i N; i++) b[i] = a[i]; } is even with -O3 -mavx2 vectorized just with 16-byte vectors instead of 32-byte vectors and has various fixes I've noticed when diving into it. The vector permutations with AVX2 aren't very easy, because some instructions don't shuffle cross-lane, some do but only for some modes. The patch adds AVX2 vec_pack_trunc* expanders so that the above can be vectorized, and implements a couple of permutation sequences, including for a single operand __builtin_vec_shuffle a 4 insn sequence that handles arbitrary V32QI/V16HI constant permutations (and some cases where 1 insn is possible too) and also a variable mask V{32Q,16H,8S,4D}I permutations. I think we badly need testcase which will try all possible constant permutations (probably one testcase per mode), even for V32QImode that's just 32x32 plus 32x64 tests (if split into 32 tests in a function times 96 noinline functions), but with that I'd like to wait for Richard's permutation improvements, because although currently the backend signalizes it can handle some constant argument e.g. V32QImode permutation, as there is no V32QImode permutation builtin __builtin_shuffle emits it as variable mask operation. Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? 2011-10-12 Jakub Jelinek ja...@redhat.com * config/i386/i386.md (UNSPEC_VPERMDI): Remove. * config/i386/i386.c (ix86_expand_vec_perm): Handle V16QImode and V32QImode for TARGET_AVX2. (MAX_VECT_LEN): Increase to 32. (expand_vec_perm_blend): Add support for 32-byte integer vectors with TARGET_AVX2. (valid_perm_using_mode_p): New function. (expand_vec_perm_pshufb): Add support for 32-byte integer vectors with TARGET_AVX2. (expand_vec_perm_vpshufb2_vpermq): New function. (expand_vec_perm_vpshufb2_vpermq_even_odd): New function. (expand_vec_perm_even_odd_1): Handle 32-byte integer vectors with TARGET_AVX2. (ix86_expand_vec_perm_builtin_1): Try expand_vec_perm_vpshufb2_vpermq and expand_vec_perm_vpshufb2_vpermq_even_odd. * config/i386/sse.md (VEC_EXTRACT_EVENODD_MODE): Add for TARGET_AVX2 32-byte integer vector modes. (vec_pack_trunc_mode): Use VI248_AVX2 instead of VI248_128. (avx2_interleave_highv32qi, avx2_interleave_lowv32qi): Remove pasto. (avx2_pshufdv3, avx2_pshuflwv3, avx2_pshufhwv3): Generate 4 new operands. (avx2_pshufd_1, avx2_pshuflw_1, avx2_pshufhw_1): Don't use match_dup, instead add 4 new operands and require they have right cross-lane values. (avx2_permv4di): Change into define_expand. (avx2_permv4di_1): New instruction. (avx2_permv2ti): Use nonimmediate_operand instead of register_operand for xm constrained operand. (VEC_PERM_AVX2): Add V32QI and V16QI for TARGET_AVX2. --- gcc/config/i386/i386.md.jj 2011-10-06 16:42:12.0 +0200 +++ gcc/config/i386/i386.md 2011-10-11 10:07:04.0 +0200 @@ -235,7 +235,6 @@ (define_c_enum unspec [ UNSPEC_VPERMSI UNSPEC_VPERMDF UNSPEC_VPERMSF - UNSPEC_VPERMDI UNSPEC_VPERMTI UNSPEC_GATHER --- gcc/config/i386/i386.c.jj 2011-10-10 09:41:28.0 +0200 +++ gcc/config/i386/i386.c 2011-10-12 11:05:06.0 +0200 @@ -19334,7 +19334,7 @@ ix86_expand_vec_perm (rtx operands[]) rtx op0 = operands[1]; rtx op1 = operands[2]; rtx mask = operands[3]; - rtx t1, t2, vt, vec[16]; + rtx t1, t2, t3, t4, vt, vt2, vec[32]; enum machine_mode mode = GET_MODE (op0); enum machine_mode maskmode = GET_MODE (mask); int w, e, i; @@ -19343,50 +19343,72 @@ ix86_expand_vec_perm (rtx operands[]) /* Number of elements in the vector. */ w = GET_MODE_NUNITS (mode); e = GET_MODE_UNIT_SIZE (mode); - gcc_assert (w = 16); + gcc_assert (w = 32); if (TARGET_AVX2) { - if (mode == V4DImode || mode == V4DFmode) + if (mode == V4DImode || mode == V4DFmode || mode == V16HImode) { /* Unfortunately, the VPERMQ and VPERMPD instructions only support an constant shuffle operand. With a tiny bit of effort we can use VPERMD instead. A re-interpretation stall for V4DFmode is -unfortunate but there's no avoiding it. */ - t1 = gen_reg_rtx (V8SImode); +unfortunate but there's no avoiding it. +Similarly for V16HImode we don't have instructions for variable +shuffling, while for V32QImode we can use after preparing suitable +masks vpshufb; vpshufb; vpermq; vpor. */ + + if (mode == V16HImode) + { + maskmode = mode = V32QImode; + w = 32; + e = 1; + } + else + { + maskmode = mode = V8SImode; +
Re: [PATCH] AVX2 vector permutation fixes plus vec_pack_trunc_{v16hi,v8si,v4di} support
On 10/12/2011 09:09 AM, Jakub Jelinek wrote: /* Multiply the shuffle indicies by two. */ - emit_insn (gen_avx2_lshlv8si3 (t1, t1, const1_rtx)); + if (maskmode == V8SImode) + emit_insn (gen_avx2_lshlv8si3 (t1, t1, const1_rtx)); + else + emit_insn (gen_addv32qi3 (t1, t1, t1)); I guess this would be cleaner to use plus always. And thus expand_simple_binop instead of (a couple of) these mode tests. + case V32QImode: + t1 = gen_reg_rtx (V32QImode); + t2 = gen_reg_rtx (V32QImode); + t3 = gen_reg_rtx (V32QImode); + vt2 = GEN_INT (128); + for (i = 0; i 32; i++) + vec[i] = vt2; + vt = gen_rtx_CONST_VECTOR (V32QImode, gen_rtvec_v (32, vec)); + vt = force_reg (V32QImode, vt); + for (i = 0; i 32; i++) + vec[i] = i 16 ? vt2 : const0_rtx; + vt2 = gen_rtx_CONST_VECTOR (V32QImode, gen_rtvec_v (32, vec)); + vt2 = force_reg (V32QImode, vt2); + emit_insn (gen_avx2_lshlv4di3 (gen_lowpart (V4DImode, t1), + gen_lowpart (V4DImode, mask), + GEN_INT (3))); + emit_insn (gen_avx2_andnotv32qi3 (t2, vt, mask)); + emit_insn (gen_xorv32qi3 (t1, t1, vt2)); + emit_insn (gen_andv32qi3 (t1, t1, vt)); + emit_insn (gen_iorv32qi3 (t3, t1, t2)); + emit_insn (gen_xorv32qi3 (t1, t1, vt)); + emit_insn (gen_avx2_permv4di_1 (gen_lowpart (V4DImode, t3), + gen_lowpart (V4DImode, t3), + const2_rtx, GEN_INT (3), + const0_rtx, const1_rtx)); + emit_insn (gen_iorv32qi3 (t1, t1, t2)); Some commentary here is required. I might have expected to see a compare, or something, but the logical operations here are less than obvious. I believe I've commented on everything else in the previous messages. r~
[PATCH] AVX2 vector permutation fixes plus vec_pack_trunc_{v16hi,v8si,v4di} support (take 2)
On Wed, Oct 12, 2011 at 10:49:33AM -0700, Richard Henderson wrote: I believe I've commented on everything else in the previous messages. Here is an updated patch which should incorporate your comments from both mails (thanks for them). Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? 2011-10-12 Jakub Jelinek ja...@redhat.com * config/i386/i386.md (UNSPEC_VPERMDI): Remove. * config/i386/i386.c (ix86_expand_vec_perm): Handle V16QImode and V32QImode for TARGET_AVX2. (MAX_VECT_LEN): Increase to 32. (expand_vec_perm_blend): Add support for 32-byte integer vectors with TARGET_AVX2. (valid_perm_using_mode_p): New function. (expand_vec_perm_pshufb): Add support for 32-byte integer vectors with TARGET_AVX2. (expand_vec_perm_vpshufb2_vpermq): New function. (expand_vec_perm_vpshufb2_vpermq_even_odd): New function. (expand_vec_perm_even_odd_1): Handle 32-byte integer vectors with TARGET_AVX2. (ix86_expand_vec_perm_builtin_1): Try expand_vec_perm_vpshufb2_vpermq and expand_vec_perm_vpshufb2_vpermq_even_odd. * config/i386/sse.md (VEC_EXTRACT_EVENODD_MODE): Add for TARGET_AVX2 32-byte integer vector modes. (vec_pack_trunc_mode): Use VI248_AVX2 instead of VI248_128. (avx2_interleave_highv32qi, avx2_interleave_lowv32qi): Remove pasto. (avx2_pshufdv3, avx2_pshuflwv3, avx2_pshufhwv3): Generate 4 new operands. (avx2_pshufd_1, avx2_pshuflw_1, avx2_pshufhw_1): Don't use match_dup, instead add 4 new operands and require they have right cross-lane values. (avx2_permv4di): Change into define_expand. (avx2_permv4di_1): New instruction. (avx2_permv2ti): Use nonimmediate_operand instead of register_operand for xm constrained operand. (VEC_PERM_AVX2): Add V32QI and V16QI for TARGET_AVX2. --- gcc/config/i386/i386.md.jj 2011-10-12 20:28:19.0 +0200 +++ gcc/config/i386/i386.md 2011-10-12 20:30:00.0 +0200 @@ -235,7 +235,6 @@ (define_c_enum unspec [ UNSPEC_VPERMSI UNSPEC_VPERMDF UNSPEC_VPERMSF - UNSPEC_VPERMDI UNSPEC_VPERMTI UNSPEC_GATHER --- gcc/config/i386/i386.c.jj 2011-10-12 20:28:19.0 +0200 +++ gcc/config/i386/i386.c 2011-10-12 21:01:39.0 +0200 @@ -19334,7 +19334,7 @@ ix86_expand_vec_perm (rtx operands[]) rtx op0 = operands[1]; rtx op1 = operands[2]; rtx mask = operands[3]; - rtx t1, t2, vt, vec[16]; + rtx t1, t2, t3, t4, vt, vt2, vec[32]; enum machine_mode mode = GET_MODE (op0); enum machine_mode maskmode = GET_MODE (mask); int w, e, i; @@ -19343,50 +19343,68 @@ ix86_expand_vec_perm (rtx operands[]) /* Number of elements in the vector. */ w = GET_MODE_NUNITS (mode); e = GET_MODE_UNIT_SIZE (mode); - gcc_assert (w = 16); + gcc_assert (w = 32); if (TARGET_AVX2) { - if (mode == V4DImode || mode == V4DFmode) + if (mode == V4DImode || mode == V4DFmode || mode == V16HImode) { /* Unfortunately, the VPERMQ and VPERMPD instructions only support an constant shuffle operand. With a tiny bit of effort we can use VPERMD instead. A re-interpretation stall for V4DFmode is -unfortunate but there's no avoiding it. */ - t1 = gen_reg_rtx (V8SImode); +unfortunate but there's no avoiding it. +Similarly for V16HImode we don't have instructions for variable +shuffling, while for V32QImode we can use after preparing suitable +masks vpshufb; vpshufb; vpermq; vpor. */ + + if (mode == V16HImode) + { + maskmode = mode = V32QImode; + w = 32; + e = 1; + } + else + { + maskmode = mode = V8SImode; + w = 8; + e = 4; + } + t1 = gen_reg_rtx (maskmode); /* Replicate the low bits of the V4DImode mask into V8SImode: mask = { A B C D } t1 = { A A B B C C D D }. */ - for (i = 0; i 4; ++i) + for (i = 0; i w / 2; ++i) vec[i*2 + 1] = vec[i*2] = GEN_INT (i * 2); - vt = gen_rtx_CONST_VECTOR (V8SImode, gen_rtvec_v (8, vec)); - vt = force_reg (V8SImode, vt); - mask = gen_lowpart (V8SImode, mask); - emit_insn (gen_avx2_permvarv8si (t1, vt, mask)); + vt = gen_rtx_CONST_VECTOR (maskmode, gen_rtvec_v (w, vec)); + vt = force_reg (maskmode, vt); + mask = gen_lowpart (maskmode, mask); + if (maskmode == V8SImode) + emit_insn (gen_avx2_permvarv8si (t1, vt, mask)); + else + emit_insn (gen_avx2_pshufbv32qi3 (t1, mask, vt)); /* Multiply the shuffle indicies by two. */ - emit_insn (gen_avx2_lshlv8si3 (t1, t1, const1_rtx)); + t1 = expand_simple_binop (maskmode, PLUS, t1, t1, t1, 1, +
Re: [PATCH] AVX2 vector permutation fixes plus vec_pack_trunc_{v16hi,v8si,v4di} support (take 2)
On 10/12/2011 02:23 PM, Jakub Jelinek wrote: 2011-10-12 Jakub Jelinek ja...@redhat.com * config/i386/i386.md (UNSPEC_VPERMDI): Remove. * config/i386/i386.c (ix86_expand_vec_perm): Handle V16QImode and V32QImode for TARGET_AVX2. (MAX_VECT_LEN): Increase to 32. (expand_vec_perm_blend): Add support for 32-byte integer vectors with TARGET_AVX2. (valid_perm_using_mode_p): New function. (expand_vec_perm_pshufb): Add support for 32-byte integer vectors with TARGET_AVX2. (expand_vec_perm_vpshufb2_vpermq): New function. (expand_vec_perm_vpshufb2_vpermq_even_odd): New function. (expand_vec_perm_even_odd_1): Handle 32-byte integer vectors with TARGET_AVX2. (ix86_expand_vec_perm_builtin_1): Try expand_vec_perm_vpshufb2_vpermq and expand_vec_perm_vpshufb2_vpermq_even_odd. * config/i386/sse.md (VEC_EXTRACT_EVENODD_MODE): Add for TARGET_AVX2 32-byte integer vector modes. (vec_pack_trunc_mode): Use VI248_AVX2 instead of VI248_128. (avx2_interleave_highv32qi, avx2_interleave_lowv32qi): Remove pasto. (avx2_pshufdv3, avx2_pshuflwv3, avx2_pshufhwv3): Generate 4 new operands. (avx2_pshufd_1, avx2_pshuflw_1, avx2_pshufhw_1): Don't use match_dup, instead add 4 new operands and require they have right cross-lane values. (avx2_permv4di): Change into define_expand. (avx2_permv4di_1): New instruction. (avx2_permv2ti): Use nonimmediate_operand instead of register_operand for xm constrained operand. (VEC_PERM_AVX2): Add V32QI and V16QI for TARGET_AVX2. Ok. r~