Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Wed, Oct 1, 2014 at 10:43 PM, Jakub Jelinek ja...@redhat.com wrote: For PR62128, IMHO the right fix is attached. Note, this is again covered in vshuf-*.c tests (test 22 in both vshuf-v32*.c and vshuf-v16*.c). With that attached patch, pr62128.c (aka test_22 in vshuf-v32qi.c), changes: - vpshufb .LC0(%rip), %ymm0, %ymm1 - vpshufb .LC1(%rip), %ymm0, %ymm0 - vpermq $78, %ymm1, %ymm1 - vpor%ymm1, %ymm0, %ymm0 + vpermq $78, %ymm0, %ymm1 + vpalignr$1, %ymm0, %ymm1, %ymm0 ret 2014-10-01 Jakub Jelinek ja...@redhat.com PR target/62128 * config/i386/i386.c (expand_vec_perm_1): Try expand_vec_perm_palignr if it expands to a single insn only. (expand_vec_perm_palignr): Add SINGLE_INSN_ONLY_P argument. If true, fail unless in_order is true. Add forward declaration. (expand_vec_perm_vperm2f128): Fix up comment about which permutation is useful for one_operand_p. (ix86_expand_vec_perm_const_1): Adjust expand_vec_perm_palignr caller. Now bootstrapped/regtested on x86_64-linux and i686-linux (and additionally tested also with --target_board=unix/-mavx2), ok for trunk? OK. Thanks, Uros.
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
Getting back to initial patch, is it ok? It fixes gcc.target/i386/pr52252-atom.c for AVX2 make check. X86 bootstrap is also ok. 2014-10-01 Evgeny Stupachenko evstu...@gmail.com * config/i386/i386.c (expand_vec_perm_palignr): Extend to use AVX2 PALINGR instruction. * config/i386/i386.c (ix86_expand_vec_perm_const_1): Add palignr try for AVX2. On Wed, Sep 17, 2014 at 9:26 PM, Evgeny Stupachenko evstu...@gmail.com wrote: The test in pr62128 is exactly TEST 22 from gcc.dg/torture/vshuf-v32qi.c. It will check if the pattern is correct or not. Resubmitting patch looks good as current mail thread is already too complicated. On Wed, Sep 17, 2014 at 6:49 PM, H.J. Lu hjl.to...@gmail.com wrote: On Wed, Sep 17, 2014 at 6:01 AM, Evgeny Stupachenko evstu...@gmail.com wrote: It fix gcc.target/i386/pr52252-atom.c in core-avx2 make check and pr62128. I suggest you resubmit the patch as a bug fix for pr62128 with testcases from pr62128 as well as gcc.target/i386/pr52252-atom.c. -- H.J. palignr_avx2.patch Description: Binary data
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Wed, Oct 1, 2014 at 12:16 PM, Evgeny Stupachenko evstu...@gmail.com wrote: Getting back to initial patch, is it ok? IMO, we should start with Jakub's proposed patch [1] [1] https://gcc.gnu.org/ml/gcc-patches/2014-10/msg00010.html Uros. It fixes gcc.target/i386/pr52252-atom.c for AVX2 make check. X86 bootstrap is also ok. 2014-10-01 Evgeny Stupachenko evstu...@gmail.com * config/i386/i386.c (expand_vec_perm_palignr): Extend to use AVX2 PALINGR instruction. * config/i386/i386.c (ix86_expand_vec_perm_const_1): Add palignr try for AVX2. On Wed, Sep 17, 2014 at 9:26 PM, Evgeny Stupachenko evstu...@gmail.com wrote: The test in pr62128 is exactly TEST 22 from gcc.dg/torture/vshuf-v32qi.c. It will check if the pattern is correct or not. Resubmitting patch looks good as current mail thread is already too complicated. On Wed, Sep 17, 2014 at 6:49 PM, H.J. Lu hjl.to...@gmail.com wrote: On Wed, Sep 17, 2014 at 6:01 AM, Evgeny Stupachenko evstu...@gmail.com wrote: It fix gcc.target/i386/pr52252-atom.c in core-avx2 make check and pr62128. I suggest you resubmit the patch as a bug fix for pr62128 with testcases from pr62128 as well as gcc.target/i386/pr52252-atom.c. -- H.J.
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Wed, Oct 01, 2014 at 12:28:51PM +0200, Uros Bizjak wrote: On Wed, Oct 1, 2014 at 12:16 PM, Evgeny Stupachenko evstu...@gmail.com wrote: Getting back to initial patch, is it ok? IMO, we should start with Jakub's proposed patch [1] [1] https://gcc.gnu.org/ml/gcc-patches/2014-10/msg00010.html That doesn't compile, will post a new version; got interrupted when I found that in GCC_TEST_RUN_EXPENSIVE=1 make check-gcc RUNTESTFLAGS='--target_board=unix/-mavx2 dg-torture.exp=vshuf*.c' one test is miscompiled even with unpatched compiler, debugging that now. That said, my patch will not do anything about the case Mark mentioned { 1, 2, 3, ..., 31, 0 } permutation, for that we can't do vpalignr followed by vpshufb or similar, but need to do some permutation first and then vpalignr on the result. So it would need a new routine. It is still a 2 insn permutation, not 6, and needs different algorithm, so sharing the same routine for that is undesirable. Jakub
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
This is not a fix for the case { 1, 2, 3, ..., 31, 0 }. This patch is an extension of expand_vec_perm_palignr on AVX2 case. For the case { 1, 2, 3, ..., 31, 0 } we should use separate function/pattern. I like split as it is similar to already handled SSE byte rotate {1,2,3,.,15, 0}: ssse3_palignrmode_perm and AVX2 split: *avx_vperm_broadcast_mode. On Wed, Oct 1, 2014 at 2:35 PM, Jakub Jelinek ja...@redhat.com wrote: On Wed, Oct 01, 2014 at 12:28:51PM +0200, Uros Bizjak wrote: On Wed, Oct 1, 2014 at 12:16 PM, Evgeny Stupachenko evstu...@gmail.com wrote: Getting back to initial patch, is it ok? IMO, we should start with Jakub's proposed patch [1] [1] https://gcc.gnu.org/ml/gcc-patches/2014-10/msg00010.html That doesn't compile, will post a new version; got interrupted when I found that in GCC_TEST_RUN_EXPENSIVE=1 make check-gcc RUNTESTFLAGS='--target_board=unix/-mavx2 dg-torture.exp=vshuf*.c' one test is miscompiled even with unpatched compiler, debugging that now. That said, my patch will not do anything about the case Mark mentioned { 1, 2, 3, ..., 31, 0 } permutation, for that we can't do vpalignr followed by vpshufb or similar, but need to do some permutation first and then vpalignr on the result. So it would need a new routine. It is still a 2 insn permutation, not 6, and needs different algorithm, so sharing the same routine for that is undesirable. Jakub
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Wed, Oct 01, 2014 at 12:35:14PM +0200, Jakub Jelinek wrote: On Wed, Oct 01, 2014 at 12:28:51PM +0200, Uros Bizjak wrote: On Wed, Oct 1, 2014 at 12:16 PM, Evgeny Stupachenko evstu...@gmail.com wrote: Getting back to initial patch, is it ok? IMO, we should start with Jakub's proposed patch [1] [1] https://gcc.gnu.org/ml/gcc-patches/2014-10/msg00010.html That doesn't compile, will post a new version; got interrupted when I found that in GCC_TEST_RUN_EXPENSIVE=1 make check-gcc RUNTESTFLAGS='--target_board=unix/-mavx2 dg-torture.exp=vshuf*.c' one test is miscompiled even with unpatched compiler, debugging that now. Let's start with the bugfix. The || doesn't make any sense, and we really want to fill in 4 bits (0, 1, 4, 5) of the immediate, not just two, anyway. valid_perm_using_mode_p (V2TImode, d) should already guarantee that it is possible to permutate it as V2TI, so all we care about are the values of d-perm[0] and d-perm[nelt / 2], but we care not just which lane it is, but also which operand (src1 or src2). Tested with GCC_TEST_RUN_EXPENSIVE=1 make check-gcc RUNTESTFLAGS='--target_board=unix/-mavx2 dg-torture.exp=vshuf*.c' Ok for trunk/4.9/4.8? 2014-10-01 Jakub Jelinek ja...@redhat.com PR target/63428 * config/i386/i386.c (expand_vec_perm_pshufb): Fix up rperm[0] argument to avx2_permv2ti. * gcc.dg/torture/vshuf-4.inc: Move test 122 from EXPTESTS to test 24 in TESTS. --- gcc/config/i386/i386.c.jj 2014-10-01 11:22:09.0 +0200 +++ gcc/config/i386/i386.c 2014-10-01 13:00:30.835809031 +0200 @@ -42961,8 +42961,8 @@ expand_vec_perm_pshufb (struct expand_ve op0 = gen_lowpart (V4DImode, d-op0); op1 = gen_lowpart (V4DImode, d-op1); rperm[0] - = GEN_INT (((d-perm[0] (nelt / 2)) ? 1 : 0) - || ((d-perm[nelt / 2] (nelt / 2)) ? 2 : 0)); + = GEN_INT ((d-perm[0] / (nelt / 2)) + | ((d-perm[nelt / 2] / (nelt / 2)) * 16)); emit_insn (gen_avx2_permv2ti (target, op0, op1, rperm[0])); if (target != d-target) emit_move_insn (d-target, gen_lowpart (d-vmode, target)); --- gcc/testsuite/gcc.dg/torture/vshuf-4.inc.jj 2012-03-20 08:51:25.0 +0100 +++ gcc/testsuite/gcc.dg/torture/vshuf-4.inc2014-10-01 13:23:07.163090945 +0200 @@ -23,7 +23,8 @@ T (19,3, 2, 1, 0) \ T (20, 0, 4, 1, 5) \ T (21, 2, 6, 3, 7) \ T (22, 1, 2, 3, 0) \ -T (23, 2, 1, 0, 3) +T (23, 2, 1, 0, 3) \ +T (24, 2, 5, 6, 3) #define EXPTESTS \ T (116,1, 2, 4, 3) \ T (117,7, 3, 3, 0) \ @@ -31,7 +32,6 @@ T (118, 5, 3, 2, 7) \ T (119,0, 3, 5, 6) \ T (120,0, 0, 1, 5) \ T (121,4, 6, 2, 1) \ -T (122,2, 5, 6, 3) \ T (123,4, 6, 3, 2) \ T (124,4, 7, 5, 6) \ T (125,0, 4, 2, 4) \ Jakub
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Wed, Oct 1, 2014 at 1:38 PM, Jakub Jelinek ja...@redhat.com wrote: That doesn't compile, will post a new version; got interrupted when I found that in GCC_TEST_RUN_EXPENSIVE=1 make check-gcc RUNTESTFLAGS='--target_board=unix/-mavx2 dg-torture.exp=vshuf*.c' one test is miscompiled even with unpatched compiler, debugging that now. Let's start with the bugfix. The || doesn't make any sense, and we really want to fill in 4 bits (0, 1, 4, 5) of the immediate, not just two, anyway. valid_perm_using_mode_p (V2TImode, d) should already guarantee that it is possible to permutate it as V2TI, so all we care about are the values of d-perm[0] and d-perm[nelt / 2], but we care not just which lane it is, but also which operand (src1 or src2). Tested with GCC_TEST_RUN_EXPENSIVE=1 make check-gcc RUNTESTFLAGS='--target_board=unix/-mavx2 dg-torture.exp=vshuf*.c' Ok for trunk/4.9/4.8? 2014-10-01 Jakub Jelinek ja...@redhat.com PR target/63428 * config/i386/i386.c (expand_vec_perm_pshufb): Fix up rperm[0] argument to avx2_permv2ti. * gcc.dg/torture/vshuf-4.inc: Move test 122 from EXPTESTS to test 24 in TESTS. OK. Thanks, Uros.
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Wed, Oct 01, 2014 at 01:45:54PM +0200, Uros Bizjak wrote: OK. Thanks. Second step is a tiny optimization, for the simplified 122 (now 24) vshuf-v4di.c testcase: typedef unsigned long long V __attribute__ ((vector_size (32))); V a, b, c, d; int main () { int i; for (i = 0; i 4; ++i) { a[i] = i + 2; b[i] = 4 + i + 2; } asm volatile ( : : : memory); c = __builtin_shuffle (a, b, (V) { 2, 5, 6, 3 }); d = __builtin_shuffle ((V) { 2, 3, 4, 5 }, (V) { 6, 7, 8, 9 }, (V) { 2, 5, 6, 3 }); if (__builtin_memcmp (c, d, sizeof (c))) __builtin_abort (); return 0; } this patch allows better code to be generated: - vmovdqa b(%rip), %ymm0 + vpermq $238, a(%rip), %ymm1 movl$32, %edx - movl$d, %esi - vmovdqa a(%rip), %ymm1 + vmovdqa b(%rip), %ymm0 + movl$d, %esi movl$c, %edi - vperm2i128 $17, %ymm0, %ymm1, %ymm1 vpblendd$195, %ymm1, %ymm0, %ymm0 vmovdqa %ymm0, c(%rip) That is because vperm2i128 $17 unnecessarily uses two operands when all the data it grabs are from a single one. So, by canonicalizing the permutation we can emit vpermq $238 instead. Perhaps more places might benefit from extra canonicalize_perm calls (two spots already use that beyond the single one on the expansion/testing entry point). Tested again with GCC_TEST_RUN_EXPENSIVE=1 make check-gcc \ RUNTESTFLAGS='--target_board=unix/-mavx2 dg-torture.exp=vshuf*.c' on x86_64-linux. Ok for trunk? 2014-10-01 Jakub Jelinek ja...@redhat.com * config/i386/i386.c (expand_vec_perm_vperm2f128): Canonicalize dfirst permutation. --- gcc/config/i386/i386.c.jj 2014-10-01 13:00:30.0 +0200 +++ gcc/config/i386/i386.c 2014-10-01 13:59:40.061956852 +0200 @@ -43905,15 +43905,16 @@ expand_vec_perm_vperm2f128 (struct expan dfirst.perm[i] = (i (nelt2 - 1)) + ((perm (2 * (i = nelt2))) 3) * nelt2; + canonicalize_perm (dfirst); ok = expand_vec_perm_1 (dfirst); gcc_assert (ok); /* And dsecond is some single insn shuffle, taking d-op0 and result of vperm2f128 (if perm 16) or d-op1 and result of vperm2f128 (otherwise). */ - dsecond.op1 = dfirst.target; if (perm = 16) - dsecond.op0 = dfirst.op1; + dsecond.op0 = dsecond.op1; + dsecond.op1 = dfirst.target; ok = expand_vec_perm_1 (dsecond); gcc_assert (ok); Jakub
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Wed, Oct 1, 2014 at 2:17 PM, Jakub Jelinek ja...@redhat.com wrote: On Wed, Oct 01, 2014 at 01:45:54PM +0200, Uros Bizjak wrote: OK. Thanks. Second step is a tiny optimization, for the simplified 122 (now 24) vshuf-v4di.c testcase: typedef unsigned long long V __attribute__ ((vector_size (32))); V a, b, c, d; int main () { int i; for (i = 0; i 4; ++i) { a[i] = i + 2; b[i] = 4 + i + 2; } asm volatile ( : : : memory); c = __builtin_shuffle (a, b, (V) { 2, 5, 6, 3 }); d = __builtin_shuffle ((V) { 2, 3, 4, 5 }, (V) { 6, 7, 8, 9 }, (V) { 2, 5, 6, 3 }); if (__builtin_memcmp (c, d, sizeof (c))) __builtin_abort (); return 0; } this patch allows better code to be generated: - vmovdqa b(%rip), %ymm0 + vpermq $238, a(%rip), %ymm1 movl$32, %edx - movl$d, %esi - vmovdqa a(%rip), %ymm1 + vmovdqa b(%rip), %ymm0 + movl$d, %esi movl$c, %edi - vperm2i128 $17, %ymm0, %ymm1, %ymm1 vpblendd$195, %ymm1, %ymm0, %ymm0 vmovdqa %ymm0, c(%rip) That is because vperm2i128 $17 unnecessarily uses two operands when all the data it grabs are from a single one. So, by canonicalizing the permutation we can emit vpermq $238 instead. Perhaps more places might benefit from extra canonicalize_perm calls (two spots already use that beyond the single one on the expansion/testing entry point). Tested again with GCC_TEST_RUN_EXPENSIVE=1 make check-gcc \ RUNTESTFLAGS='--target_board=unix/-mavx2 dg-torture.exp=vshuf*.c' on x86_64-linux. Ok for trunk? 2014-10-01 Jakub Jelinek ja...@redhat.com * config/i386/i386.c (expand_vec_perm_vperm2f128): Canonicalize dfirst permutation. OK. Thanks, Uros.
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Wed, Oct 01, 2014 at 02:25:01PM +0200, Uros Bizjak wrote: OK. And now the expand_vec_perm_palignr improvement, tested with GCC_TEST_RUN_EXPENSIVE=1 make check-gcc \ RUNTESTFLAGS='--target_board=unix/-mavx2 dg-torture.exp=vshuf*.c' E.g. typedef unsigned long long V __attribute__ ((vector_size (32))); extern void abort (void); V a, b, c, d; void test_14 (void) { V mask = { 6, 1, 3, 4 }; int i; c = __builtin_shuffle (a, mask); d = __builtin_shuffle (a, b, mask); } (distilled from test 15 in vshuf-v4di.c) results in: - vmovdqa a(%rip), %ymm0 - vpermq $54, %ymm0, %ymm1 - vpshufb .LC1(%rip), %ymm0, %ymm0 - vmovdqa %ymm1, c(%rip) - vmovdqa b(%rip), %ymm1 - vpshufb .LC0(%rip), %ymm1, %ymm1 - vpermq $78, %ymm1, %ymm1 - vpor%ymm1, %ymm0, %ymm0 + vmovdqa a(%rip), %ymm1 + vpermq $54, %ymm1, %ymm0 + vmovdqa %ymm0, c(%rip) + vmovdqa b(%rip), %ymm0 + vpalignr$8, %ymm1, %ymm0, %ymm0 + vpermq $99, %ymm0, %ymm0 vmovdqa %ymm0, d(%rip) vzeroupper ret change (and two fewer .rodata constants). Ok for trunk? 2014-10-01 Jakub Jelinek ja...@redhat.com * config/i386/i386.c (expand_vec_perm_palignr): Handle 256-bit vectors for TARGET_AVX2. --- gcc/config/i386/i386.c.jj 2014-10-01 14:24:16.483138899 +0200 +++ gcc/config/i386/i386.c 2014-10-01 14:27:53.577222011 +0200 @@ -43297,44 +43297,75 @@ expand_vec_perm_palignr (struct expand_v rtx shift, target; struct expand_vec_perm_d dcopy; - /* Even with AVX, palignr only operates on 128-bit vectors. */ - if (!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + /* Even with AVX, palignr only operates on 128-bit vectors, + in AVX2 palignr operates on both 128-bit lanes. */ + if ((!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + (!TARGET_AVX2 || GET_MODE_SIZE (d-vmode) != 32)) return false; - min = nelt, max = 0; + min = 2 * nelt, max = 0; for (i = 0; i nelt; ++i) { unsigned e = d-perm[i]; + if (GET_MODE_SIZE (d-vmode) == 32) + e = (e ((nelt / 2) - 1)) | ((e nelt) 1); if (e min) min = e; if (e max) max = e; } - if (min == 0 || max - min = nelt) + if (min == 0 + || max - min = (GET_MODE_SIZE (d-vmode) == 32 ? nelt / 2 : nelt)) return false; /* Given that we have SSSE3, we know we'll be able to implement the - single operand permutation after the palignr with pshufb. */ - if (d-testing_p) + single operand permutation after the palignr with pshufb for + 128-bit vectors. */ + if (d-testing_p GET_MODE_SIZE (d-vmode) == 16) return true; dcopy = *d; - shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d-vmode))); - target = gen_reg_rtx (TImode); - emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d-op1), - gen_lowpart (TImode, d-op0), shift)); - - dcopy.op0 = dcopy.op1 = gen_lowpart (d-vmode, target); - dcopy.one_operand_p = true; in_order = true; for (i = 0; i nelt; ++i) { - unsigned e = dcopy.perm[i] - min; + unsigned e = dcopy.perm[i]; + if (GET_MODE_SIZE (d-vmode) == 32 + e = nelt + (e (nelt / 2 - 1)) min) + e = e - min - (nelt / 2); + else + e = e - min; if (e != i) in_order = false; dcopy.perm[i] = e; } + dcopy.one_operand_p = true; + + /* For AVX2, test whether we can permute the result in one instruction. */ + if (d-testing_p) +{ + if (in_order) + return true; + dcopy.op1 = dcopy.op0; + return expand_vec_perm_1 (dcopy); +} + + shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d-vmode))); + if (GET_MODE_SIZE (d-vmode) == 16) +{ + target = gen_reg_rtx (TImode); + emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d-op1), + gen_lowpart (TImode, d-op0), shift)); +} + else +{ + target = gen_reg_rtx (V2TImode); + emit_insn (gen_avx2_palignrv2ti (target, gen_lowpart (V2TImode, d-op1), + gen_lowpart (V2TImode, d-op0), shift)); +} + + dcopy.op0 = dcopy.op1 = gen_lowpart (d-vmode, target); /* Test for the degenerate case where the alignment by itself produces the desired permutation. */ @@ -43345,7 +43376,7 @@ expand_vec_perm_palignr (struct expand_v } ok = expand_vec_perm_1 (dcopy); - gcc_assert (ok); + gcc_assert (ok || GET_MODE_SIZE (d-vmode) == 32); return ok; } Jakub
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Wed, Oct 1, 2014 at 2:56 PM, Jakub Jelinek ja...@redhat.com wrote: And now the expand_vec_perm_palignr improvement, tested with GCC_TEST_RUN_EXPENSIVE=1 make check-gcc \ RUNTESTFLAGS='--target_board=unix/-mavx2 dg-torture.exp=vshuf*.c' E.g. typedef unsigned long long V __attribute__ ((vector_size (32))); extern void abort (void); V a, b, c, d; void test_14 (void) { V mask = { 6, 1, 3, 4 }; int i; c = __builtin_shuffle (a, mask); d = __builtin_shuffle (a, b, mask); } (distilled from test 15 in vshuf-v4di.c) results in: - vmovdqa a(%rip), %ymm0 - vpermq $54, %ymm0, %ymm1 - vpshufb .LC1(%rip), %ymm0, %ymm0 - vmovdqa %ymm1, c(%rip) - vmovdqa b(%rip), %ymm1 - vpshufb .LC0(%rip), %ymm1, %ymm1 - vpermq $78, %ymm1, %ymm1 - vpor%ymm1, %ymm0, %ymm0 + vmovdqa a(%rip), %ymm1 + vpermq $54, %ymm1, %ymm0 + vmovdqa %ymm0, c(%rip) + vmovdqa b(%rip), %ymm0 + vpalignr$8, %ymm1, %ymm0, %ymm0 + vpermq $99, %ymm0, %ymm0 vmovdqa %ymm0, d(%rip) vzeroupper ret change (and two fewer .rodata constants). Ok for trunk? 2014-10-01 Jakub Jelinek ja...@redhat.com * config/i386/i386.c (expand_vec_perm_palignr): Handle 256-bit vectors for TARGET_AVX2. Please mention PR 62128 and include the testcase from the PR. Also, please add a version of gcc.target/i386/pr52252-atom.c, compiled with -mavx2 (perhaps named pr52252-avx2.c) OK with a small adjustment below. Thanks, Uros. --- gcc/config/i386/i386.c.jj 2014-10-01 14:24:16.483138899 +0200 +++ gcc/config/i386/i386.c 2014-10-01 14:27:53.577222011 +0200 @@ -43297,44 +43297,75 @@ expand_vec_perm_palignr (struct expand_v rtx shift, target; struct expand_vec_perm_d dcopy; - /* Even with AVX, palignr only operates on 128-bit vectors. */ - if (!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + /* Even with AVX, palignr only operates on 128-bit vectors, + in AVX2 palignr operates on both 128-bit lanes. */ + if ((!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + (!TARGET_AVX2 || GET_MODE_SIZE (d-vmode) != 32)) Please simplify the above condition ... return false; - min = nelt, max = 0; + min = 2 * nelt, max = 0; for (i = 0; i nelt; ++i) { unsigned e = d-perm[i]; + if (GET_MODE_SIZE (d-vmode) == 32) + e = (e ((nelt / 2) - 1)) | ((e nelt) 1); if (e min) min = e; if (e max) max = e; } - if (min == 0 || max - min = nelt) + if (min == 0 + || max - min = (GET_MODE_SIZE (d-vmode) == 32 ? nelt / 2 : nelt)) return false; /* Given that we have SSSE3, we know we'll be able to implement the - single operand permutation after the palignr with pshufb. */ - if (d-testing_p) + single operand permutation after the palignr with pshufb for + 128-bit vectors. */ + if (d-testing_p GET_MODE_SIZE (d-vmode) == 16) return true; dcopy = *d; - shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d-vmode))); - target = gen_reg_rtx (TImode); - emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d-op1), - gen_lowpart (TImode, d-op0), shift)); - - dcopy.op0 = dcopy.op1 = gen_lowpart (d-vmode, target); - dcopy.one_operand_p = true; in_order = true; for (i = 0; i nelt; ++i) { - unsigned e = dcopy.perm[i] - min; + unsigned e = dcopy.perm[i]; + if (GET_MODE_SIZE (d-vmode) == 32 + e = nelt + (e (nelt / 2 - 1)) min) + e = e - min - (nelt / 2); + else + e = e - min; if (e != i) in_order = false; dcopy.perm[i] = e; } + dcopy.one_operand_p = true; + + /* For AVX2, test whether we can permute the result in one instruction. */ + if (d-testing_p) +{ + if (in_order) + return true; + dcopy.op1 = dcopy.op0; + return expand_vec_perm_1 (dcopy); +} + + shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d-vmode))); + if (GET_MODE_SIZE (d-vmode) == 16) +{ + target = gen_reg_rtx (TImode); + emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d-op1), + gen_lowpart (TImode, d-op0), shift)); +} + else +{ + target = gen_reg_rtx (V2TImode); + emit_insn (gen_avx2_palignrv2ti (target, gen_lowpart (V2TImode, d-op1), + gen_lowpart (V2TImode, d-op0), shift)); +} + + dcopy.op0 = dcopy.op1 = gen_lowpart (d-vmode, target); /* Test for the degenerate case where the alignment by itself produces the desired permutation. */ @@ -43345,7 +43376,7 @@ expand_vec_perm_palignr (struct expand_v } ok = expand_vec_perm_1 (dcopy); - gcc_assert (ok); + gcc_assert (ok || GET_MODE_SIZE (d-vmode) == 32); return ok; }
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Wed, Oct 01, 2014 at 03:09:59PM +0200, Uros Bizjak wrote: 2014-10-01 Jakub Jelinek ja...@redhat.com * config/i386/i386.c (expand_vec_perm_palignr): Handle 256-bit vectors for TARGET_AVX2. Please mention PR 62128 and include the testcase from the PR. Also, please add a version of gcc.target/i386/pr52252-atom.c, compiled with -mavx2 (perhaps named pr52252-avx2.c) This patch doesn't fix PR62128, and it is already tested (even without GCC_RUN_EXPENSIVE_TESTS=1) in the vshuf*.c torture tests (several of them). If you want coverage not just for the default flags, I'd say we should say for -O2 only just add gcc.target/i386/{avx,avx2,avx512}-vshuf-*.c tests that would include ../../gcc.dg/torture/vshuf-*.c and be compiled/run with -mavx/-mavx2/-mavx512 options instead of the default ones. For PR62128, IMHO the right fix is attached. Note, this is again covered in vshuf-*.c tests (test 22 in both vshuf-v32*.c and vshuf-v16*.c). With that attached patch, pr62128.c (aka test_22 in vshuf-v32qi.c), changes: - vpshufb .LC0(%rip), %ymm0, %ymm1 - vpshufb .LC1(%rip), %ymm0, %ymm0 - vpermq $78, %ymm1, %ymm1 - vpor%ymm1, %ymm0, %ymm0 + vpermq $78, %ymm0, %ymm1 + vpalignr$1, %ymm0, %ymm1, %ymm0 ret --- gcc/config/i386/i386.c.jj 2014-10-01 14:24:16.483138899 +0200 +++ gcc/config/i386/i386.c 2014-10-01 14:27:53.577222011 +0200 @@ -43297,44 +43297,75 @@ expand_vec_perm_palignr (struct expand_v rtx shift, target; struct expand_vec_perm_d dcopy; - /* Even with AVX, palignr only operates on 128-bit vectors. */ - if (!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + /* Even with AVX, palignr only operates on 128-bit vectors, + in AVX2 palignr operates on both 128-bit lanes. */ + if ((!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + (!TARGET_AVX2 || GET_MODE_SIZE (d-vmode) != 32)) Please simplify the above condition ... How? I don't see how that can be simplified, it can be transformed into if (!((TARGET_SSSE3 GET_MODE_SIZE (d-vmode) == 16) || (TARGET_AVX2 GET_MODE_SIZE (d-vmode) == 32))) but I don't find that simpler. Jakub 2014-10-01 Jakub Jelinek ja...@redhat.com PR target/62128 * config/i386/i386.c (expand_vec_perm_1): Try expand_vec_perm_palignr if it expands to a single insn only. (expand_vec_perm_palignr): Add SINGLE_INSN_ONLY_P argument. If true, fail unless in_order is true. Add forward declaration. (expand_vec_perm_vperm2f128): Fix up comment about which permutation is useful for one_operand_p. (ix86_expand_vec_perm_const_1): Adjust expand_vec_perm_palignr caller. --- gcc/config/i386/i386.c.jj 2014-10-01 15:42:28.0 +0200 +++ gcc/config/i386/i386.c 2014-10-01 15:50:06.234079887 +0200 @@ -39636,6 +39636,7 @@ struct expand_vec_perm_d static bool canonicalize_perm (struct expand_vec_perm_d *d); static bool expand_vec_perm_1 (struct expand_vec_perm_d *d); static bool expand_vec_perm_broadcast_1 (struct expand_vec_perm_d *d); +static bool expand_vec_perm_palignr (struct expand_vec_perm_d *d, bool); /* Get a vector mode of the same size as the original but with elements twice as wide. This is only guaranteed to apply to integral vectors. */ @@ -43225,6 +43226,10 @@ expand_vec_perm_1 (struct expand_vec_per if (expand_vec_perm_pshufb (d)) return true; + /* Try the AVX2 vpalignr instruction. */ + if (expand_vec_perm_palignr (d, true)) +return true; + /* Try the AVX512F vpermi2 instructions. */ rtx vec[64]; enum machine_mode mode = d-vmode; @@ -43286,10 +43291,11 @@ expand_vec_perm_pshuflw_pshufhw (struct the permutation using the SSSE3 palignr instruction. This succeeds when all of the elements in PERM fit within one vector and we merely need to shift them down so that a single vector permutation has a - chance to succeed. */ + chance to succeed. If SINGLE_INSN_ONLY_P, succeed if only + the vpalignr instruction itself can perform the requested permutation. */ static bool -expand_vec_perm_palignr (struct expand_vec_perm_d *d) +expand_vec_perm_palignr (struct expand_vec_perm_d *d, bool single_insn_only_p) { unsigned i, nelt = d-nelt; unsigned min, max; @@ -43320,8 +43326,9 @@ expand_vec_perm_palignr (struct expand_v /* Given that we have SSSE3, we know we'll be able to implement the single operand permutation after the palignr with pshufb for - 128-bit vectors. */ - if (d-testing_p GET_MODE_SIZE (d-vmode) == 16) + 128-bit vectors. If SINGLE_INSN_ONLY_P, in_order has to be computed + first. */ + if (d-testing_p GET_MODE_SIZE (d-vmode) == 16 !single_insn_only_p) return true; dcopy = *d; @@ -43342,6 +43349,9 @@ expand_vec_perm_palignr (struct expand_v } dcopy.one_operand_p = true; + if (single_insn_only_p !in_order) +return false; + /* For AVX2, test
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Wed, Oct 1, 2014 at 4:12 PM, Jakub Jelinek ja...@redhat.com wrote: On Wed, Oct 01, 2014 at 03:09:59PM +0200, Uros Bizjak wrote: 2014-10-01 Jakub Jelinek ja...@redhat.com * config/i386/i386.c (expand_vec_perm_palignr): Handle 256-bit vectors for TARGET_AVX2. Please mention PR 62128 and include the testcase from the PR. Also, please add a version of gcc.target/i386/pr52252-atom.c, compiled with -mavx2 (perhaps named pr52252-avx2.c) This patch doesn't fix PR62128, and it is already tested (even without GCC_RUN_EXPENSIVE_TESTS=1) in the vshuf*.c torture tests (several of them). Ah, OK then. If you want coverage not just for the default flags, I'd say we should say for -O2 only just add gcc.target/i386/{avx,avx2,avx512}-vshuf-*.c tests that would include ../../gcc.dg/torture/vshuf-*.c and be compiled/run with -mavx/-mavx2/-mavx512 options instead of the default ones. No, the above should be good for now. The failure was triggered by the target that defaults to -mavx2, and for avx512f we can run this suite in the way you suggest. For PR62128, IMHO the right fix is attached. Note, this is again covered in vshuf-*.c tests (test 22 in both vshuf-v32*.c and vshuf-v16*.c). With that attached patch, pr62128.c (aka test_22 in vshuf-v32qi.c), changes: - vpshufb .LC0(%rip), %ymm0, %ymm1 - vpshufb .LC1(%rip), %ymm0, %ymm0 - vpermq $78, %ymm1, %ymm1 - vpor%ymm1, %ymm0, %ymm0 + vpermq $78, %ymm0, %ymm1 + vpalignr$1, %ymm0, %ymm1, %ymm0 ret --- gcc/config/i386/i386.c.jj 2014-10-01 14:24:16.483138899 +0200 +++ gcc/config/i386/i386.c 2014-10-01 14:27:53.577222011 +0200 @@ -43297,44 +43297,75 @@ expand_vec_perm_palignr (struct expand_v rtx shift, target; struct expand_vec_perm_d dcopy; - /* Even with AVX, palignr only operates on 128-bit vectors. */ - if (!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + /* Even with AVX, palignr only operates on 128-bit vectors, + in AVX2 palignr operates on both 128-bit lanes. */ + if ((!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + (!TARGET_AVX2 || GET_MODE_SIZE (d-vmode) != 32)) Please simplify the above condition ... How? I don't see how that can be simplified, it can be transformed into if (!((TARGET_SSSE3 GET_MODE_SIZE (d-vmode) == 16) || (TARGET_AVX2 GET_MODE_SIZE (d-vmode) == 32))) but I don't find that simpler. I was thinking of the above, but you are correct, the change doesn't bring us anything. So, the patch is OK as it was. Thanks, Uros.
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Wed, Oct 1, 2014 at 2:56 PM, Jakub Jelinek ja...@redhat.com wrote: On Wed, Oct 01, 2014 at 02:25:01PM +0200, Uros Bizjak wrote: OK. And now the expand_vec_perm_palignr improvement, tested with GCC_TEST_RUN_EXPENSIVE=1 make check-gcc \ RUNTESTFLAGS='--target_board=unix/-mavx2 dg-torture.exp=vshuf*.c' E.g. typedef unsigned long long V __attribute__ ((vector_size (32))); extern void abort (void); V a, b, c, d; void test_14 (void) { V mask = { 6, 1, 3, 4 }; int i; c = __builtin_shuffle (a, mask); d = __builtin_shuffle (a, b, mask); } (distilled from test 15 in vshuf-v4di.c) results in: - vmovdqa a(%rip), %ymm0 - vpermq $54, %ymm0, %ymm1 - vpshufb .LC1(%rip), %ymm0, %ymm0 - vmovdqa %ymm1, c(%rip) - vmovdqa b(%rip), %ymm1 - vpshufb .LC0(%rip), %ymm1, %ymm1 - vpermq $78, %ymm1, %ymm1 - vpor%ymm1, %ymm0, %ymm0 + vmovdqa a(%rip), %ymm1 + vpermq $54, %ymm1, %ymm0 + vmovdqa %ymm0, c(%rip) + vmovdqa b(%rip), %ymm0 + vpalignr$8, %ymm1, %ymm0, %ymm0 + vpermq $99, %ymm0, %ymm0 vmovdqa %ymm0, d(%rip) vzeroupper ret change (and two fewer .rodata constants). On a related note, I would like to point out that gcc.target/i386/pr61403.c also fails to generate blend insn with -mavx2. The new insn sequence includes lots of new vpshufb insns with memory access. Uros.
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Wed, Oct 01, 2014 at 04:12:15PM +0200, Jakub Jelinek wrote: For PR62128, IMHO the right fix is attached. Note, this is again covered in vshuf-*.c tests (test 22 in both vshuf-v32*.c and vshuf-v16*.c). With that attached patch, pr62128.c (aka test_22 in vshuf-v32qi.c), changes: - vpshufb .LC0(%rip), %ymm0, %ymm1 - vpshufb .LC1(%rip), %ymm0, %ymm0 - vpermq $78, %ymm1, %ymm1 - vpor%ymm1, %ymm0, %ymm0 + vpermq $78, %ymm0, %ymm1 + vpalignr$1, %ymm0, %ymm1, %ymm0 ret 2014-10-01 Jakub Jelinek ja...@redhat.com PR target/62128 * config/i386/i386.c (expand_vec_perm_1): Try expand_vec_perm_palignr if it expands to a single insn only. (expand_vec_perm_palignr): Add SINGLE_INSN_ONLY_P argument. If true, fail unless in_order is true. Add forward declaration. (expand_vec_perm_vperm2f128): Fix up comment about which permutation is useful for one_operand_p. (ix86_expand_vec_perm_const_1): Adjust expand_vec_perm_palignr caller. Now bootstrapped/regtested on x86_64-linux and i686-linux (and additionally tested also with --target_board=unix/-mavx2), ok for trunk? --- gcc/config/i386/i386.c.jj 2014-10-01 15:42:28.0 +0200 +++ gcc/config/i386/i386.c2014-10-01 15:50:06.234079887 +0200 @@ -39636,6 +39636,7 @@ struct expand_vec_perm_d static bool canonicalize_perm (struct expand_vec_perm_d *d); static bool expand_vec_perm_1 (struct expand_vec_perm_d *d); static bool expand_vec_perm_broadcast_1 (struct expand_vec_perm_d *d); +static bool expand_vec_perm_palignr (struct expand_vec_perm_d *d, bool); /* Get a vector mode of the same size as the original but with elements twice as wide. This is only guaranteed to apply to integral vectors. */ @@ -43225,6 +43226,10 @@ expand_vec_perm_1 (struct expand_vec_per if (expand_vec_perm_pshufb (d)) return true; + /* Try the AVX2 vpalignr instruction. */ + if (expand_vec_perm_palignr (d, true)) +return true; + /* Try the AVX512F vpermi2 instructions. */ rtx vec[64]; enum machine_mode mode = d-vmode; @@ -43286,10 +43291,11 @@ expand_vec_perm_pshuflw_pshufhw (struct the permutation using the SSSE3 palignr instruction. This succeeds when all of the elements in PERM fit within one vector and we merely need to shift them down so that a single vector permutation has a - chance to succeed. */ + chance to succeed. If SINGLE_INSN_ONLY_P, succeed if only + the vpalignr instruction itself can perform the requested permutation. */ static bool -expand_vec_perm_palignr (struct expand_vec_perm_d *d) +expand_vec_perm_palignr (struct expand_vec_perm_d *d, bool single_insn_only_p) { unsigned i, nelt = d-nelt; unsigned min, max; @@ -43320,8 +43326,9 @@ expand_vec_perm_palignr (struct expand_v /* Given that we have SSSE3, we know we'll be able to implement the single operand permutation after the palignr with pshufb for - 128-bit vectors. */ - if (d-testing_p GET_MODE_SIZE (d-vmode) == 16) + 128-bit vectors. If SINGLE_INSN_ONLY_P, in_order has to be computed + first. */ + if (d-testing_p GET_MODE_SIZE (d-vmode) == 16 !single_insn_only_p) return true; dcopy = *d; @@ -43342,6 +43349,9 @@ expand_vec_perm_palignr (struct expand_v } dcopy.one_operand_p = true; + if (single_insn_only_p !in_order) +return false; + /* For AVX2, test whether we can permute the result in one instruction. */ if (d-testing_p) { @@ -43922,7 +43932,8 @@ expand_vec_perm_vperm2f128 (struct expan return true; } - /* For one operand, the only useful vperm2f128 permutation is 0x10. */ + /* For one operand, the only useful vperm2f128 permutation is 0x01 + aka lanes swap. */ if (d-one_operand_p) return false; } @@ -44811,7 +44822,7 @@ ix86_expand_vec_perm_const_1 (struct exp if (expand_vec_perm_pshuflw_pshufhw (d)) return true; - if (expand_vec_perm_palignr (d)) + if (expand_vec_perm_palignr (d, false)) return true; if (expand_vec_perm_interleave2 (d)) Jakub
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
It fix gcc.target/i386/pr52252-atom.c in core-avx2 make check and pr62128. On Tue, Sep 16, 2014 at 6:51 PM, H.J. Lu hjl.to...@gmail.com wrote: On Tue, Sep 16, 2014 at 6:15 AM, Evgeny Stupachenko evstu...@gmail.com wrote: PING 2 On Mon, Sep 8, 2014 at 2:03 PM, Evgeny Stupachenko evstu...@gmail.com wrote: PING On Wed, Aug 27, 2014 at 7:50 PM, Evgeny Stupachenko evstu...@gmail.com wrote: The rotate insn appeared right after expand. I've done it similar to define_insn_and_split *avx_vperm_broadcast_mode. I don't see any potential losses on splitting that after reload. On Tue, Aug 26, 2014 at 8:29 PM, Richard Henderson r...@redhat.com wrote: On 08/26/2014 05:59 AM, Evgeny Stupachenko wrote: +(define_insn_and_split avx2_rotatemode_perm + [(set (match_operand:V_256 0 register_operand =x) + (vec_select:V_256 + (match_operand:V_256 1 register_operand x) + (match_parallel 2 palignr_operand + [(match_operand 3 const_int_operand n)])))] + TARGET_AVX2 + # + reload_completed + [(const_int 0)] Why are you waiting until after reload to expand this? It's only the vec_select parallel that determines which direction the palignr should be done. This seems like something you could do during permutation expansion. Assuming your change is triggered today without any additional changes you should include some testcases. For now, it doesn't show if it does anything useful. -- H.J.
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Wed, Sep 17, 2014 at 6:01 AM, Evgeny Stupachenko evstu...@gmail.com wrote: It fix gcc.target/i386/pr52252-atom.c in core-avx2 make check and pr62128. I suggest you resubmit the patch as a bug fix for pr62128 with testcases from pr62128 as well as gcc.target/i386/pr52252-atom.c. -- H.J.
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
The test in pr62128 is exactly TEST 22 from gcc.dg/torture/vshuf-v32qi.c. It will check if the pattern is correct or not. Resubmitting patch looks good as current mail thread is already too complicated. On Wed, Sep 17, 2014 at 6:49 PM, H.J. Lu hjl.to...@gmail.com wrote: On Wed, Sep 17, 2014 at 6:01 AM, Evgeny Stupachenko evstu...@gmail.com wrote: It fix gcc.target/i386/pr52252-atom.c in core-avx2 make check and pr62128. I suggest you resubmit the patch as a bug fix for pr62128 with testcases from pr62128 as well as gcc.target/i386/pr52252-atom.c. -- H.J.
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
PING 2 On Mon, Sep 8, 2014 at 2:03 PM, Evgeny Stupachenko evstu...@gmail.com wrote: PING On Wed, Aug 27, 2014 at 7:50 PM, Evgeny Stupachenko evstu...@gmail.com wrote: The rotate insn appeared right after expand. I've done it similar to define_insn_and_split *avx_vperm_broadcast_mode. I don't see any potential losses on splitting that after reload. On Tue, Aug 26, 2014 at 8:29 PM, Richard Henderson r...@redhat.com wrote: On 08/26/2014 05:59 AM, Evgeny Stupachenko wrote: +(define_insn_and_split avx2_rotatemode_perm + [(set (match_operand:V_256 0 register_operand =x) + (vec_select:V_256 + (match_operand:V_256 1 register_operand x) + (match_parallel 2 palignr_operand + [(match_operand 3 const_int_operand n)])))] + TARGET_AVX2 + # + reload_completed + [(const_int 0)] Why are you waiting until after reload to expand this? It's only the vec_select parallel that determines which direction the palignr should be done. This seems like something you could do during permutation expansion. r~
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Tue, Sep 16, 2014 at 6:15 AM, Evgeny Stupachenko evstu...@gmail.com wrote: PING 2 On Mon, Sep 8, 2014 at 2:03 PM, Evgeny Stupachenko evstu...@gmail.com wrote: PING On Wed, Aug 27, 2014 at 7:50 PM, Evgeny Stupachenko evstu...@gmail.com wrote: The rotate insn appeared right after expand. I've done it similar to define_insn_and_split *avx_vperm_broadcast_mode. I don't see any potential losses on splitting that after reload. On Tue, Aug 26, 2014 at 8:29 PM, Richard Henderson r...@redhat.com wrote: On 08/26/2014 05:59 AM, Evgeny Stupachenko wrote: +(define_insn_and_split avx2_rotatemode_perm + [(set (match_operand:V_256 0 register_operand =x) + (vec_select:V_256 + (match_operand:V_256 1 register_operand x) + (match_parallel 2 palignr_operand + [(match_operand 3 const_int_operand n)])))] + TARGET_AVX2 + # + reload_completed + [(const_int 0)] Why are you waiting until after reload to expand this? It's only the vec_select parallel that determines which direction the palignr should be done. This seems like something you could do during permutation expansion. Assuming your change is triggered today without any additional changes you should include some testcases. For now, it doesn't show if it does anything useful. -- H.J.
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
PING On Wed, Aug 27, 2014 at 7:50 PM, Evgeny Stupachenko evstu...@gmail.com wrote: The rotate insn appeared right after expand. I've done it similar to define_insn_and_split *avx_vperm_broadcast_mode. I don't see any potential losses on splitting that after reload. On Tue, Aug 26, 2014 at 8:29 PM, Richard Henderson r...@redhat.com wrote: On 08/26/2014 05:59 AM, Evgeny Stupachenko wrote: +(define_insn_and_split avx2_rotatemode_perm + [(set (match_operand:V_256 0 register_operand =x) + (vec_select:V_256 + (match_operand:V_256 1 register_operand x) + (match_parallel 2 palignr_operand + [(match_operand 3 const_int_operand n)])))] + TARGET_AVX2 + # + reload_completed + [(const_int 0)] Why are you waiting until after reload to expand this? It's only the vec_select parallel that determines which direction the palignr should be done. This seems like something you could do during permutation expansion. r~
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
The rotate insn appeared right after expand. I've done it similar to define_insn_and_split *avx_vperm_broadcast_mode. I don't see any potential losses on splitting that after reload. On Tue, Aug 26, 2014 at 8:29 PM, Richard Henderson r...@redhat.com wrote: On 08/26/2014 05:59 AM, Evgeny Stupachenko wrote: +(define_insn_and_split avx2_rotatemode_perm + [(set (match_operand:V_256 0 register_operand =x) + (vec_select:V_256 + (match_operand:V_256 1 register_operand x) + (match_parallel 2 palignr_operand + [(match_operand 3 const_int_operand n)])))] + TARGET_AVX2 + # + reload_completed + [(const_int 0)] Why are you waiting until after reload to expand this? It's only the vec_select parallel that determines which direction the palignr should be done. This seems like something you could do during permutation expansion. r~
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
That is covered by a separate part of the patch: (make check and bootstrap passed: 2 new passes for core-avx2) is it ok? diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index d6155cf..68ee65a 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -81,6 +81,7 @@ ;; For AVX2 support UNSPEC_VPERMVAR UNSPEC_VPERMTI + UNSPEC_VPALIGNRDI UNSPEC_GATHER UNSPEC_VSIBADDR @@ -14167,6 +14168,19 @@ (set_attr prefix vex) (set_attr mode OI)]) +(define_insn avx2_palignrv4di + [(set (match_operand:V4DI 0 register_operand =x) + (unspec:V4DI + [(match_operand:V4DI 1 register_operand x) + (match_operand:V4DI 2 nonimmediate_operand xm) + (match_operand:SI 3 const_0_to_255_operand n)] + UNSPEC_VPALIGNRDI))] + TARGET_AVX2 + vpalignr\t{%3, %2, %1, %0|%0, %1, %2, %3} + [(set_attr type sselog) + (set_attr prefix vex) + (set_attr mode OI)]) + (define_insn avx2_vec_dupv4df [(set (match_operand:V4DF 0 register_operand =x) (vec_duplicate:V4DF @@ -14658,6 +14672,49 @@ (set_attr length_immediate 1) (set_attr prefix orig,vex)]) +(define_insn_and_split avx2_rotatemode_perm + [(set (match_operand:V_256 0 register_operand =x) + (vec_select:V_256 + (match_operand:V_256 1 register_operand x) + (match_parallel 2 palignr_operand + [(match_operand 3 const_int_operand n)])))] + TARGET_AVX2 + # + reload_completed + [(const_int 0)] +{ + enum machine_mode imode = GET_MODE_INNER (MODEmode); + int shift = INTVAL (operands[3]) * GET_MODE_SIZE (imode); + rtx op0 = gen_rtx_REG (V4DImode, REGNO (operands[0])); + rtx op1 = gen_rtx_REG (V4DImode, REGNO (operands[1])); + + if (shift == 0) +emit_move_insn (operands[0], operands[1]); + else +{ + emit_insn (gen_avx2_permv2ti (op0, + op1, + op1, + GEN_INT (33))); + if (shift 16) + emit_insn (gen_avx2_palignrv4di (op0, +op0, +op1, +GEN_INT (shift))); + else if (shift 16) + emit_insn (gen_avx2_palignrv4di (op0, +op1, +op0, +GEN_INT (shift - 16))); +} + DONE; +} + [(set_attr type sseishft) + (set_attr prefix_extra 1) + (set_attr length_immediate 1) + (set_attr prefix vex)]) + + (define_expand avx_vinsertf128mode [(match_operand:V_256 0 register_operand) (match_operand:V_256 1 register_operand) The test case covering this is gcc.target/i386/pr52252-atom.c. It will pass for -march=core-avx2 when the patch committed. On Thu, Aug 14, 2014 at 6:55 PM, H.J. Lu hjl.to...@gmail.com wrote: On Thu, Aug 14, 2014 at 1:08 AM, Evgeny Stupachenko evstu...@gmail.com wrote: Ping. On Thu, Jul 10, 2014 at 7:29 PM, Evgeny Stupachenko evstu...@gmail.com wrote: On Mon, Jul 7, 2014 at 6:40 PM, Richard Henderson r...@redhat.com wrote: On 07/03/2014 02:53 AM, Evgeny Stupachenko wrote: -expand_vec_perm_palignr (struct expand_vec_perm_d *d) +expand_vec_perm_palignr (struct expand_vec_perm_d *d, int insn_num) insn_num might as well be bool avx2, since it's only ever set to two values. Agree. However: after the alignment, one operand permutation could be just move and take 2 instructions for AVX2 as well for AVX2 there could be other cases when the scheme takes 4 or 5 instructions we can leave it for potential avx512 extension - /* Even with AVX, palignr only operates on 128-bit vectors. */ - if (!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + /* SSSE3 is required to apply PALIGNR on 16 bytes operands. */ + if (GET_MODE_SIZE (d-vmode) == 16) +{ + if (!TARGET_SSSE3) + return false; +} + /* AVX2 is required to apply PALIGNR on 32 bytes operands. */ + else if (GET_MODE_SIZE (d-vmode) == 32) +{ + if (!TARGET_AVX2) + return false; +} + /* Other sizes are not supported. */ + else return false; And you'd better check it up here because... Correct. The following should resolve the issue: /* For AVX2 we need more than 2 instructions when the alignment by itself does not produce the desired permutation. */ if (TARGET_AVX2 insn_num = 2) return false; + /* For SSSE3 we need 1 instruction for palignr plus 1 for one + operand permutaoin. */ + if (insn_num == 2) +{ + ok = expand_vec_perm_1 (dcopy); + gcc_assert (ok); +} + /* For AVX2 we need 2 instructions for the shift: vpalignr and + vperm plus 4 instructions for one operand permutation. */ + else if (insn_num == 6) +{ + ok = expand_vec_perm_vpshufb2_vpermq (dcopy); + gcc_assert (ok); +} + else +ok = false; return ok; ... down here you'll simply ICE from
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On 08/26/2014 05:59 AM, Evgeny Stupachenko wrote: +(define_insn_and_split avx2_rotatemode_perm + [(set (match_operand:V_256 0 register_operand =x) + (vec_select:V_256 + (match_operand:V_256 1 register_operand x) + (match_parallel 2 palignr_operand + [(match_operand 3 const_int_operand n)])))] + TARGET_AVX2 + # + reload_completed + [(const_int 0)] Why are you waiting until after reload to expand this? It's only the vec_select parallel that determines which direction the palignr should be done. This seems like something you could do during permutation expansion. r~
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
Ping. On Thu, Jul 10, 2014 at 7:29 PM, Evgeny Stupachenko evstu...@gmail.com wrote: On Mon, Jul 7, 2014 at 6:40 PM, Richard Henderson r...@redhat.com wrote: On 07/03/2014 02:53 AM, Evgeny Stupachenko wrote: -expand_vec_perm_palignr (struct expand_vec_perm_d *d) +expand_vec_perm_palignr (struct expand_vec_perm_d *d, int insn_num) insn_num might as well be bool avx2, since it's only ever set to two values. Agree. However: after the alignment, one operand permutation could be just move and take 2 instructions for AVX2 as well for AVX2 there could be other cases when the scheme takes 4 or 5 instructions we can leave it for potential avx512 extension - /* Even with AVX, palignr only operates on 128-bit vectors. */ - if (!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + /* SSSE3 is required to apply PALIGNR on 16 bytes operands. */ + if (GET_MODE_SIZE (d-vmode) == 16) +{ + if (!TARGET_SSSE3) + return false; +} + /* AVX2 is required to apply PALIGNR on 32 bytes operands. */ + else if (GET_MODE_SIZE (d-vmode) == 32) +{ + if (!TARGET_AVX2) + return false; +} + /* Other sizes are not supported. */ + else return false; And you'd better check it up here because... Correct. The following should resolve the issue: /* For AVX2 we need more than 2 instructions when the alignment by itself does not produce the desired permutation. */ if (TARGET_AVX2 insn_num = 2) return false; + /* For SSSE3 we need 1 instruction for palignr plus 1 for one + operand permutaoin. */ + if (insn_num == 2) +{ + ok = expand_vec_perm_1 (dcopy); + gcc_assert (ok); +} + /* For AVX2 we need 2 instructions for the shift: vpalignr and + vperm plus 4 instructions for one operand permutation. */ + else if (insn_num == 6) +{ + ok = expand_vec_perm_vpshufb2_vpermq (dcopy); + gcc_assert (ok); +} + else +ok = false; return ok; ... down here you'll simply ICE from the gcc_assert. r~
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Thu, Aug 14, 2014 at 1:08 AM, Evgeny Stupachenko evstu...@gmail.com wrote: Ping. On Thu, Jul 10, 2014 at 7:29 PM, Evgeny Stupachenko evstu...@gmail.com wrote: On Mon, Jul 7, 2014 at 6:40 PM, Richard Henderson r...@redhat.com wrote: On 07/03/2014 02:53 AM, Evgeny Stupachenko wrote: -expand_vec_perm_palignr (struct expand_vec_perm_d *d) +expand_vec_perm_palignr (struct expand_vec_perm_d *d, int insn_num) insn_num might as well be bool avx2, since it's only ever set to two values. Agree. However: after the alignment, one operand permutation could be just move and take 2 instructions for AVX2 as well for AVX2 there could be other cases when the scheme takes 4 or 5 instructions we can leave it for potential avx512 extension - /* Even with AVX, palignr only operates on 128-bit vectors. */ - if (!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + /* SSSE3 is required to apply PALIGNR on 16 bytes operands. */ + if (GET_MODE_SIZE (d-vmode) == 16) +{ + if (!TARGET_SSSE3) + return false; +} + /* AVX2 is required to apply PALIGNR on 32 bytes operands. */ + else if (GET_MODE_SIZE (d-vmode) == 32) +{ + if (!TARGET_AVX2) + return false; +} + /* Other sizes are not supported. */ + else return false; And you'd better check it up here because... Correct. The following should resolve the issue: /* For AVX2 we need more than 2 instructions when the alignment by itself does not produce the desired permutation. */ if (TARGET_AVX2 insn_num = 2) return false; + /* For SSSE3 we need 1 instruction for palignr plus 1 for one + operand permutaoin. */ + if (insn_num == 2) +{ + ok = expand_vec_perm_1 (dcopy); + gcc_assert (ok); +} + /* For AVX2 we need 2 instructions for the shift: vpalignr and + vperm plus 4 instructions for one operand permutation. */ + else if (insn_num == 6) +{ + ok = expand_vec_perm_vpshufb2_vpermq (dcopy); + gcc_assert (ok); +} + else +ok = false; return ok; ... down here you'll simply ICE from the gcc_assert. Can you modify your patch to fix https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62128 with a testcase? -- H.J.
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Mon, Jul 7, 2014 at 6:40 PM, Richard Henderson r...@redhat.com wrote: On 07/03/2014 02:53 AM, Evgeny Stupachenko wrote: -expand_vec_perm_palignr (struct expand_vec_perm_d *d) +expand_vec_perm_palignr (struct expand_vec_perm_d *d, int insn_num) insn_num might as well be bool avx2, since it's only ever set to two values. Agree. However: after the alignment, one operand permutation could be just move and take 2 instructions for AVX2 as well for AVX2 there could be other cases when the scheme takes 4 or 5 instructions we can leave it for potential avx512 extension - /* Even with AVX, palignr only operates on 128-bit vectors. */ - if (!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + /* SSSE3 is required to apply PALIGNR on 16 bytes operands. */ + if (GET_MODE_SIZE (d-vmode) == 16) +{ + if (!TARGET_SSSE3) + return false; +} + /* AVX2 is required to apply PALIGNR on 32 bytes operands. */ + else if (GET_MODE_SIZE (d-vmode) == 32) +{ + if (!TARGET_AVX2) + return false; +} + /* Other sizes are not supported. */ + else return false; And you'd better check it up here because... Correct. The following should resolve the issue: /* For AVX2 we need more than 2 instructions when the alignment by itself does not produce the desired permutation. */ if (TARGET_AVX2 insn_num = 2) return false; + /* For SSSE3 we need 1 instruction for palignr plus 1 for one + operand permutaoin. */ + if (insn_num == 2) +{ + ok = expand_vec_perm_1 (dcopy); + gcc_assert (ok); +} + /* For AVX2 we need 2 instructions for the shift: vpalignr and + vperm plus 4 instructions for one operand permutation. */ + else if (insn_num == 6) +{ + ok = expand_vec_perm_vpshufb2_vpermq (dcopy); + gcc_assert (ok); +} + else +ok = false; return ok; ... down here you'll simply ICE from the gcc_assert. r~
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On 07/03/2014 02:53 AM, Evgeny Stupachenko wrote: -expand_vec_perm_palignr (struct expand_vec_perm_d *d) +expand_vec_perm_palignr (struct expand_vec_perm_d *d, int insn_num) insn_num might as well be bool avx2, since it's only ever set to two values. - /* Even with AVX, palignr only operates on 128-bit vectors. */ - if (!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + /* SSSE3 is required to apply PALIGNR on 16 bytes operands. */ + if (GET_MODE_SIZE (d-vmode) == 16) +{ + if (!TARGET_SSSE3) + return false; +} + /* AVX2 is required to apply PALIGNR on 32 bytes operands. */ + else if (GET_MODE_SIZE (d-vmode) == 32) +{ + if (!TARGET_AVX2) + return false; +} + /* Other sizes are not supported. */ + else return false; And you'd better check it up here because... + /* For SSSE3 we need 1 instruction for palignr plus 1 for one + operand permutaoin. */ + if (insn_num == 2) +{ + ok = expand_vec_perm_1 (dcopy); + gcc_assert (ok); +} + /* For AVX2 we need 2 instructions for the shift: vpalignr and + vperm plus 4 instructions for one operand permutation. */ + else if (insn_num == 6) +{ + ok = expand_vec_perm_vpshufb2_vpermq (dcopy); + gcc_assert (ok); +} + else +ok = false; return ok; ... down here you'll simply ICE from the gcc_assert. r~
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Thu, Jul 3, 2014 at 11:53 AM, Evgeny Stupachenko evstu...@gmail.com wrote: The expand_vec_perm_palignr is similar for SSSE3 and AVX2 cases, but AVX2 requires more instructions to complete the scheme. The patch below adds AVX2 support for six instructions, leaving SSSE3 for two. Is it ok? ChangeLog entry is missing. Uros.
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
2014-07-04 Evgeny Stupachenko evstu...@gmail.com * config/i386/i386.c (expand_vec_perm_palignr): Extend to use AVX2 PALINGR instruction. * config/i386/i386.c (ix86_expand_vec_perm_const_1): Add palignr try for AVX2. On Fri, Jul 4, 2014 at 3:22 PM, Uros Bizjak ubiz...@gmail.com wrote: On Thu, Jul 3, 2014 at 11:53 AM, Evgeny Stupachenko evstu...@gmail.com wrote: The expand_vec_perm_palignr is similar for SSSE3 and AVX2 cases, but AVX2 requires more instructions to complete the scheme. The patch below adds AVX2 support for six instructions, leaving SSSE3 for two. Is it ok? ChangeLog entry is missing. Uros.
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
The expand_vec_perm_palignr is similar for SSSE3 and AVX2 cases, but AVX2 requires more instructions to complete the scheme. The patch below adds AVX2 support for six instructions, leaving SSSE3 for two. Is it ok? diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 2cffcef..70fc832 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -43130,23 +43130,38 @@ expand_vec_perm_pshuflw_pshufhw (struct expand_vec_perm_d *d) return true; } +static bool +expand_vec_perm_vpshufb2_vpermq (struct expand_vec_perm_d *d); + /* A subroutine of ix86_expand_vec_perm_builtin_1. Try to simplify - the permutation using the SSSE3 palignr instruction. This succeeds + the permutation using the SSSE3/AVX2 palignr instruction. This succeeds when all of the elements in PERM fit within one vector and we merely need to shift them down so that a single vector permutation has a chance to succeed. */ static bool -expand_vec_perm_palignr (struct expand_vec_perm_d *d) +expand_vec_perm_palignr (struct expand_vec_perm_d *d, int insn_num) { unsigned i, nelt = d-nelt; unsigned min, max; bool in_order, ok; - rtx shift, target; + rtx shift, shift1, target, tmp; struct expand_vec_perm_d dcopy; - /* Even with AVX, palignr only operates on 128-bit vectors. */ - if (!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + /* SSSE3 is required to apply PALIGNR on 16 bytes operands. */ + if (GET_MODE_SIZE (d-vmode) == 16) +{ + if (!TARGET_SSSE3) + return false; +} + /* AVX2 is required to apply PALIGNR on 32 bytes operands. */ + else if (GET_MODE_SIZE (d-vmode) == 32) +{ + if (!TARGET_AVX2) + return false; +} + /* Other sizes are not supported. */ + else return false; min = nelt, max = 0; @@ -43168,9 +43183,34 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d) dcopy = *d; shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d-vmode))); - target = gen_reg_rtx (TImode); - emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d-op1), - gen_lowpart (TImode, d-op0), shift)); + shift1 = GEN_INT ((min - nelt / 2) + * GET_MODE_BITSIZE (GET_MODE_INNER (d-vmode))); + + if (GET_MODE_SIZE (d-vmode) != 32) +{ + target = gen_reg_rtx (TImode); + emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d-op1), + gen_lowpart (TImode, d-op0), shift)); +} + else +{ + target = gen_reg_rtx (V2TImode); + tmp = gen_reg_rtx (V4DImode); + emit_insn (gen_avx2_permv2ti (tmp, + gen_lowpart (V4DImode, d-op0), + gen_lowpart (V4DImode, d-op1), + GEN_INT (33))); + if (min nelt / 2) +emit_insn (gen_avx2_palignrv2ti (target, +gen_lowpart (V2TImode, tmp), +gen_lowpart (V2TImode, d-op0), +shift)); + else + emit_insn (gen_avx2_palignrv2ti (target, +gen_lowpart (V2TImode, d-op1), +gen_lowpart (V2TImode, tmp), +shift1)); +} dcopy.op0 = dcopy.op1 = gen_lowpart (d-vmode, target); dcopy.one_operand_p = true; @@ -43192,9 +43232,22 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d) return true; } - ok = expand_vec_perm_1 (dcopy); - gcc_assert (ok); - + /* For SSSE3 we need 1 instruction for palignr plus 1 for one + operand permutaoin. */ + if (insn_num == 2) +{ + ok = expand_vec_perm_1 (dcopy); + gcc_assert (ok); +} + /* For AVX2 we need 2 instructions for the shift: vpalignr and + vperm plus 4 instructions for one operand permutation. */ + else if (insn_num == 6) +{ + ok = expand_vec_perm_vpshufb2_vpermq (dcopy); + gcc_assert (ok); +} + else +ok = false; return ok; } @@ -44627,7 +44680,7 @@ ix86_expand_vec_perm_const_1 (struct expand_vec_perm_d *d) if (expand_vec_perm_pshuflw_pshufhw (d)) return true; - if (expand_vec_perm_palignr (d)) + if (expand_vec_perm_palignr (d, 2)) return true; if (expand_vec_perm_interleave2 (d)) @@ -44680,6 +44733,10 @@ ix86_expand_vec_perm_const_1 (struct expand_vec_perm_d *d) if (expand_vec_perm_even_odd (d)) return true; + /* Try sequences of six instructions. */ + if (expand_vec_perm_palignr (d, 6)) +return true; + /* Even longer sequences. */ if (expand_vec_perm_vpshufb4_vpermq2 (d)) return true; On Mon, May 19, 2014 at 7:32 PM, Richard Henderson r...@redhat.com wrote: On 05/05/2014 09:49 AM, Evgeny Stupachenko wrote: @@ -42946,6 +42948,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d) if (expand_vec_perm_pshufb (d)) return true; + /* Try the AVX2 vpshufb. */ + if
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
Ping. On Mon, May 5, 2014 at 8:49 PM, Evgeny Stupachenko evstu...@gmail.com wrote: Is the following patch ok? It passes bootstrap and make check. diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 88142a8..91f6f21 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -42807,6 +42807,8 @@ expand_vec_perm_pshufb (struct expand_vec_perm_d *d) return true; } +static bool expand_vec_perm_vpshufb2_vpermq (struct expand_vec_perm_d *d); + /* A subroutine of ix86_expand_vec_perm_builtin_1. Try to instantiate D in a single instruction. */ @@ -42946,6 +42948,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d) if (expand_vec_perm_pshufb (d)) return true; + /* Try the AVX2 vpshufb. */ + if (expand_vec_perm_vpshufb2_vpermq (d)) +return true; + /* Try the AVX512F vpermi2 instructions. */ rtx vec[64]; enum machine_mode mode = d-vmode; @@ -43004,7 +43010,7 @@ expand_vec_perm_pshuflw_pshufhw (struct expand_vec_perm_d *d) } /* A subroutine of ix86_expand_vec_perm_builtin_1. Try to simplify - the permutation using the SSSE3 palignr instruction. This succeeds + the permutation using the SSSE3/AVX2 palignr instruction. This succeeds when all of the elements in PERM fit within one vector and we merely need to shift them down so that a single vector permutation has a chance to succeed. */ @@ -43015,14 +43021,26 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d) unsigned i, nelt = d-nelt; unsigned min, max; bool in_order, ok; - rtx shift, target; + rtx shift, shift1, target, tmp; struct expand_vec_perm_d dcopy; - /* Even with AVX, palignr only operates on 128-bit vectors. */ - if (!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + /* SSSE3 is required to apply PALIGNR on 16 bytes operands. */ + if (GET_MODE_SIZE (d-vmode) == 16) +{ + if (!TARGET_SSSE3) + return false; +} + /* AVX2 is required to apply PALIGNR on 32 bytes operands. */ + else if (GET_MODE_SIZE (d-vmode) == 32) +{ + if (!TARGET_AVX2) + return false; +} + /* Other sizes are not supported. */ + else return false; - min = nelt, max = 0; + min = 2 * nelt, max = 0; for (i = 0; i nelt; ++i) { unsigned e = d-perm[i]; @@ -43041,9 +43059,35 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d) dcopy = *d; shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d-vmode))); - target = gen_reg_rtx (TImode); - emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d-op1), - gen_lowpart (TImode, d-op0), shift)); + shift1 = GEN_INT ((min - nelt / 2) * + GET_MODE_BITSIZE (GET_MODE_INNER (d-vmode))); + + if (GET_MODE_SIZE (d-vmode) != 32) +{ + target = gen_reg_rtx (TImode); + emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d-op1), + gen_lowpart (TImode, d-op0), shift)); +} + else +{ + target = gen_reg_rtx (V2TImode); + tmp = gen_reg_rtx (V4DImode); + emit_insn (gen_avx2_permv2ti (tmp, + gen_lowpart (V4DImode, d-op0), + gen_lowpart (V4DImode, d-op1), + GEN_INT (33))); + if (min nelt / 2) +emit_insn (gen_avx2_palignrv2ti (target, +gen_lowpart (V2TImode, tmp), +gen_lowpart (V2TImode, d-op0), +shift)); + else + emit_insn (gen_avx2_palignrv2ti (target, +gen_lowpart (V2TImode, d-op1), +gen_lowpart (V2TImode, tmp), +shift1)); +} + dcopy.op0 = dcopy.op1 = gen_lowpart (d-vmode, target); dcopy.one_operand_p = true; On Tue, Apr 29, 2014 at 1:03 AM, Richard Henderson r...@redhat.com wrote: On 04/28/2014 01:43 PM, Evgeny Stupachenko wrote: Agree on checks: /* PALIGNR of 2 128-bits registers takes only 1 instrucion. Requires SSSE3. */ if (GET_MODE_SIZE (d-vmode) == 16) { if(!TARGET_SSSE3) return false; } /* PALIGNR of 2 256-bits registers on AVX2 costs only 2 instructions: PERM and PALIGNR. It is more profitable than 2 PSHUFB and PERM. */ else if (GET_MODE_SIZE (d-vmode) == 32) { if(!TARGET_AVX2) return false; } else return false; Thanks, much better. r~
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On 05/05/2014 09:49 AM, Evgeny Stupachenko wrote: @@ -42946,6 +42948,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d) if (expand_vec_perm_pshufb (d)) return true; + /* Try the AVX2 vpshufb. */ + if (expand_vec_perm_vpshufb2_vpermq (d)) +return true; Why is this here? It doesn't expand to 1 insn, which is what expand_vec_perm_1 is intended to check. It's already called from ix86_expand_vec_perm_const_1. If things need to be shuffled around in that function, then that's the right place to do so. It's also clearly unrelated to palignr, so has no business at all within this patch. - min = nelt, max = 0; + min = 2 * nelt, max = 0; This change to min is wrong for this patch. It probably belonged in your 2/2 patch. + shift1 = GEN_INT ((min - nelt / 2) * + GET_MODE_BITSIZE (GET_MODE_INNER (d-vmode))); Coding convention sez shift1 = GEN_INT ((min - nelt / 2) * GET_MODE_BITSIZE (GET_MODE_INNER (d-vmode))); r~
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
Is the following patch ok? It passes bootstrap and make check. diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 88142a8..91f6f21 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -42807,6 +42807,8 @@ expand_vec_perm_pshufb (struct expand_vec_perm_d *d) return true; } +static bool expand_vec_perm_vpshufb2_vpermq (struct expand_vec_perm_d *d); + /* A subroutine of ix86_expand_vec_perm_builtin_1. Try to instantiate D in a single instruction. */ @@ -42946,6 +42948,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d) if (expand_vec_perm_pshufb (d)) return true; + /* Try the AVX2 vpshufb. */ + if (expand_vec_perm_vpshufb2_vpermq (d)) +return true; + /* Try the AVX512F vpermi2 instructions. */ rtx vec[64]; enum machine_mode mode = d-vmode; @@ -43004,7 +43010,7 @@ expand_vec_perm_pshuflw_pshufhw (struct expand_vec_perm_d *d) } /* A subroutine of ix86_expand_vec_perm_builtin_1. Try to simplify - the permutation using the SSSE3 palignr instruction. This succeeds + the permutation using the SSSE3/AVX2 palignr instruction. This succeeds when all of the elements in PERM fit within one vector and we merely need to shift them down so that a single vector permutation has a chance to succeed. */ @@ -43015,14 +43021,26 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d) unsigned i, nelt = d-nelt; unsigned min, max; bool in_order, ok; - rtx shift, target; + rtx shift, shift1, target, tmp; struct expand_vec_perm_d dcopy; - /* Even with AVX, palignr only operates on 128-bit vectors. */ - if (!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + /* SSSE3 is required to apply PALIGNR on 16 bytes operands. */ + if (GET_MODE_SIZE (d-vmode) == 16) +{ + if (!TARGET_SSSE3) + return false; +} + /* AVX2 is required to apply PALIGNR on 32 bytes operands. */ + else if (GET_MODE_SIZE (d-vmode) == 32) +{ + if (!TARGET_AVX2) + return false; +} + /* Other sizes are not supported. */ + else return false; - min = nelt, max = 0; + min = 2 * nelt, max = 0; for (i = 0; i nelt; ++i) { unsigned e = d-perm[i]; @@ -43041,9 +43059,35 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d) dcopy = *d; shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d-vmode))); - target = gen_reg_rtx (TImode); - emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d-op1), - gen_lowpart (TImode, d-op0), shift)); + shift1 = GEN_INT ((min - nelt / 2) * + GET_MODE_BITSIZE (GET_MODE_INNER (d-vmode))); + + if (GET_MODE_SIZE (d-vmode) != 32) +{ + target = gen_reg_rtx (TImode); + emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d-op1), + gen_lowpart (TImode, d-op0), shift)); +} + else +{ + target = gen_reg_rtx (V2TImode); + tmp = gen_reg_rtx (V4DImode); + emit_insn (gen_avx2_permv2ti (tmp, + gen_lowpart (V4DImode, d-op0), + gen_lowpart (V4DImode, d-op1), + GEN_INT (33))); + if (min nelt / 2) +emit_insn (gen_avx2_palignrv2ti (target, +gen_lowpart (V2TImode, tmp), +gen_lowpart (V2TImode, d-op0), +shift)); + else + emit_insn (gen_avx2_palignrv2ti (target, +gen_lowpart (V2TImode, d-op1), +gen_lowpart (V2TImode, tmp), +shift1)); +} + dcopy.op0 = dcopy.op1 = gen_lowpart (d-vmode, target); dcopy.one_operand_p = true; On Tue, Apr 29, 2014 at 1:03 AM, Richard Henderson r...@redhat.com wrote: On 04/28/2014 01:43 PM, Evgeny Stupachenko wrote: Agree on checks: /* PALIGNR of 2 128-bits registers takes only 1 instrucion. Requires SSSE3. */ if (GET_MODE_SIZE (d-vmode) == 16) { if(!TARGET_SSSE3) return false; } /* PALIGNR of 2 256-bits registers on AVX2 costs only 2 instructions: PERM and PALIGNR. It is more profitable than 2 PSHUFB and PERM. */ else if (GET_MODE_SIZE (d-vmode) == 32) { if(!TARGET_AVX2) return false; } else return false; Thanks, much better. r~
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Mon, Apr 28, 2014 at 9:48 AM, Evgeny Stupachenko evstu...@gmail.com wrote: Hi, The patch enables use of palignr with perm instead of 2 pshufb, or and perm at AVX2 for some cases. Bootstrapped and passes make check on x86. Is it ok? 2014-04-28 Evgeny Stupachenko evstu...@gmail.com * config/i386/i386.c (expand_vec_perm_1): Try AVX2 vpshufb. * config/i386/i386.c (expand_vec_perm_palignr): Extend to use AVX2 PALINGR instruction. Can you add testcases to verify that AVX2 vpshufb and paligngr are properly generated? diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 88142a8..ae80477 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -42807,6 +42807,8 @@ expand_vec_perm_pshufb (struct expand_vec_perm_d *d) return true; } +static bool expand_vec_perm_vpshufb2_vpermq (struct expand_vec_perm_d *d); + /* A subroutine of ix86_expand_vec_perm_builtin_1. Try to instantiate D in a single instruction. */ @@ -42946,6 +42948,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d) if (expand_vec_perm_pshufb (d)) return true; + /* Try the AVX2 vpshufb. */ + if (expand_vec_perm_vpshufb2_vpermq (d)) +return true; + /* Try the AVX512F vpermi2 instructions. */ rtx vec[64]; enum machine_mode mode = d-vmode; @@ -43004,7 +43010,7 @@ expand_vec_perm_pshuflw_pshufhw (struct expand_vec_perm_d *d) } /* A subroutine of ix86_expand_vec_perm_builtin_1. Try to simplify - the permutation using the SSSE3 palignr instruction. This succeeds + the permutation using the SSSE3/AVX2 palignr instruction. This succeeds when all of the elements in PERM fit within one vector and we merely need to shift them down so that a single vector permutation has a chance to succeed. */ @@ -43015,14 +43021,20 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d) unsigned i, nelt = d-nelt; unsigned min, max; bool in_order, ok; - rtx shift, target; + rtx shift, shift1, target, tmp; struct expand_vec_perm_d dcopy; - /* Even with AVX, palignr only operates on 128-bit vectors. */ - if (!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + /* PALIGNR of 2 256-bits registers on AVX2 costs only 2 instructions: + PERM and PALIGNR. It is more profitable than 2 PSHUFB and PERM. + PALIGNR of 2 128-bits registers takes only 1 instrucion. */ + if (!TARGET_SSSE3 || (GET_MODE_SIZE (d-vmode) != 16 ^^^ This should be only on the next lined. + GET_MODE_SIZE (d-vmode) != 32)) +return false; + /* Only AVX2 or higher support PALIGNR on 256-bits registers. */ + if (!TARGET_AVX2 (GET_MODE_SIZE (d-vmode) == 32)) ^ No need for '(' here. Or you can use if ((!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) (!TARGET_AVX2 || GET_MODE_SIZE (d-vmode) != 32)) return false; - min = nelt, max = 0; + min = 2 * nelt, max = 0; ^^ Will this change 128bit vector code? for (i = 0; i nelt; ++i) { unsigned e = d-perm[i]; @@ -43041,9 +43053,34 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d) dcopy = *d; shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d-vmode))); - target = gen_reg_rtx (TImode); - emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d-op1), - gen_lowpart (TImode, d-op0), shift)); + shift1 = GEN_INT ((min - nelt / 2) * + GET_MODE_BITSIZE (GET_MODE_INNER (d-vmode))); + + if (GET_MODE_SIZE (d-vmode) != 32) +{ + target = gen_reg_rtx (TImode); + emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d-op1), + gen_lowpart (TImode, d-op0), shift)); +} + else +{ + target = gen_reg_rtx (V2TImode); + tmp = gen_reg_rtx (V4DImode); + emit_insn (gen_avx2_permv2ti (tmp, + gen_lowpart (V4DImode, d-op0), + gen_lowpart (V4DImode, d-op1), + GEN_INT (33))); + if (min nelt / 2) +emit_insn (gen_avx2_palignrv2ti (target, +gen_lowpart (V2TImode, tmp), +gen_lowpart (V2TImode, d-op0), +shift)); + else + emit_insn (gen_avx2_palignrv2ti (target, +gen_lowpart (V2TImode, d-op1), +gen_lowpart (V2TImode, tmp), +shift1)); +} dcopy.op0 = dcopy.op1 = gen_lowpart (d-vmode, target); dcopy.one_operand_p = true; Evgeny -- H.J.
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On 04/28/2014 09:48 AM, Evgeny Stupachenko wrote: - /* Even with AVX, palignr only operates on 128-bit vectors. */ - if (!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + /* PALIGNR of 2 256-bits registers on AVX2 costs only 2 instructions: + PERM and PALIGNR. It is more profitable than 2 PSHUFB and PERM. + PALIGNR of 2 128-bits registers takes only 1 instrucion. */ + if (!TARGET_SSSE3 || (GET_MODE_SIZE (d-vmode) != 16 + GET_MODE_SIZE (d-vmode) != 32)) +return false; + /* Only AVX2 or higher support PALIGNR on 256-bits registers. */ + if (!TARGET_AVX2 (GET_MODE_SIZE (d-vmode) == 32)) return false; This is confusingly written. How about if (GET_MODE_SIZE (d-vmode) == 16) { if (!TARGET_SSSE3) return false; } else if (GET_MODE_SIZE (d-vmode) == 32) { if (!TARGET_AVX2) return false; } else return false; With the comments added into the right places. r~
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On Mon, Apr 28, 2014 at 9:08 PM, H.J. Lu hjl.to...@gmail.com wrote: On Mon, Apr 28, 2014 at 9:48 AM, Evgeny Stupachenko evstu...@gmail.com wrote: Hi, The patch enables use of palignr with perm instead of 2 pshufb, or and perm at AVX2 for some cases. Bootstrapped and passes make check on x86. Is it ok? 2014-04-28 Evgeny Stupachenko evstu...@gmail.com * config/i386/i386.c (expand_vec_perm_1): Try AVX2 vpshufb. * config/i386/i386.c (expand_vec_perm_palignr): Extend to use AVX2 PALINGR instruction. Can you add testcases to verify that AVX2 vpshufb and paligngr are properly generated? One of next patches will have test case. diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 88142a8..ae80477 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -42807,6 +42807,8 @@ expand_vec_perm_pshufb (struct expand_vec_perm_d *d) return true; } +static bool expand_vec_perm_vpshufb2_vpermq (struct expand_vec_perm_d *d); + /* A subroutine of ix86_expand_vec_perm_builtin_1. Try to instantiate D in a single instruction. */ @@ -42946,6 +42948,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d) if (expand_vec_perm_pshufb (d)) return true; + /* Try the AVX2 vpshufb. */ + if (expand_vec_perm_vpshufb2_vpermq (d)) +return true; + /* Try the AVX512F vpermi2 instructions. */ rtx vec[64]; enum machine_mode mode = d-vmode; @@ -43004,7 +43010,7 @@ expand_vec_perm_pshuflw_pshufhw (struct expand_vec_perm_d *d) } /* A subroutine of ix86_expand_vec_perm_builtin_1. Try to simplify - the permutation using the SSSE3 palignr instruction. This succeeds + the permutation using the SSSE3/AVX2 palignr instruction. This succeeds when all of the elements in PERM fit within one vector and we merely need to shift them down so that a single vector permutation has a chance to succeed. */ @@ -43015,14 +43021,20 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d) unsigned i, nelt = d-nelt; unsigned min, max; bool in_order, ok; - rtx shift, target; + rtx shift, shift1, target, tmp; struct expand_vec_perm_d dcopy; - /* Even with AVX, palignr only operates on 128-bit vectors. */ - if (!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + /* PALIGNR of 2 256-bits registers on AVX2 costs only 2 instructions: + PERM and PALIGNR. It is more profitable than 2 PSHUFB and PERM. + PALIGNR of 2 128-bits registers takes only 1 instrucion. */ + if (!TARGET_SSSE3 || (GET_MODE_SIZE (d-vmode) != 16 ^^^ This should be only on the next lined. + GET_MODE_SIZE (d-vmode) != 32)) +return false; + /* Only AVX2 or higher support PALIGNR on 256-bits registers. */ + if (!TARGET_AVX2 (GET_MODE_SIZE (d-vmode) == 32)) ^ No need for '(' here. Or you can use if ((!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) (!TARGET_AVX2 || GET_MODE_SIZE (d-vmode) != 32)) return false; - min = nelt, max = 0; + min = 2 * nelt, max = 0; ^^ Will this change 128bit vector code? No. The only case when all elements in permutation constant are equal or greater than nelt will lead to canonization to one operand case with all elements less than nelt. So the change actually affects nothing, but still more accurate. for (i = 0; i nelt; ++i) { unsigned e = d-perm[i]; @@ -43041,9 +43053,34 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d) dcopy = *d; shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d-vmode))); - target = gen_reg_rtx (TImode); - emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d-op1), - gen_lowpart (TImode, d-op0), shift)); + shift1 = GEN_INT ((min - nelt / 2) * + GET_MODE_BITSIZE (GET_MODE_INNER (d-vmode))); + + if (GET_MODE_SIZE (d-vmode) != 32) +{ + target = gen_reg_rtx (TImode); + emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d-op1), + gen_lowpart (TImode, d-op0), shift)); +} + else +{ + target = gen_reg_rtx (V2TImode); + tmp = gen_reg_rtx (V4DImode); + emit_insn (gen_avx2_permv2ti (tmp, + gen_lowpart (V4DImode, d-op0), + gen_lowpart (V4DImode, d-op1), + GEN_INT (33))); + if (min nelt / 2) +emit_insn (gen_avx2_palignrv2ti (target, +gen_lowpart (V2TImode, tmp), +gen_lowpart (V2TImode, d-op0), +shift)); + else + emit_insn (gen_avx2_palignrv2ti (target, +gen_lowpart (V2TImode, d-op1), +gen_lowpart (V2TImode,
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
Agree on checks: /* PALIGNR of 2 128-bits registers takes only 1 instrucion. Requires SSSE3. */ if (GET_MODE_SIZE (d-vmode) == 16) { if(!TARGET_SSSE3) return false; } /* PALIGNR of 2 256-bits registers on AVX2 costs only 2 instructions: PERM and PALIGNR. It is more profitable than 2 PSHUFB and PERM. */ else if (GET_MODE_SIZE (d-vmode) == 32) { if(!TARGET_AVX2) return false; } else return false; On Mon, Apr 28, 2014 at 9:32 PM, Richard Henderson r...@redhat.com wrote: On 04/28/2014 09:48 AM, Evgeny Stupachenko wrote: - /* Even with AVX, palignr only operates on 128-bit vectors. */ - if (!TARGET_SSSE3 || GET_MODE_SIZE (d-vmode) != 16) + /* PALIGNR of 2 256-bits registers on AVX2 costs only 2 instructions: + PERM and PALIGNR. It is more profitable than 2 PSHUFB and PERM. + PALIGNR of 2 128-bits registers takes only 1 instrucion. */ + if (!TARGET_SSSE3 || (GET_MODE_SIZE (d-vmode) != 16 + GET_MODE_SIZE (d-vmode) != 32)) +return false; + /* Only AVX2 or higher support PALIGNR on 256-bits registers. */ + if (!TARGET_AVX2 (GET_MODE_SIZE (d-vmode) == 32)) return false; This is confusingly written. How about if (GET_MODE_SIZE (d-vmode) == 16) { if (!TARGET_SSSE3) return false; } else if (GET_MODE_SIZE (d-vmode) == 32) { if (!TARGET_AVX2) return false; } else return false; With the comments added into the right places. r~
Re: [PATCH 1/2, x86] Add palignr support for AVX2.
On 04/28/2014 01:43 PM, Evgeny Stupachenko wrote: Agree on checks: /* PALIGNR of 2 128-bits registers takes only 1 instrucion. Requires SSSE3. */ if (GET_MODE_SIZE (d-vmode) == 16) { if(!TARGET_SSSE3) return false; } /* PALIGNR of 2 256-bits registers on AVX2 costs only 2 instructions: PERM and PALIGNR. It is more profitable than 2 PSHUFB and PERM. */ else if (GET_MODE_SIZE (d-vmode) == 32) { if(!TARGET_AVX2) return false; } else return false; Thanks, much better. r~