Re: [PATCH] testsuite: Fix up pr111754.c test
> Am 28.11.2023 um 09:58 schrieb Jakub Jelinek : > > On Tue, Nov 28, 2023 at 09:43:52AM +0100, Jakub Jelinek wrote: >>> On Tue, Nov 28, 2023 at 03:56:47PM +0800, juzhe.zh...@rivai.ai wrote: >>> Hi, there is a regression in RISC-V caused by this patch: >>> >>> FAIL: gcc.dg/vect/pr111754.c -flto -ffat-lto-objects scan-tree-dump >>> optimized "return { 0.0, 9.0e\\+0, 0.0, 0.0 }" >>> FAIL: gcc.dg/vect/pr111754.c scan-tree-dump optimized "return { 0.0, >>> 9.0e\\+0, 0.0, 0.0 }" >>> >>> I have checked the dump is : >>> F foo (F a, F b) >>> { >>> [local count: 1073741824]: >>> = { 0.0, 9.0e+0, 0.0, 0.0 }; >>> return ; >>> >>> } >>> >>> The dump IR seems reasonable to me. >>> I wonder whether we should walk around in RISC-V backend to generate the >>> same IR as ARM SVE ? >>> Or we should adjust the test ? >> >> Note, the test also FAILs on i686-linux (but not e.g. on x86_64-linux): >> /home/jakub/src/gcc/obj67/gcc/xgcc -B/home/jakub/src/gcc/obj67/gcc/ >> /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c >> -fdiagnostics-plain-output -O2 -fdump-tree-optimized -S -o pr111754.s >> /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c: In function 'foo': >> /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c:7:1: warning: SSE >> vector return without SSE enabled changes the ABI [-Wpsabi] >> /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c:6:3: note: the ABI >> for passing parameters with 16-byte alignment has changed in GCC 4.6 >> /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c:6:3: warning: SSE >> vector argument without SSE enabled changes the ABI [-Wpsabi] >> FAIL: gcc.dg/vect/pr111754.c (test for excess errors) >> Excess errors: >> /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c:7:1: warning: SSE >> vector return without SSE enabled changes the ABI [-Wpsabi] >> /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c:6:3: warning: SSE >> vector argument without SSE enabled changes the ABI [-Wpsabi] >> >> PASS: gcc.dg/vect/pr111754.c scan-tree-dump-not optimized "VEC_PERM_EXPR" >> FAIL: gcc.dg/vect/pr111754.c scan-tree-dump optimized "return { 0.0, >> 9.0e\\+0, 0.0, 0.0 }" >> >> So, I think it is wrong to specify >> /* { dg-options "-O2 -fdump-tree-optimized" } */ >> in the test, should be dg-additional-options instead, so that it gets >> the implied vector compilation options e.g. for i686-linux (-msse2 in that >> case at least), question is if -Wno-psabi should be added as well or not, >> and certainly the scan-tree-dump needs to be guarded by appropriate >> vect_* effective target (but dunno which, one which asserts support for >> V4SFmode and returning it). >> Alternatively, perhaps don't check optimized dump but some earlier one >> before generic vector lowering, then hopefully it could match on all >> targets? Maybe with the = ... vs. return ... variants. > > All in one patch now. > > Tested on x86_64-linux with > make check-gcc RUNTESTFLAGS='--target_board=unix\{-m32,-m32/-mno-sse,-m64\} > vect.exp=pr111754.c' > Ok for trunk? Ok Richard > 2023-11-28 Jakub Jelinek > >PR middle-end/111754 >* gcc.dg/vect/pr111754.c: Use dg-additional-options rather than >dg-options, add -Wno-psabi and use -fdump-tree-forwprop1 rather than >-fdump-tree-optimized. Scan forwprop1 dump rather than optimized and >scan for either direct return or setting of to the vector. > > --- gcc/testsuite/gcc.dg/vect/pr111754.c.jj2023-11-28 08:46:28.422801989 > +0100 > +++ gcc/testsuite/gcc.dg/vect/pr111754.c2023-11-28 09:52:56.761059292 > +0100 > @@ -1,5 +1,6 @@ > +/* PR middle-end/111754 */ > /* { dg-do compile } */ > -/* { dg-options "-O2 -fdump-tree-optimized" } */ > +/* { dg-additional-options "-O2 -fdump-tree-forwprop1 -Wno-psabi" } */ > > typedef float __attribute__((__vector_size__ (16))) F; > > @@ -9,5 +10,5 @@ F foo (F a, F b) > return __builtin_shufflevector (v, v, 1, 0, 1, 2); > } > > -/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR" "optimized" } } */ > -/* { dg-final { scan-tree-dump "return \{ 0.0, 9.0e\\+0, 0.0, 0.0 \}" > "optimized" } } */ > +/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR" "forwprop1" } } */ > +/* { dg-final { scan-tree-dump "(return| =) \{ 0.0, 9.0e\\+0, 0.0, > 0.0 \}" "forwprop1" } } */ > > >Jakub >
[PATCH] testsuite: Fix up pr111754.c test
On Tue, Nov 28, 2023 at 09:43:52AM +0100, Jakub Jelinek wrote: > On Tue, Nov 28, 2023 at 03:56:47PM +0800, juzhe.zh...@rivai.ai wrote: > > Hi, there is a regression in RISC-V caused by this patch: > > > > FAIL: gcc.dg/vect/pr111754.c -flto -ffat-lto-objects scan-tree-dump > > optimized "return { 0.0, 9.0e\\+0, 0.0, 0.0 }" > > FAIL: gcc.dg/vect/pr111754.c scan-tree-dump optimized "return { 0.0, > > 9.0e\\+0, 0.0, 0.0 }" > > > > I have checked the dump is : > > F foo (F a, F b) > > { > >[local count: 1073741824]: > >= { 0.0, 9.0e+0, 0.0, 0.0 }; > > return ; > > > > } > > > > The dump IR seems reasonable to me. > > I wonder whether we should walk around in RISC-V backend to generate the > > same IR as ARM SVE ? > > Or we should adjust the test ? > > Note, the test also FAILs on i686-linux (but not e.g. on x86_64-linux): > /home/jakub/src/gcc/obj67/gcc/xgcc -B/home/jakub/src/gcc/obj67/gcc/ > /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c > -fdiagnostics-plain-output -O2 -fdump-tree-optimized -S -o pr111754.s > /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c: In function 'foo': > /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c:7:1: warning: SSE > vector return without SSE enabled changes the ABI [-Wpsabi] > /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c:6:3: note: the ABI > for passing parameters with 16-byte alignment has changed in GCC 4.6 > /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c:6:3: warning: SSE > vector argument without SSE enabled changes the ABI [-Wpsabi] > FAIL: gcc.dg/vect/pr111754.c (test for excess errors) > Excess errors: > /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c:7:1: warning: SSE > vector return without SSE enabled changes the ABI [-Wpsabi] > /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c:6:3: warning: SSE > vector argument without SSE enabled changes the ABI [-Wpsabi] > > PASS: gcc.dg/vect/pr111754.c scan-tree-dump-not optimized "VEC_PERM_EXPR" > FAIL: gcc.dg/vect/pr111754.c scan-tree-dump optimized "return { 0.0, > 9.0e\\+0, 0.0, 0.0 }" > > So, I think it is wrong to specify > /* { dg-options "-O2 -fdump-tree-optimized" } */ > in the test, should be dg-additional-options instead, so that it gets > the implied vector compilation options e.g. for i686-linux (-msse2 in that > case at least), question is if -Wno-psabi should be added as well or not, > and certainly the scan-tree-dump needs to be guarded by appropriate > vect_* effective target (but dunno which, one which asserts support for > V4SFmode and returning it). > Alternatively, perhaps don't check optimized dump but some earlier one > before generic vector lowering, then hopefully it could match on all > targets? Maybe with the = ... vs. return ... variants. All in one patch now. Tested on x86_64-linux with make check-gcc RUNTESTFLAGS='--target_board=unix\{-m32,-m32/-mno-sse,-m64\} vect.exp=pr111754.c' Ok for trunk? 2023-11-28 Jakub Jelinek PR middle-end/111754 * gcc.dg/vect/pr111754.c: Use dg-additional-options rather than dg-options, add -Wno-psabi and use -fdump-tree-forwprop1 rather than -fdump-tree-optimized. Scan forwprop1 dump rather than optimized and scan for either direct return or setting of to the vector. --- gcc/testsuite/gcc.dg/vect/pr111754.c.jj 2023-11-28 08:46:28.422801989 +0100 +++ gcc/testsuite/gcc.dg/vect/pr111754.c2023-11-28 09:52:56.761059292 +0100 @@ -1,5 +1,6 @@ +/* PR middle-end/111754 */ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-optimized" } */ +/* { dg-additional-options "-O2 -fdump-tree-forwprop1 -Wno-psabi" } */ typedef float __attribute__((__vector_size__ (16))) F; @@ -9,5 +10,5 @@ F foo (F a, F b) return __builtin_shufflevector (v, v, 1, 0, 1, 2); } -/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR" "optimized" } } */ -/* { dg-final { scan-tree-dump "return \{ 0.0, 9.0e\\+0, 0.0, 0.0 \}" "optimized" } } */ +/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR" "forwprop1" } } */ +/* { dg-final { scan-tree-dump "(return| =) \{ 0.0, 9.0e\\+0, 0.0, 0.0 \}" "forwprop1" } } */ Jakub
Re: PR111754
On Tue, Nov 28, 2023 at 03:56:47PM +0800, juzhe.zh...@rivai.ai wrote: > Hi, there is a regression in RISC-V caused by this patch: > > FAIL: gcc.dg/vect/pr111754.c -flto -ffat-lto-objects scan-tree-dump > optimized "return { 0.0, 9.0e\\+0, 0.0, 0.0 }" > FAIL: gcc.dg/vect/pr111754.c scan-tree-dump optimized "return { 0.0, > 9.0e\\+0, 0.0, 0.0 }" > > I have checked the dump is : > F foo (F a, F b) > { >[local count: 1073741824]: >= { 0.0, 9.0e+0, 0.0, 0.0 }; > return ; > > } > > The dump IR seems reasonable to me. > I wonder whether we should walk around in RISC-V backend to generate the same > IR as ARM SVE ? > Or we should adjust the test ? Note, the test also FAILs on i686-linux (but not e.g. on x86_64-linux): /home/jakub/src/gcc/obj67/gcc/xgcc -B/home/jakub/src/gcc/obj67/gcc/ /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c -fdiagnostics-plain-output -O2 -fdump-tree-optimized -S -o pr111754.s /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c: In function 'foo': /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c:7:1: warning: SSE vector return without SSE enabled changes the ABI [-Wpsabi] /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c:6:3: note: the ABI for passing parameters with 16-byte alignment has changed in GCC 4.6 /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c:6:3: warning: SSE vector argument without SSE enabled changes the ABI [-Wpsabi] FAIL: gcc.dg/vect/pr111754.c (test for excess errors) Excess errors: /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c:7:1: warning: SSE vector return without SSE enabled changes the ABI [-Wpsabi] /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr111754.c:6:3: warning: SSE vector argument without SSE enabled changes the ABI [-Wpsabi] PASS: gcc.dg/vect/pr111754.c scan-tree-dump-not optimized "VEC_PERM_EXPR" FAIL: gcc.dg/vect/pr111754.c scan-tree-dump optimized "return { 0.0, 9.0e\\+0, 0.0, 0.0 }" So, I think it is wrong to specify /* { dg-options "-O2 -fdump-tree-optimized" } */ in the test, should be dg-additional-options instead, so that it gets the implied vector compilation options e.g. for i686-linux (-msse2 in that case at least), question is if -Wno-psabi should be added as well or not, and certainly the scan-tree-dump needs to be guarded by appropriate vect_* effective target (but dunno which, one which asserts support for V4SFmode and returning it). Alternatively, perhaps don't check optimized dump but some earlier one before generic vector lowering, then hopefully it could match on all targets? Maybe with the = ... vs. return ... variants. Jakub
PR111754
Hi, there is a regression in RISC-V caused by this patch: FAIL: gcc.dg/vect/pr111754.c -flto -ffat-lto-objects scan-tree-dump optimized "return { 0.0, 9.0e\\+0, 0.0, 0.0 }" FAIL: gcc.dg/vect/pr111754.c scan-tree-dump optimized "return { 0.0, 9.0e\\+0, 0.0, 0.0 }" I have checked the dump is : F foo (F a, F b) { [local count: 1073741824]: = { 0.0, 9.0e+0, 0.0, 0.0 }; return ; } The dump IR seems reasonable to me. I wonder whether we should walk around in RISC-V backend to generate the same IR as ARM SVE ? Or we should adjust the test ? Thanks. juzhe.zh...@rivai.ai
Re: PR111754
Prathamesh Kulkarni writes: > PR111754: Rework encoding of result for VEC_PERM_EXPR with constant input > vectors. > > gcc/ChangeLog: > PR middle-end/111754 > * fold-const.cc (fold_vec_perm_cst): Set result's encoding to sel's > encoding, and set res_nelts_per_pattern to 2 if sel contains stepped > sequence but input vectors do not. > (test_nunits_min_2): New test Case 8. > (test_nunits_min_4): New tests Case 8 and Case 9. > > gcc/testsuite/ChangeLog: > PR middle-end/111754 > * gcc.target/aarch64/sve/slp_3.c: Adjust code-gen. > * gcc.target/aarch64/sve/slp_4.c: Likewise. > * gcc.dg/vect/pr111754.c: New test. OK, thanks. Richard > Co-authored-by: Richard Sandiford > > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc > index 332bc8aead2..dff09b81f7b 100644 > --- a/gcc/fold-const.cc > +++ b/gcc/fold-const.cc > @@ -10803,27 +10803,38 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, > const vec_perm_indices , >unsigned res_npatterns, res_nelts_per_pattern; >unsigned HOST_WIDE_INT res_nelts; > > - /* (1) If SEL is a suitable mask as determined by > - valid_mask_for_fold_vec_perm_cst_p, then: > - res_npatterns = max of npatterns between ARG0, ARG1, and SEL > - res_nelts_per_pattern = max of nelts_per_pattern between > - ARG0, ARG1 and SEL. > - (2) If SEL is not a suitable mask, and TYPE is VLS then: > - res_npatterns = nelts in result vector. > - res_nelts_per_pattern = 1. > - This exception is made so that VLS ARG0, ARG1 and SEL work as before. > */ > - if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) > -{ > - res_npatterns > - = std::max (VECTOR_CST_NPATTERNS (arg0), > - std::max (VECTOR_CST_NPATTERNS (arg1), > - sel.encoding ().npatterns ())); > + /* First try to implement the fold in a VLA-friendly way. > + > + (1) If the selector is simply a duplication of N elements, the > + result is likewise a duplication of N elements. > + > + (2) If the selector is N elements followed by a duplication > + of N elements, the result is too. > + > + (3) If the selector is N elements followed by an interleaving > + of N linear series, the situation is more complex. > + > + valid_mask_for_fold_vec_perm_cst_p detects whether we > + can handle this case. If we can, then each of the N linear > + series either (a) selects the same element each time or > + (b) selects a linear series from one of the input patterns. > > - res_nelts_per_pattern > - = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0), > - std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1), > - sel.encoding ().nelts_per_pattern ())); > + If (b) holds for one of the linear series, the result > + will contain a linear series, and so the result will have > + the same shape as the selector. If (a) holds for all of > + the linear series, the result will be the same as (2) above. > > + (b) can only hold if one of the input patterns has a > + stepped encoding. */ > + > + if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) > +{ > + res_npatterns = sel.encoding ().npatterns (); > + res_nelts_per_pattern = sel.encoding ().nelts_per_pattern (); > + if (res_nelts_per_pattern == 3 > + && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3 > + && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3) > + res_nelts_per_pattern = 2; >res_nelts = res_npatterns * res_nelts_per_pattern; > } >else if (TYPE_VECTOR_SUBPARTS (type).is_constant (_nelts)) > @@ -17622,6 +17633,29 @@ test_nunits_min_2 (machine_mode vmode) > tree expected_res[] = { ARG0(0), ARG1(0), ARG1(1) }; > validate_res (1, 3, res, expected_res); >} > + > + /* Case 8: Same as aarch64/sve/slp_3.c: > + arg0, arg1 are dup vectors. > + sel = { 0, len, 1, len+1, 2, len+2, ... } // (2, 3) > + So res = { arg0[0], arg1[0], ... } // (2, 1) > + > + In this case, since the input vectors are dup, only the first two > + elements per pattern in sel are considered significant. */ > + { > + tree arg0 = build_vec_cst_rand (vmode, 1, 1); > + tree arg1 = build_vec_cst_rand (vmode, 1, 1); > + poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)); > + > + vec_perm_builder builder (len, 2, 3); > + poly_uint64 mask_elems[] = { 0, len, 1, len + 1, 2, len + 2 }; > + builder_push_elems (builder, mask_elems); > + > + vec_perm_indices sel (builder, 2, len)
Re: PR111754
rk as before. > > */ > > - if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) > > -{ > > - res_npatterns > > - = std::max (VECTOR_CST_NPATTERNS (arg0), > > - std::max (VECTOR_CST_NPATTERNS (arg1), > > - sel.encoding ().npatterns ())); > > + /* First try to implement the fold in a VLA-friendly way. > > + > > + (1) If the selector is simply a duplication of N elements, the > > + result is likewise a duplication of N elements. > > + > > + (2) If the selector is N elements followed by a duplication > > + of N elements, the result is too. > > + > > + (3) If the selector is N elements followed by an interleaving > > + of N linear series, the situation is more complex. > > + > > + valid_mask_for_fold_vec_perm_cst_p detects whether we > > + can handle this case. If we can, then each of the N linear > > + series either (a) selects the same element each time or > > + (b) selects a linear series from one of the input patterns. > > + > > + If (b) holds for one of the linear series, the result > > + will contain a linear series, and so the result will have > > + the same shape as the selector. If (a) holds for all of > > + the lienar series, the result will be the same as (2) above. > > my typo: linear > > > > - res_nelts_per_pattern > > - = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0), > > - std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1), > > - sel.encoding ().nelts_per_pattern ())); > > + (b) can only hold if one of the input patterns has a > > + stepped encoding. */ > > > > + if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) > > +{ > > + res_npatterns = sel.encoding ().npatterns (); > > + res_nelts_per_pattern = sel.encoding ().nelts_per_pattern (); > > + if (res_nelts_per_pattern == 3 > > + && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3 > > + && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3) > > + res_nelts_per_pattern = 2; > >res_nelts = res_npatterns * res_nelts_per_pattern; > > } > >else if (TYPE_VECTOR_SUBPARTS (type).is_constant (_nelts)) > > @@ -17562,6 +17573,29 @@ test_nunits_min_2 (machine_mode vmode) > > tree expected_res[] = { ARG0(0), ARG1(0), ARG1(1) }; > > validate_res (1, 3, res, expected_res); > >} > > + > > + /* Case 8: Same as aarch64/sve/slp_3.c: > > + arg0, arg1 are dup vectors. > > + sel = { 0, len, 1, len+1, 2, len+2, ... } // (2, 3) > > + So res = { arg0[0], arg1[0], ... } // (2, 1) > > + > > + In this case, since the input vectors are dup, only the first two > > + elements per pattern in sel are considered significant. */ > > + { > > + tree arg0 = build_vec_cst_rand (vmode, 1, 1); > > + tree arg1 = build_vec_cst_rand (vmode, 1, 1); > > + poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)); > > + > > + vec_perm_builder builder (len, 2, 3); > > + poly_uint64 mask_elems[] = { 0, len, 1, len + 1, 2, len + 2 }; > > + builder_push_elems (builder, mask_elems); > > + > > + vec_perm_indices sel (builder, 2, len); > > + tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel); > > + > > + tree expected_res[] = { ARG0(0), ARG1(0) }; > > + validate_res (2, 1, res, expected_res); > > + } > > } > > } > > > > @@ -17730,6 +17764,45 @@ test_nunits_min_4 (machine_mode vmode) > > ASSERT_TRUE (res == NULL_TREE); > > ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns")); > >} > > + > > + /* Case 8: PR111754: When input vector is not a stepped sequence, > > + check that the result is not a stepped sequence either, even > > + if sel has a stepped sequence. */ > > + { > > + tree arg0 = build_vec_cst_rand (vmode, 1, 2); > > + tree arg1 = build_vec_cst_rand (vmode, 1, 2); > > + poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)); > > + > > + vec_perm_builder builder (len, 1, 3); > > + poly_uint64 mask_elems[] = { 0, 1, 2 }; > > + builder_push_elems (builder, mask_elems); > > + > > + vec_perm_indices sel (builder, 2, len); > > + tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel); > > + > > + tree expected_res[] = { ARG0(0), ARG0(1) }; >
Re: PR111754
same element each time or > + (b) selects a linear series from one of the input patterns. > + > + If (b) holds for one of the linear series, the result > + will contain a linear series, and so the result will have > + the same shape as the selector. If (a) holds for all of > + the lienar series, the result will be the same as (2) above. my typo: linear > > - res_nelts_per_pattern > - = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0), > - std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1), > - sel.encoding ().nelts_per_pattern ())); > + (b) can only hold if one of the input patterns has a > + stepped encoding. */ > > + if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) > +{ > + res_npatterns = sel.encoding ().npatterns (); > + res_nelts_per_pattern = sel.encoding ().nelts_per_pattern (); > + if (res_nelts_per_pattern == 3 > + && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3 > + && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3) > + res_nelts_per_pattern = 2; >res_nelts = res_npatterns * res_nelts_per_pattern; > } >else if (TYPE_VECTOR_SUBPARTS (type).is_constant (_nelts)) > @@ -17562,6 +17573,29 @@ test_nunits_min_2 (machine_mode vmode) > tree expected_res[] = { ARG0(0), ARG1(0), ARG1(1) }; > validate_res (1, 3, res, expected_res); >} > + > + /* Case 8: Same as aarch64/sve/slp_3.c: > + arg0, arg1 are dup vectors. > + sel = { 0, len, 1, len+1, 2, len+2, ... } // (2, 3) > + So res = { arg0[0], arg1[0], ... } // (2, 1) > + > + In this case, since the input vectors are dup, only the first two > + elements per pattern in sel are considered significant. */ > + { > + tree arg0 = build_vec_cst_rand (vmode, 1, 1); > + tree arg1 = build_vec_cst_rand (vmode, 1, 1); > + poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)); > + > + vec_perm_builder builder (len, 2, 3); > + poly_uint64 mask_elems[] = { 0, len, 1, len + 1, 2, len + 2 }; > + builder_push_elems (builder, mask_elems); > + > + vec_perm_indices sel (builder, 2, len); > + tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel); > + > + tree expected_res[] = { ARG0(0), ARG1(0) }; > + validate_res (2, 1, res, expected_res); > + } > } > } > > @@ -17730,6 +17764,45 @@ test_nunits_min_4 (machine_mode vmode) > ASSERT_TRUE (res == NULL_TREE); > ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns")); >} > + > + /* Case 8: PR111754: When input vector is not a stepped sequence, > + check that the result is not a stepped sequence either, even > + if sel has a stepped sequence. */ > + { > + tree arg0 = build_vec_cst_rand (vmode, 1, 2); > + tree arg1 = build_vec_cst_rand (vmode, 1, 2); > + poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)); > + > + vec_perm_builder builder (len, 1, 3); > + poly_uint64 mask_elems[] = { 0, 1, 2 }; > + builder_push_elems (builder, mask_elems); > + > + vec_perm_indices sel (builder, 2, len); > + tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel); > + > + tree expected_res[] = { ARG0(0), ARG0(1) }; > + validate_res (sel.encoding ().npatterns (), 2, res, expected_res); The test is OK, but I think it's worth noting that the fold_vec_perm_cst arguments aren't canonical. Since sel selects only from the first input, the canonical form would be: tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg0, sel); So OK with a comment, but also OK with the line above instead (and no arg1). > + } > + > + /* Case 9: If sel doesn't contain a stepped sequence, > + check that the result has same encoding as sel, irrespective > + of shape of input vectors. */ > + { > + tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1); > + tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1); > + poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)); > + > + vec_perm_builder builder (len, 1, 2); > + poly_uint64 mask_elems[] = { 0, len }; > + builder_push_elems (builder, mask_elems); > + > + vec_perm_indices sel (builder, 2, len); > + tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel); > + > + tree expected_res[] = { ARG0(0), ARG1(0) }; > + validate_res (sel.encoding ().npatterns (), > + sel.encoding ().nelts_per_pattern (), res, expected_res); > + } > } > } > > diff --git a/gcc/testsuite/gcc.dg/vect/pr111754.c > b/gcc/testsuite/gcc.dg/vect/pr111754.c > new file mo
Re: PR111754
On Wed, 15 Nov 2023 at 20:44, Prathamesh Kulkarni wrote: > > On Wed, 8 Nov 2023 at 21:57, Prathamesh Kulkarni > wrote: > > > > On Thu, 26 Oct 2023 at 09:43, Prathamesh Kulkarni > > wrote: > > > > > > On Thu, 26 Oct 2023 at 04:09, Richard Sandiford > > > wrote: > > > > > > > > Prathamesh Kulkarni writes: > > > > > On Wed, 25 Oct 2023 at 02:58, Richard Sandiford > > > > > wrote: > > > > >> > > > > >> Hi, > > > > >> > > > > >> Sorry the slow review. I clearly didn't think this through properly > > > > >> when doing the review of the original patch, so I wanted to spend > > > > >> some time working on the code to get a better understanding of > > > > >> the problem. > > > > >> > > > > >> Prathamesh Kulkarni writes: > > > > >> > Hi, > > > > >> > For the following test-case: > > > > >> > > > > > >> > typedef float __attribute__((__vector_size__ (16))) F; > > > > >> > F foo (F a, F b) > > > > >> > { > > > > >> > F v = (F) { 9 }; > > > > >> > return __builtin_shufflevector (v, v, 1, 0, 1, 2); > > > > >> > } > > > > >> > > > > > >> > Compiling with -O2 results in following ICE: > > > > >> > foo.c: In function ‘foo’: > > > > >> > foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314 > > > > >> > 6 | return __builtin_shufflevector (v, v, 1, 0, 1, 2); > > > > >> > | ^~ > > > > >> > 0x7f3185 wi::int_traits > > > > >> >>::decompose(long*, unsigned int, std::pair > > > > >> > const&) > > > > >> > ../../gcc/gcc/rtl.h:2314 > > > > >> > 0x7f3185 wide_int_ref_storage > > > >> > false>::wide_int_ref_storage > > > > >> >>(std::pair const&) > > > > >> > ../../gcc/gcc/wide-int.h:1089 > > > > >> > 0x7f3185 generic_wide_int > > > > >> >>::generic_wide_int > > > > >> >>(std::pair const&) > > > > >> > ../../gcc/gcc/wide-int.h:847 > > > > >> > 0x7f3185 poly_int<1u, generic_wide_int > > > >> > false> > >::poly_int > > > > >> >>(poly_int_full, std::pair const&) > > > > >> > ../../gcc/gcc/poly-int.h:467 > > > > >> > 0x7f3185 poly_int<1u, generic_wide_int > > > >> > false> > >::poly_int > > > > >> >>(std::pair const&) > > > > >> > ../../gcc/gcc/poly-int.h:453 > > > > >> > 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode) > > > > >> > ../../gcc/gcc/rtl.h:2383 > > > > >> > 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const > > > > >> > ../../gcc/gcc/rtx-vector-builder.h:122 > > > > >> > 0xfd4e1b vector_builder > > > >> > rtx_vector_builder>::elt(unsigned int) const > > > > >> > ../../gcc/gcc/vector-builder.h:253 > > > > >> > 0xfd4d11 rtx_vector_builder::build() > > > > >> > ../../gcc/gcc/rtx-vector-builder.cc:73 > > > > >> > 0xc21d9c const_vector_from_tree > > > > >> > ../../gcc/gcc/expr.cc:13487 > > > > >> > 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > > > > >> > expand_modifier, rtx_def**, bool) > > > > >> > ../../gcc/gcc/expr.cc:11059 > > > > >> > 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, > > > > >> > expand_modifier) > > > > >> > ../../gcc/gcc/expr.h:310 > > > > >> > 0xaee682 expand_return > > > > >> > ../../gcc/gcc/cfgexpand.cc:3809 > > > > >> > 0xaee682 expand_gimple_stmt_1 > > > > >> > ../../gcc/gcc/cfgexpand.cc:3918 > > > > >> > 0xaee682 expand_gimple_stmt > > > > >> > ../../gcc/gcc/cfgexpand.cc:4044 > > > > >> > 0xaf28f0 expand_gimple_basic_block > > > > >> > ../../gcc/gcc/cfgexpand.cc:6100 > > > > >> > 0xaf4996 execute > > > > >> > ../../gcc/gcc/cfgexpand.cc:6835 > > > > >> > > > > > >> > IIUC, the issue is that fold_vec_perm returns a vector having > > > > >> > float element > > > > >> > type with res_nelts_per_pattern == 3, and later ICE's when it tries > > > > >> > to derive element v[3], not present in the encoding, while trying > > > > >> > to > > > > >> > build rtx vector > > > > >> > in rtx_vector_builder::build(): > > > > >> > for (unsigned int i = 0; i < nelts; ++i) > > > > >> > RTVEC_ELT (v, i) = elt (i); > > > > >> > > > > > >> > The attached patch tries to fix this by returning false from > > > > >> > valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and > > > > >> > input vector has non-integral element type, so for VLA vectors, it > > > > >> > will only build result with dup sequence (nelts_per_pattern < 3) > > > > >> > for > > > > >> > non-integral element type. > > > > >> > > > > > >> > For VLS vectors, this will still work for stepped sequence since it > > > > >> > will then use the "VLS exception" in fold_vec_perm_cst, and set: > > > > >> > res_npattern = res_nelts and > > > > >> > res_nelts_per_pattern = 1 > > > > >> > > > > > >> > and fold the above case to: > > > > >> > F foo (F a, F b) > > > > >> > { > > > > >> >[local count: 1073741824]: > > > > >> > return { 0.0, 9.0e+0, 0.0, 0.0 }; > > > > >> > } > > > > >> > > > > > >> > But I am not sure if this is entirely correct, since: > > > > >> > tree res =
Re: PR111754
On Wed, 8 Nov 2023 at 21:57, Prathamesh Kulkarni wrote: > > On Thu, 26 Oct 2023 at 09:43, Prathamesh Kulkarni > wrote: > > > > On Thu, 26 Oct 2023 at 04:09, Richard Sandiford > > wrote: > > > > > > Prathamesh Kulkarni writes: > > > > On Wed, 25 Oct 2023 at 02:58, Richard Sandiford > > > > wrote: > > > >> > > > >> Hi, > > > >> > > > >> Sorry the slow review. I clearly didn't think this through properly > > > >> when doing the review of the original patch, so I wanted to spend > > > >> some time working on the code to get a better understanding of > > > >> the problem. > > > >> > > > >> Prathamesh Kulkarni writes: > > > >> > Hi, > > > >> > For the following test-case: > > > >> > > > > >> > typedef float __attribute__((__vector_size__ (16))) F; > > > >> > F foo (F a, F b) > > > >> > { > > > >> > F v = (F) { 9 }; > > > >> > return __builtin_shufflevector (v, v, 1, 0, 1, 2); > > > >> > } > > > >> > > > > >> > Compiling with -O2 results in following ICE: > > > >> > foo.c: In function ‘foo’: > > > >> > foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314 > > > >> > 6 | return __builtin_shufflevector (v, v, 1, 0, 1, 2); > > > >> > | ^~ > > > >> > 0x7f3185 wi::int_traits > > > >> >>::decompose(long*, unsigned int, std::pair > > > >> > const&) > > > >> > ../../gcc/gcc/rtl.h:2314 > > > >> > 0x7f3185 wide_int_ref_storage > > >> > false>::wide_int_ref_storage > > > >> >>(std::pair const&) > > > >> > ../../gcc/gcc/wide-int.h:1089 > > > >> > 0x7f3185 generic_wide_int > > > >> >>::generic_wide_int > > > >> >>(std::pair const&) > > > >> > ../../gcc/gcc/wide-int.h:847 > > > >> > 0x7f3185 poly_int<1u, generic_wide_int > > >> > false> > >::poly_int > > > >> >>(poly_int_full, std::pair const&) > > > >> > ../../gcc/gcc/poly-int.h:467 > > > >> > 0x7f3185 poly_int<1u, generic_wide_int > > >> > false> > >::poly_int > > > >> >>(std::pair const&) > > > >> > ../../gcc/gcc/poly-int.h:453 > > > >> > 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode) > > > >> > ../../gcc/gcc/rtl.h:2383 > > > >> > 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const > > > >> > ../../gcc/gcc/rtx-vector-builder.h:122 > > > >> > 0xfd4e1b vector_builder > > >> > rtx_vector_builder>::elt(unsigned int) const > > > >> > ../../gcc/gcc/vector-builder.h:253 > > > >> > 0xfd4d11 rtx_vector_builder::build() > > > >> > ../../gcc/gcc/rtx-vector-builder.cc:73 > > > >> > 0xc21d9c const_vector_from_tree > > > >> > ../../gcc/gcc/expr.cc:13487 > > > >> > 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > > > >> > expand_modifier, rtx_def**, bool) > > > >> > ../../gcc/gcc/expr.cc:11059 > > > >> > 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, > > > >> > expand_modifier) > > > >> > ../../gcc/gcc/expr.h:310 > > > >> > 0xaee682 expand_return > > > >> > ../../gcc/gcc/cfgexpand.cc:3809 > > > >> > 0xaee682 expand_gimple_stmt_1 > > > >> > ../../gcc/gcc/cfgexpand.cc:3918 > > > >> > 0xaee682 expand_gimple_stmt > > > >> > ../../gcc/gcc/cfgexpand.cc:4044 > > > >> > 0xaf28f0 expand_gimple_basic_block > > > >> > ../../gcc/gcc/cfgexpand.cc:6100 > > > >> > 0xaf4996 execute > > > >> > ../../gcc/gcc/cfgexpand.cc:6835 > > > >> > > > > >> > IIUC, the issue is that fold_vec_perm returns a vector having float > > > >> > element > > > >> > type with res_nelts_per_pattern == 3, and later ICE's when it tries > > > >> > to derive element v[3], not present in the encoding, while trying to > > > >> > build rtx vector > > > >> > in rtx_vector_builder::build(): > > > >> > for (unsigned int i = 0; i < nelts; ++i) > > > >> > RTVEC_ELT (v, i) = elt (i); > > > >> > > > > >> > The attached patch tries to fix this by returning false from > > > >> > valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and > > > >> > input vector has non-integral element type, so for VLA vectors, it > > > >> > will only build result with dup sequence (nelts_per_pattern < 3) for > > > >> > non-integral element type. > > > >> > > > > >> > For VLS vectors, this will still work for stepped sequence since it > > > >> > will then use the "VLS exception" in fold_vec_perm_cst, and set: > > > >> > res_npattern = res_nelts and > > > >> > res_nelts_per_pattern = 1 > > > >> > > > > >> > and fold the above case to: > > > >> > F foo (F a, F b) > > > >> > { > > > >> >[local count: 1073741824]: > > > >> > return { 0.0, 9.0e+0, 0.0, 0.0 }; > > > >> > } > > > >> > > > > >> > But I am not sure if this is entirely correct, since: > > > >> > tree res = out_elts.build (); > > > >> > will canonicalize the encoding and may result in a stepped sequence > > > >> > (vector_builder::finalize() may reduce npatterns at the cost of > > > >> > increasing > > > >> > nelts_per_pattern) ? > > > >> > > > > >> > PS: This issue is now latent after PR111648 fix, since > > > >> >
Re: PR111754
our patch. > Sorry for the delay. Hi, Sorry for slow response, I have rebased your patch and added couple of tests. The attached patch resulted in fallout for aarch64/sve/slp_3.c and aarch64/sve/slp_4.c. Specifically for slp_3.c, we didn't fold following case: arg0, arg1 are dup vectors. sel = { 0, len, 1, len + 1, 2, len + 2, ... } // (npatterns = 2, nelts_per_pattern = 3) because res_nelts_per_pattern was set to 3, and upon encountering 2, fold_vec_perm_cst returned false. With patch, we set res_nelts_per_pattern = 2 (since input vectors are dup), and thus gets folded to: res = { arg0[0], arg1[0], ... } // (2, 1) Which results in using ldrqd for loading the result instead of doing the permutation at runtime with mov and zip1. I have adjusted the tests for new code-gen. Does it look OK ? There's also this strange failure observed on x86_64, as well as on aarch64: New tests that FAIL (1 tests): libitm.c++/dropref.C -B /home/prathamesh.kulkarni/gnu-toolchain/gcc/gnu-964-5/bootstrap-build-after/aarch64-unknown-linux-gnu/./libitm/../libstdc++-v3/src/.libs execution test Looking at dropref.C: /* { dg-xfail-run-if "unsupported" { *-*-* } } */ #include char *pp; int main() { __transaction_atomic { _ITM_dropReferences (pp, 555); } return 0; } doesn't seem relevant to VEC_PERM_EXPR folding ? The patch otherwise passes bootstrap+test on aarch64-linux-gnu with and without SVE, and on x86_64-linux-gnu. Thanks, Prathamesh > > Thanks, > Prathamesh > > > > Thanks, > > Richard diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc index 40767736389..75410869796 100644 --- a/gcc/fold-const.cc +++ b/gcc/fold-const.cc @@ -10743,27 +10743,38 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices , unsigned res_npatterns, res_nelts_per_pattern; unsigned HOST_WIDE_INT res_nelts; - /* (1) If SEL is a suitable mask as determined by - valid_mask_for_fold_vec_perm_cst_p, then: - res_npatterns = max of npatterns between ARG0, ARG1, and SEL - res_nelts_per_pattern = max of nelts_per_pattern between -ARG0, ARG1 and SEL. - (2) If SEL is not a suitable mask, and TYPE is VLS then: - res_npatterns = nelts in result vector. - res_nelts_per_pattern = 1. - This exception is made so that VLS ARG0, ARG1 and SEL work as before. */ - if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) -{ - res_npatterns - = std::max (VECTOR_CST_NPATTERNS (arg0), - std::max (VECTOR_CST_NPATTERNS (arg1), - sel.encoding ().npatterns ())); + /* First try to implement the fold in a VLA-friendly way. + + (1) If the selector is simply a duplication of N elements, the +result is likewise a duplication of N elements. + + (2) If the selector is N elements followed by a duplication +of N elements, the result is too. + + (3) If the selector is N elements followed by an interleaving +of N linear series, the situation is more complex. + +valid_mask_for_fold_vec_perm_cst_p detects whether we +can handle this case. If we can, then each of the N linear +series either (a) selects the same element each time or +(b) selects a linear series from one of the input patterns. + +If (b) holds for one of the linear series, the result +will contain a linear series, and so the result will have +the same shape as the selector. If (a) holds for all of +the lienar series, the result will be the same as (2) above. - res_nelts_per_pattern - = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0), - std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1), - sel.encoding ().nelts_per_pattern ())); +(b) can only hold if one of the input patterns has a +stepped encoding. */ + if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) +{ + res_npatterns = sel.encoding ().npatterns (); + res_nelts_per_pattern = sel.encoding ().nelts_per_pattern (); + if (res_nelts_per_pattern == 3 + && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3 + && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3) + res_nelts_per_pattern = 2; res_nelts = res_npatterns * res_nelts_per_pattern; } else if (TYPE_VECTOR_SUBPARTS (type).is_constant (_nelts)) @@ -17562,6 +17573,29 @@ test_nunits_min_2 (machine_mode vmode) tree expected_res[] = { ARG0(0), ARG1(0), ARG1(1) }; validate_res (1, 3, res, expected_res); } + + /* Case 8: Same as aarch64/sve/slp_3.c: +arg0, arg1 are dup vectors. +sel = { 0, len, 1, len+1, 2, len+2, ... } // (2, 3) +So res = { arg0[0], arg1[0], ... } // (2, 1) + +In this case, since the input vectors are dup, only the first two +elements per pattern in sel are considered significant. */ + {
Re: PR111754
On Thu, 26 Oct 2023 at 04:09, Richard Sandiford wrote: > > Prathamesh Kulkarni writes: > > On Wed, 25 Oct 2023 at 02:58, Richard Sandiford > > wrote: > >> > >> Hi, > >> > >> Sorry the slow review. I clearly didn't think this through properly > >> when doing the review of the original patch, so I wanted to spend > >> some time working on the code to get a better understanding of > >> the problem. > >> > >> Prathamesh Kulkarni writes: > >> > Hi, > >> > For the following test-case: > >> > > >> > typedef float __attribute__((__vector_size__ (16))) F; > >> > F foo (F a, F b) > >> > { > >> > F v = (F) { 9 }; > >> > return __builtin_shufflevector (v, v, 1, 0, 1, 2); > >> > } > >> > > >> > Compiling with -O2 results in following ICE: > >> > foo.c: In function ‘foo’: > >> > foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314 > >> > 6 | return __builtin_shufflevector (v, v, 1, 0, 1, 2); > >> > | ^~ > >> > 0x7f3185 wi::int_traits > >> >>::decompose(long*, unsigned int, std::pair > >> > const&) > >> > ../../gcc/gcc/rtl.h:2314 > >> > 0x7f3185 wide_int_ref_storage >> > false>::wide_int_ref_storage > >> >>(std::pair const&) > >> > ../../gcc/gcc/wide-int.h:1089 > >> > 0x7f3185 generic_wide_int > >> >>::generic_wide_int > >> >>(std::pair const&) > >> > ../../gcc/gcc/wide-int.h:847 > >> > 0x7f3185 poly_int<1u, generic_wide_int >> > false> > >::poly_int > >> >>(poly_int_full, std::pair const&) > >> > ../../gcc/gcc/poly-int.h:467 > >> > 0x7f3185 poly_int<1u, generic_wide_int >> > false> > >::poly_int > >> >>(std::pair const&) > >> > ../../gcc/gcc/poly-int.h:453 > >> > 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode) > >> > ../../gcc/gcc/rtl.h:2383 > >> > 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const > >> > ../../gcc/gcc/rtx-vector-builder.h:122 > >> > 0xfd4e1b vector_builder >> > rtx_vector_builder>::elt(unsigned int) const > >> > ../../gcc/gcc/vector-builder.h:253 > >> > 0xfd4d11 rtx_vector_builder::build() > >> > ../../gcc/gcc/rtx-vector-builder.cc:73 > >> > 0xc21d9c const_vector_from_tree > >> > ../../gcc/gcc/expr.cc:13487 > >> > 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > >> > expand_modifier, rtx_def**, bool) > >> > ../../gcc/gcc/expr.cc:11059 > >> > 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier) > >> > ../../gcc/gcc/expr.h:310 > >> > 0xaee682 expand_return > >> > ../../gcc/gcc/cfgexpand.cc:3809 > >> > 0xaee682 expand_gimple_stmt_1 > >> > ../../gcc/gcc/cfgexpand.cc:3918 > >> > 0xaee682 expand_gimple_stmt > >> > ../../gcc/gcc/cfgexpand.cc:4044 > >> > 0xaf28f0 expand_gimple_basic_block > >> > ../../gcc/gcc/cfgexpand.cc:6100 > >> > 0xaf4996 execute > >> > ../../gcc/gcc/cfgexpand.cc:6835 > >> > > >> > IIUC, the issue is that fold_vec_perm returns a vector having float > >> > element > >> > type with res_nelts_per_pattern == 3, and later ICE's when it tries > >> > to derive element v[3], not present in the encoding, while trying to > >> > build rtx vector > >> > in rtx_vector_builder::build(): > >> > for (unsigned int i = 0; i < nelts; ++i) > >> > RTVEC_ELT (v, i) = elt (i); > >> > > >> > The attached patch tries to fix this by returning false from > >> > valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and > >> > input vector has non-integral element type, so for VLA vectors, it > >> > will only build result with dup sequence (nelts_per_pattern < 3) for > >> > non-integral element type. > >> > > >> > For VLS vectors, this will still work for stepped sequence since it > >> > will then use the "VLS exception" in fold_vec_perm_cst, and set: > >> > res_npattern = res_nelts and > >> > res_nelts_per_pattern = 1 > >> > > >> > and fold the above case to: > >> > F foo (F a, F b) > >> > { > >> >[local count: 1073741824]: > >> > return { 0.0, 9.0e+0, 0.0, 0.0 }; > >> > } > >> > > >> > But I am not sure if this is entirely correct, since: > >> > tree res = out_elts.build (); > >> > will canonicalize the encoding and may result in a stepped sequence > >> > (vector_builder::finalize() may reduce npatterns at the cost of > >> > increasing > >> > nelts_per_pattern) ? > >> > > >> > PS: This issue is now latent after PR111648 fix, since > >> > valid_mask_for_fold_vec_perm_cst with sel = {1, 0, 1, ...} returns > >> > false because the corresponding pattern in arg0 is not a natural > >> > stepped sequence, and folds correctly using VLS exception. However, I > >> > guess the underlying issue of dealing with non-integral element types > >> > in fold_vec_perm_cst still remains ? > >> > > >> > The patch passes bootstrap+test with and without SVE on > >> > aarch64-linux-gnu, > >> > and on x86_64-linux-gnu. > >> > >> I think the problem is instead in the way that we're calculating > >> res_npatterns and res_nelts_per_pattern. > >> > >>
Re: PR111754
Prathamesh Kulkarni writes: > On Wed, 25 Oct 2023 at 02:58, Richard Sandiford > wrote: >> >> Hi, >> >> Sorry the slow review. I clearly didn't think this through properly >> when doing the review of the original patch, so I wanted to spend >> some time working on the code to get a better understanding of >> the problem. >> >> Prathamesh Kulkarni writes: >> > Hi, >> > For the following test-case: >> > >> > typedef float __attribute__((__vector_size__ (16))) F; >> > F foo (F a, F b) >> > { >> > F v = (F) { 9 }; >> > return __builtin_shufflevector (v, v, 1, 0, 1, 2); >> > } >> > >> > Compiling with -O2 results in following ICE: >> > foo.c: In function ‘foo’: >> > foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314 >> > 6 | return __builtin_shufflevector (v, v, 1, 0, 1, 2); >> > | ^~ >> > 0x7f3185 wi::int_traits >> >>::decompose(long*, unsigned int, std::pair >> > const&) >> > ../../gcc/gcc/rtl.h:2314 >> > 0x7f3185 wide_int_ref_storage> > false>::wide_int_ref_storage >> >>(std::pair const&) >> > ../../gcc/gcc/wide-int.h:1089 >> > 0x7f3185 generic_wide_int >> >>::generic_wide_int >> >>(std::pair const&) >> > ../../gcc/gcc/wide-int.h:847 >> > 0x7f3185 poly_int<1u, generic_wide_int> > false> > >::poly_int >> >>(poly_int_full, std::pair const&) >> > ../../gcc/gcc/poly-int.h:467 >> > 0x7f3185 poly_int<1u, generic_wide_int> > false> > >::poly_int >> >>(std::pair const&) >> > ../../gcc/gcc/poly-int.h:453 >> > 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode) >> > ../../gcc/gcc/rtl.h:2383 >> > 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const >> > ../../gcc/gcc/rtx-vector-builder.h:122 >> > 0xfd4e1b vector_builder> > rtx_vector_builder>::elt(unsigned int) const >> > ../../gcc/gcc/vector-builder.h:253 >> > 0xfd4d11 rtx_vector_builder::build() >> > ../../gcc/gcc/rtx-vector-builder.cc:73 >> > 0xc21d9c const_vector_from_tree >> > ../../gcc/gcc/expr.cc:13487 >> > 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode, >> > expand_modifier, rtx_def**, bool) >> > ../../gcc/gcc/expr.cc:11059 >> > 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier) >> > ../../gcc/gcc/expr.h:310 >> > 0xaee682 expand_return >> > ../../gcc/gcc/cfgexpand.cc:3809 >> > 0xaee682 expand_gimple_stmt_1 >> > ../../gcc/gcc/cfgexpand.cc:3918 >> > 0xaee682 expand_gimple_stmt >> > ../../gcc/gcc/cfgexpand.cc:4044 >> > 0xaf28f0 expand_gimple_basic_block >> > ../../gcc/gcc/cfgexpand.cc:6100 >> > 0xaf4996 execute >> > ../../gcc/gcc/cfgexpand.cc:6835 >> > >> > IIUC, the issue is that fold_vec_perm returns a vector having float element >> > type with res_nelts_per_pattern == 3, and later ICE's when it tries >> > to derive element v[3], not present in the encoding, while trying to >> > build rtx vector >> > in rtx_vector_builder::build(): >> > for (unsigned int i = 0; i < nelts; ++i) >> > RTVEC_ELT (v, i) = elt (i); >> > >> > The attached patch tries to fix this by returning false from >> > valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and >> > input vector has non-integral element type, so for VLA vectors, it >> > will only build result with dup sequence (nelts_per_pattern < 3) for >> > non-integral element type. >> > >> > For VLS vectors, this will still work for stepped sequence since it >> > will then use the "VLS exception" in fold_vec_perm_cst, and set: >> > res_npattern = res_nelts and >> > res_nelts_per_pattern = 1 >> > >> > and fold the above case to: >> > F foo (F a, F b) >> > { >> >[local count: 1073741824]: >> > return { 0.0, 9.0e+0, 0.0, 0.0 }; >> > } >> > >> > But I am not sure if this is entirely correct, since: >> > tree res = out_elts.build (); >> > will canonicalize the encoding and may result in a stepped sequence >> > (vector_builder::finalize() may reduce npatterns at the cost of increasing >> > nelts_per_pattern) ? >> > >> > PS: This issue is now latent after PR111648 fix, since >> > valid_mask_for_fold_vec_perm_cst with sel = {1, 0, 1, ...} returns >> > false because the corresponding pattern in arg0 is not a natural >> > stepped sequence, and folds correctly using VLS exception. However, I >> > guess the underlying issue of dealing with non-integral element types >> > in fold_vec_perm_cst still remains ? >> > >> > The patch passes bootstrap+test with and without SVE on aarch64-linux-gnu, >> > and on x86_64-linux-gnu. >> >> I think the problem is instead in the way that we're calculating >> res_npatterns and res_nelts_per_pattern. >> >> If the selector is a duplication of { a1, ..., an }, then the >> result will be a duplication of n elements, regardless of the shape >> of the other arguments. >> >> Similarly, if the selector is { a1, , an } followed by a >> duplication of { b1, ..., bn }, the result be n elements followed >> by a duplication of n
Re: PR111754
On Wed, 25 Oct 2023 at 02:58, Richard Sandiford wrote: > > Hi, > > Sorry the slow review. I clearly didn't think this through properly > when doing the review of the original patch, so I wanted to spend > some time working on the code to get a better understanding of > the problem. > > Prathamesh Kulkarni writes: > > Hi, > > For the following test-case: > > > > typedef float __attribute__((__vector_size__ (16))) F; > > F foo (F a, F b) > > { > > F v = (F) { 9 }; > > return __builtin_shufflevector (v, v, 1, 0, 1, 2); > > } > > > > Compiling with -O2 results in following ICE: > > foo.c: In function ‘foo’: > > foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314 > > 6 | return __builtin_shufflevector (v, v, 1, 0, 1, 2); > > | ^~ > > 0x7f3185 wi::int_traits > >>::decompose(long*, unsigned int, std::pair > > const&) > > ../../gcc/gcc/rtl.h:2314 > > 0x7f3185 wide_int_ref_storage > false>::wide_int_ref_storage > >>(std::pair const&) > > ../../gcc/gcc/wide-int.h:1089 > > 0x7f3185 generic_wide_int > >>::generic_wide_int > >>(std::pair const&) > > ../../gcc/gcc/wide-int.h:847 > > 0x7f3185 poly_int<1u, generic_wide_int > false> > >::poly_int > >>(poly_int_full, std::pair const&) > > ../../gcc/gcc/poly-int.h:467 > > 0x7f3185 poly_int<1u, generic_wide_int > false> > >::poly_int > >>(std::pair const&) > > ../../gcc/gcc/poly-int.h:453 > > 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode) > > ../../gcc/gcc/rtl.h:2383 > > 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const > > ../../gcc/gcc/rtx-vector-builder.h:122 > > 0xfd4e1b vector_builder > rtx_vector_builder>::elt(unsigned int) const > > ../../gcc/gcc/vector-builder.h:253 > > 0xfd4d11 rtx_vector_builder::build() > > ../../gcc/gcc/rtx-vector-builder.cc:73 > > 0xc21d9c const_vector_from_tree > > ../../gcc/gcc/expr.cc:13487 > > 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > > expand_modifier, rtx_def**, bool) > > ../../gcc/gcc/expr.cc:11059 > > 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier) > > ../../gcc/gcc/expr.h:310 > > 0xaee682 expand_return > > ../../gcc/gcc/cfgexpand.cc:3809 > > 0xaee682 expand_gimple_stmt_1 > > ../../gcc/gcc/cfgexpand.cc:3918 > > 0xaee682 expand_gimple_stmt > > ../../gcc/gcc/cfgexpand.cc:4044 > > 0xaf28f0 expand_gimple_basic_block > > ../../gcc/gcc/cfgexpand.cc:6100 > > 0xaf4996 execute > > ../../gcc/gcc/cfgexpand.cc:6835 > > > > IIUC, the issue is that fold_vec_perm returns a vector having float element > > type with res_nelts_per_pattern == 3, and later ICE's when it tries > > to derive element v[3], not present in the encoding, while trying to > > build rtx vector > > in rtx_vector_builder::build(): > > for (unsigned int i = 0; i < nelts; ++i) > > RTVEC_ELT (v, i) = elt (i); > > > > The attached patch tries to fix this by returning false from > > valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and > > input vector has non-integral element type, so for VLA vectors, it > > will only build result with dup sequence (nelts_per_pattern < 3) for > > non-integral element type. > > > > For VLS vectors, this will still work for stepped sequence since it > > will then use the "VLS exception" in fold_vec_perm_cst, and set: > > res_npattern = res_nelts and > > res_nelts_per_pattern = 1 > > > > and fold the above case to: > > F foo (F a, F b) > > { > >[local count: 1073741824]: > > return { 0.0, 9.0e+0, 0.0, 0.0 }; > > } > > > > But I am not sure if this is entirely correct, since: > > tree res = out_elts.build (); > > will canonicalize the encoding and may result in a stepped sequence > > (vector_builder::finalize() may reduce npatterns at the cost of increasing > > nelts_per_pattern) ? > > > > PS: This issue is now latent after PR111648 fix, since > > valid_mask_for_fold_vec_perm_cst with sel = {1, 0, 1, ...} returns > > false because the corresponding pattern in arg0 is not a natural > > stepped sequence, and folds correctly using VLS exception. However, I > > guess the underlying issue of dealing with non-integral element types > > in fold_vec_perm_cst still remains ? > > > > The patch passes bootstrap+test with and without SVE on aarch64-linux-gnu, > > and on x86_64-linux-gnu. > > I think the problem is instead in the way that we're calculating > res_npatterns and res_nelts_per_pattern. > > If the selector is a duplication of { a1, ..., an }, then the > result will be a duplication of n elements, regardless of the shape > of the other arguments. > > Similarly, if the selector is { a1, , an } followed by a > duplication of { b1, ..., bn }, the result be n elements followed > by a duplication of n elements, regardless of the shape of the other > arguments. > > So for these two cases, res_npatterns and res_nelts_per_pattern > can come directly from the
Re: PR111754
Sigh, I knew I should have waited until the morning to proof-read and send this. Richard Sandiford writes: > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc > index 40767736389..00fce4945a7 100644 > --- a/gcc/fold-const.cc > +++ b/gcc/fold-const.cc > @@ -10743,27 +10743,37 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, > const vec_perm_indices , >unsigned res_npatterns, res_nelts_per_pattern; >unsigned HOST_WIDE_INT res_nelts; > > - /* (1) If SEL is a suitable mask as determined by > - valid_mask_for_fold_vec_perm_cst_p, then: > - res_npatterns = max of npatterns between ARG0, ARG1, and SEL > - res_nelts_per_pattern = max of nelts_per_pattern between > - ARG0, ARG1 and SEL. > - (2) If SEL is not a suitable mask, and TYPE is VLS then: > - res_npatterns = nelts in result vector. > - res_nelts_per_pattern = 1. > - This exception is made so that VLS ARG0, ARG1 and SEL work as before. > */ > - if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) > -{ > - res_npatterns > - = std::max (VECTOR_CST_NPATTERNS (arg0), > - std::max (VECTOR_CST_NPATTERNS (arg1), > - sel.encoding ().npatterns ())); > + /* First try to implement the fold in a VLA-friendly way. > + > + (1) If the selector is simply a duplication of N elements, the > + result is likewise a duplication of N elements. > + > + (2) If the selector is N elements followed by a duplication > + of N elements, the result is too. > > - res_nelts_per_pattern > - = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0), > - std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1), > - sel.encoding ().nelts_per_pattern ())); > + (3) If the selector is N elements followed by an interleaving > + of N linear series, the situation is more complex. > > + valid_mask_for_fold_vec_perm_cst_p detects whether we > + can handle this case. If we can, then each of the N linear > + series either (a) selects the same element each time or > + (b) selects a linear series from one of the input patterns. > + > + If (b) holds for one of the linear series, the result > + will contain a linear series, and so the result will have > + the same shape as the selector. If (a) holds for all of > + the lienar series, the result will be the same as (2) above. linear > + > + (b) can only hold if one of the inputs pattern has a input patterns Sorry for the typos. Richard > + stepped encoding. */ > + if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason)) > +{ > + res_npatterns = sel.encoding ().npatterns (); > + res_nelts_per_pattern = sel.encoding ().nelts_per_pattern (); > + if (res_nelts_per_pattern == 3 > + && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3 > + && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3) > + res_nelts_per_pattern = 2; >res_nelts = res_npatterns * res_nelts_per_pattern; > } >else if (TYPE_VECTOR_SUBPARTS (type).is_constant (_nelts))
Re: PR111754
Hi, Sorry the slow review. I clearly didn't think this through properly when doing the review of the original patch, so I wanted to spend some time working on the code to get a better understanding of the problem. Prathamesh Kulkarni writes: > Hi, > For the following test-case: > > typedef float __attribute__((__vector_size__ (16))) F; > F foo (F a, F b) > { > F v = (F) { 9 }; > return __builtin_shufflevector (v, v, 1, 0, 1, 2); > } > > Compiling with -O2 results in following ICE: > foo.c: In function ‘foo’: > foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314 > 6 | return __builtin_shufflevector (v, v, 1, 0, 1, 2); > | ^~ > 0x7f3185 wi::int_traits >>::decompose(long*, unsigned int, std::pair > const&) > ../../gcc/gcc/rtl.h:2314 > 0x7f3185 wide_int_ref_storage false>::wide_int_ref_storage >>(std::pair const&) > ../../gcc/gcc/wide-int.h:1089 > 0x7f3185 generic_wide_int >>::generic_wide_int >>(std::pair const&) > ../../gcc/gcc/wide-int.h:847 > 0x7f3185 poly_int<1u, generic_wide_int false> > >::poly_int >>(poly_int_full, std::pair const&) > ../../gcc/gcc/poly-int.h:467 > 0x7f3185 poly_int<1u, generic_wide_int false> > >::poly_int >>(std::pair const&) > ../../gcc/gcc/poly-int.h:453 > 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode) > ../../gcc/gcc/rtl.h:2383 > 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const > ../../gcc/gcc/rtx-vector-builder.h:122 > 0xfd4e1b vector_builder rtx_vector_builder>::elt(unsigned int) const > ../../gcc/gcc/vector-builder.h:253 > 0xfd4d11 rtx_vector_builder::build() > ../../gcc/gcc/rtx-vector-builder.cc:73 > 0xc21d9c const_vector_from_tree > ../../gcc/gcc/expr.cc:13487 > 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode, > expand_modifier, rtx_def**, bool) > ../../gcc/gcc/expr.cc:11059 > 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier) > ../../gcc/gcc/expr.h:310 > 0xaee682 expand_return > ../../gcc/gcc/cfgexpand.cc:3809 > 0xaee682 expand_gimple_stmt_1 > ../../gcc/gcc/cfgexpand.cc:3918 > 0xaee682 expand_gimple_stmt > ../../gcc/gcc/cfgexpand.cc:4044 > 0xaf28f0 expand_gimple_basic_block > ../../gcc/gcc/cfgexpand.cc:6100 > 0xaf4996 execute > ../../gcc/gcc/cfgexpand.cc:6835 > > IIUC, the issue is that fold_vec_perm returns a vector having float element > type with res_nelts_per_pattern == 3, and later ICE's when it tries > to derive element v[3], not present in the encoding, while trying to > build rtx vector > in rtx_vector_builder::build(): > for (unsigned int i = 0; i < nelts; ++i) > RTVEC_ELT (v, i) = elt (i); > > The attached patch tries to fix this by returning false from > valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and > input vector has non-integral element type, so for VLA vectors, it > will only build result with dup sequence (nelts_per_pattern < 3) for > non-integral element type. > > For VLS vectors, this will still work for stepped sequence since it > will then use the "VLS exception" in fold_vec_perm_cst, and set: > res_npattern = res_nelts and > res_nelts_per_pattern = 1 > > and fold the above case to: > F foo (F a, F b) > { >[local count: 1073741824]: > return { 0.0, 9.0e+0, 0.0, 0.0 }; > } > > But I am not sure if this is entirely correct, since: > tree res = out_elts.build (); > will canonicalize the encoding and may result in a stepped sequence > (vector_builder::finalize() may reduce npatterns at the cost of increasing > nelts_per_pattern) ? > > PS: This issue is now latent after PR111648 fix, since > valid_mask_for_fold_vec_perm_cst with sel = {1, 0, 1, ...} returns > false because the corresponding pattern in arg0 is not a natural > stepped sequence, and folds correctly using VLS exception. However, I > guess the underlying issue of dealing with non-integral element types > in fold_vec_perm_cst still remains ? > > The patch passes bootstrap+test with and without SVE on aarch64-linux-gnu, > and on x86_64-linux-gnu. I think the problem is instead in the way that we're calculating res_npatterns and res_nelts_per_pattern. If the selector is a duplication of { a1, ..., an }, then the result will be a duplication of n elements, regardless of the shape of the other arguments. Similarly, if the selector is { a1, , an } followed by a duplication of { b1, ..., bn }, the result be n elements followed by a duplication of n elements, regardless of the shape of the other arguments. So for these two cases, res_npatterns and res_nelts_per_pattern can come directly from the selector's encoding. If: (1) the selector is an n-pattern stepped sequence (2) the stepped part of each pattern selects from the same input pattern (3) the stepped part of each pattern does not select the first element of the input pattern, or the full input pattern is stepped (your previous
PR111754
Hi, For the following test-case: typedef float __attribute__((__vector_size__ (16))) F; F foo (F a, F b) { F v = (F) { 9 }; return __builtin_shufflevector (v, v, 1, 0, 1, 2); } Compiling with -O2 results in following ICE: foo.c: In function ‘foo’: foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314 6 | return __builtin_shufflevector (v, v, 1, 0, 1, 2); | ^~ 0x7f3185 wi::int_traits >::decompose(long*, unsigned int, std::pair const&) ../../gcc/gcc/rtl.h:2314 0x7f3185 wide_int_ref_storage::wide_int_ref_storage >(std::pair const&) ../../gcc/gcc/wide-int.h:1089 0x7f3185 generic_wide_int >::generic_wide_int >(std::pair const&) ../../gcc/gcc/wide-int.h:847 0x7f3185 poly_int<1u, generic_wide_int > >::poly_int >(poly_int_full, std::pair const&) ../../gcc/gcc/poly-int.h:467 0x7f3185 poly_int<1u, generic_wide_int > >::poly_int >(std::pair const&) ../../gcc/gcc/poly-int.h:453 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode) ../../gcc/gcc/rtl.h:2383 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const ../../gcc/gcc/rtx-vector-builder.h:122 0xfd4e1b vector_builder::elt(unsigned int) const ../../gcc/gcc/vector-builder.h:253 0xfd4d11 rtx_vector_builder::build() ../../gcc/gcc/rtx-vector-builder.cc:73 0xc21d9c const_vector_from_tree ../../gcc/gcc/expr.cc:13487 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) ../../gcc/gcc/expr.cc:11059 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier) ../../gcc/gcc/expr.h:310 0xaee682 expand_return ../../gcc/gcc/cfgexpand.cc:3809 0xaee682 expand_gimple_stmt_1 ../../gcc/gcc/cfgexpand.cc:3918 0xaee682 expand_gimple_stmt ../../gcc/gcc/cfgexpand.cc:4044 0xaf28f0 expand_gimple_basic_block ../../gcc/gcc/cfgexpand.cc:6100 0xaf4996 execute ../../gcc/gcc/cfgexpand.cc:6835 IIUC, the issue is that fold_vec_perm returns a vector having float element type with res_nelts_per_pattern == 3, and later ICE's when it tries to derive element v[3], not present in the encoding, while trying to build rtx vector in rtx_vector_builder::build(): for (unsigned int i = 0; i < nelts; ++i) RTVEC_ELT (v, i) = elt (i); The attached patch tries to fix this by returning false from valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and input vector has non-integral element type, so for VLA vectors, it will only build result with dup sequence (nelts_per_pattern < 3) for non-integral element type. For VLS vectors, this will still work for stepped sequence since it will then use the "VLS exception" in fold_vec_perm_cst, and set: res_npattern = res_nelts and res_nelts_per_pattern = 1 and fold the above case to: F foo (F a, F b) { [local count: 1073741824]: return { 0.0, 9.0e+0, 0.0, 0.0 }; } But I am not sure if this is entirely correct, since: tree res = out_elts.build (); will canonicalize the encoding and may result in a stepped sequence (vector_builder::finalize() may reduce npatterns at the cost of increasing nelts_per_pattern) ? PS: This issue is now latent after PR111648 fix, since valid_mask_for_fold_vec_perm_cst with sel = {1, 0, 1, ...} returns false because the corresponding pattern in arg0 is not a natural stepped sequence, and folds correctly using VLS exception. However, I guess the underlying issue of dealing with non-integral element types in fold_vec_perm_cst still remains ? The patch passes bootstrap+test with and without SVE on aarch64-linux-gnu, and on x86_64-linux-gnu. Thanks, Prathamesh diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc index 82299bb7f1d..cedfc9616e9 100644 --- a/gcc/fold-const.cc +++ b/gcc/fold-const.cc @@ -10642,6 +10642,11 @@ valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree arg1, if (sel_nelts_per_pattern < 3) return true; + /* If SEL contains stepped sequence, ensure that we are dealing with + integral vector_cst. */ + if (!INTEGRAL_TYPE_P (TREE_TYPE (TREE_TYPE (arg0 +return false; + for (unsigned pattern = 0; pattern < sel_npatterns; pattern++) { poly_uint64 a1 = sel[pattern + sel_npatterns]; diff --git a/gcc/testsuite/gcc.dg/vect/pr111754.c b/gcc/testsuite/gcc.dg/vect/pr111754.c new file mode 100644 index 000..7c1c16875c7 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr111754.c @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +typedef float __attribute__((__vector_size__ (16))) F; + +F foo (F a, F b) +{ + F v = (F) { 9 }; + return __builtin_shufflevector (v, v, 1, 0, 1, 2); +} + +/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR" "optimized" } } */ +/* { dg-final { scan-tree-dump "return \{ 0.0, 9.0e\\+0, 0.0, 0.0 \}" "optimized" } } */