Re: [PATCH] Add -funconstrained-commons to work around PR/69368 (and others) in SPEC2006
On 10/03/16 16:18, Dominique d'Humières wrote: > The test gfortran.dg/unconstrained_commons.f fails in the 32 bit mode. It > needs some regexp Indeed, confirmed on ARM, sorry for not spotting this earlier. I believe the variable, if there is one, should always be called 'j', as it is in the source. So how about this, tested on ARM, AArch64, x86_64? gcc/testsuite/ChangeLog: * gfortran.dg/unconstrained_commons.f: Widen regexp to match j_ --- gcc/testsuite/gfortran.dg/unconstrained_commons.f | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/testsuite/gfortran.dg/unconstrained_commons.f b/gcc/testsuite/gfortran.dg/unconstrained_commons.f index f9fc471..bee67ab 100644 --- a/gcc/testsuite/gfortran.dg/unconstrained_commons.f +++ b/gcc/testsuite/gfortran.dg/unconstrained_commons.f @@ -17,4 +17,4 @@ ! { dg-final { scan-tree-dump-not "FIND" "dom2" } } ! We should retain both a read and write of mycommon.x. ! { dg-final { scan-tree-dump-times " _\[0-9\]+ = mycommon\\.x\\\[_\[0-9\]+\\\];" 1 "dom2" } } -! { dg-final { scan-tree-dump-times " mycommon\\.x\\\[_\[0-9\]+\\\] = _\[0-9\]+;" 1 "dom2" } } +! { dg-final { scan-tree-dump-times " mycommon\\.x\\\[j?_\[0-9\]+\\\] = _\[0-9\]+;" 1 "dom2" } } -- 1.9.1
Re: [PATCH 1/2][AArch64] Implement AAPCS64 updates for alignment attribute
On 04/03/16 17:24, Alan Lawrence wrote: On 26/02/16 14:52, James Greenhalgh wrote: gcc/ChangeLog: * gcc/config/aarch64/aarch64.c (aarch64_function_arg_alignment): Rewrite, looking one level down for records and arrays. --- gcc/config/aarch64/aarch64.c | 31 --- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 9142ac0..b084f83 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -1925,22 +1925,23 @@ aarch64_vfp_is_call_candidate (cumulative_args_t pcum_v, machine_mode mode, static unsigned int aarch64_function_arg_alignment (machine_mode mode, const_tree type) { - unsigned int alignment; + if (!type) +return GET_MODE_ALIGNMENT (mode); + if (integer_zerop (TYPE_SIZE (type))) +return 0; - if (type) -{ - if (!integer_zerop (TYPE_SIZE (type))) -{ - if (TYPE_MODE (type) == mode) -alignment = TYPE_ALIGN (type); - else -alignment = GET_MODE_ALIGNMENT (mode); -} - else -alignment = 0; -} - else -alignment = GET_MODE_ALIGNMENT (mode); + gcc_assert (TYPE_MODE (type) == mode); + + if (!AGGREGATE_TYPE_P (type)) +return TYPE_ALIGN (TYPE_MAIN_VARIANT (type)); + + if (TREE_CODE (type) == ARRAY_TYPE) +return TYPE_ALIGN (TREE_TYPE (type)); + + unsigned int alignment = 0; + + for (tree field = TYPE_FIELDS (type); field; field = DECL_CHAIN (field)) +alignment = std::max (alignment, DECL_ALIGN (field)); return alignment; } Ping. [snip] I'm not proposing to backport these AArch64 changes, hence: Ping^2. (For gcc 7 ?) Also tests https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01073.html . Ping^3. Cheers, Alan
[PATCH] Fix PR70013
In this PR, a packed structure containing bitfields, loses part of its constant-pool initialization in SRA. A fuller explanation is on the PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013#c11. In short we need to treat constant-pool entries, like function parameters, as both come 'pre-initialized' before the function starts. Bootstrapped + regtest (gcc, g++) on AArch64; same with addition of Ada, on ARM and x86. OK for trunk? gcc/ChangeLog: PR tree-optimization/70013 * tree-sra.c (analyze_access_subtree): Also set grp_unscalarized_data for constant-pool entries. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/sra-20.c: New. --- gcc/testsuite/gcc.dg/tree-ssa/sra-20.c | 20 gcc/tree-sra.c | 3 ++- 2 files changed, 22 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sra-20.c diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sra-20.c b/gcc/testsuite/gcc.dg/tree-ssa/sra-20.c new file mode 100644 index 000..5002c24 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/sra-20.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O1 -Wall" } */ +/* PR/70013, SRA of constant-pool loads removes initialization of part of d. */ +#pragma pack (1) +struct S0 { + unsigned f0 : 17; +}; + +int c; + +int +main (int argc, char **argv) +{ + struct S0 d[] = { { 1 }, { 2 } }; + struct S0 e = d[1]; + + c = d[0].f0; + __builtin_printf ("%x\n", e.f0); + return 0; +} diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c index 72157ed..24eac6a 100644 --- a/gcc/tree-sra.c +++ b/gcc/tree-sra.c @@ -2427,7 +2427,8 @@ analyze_access_subtree (struct access *root, struct access *parent, if (!hole || root->grp_total_scalarization) root->grp_covered = 1; - else if (root->grp_write || TREE_CODE (root->base) == PARM_DECL) + else if (root->grp_write || TREE_CODE (root->base) == PARM_DECL + || constant_decl_p (root->base)) root->grp_unscalarized_data = 1; /* not covered and written to */ return sth_created; } -- 1.9.1
Re: [PATCH] Add -funconstrained-commons to work around PR/69368 (and others) in SPEC2006
On 07/03/16 11:02, Alan Lawrence wrote: On 04/03/16 13:27, Richard Biener wrote: I think to make it work with LTO you need to mark it 'Optimization'. Also it's about arrays so maybe 'Assume common declarations may be overridden with ones with a larger trailing array' also if we document it here we should eventually document it in invoke.texi. Not sure if "unknown commons" is a good term, maybe "unconstrained commons" instead? All done; I doubt there is really a good word, unconstrained seems as good as any. I've reused much the same wording in invoke.texi, unless you think there is more to add. On 04/03/16 13:33, Jakub Jelinek wrote: Also, isn't the *.opt description line supposed to end with a full stop? Ah, yes, thanks. Is this version OK for trunk? gcc/ChangeLog: DATE Alan Lawrence <alan.lawre...@arm.com> Jakub Jelinek <ja...@redhat.com> * common.opt (funconstrained-commons, flag_unconstrained_commons): New. * tree.c (array_at_struct_end_p): Do not limit to size of decl for DECL_COMMONS if flag_unconstrained_commons is set. * tree-dfa.c (get_ref_base_and_extent): Likewise. And add to that * doc/invoke.texi (Optimize Options): Add -funconstrained-commons. (funconstrained-commons): Document. Thanks, Alan gcc/testsuite/ChangeLog: * gfortran.dg/unconstrained_commons.f: New. --- gcc/common.opt| 5 + gcc/doc/invoke.texi | 8 +++- gcc/testsuite/gfortran.dg/unconstrained_commons.f | 20 gcc/tree-dfa.c| 15 ++- gcc/tree.c| 6 -- 5 files changed, 50 insertions(+), 4 deletions(-) create mode 100644 gcc/testsuite/gfortran.dg/unconstrained_commons.f diff --git a/gcc/common.opt b/gcc/common.opt index 520fa9c..bbf79ef 100644 --- a/gcc/common.opt +++ b/gcc/common.opt @@ -2451,6 +2451,11 @@ fsplit-paths Common Report Var(flag_split_paths) Init(0) Optimization Split paths leading to loop backedges. +funconstrained-commons +Common Var(flag_unconstrained_commons) Optimization +Assume common declarations may be overridden with ones with a larger +trailing array. + funit-at-a-time Common Report Var(flag_unit_at_a_time) Init(1) Compile whole compilation unit at a time. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 0a2a6f4..68933a1 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -407,7 +407,7 @@ Objective-C and Objective-C++ Dialects}. -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-partial-pre -ftree-pta @gol -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol -ftree-switch-conversion -ftree-tail-merge -ftree-ter @gol --ftree-vectorize -ftree-vrp @gol +-ftree-vectorize -ftree-vrp -funconstrained-commons @gol -funit-at-a-time -funroll-all-loops -funroll-loops @gol -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol -fipa-ra -fvariable-expansion-in-unroller -fvect-cost-model -fvpt @gol @@ -6659,6 +6659,12 @@ the loop optimizer itself cannot prove that these assumptions are valid. If you use @option{-Wunsafe-loop-optimizations}, the compiler warns you if it finds this kind of loop. +@item -funconstrained-commons +@opindex funconstrained-commons +This option tells the compiler that variables declared in common blocks +(e.g. Fortran) may later be overridden with longer trailing arrays. This +prevents certain optimizations that depend on knowing the array bounds. + @item -fcrossjumping @opindex fcrossjumping Perform cross-jumping transformation. diff --git a/gcc/testsuite/gfortran.dg/unconstrained_commons.f b/gcc/testsuite/gfortran.dg/unconstrained_commons.f new file mode 100644 index 000..f9fc471 --- /dev/null +++ b/gcc/testsuite/gfortran.dg/unconstrained_commons.f @@ -0,0 +1,20 @@ +! { dg-do compile } +! { dg-options "-O3 -funconstrained-commons -fdump-tree-dom2-details" } + +! Test for PR69368: a single-element array in a common block, which will be +! overridden with a larger size at link time (contrary to language spec). +! Dominator opts considers accesses to differently-computed elements of X as +! equivalent, unless -funconstrained-commons is passed in. + SUBROUTINE FOO + IMPLICIT DOUBLE PRECISION (X) + INTEGER J + COMMON /MYCOMMON / X(1) + DO 10 J=1,1024 + X(J+1)=X(J+7) + 10 CONTINUE + RETURN + END +! { dg-final { scan-tree-dump-not "FIND" "dom2" } } +! We should retain both a read and write of mycommon.x. +! { dg-final { scan-tree-dump-times " _\[0-9\]+ = mycommon\\.x\\\[_\[0-9\]+\\\];" 1 "dom2" } } +! { dg-final { scan-tree-dump-times " mycommon\\.x\\\[_\[0-9\]+\\\] = _\[0-9\]+;" 1 "dom2" } } diff --git a/gcc/tree-dfa.c b/gcc/tree-dfa.c index 0e98056..f133abc 100644 --- a/gcc/tree-dfa
[PATCH] Add -funconstrained-commons to work around PR/69368 (and others) in SPEC2006 (was: Re: [PATCH] Add -funknown-commons ...)
On 04/03/16 13:27, Richard Biener wrote: > I think to make it work with LTO you need to mark it 'Optimization'. > Also it's about > arrays so maybe > > 'Assume common declarations may be overridden with ones with a larger > trailing array' > > also if we document it here we should eventually document it in invoke.texi. > > Not sure if "unknown commons" is a good term, maybe "unconstrained > commons" instead? All done; I doubt there is really a good word, unconstrained seems as good as any. I've reused much the same wording in invoke.texi, unless you think there is more to add. On 04/03/16 13:33, Jakub Jelinek wrote: > Also, isn't the *.opt description line supposed to end with a full stop? Ah, yes, thanks. Is this version OK for trunk? gcc/ChangeLog: DATE Alan Lawrence <alan.lawre...@arm.com> Jakub Jelinek <ja...@redhat.com> * common.opt (funconstrained-commons, flag_unconstrained_commons): New. * tree.c (array_at_struct_end_p): Do not limit to size of decl for DECL_COMMONS if flag_unconstrained_commons is set. * tree-dfa.c (get_ref_base_and_extent): Likewise. gcc/testsuite/ChangeLog: * gfortran.dg/unconstrained_commons.f: New. --- gcc/common.opt| 5 + gcc/doc/invoke.texi | 8 +++- gcc/testsuite/gfortran.dg/unconstrained_commons.f | 20 gcc/tree-dfa.c| 15 ++- gcc/tree.c| 6 -- 5 files changed, 50 insertions(+), 4 deletions(-) create mode 100644 gcc/testsuite/gfortran.dg/unconstrained_commons.f diff --git a/gcc/common.opt b/gcc/common.opt index 520fa9c..bbf79ef 100644 --- a/gcc/common.opt +++ b/gcc/common.opt @@ -2451,6 +2451,11 @@ fsplit-paths Common Report Var(flag_split_paths) Init(0) Optimization Split paths leading to loop backedges. +funconstrained-commons +Common Var(flag_unconstrained_commons) Optimization +Assume common declarations may be overridden with ones with a larger +trailing array. + funit-at-a-time Common Report Var(flag_unit_at_a_time) Init(1) Compile whole compilation unit at a time. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 0a2a6f4..68933a1 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -407,7 +407,7 @@ Objective-C and Objective-C++ Dialects}. -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-partial-pre -ftree-pta @gol -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol -ftree-switch-conversion -ftree-tail-merge -ftree-ter @gol --ftree-vectorize -ftree-vrp @gol +-ftree-vectorize -ftree-vrp -funconstrained-commons @gol -funit-at-a-time -funroll-all-loops -funroll-loops @gol -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol -fipa-ra -fvariable-expansion-in-unroller -fvect-cost-model -fvpt @gol @@ -6659,6 +6659,12 @@ the loop optimizer itself cannot prove that these assumptions are valid. If you use @option{-Wunsafe-loop-optimizations}, the compiler warns you if it finds this kind of loop. +@item -funconstrained-commons +@opindex funconstrained-commons +This option tells the compiler that variables declared in common blocks +(e.g. Fortran) may later be overridden with longer trailing arrays. This +prevents certain optimizations that depend on knowing the array bounds. + @item -fcrossjumping @opindex fcrossjumping Perform cross-jumping transformation. diff --git a/gcc/testsuite/gfortran.dg/unconstrained_commons.f b/gcc/testsuite/gfortran.dg/unconstrained_commons.f new file mode 100644 index 000..f9fc471 --- /dev/null +++ b/gcc/testsuite/gfortran.dg/unconstrained_commons.f @@ -0,0 +1,20 @@ +! { dg-do compile } +! { dg-options "-O3 -funconstrained-commons -fdump-tree-dom2-details" } + +! Test for PR69368: a single-element array in a common block, which will be +! overridden with a larger size at link time (contrary to language spec). +! Dominator opts considers accesses to differently-computed elements of X as +! equivalent, unless -funconstrained-commons is passed in. + SUBROUTINE FOO + IMPLICIT DOUBLE PRECISION (X) + INTEGER J + COMMON /MYCOMMON / X(1) + DO 10 J=1,1024 + X(J+1)=X(J+7) + 10 CONTINUE + RETURN + END +! { dg-final { scan-tree-dump-not "FIND" "dom2" } } +! We should retain both a read and write of mycommon.x. +! { dg-final { scan-tree-dump-times " _\[0-9\]+ = mycommon\\.x\\\[_\[0-9\]+\\\];" 1 "dom2" } } +! { dg-final { scan-tree-dump-times " mycommon\\.x\\\[_\[0-9\]+\\\] = _\[0-9\]+;" 1 "dom2" } } diff --git a/gcc/tree-dfa.c b/gcc/tree-dfa.c index 0e98056..f133abc 100644 --- a/gcc/tree-dfa.c +++ b/gcc/tree-dfa.c @@ -612,9 +612,22 @@ get_ref_base_and_extent (tree exp, HOST_WIDE_INT *poffset, if (DECL_P (exp)) { + if (flag_unconst
Re: [PATCH 1/2][AArch64] Implement AAPCS64 updates for alignment attribute
On 26/02/16 14:52, James Greenhalgh wrote: gcc/ChangeLog: * gcc/config/aarch64/aarch64.c (aarch64_function_arg_alignment): Rewrite, looking one level down for records and arrays. --- gcc/config/aarch64/aarch64.c | 31 --- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 9142ac0..b084f83 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -1925,22 +1925,23 @@ aarch64_vfp_is_call_candidate (cumulative_args_t pcum_v, machine_mode mode, static unsigned int aarch64_function_arg_alignment (machine_mode mode, const_tree type) { - unsigned int alignment; + if (!type) +return GET_MODE_ALIGNMENT (mode); + if (integer_zerop (TYPE_SIZE (type))) +return 0; - if (type) -{ - if (!integer_zerop (TYPE_SIZE (type))) - { - if (TYPE_MODE (type) == mode) - alignment = TYPE_ALIGN (type); - else - alignment = GET_MODE_ALIGNMENT (mode); - } - else - alignment = 0; -} - else -alignment = GET_MODE_ALIGNMENT (mode); + gcc_assert (TYPE_MODE (type) == mode); + + if (!AGGREGATE_TYPE_P (type)) +return TYPE_ALIGN (TYPE_MAIN_VARIANT (type)); + + if (TREE_CODE (type) == ARRAY_TYPE) +return TYPE_ALIGN (TREE_TYPE (type)); + + unsigned int alignment = 0; + + for (tree field = TYPE_FIELDS (type); field; field = DECL_CHAIN (field)) +alignment = std::max (alignment, DECL_ALIGN (field)); return alignment; } Ping. If this is not appropriate for GCC6, then is it OK for Stage 1 of GCC 7? I think this needs to be a GCC 7 fix at this point. Additionally, I'd like to fully understand PR69841 before we take this patch. In particular, under what circumstances can the C++ front end set DECL_ALIGN of a type to be wider than we expect. Can we ever end up with 128-bit alignment of a template parameter when we were expecting 32- or 64-bit alignment, and if we can what are the implications on this patch? OK, so IIUC, we *should* be able to rely on DECL_ALIGN for the AAPCS64, as PR/69841 occurred on gcc-5-branch because a C++ frontend change had not been backported. I'm not proposing to backport these AArch64 changes, hence: Ping^2. (For gcc 7 ?) Also tests https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01073.html . Thanks, Alan
Re: [PATCH] Add -funknown-commons to work around PR/69368 (and others) in SPEC2006
On 25/02/16 18:00, Alan Lawrence wrote: On 22/02/16 12:03, Jakub Jelinek wrote: (f) A global command-line option, which we check alongside DECL_COMMON and further tests (basically, we want only DECL_COMMON decls that either have ARRAY_TYPE, or some other aggregate type with flexible array member or some other trailing array in the struct), and use the resulting predicate in places where we optimize based on DECL_SIZE{,_UNIT} of the decl - if that predicate returns true, we assume the DECL_SIZE is "variable"/"unlimited"/whatever. The two spots known to me that would need to use this new predicate would be: tree.c (array_at_struct_end_p): [...] tree-dfa.c (get_ref_base_and_extent): [...] Close enough to what I meant by (a)/(b), but never mind that, and I hadn't looked at where the change for aggressive loop opts needed to be. However... Well you are essentially writing the patch for me at this point (so, thanks!), but here's a patch that puts that under a global -funknown-commons flag. Testcase as before. Bootstrapped (with flag overridden to true) on x86_64 and aarch64, check-gcc, check-fortran, and tested 416.gamess on AArch64, where this patch enables running *without* the -fno-aggressive-loop-opts previously required. In the gcc testsuite, these fail with the option turned on: gcc.dg/pr53265.c (test for warnings, line 137) gcc.dg/pr53265.c (test for warnings, line 138) gcc.dg/pr64277.c scan-tree-dump cunroll ... (*2) gcc.dg/tree-ssa/cunroll-{1,2,3,4,9,10,11}.c scan-tree-dump cunroll/cunrolli/ivcanon (various) gcc.dg/tree-ssa/loop-38.c scan-tree-dump cunrolli "Loop 1 iterates at most 11 times" ...which all appear harmless. Thomas Koenig suggests (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368#c80) that this be combined with some flag fiddling and warnings in the Fortran front-end; this patch doesn't do that, as I'm not very familiar with the frontends, but that can follow in a separate patch. (Thomas?) OK for trunk? Cheers, Alan gcc/ChangeLog: DATE Alan Lawrence <alan.lawre...@arm.com> Jakub Jelinek <ja...@redhat.com> * common.opt (funknown-commons, flag_unknown_commons): New. * tree.c (array_at_struct_end_p): Do not limit to size of decl for DECL_COMMONS if flag_unknown_commons is set. * tree-dfa.c (get_ref_base_and_extent): Likewise. gcc/testsuite/ChangeLog: * gfortran.dg/unknown_commons.f: New. Ping. (Besides my original patch https://gcc.gnu.org/ml/gcc-patches/2016-02/msg01356.html which we decided was too fragile, I believe this is the only patch proposed that actually fixes the miscompare in 416.gamess.) Thanks, Alan
Re: [PATCH] Add -funknown-commons to work around PR/69368 (and others) in SPEC2006
On 22/02/16 12:03, Jakub Jelinek wrote: > (f) A global command-line option, which we check alongside DECL_COMMON and > further tests (basically, we want only DECL_COMMON decls that either have > ARRAY_TYPE, or some other aggregate type with flexible array member or some > other trailing array in the struct), and use the resulting predicate in > places where we optimize based on DECL_SIZE{,_UNIT} of the decl - if that > predicate returns true, we assume the DECL_SIZE is > "variable"/"unlimited"/whatever. > The two spots known to me that would need to use this new predicate would > be: > tree.c (array_at_struct_end_p): [...] > tree-dfa.c (get_ref_base_and_extent): [...] Close enough to what I meant by (a)/(b), but never mind that, and I hadn't looked at where the change for aggressive loop opts needed to be. However... Well you are essentially writing the patch for me at this point (so, thanks!), but here's a patch that puts that under a global -funknown-commons flag. Testcase as before. Bootstrapped (with flag overridden to true) on x86_64 and aarch64, check-gcc, check-fortran, and tested 416.gamess on AArch64, where this patch enables running *without* the -fno-aggressive-loop-opts previously required. In the gcc testsuite, these fail with the option turned on: gcc.dg/pr53265.c (test for warnings, line 137) gcc.dg/pr53265.c (test for warnings, line 138) gcc.dg/pr64277.c scan-tree-dump cunroll ... (*2) gcc.dg/tree-ssa/cunroll-{1,2,3,4,9,10,11}.c scan-tree-dump cunroll/cunrolli/ivcanon (various) gcc.dg/tree-ssa/loop-38.c scan-tree-dump cunrolli "Loop 1 iterates at most 11 times" ...which all appear harmless. Thomas Koenig suggests (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368#c80) that this be combined with some flag fiddling and warnings in the Fortran front-end; this patch doesn't do that, as I'm not very familiar with the frontends, but that can follow in a separate patch. (Thomas?) OK for trunk? Cheers, Alan gcc/ChangeLog: DATE Alan Lawrence <alan.lawre...@arm.com> Jakub Jelinek <ja...@redhat.com> * common.opt (funknown-commons, flag_unknown_commons): New. * tree.c (array_at_struct_end_p): Do not limit to size of decl for DECL_COMMONS if flag_unknown_commons is set. * tree-dfa.c (get_ref_base_and_extent): Likewise. gcc/testsuite/ChangeLog: * gfortran.dg/unknown_commons.f: New. --- gcc/common.opt | 4 gcc/testsuite/gfortran.dg/unknown_commons.f | 20 gcc/tree-dfa.c | 15 ++- gcc/tree.c | 6 -- 4 files changed, 42 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gfortran.dg/unknown_commons.f diff --git a/gcc/common.opt b/gcc/common.opt index 520fa9c..b5b1282 100644 --- a/gcc/common.opt +++ b/gcc/common.opt @@ -2455,6 +2455,10 @@ funit-at-a-time Common Report Var(flag_unit_at_a_time) Init(1) Compile whole compilation unit at a time. +funknown-commons +Common Var(flag_unknown_commons) +Assume common declarations may be overridden with unknown larger sizes + funroll-loops Common Report Var(flag_unroll_loops) Optimization Perform loop unrolling when iteration count is known. diff --git a/gcc/testsuite/gfortran.dg/unknown_commons.f b/gcc/testsuite/gfortran.dg/unknown_commons.f new file mode 100644 index 000..97e3ce3 --- /dev/null +++ b/gcc/testsuite/gfortran.dg/unknown_commons.f @@ -0,0 +1,20 @@ +! { dg-do compile } +! { dg-options "-O3 -funknown-commons -fdump-tree-dom2-details" } + +! Test for PR69368: a single-element array in a common block, which will be +! overridden with a larger size at link time (contrary to language spec). +! Dominator opts considers accesses to differently-computed elements of X as +! equivalent, unless -funknown-commons is passed in. + SUBROUTINE FOO + IMPLICIT DOUBLE PRECISION (X) + INTEGER J + COMMON /MYCOMMON / X(1) + DO 10 J=1,1024 + X(J+1)=X(J+7) + 10 CONTINUE + RETURN + END +! { dg-final { scan-tree-dump-not "FIND" "dom2" } } +! We should retain both a read and write of mycommon.x. +! { dg-final { scan-tree-dump-times " _\[0-9\]+ = mycommon\\.x\\\[_\[0-9\]+\\\];" 1 "dom2" } } +! { dg-final { scan-tree-dump-times " mycommon\\.x\\\[_\[0-9\]+\\\] = _\[0-9\]+;" 1 "dom2" } } diff --git a/gcc/tree-dfa.c b/gcc/tree-dfa.c index 0e98056..017e98e 100644 --- a/gcc/tree-dfa.c +++ b/gcc/tree-dfa.c @@ -612,9 +612,22 @@ get_ref_base_and_extent (tree exp, HOST_WIDE_INT *poffset, if (DECL_P (exp)) { + if (flag_unknown_commons + && TREE_CODE (exp) == VAR_DECL && DECL_COMMON (exp)) + { + tree sz_tree = TYPE_SIZE (TREE_TYPE (exp)); + /* If size is unknown, or we have read to the end, assume there
Re: [PATCH, PR middle-end/68134] Reject scalar modes in default get_mask_mode hook
On 20/02/16 09:29, Ilya Enkovich wrote: 2016-02-19 20:36 GMT+03:00 Alan Lawrence <alan.lawre...@foss.arm.com>: Mostly this is fairly straightforward, relatively little midend code is required, and the backend cleans up quite a bit. However, I get stuck on the case of singleton vectors (64x1). No surprises there, then... The PR/68134 fix, makes the 'mask mode' for comparing 64x1 vectors, into BLKmode, so that we get past this in expand_vector_operations: /* A scalar operation pretending to be a vector one. */ if (VECTOR_BOOLEAN_TYPE_P (type) && !VECTOR_MODE_P (TYPE_MODE (type)) && TYPE_MODE (type) != BLKmode) return; and we do the operation piecewise. (Which is what we want; there is only one piece!) However, with my vec_cmp + vcond_mask changes, dropping vconddidi, this means we look for a vcond_maskdiblk and vec_cmpdiblk. Which doesn't really feel right - it feels like the 64x1 mask should be a DImode, just like other 64x1 vectors. The problem here is to distinguish vector mask of one DI element and DI scalar mask. We don't want to lower scalar mask manipulations because they are simple integer operations, not vector ones. Probably vector of a single DI should have V1DI mode and not pretend to be a scalar? This would make things easier. Thanks for the quick reply, Ilya. What's the difference between, as you say, a "simple integer operation" and a "vector" operation of just one element? This is why we do *not* have V1DImode in the AArch64 (or ARM) backends, but instead treat 64x1 vectors as DImode - the operations are the same; so keeping them as the same mode, enables CSE and lots of other optimizations, plus we don't have to have two near-identical copies (DI + V1DI) for many patterns, etc... If the operations were on a "DI scalar mask", when would the first part of that test, VECTOR_BOOLEAN_TYPE_P, hold? Thanks, Alan
Re: [PATCH] Add -funknown-commons to work around PR/69368 (and others) in SPEC2006
On 22/02/16 12:03, Jakub Jelinek wrote: (f) A global command-line option, which we check alongside DECL_COMMON and further tests (basically, we want only DECL_COMMON decls that either have ARRAY_TYPE, or some other aggregate type with flexible array member or some other trailing array in the struct), and use the resulting predicate in places where we optimize based on DECL_SIZE{,_UNIT} of the decl - if that predicate returns true, we assume the DECL_SIZE is "variable"/"unlimited"/whatever. The two spots known to me that would need to use this new predicate would be: tree.c (array_at_struct_end_p): tree-dfa.c (get_ref_base_and_extent): Well, with just those two, this fixes 416.gamess, and passes all tests in gfortran, with only a few scan-dump/warning tests failing in gcc... --Alan
Re: [PATCH 1/2][AArch64] Implement AAPCS64 updates for alignment attribute
On 22/01/16 17:16, Alan Lawrence wrote: On 21/01/16 17:23, Alan Lawrence wrote: On 18/01/16 17:10, Eric Botcazou wrote: Could you post the list of files that differ? How do they differ exactly? Hmmm. Well, I definitely had this failing to bootstrap once. I repeated that, to try to identify exactly what the differences wereand it succeeded even with my pre-AAPCS64-update host compiler. So, this is probably a false alarm; I'm bootstrapping again, after a rebase, to make sure... --Alan Ok, rebased onto a more recent build, and bootstrapping with Ada posed no problems. Sorry for the noise. However, I had to drop the assert that TYPE_FIELDS was non-null because of some C++ testcases. Is this version OK for trunk? --Alan gcc/ChangeLog: * gcc/config/aarch64/aarch64.c (aarch64_function_arg_alignment): Rewrite, looking one level down for records and arrays. --- gcc/config/aarch64/aarch64.c | 31 --- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 9142ac0..b084f83 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -1925,22 +1925,23 @@ aarch64_vfp_is_call_candidate (cumulative_args_t pcum_v, machine_mode mode, static unsigned int aarch64_function_arg_alignment (machine_mode mode, const_tree type) { - unsigned int alignment; + if (!type) +return GET_MODE_ALIGNMENT (mode); + if (integer_zerop (TYPE_SIZE (type))) +return 0; - if (type) -{ - if (!integer_zerop (TYPE_SIZE (type))) - { - if (TYPE_MODE (type) == mode) - alignment = TYPE_ALIGN (type); - else - alignment = GET_MODE_ALIGNMENT (mode); - } - else - alignment = 0; -} - else -alignment = GET_MODE_ALIGNMENT (mode); + gcc_assert (TYPE_MODE (type) == mode); + + if (!AGGREGATE_TYPE_P (type)) +return TYPE_ALIGN (TYPE_MAIN_VARIANT (type)); + + if (TREE_CODE (type) == ARRAY_TYPE) +return TYPE_ALIGN (TREE_TYPE (type)); + + unsigned int alignment = 0; + + for (tree field = TYPE_FIELDS (type); field; field = DECL_CHAIN (field)) +alignment = std::max (alignment, DECL_ALIGN (field)); return alignment; } Ping. If this is not appropriate for GCC6, then is it OK for Stage 1 of GCC 7? Thanks, Alan
Re: [PATCH] Add -funknown-commons to work around PR/69368 (and others) in SPEC2006
On 19/02/16 17:52, Jakub Jelinek wrote: On Fri, Feb 19, 2016 at 05:42:34PM +, Alan Lawrence wrote: This relates to FORTRAN code where different modules give different sizes to the same array in a COMMON block (contrary to the fortran language specification). SPEC have refused to patch the source code (https://www.spec.org/cpu2006/Docs/faq.html#Run.05). Hence, this patch provides a Fortran-specific option -funknown-commons that marks such arrays as having unknown size - that is, NULL_TREE for both TYPE_SIZE and max value of TYPE_DOMAIN. DECL_SIZE is preserved for e.g. output in varasm.c. On AArch64, it fixes the 416.gamess issue, and allows compiling 416.gamess without the -fno-aggressive-loop-optimizations previously required (numerous other PRs relating to 416.gamess). I had to fix up a couple of places to deal with null TYPE_SIZE but in most cases I think it is wrong to touch TYPE_SIZE/TYPE_SIZE_UNIT, IMHO it is much better just to ignore DECL_SIZE/DECL_SIZE_UNIT in the selected few places (get_ref_base_and_extent, the tree-ssa-loop-niters.c analysis) if the switch is on, for selected decls (aggregates with flexible array members and other similar trailing arrays, arrays themselves; all only if DECL_COMMON). So do you see... (a) A global command-line option, which we check alongside DECL_COMMON, in (all or some of the) places which currently deal with flexible array members? (I guess the flag would have no effect for compiling C code...or (b) do we require it to 'enable' the existing flexible array member support?) (c) A new flag on each DECL, which we check in the places dealing with flexible array members, which we set on those decls in the fortran frontend if a fortran-frontend-only command-line option is passed? (d) A new flag on each DECL, which replaces the check for flexible array members, plus a global command-line option, which controls both setting the flag (on DECL_COMMONs) in the fortran frontend, and also setting it on flexible array members in the C frontend? This might be 'cleanest' but is also probably the most change and I doubt I'm going to have time to do this for GCC 6... (e) Some other scheme that I've not understood yet? Thanks, Alan
[PATCH] Add -funknown-commons to work around PR/69368 (and others) in SPEC2006
This relates to FORTRAN code where different modules give different sizes to the same array in a COMMON block (contrary to the fortran language specification). SPEC have refused to patch the source code (https://www.spec.org/cpu2006/Docs/faq.html#Run.05). Hence, this patch provides a Fortran-specific option -funknown-commons that marks such arrays as having unknown size - that is, NULL_TREE for both TYPE_SIZE and max value of TYPE_DOMAIN. DECL_SIZE is preserved for e.g. output in varasm.c. On AArch64, it fixes the 416.gamess issue, and allows compiling 416.gamess without the -fno-aggressive-loop-optimizations previously required (numerous other PRs relating to 416.gamess). I had to fix up a couple of places to deal with null TYPE_SIZE but in most cases a test for such was already present. (omp-low.c had too many so I gave up: in practice I think it's OK to just not use the new flag at the same time as -fopenmp). Bootstrapped + check-gcc + check-g++ on AArch64 and x86. Also check-fortran with a variant enabling the option in all cases (to allow compilation of C): here, only gfortran.dg/gomp/appendix-a/a.24.1.f90 fails, this uses -fopenmp mentioned above. Testcase fails both the FIND and one of the assignments if -funknown-commons is removed. OK for trunk? I'd be happy to move this under -std=legacy if the Fortran folks prefer. (-funcommon-commons has also been humourously suggested!) gcc/ChangeLog: * expr.c (store_field): Handle null TYPE_SIZE. * tree-vect-data-refs.c (vect_analyze_data_refs): Bail out if TYPE_SIZE is null. * tree-dfa.c (get_ref_base_and_extent): Keep maxsize==-1 when reading the whole of a DECL. gcc/fortran/ChangeLog: * lang.opt: Add -funknown-commons. * trans-common.ci (build_field): If flag_unknown_commons is on, set upper bound and size of arrays in common block to NULL_TREE. (create_common): Explicitly set DECL_SIZE and DECL_SIZE_UNIT. gcc/testsuite/ChangeLog: * gfortran.dg/unknown_commons.f: New. --- gcc/expr.c | 1 + gcc/fortran/lang.opt| 4 gcc/fortran/trans-common.c | 35 + gcc/testsuite/gfortran.dg/unknown_commons.f | 20 + gcc/tree-dfa.c | 6 - gcc/tree-vect-data-refs.c | 8 +++ 6 files changed, 73 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/gfortran.dg/unknown_commons.f diff --git a/gcc/expr.c b/gcc/expr.c index f4bac36..609bf59 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -6638,6 +6638,7 @@ store_field (rtx target, HOST_WIDE_INT bitsize, HOST_WIDE_INT bitpos, RHS isn't the same size as the bitfield, we must use bitfield operations. */ || (bitsize >= 0 + && TYPE_SIZE (TREE_TYPE (exp)) && TREE_CODE (TYPE_SIZE (TREE_TYPE (exp))) == INTEGER_CST && compare_tree_int (TYPE_SIZE (TREE_TYPE (exp)), bitsize) != 0 /* Except for initialization of full bytes from a CONSTRUCTOR, which diff --git a/gcc/fortran/lang.opt b/gcc/fortran/lang.opt index 45428d8..b1b4641 100644 --- a/gcc/fortran/lang.opt +++ b/gcc/fortran/lang.opt @@ -686,6 +686,10 @@ funderscoring Fortran Var(flag_underscoring) Init(1) Append underscores to externally visible names. +funknown-commons +Fortran Var(flag_unknown_commons) +Treat arrays in common blocks as having unknown size. + fwhole-file Fortran Ignore Does nothing. Preserved for backward compatibility. diff --git a/gcc/fortran/trans-common.c b/gcc/fortran/trans-common.c index 21d1928..bf99c56 100644 --- a/gcc/fortran/trans-common.c +++ b/gcc/fortran/trans-common.c @@ -284,8 +284,41 @@ build_field (segment_info *h, tree union_type, record_layout_info rli) unsigned HOST_WIDE_INT desired_align, known_align; name = get_identifier (h->sym->name); + + gcc_assert (TYPE_P (h->field)); + tree size = TYPE_SIZE (h->field); + tree size_unit = TYPE_SIZE_UNIT (h->field); + if (GFC_ARRAY_TYPE_P (h->field) + && !h->sym->value + && flag_unknown_commons) +{ + gcc_assert (!GFC_DESCRIPTOR_TYPE_P (h->field)); + gcc_assert (TREE_CODE (h->field) == ARRAY_TYPE); + + /* Could be merged at link time with a larger array whose size we don't +know. */ + static tree range_type = NULL_TREE; + if (!range_type) + range_type = build_range_type (gfc_array_index_type, + gfc_index_zero_node, NULL_TREE); + tree type = copy_node (h->field); + TYPE_DOMAIN (type) = range_type; + TYPE_SIZE (type) = NULL_TREE; + TYPE_SIZE_UNIT (type) = NULL_TREE; + SET_TYPE_MODE (type, BLKmode); + gcc_assert (TYPE_CANONICAL (h->field) == h->field); + gcc_assert (TYPE_MAIN_VARIANT (h->field) == h->field); + TYPE_CANONICAL (type) = type; + TYPE_MAIN_VARIANT (type) = type; + layout_type (type); + h->field =
Re: [PATCH, PR middle-end/68134] Reject scalar modes in default get_mask_mode hook
On 17/11/15 11:49, Ilya Enkovich wrote: Hi, Default hook for get_mask_mode is supposed to return integer vector modes. This means it should reject calar modes returned by mode_for_vector. Bootstrapped and regtested on x86_64-unknown-linux-gnu, regtested on aarch64-unknown-linux-gnu. OK for trunk? Thanks, Ilya -- gcc/ 2015-11-17 Ilya EnkovichPR middle-end/68134 * targhooks.c (default_get_mask_mode): Filter out scalar modes returned by mode_for_vector. I've been playing around with a patch that expands arbitrary VEC_COND_EXPRs using the vec_cmp and vcond_mask optabs - allowing platforms to drop the ugly vcond pattern (for which a cross-product of modes is usually required) where vcond = vec_cmp + vcond_mask. (E.g. ARM and AArch64.) Mostly this is fairly straightforward, relatively little midend code is required, and the backend cleans up quite a bit. However, I get stuck on the case of singleton vectors (64x1). No surprises there, then... The PR/68134 fix, makes the 'mask mode' for comparing 64x1 vectors, into BLKmode, so that we get past this in expand_vector_operations: /* A scalar operation pretending to be a vector one. */ if (VECTOR_BOOLEAN_TYPE_P (type) && !VECTOR_MODE_P (TYPE_MODE (type)) && TYPE_MODE (type) != BLKmode) return; and we do the operation piecewise. (Which is what we want; there is only one piece!) However, with my vec_cmp + vcond_mask changes, dropping vconddidi, this means we look for a vcond_maskdiblk and vec_cmpdiblk. Which doesn't really feel right - it feels like the 64x1 mask should be a DImode, just like other 64x1 vectors. So, I'm left thinking - is there a better way to look for "scalar operations pretending to be vector ones", such that we can get a DImode vec through the above? What exactly do we want that check to do? In my AArch64 testing, I was able to take it out altogether and get all tests passing...? (I don't have AVX512 hardware) Thanks, Alan
Re: [PATCH 4/4] Un-XFAIL ssa-dom-cse-2.c for most platforms
On 04/02/16 09:53, Dominik Vogt wrote: On Wed, Feb 03, 2016 at 11:41:02AM +, Alan Lawrence wrote: On 26/01/16 12:23, Dominik Vogt wrote: On Mon, Dec 21, 2015 at 01:13:28PM +, Alan Lawrence wrote: ...the test passes with --param sra-max-scalarization-size-Ospeed. Verified on aarch64 and with stage1 compiler for hppa, powerpc, sparc, s390. How did you test this on s390? For me, the test still fails unless I add -march=z13 (s390x). Sorry for the slow response, was away last week. On x86 host, I built a compiler configure --enable-languages=c,c++,lto --target=s390-none-linux-gnu Looks like the test fails only on s390x (64-Bit ) without -march=z13 but works on s390 (31-Bit). Ah, yes, I see. Loop is not unrolled for dom2, then vectorizer kicks in but vectorizes only the initialization/loading, and dom can't see the redundancy in MEM[(int[8] *)] = { 0, 1 }; ... _23 = a[0]; as they aren't reading equivalent chunks of memory. Same problem as on Alpha, and powerpc64 (without -mcpu=power7/power8). Powerpc chose to XFAIL rather than add -mcpu=power7, but I'm OK with any testsuite workaround here, such as yours. Cheers, Alan
Re: [PATCH 4/4] Un-XFAIL ssa-dom-cse-2.c for most platforms
On 26/01/16 12:23, Dominik Vogt wrote: On Mon, Dec 21, 2015 at 01:13:28PM +, Alan Lawrence wrote: ...the test passes with --param sra-max-scalarization-size-Ospeed. Verified on aarch64 and with stage1 compiler for hppa, powerpc, sparc, s390. How did you test this on s390? For me, the test still fails unless I add -march=z13 (s390x). Sorry for the slow response, was away last week. On x86 host, I built a compiler configure --enable-languages=c,c++,lto --target=s390-none-linux-gnu make all-gcc make check-gcc RUNTESTFLAGS=tree-ssa.exp=ssa-dom-cse-2.c and that shows the tests passing. gcc -v shows little further: Reading specs from ./gcc/specs COLLECT_GCC=./gcc/xgcc COLLECT_LTO_WRAPPER=./gcc/lto-wrapper Target: s390-none-linux-gnu Configured with: /work/alalaw01/src2/gcc/configure --enable-languages=c,c++,lto --target=s390-none-linux-gnu Thread model: posix gcc version 6.0.0 20151206 (experimental) (GCC) I speculate that perhaps the -march=z13 is default for s390 *linux* ??? If you can send me alternative configury, or dumps from your failing compiler (dom{2,3}-details, optimized), I can take a look. Thanks, Alan
[PATCH][Testsuite] Fix PR66877
This is a scan-tree-dump failure in vect-over-widen-3-big-array.c, that occurs only on ARM - the only platform to have vect_widen_shift. Tested on arm-none-eabi (armv8-crypto-neon-fp, plus a non-neon variant), also aarch64 (token platform without vect_widen_shift). gcc/testsuite/ChangeLog: * gcc.dg/vect/vect-over-widen-3-big-array.c: Only look for 1 vect_recog_over_widening_pattern in dump if we have vect_widen_shift. --- gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c index 1ca3128..69773a5 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c @@ -58,6 +58,7 @@ int main (void) return 0; } -/* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 2 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 2 "vect" { target { ! vect_widen_shift } } } } */ +/* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 1 "vect" { target vect_widen_shift } } } */ /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ -- 1.9.1
Re: [PATCH] ARM PR68620 (ICE with FP16 on armeb)
On 20/01/16 21:10, Christophe Lyon wrote: On 19 January 2016 at 15:51, Alan Lawrence <alan.lawre...@foss.arm.com> wrote: On 19/01/16 11:15, Christophe Lyon wrote: For neon_vdupn, I chose to implement neon_vdup_nv4hf and neon_vdup_nv8hf instead of updating the VX iterator because I thought it was not desirable to impact neon_vrev32. Well, the same instruction will suffice for vrev32'ing vectors of HF just as well as vectors of HI, so I think I'd argue that's harmless enough. To gain the benefit, we'd need to update arm_evpc_neon_vrev with a few new cases, though. Since this is more intrusive, I'd rather leave that part for later. OK? Sure. +#ifdef __ARM_BIG_ENDIAN + /* Here, 3 is (4-1) where 4 is the number of lanes. This is also the + right value for vectors with 8 lanes. */ +#define __arm_lane(__vec, __idx) (__idx ^ 3) +#else +#define __arm_lane(__vec, __idx) __idx +#endif + Looks right, but sounds... my concern here is that I'm hoping at some point we will move the *other* vget/set_lane intrinsics to use GCC vector extensions too. At which time (unlike __aarch64_lane which can be used everywhere) this will be the wrong formula. Can we name (and/or comment) it to avoid misleading anyone? The key characteristic seems to be that it is for vectors of 16-bit elements only. I'm not to follow, here. Looking at the patterns for neon_vget_lane_*internal in neon.md, I can see 2 flavours: one for VD, one for VQ2. The latter uses "halfelts". Do you prefer that I create 2 macros (say __arm_lane and __arm_laneq), that would be similar to the aarch64 ones (by computing the number of lanes of the input vector), but the "q" one would use half the total number of lanes instead? That works for me! Sthg like: #define __arm_lane(__vec, __idx) NUM_LANES(__vec) - __idx #define __arm_laneq(__vec, __idx) (__idx & (NUM_LANES(__vec)/2)) + (NUM_LANES(__vec)/2 - __idx) //or similarly #define __arm_laneq(__vec, __idx) (__idx ^ (NUM_LANES(__vec)/2 - 1)) Alternatively I'd been thinking #define __arm_lane_32xN(__idx) __idx ^ 1 #define __arm_lane_16xN(__idx) __idx ^ 3 #define __arm_lane_8xN(__idx) __idx ^ 7 Bear in mind PR64893 that we had on AArch64 :-( Here is a new version, based on the comments above. I've also removed the addition of arm_fp_ok effective target since I added that in my other testsuite patch. OK now? Yes. I realize my worry about PR64893 doesn't apply here since we pass constant lane numbers / vector bounds into __builtin_arm_lane_check. Thanks! --Alan Thanks, Christophe Cheers, Alan
Re: [PATCH 1/2][AArch64] Implement AAPCS64 updates for alignment attribute
On 21/01/16 17:23, Alan Lawrence wrote: > On 18/01/16 17:10, Eric Botcazou wrote: >> >> Could you post the list of files that differ? How do they differ exactly? > > Hmmm. Well, I definitely had this failing to bootstrap once. I repeated that, > to > try to identify exactly what the differences wereand it succeeded even > with > my pre-AAPCS64-update host compiler. So, this is probably a false alarm; I'm > bootstrapping again, after a rebase, to make sure... > > --Alan Ok, rebased onto a more recent build, and bootstrapping with Ada posed no problems. Sorry for the noise. However, I had to drop the assert that TYPE_FIELDS was non-null because of some C++ testcases. Is this version OK for trunk? --Alan gcc/ChangeLog: * gcc/config/aarch64/aarch64.c (aarch64_function_arg_alignment): Rewrite, looking one level down for records and arrays. --- gcc/config/aarch64/aarch64.c | 31 --- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 9142ac0..b084f83 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -1925,22 +1925,23 @@ aarch64_vfp_is_call_candidate (cumulative_args_t pcum_v, machine_mode mode, static unsigned int aarch64_function_arg_alignment (machine_mode mode, const_tree type) { - unsigned int alignment; + if (!type) +return GET_MODE_ALIGNMENT (mode); + if (integer_zerop (TYPE_SIZE (type))) +return 0; - if (type) -{ - if (!integer_zerop (TYPE_SIZE (type))) - { - if (TYPE_MODE (type) == mode) - alignment = TYPE_ALIGN (type); - else - alignment = GET_MODE_ALIGNMENT (mode); - } - else - alignment = 0; -} - else -alignment = GET_MODE_ALIGNMENT (mode); + gcc_assert (TYPE_MODE (type) == mode); + + if (!AGGREGATE_TYPE_P (type)) +return TYPE_ALIGN (TYPE_MAIN_VARIANT (type)); + + if (TREE_CODE (type) == ARRAY_TYPE) +return TYPE_ALIGN (TREE_TYPE (type)); + + unsigned int alignment = 0; + + for (tree field = TYPE_FIELDS (type); field; field = DECL_CHAIN (field)) +alignment = std::max (alignment, DECL_ALIGN (field)); return alignment; } -- 1.9.1
[PATCH][Testsuite] Fix scan-tree-dump failures with vect_multiple_sizes
Since r230292, these tests in gcc.dg/vect have been failing on ARM, AArch64, and x86_64 with -march=haswell (among others - when prefer_avx128 is true): vect-outer-1-big-array.c scan-tree-dump-times vect "grouped access in outer loop" 2 vect-outer-1.c scan-tree-dump-times vect "grouped access in outer loop" 2 vect-outer-1a-big-array.c scan-tree-dump-times vect "grouped access in outer loop" 2 vect-outer-1a.c scan-tree-dump-times vect "grouped access in outer loop" 2 vect-outer-1b-big-array.c scan-tree-dump-times vect "grouped access in outer loop" 2 vect-outer-1b.c scan-tree-dump-times vect "grouped access in outer loop" 2 vect-outer-2b.c scan-tree-dump-times vect "grouped access in outer loop" 2 vect-outer-3b.c scan-tree-dump-times vect "grouped access in outer loop" 4 vect-reduc-dot-s8b.c scan-tree-dump-times vect "vect_recog_widen_mult_pattern: detected" 2 (plus all the -flto -ffat-lto-objects equivalents). This is because that commit changed vect_analyze_loop{,_2} to bail out early and avoid retrying with a different vector size on finding a fatal error; all the above tests are conditioned on { target vect_multiple_sizes }. Hence, drop the vect_multiple_sizes version of the scan-tree-dump, as we now expect those messages to show up once everywhere. The optional extra would be to add a message that vect_analyze_loop was failing with *fatal* error, and scan for that, but that doesn't really seem warranted. Tested vect.exp on aarch64-none-elf, arm-none-eabi, and x86_64 linux with -march=haswell. OK for trunk? Cheers, Alan gcc/testsuite/ChangeLog: * gcc.dg/vect/vect-outer-1-big-array.c: Drop vect_multiple_sizes; use same scan-tree-dump-times on all platforms. * gcc.dg/vect/vect-outer-1.c: Likewise. * gcc.dg/vect/vect-outer-1a-big-array.c: Likewise. * gcc.dg/vect/vect-outer-1a.c: Likewise. * gcc.dg/vect/vect-outer-1b-big-array.c: Likewise. * gcc.dg/vect/vect-outer-1b.c: Likewise. * gcc.dg/vect/vect-outer-2b.c: Likewise. * gcc.dg/vect/vect-outer-3b.c: Likewise. * gcc.dg/vect/vect-reduc-dot-s8b.c: Likewise. --- gcc/testsuite/gcc.dg/vect/vect-outer-1-big-array.c | 3 +-- gcc/testsuite/gcc.dg/vect/vect-outer-1.c| 3 +-- gcc/testsuite/gcc.dg/vect/vect-outer-1a-big-array.c | 3 +-- gcc/testsuite/gcc.dg/vect/vect-outer-1a.c | 3 +-- gcc/testsuite/gcc.dg/vect/vect-outer-1b-big-array.c | 3 +-- gcc/testsuite/gcc.dg/vect/vect-outer-1b.c | 3 +-- gcc/testsuite/gcc.dg/vect/vect-outer-2b.c | 3 +-- gcc/testsuite/gcc.dg/vect/vect-outer-3b.c | 3 +-- gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8b.c | 3 +-- 9 files changed, 9 insertions(+), 18 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vect/vect-outer-1-big-array.c b/gcc/testsuite/gcc.dg/vect/vect-outer-1-big-array.c index 6c61b09..63918ad 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-outer-1-big-array.c +++ b/gcc/testsuite/gcc.dg/vect/vect-outer-1-big-array.c @@ -22,5 +22,4 @@ foo (){ } /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */ -/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 1 "vect" { target { ! vect_multiple_sizes } } } } */ -/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 2 "vect" { target vect_multiple_sizes } } } */ +/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 1 "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-outer-1.c b/gcc/testsuite/gcc.dg/vect/vect-outer-1.c index 5fdaa83..b1bcbc2 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-outer-1.c +++ b/gcc/testsuite/gcc.dg/vect/vect-outer-1.c @@ -22,5 +22,4 @@ foo (){ } /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */ -/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 1 "vect" { target { ! vect_multiple_sizes } } } } */ -/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 2 "vect" { target vect_multiple_sizes } } } */ +/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 1 "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-outer-1a-big-array.c b/gcc/testsuite/gcc.dg/vect/vect-outer-1a-big-array.c index 68b25f9..98dfcfb 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-outer-1a-big-array.c +++ b/gcc/testsuite/gcc.dg/vect/vect-outer-1a-big-array.c @@ -20,5 +20,4 @@ foo (){ } /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */ -/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 1 "vect" { target { ! vect_multiple_sizes } } } } */ -/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 2 "vect" { target vect_multiple_sizes } } } */ +/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 1 "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/vect-outer-1a.c b/gcc/testsuite/gcc.dg/vect/vect-outer-1a.c index 85649af..0200fb4 100644 ---
Re: [PATCH 0/2][AArch64] Implement AAPCS64 updates for alignment attribute
On 18/01/16 17:10, Eric Botcazou wrote: Similarly to ARM, I note that Ada is affected. Indeed, with a gcc 4.9 host compiler, I saw a bootstrap miscompare iff including Ada; however, I was able to bootstrap Ada successfully, if I first built a GCC including this patch with --disable-bootstrap, and then used that as host compiler. The best explanation I can see for this is mismatched host vs built libraries and compiler being used together, something like Jakub's suggestion http://gcc.gnu.org/ml/gcc-patches/2015-11/msg00338.html. I don't feel I have the expertise for this, and am CCing the Ada maintainers in the hope they can help. That's a bit weird though because this should have also occurred for ARM when the ABI was broken the same way if the Ada bootstrap is not entirely correct. Now, as far I know, this didn't occur for ARM during bootstrap but only during testing with make -k check. Or else could this be a parallel compilation bug? Could you post the list of files that differ? How do they differ exactly? Hmmm. Well, I definitely had this failing to bootstrap once. I repeated that, to try to identify exactly what the differences wereand it succeeded even with my pre-AAPCS64-update host compiler. So, this is probably a false alarm; I'm bootstrapping again, after a rebase, to make sure... --Alan
Re: [PATCH 2/4] Equate MEM_REFs and ARRAY_REFs in tree-ssa-scopedtables.c
On 19/01/16 09:46, Christophe Lyon wrote: On 19 January 2016 at 04:05, H.J. Lu <hjl.to...@gmail.com> wrote: On Thu, Dec 24, 2015 at 3:55 AM, Alan Lawrence <alan.lawre...@arm.com> wrote: This version changes the test cases to fix failures on some platforms, by rewriting the initializers so that they aren't pushed out to the constant pool. gcc/ChangeLog: * tree-ssa-scopedtables.c (avail_expr_hash): Hash MEM_REF and ARRAY_REF using get_ref_base_and_extent. (equal_mem_array_ref_p): New. (hashable_expr_equal_p): Add call to previous. This caused: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69352 Hi Alan, This patch also caused regressions on arm-none-linux-gnueabihf with GCC configured as: --with-thumb --with-cpu=cortex-a57 --with-fpu=crypto-neon-fp-armv8 These tests now fail: gcc.dg/torture/pr61742.c -O2 (test for excess errors) gcc.dg/torture/pr61742.c -O2 -flto -fno-use-linker-plugin -flto-partition=none (test for excess errors) gcc.dg/torture/pr61742.c -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions (test for excess errors) gcc.dg/torture/pr61742.c -O3 -g (test for excess errors) Hmm, I still see these passing, both natively on arm-none-linux-gnueabihf and with a cross-build. hf implies --with-float=hard, right? Do you see what the error messages are? Thanks, Alan
Re: [PATCH] ARM PR68620 (ICE with FP16 on armeb)
On 19/01/16 11:15, Christophe Lyon wrote: For neon_vdupn, I chose to implement neon_vdup_nv4hf and neon_vdup_nv8hf instead of updating the VX iterator because I thought it was not desirable to impact neon_vrev32. Well, the same instruction will suffice for vrev32'ing vectors of HF just as well as vectors of HI, so I think I'd argue that's harmless enough. To gain the benefit, we'd need to update arm_evpc_neon_vrev with a few new cases, though. Since this is more intrusive, I'd rather leave that part for later. OK? Sure. +#ifdef __ARM_BIG_ENDIAN + /* Here, 3 is (4-1) where 4 is the number of lanes. This is also the + right value for vectors with 8 lanes. */ +#define __arm_lane(__vec, __idx) (__idx ^ 3) +#else +#define __arm_lane(__vec, __idx) __idx +#endif + Looks right, but sounds... my concern here is that I'm hoping at some point we will move the *other* vget/set_lane intrinsics to use GCC vector extensions too. At which time (unlike __aarch64_lane which can be used everywhere) this will be the wrong formula. Can we name (and/or comment) it to avoid misleading anyone? The key characteristic seems to be that it is for vectors of 16-bit elements only. I'm not to follow, here. Looking at the patterns for neon_vget_lane_*internal in neon.md, I can see 2 flavours: one for VD, one for VQ2. The latter uses "halfelts". Do you prefer that I create 2 macros (say __arm_lane and __arm_laneq), that would be similar to the aarch64 ones (by computing the number of lanes of the input vector), but the "q" one would use half the total number of lanes instead? That works for me! Sthg like: #define __arm_lane(__vec, __idx) NUM_LANES(__vec) - __idx #define __arm_laneq(__vec, __idx) (__idx & (NUM_LANES(__vec)/2)) + (NUM_LANES(__vec)/2 - __idx) //or similarly #define __arm_laneq(__vec, __idx) (__idx ^ (NUM_LANES(__vec)/2 - 1)) Alternatively I'd been thinking #define __arm_lane_32xN(__idx) __idx ^ 1 #define __arm_lane_16xN(__idx) __idx ^ 3 #define __arm_lane_8xN(__idx) __idx ^ 7 Bear in mind PR64893 that we had on AArch64 :-( Cheers, Alan
[PATCH][ARM] Remove neon_reinterpret, use casts
This cleans up the neon_reinterpret code on ARM in a similar way to AArch64. Rather than a builtin backing onto an expander that emits a mov insn, we can just use a cast, because GCC defines casts of vector types as keeping the same bit pattern. On armeb, this fixes previously-failing test: gcc.target/arm/crypto-vldrq_p128.c scan-assembler vld1.64\t{d[0-9]+-d[0-9]+}.* Bootstrap + check-gcc on arm-none-linux-gnueabihf; cross-tested armeb-none-eabi. OK for trunk? gcc/ChangeLog: * config/arm/arm-protos.h (neon_reinterpret): Remove. * config/arm/arm.c (neon_reinterpret): Remove. * config/arm/arm_neon_builtins.def (vreinterpretv8qi, vreinterpretv4hi, vreinterpretv2si, vreinterpretv2sf, vreinterpretdi, vreinterpretv16qi, vreinterpretv8hi, vreinterpretv4si, vreinterpretv4sf, vreinterpretv2di, vreinterpretti): Remove. * config/arm/neon.md (neon_vreinterpretv8qi, neon_vreinterpretv4hi, neon_vreinterpretv2si, neon_vreinterpretv2sf, neon_vreinterpretdi, neon_vreinterpretti, neon_vreinterpretv16qi, neon_vreinterpretv8hi, neon_vreinterpretv4si, neon_vreinterpretv4sf, neon_vreinterpretv2di): Remove. * config/arm/arm_neon.h (vreinterpret_p8_p16, vreinterpret_p8_f32, vreinterpret_p8_p64, vreinterpret_p8_s64, vreinterpret_p8_u64, vreinterpret_p8_s8, vreinterpret_p8_s16, vreinterpret_p8_s32, vreinterpret_p8_u8, vreinterpret_p8_u16, vreinterpret_p8_u32, vreinterpret_p16_p8, vreinterpret_p16_f32, vreinterpret_p16_p64, vreinterpret_p16_s64, vreinterpret_p16_u64, vreinterpret_p16_s8, vreinterpret_p16_s16, vreinterpret_p16_s32, vreinterpret_p16_u8, vreinterpret_p16_u16, vreinterpret_p16_u32, vreinterpret_f32_p8, vreinterpret_f32_p16, vreinterpret_f32_p64, vreinterpret_f32_s64, vreinterpret_f32_u64, vreinterpret_f32_s8, vreinterpret_f32_s16, vreinterpret_f32_s32, vreinterpret_f32_u8, vreinterpret_f32_u16, vreinterpret_f32_u32, vreinterpret_p64_p8, vreinterpret_p64_p16, vreinterpret_p64_f32, vreinterpret_p64_s64, vreinterpret_p64_u64, vreinterpret_p64_s8, vreinterpret_p64_s16, vreinterpret_p64_s32, vreinterpret_p64_u8, vreinterpret_p64_u16, vreinterpret_p64_u32, vreinterpret_s64_p8, vreinterpret_s64_p16, vreinterpret_s64_f32, vreinterpret_s64_p64, vreinterpret_s64_u64, vreinterpret_s64_s8, vreinterpret_s64_s16, vreinterpret_s64_s32, vreinterpret_s64_u8, vreinterpret_s64_u16, vreinterpret_s64_u32, vreinterpret_u64_p8, vreinterpret_u64_p16, vreinterpret_u64_f32, vreinterpret_u64_p64, vreinterpret_u64_s64, vreinterpret_u64_s8, vreinterpret_u64_s16, vreinterpret_u64_s32, vreinterpret_u64_u8, vreinterpret_u64_u16, vreinterpret_u64_u32, vreinterpret_s8_p8, vreinterpret_s8_p16, vreinterpret_s8_f32, vreinterpret_s8_p64, vreinterpret_s8_s64, vreinterpret_s8_u64, vreinterpret_s8_s16, vreinterpret_s8_s32, vreinterpret_s8_u8, vreinterpret_s8_u16, vreinterpret_s8_u32, vreinterpret_s16_p8, vreinterpret_s16_p16, vreinterpret_s16_f32, vreinterpret_s16_p64, vreinterpret_s16_s64, vreinterpret_s16_u64, vreinterpret_s16_s8, vreinterpret_s16_s32, vreinterpret_s16_u8, vreinterpret_s16_u16, vreinterpret_s16_u32, vreinterpret_s32_p8, vreinterpret_s32_p16, vreinterpret_s32_f32, vreinterpret_s32_p64, vreinterpret_s32_s64, vreinterpret_s32_u64, vreinterpret_s32_s8, vreinterpret_s32_s16, vreinterpret_s32_u8, vreinterpret_s32_u16, vreinterpret_s32_u32, vreinterpret_u8_p8, vreinterpret_u8_p16, vreinterpret_u8_f32, vreinterpret_u8_p64, vreinterpret_u8_s64, vreinterpret_u8_u64, vreinterpret_u8_s8, vreinterpret_u8_s16, vreinterpret_u8_s32, vreinterpret_u8_u16, vreinterpret_u8_u32, vreinterpret_u16_p8, vreinterpret_u16_p16, vreinterpret_u16_f32, vreinterpret_u16_p64, vreinterpret_u16_s64, vreinterpret_u16_u64, vreinterpret_u16_s8, vreinterpret_u16_s16, vreinterpret_u16_s32, vreinterpret_u16_u8, vreinterpret_u16_u32, vreinterpret_u32_p8, vreinterpret_u32_p16, vreinterpret_u32_f32, vreinterpret_u32_p64, vreinterpret_u32_s64, vreinterpret_u32_u64, vreinterpret_u32_s8, vreinterpret_u32_s16, vreinterpret_u32_s32, vreinterpret_u32_u8, vreinterpret_u32_u16, vreinterpretq_p8_p16, vreinterpretq_p8_f32, vreinterpretq_p8_p64, vreinterpretq_p8_p128, vreinterpretq_p8_s64, vreinterpretq_p8_u64, vreinterpretq_p8_s8, vreinterpretq_p8_s16, vreinterpretq_p8_s32, vreinterpretq_p8_u8, vreinterpretq_p8_u16, vreinterpretq_p8_u32, vreinterpretq_p16_p8, vreinterpretq_p16_f32, vreinterpretq_p16_p64, vreinterpretq_p16_p128, vreinterpretq_p16_s64, vreinterpretq_p16_u64, vreinterpretq_p16_s8, vreinterpretq_p16_s16, vreinterpretq_p16_s32, vreinterpretq_p16_u8, vreinterpretq_p16_u16,
[PATCH][ARM] Add movv4hf/v8hf expanders & later insns; disable VnHF immediates.
This fixes ICEs on armeb for float16x[48]_t vectors, e.g. in check_effective_target_arm_neon_fp_16_ok. At present, without the expander, moving v4hf/v8hf values around is done via subregs. On armeb, this ICEs because REG_CANNOT_CHANGE_MODE_P. (On arm-*, moving via two subregs is less efficient than one native insn!) However, adding the expanders, reveals a latent bug in the V4HF variant of *neon_mov, that vector constants are not handled properly in the neon_valid_immediate code. Hence, for now I've used a separate expander that disallows immediates, and disabled VnHF vectors as immediates in neon_valid_immediate_for_move; I'll file a PR for this. Also to fix the advsimd-intrinsics/vcombine test I had to add HF vector modes to the VX iterator and V_reg attribute, for vdup_n, as loading a vector of identical HF elements is now done by loading the scalar + vdup rather than forcing the vector out to the constant pool. On armeb, one of the ICEs this fixes, is in the test program for check_effective_target_arm_neon_fp_16_ok. This means the advsimd-intrinsics vcvt_f16 test now runs (and passes), and also that the other tests now run with neon-fp16, rather than only neon as previously (on armeb). This reveals that the fp16 cases of vld1_lane and vset_lane are (and were) failing. Since those tests would previously have failed *if fp16 had been passed in*, I think this is still a step forward; one can still run the tests with an explicit non-fp16 multilib if the old behaviour is desired. Note the previous patch removes other uses of VQXMOV (not strictly a dependency, generating V4HF/V8HF reinterpret patterns is harmless, they just aren't used). Bootstrapped + check-gcc on arm-none-linux-gnueabihf; cross-tested armeb-none-eabi. gcc/ChangeLog: * config/arm/arm.c (neon_valid_immediate): Disallow vectors of HFmode. * config/arm/iterators.md (V_HF): New. (VQXMOV): Add V8HF. (VX): Add V4HF, V8HF. (V_reg): Add cases for V4HF, V8HF. * config/arm/vec-common.md (mov V_HF): New. --- gcc/config/arm/arm.c | 2 ++ gcc/config/arm/iterators.md | 8 ++-- gcc/config/arm/vec-common.md | 20 3 files changed, 28 insertions(+), 2 deletions(-) diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 3276b03..4fdba38 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -12371,6 +12371,8 @@ neon_valid_immediate (rtx op, machine_mode mode, int inverse, /* Vectors of float constants. */ if (GET_MODE_CLASS (mode) == MODE_VECTOR_FLOAT) { + if (GET_MODE_INNER (mode) == HFmode) + return -1; rtx el0 = CONST_VECTOR_ELT (op, 0); const REAL_VALUE_TYPE *r0; diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md index 974cf51..c5db868 100644 --- a/gcc/config/arm/iterators.md +++ b/gcc/config/arm/iterators.md @@ -59,6 +59,9 @@ ;; Integer and float modes supported by Neon and IWMMXT. (define_mode_iterator VALL [V2DI V2SI V4HI V8QI V2SF V4SI V8HI V16QI V4SF]) +;; Vectors of half-precision floats. +(define_mode_iterator V_HF [V4HF V8HF]) + ;; Integer and float modes supported by Neon and IWMMXT, except V2DI. (define_mode_iterator VALLW [V2SI V4HI V8QI V2SF V4SI V8HI V16QI V4SF]) @@ -99,7 +102,7 @@ (define_mode_iterator VQI [V16QI V8HI V4SI]) ;; Quad-width vector modes, with TImode added, for moves. -(define_mode_iterator VQXMOV [V16QI V8HI V4SI V4SF V2DI TI]) +(define_mode_iterator VQXMOV [V16QI V8HI V8HF V4SI V4SF V2DI TI]) ;; Opaque structure types wider than TImode. (define_mode_iterator VSTRUCT [EI OI CI XI]) @@ -160,7 +163,7 @@ (define_mode_iterator VMDQI [V4HI V2SI V8HI V4SI]) ;; Modes with 8-bit and 16-bit elements. -(define_mode_iterator VX [V8QI V4HI V16QI V8HI]) +(define_mode_iterator VX [V8QI V4HI V4HF V16QI V8HI V8HF]) ;; Modes with 8-bit elements. (define_mode_iterator VE [V8QI V16QI]) @@ -428,6 +431,7 @@ ;; Register width from element mode (define_mode_attr V_reg [(V8QI "P") (V16QI "q") (V4HI "P") (V8HI "q") +(V4HF "P") (V8HF "q") (V2SI "P") (V4SI "q") (V2SF "P") (V4SF "q") (DI "P") (V2DI "q") diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md index ce98f71..c27578a 100644 --- a/gcc/config/arm/vec-common.md +++ b/gcc/config/arm/vec-common.md @@ -38,6 +38,26 @@ } }) +;; This exists separately from the above pattern to exclude an immediate RHS. + +(define_expand "mov" + [(set (match_operand:V_HF 0 "nonimmediate_operand" "") + (match_operand:V_HF 1 "nonimmediate_operand" ""))] + "TARGET_NEON + || (TARGET_REALLY_IWMMXT && VALID_IWMMXT_REG_MODE (mode))" +{ + if (can_create_pseudo_p ()) +{ + if (!REG_P (operands[0])) + operands[1] = force_reg (mode, operands[1]); + else if (TARGET_NEON && CONSTANT_P (operands[1])) + { + operands[1] =
Re: [PATCH] ARM PR68620 (ICE with FP16 on armeb)
Thanks for working on this, Christophe, and sorry I missed the PR. You got further in fixing more things than I did though :). A couple of comments: > For the vec_set_internal and neon_vld1_dup patterns, I > switched to an existing iterator which already had the needed > V4HF/V8HF (so I switched to VD_LANE and VQ2). It's a separate issue, and I hadn't done this either, but looking again - I don't see any reason why we shouldn't apply VD->VD_LANE to the vec_extract standard name pattern too. (At present looks like we have vec_extractv8hf but no vec_extractv4hf ?) > For neon_vdupn, I chose to implement neon_vdup_nv4hf and > neon_vdup_nv8hf instead of updating the VX iterator because I thought > it was not desirable to impact neon_vrev32. Well, the same instruction will suffice for vrev32'ing vectors of HF just as well as vectors of HI, so I think I'd argue that's harmless enough. To gain the benefit, we'd need to update arm_evpc_neon_vrev with a few new cases, though. > @@ -5252,12 +5252,22 @@ vget_lane_s32 (int32x2_t __a, const int __b) > were marked always-inline so there were no call sites, the declaration > would nonetheless raise an error. Hence, we must use a macro instead. */ > > + /* For big-endian, GCC's vector indices are the opposite way around > + to the architectural lane indices used by Neon intrinsics. */ Not quite the opposite way around, as you take into account yourself! 'Reversed within each 64 bits', perhaps? > +#ifdef __ARM_BIG_ENDIAN > + /* Here, 3 is (4-1) where 4 is the number of lanes. This is also the > + right value for vectors with 8 lanes. */ > +#define __arm_lane(__vec, __idx) (__idx ^ 3) > +#else > +#define __arm_lane(__vec, __idx) __idx > +#endif > + Looks right, but sounds... my concern here is that I'm hoping at some point we will move the *other* vget/set_lane intrinsics to use GCC vector extensions too. At which time (unlike __aarch64_lane which can be used everywhere) this will be the wrong formula. Can we name (and/or comment) it to avoid misleading anyone? The key characteristic seems to be that it is for vectors of 16-bit elements only. > @@ -5334,7 +5344,7 @@ vgetq_lane_s32 (int32x4_t __a, const int __b) > ({ \ >float16x8_t __vec = (__v); \ >__builtin_arm_lane_check (8, __idx); \ > - float16_t __res = __vec[__idx];\ > + float16_t __res = __vec[__arm_lane(__vec, __idx)]; \ In passing - the function name in the @@ header is of course misleading, this is #define vgetq_lane_f16 (and the later hunks) Thanks, Alan
Re: [PATCH 1/4] Make SRA scalarize constant-pool loads
On 24/12/15 11:53, Alan Lawrence wrote: Here's a new version that fixes the gcc.dg/guality/pr54970.c failures seen on aarch64 and powerpc64. Prior to SRA handling constant pool decls, -fdump-tree-esra-details (at -O1 -g) had shown: : a = *.LC0; # DEBUG a$0 => MEM[(int[3] *)&*.LC0] a$4_3 = MEM[(int[3] *)&*.LC0 + 4B]; # DEBUG a$4 => a$4_3 a$8_4 = MEM[(int[3] *)&*.LC0 + 8B]; The previous patch changed this to: : SR.5_3 = *.LC0[0]; SR.6_4 = *.LC0[1]; SR.7_19 = *.LC0[2]; SR.8_21 = *.LC1[0]; SR.9_22 = *.LC1[1]; SR.10_23 = *.LC1[2]; # DEBUG a$0 => NULL // Note here a$4_24 = SR.6_4; # DEBUG a$4 => a$4_24 a$8_25 = SR.7_19; Turns out the DEBUG a$0 => NULL was produced in load_assign_lhs_subreplacements: if (lacc && lacc->grp_to_be_debug_replaced) { gdebug *ds; tree drhs; struct access *racc = find_access_in_subtree (sad->top_racc, offset, lacc->size); if (racc && racc->grp_to_be_replaced) { if (racc->grp_write) drhs = get_access_replacement (racc); else drhs = NULL; // <=== HERE } ... ds = gimple_build_debug_bind (get_access_replacement (lacc), drhs, gsi_stmt (sad->old_gsi)); Prior to the patch, we'd skipped around load_assign_lhs_subreplacements, because access_has_children_p(racc) (for racc = *.LC0) didn't hold in sra_modify_assign. I also added a constant_decl_p function, combining the two checks, plus some testcase fixes. Bootstrapped + check-gcc,g++ on x86_64, ARM, AArch64, also on powerpc64{,le}-none-linux-gnu *in combination with the other patches in the series* (I haven't tested the individual patches on PPC), plus Ada on ARM and x86_64. gcc/ChangeLog: PR target/63679 * tree-sra.c (disqualified_constants, constant_decl_p): New. (sra_initialize): Allocate disqualified_constants. (sra_deinitialize): Free disqualified_constants. (disqualify_candidate): Update disqualified_constants when appropriate. (create_access): Scan for constant-pool entries as we go along. (scalarizable_type_p): Add check against type_contains_placeholder_p. (maybe_add_sra_candidate): Allow constant-pool entries. (load_assign_lhs_subreplacements): Bind debug for constant pool vars. (initialize_constant_pool_replacements): New. (sra_modify_assign): Avoid mangling assignments created by previous, and don't generate writes into constant pool. (sra_modify_function_body): Call initialize_constant_pool_replacements. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/sra-17.c: New. * gcc.dg/tree-ssa/sra-18.c: New. Ping. (The next bit is false, unless you force SRA to happen more widely, but all the above stands) This also fixes a bunch of other guality tests on AArch64 that were failing prior to the patch series, and another bunch on PowerPC64 (bigendian -m32), listed below. > Tests fixed on aarch64-none-linux-gnu: gcc.dg/guality/pr54970.c -O1 line 15 *p == 3 gcc.dg/guality/pr54970.c -O1 line 15 *q == 2 gcc.dg/guality/pr54970.c -O1 line 20 *p == 13 gcc.dg/guality/pr54970.c -O1 line 20 *q == 2 gcc.dg/guality/pr54970.c -O1 line 25 *p == 13 gcc.dg/guality/pr54970.c -O1 line 25 *q == 12 gcc.dg/guality/pr54970.c -O1 line 31 *p == 6 gcc.dg/guality/pr54970.c -O1 line 31 *q == 5 gcc.dg/guality/pr54970.c -O1 line 36 *p == 26 gcc.dg/guality/pr54970.c -O1 line 36 *q == 5 gcc.dg/guality/pr54970.c -O1 line 45 *p == 26 gcc.dg/guality/pr54970.c -O1 line 45 *q == 25 gcc.dg/guality/pr54970.c -O1 line 45 p[-1] == 25 gcc.dg/guality/pr54970.c -O1 line 45 p[-2] == 4 gcc.dg/guality/pr54970.c -O1 line 45 q[-1] == 4 gcc.dg/guality/pr54970.c -O1 line 45 q[1] == 26 gcc.dg/guality/pr56154-1.c -O1 line pr56154-1.c:20 x.a == 6 gcc.dg/guality/pr59776.c -O1 line pr59776.c:17 s1.f == 5.0 gcc.dg/guality/pr59776.c -O1 line pr59776.c:17 s1.g == 6.0 gcc.dg/guality/pr59776.c -O1 line pr59776.c:17 s2.f == 0.0 gcc.dg/guality/pr59776.c -O1 line pr59776.c:17 s2.g == 6.0 gcc.dg/guality/pr59776.c -O1 line pr59776.c:20 s1.f == 5.0 gcc.dg/guality/pr59776.c -O1 line pr59776.c:20 s1.g == 6.0 gcc.dg/guality/pr59776.c -O1 line pr59776.c:20 s2.f == 5.0 gcc.dg/guality/pr59776.c -O1 line pr59776.c:20 s2.g == 6.0 gcc.dg/guality/sra-1.c -O1 line 21 a.j == 14 gcc.dg/guality/sra-1.c -O1 line 32 a[1] == 14 gcc.dg/guality/sra-1.c -O1 line 43 a.i == 4 gcc.dg/guality/sra-1.c -O1 line 43 a.j == 14 (and other optimization levels) On ppc64 bigendian, with the rest of the series (but I expect it's this p
Re: [PATCH 3/4 v2] Enhance SCEV to follow copies of SSA_NAMEs.
On 15/01/16 10:07, Richard Biener wrote: On Fri, Jan 15, 2016 at 10:47 AM, Alan Lawrence <alan.lawre...@arm.com> wrote: On Thu, Jan 14, 2016 at 12:30 PM, Richard Biener <richard.guent...@gmail.com> wrote: The vuse test is not necessary + && (gimple_assign_rhs_code (def) == SSA_NAME + || is_gimple_min_invariant (gimple_assign_rhs1 (def and the is_gimple_min_invariant (rhs1) test is not sufficient if you consider - (-INT_MIN) with -ftrapv for example. Thanks, I didn't realize gimple_min_invariant would allow such cases. Well, the invariant would be -INT_MIN but gimple_assign_rhs_code (def) would be NEGATE_EXPR. Basically you forgot about unary operators. Hmm, shouldn't those have get_gimple_rhs_class(gimple_assign_rhs_code(stmt)) == GIMPLE_UNARY_RHS, rather than GIMPLE_SINGLE_RHS as checked for by gimple_assign_single_p? If SINGLE_RHS includes unary operators, the new version of the patch is as flawed as the previous, in that it drops the unary operator altogether. --Alan
Re: [PATCH 4/4] Un-XFAIL ssa-dom-cse-2.c for most platforms
It seems the conclusion on PowerPC is to XFAIL the test on powerpc64 (there will be XPASSes with -mcpu=power7 or -mcpu=power8). Which is what the original patch does (https://gcc.gnu.org/ml/gcc-patches/2015-12/msg01979.html). So, Ping. Thanks, Alan On 21/12/15 15:33, Bill Schmidt wrote: On Mon, 2015-12-21 at 15:22 +, Alan Lawrence wrote: On 21/12/15 14:59, Bill Schmidt wrote: On powerpc64, the test passes with -mcpu=power8 (the loop is vectorized as a reduction); however, without that, similar code is generated to Alpha (the vectorizer decides the reduction is not worthwhile without SIMD support), and the test fails; hence, I've XFAILed for powerpc, but I think I could condition the XFAIL on powerpc64 && !check_p8vector_hw_available, if preferred? Fun. Does it work with -mcpu=power7? Yes, it works with -mcpu=power7, as well as -mcpu=power8, but not e.g. -mcpu=power6. Bill: What GCC DejaGNU incantation would you like to see? This sounds like more fallout from unaligned accesses being faster on POWER8 than previous hardware. What about conditioning the XFAIL on { powerpc*-*-* && { ! vect_hw_misalign } } -- does this work properly? Not on a stage1 compiler - check_p8vector_hw_available itself requires being able to run executables - I'll check on gcc112. However, both look like they're really about the host (ability to execute an asm instruction), not the target (/ability for gcc to output such an instruction) Hm, that looks like a pervasive problem for powerpc. There are a number of things that are supposed to be testing effective target but rely on check_p8vector_hw_available, which as you note requires executing an instruction and is really about the host. We need to clean that up; I should probably open a bug. Kind of amazed this has gotten past us for a couple of years. For now, just XFAILing for powerpc seems the best alternative for this test. Thanks, Bill --Alan
[PATCH 0/2][AArch64] Implement AAPCS64 updates for alignment attribute
These parallel the updates to ARM https://gcc.gnu.org/ml/gcc-patches/2015-07/msg00214.html following from Richard Earnshaw's proposal for updates to the AAPCS and AAPCS64, https://gcc.gnu.org/ml/gcc/2015-07/msg00040.html . On AArch64 we do not have the problem of broken profiledbootstrap (as originally on ARM), hence I do not propose to backport these to earlier GCC branches. Rather, the GCC 5 -> 6 transition, seems a better time for any ABI change. (However, as previously, there should be no changes for types that are naturally aligned, only those using types with explicit alignment specifiers, which was not previously sanctioned by the AAPCS.) Similarly to ARM, I note that Ada is affected. Indeed, with a gcc 4.9 host compiler, I saw a bootstrap miscompare iff including Ada; however, I was able to bootstrap Ada successfully, if I first built a GCC including this patch with --disable-bootstrap, and then used that as host compiler. The best explanation I can see for this is mismatched host vs built libraries and compiler being used together, something like Jakub's suggestion http://gcc.gnu.org/ml/gcc-patches/2015-11/msg00338.html. I don't feel I have the expertise for this, and am CCing the Ada maintainers in the hope they can help. Bootstrapped + check-gcc + check-g++ on aarch64-none-linux-gnu and aarch64_be-none-elf. Also bootstrapped + check-ada on aarch64-none-linux-gnu with --disable-bootstrap host compiler, as above. OK for trunk? --Alan
[PATCH 2/2][AArch64] Tests of AAPCS64 updates for alignment attribute
Here I've added both tests using the abitest.h framework(which verifies values are passed in the correct registers as specified by the AAPCS64), and separate tests which verify that called functions read arguments from the same locations as they are passed. Hence, each test_align-N.c corresponds to rec_align-N.c. I've tried to stay consistent with the existing naming of e.g. test_10.c, test_align-1.c, va_arg-10.c, but would be happy to change to another scheme if preferred! (e.g. func-ret-1.c ?) Cheers, Alan gcc/testsuite/ChangeLog: * gcc.target/aarch64/aapcs64/aapcs64.exp: Also execute rec_*.c * gcc.target//aarch64/aapcs64/rec_align-5.c: New. * gcc.target//aarch64/aapcs64/rec_align-6.c: New. * gcc.target//aarch64/aapcs64/rec_align-7.c: New. * gcc.target//aarch64/aapcs64/rec_align-8.c: New. * gcc.target//aarch64/aapcs64/rec_align-9.c: New. * gcc.target//aarch64/aapcs64/test_align-5.c: New. * gcc.target//aarch64/aapcs64/test_align-6.c: New. * gcc.target//aarch64/aapcs64/test_align-7.c: New. * gcc.target//aarch64/aapcs64/test_align-8.c: New. * gcc.target//aarch64/aapcs64/test_align-9.c: New. * gcc.target//aarch64/aapcs64/rec_vaarg-1.c: New. * gcc.target//aarch64/aapcs64/rec_vaarg-2.c: New. --- .../gcc.target/aarch64/aapcs64/aapcs64.exp | 10 + .../gcc.target/aarch64/aapcs64/rec_align-5.c | 44 .../gcc.target/aarch64/aapcs64/rec_align-6.c | 45 + .../gcc.target/aarch64/aapcs64/rec_align-7.c | 47 ++ .../gcc.target/aarch64/aapcs64/rec_align-8.c | 37 + .../gcc.target/aarch64/aapcs64/rec_align-9.c | 41 +++ .../gcc.target/aarch64/aapcs64/rec_align_vaarg-1.c | 38 + .../gcc.target/aarch64/aapcs64/rec_align_vaarg-2.c | 28 + .../gcc.target/aarch64/aapcs64/test_align-5.c | 35 .../gcc.target/aarch64/aapcs64/test_align-6.c | 36 + .../gcc.target/aarch64/aapcs64/test_align-7.c | 38 + .../gcc.target/aarch64/aapcs64/test_align-8.c | 33 +++ .../gcc.target/aarch64/aapcs64/test_align-9.c | 47 ++ 13 files changed, 479 insertions(+) create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-5.c create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-6.c create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-7.c create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-9.c create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align_vaarg-1.c create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align_vaarg-2.c create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/test_align-5.c create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/test_align-6.c create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/test_align-7.c create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/test_align-8.c create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/test_align-9.c diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/aapcs64.exp b/gcc/testsuite/gcc.target/aarch64/aapcs64/aapcs64.exp index ac2f089..eb297c5 100644 --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/aapcs64.exp +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/aapcs64.exp @@ -38,6 +38,16 @@ foreach src [lsort [glob -nocomplain $srcdir/$subdir/test_*.c]] { } } +# Test parameter receiving. +set additional_flags_for_rec $additional_flags +append additional_flags_for_rec " -fno-inline" +foreach src [lsort [glob -nocomplain $srcdir/$subdir/rec_*.c]] { +if {[runtest_file_p $runtests $src]} { + c-torture-execute [list $src] \ + $additional_flags_for_rec +} +} + # Test unnamed argument retrieval via the va_arg macro. foreach src [lsort [glob -nocomplain $srcdir/$subdir/va_arg-*.c]] { if {[runtest_file_p $runtests $src]} { diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-5.c b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-5.c new file mode 100644 index 000..1b42c92 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-5.c @@ -0,0 +1,44 @@ +/* Test AAPCS layout (alignment) for callee. */ + +/* { dg-do run { target aarch64*-*-* } } */ + +extern void abort (void); + +typedef __attribute__ ((__aligned__ (8))) int alignedint; + +alignedint a = 11; +alignedint b = 13; +alignedint c = 17; +alignedint d = 19; +alignedint e = 23; +alignedint f = 29; +alignedint g = 31; +alignedint h = 37; +alignedint i = 41; +alignedint j = 43; + +void +test_passing_many_alignedint (alignedint x0, alignedint x1, alignedint x2, + alignedint x3, alignedint x4, alignedint x5, + alignedint x6,
[PATCH 1/2][AArch64] Implement AAPCS64 updates for alignment attribute
gcc/ChangeLog: * gcc/config/aarch64/aarch64.c (aarch64_function_arg_alignment): Rewrite, looking one level down for records and arrays. --- gcc/config/aarch64/aarch64.c | 32 +--- 1 file changed, 17 insertions(+), 15 deletions(-) diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index ae4cfb3..8eb8c3e 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -1925,22 +1925,24 @@ aarch64_vfp_is_call_candidate (cumulative_args_t pcum_v, machine_mode mode, static unsigned int aarch64_function_arg_alignment (machine_mode mode, const_tree type) { - unsigned int alignment; + if (!type) +return GET_MODE_ALIGNMENT (mode); + if (integer_zerop (TYPE_SIZE (type))) +return 0; - if (type) -{ - if (!integer_zerop (TYPE_SIZE (type))) - { - if (TYPE_MODE (type) == mode) - alignment = TYPE_ALIGN (type); - else - alignment = GET_MODE_ALIGNMENT (mode); - } - else - alignment = 0; -} - else -alignment = GET_MODE_ALIGNMENT (mode); + gcc_assert (TYPE_MODE (type) == mode); + + if (!AGGREGATE_TYPE_P (type)) +return TYPE_ALIGN (TYPE_MAIN_VARIANT (type)); + + if (TREE_CODE (type) == ARRAY_TYPE) +return TYPE_ALIGN (TREE_TYPE (type)); + + unsigned int alignment = 0; + gcc_assert (TYPE_FIELDS (type)); + + for (tree field = TYPE_FIELDS (type); field; field = DECL_CHAIN (field)) +alignment = std::max (alignment, DECL_ALIGN (field)); return alignment; } -- 1.9.1
Re: [PATCH 3/4 v2] Enhance SCEV to follow copies of SSA_NAMEs.
On Thu, Jan 14, 2016 at 12:30 PM, Richard Bienerwrote: > > The vuse test is not necessary > >> + && (gimple_assign_rhs_code (def) == SSA_NAME >> + || is_gimple_min_invariant (gimple_assign_rhs1 (def > > and the is_gimple_min_invariant (rhs1) test is not sufficient if you > consider - (-INT_MIN) with -ftrapv for example. Thanks, I didn't realize gimple_min_invariant would allow such cases. >> + /* Keep symbolic form, but look through obvious copies for constants. >> */ > > You're also looing for SSA names copied so the comment is sligntly wrong. So it occurs to me that we do only need to look for constants: picking one SSA name instead of another (when both are outside the loop) doesn't really help SCEV; and it seems safer/less change to avoid fiddling around with which names are used. Moreover, given we are in LC _SSA_, I think we can look through degenerate phi's even of SSA_NAMES, without violating loop-closed-ness, so long as we only use cases where we eventually reach a constant. Therefore, how about this version? (Bootstrapped + check-gcc + check-g++ on x86_64, AArch64, ARM; again, fixing vectorization of pr65947-2.c). If this is OK - that leaves only the first patch to SRA, https://gcc.gnu.org/ml/gcc-patches/2015-12/msg02117.html (the "extra" fixes I claimed in the guality testsuite were a testing mistake, but this version still fixes the regressions in guality pr54970.c from the original "ok with the comment added and a testcase (or two)" https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01948.html, as well as the comment + 2*testcase). I think I'd still like to propose these as a stage 3/4 fix (with --param sra-max-scalarization-size-Ospeed=32) of PR/63679, a regression against gcc 4.9. Thanks, Alan gcc/ChangeLog: * tree-scalar-evolution.c (follow_copies_to_constant): New. (analyze_initial_condition, analyze_scalar_evolution_1): Call previous. --- gcc/tree-scalar-evolution.c | 50 ++--- 1 file changed, 33 insertions(+), 17 deletions(-) diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c index 9b33693..470c3ea 100644 --- a/gcc/tree-scalar-evolution.c +++ b/gcc/tree-scalar-evolution.c @@ -1522,6 +1522,34 @@ analyze_evolution_in_loop (gphi *loop_phi_node, return evolution_function; } +/* Looks to see if VAR is a copy of a constant (via straightforward assignments + or degenerate phi's). If so, returns the constant; else, returns VAR. */ + +static tree +follow_copies_to_constant (tree var) +{ + tree res = var; + while (TREE_CODE (res) == SSA_NAME) +{ + gimple *def = SSA_NAME_DEF_STMT (res); + if (gphi *phi = dyn_cast (def)) + { + if (tree rhs = degenerate_phi_result (phi)) + res = rhs; + else + break; + } + else if (gimple_assign_single_p (def)) + /* Will exit loop if not an SSA_NAME. */ + res = gimple_assign_rhs1 (def); + else + break; +} + if (CONSTANT_CLASS_P (res)) +return res; + return var; +} + /* Given a loop-phi-node, return the initial conditions of the variable on entry of the loop. When the CCP has propagated constants into the loop-phi-node, the initial condition is @@ -1574,21 +1602,9 @@ analyze_initial_condition (gphi *loop_phi_node) if (init_cond == chrec_not_analyzed_yet) init_cond = chrec_dont_know; - /* During early loop unrolling we do not have fully constant propagated IL. - Handle degenerate PHIs here to not miss important unrollings. */ - if (TREE_CODE (init_cond) == SSA_NAME) -{ - gimple *def = SSA_NAME_DEF_STMT (init_cond); - if (gphi *phi = dyn_cast (def)) - { - tree res = degenerate_phi_result (phi); - if (res != NULL_TREE - /* Only allow invariants here, otherwise we may break -loop-closed SSA form. */ - && is_gimple_min_invariant (res)) - init_cond = res; - } -} + /* We may not have fully constant propagated IL. Handle degenerate PHIs here + to not miss important early loop unrollings. */ + init_cond = follow_copies_to_constant (init_cond); if (dump_file && (dump_flags & TDF_SCEV)) { @@ -1968,8 +1984,8 @@ analyze_scalar_evolution_1 (struct loop *loop, tree var, tree res) if (bb == NULL || !flow_bb_inside_loop_p (loop, bb)) { - /* Keep the symbolic form. */ - res = var; + /* Keep symbolic form, but look through obvious copies for constants. */ + res = follow_copies_to_constant (var); goto set_and_end; } -- 1.9.1
Re: [PATCH] Tidy: remove reduc_xxx_optab migration code
On 14/01/16 12:22, Richard Biener wrote: On Thu, Jan 14, 2016 at 11:26 AM, Alan Lawrence <alan.lawre...@arm.com> wrote: If/when mips-ps-3d.md is moved from reduc_* to reduc_*_scal optabs (patch here: https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00311.html ), there will be no uses of the old reduc_* optabs remaining. This patch removes those optabs and the migration path. Bootstrapped + check-gcc + check-g++ on x86_64 and AArch64. I'm not sure whether such a code tidy is acceptable for stage 3, please advise? Yes, this is still very much welcome. Thanks, Richard. I took that as an approval, and committed r232373. Thanks, Alan
[PATCH 3/4 v2] Enhance SCEV to follow copies of SSA_NAMEs.
(Previous message: https://gcc.gnu.org/ml/gcc-patches/2015-12/msg02159.html) On Sat, Dec 26, 2015 at 18:58 PM, Richard Bienerwrote: >> I'm not sure whether adding a pass_copy_prop is the right thing here, but >> since >> loop-header-copying can create such opportunities, it seems good to take >> them! > > Aww. I'd rather have general infrastructure (like SCEV) deal with > those non-propagated copies. > > There are likely other missed propagation / folding opportunities > caused from partial peeling. > > Richard. I take your second point, but I am concerned that this leads to duplication of copy-propagation code throughout the compiler? However, this patch does that. Making analyze_initial_condition (alone) follow copies, is enough to solve my case of missed vectorization of pr65947-2.c; but I also used the routine in analyze_scalar_evolution_1, as it found copies to follow in both bootstrap and a significant number of testcases: c-c++-common/gomp/pr58472.c gcc.c-torture/execute/2422-1.c gcc.c-torture/execute/20041213-2.c gcc.c-torture/execute/20100430-1.c gcc.c-torture/execute/pr49712.c gcc.c-torture/execute/pr58640.c gcc.dg/Warray-bounds-13.c gcc.dg/graphite/block-{1,5,6}.c gcc.dg/graphite/interchange-12.c gcc.dg/graphite/interchange-mvt.c gcc.dg/graphite/pr19910.c gcc.dg/graphite/uns-block-1.c gcc.dg/loop-unswitch-{2,4}.c gcc.dg/pr59670.c gcc.dg/torture/matrix-{1,2,3,4,5,6}.c gcc.dg/torture/pr24750-1.c gcc.dg/torture/transpose-{1,2,3,4,5,6}.c ...some of which were resolved to constants. (Of course, this approach is not as thorough as adding a pass_copy_prop. E.g. it does *not* enable vectorization of the inlined copy in pr65947-9.c.) Bootstrapped + check-gcc + check-g++ on ARM, AArch64, x86_64. gcc/ChangeLog: * tree-scalar-evolution.c (follow_copies): New. (analyze_initial_condition, analyze_scalar_evolution_1): Call previous. --- gcc/tree-scalar-evolution.c | 53 ++--- 1 file changed, 36 insertions(+), 17 deletions(-) diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c index 9b33693..357eb8f 100644 --- a/gcc/tree-scalar-evolution.c +++ b/gcc/tree-scalar-evolution.c @@ -1522,6 +1522,37 @@ analyze_evolution_in_loop (gphi *loop_phi_node, return evolution_function; } +/* While VAR is an SSA_NAME whose definition is a straight copy of another name, + or (perhaps via a degenerate phi) a constant, follows that definition. + Returns the constant, or the earliest SSA_NAME whose definition is neither + constant nor copy. */ + +static tree +follow_copies (tree var) +{ + while (TREE_CODE (var) == SSA_NAME) +{ + gimple *def = SSA_NAME_DEF_STMT (var); + if (gphi *phi = dyn_cast (def)) + { + tree rhs = degenerate_phi_result (phi); + /* Don't follow degenerate phi's of SSA_NAMES, that can break +loop-closed SSA form. */ + if (rhs && is_gimple_min_invariant (rhs)) + var = rhs; + break; + } + else if (gimple_assign_single_p (def) + && !gimple_vuse (def) + && (gimple_assign_rhs_code (def) == SSA_NAME + || is_gimple_min_invariant (gimple_assign_rhs1 (def + var = gimple_assign_rhs1 (def); + else + break; +} + return var; +} + /* Given a loop-phi-node, return the initial conditions of the variable on entry of the loop. When the CCP has propagated constants into the loop-phi-node, the initial condition is @@ -1574,21 +1605,9 @@ analyze_initial_condition (gphi *loop_phi_node) if (init_cond == chrec_not_analyzed_yet) init_cond = chrec_dont_know; - /* During early loop unrolling we do not have fully constant propagated IL. - Handle degenerate PHIs here to not miss important unrollings. */ - if (TREE_CODE (init_cond) == SSA_NAME) -{ - gimple *def = SSA_NAME_DEF_STMT (init_cond); - if (gphi *phi = dyn_cast (def)) - { - tree res = degenerate_phi_result (phi); - if (res != NULL_TREE - /* Only allow invariants here, otherwise we may break -loop-closed SSA form. */ - && is_gimple_min_invariant (res)) - init_cond = res; - } -} + /* We may not have fully constant propagated IL. Handle degenerate PHIs here + to not miss important early loop unrollings. */ + init_cond = follow_copies (init_cond); if (dump_file && (dump_flags & TDF_SCEV)) { @@ -1968,8 +1987,8 @@ analyze_scalar_evolution_1 (struct loop *loop, tree var, tree res) if (bb == NULL || !flow_bb_inside_loop_p (loop, bb)) { - /* Keep the symbolic form. */ - res = var; + /* Keep symbolic form, but look through obvious copies for constants. */ + res = follow_copies (var); goto set_and_end; } -- 1.9.1
Re: [PATCH][MIPS] Migrate reduction optabs in mips-ps-3d.md
On 07/01/16 12:47, Alan Lawrence wrote: Here's an updated version, also covering the min/max patterns I missed before. I've now managed to do some testing with a stage 1 compiler, by compiling all tests in gcc.dg/vect at -O2 -ftree-vectorize -mips3d -march=mips64r2 -mabi=n32 $x -ffast-math -ffinite-math-only. There were no changes in which files compiled (871 did, some did not due to libraries etc. missing in my stage 1 compiler), and no changes in any assembly output. The patterns were triggered on: fast-math-vect-reduc-5.c, fast-math-vect-reduc-8.c, fast-math-vect-reduc-9.c, no-fast-math-vect16.c, pr66142.c, vect-outer-fir-big-array.c, vect-outer-fir.c, vect-outer-fir-lb-big-array.c, vect-outer-fir-lb.c, vect-reduc-10.c, vect-reduc-6.c I realize this is not completely representative of a 'proper' test run, in which different files are compiled with different options, but it provides reasonable confidence that I'm not changing any code generation. OK for trunk? (stage 3?) Ping. gcc/ChangeLog: * config/mips/mips-ps-3d.md (reduc_splus_v2sf): Remove. (reduc_plus_scal_v2sf): New. (reduc_smax_v2sf): Rename to... (reduc_smax_scal_v2sf): ...here, make result SFmode, add vec_extract. (reduc_smin_v2sf): Rename to... (reduc_smin_scal_v2sf): ...here, make result SFmode, add vec_extract. --- gcc/config/mips/mips-ps-3d.md | 34 ++ 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/gcc/config/mips/mips-ps-3d.md b/gcc/config/mips/mips-ps-3d.md index 8bc7608..a93a2b9 100644 --- a/gcc/config/mips/mips-ps-3d.md +++ b/gcc/config/mips/mips-ps-3d.md @@ -371,13 +371,17 @@ [(set_attr "type" "fadd") (set_attr "mode" "SF")]) -(define_insn "reduc_splus_v2sf" - [(set (match_operand:V2SF 0 "register_operand" "=f") - (unspec:V2SF [(match_operand:V2SF 1 "register_operand" "f") - (match_dup 1)] -UNSPEC_ADDR_PS))] +(define_expand "reduc_plus_scal_v2sf" + [(match_operand:SF 0 "register_operand" "=f") + (match_operand:V2SF 1 "register_operand" "f")] "TARGET_HARD_FLOAT && TARGET_MIPS3D" - "") + { +rtx temp = gen_reg_rtx (V2SFmode); +emit_insn (gen_mips_addr_ps (temp, operands[1], operands[1])); +rtx lane = BYTES_BIG_ENDIAN ? const1_rtx : const0_rtx; +emit_insn (gen_vec_extractv2sf (operands[0], temp, lane)); +DONE; + }) ; cvt.pw.ps - Floating Point Convert Paired Single to Paired Word (define_insn "mips_cvt_pw_ps" @@ -745,20 +749,26 @@ DONE; }) -(define_expand "reduc_smin_v2sf" - [(match_operand:V2SF 0 "register_operand") +(define_expand "reduc_smin_scal_v2sf" + [(match_operand:SF 0 "register_operand") (match_operand:V2SF 1 "register_operand")] "TARGET_HARD_FLOAT && TARGET_PAIRED_SINGLE_FLOAT" { - mips_expand_vec_reduc (operands[0], operands[1], gen_sminv2sf3); + rtx temp = gen_reg_rtx (V2SFmode); + mips_expand_vec_reduc (temp, operands[1], gen_sminv2sf3); + rtx lane = BYTES_BIG_ENDIAN ? const1_rtx : const0_rtx; + emit_insn (gen_vec_extractv2sf (operands[0], temp, lane)); DONE; }) -(define_expand "reduc_smax_v2sf" - [(match_operand:V2SF 0 "register_operand") +(define_expand "reduc_smax_scal_v2sf" + [(match_operand:SF 0 "register_operand") (match_operand:V2SF 1 "register_operand")] "TARGET_HARD_FLOAT && TARGET_PAIRED_SINGLE_FLOAT" { - mips_expand_vec_reduc (operands[0], operands[1], gen_smaxv2sf3); + rtx temp = gen_reg_rtx (V2SFmode); + mips_expand_vec_reduc (temp, operands[1], gen_smaxv2sf3); + rtx lane = BYTES_BIG_ENDIAN ? const1_rtx : const0_rtx; + emit_insn (gen_vec_extractv2sf (operands[0], temp, lane)); DONE; })
[PATCH] Tidy: remove reduc_xxx_optab migration code
If/when mips-ps-3d.md is moved from reduc_* to reduc_*_scal optabs (patch here: https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00311.html ), there will be no uses of the old reduc_* optabs remaining. This patch removes those optabs and the migration path. Bootstrapped + check-gcc + check-g++ on x86_64 and AArch64. I'm not sure whether such a code tidy is acceptable for stage 3, please advise? Thanks, Alan gcc/ChangeLog: * doc/md.texi (reduc_smin_@var{m}, reduc_smax_@var{m}, reduc_umin_@var{m}, reduc_umax_@var{m}, reduc_splus_@var{m}, reduc_uplus_@var{m}): Remove. * expr.c (expand_expr_real_2): Remove expansion path for reduc_[us](min|max|plus) optabs. * optabs-tree.c (scalar_reduc_to_vector): Remove. * optabs-tree.h (scalar_reduc_to_vector): Remove. * optabs.def (reduc_smax_optab, reduc_smin_optab, reduc_splus_optab, reduc_umax_optab, reduc_umin_optab, reduc_uplus_optab): Remove. * tree-vect-loop.c (vectorizable_reduction): Remove test for reduc_[us](min|max|plus) optabs. --- gcc/doc/md.texi | 27 --- gcc/expr.c | 39 ++- gcc/optabs-tree.c| 20 gcc/optabs-tree.h| 1 - gcc/optabs.def | 7 --- gcc/tree-vect-loop.c | 12 6 files changed, 14 insertions(+), 92 deletions(-) diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index faf3910..68321dc 100644 --- a/gcc/doc/md.texi +++ b/gcc/doc/md.texi @@ -5007,33 +5007,6 @@ raised and a quiet @code{NaN} is returned. All operands have mode @var{m}, which is a scalar or vector floating-point mode. These patterns are not allowed to @code{FAIL}. -@cindex @code{reduc_smin_@var{m}} instruction pattern -@cindex @code{reduc_smax_@var{m}} instruction pattern -@item @samp{reduc_smin_@var{m}}, @samp{reduc_smax_@var{m}} -Find the signed minimum/maximum of the elements of a vector. The vector is -operand 1, and the result is stored in the least significant bits of -operand 0 (also a vector). The output and input vector should have the same -modes. These are legacy optabs, and platforms should prefer to implement -@samp{reduc_smin_scal_@var{m}} and @samp{reduc_smax_scal_@var{m}}. - -@cindex @code{reduc_umin_@var{m}} instruction pattern -@cindex @code{reduc_umax_@var{m}} instruction pattern -@item @samp{reduc_umin_@var{m}}, @samp{reduc_umax_@var{m}} -Find the unsigned minimum/maximum of the elements of a vector. The vector is -operand 1, and the result is stored in the least significant bits of -operand 0 (also a vector). The output and input vector should have the same -modes. These are legacy optabs, and platforms should prefer to implement -@samp{reduc_umin_scal_@var{m}} and @samp{reduc_umax_scal_@var{m}}. - -@cindex @code{reduc_splus_@var{m}} instruction pattern -@cindex @code{reduc_uplus_@var{m}} instruction pattern -@item @samp{reduc_splus_@var{m}}, @samp{reduc_uplus_@var{m}} -Compute the sum of the signed/unsigned elements of a vector. The vector is -operand 1, and the result is stored in the least significant bits of operand 0 -(also a vector). The output and input vector should have the same modes. -These are legacy optabs, and platforms should prefer to implement -@samp{reduc_plus_scal_@var{m}}. - @cindex @code{reduc_smin_scal_@var{m}} instruction pattern @cindex @code{reduc_smax_scal_@var{m}} instruction pattern @item @samp{reduc_smin_scal_@var{m}}, @samp{reduc_smax_scal_@var{m}} diff --git a/gcc/expr.c b/gcc/expr.c index 8973893..e1ed44d 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -9164,35 +9164,16 @@ expand_expr_real_2 (sepops ops, rtx target, machine_mode tmode, this_optab = optab_for_tree_code (code, type, optab_default); machine_mode vec_mode = TYPE_MODE (TREE_TYPE (treeop0)); - if (optab_handler (this_optab, vec_mode) != CODE_FOR_nothing) - { - struct expand_operand ops[2]; - enum insn_code icode = optab_handler (this_optab, vec_mode); - - create_output_operand ([0], target, mode); - create_input_operand ([1], op0, vec_mode); - if (maybe_expand_insn (icode, 2, ops)) - { - target = ops[0].value; - if (GET_MODE (target) != mode) - return gen_lowpart (tmode, target); - return target; - } - } - /* Fall back to optab with vector result, and then extract scalar. */ - this_optab = scalar_reduc_to_vector (this_optab, type); -temp = expand_unop (vec_mode, this_optab, op0, NULL_RTX, unsignedp); -gcc_assert (temp); -/* The tree code produces a scalar result, but (somewhat by convention) - the optab produces a vector with the result in element 0 if - little-endian, or element N-1 if big-endian. So pull the scalar - result out of that element. */ -int index = BYTES_BIG_ENDIAN ? GET_MODE_NUNITS (vec_mode) -
Re: [AARCH64][ACLE] Implement __ARM_FP_FENV_ROUNDING in aarch64 backend.
On 11/01/16 10:28, Marcus Shawcroft wrote: On 11 January 2016 at 08:12, Bilyan Borisovwrote: 2015-XX-XX Bilyan Borisov * config/aarch64/aarch64-c.c (aarch64_update_cpp_builtins): New macro definition. gcc/testsuite/ 2015-XX-XX Bilyan Borisov * gcc.target/aarch64/fesetround-checking-baremetal.c: New. * gcc.target/aarch64/fesetround-checking-linux.c: Likewise. for new test cases in gcc.target/aarch64 use the name convention here https://gcc.gnu.org/wiki/TestCaseWriting, in this case: fesetround_checking_baremetal_1.c fesetround_checking_linux_1.c If we are distinguishing between glibc/musl/bionic/uclibc then the use of the name "linux" in the last test case would seem to be inappropriate (all of glibc/musl/bionic and uclibc can be used on top of a linux kernel). Suggest s/linux/glibc/ in the test case name. Cheers /Marcus However, the test doesn't really look at whether we're using glibc vs musl/bionic/uclibc, only at whether we are targeting -linux-gnu or -none-elf. Indeed, IIUC, the test would fail, if one (somehow) built a gcc using musl/bionic/glibc, as I'm not aware of us having a mechanism to detect that from the testsuite... Just my two cents! Alan
[PATCH] Cleanup vect testsuite includes
This was an attempt to make more of the vect testsuite compilable with a stage-1 compiler, i.e. without standard header files like stdlib.h, to ease looking for differences in assembly output. (It is still necessary to comment out most of tree-vect.h to do this, but at least such temporary/local changes can be restricted to one file.) Inclusion of stdlib.h and signal.h are quite inconsistent, with some files explicitly declaring common functions like abort, and others #including the header, sometimes totally unnecessarily. I left files using malloc, calloc and free as is, tho I guess the same treatment could be applied there. Tested (natively) on x86_64-none-linux-gnu and aarch64-none-linux-gnu. Is this OK for trunk? gcc/testsuite/ChangeLog: * gcc.dg/vect/fast-math-bb-slp-call-3.c: Declare functions as 'extern' rather than #including math.h & stdlib.h. * gcc.dg/vect/pr47001.c: Declare abort as 'extern', remove stdlib.h. * gcc.dg/vect/pr49771.c: Likewise. * gcc.dg/vect/vect-10-big-array.c: Likewise. * gcc.dg/vect/vect-neg-store-1.c: Likewise. * gcc.dg/vect/vect-neg-store-2.c: Likewise. * gcc.dg/vect/slp-37.c: Change NULL to 0, remove stdlib.h. * gcc.dg/vect/pr40254.c: Remove unnecessary include of stdlib.h. * gcc.dg/vect/pr44507.c: Likewise. * gcc.dg/vect/pr45902.c: Likewise. * gcc.dg/vect/slp-widen-mult-half.c: Likewise. * gcc.dg/vect/vect-117.c: Likewise. * gcc.dg/vect/vect-99.c: Likewise. * gcc.dg/vect/vect-aggressive-1.c: Likewise. * gcc.dg/vect/vect-cond-1.c: Likewise. * gcc.dg/vect/vect-cond-2.c: Likewise. * gcc.dg/vect/vect-cond-3.c: Likewise. * gcc.dg/vect/vect-cond-4.c: Likewise. * gcc.dg/vect/vect-mask-load-1.c: Likewise. * gcc.dg/vect/vect-mask-loadstore-1.c: Likewise. * gcc.dg/vect/vect-over-widen-1-big-array.c: Likewise. * gcc.dg/vect/vect-over-widen-1.c: Likewise. * gcc.dg/vect/vect-over-widen-2-big-array.c: Likewise. * gcc.dg/vect/vect-over-widen-2.c: Likewise. * gcc.dg/vect/vect-over-widen-3-big-array.c: Likewise. * gcc.dg/vect/vect-over-widen-3.c: Likewise. * gcc.dg/vect/vect-over-widen-4-big-array.c: Likewise. * gcc.dg/vect/vect-over-widen-4.c: Likewise. * gcc.dg/vect/vect-widen-mult-const-s16.c: Likewise. * gcc.dg/vect/vect-widen-mult-const-u16.c: Likewise. * gcc.dg/vect/vect-widen-mult-half-u8.c: Likewise. * gcc.dg/vect/vect-widen-mult-half.c: Likewise. * gcc.dg/vect/no-trapping-math-vect-ifcvt-11.c: Remove unnecessary include of signal.h. * gcc.dg/vect/no-trapping-math-vect-ifcvt-12.c: Likewise. * gcc.dg/vect/no-trapping-math-vect-ifcvt-13.c: Likewise. * gcc.dg/vect/no-trapping-math-vect-ifcvt-14.c: Likewise. * gcc.dg/vect/no-trapping-math-vect-ifcvt-15.c: Likewise. * gcc.dg/vect/no-trapping-math-vect-ifcvt-16.c: Likewise. * gcc.dg/vect/vect-ifcvt-16.c: Likewise. * gcc.dg/vect/vect-ifcvt-17.c: Likewise. * gcc.dg/vect/vect-ifcvt-2.c: Likewise. * gcc.dg/vect/vect-ifcvt-3.c: Likewise. * gcc.dg/vect/vect-ifcvt-4.c: Likewise. * gcc.dg/vect/vect-ifcvt-5.c: Likewise. * gcc.dg/vect/vect-ifcvt-6.c: Likewise. * gcc.dg/vect/vect-ifcvt-7.c: Likewise. * gcc.dg/vect/vect-ifcvt-9.c: Likewise. * gcc.dg/vect/vect-outer-5.c: Likewise. * gcc.dg/vect/vect-outer-6.c: Likewise. * gcc.dg/vect/vect-strided-u8-i8-gap4-unknown.c: Remove unnecessary include of stdio.h. --- gcc/testsuite/gcc.dg/vect/fast-math-bb-slp-call-3.c | 8 ++-- gcc/testsuite/gcc.dg/vect/no-trapping-math-vect-ifcvt-11.c | 1 - gcc/testsuite/gcc.dg/vect/no-trapping-math-vect-ifcvt-12.c | 1 - gcc/testsuite/gcc.dg/vect/no-trapping-math-vect-ifcvt-13.c | 1 - gcc/testsuite/gcc.dg/vect/no-trapping-math-vect-ifcvt-14.c | 1 - gcc/testsuite/gcc.dg/vect/no-trapping-math-vect-ifcvt-15.c | 1 - gcc/testsuite/gcc.dg/vect/no-vfa-vect-dv-2.c| 1 - gcc/testsuite/gcc.dg/vect/pr40254.c | 1 - gcc/testsuite/gcc.dg/vect/pr44507.c | 1 - gcc/testsuite/gcc.dg/vect/pr45902.c | 1 - gcc/testsuite/gcc.dg/vect/pr47001.c | 2 +- gcc/testsuite/gcc.dg/vect/pr49771.c | 3 ++- gcc/testsuite/gcc.dg/vect/slp-37.c | 5 ++--- gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c | 1 - gcc/testsuite/gcc.dg/vect/vect-10-big-array.c | 3 ++- gcc/testsuite/gcc.dg/vect/vect-117.c| 1 - gcc/testsuite/gcc.dg/vect/vect-99.c | 1 - gcc/testsuite/gcc.dg/vect/vect-aggressive-1.c | 1 - gcc/testsuite/gcc.dg/vect/vect-cond-1.c | 1 - gcc/testsuite/gcc.dg/vect/vect-cond-2.c
[PATCH] Fix PR68707
Here's an alternative patch, using the hunk from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707#c16, which 'improves' (heh, we think) on the previous by allowing SLP to proceed where the loads are strided, for example, slp-perm-11.c. Also I fix the testsuite failures (looking for SLP where we've now decided to do load/store-lanes instead). I used simple scan-tree-dump (rather than -times) because load_lanes typically appears many times in the log for each instance output, and it felt fragile to look for a specific number. Note, this doesn't fix PR67323 - that's an example where unpermuted SLP is still (a bit) worse than load-lanes, as it needs unrolling * 2. The previously-posted patch doesn't fix this either, however (https://gcc.gnu.org/ml/gcc-patches/2015-12/msg01620.html). Bootstrapped + check-gcc + check-g++ on x86_64 (no changes), arm and aarch64 (various new scan-tree-dump tests all passing, and PASS->FAIL for O3-pr36098.c). gcc/ChangeLog: 2016-01-XX Alan Lawrence <alan.lawre...@arm.com> Richard Biener <rguent...@suse.de> * tree-vect-slp.c (vect_analyze_slp_instance): Cancel permuted SLP instances that can be handled via vect_load_lanes. gcc/testsuite/ChangeLog: * lib/target-supports.exp (check_effective_target_vect_load_lanes): New. * gcc.dg/vect/slp-perm-1.c: Look for vect_load_lanes instead of SLP on platforms supporting it. * gcc.dg/vect/slp-perm-2.c: Likewise. * gcc.dg/vect/slp-perm-3.c: Likewise. * gcc.dg/vect/slp-perm-5.c: Likewise. * gcc.dg/vect/slp-perm-7.c: Likewise. * gcc.dg/vect/slp-perm-8.c: Likewise. * gcc.dg/vect/slp-perm-6.c: Look for vect_load_lanes in addition to SLP on platforms supporting it. --- gcc/testsuite/gcc.dg/vect/slp-perm-1.c | 6 +- gcc/testsuite/gcc.dg/vect/slp-perm-2.c | 7 +-- gcc/testsuite/gcc.dg/vect/slp-perm-3.c | 7 +-- gcc/testsuite/gcc.dg/vect/slp-perm-5.c | 7 +-- gcc/testsuite/gcc.dg/vect/slp-perm-6.c | 7 +-- gcc/testsuite/gcc.dg/vect/slp-perm-7.c | 8 +--- gcc/testsuite/gcc.dg/vect/slp-perm-8.c | 7 +-- gcc/testsuite/lib/target-supports.exp | 19 +++ gcc/tree-vect-slp.c| 30 ++ 9 files changed, 84 insertions(+), 14 deletions(-) diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-1.c b/gcc/testsuite/gcc.dg/vect/slp-perm-1.c index 2b6c134..6d0d66f 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-perm-1.c +++ b/gcc/testsuite/gcc.dg/vect/slp-perm-1.c @@ -58,5 +58,9 @@ int main (int argc, const char* argv[]) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_perm } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_perm } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { vect_perm && {! vect_load_lanes } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target vect_load_lanes } } } */ +/* { dg-final { scan-tree-dump "note: Built SLP cancelled: can use load/store-lanes" { target { vect_perm && vect_load_lanes } } } } */ +/* { dg-final { scan-tree-dump "LOAD_LANES" "vect" { target vect_load_lanes } } } */ +/* { dg-final { scan-tree-dump "STORE_LANES" "vect" { target vect_load_lanes } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-2.c b/gcc/testsuite/gcc.dg/vect/slp-perm-2.c index da50e16..34b413d 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-perm-2.c +++ b/gcc/testsuite/gcc.dg/vect/slp-perm-2.c @@ -54,5 +54,8 @@ int main (int argc, const char* argv[]) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_perm } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_perm } } } */ - +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { vect_perm && {! vect_load_lanes } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target vect_load_lanes } } } */ +/* { dg-final { scan-tree-dump "note: Built SLP cancelled: can use load/store-lanes" { target { vect_perm && vect_load_lanes } } } } */ +/* { dg-final { scan-tree-dump "LOAD_LANES" "vect" { target vect_load_lanes } } } */ +/* { dg-final { scan-tree-dump "STORE_LANES" "vect" { target vect_load_lanes } } } */ diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-3.c b/gcc/testsuite/gcc.dg/vect/slp-perm-3.c index 1d33f9b..ec13e0f 100644 --- a/gcc/testsuite/gcc.dg/vect/slp-perm-3.c +++ b/gcc/testsuite/gcc.dg/vect/slp-perm-3.c @@ -67,6 +67,9 @@ int main (int argc, const char
Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization
On Tues, Oct 27, 2015 at 2:39 PM, Richard Bienerwrote: On Mon, Oct 26, 2015 at 6:59 AM, sameera wrote: Richard, we have defined the input language for convenience in prototype implementation. However, we will be using GIMPLE as our IR. As per grammar of our tree, p-tree denote the permute order associated with the statement, whereas c-tree is actually the GIMPLE instruction, which performs compute operation. I tried looking at structures used in SLP, however they can not be used as they are, as main difference between current SLP implementation in GCC versus our representation is that, permute order in SLP is part of the tree node in current GCC, whereas in our representation permute order is represented as independent tree-node. Hence, I have created new tree structure for our pass, which will create p-tree nodes for permute order, and c-tree node which points to appropriate gimple statement. Yes, that's the whole purpose - get the vectorizer (and SLP) a better data structure which is explicit about permutes. I somewhat miss an idea on how to deal with irregular SLP cases - that is, does the AST model each (current) SLP node as a single statement or does it model individual vector elements (so you can insert no-op compensation code to make the operations match). Consider a[i] = b[i] * 3; a[i+1] = b[i+1]; which can be vectorized with SLP if you realize you can multiply b[i+1] by 1. The pass structure we are having for our optimization is as follows: - New pass target-aware-loop-vect with following subpasses: - Loop structure identification - Restricted loop construction - Loop analysis (6 phases of target-aware reordering-instruction selection algorithm) - Loop transformation (Cost aware gimple code generation) We might need some optimizations to be performed on loop body before we start our optimization for reordering instruction selection, so that input program can be transformed to fit into our restricted loop structure. These transformations can be performed by restricted loop construction subpass. Eg.: multi-dimentional array, nested loops, reduction chains etc. can be supported by transforming input loop. The case that you have mentioned, can be transformed at this level. As of implementing the whole thing in GCC I recently spent some more time thinking and came to the conclusion that the path to the fastest improvements in GCC code-gen would be to build the new AST after the current analysis phase finished, do the optimization and drive code-generation off the new AST. Thus keep things like data-ref and dependence analysis as is, as well as group analysis and SLP tree construction. Build the AST from the vectorizer "IL" at the point vect_transform_loop / vect_slp_transform_bb is called. Current pass structure that we have designed, does exactly that. The only difference is that, we are planning to use the transform phase as a co-process of our transform pass, than reconstruct an AST from our representation to pass to vect_transform_loop. More and more of the rest of the vectorizer code can then be "slowly" moved to work on the AST and AST construction be moved earlier. So I think an incremental approach is much more likely to get us some benefit, and I'm wondering how much of the benefit here stems from which bit, and how many of these changes can be broken off. For example, making permutes into explicit operations on the SLP tree, and adding a pass that tries to push them up/down the tree into each other, might be done on the existing codebase, and would be a significant win in itself (bringing some/much(?) of improvements in interleaving code of your proposal). I'm not really sure how much of the benefit stems from what, though - e.g. does reducing vector length down to fit the target machine only after the other steps, bring any benefits besides raising abstraction? (Which certainly can be a benefit!). AFAICT this could be done independently of other changes? Another limitation of the present SLP structure is it's not understanding sharing (until later CSE, hence costing wrongly); it would be good to fix this (making the SLP tree into a DAG, and potentially reusing subtrees by adding permutes). This is a step we'd like to take anyway, but might tie in with your "Bottom-up commoning of p-node subtrees". And would help with (BB as well as loop) cost calculations. Are there any really fundamental differences that we cannot overcome in steps? ( / are there any steps we shouldn't take in isolation, because we'd have to 'throw them away' later?) Alan
[PATCH][RS6000] Migrate reduction optabs in paired.md
There are only a couple of uses of the old reduction optabs remaining (the optabs producing a vector with only one element set). This migrates the uses in gcc.target/rs6000/paired.md. In the absence of a vec_extract pattern, I generate two subreg moves, the same as usually produced by the midend when using the old pattern. I don't have hardware to properly test this, but using a stage 1 compiler, I have compiled all the tests in gcc.dg/vect, at -O2 -ftree-vectorize -mpaired -S -mno-altivec -ffast-math -ffinite-math-only, on both powerpc-none-linux-gnupaired and ppcel-none-linux-gnupaired. The patterns were triggered on fast-math-vect-reduc-5.c, fast-math-vect-reduc-8.c, no-fast-math-vect16.c, vect-reduc-6.c. no-fast-math-vect16.c exhibited some regalloc differences (regs 0 and 12 are swapped): addi 9,1,136 + lfs 12,136(29) psq_stx 0,0,9,0,0 - lfs 0,136(29) - lfs 12,140(1) - fcmpu 7,12,0 + lfs 0,140(1) + fcmpu 7,0,12 no other assembly was changed. Is this OK for trunk? (stage 3?) Cheers, Alan gcc/ChangeLog: * gcc.target/rs6000/paired.md (reduc_smax_v2sf): Rename to... (reduc_smax_scal_v2sf): ...here, make result SFmode, extract element. (reduc_smin_v2sf): Rename to... (reduc_smin_scal_v2sf): ...here, make result SFmode, extract element. (reduc_splus_v2sf): Rename to... (reduc_plus_scal_v2sf): ...here, make result SFmode, extract element. --- gcc/config/rs6000/paired.md | 39 --- 1 file changed, 28 insertions(+), 11 deletions(-) diff --git a/gcc/config/rs6000/paired.md b/gcc/config/rs6000/paired.md index 5d094fb..c3f4d66 100644 --- a/gcc/config/rs6000/paired.md +++ b/gcc/config/rs6000/paired.md @@ -421,45 +421,62 @@ DONE; }) -(define_expand "reduc_smax_v2sf" - [(match_operand:V2SF 0 "gpc_reg_operand" "=f") +(define_expand "reduc_smax_scal_v2sf" + [(match_operand:SF 0 "gpc_reg_operand" "=f") (match_operand:V2SF 1 "gpc_reg_operand" "f")] "TARGET_PAIRED_FLOAT" { rtx tmp_swap = gen_reg_rtx (V2SFmode); rtx tmp = gen_reg_rtx (V2SFmode); + rtx vec_res = gen_reg_rtx (V2SFmode); + rtx di_res = gen_reg_rtx (DImode); emit_insn (gen_paired_merge10 (tmp_swap, operands[1], operands[1])); emit_insn (gen_subv2sf3 (tmp, operands[1], tmp_swap)); - emit_insn (gen_selv2sf4 (operands[0], tmp, operands[1], tmp_swap, CONST0_RTX (SFmode))); + emit_insn (gen_selv2sf4 (vec_res, tmp, operands[1], tmp_swap, + CONST0_RTX (SFmode))); + emit_move_insn (di_res, simplify_gen_subreg (DImode, vec_res, V2SFmode, 0)); + emit_move_insn (operands[0], simplify_gen_subreg (SFmode, di_res, DImode, + BYTES_BIG_ENDIAN ? 4 : 0)); DONE; }) -(define_expand "reduc_smin_v2sf" - [(match_operand:V2SF 0 "gpc_reg_operand" "=f") +(define_expand "reduc_smin_scal_v2sf" + [(match_operand:SF 0 "gpc_reg_operand" "=f") (match_operand:V2SF 1 "gpc_reg_operand" "f")] "TARGET_PAIRED_FLOAT" { rtx tmp_swap = gen_reg_rtx (V2SFmode); rtx tmp = gen_reg_rtx (V2SFmode); + rtx vec_res = gen_reg_rtx (V2SFmode); + rtx di_res = gen_reg_rtx (DImode); emit_insn (gen_paired_merge10 (tmp_swap, operands[1], operands[1])); emit_insn (gen_subv2sf3 (tmp, operands[1], tmp_swap)); - emit_insn (gen_selv2sf4 (operands[0], tmp, tmp_swap, operands[1], CONST0_RTX (SFmode))); + emit_insn (gen_selv2sf4 (vec_res, tmp, tmp_swap, operands[1], + CONST0_RTX (SFmode))); + emit_move_insn (di_res, simplify_gen_subreg (DImode, vec_res, V2SFmode, 0)); + emit_move_insn (operands[0], simplify_gen_subreg (SFmode, di_res, DImode, + BYTES_BIG_ENDIAN ? 4 : 0)); DONE; }) -(define_expand "reduc_splus_v2sf" - [(set (match_operand:V2SF 0 "gpc_reg_operand" "=f") +(define_expand "reduc_plus_scal_v2sf" + [(set (match_operand:SF 0 "gpc_reg_operand" "=f") (match_operand:V2SF 1 "gpc_reg_operand" "f"))] "TARGET_PAIRED_FLOAT" - " { - emit_insn (gen_paired_sum1 (operands[0], operands[1], operands[1], operands[1])); + rtx vec_res = gen_reg_rtx (V2SFmode); + rtx di_res = gen_reg_rtx (DImode); + + emit_insn (gen_paired_sum1 (vec_res, operands[1], operands[1], operands[1])); + emit_move_insn (di_res, simplify_gen_subreg (DImode, vec_res, V2SFmode, 0)); + emit_move_insn (operands[0], simplify_gen_subreg (SFmode, di_res, DImode, + BYTES_BIG_ENDIAN ? 4 : 0)); DONE; -}") +}) (define_expand "movmisalignv2sf" [(set (match_operand:V2SF 0 "nonimmediate_operand" "") -- 1.9.1
[PATCH][MIPS] Migrate reduction optabs in mips-ps-3d.md
Here's an updated version, also covering the min/max patterns I missed before. I've now managed to do some testing with a stage 1 compiler, by compiling all tests in gcc.dg/vect at -O2 -ftree-vectorize -mips3d -march=mips64r2 -mabi=n32 $x -ffast-math -ffinite-math-only. There were no changes in which files compiled (871 did, some did not due to libraries etc. missing in my stage 1 compiler), and no changes in any assembly output. The patterns were triggered on: fast-math-vect-reduc-5.c, fast-math-vect-reduc-8.c, fast-math-vect-reduc-9.c, no-fast-math-vect16.c, pr66142.c, vect-outer-fir-big-array.c, vect-outer-fir.c, vect-outer-fir-lb-big-array.c, vect-outer-fir-lb.c, vect-reduc-10.c, vect-reduc-6.c I realize this is not completely representative of a 'proper' test run, in which different files are compiled with different options, but it provides reasonable confidence that I'm not changing any code generation. OK for trunk? (stage 3?) Cheers, Alan gcc/ChangeLog: * config/mips/mips-ps-3d.md (reduc_splus_v2sf): Remove. (reduc_plus_scal_v2sf): New. (reduc_smax_v2sf): Rename to... (reduc_smax_scal_v2sf): ...here, make result SFmode, add vec_extract. (reduc_smin_v2sf): Rename to... (reduc_smin_scal_v2sf): ...here, make result SFmode, add vec_extract. --- gcc/config/mips/mips-ps-3d.md | 34 ++ 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/gcc/config/mips/mips-ps-3d.md b/gcc/config/mips/mips-ps-3d.md index 8bc7608..a93a2b9 100644 --- a/gcc/config/mips/mips-ps-3d.md +++ b/gcc/config/mips/mips-ps-3d.md @@ -371,13 +371,17 @@ [(set_attr "type" "fadd") (set_attr "mode" "SF")]) -(define_insn "reduc_splus_v2sf" - [(set (match_operand:V2SF 0 "register_operand" "=f") - (unspec:V2SF [(match_operand:V2SF 1 "register_operand" "f") - (match_dup 1)] -UNSPEC_ADDR_PS))] +(define_expand "reduc_plus_scal_v2sf" + [(match_operand:SF 0 "register_operand" "=f") + (match_operand:V2SF 1 "register_operand" "f")] "TARGET_HARD_FLOAT && TARGET_MIPS3D" - "") + { +rtx temp = gen_reg_rtx (V2SFmode); +emit_insn (gen_mips_addr_ps (temp, operands[1], operands[1])); +rtx lane = BYTES_BIG_ENDIAN ? const1_rtx : const0_rtx; +emit_insn (gen_vec_extractv2sf (operands[0], temp, lane)); +DONE; + }) ; cvt.pw.ps - Floating Point Convert Paired Single to Paired Word (define_insn "mips_cvt_pw_ps" @@ -745,20 +749,26 @@ DONE; }) -(define_expand "reduc_smin_v2sf" - [(match_operand:V2SF 0 "register_operand") +(define_expand "reduc_smin_scal_v2sf" + [(match_operand:SF 0 "register_operand") (match_operand:V2SF 1 "register_operand")] "TARGET_HARD_FLOAT && TARGET_PAIRED_SINGLE_FLOAT" { - mips_expand_vec_reduc (operands[0], operands[1], gen_sminv2sf3); + rtx temp = gen_reg_rtx (V2SFmode); + mips_expand_vec_reduc (temp, operands[1], gen_sminv2sf3); + rtx lane = BYTES_BIG_ENDIAN ? const1_rtx : const0_rtx; + emit_insn (gen_vec_extractv2sf (operands[0], temp, lane)); DONE; }) -(define_expand "reduc_smax_v2sf" - [(match_operand:V2SF 0 "register_operand") +(define_expand "reduc_smax_scal_v2sf" + [(match_operand:SF 0 "register_operand") (match_operand:V2SF 1 "register_operand")] "TARGET_HARD_FLOAT && TARGET_PAIRED_SINGLE_FLOAT" { - mips_expand_vec_reduc (operands[0], operands[1], gen_smaxv2sf3); + rtx temp = gen_reg_rtx (V2SFmode); + mips_expand_vec_reduc (temp, operands[1], gen_smaxv2sf3); + rtx lane = BYTES_BIG_ENDIAN ? const1_rtx : const0_rtx; + emit_insn (gen_vec_extractv2sf (operands[0], temp, lane)); DONE; }) -- 1.9.1
Re: [PATCH 2/4] Equate MEM_REFs and ARRAY_REFs in tree-ssa-scopedtables.c
On 05/01/16 07:29, Richard Biener wrote: On January 4, 2016 8:08:17 PM GMT+01:00, Jeff Law <l...@redhat.com> wrote: On 12/21/2015 06:13 AM, Alan Lawrence wrote: This is a respin of patches https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03266.html and https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03267.html, which were "too quickly" approved before concerns with efficiency were pointed out. I tried to change the hashing just in tree-ssa-dom.c using C++ subclassing, but couldn't cleanly separate this out from tree-ssa-scopedtables and tree-ssa-threadedge.c due to use of avail_exprs_stack. So I figured it was probably appropriate to use the equivalences in jump threading too. Also, using get_ref_base_and_extent unifies handling of MEM_REFs and ARRAY_REFs Without looking at the patch, ARRAY_REFs can have non-constant indices which get_ref_base_and_extend handles conservative. You should make sure to not regress here. Thanks for the warning - my understanding is that in such a case, get_ref_base_and_extent returns max_size=(size of the whole array), size=(size of one element); and I only handle cases where size==max_size. Arrays of unknown length have size -1, so will never be equal. Thanks, Alan
RS6000/PowerPC -mpaired
Can anyone give me any pointers on how to use the -mpaired option ("PowerPC 750CL paired-single instructions") ? I'm trying to get it to use the reduc_s{min,max,plus}_v2sf patterns in gcc/config/rs6000/paired.md, so far I haven't managed to find a configuration that supports loading a vector of floats but that doesn't also have superior altivec/v4sf support... Thanks, Alan
Re: [PATCH 4/4] Un-XFAIL ssa-dom-cse-2.c for most platforms
On 24/12/15 19:59, Mike Stump wrote: On Dec 22, 2015, at 8:00 AM, Alan Lawrence <alan.lawre...@foss.arm.com> wrote: On 21/12/15 15:33, Bill Schmidt wrote: Not on a stage1 compiler - check_p8vector_hw_available itself requires being able to run executables - I'll check on gcc112. However, both look like they're really about the host (ability to execute an asm instruction), not the target (/ability for gcc to output such an instruction) Hm, that looks like a pervasive problem for powerpc. There are a number of things that are supposed to be testing effective target but rely on check_p8vector_hw_available, which as you note requires executing an instruction and is really about the host. We need to clean that up; I should probably open a bug. Kind of amazed this has gotten past us for a couple of years. Well, I was about to apologize for making a bogus remark. A really "proper" setup, would be to tell dejagnu to run your execution tests in some kind of emulator/simulator (on your host, perhaps one kind of powerpc) that only/additionally runs instructions for the other, _target_, kind of powerpc...and whatever setup you'd need for all that probably does not live in the GCC repository! I’m not following. dejagnu can already run tests on the target to makes decisions on which tests to run and what to expect from them, if it wants. Some ports already do this. Further, this is pretty typical and standard and easy to do You confuse the issue by mentioning host, but this I think is wrong. These decisions have nothing to do with the host. The are properties of the target execution environment. I’d be happy to help if you’d like. I’d just need the details of what you’d like help with. You're right, which is why I described my first (wrong) remark as bogus. That is, check_p8vector_hw_available is executing an assembly instruction, and on a well-configured test setup, that would potentially invoke an emulator etc. - whereas I am just doing 'native' testing on gcc110/gcc112 on the compile farm. So (as Mike says) there is no bug here, but one just needs to be aware that passing -mcpu=power7 (say) is not sufficient to make check_p8vector_hw_available return false when executing on a power8 host; you would also need to set up some kind of power7 emulator/simulator. Hope that's clear! Thanks, Alan
Re: [PATCH 1/4] Make SRA scalarize constant-pool loads
On 24/12/15 11:53, Alan Lawrence wrote: Here's a new version that fixes the gcc.dg/guality/pr54970.c failures seen on aarch64 and powerpc64. [snip] This also fixes a bunch of other guality tests on AArch64 that were failing prior to the patch series, and another bunch on PowerPC64 (bigendian -m32), listed below. Ach, sorry, not quite. That version avoids any regressions (e.g. in pr54970.c), but does not fix all those other tests, unless you also have this hunk (https://gcc.gnu.org/ml/gcc-patches/2015-08/msg01483.html): diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c index a3ff2df..2a741b8 100644 --- a/gcc/tree-sra.c +++ b/gcc/tree-sra.c @@ -2651,7 +2651,8 @@ analyze_all_variable_accesses (void) && scalarizable_type_p (TREE_TYPE (var))) { if (tree_to_uhwi (TYPE_SIZE (TREE_TYPE (var))) - <= max_scalarization_size) + <= max_scalarization_size + || DECL_IN_CONSTANT_POOL (var)) { create_total_scalarization_access (var); completely_scalarize (var, TREE_TYPE (var), 0, var); ...which I was using to increase test coverage of the SRA changes. (Alternatively, you can "fix" the tests by running the testsuite with a forced --param sra-max-scalarization-size. But this is only saying that the dwarf info now generated by scalarizing constant-pools, is better than whatever dwarf was being generated by whatever other part of the compiler before.) --Alan
Re: [PATCH][Testsuite]Cleanup logs from gdb tests by adding newlines
Ping. --Alan On 10/12/15 10:31, Alan Lawrence wrote: Runs of the guality testsuite can sometimes end up with gcc.log containing malformed lines like: A debugging session is active.PASS: gcc.dg/guality/pr36728-1.c -O2 line 18 arg4 == 4 A debugging session is active.PASS: gcc.dg/guality/restrict.c -O2 line 30 type:ip == int * Inferior 1 [process 27054] will be killed.PASS: gcc.dg/guality/restrict.c -O2 line 30 type:cicrp == const int * const restrict Inferior 1 [process 27160] will be killed.PASS: gcc.dg/guality/restrict.c -O2 line 30 type:cvirp == int * const volatile restrict This patch just makes sure the PASS/FAIL comes at the beginning of a line. (At the slight cost of adding some extra newlines not in the actual test output.) I moved the remote_close target calls earlier, to avoid any possible race condition of extra output being generated after the newline - this may not be strictly necessary. Tested on aarch64-none-linux-gnu and x86_64-none-linux-gnu. I think this is reasonable for stage 3 - OK for trunk? gcc/testsuite/ChangeLog: * lib/gcc-gdb-test.exp (gdb-test): call remote_close earlier, and send newline to log, before calling pass/fail/unsupported. * lib/gcc-simulate-thread.exp (simulate-thread): Likewise. --- gcc/testsuite/lib/gcc-gdb-test.exp| 15 ++- gcc/testsuite/lib/gcc-simulate-thread.exp | 10 +++--- 2 files changed, 17 insertions(+), 8 deletions(-) diff --git a/gcc/testsuite/lib/gcc-gdb-test.exp b/gcc/testsuite/lib/gcc-gdb-test.exp index d3ba6e4..f60cabf 100644 --- a/gcc/testsuite/lib/gcc-gdb-test.exp +++ b/gcc/testsuite/lib/gcc-gdb-test.exp @@ -84,8 +84,9 @@ proc gdb-test { args } { remote_expect target [timeout_value] { # Too old GDB -re "Unhandled dwarf expression|Error in sourced command file|
Re: [PATCH 2/4] Equate MEM_REFs and ARRAY_REFs in tree-ssa-scopedtables.c
This version changes the test cases to fix failures on some platforms, by rewriting the initializers so that they aren't pushed out to the constant pool. gcc/ChangeLog: * tree-ssa-scopedtables.c (avail_expr_hash): Hash MEM_REF and ARRAY_REF using get_ref_base_and_extent. (equal_mem_array_ref_p): New. (hashable_expr_equal_p): Add call to previous. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/ssa-dom-cse-5.c: New. * gcc.dg/tree-ssa/ssa-dom-cse-6.c: New. * gcc.dg/tree-ssa/ssa-dom-cse-7.c: New. --- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c | 18 +++ gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c | 20 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c | 21 + gcc/tree-ssa-scopedtables.c | 67 +-- 4 files changed, 123 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c new file mode 100644 index 000..cd38d3e --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c @@ -0,0 +1,18 @@ +/* Test normalization of ARRAY_REF expressions to MEM_REFs in dom. */ +/* { dg-do compile } */ +/* { dg-options "-O2 -fno-tree-fre -fdump-tree-dom2" } */ + +#define N 8 + +int +main (int argc, char **argv) +{ + int a[N]; + for (int i = 0; i < N; i++) +a[i] = 2*i + 1; + int *p = [0]; + __builtin_printf ("%d\n", a[argc]); + return *(++p); +} + +/* { dg-final { scan-tree-dump-times "return 3;" 1 "dom2"} } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c new file mode 100644 index 000..002fd81 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c @@ -0,0 +1,20 @@ +/* Test normalization of ARRAY_REF expressions to MEM_REFs in dom. */ +/* { dg-do compile } */ +/* { dg-options "-O1 -fno-tree-fre -fdump-tree-dom2" } */ + +int +main (int argc, char **argv) +{ + union { +int a[4]; +int b[2]; + } u; + u.a[0] = 1; + u.a[1] = 42; + u.a[2] = 3; + u.a[3] = 4; + __builtin_printf ("%d\n", u.a[argc]); + return u.b[1]; +} + +/* { dg-final { scan-tree-dump-times "return 42;" 1 "dom2" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c new file mode 100644 index 000..151e5d4e --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c @@ -0,0 +1,21 @@ +/* Test normalization of MEM_REF expressions in dom. */ +/* { dg-do compile } */ +/* { dg-options "-O3 -fno-tree-fre -fno-tree-pre -fdump-tree-optimized" } */ + +typedef struct { + int a[8]; +} foo; + +foo f; + +int +test () +{ + foo g; + g.a[0] = 1; g.a[1] = 2; g.a[2] = 3; g.a[3] = 4; + g.a[4] = 5; g.a[5] = 6; g.a[6] = 7; g.a[7] = 8; + f=g; + return f.a[2]; +} + +/* { dg-final { scan-tree-dump-times "return 3;" 1 "optimized" } } */ diff --git a/gcc/tree-ssa-scopedtables.c b/gcc/tree-ssa-scopedtables.c index 6034f79..8df7125 100644 --- a/gcc/tree-ssa-scopedtables.c +++ b/gcc/tree-ssa-scopedtables.c @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see #include "fold-const.h" #include "tree-eh.h" #include "internal-fn.h" +#include "tree-dfa.h" static bool hashable_expr_equal_p (const struct hashable_expr *, const struct hashable_expr *); @@ -209,11 +210,70 @@ avail_expr_hash (class expr_hash_elt *p) const struct hashable_expr *expr = p->expr (); inchash::hash hstate; + if (expr->kind == EXPR_SINGLE) +{ + /* T could potentially be a switch index or a goto dest. */ + tree t = expr->ops.single.rhs; + if (TREE_CODE (t) == MEM_REF || TREE_CODE (t) == ARRAY_REF) + { + /* Make equivalent statements of both these kinds hash together. +Dealing with both MEM_REF and ARRAY_REF allows us not to care +about equivalence with other statements not considered here. */ + bool reverse; + HOST_WIDE_INT offset, size, max_size; + tree base = get_ref_base_and_extent (t, , , _size, + ); + /* Strictly, we could try to normalize variable-sized accesses too, + but here we just deal with the common case. */ + if (size == max_size) + { + enum tree_code code = MEM_REF; + hstate.add_object (code); + inchash::add_expr (base, hstate); + hstate.add_object (offset); + hstate.add_object (size); + return hstate.end (); + } + } +} + inchash::add_hashable_expr (expr, hstate); return hstate.end (); } +/* Compares trees T0 and T1 to see if they are MEM_REF or ARRAY_REFs equivalent + to each other. (That is, they return the
Re: [PATCH 1/4] Make SRA scalarize constant-pool loads
Here's a new version that fixes the gcc.dg/guality/pr54970.c failures seen on aarch64 and powerpc64. Prior to SRA handling constant pool decls, -fdump-tree-esra-details (at -O1 -g) had shown: : a = *.LC0; # DEBUG a$0 => MEM[(int[3] *)&*.LC0] a$4_3 = MEM[(int[3] *)&*.LC0 + 4B]; # DEBUG a$4 => a$4_3 a$8_4 = MEM[(int[3] *)&*.LC0 + 8B]; The previous patch changed this to: : SR.5_3 = *.LC0[0]; SR.6_4 = *.LC0[1]; SR.7_19 = *.LC0[2]; SR.8_21 = *.LC1[0]; SR.9_22 = *.LC1[1]; SR.10_23 = *.LC1[2]; # DEBUG a$0 => NULL // Note here a$4_24 = SR.6_4; # DEBUG a$4 => a$4_24 a$8_25 = SR.7_19; Turns out the DEBUG a$0 => NULL was produced in load_assign_lhs_subreplacements: if (lacc && lacc->grp_to_be_debug_replaced) { gdebug *ds; tree drhs; struct access *racc = find_access_in_subtree (sad->top_racc, offset, lacc->size); if (racc && racc->grp_to_be_replaced) { if (racc->grp_write) drhs = get_access_replacement (racc); else drhs = NULL; // <=== HERE } ... ds = gimple_build_debug_bind (get_access_replacement (lacc), drhs, gsi_stmt (sad->old_gsi)); Prior to the patch, we'd skipped around load_assign_lhs_subreplacements, because access_has_children_p(racc) (for racc = *.LC0) didn't hold in sra_modify_assign. I also added a constant_decl_p function, combining the two checks, plus some testcase fixes. This also fixes a bunch of other guality tests on AArch64 that were failing prior to the patch series, and another bunch on PowerPC64 (bigendian -m32), listed below. Bootstrapped + check-gcc,g++ on x86_64, ARM, AArch64, also on powerpc64{,le}-none-linux-gnu *in combination with the other patches in the series* (I haven't tested the individual patches on PPC), plus Ada on ARM and x86_64. gcc/ChangeLog: PR target/63679 * tree-sra.c (disqualified_constants, constant_decl_p): New. (sra_initialize): Allocate disqualified_constants. (sra_deinitialize): Free disqualified_constants. (disqualify_candidate): Update disqualified_constants when appropriate. (create_access): Scan for constant-pool entries as we go along. (scalarizable_type_p): Add check against type_contains_placeholder_p. (maybe_add_sra_candidate): Allow constant-pool entries. (load_assign_lhs_subreplacements): Bind debug for constant pool vars. (initialize_constant_pool_replacements): New. (sra_modify_assign): Avoid mangling assignments created by previous, and don't generate writes into constant pool. (sra_modify_function_body): Call initialize_constant_pool_replacements. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/sra-17.c: New. * gcc.dg/tree-ssa/sra-18.c: New. Tests fixed on aarch64-none-linux-gnu: gcc.dg/guality/pr54970.c -O1 line 15 *p == 3 gcc.dg/guality/pr54970.c -O1 line 15 *q == 2 gcc.dg/guality/pr54970.c -O1 line 20 *p == 13 gcc.dg/guality/pr54970.c -O1 line 20 *q == 2 gcc.dg/guality/pr54970.c -O1 line 25 *p == 13 gcc.dg/guality/pr54970.c -O1 line 25 *q == 12 gcc.dg/guality/pr54970.c -O1 line 31 *p == 6 gcc.dg/guality/pr54970.c -O1 line 31 *q == 5 gcc.dg/guality/pr54970.c -O1 line 36 *p == 26 gcc.dg/guality/pr54970.c -O1 line 36 *q == 5 gcc.dg/guality/pr54970.c -O1 line 45 *p == 26 gcc.dg/guality/pr54970.c -O1 line 45 *q == 25 gcc.dg/guality/pr54970.c -O1 line 45 p[-1] == 25 gcc.dg/guality/pr54970.c -O1 line 45 p[-2] == 4 gcc.dg/guality/pr54970.c -O1 line 45 q[-1] == 4 gcc.dg/guality/pr54970.c -O1 line 45 q[1] == 26 gcc.dg/guality/pr56154-1.c -O1 line pr56154-1.c:20 x.a == 6 gcc.dg/guality/pr59776.c -O1 line pr59776.c:17 s1.f == 5.0 gcc.dg/guality/pr59776.c -O1 line pr59776.c:17 s1.g == 6.0 gcc.dg/guality/pr59776.c -O1 line pr59776.c:17 s2.f == 0.0 gcc.dg/guality/pr59776.c -O1 line pr59776.c:17 s2.g == 6.0 gcc.dg/guality/pr59776.c -O1 line pr59776.c:20 s1.f == 5.0 gcc.dg/guality/pr59776.c -O1 line pr59776.c:20 s1.g == 6.0 gcc.dg/guality/pr59776.c -O1 line pr59776.c:20 s2.f == 5.0 gcc.dg/guality/pr59776.c -O1 line pr59776.c:20 s2.g == 6.0 gcc.dg/guality/sra-1.c -O1 line 21 a.j == 14 gcc.dg/guality/sra-1.c -O1 line 32 a[1] == 14 gcc.dg/guality/sra-1.c -O1 line 43 a.i == 4 gcc.dg/guality/sra-1.c -O1 line 43 a.j == 14 (and other optimization levels) On ppc64 bigendian, with the rest of the series (but I expect it's this patch): unix/-m32: gcc.dg/guality/pr54970.c -O1 line 15 a[0] == 1 unix/-m32: gcc.dg/guality/pr54970.c -O1 line 20 a[0] == 1 unix/-m32: gcc.dg/guality/pr54970.c -O1 line 25 a[0] == 1 unix/-m32: gcc.dg/guality/pr54970.c -O1 line
Re: [AArch32][NEON] Implementing vmaxnmQ_ST and vminnmQ_ST intrinsincs.
On 21/12/15 11:58, Bilyan Borisov wrote: This patch implements the vmaxnmQ_ST and vminnmQ_ST intrinsincs. It also implements the __ARM_FEATURE_NUMERIC_MAXMIN macro, which is defined when __ARM_ARCH >= 8, and which enables the intrinsincs. Tested on arm-none-eabi, armeb-none-eabi, arm-none-linux-gnueabihf. --- gcc/ 2015-XX-XX Bilyan Borisov* config/arm/arm-c.c (arm_cpu_builtins): New macro definition. * config/arm/arm_neon.h (vmaxnm_f32): New intrinsinc. (vmaxnmq_f32): Likewise. (vminnm_f32): Likewise. (vminnmq_f32): Likewise. * config/arm/arm_neon_builtins.def (vmaxnm): New builtin. (vminnm): Likewise. * config/arm/iterators.md (VMAXMINNM): New iterator. (maxmin): Updated iterator. * config/arm/neon.md (neon_v, VCVTF): New pattern. * config/arm/unspecs.md (UNSPEC_VMAXNM): New unspec. (UNSPEC_VMINNM): Likewise. gcc/testsuite/ 2015-XX-XX Bilyan Borisov * gcc.target/arm/simd/vmaxnm_f32_1.c: New. * gcc.target/arm/simd/vmaxnmq_f32_1.c: Likewise. * gcc.target/arm/simd/vminnm_f32_1.c: Likewise. * gcc.target/arm/simd/vminnmq_f32_1.c: Likewise. I note strong similarities between this patch and David Sherwood's https://gcc.gnu.org/ml/gcc-patches/2015-12/msg01560.html Both add the same UNSPEC_s, and equivalent VMAXMIN(F?)NM. David's adds and attributes, whereas Bilyan reuses some elements of the existing . AFAICT, the patterns they add are in other ways equivalent (same type, condition, modes, alternatives), albeit in different files and constructed using those different iterators, and David's has the standard names (which IIUC we want, so the autovectorizer finds them) whereas Bilyan adds the intrinsics (which we also want)... --Alan
Re: [PING][Patch 3/3][Arm] Add support for IEEE-conformant versions of scalar fmin* and fmax*
On 16/12/15 09:31, Kyrill Tkachov wrote: Hi David, On 16/12/15 08:53, David Sherwood wrote: Hi, Here is the last patch of the fmin/fmax change, which adds the optabs to the arm backend. Tested: arm-none-eabi: no regressions Good to go? David Sherwood. ChangeLog: 2015-12-08 David Sherwoodgcc/ * config/arm/iterators.md: New iterators. * config/arm/unspecs.md: New unspecs. * config/arm/neon.md: New pattern. * config/arm/vfp.md: Likewise. Please list the new entities you add in parentheses. For example: * config/arm/iterators.md (VMAXMINFNM): New int iterator. (fmaxmin): New int attribute. (fmaxmin): Likewise. Same for the other files. That way one can grep through the ChangeLogs to find when any particular pattern/iterator/whatever was modified. +;; Auto-vectorized forms for the IEEE-754 fmax()/fmin() functions +(define_insn "3" + [(set (match_operand:VCVTF 0 "s_register_operand" "=w") +(unspec:VCVTF [(match_operand:VCVTF 1 "s_register_operand" "w") + (match_operand:VCVTF 2 "s_register_operand" "w")] + VMAXMINFNM))] + "TARGET_NEON && TARGET_FPU_ARMV8" + ".\t%0, %1, %2" + [(set_attr "type" "neon_fp_minmax_s")] +) I would just say "Vector forms" rather than "Auto-vectorized". In principle we can get vector types through other means besides auto-vectorisation. +;; Scalar forms for the IEEE-754 fmax()/fmin() functions +(define_insn "3" + [(set (match_operand:SDF 0 "s_register_operand" "=") +(unspec:SDF [(match_operand:SDF 1 "s_register_operand" "") + (match_operand:SDF 2 "s_register_operand" "")] + VMAXMINFNM))] + "TARGET_HARD_FLOAT && TARGET_VFP5 " + ".\\t%0, %1, %2" + [(set_attr "type" "f_minmax") + (set_attr "conds" "unconditional")] +) + I notice your new test doesn't test the SF variant of this. Could you please add something to test it? Looks good to me otherwise. Thanks, Kyrill I note strong similarities between this patch and Bilyan Borisov's https://gcc.gnu.org/ml/gcc-patches/2015-12/msg01971.html Both add the same UNSPEC_s, and equivalent VMAXMIN(F?)NM. David's adds and attributes, whereas Bilyan reuses some elements of the existing . AFAICT, the patterns they add are in other ways equivalent (same type, condition, modes, alternatives), albeit in different files and constructed using those different iterators, and David's has the standard names (which IIUC we want, so the autovectorizer finds them) whereas Bilyan adds the intrinsics (which we also want)... --Alan
Re: [PATCH 0/4 v3] Fix PR/63679 when --param sra-max-scalarization-size specified
On 21/12/15 13:13, Alan Lawrence wrote: This is a respin of previous patch series: https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03271.html Minus three of the smaller patches having already been committed; with the updated version of the main patch to SRA; and the patches to DOM reworked to avoid constructing gimple stmt's. IMHO this feels quite invasive for stage 3, however, I note the PR/63679 (ssa-dom-cse-2.c being XFAILed on a bunch of platforms, and currently still FAILing on ARM) is a regression from gcc 4.9.1. So I'd ask maintainer's thoughts as to whether we should take this patch set for gcc 6. I've now tested the series on powerpc64-none-linux-gnu (gcc110) and powerpc64le-none-linux-gnu (gcc112). On powerpc64le, this causes the same guality failures as on AArch64: gcc.dg/guality/pr54970.c -O1 line 15 a[0] == 1 gcc.dg/guality/pr54970.c -O1 line 20 a[0] == 1 gcc.dg/guality/pr54970.c -O1 line 25 a[0] == 1 and the same with other optimization flags. (I note that there are quite a lot of guality failures on both powerpc and also aarch64, which are generally not included in the posts on the gcc-testresults mailing list). The same pr54970 tests still seem to pass on powerpc64 big-endian even with the patches. On both powerpc64 and powerpc64le, the ssa-dom-cse-7.c test also fails, because the constant gets pushed to the constant pool :(, which was not the point of that test (I'd tried to construct it to test normalization of MEM_REFs in DOM prior to the SRA changes). So, adding --param sra-max-scalarization-size-Ospeed=32 fixes this once the SRA patch is in (!), or we could xfail on powerpc64*-*-*-*. --Alan
Re: [PATCH 4/4] Un-XFAIL ssa-dom-cse-2.c for most platforms
On 21/12/15 15:33, Bill Schmidt wrote: Not on a stage1 compiler - check_p8vector_hw_available itself requires being able to run executables - I'll check on gcc112. However, both look like they're really about the host (ability to execute an asm instruction), not the target (/ability for gcc to output such an instruction) Hm, that looks like a pervasive problem for powerpc. There are a number of things that are supposed to be testing effective target but rely on check_p8vector_hw_available, which as you note requires executing an instruction and is really about the host. We need to clean that up; I should probably open a bug. Kind of amazed this has gotten past us for a couple of years. Well, I was about to apologize for making a bogus remark. A really "proper" setup, would be to tell dejagnu to run your execution tests in some kind of emulator/simulator (on your host, perhaps one kind of powerpc) that only/additionally runs instructions for the other, _target_, kind of powerpc...and whatever setup you'd need for all that probably does not live in the GCC repository! For now, just XFAILing for powerpc seems the best alternative for this test. Ok, thanks. --Alan
Re: [PATCH 0/4 v3] Fix PR/63679 when --param sra-max-scalarization-size specified
On 22/12/15 16:05, Bill Schmidt wrote: On Tue, 2015-12-22 at 15:54 +, Alan Lawrence wrote: On 21/12/15 13:13, Alan Lawrence wrote: This is a respin of previous patch series: https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03271.html Minus three of the smaller patches having already been committed; with the updated version of the main patch to SRA; and the patches to DOM reworked to avoid constructing gimple stmt's. IMHO this feels quite invasive for stage 3, however, I note the PR/63679 (ssa-dom-cse-2.c being XFAILed on a bunch of platforms, and currently still FAILing on ARM) is a regression from gcc 4.9.1. So I'd ask maintainer's thoughts as to whether we should take this patch set for gcc 6. I've now tested the series on powerpc64-none-linux-gnu (gcc110) and powerpc64le-none-linux-gnu (gcc112). On powerpc64le, this causes the same guality failures as on AArch64: gcc.dg/guality/pr54970.c -O1 line 15 a[0] == 1 gcc.dg/guality/pr54970.c -O1 line 20 a[0] == 1 gcc.dg/guality/pr54970.c -O1 line 25 a[0] == 1 and the same with other optimization flags. (I note that there are quite a lot of guality failures on both powerpc and also aarch64, which are generally not included in the posts on the gcc-testresults mailing list). That's interesting. I never see these for powerpc64le on my internal build system. I wonder what the difference is? I've ssh'd into gcc112.fsffrance.org, a power8 powerpc64le system; and configure --with-cpu=power8 --disable-libsanitizer --with-long-double-128 --enable-languages=c,c++,lto. Hmmm...ISTR one variable that can make a big difference is the version (or absence!) of gdb...gcc112 has gdb "Fedora 7.8.2-38.fc21", copyright 2014, and GDB 7.8 looks to have been released at the end of 2014. The same pr54970 tests still seem to pass on powerpc64 big-endian even with the patches. Ach, not quite. In fact those three are failing on powerpc64 bigendian even *without* the patches (at all optimization levels besides -O0 where they pass). This is on gcc110, configure --enable-languages=c,c++,lto --disable-libsanitizer --with-cpu=power7 --with-long-double-128. GDB is Fedora 7.7.1-21.fc20, a bit older, and I tested both unix/-m32 and unix/-m64 (following Bill Seurer's posts on gcc-testresults). I'll email a list of the failures I'm seeing offlist (summary: 186 on gcc112, 148 on gcc110 with -m32, 395 on gcc112 with -m64); however, I suspect gdb version is the difference we are looking for. Which *may* mean that with a more up-to-date GDB, it's possible those failures may not be introduced on ppc64le. (Or similarly AArch64.) Hmmm --Alan
[PATCH 4/4] Un-XFAIL ssa-dom-cse-2.c for most platforms
...the test passes with --param sra-max-scalarization-size-Ospeed. Verified on aarch64 and with stage1 compiler for hppa, powerpc, sparc, s390. On alpha, tree-optimized is: MEM[(int[8] *)] = { 0, 1 }; MEM[(int[8] *) + 8B] = { 2, 3 }; MEM[(int[8] *) + 16B] = { 4, 5 }; MEM[(int[8] *) + 24B] = { 6, 7 }; _23 = a[0]; _29 = a[1]; sum_30 = _23 + _29; _36 = a[2]; sum_37 = sum_30 + _36; Which is beyond the scope of these changes to DOM to optimize. On powerpc64, the test passes with -mcpu=power8 (the loop is vectorized as a reduction); however, without that, similar code is generated to Alpha (the vectorizer decides the reduction is not worthwhile without SIMD support), and the test fails; hence, I've XFAILed for powerpc, but I think I could condition the XFAIL on powerpc64 && !check_p8vector_hw_available, if preferred? gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/ssa-dom-cse-2.c: Remove XFAIL for powerpc(32), hppa, aarch64, sparc, s390. Add --param sra-max-scalarization-size-Ospeed. --- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-2.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-2.c index 9eccdc9..748448e 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-2.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O3 -fno-tree-fre -fno-tree-pre -fdump-tree-optimized" } */ +/* { dg-options "-O3 -fno-tree-fre -fno-tree-pre -fdump-tree-optimized --param sra-max-scalarization-size-Ospeed=32" } */ int foo () @@ -17,7 +17,8 @@ foo () /* After late unrolling the above loop completely DOM should be able to optimize this to return 28. */ -/* See PR63679 and PR64159, if the target forces the initializer to memory then - DOM is not able to perform this optimization. */ +/* On alpha, the vectorizer generates writes of two vector elements at once, + but the loop reads only one element at a time, and DOM cannot resolve these. + The same happens on powerpc depending on the SIMD support available. */ -/* { dg-final { scan-tree-dump "return 28;" "optimized" { xfail aarch64*-*-* alpha*-*-* hppa*-*-* powerpc*-*-* sparc*-*-* s390*-*-* } } } */ +/* { dg-final { scan-tree-dump "return 28;" "optimized" { xfail alpha*-*-* powerpc64*-*-* } } } */ -- 1.9.1
[PATCH 2/4] Equate MEM_REFs and ARRAY_REFs in tree-ssa-scopedtables.c
This is a respin of patches https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03266.html and https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03267.html, which were "too quickly" approved before concerns with efficiency were pointed out. I tried to change the hashing just in tree-ssa-dom.c using C++ subclassing, but couldn't cleanly separate this out from tree-ssa-scopedtables and tree-ssa-threadedge.c due to use of avail_exprs_stack. So I figured it was probably appropriate to use the equivalences in jump threading too. Also, using get_ref_base_and_extent unifies handling of MEM_REFs and ARRAY_REFs (hence only one patch rather than two). I've added a couple of testcases that show the improvement in DOM, but in all cases I had to disable FRE, even PRE, to get any improvement, apart from on ssa-dom-cse-2.c itself (where on the affected platforms FRE still does not do the optimization). This makes me wonder if this is the right approach or whether changing the references output by SRA (as per https://gcc.gnu.org/ml/gcc-patches/2015-08/msg01490.html , judged as a hack to SRA to work around limitations in DOM - or is it?) would be better. Also, this causes gcc to miss vectorization of pr65947-2.c, because the additional equivalents spotted in dom1 lead to an extra jump-thread (partially peeling the loop's first iter, as this has become entirely predictable); loop-header-copying (ch_vect) then kicks in, restoring the loop structure but leading to: i_49 = 1; ivtmp_46 = 253; if (ivtmp_46 != 0) goto ; else goto ; : <--- OUTSIDE LOOP : <--- LOOP HEADER # i_53 = PHIand scalar evolution is unable to analyze i_49 (defined outside the loop) even though it's equal to a constant (analyze_scalar_evolution_1, test of flow_bb_inside_loop_p). This is fixed in the next patch... gcc/ChangeLog: * tree-ssa-scopedtables.c (avail_expr_hash): Hash MEM_REF and ARRAY_REF using get_ref_base_and_extent. (equal_mem_array_ref_p): New. (hashable_expr_equal_p): Add call to previous. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/ssa-dom-cse-5.c: New. * gcc.dg/tree-ssa/ssa-dom-cse-6.c: New. * gcc.dg/tree-ssa/ssa-dom-cse-7.c: New. --- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c | 18 +++ gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c | 16 +++ gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c | 19 gcc/tree-ssa-scopedtables.c | 67 +-- 4 files changed, 117 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c new file mode 100644 index 000..cd38d3e --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c @@ -0,0 +1,18 @@ +/* Test normalization of ARRAY_REF expressions to MEM_REFs in dom. */ +/* { dg-do compile } */ +/* { dg-options "-O2 -fno-tree-fre -fdump-tree-dom2" } */ + +#define N 8 + +int +main (int argc, char **argv) +{ + int a[N]; + for (int i = 0; i < N; i++) +a[i] = 2*i + 1; + int *p = [0]; + __builtin_printf ("%d\n", a[argc]); + return *(++p); +} + +/* { dg-final { scan-tree-dump-times "return 3;" 1 "dom2"} } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c new file mode 100644 index 000..b0c5138 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c @@ -0,0 +1,16 @@ +/* Test normalization of ARRAY_REF expressions to MEM_REFs in dom. */ +/* { dg-do compile } */ +/* { dg-options "-O1 -fno-tree-fre -fdump-tree-dom1" } */ + +int +main (int argc, char **argv) +{ + union { +int a[8]; +int b[2]; + } u = { .a = { 1, 42, 3, 4, 5, 6, 7, 8 } }; + __builtin_printf ("%d\n", u.a[argc]); + return u.b[1]; +} + +/* { dg-final { scan-assembler-times "return 42;" 1 "dom1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c new file mode 100644 index 000..1fc4c5e --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c @@ -0,0 +1,19 @@ +/* Test normalization of MEM_REF expressions in dom. */ +/* { dg-do compile } */ +/* { dg-options "-O3 -fno-tree-fre -fno-tree-pre -fdump-tree-optimized" } */ + +typedef struct { + int a[8]; +} foo; + +foo f; + +int +test () +{ + foo g = { { 1, 2, 3, 4, 5, 6, 7, 8 } }; + f=g; + return f.a[2]; +} + +/* { dg-final { scan-tree-dump-times "return 3;" 1 "optimized" } } */ diff --git a/gcc/tree-ssa-scopedtables.c b/gcc/tree-ssa-scopedtables.c index 6034f79..8df7125 100644 --- a/gcc/tree-ssa-scopedtables.c +++ b/gcc/tree-ssa-scopedtables.c @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see #include "fold-const.h" #include "tree-eh.h" #include "internal-fn.h" +#include
[PATCH 3/4] Add a pass_copy_prop following pass_ch_vect
This fixes the missed vectorization of gcc.dg/vect/pr65947-2.c following the previous patch, by cleaning up: i_49 = 1; ... goto ; : : <--- LOOP HEADER # i_53 = PHIinto: : # i_53 = PHI Allowing scalar evolution and vectorization to proceed. A similar sequence of partial peeling, header-copying, and constant propagation also leads to pr65947-9.c successfully vectorizing (it previously didn't as the iteration count was too high) when inlined and so using known data. I'm not sure whether adding a pass_copy_prop is the right thing here, but since loop-header-copying can create such opportunities, it seems good to take them! gcc/ChangeLog: * passes.def: Add a pass_copy_prop following pass_ch_vect. gcc/testsuite/ChangeLog: * gcc.dg/vect/pr65947-9.c: Add __attribute__ ((noinline)). --Alan --- gcc/passes.def| 1 + gcc/testsuite/gcc.dg/vect/pr65947-9.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/gcc/passes.def b/gcc/passes.def index 624d121..1043548 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -269,6 +269,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_expand_omp_ssa); POP_INSERT_PASSES () NEXT_PASS (pass_ch_vect); + NEXT_PASS (pass_copy_prop); NEXT_PASS (pass_if_conversion); /* pass_vectorize must immediately follow pass_if_conversion. Please do not add any other passes in between. */ diff --git a/gcc/testsuite/gcc.dg/vect/pr65947-9.c b/gcc/testsuite/gcc.dg/vect/pr65947-9.c index d0da13f..c7aac8d 100644 --- a/gcc/testsuite/gcc.dg/vect/pr65947-9.c +++ b/gcc/testsuite/gcc.dg/vect/pr65947-9.c @@ -7,7 +7,7 @@ extern void abort (void) __attribute__ ((noreturn)); /* Condition reduction with maximum possible loop size. Will fail to vectorize because the vectorisation requires a slot for default values. */ -char +__attribute__ ((noinline)) char condition_reduction (char *a, char min_v) { char last = -72; -- 1.9.1
[PATCH 1/4] Make SRA scalarize constant-pool loads
This is the same as the version conditionally approved near the end of stage 1 https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01948.html, plus the requested comment and a couple of testcases for platforms where the constants are pushed into the constant pool. (I didn't commit this at the time because without the rest of the series, benchmarking results didn't really show any improvement, and the PR was not fixed.) I've left the disqualified_constants check in, although I could be persuaded to remove it if people felt strongly. Also on AArch64, this still causes FAILs in guality pr54970.c: FAIL: gcc.dg/guality/pr54970.c -Os line 15 a[0] == 1 FAIL: gcc.dg/guality/pr54970.c -Os line 20 a[0] == 1 FAIL: gcc.dg/guality/pr54970.c -Os line 25 a[0] == 1 and I think I must ask the AArch64 maintainers to take these for the time being. Bootstrapped + check-gcc,g++,ada on x86_64, ARM, AArch64. gcc/ChangeLog: PR target/63679 * tree-sra.c (disqualified_constants): New. (sra_initialize): Allocate disqualified_constants. (sra_deinitialize): Free disqualified_constants. (disqualify_candidate): Update disqualified_constants when appropriate. (create_access): Scan for constant-pool entries as we go along. (scalarizable_type_p): Add check against type_contains_placeholder_p. (maybe_add_sra_candidate): Allow constant-pool entries. (initialize_constant_pool_replacements): New. (sra_modify_expr): Avoid mangling assignments created by previous. (sra_modify_function_body): Call initialize_constant_pool_replacements. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/sra-17.c: New. * gcc.dg/tree-ssa/sra-18.c: New. --- gcc/testsuite/gcc.dg/tree-ssa/sra-17.c | 19 +++ gcc/testsuite/gcc.dg/tree-ssa/sra-18.c | 28 ++ gcc/tree-sra.c | 99 -- 3 files changed, 143 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sra-17.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sra-18.c diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sra-17.c b/gcc/testsuite/gcc.dg/tree-ssa/sra-17.c new file mode 100644 index 000..cff868a --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/sra-17.c @@ -0,0 +1,19 @@ +/* { dg-do run { target { aarch64*-*-* alpha*-*-* arm*-*-* hppa*-*-* powerpc*-*-* s390*-*-* } } } */ +/* { dg-options "-O2 -fdump-tree-esra --param sra-max-scalarization-size-Ospeed=16" } */ + +extern void abort (void); + +int +main (int argc, char **argv) +{ + int a[4] = { 7, 19, 11, 255 }; + int tot = 0; + for (int i = 0; i < 4; i++) +tot = (tot*256) + a[i]; + if (tot == 0x07130bff) +return 0; + abort (); +} + +/* { dg-final { scan-tree-dump-times "Removing load: a = \\\*.LC0;" 1 "esra" } } */ +/* { dg-final { scan-tree-dump-times "SR.\[0-9_\]+ = \\\*.LC0\\\[" 4 "esra" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sra-18.c b/gcc/testsuite/gcc.dg/tree-ssa/sra-18.c new file mode 100644 index 000..6c0d52b --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/sra-18.c @@ -0,0 +1,28 @@ +/* { dg-do run { target { aarch64*-*-* alpha*-*-* arm*-*-* hppa*-*-* powerpc*-*-* s390*-*-* } } } */ +/* { dg-options "-O2 -fdump-tree-esra --param sra-max-scalarization-size-Ospeed=16" } */ + +extern void abort (void); +struct foo { int x; }; + +struct bar { struct foo f[2]; }; + +struct baz { struct bar b[2]; }; + +int +main (int argc, char **argv) +{ + struct baz a = { { { { { 4 }, { 5 } } }, { { { 6 }, { 7 } } } } }; + int tot = 0; + for (int i = 0; i < 2; i++) +for (int j = 0; j < 2; j++) + tot = (tot*256) + a.b[i].f[j].x; + if (tot == 0x04050607) +return 0; + abort (); +} + +/* { dg-final { scan-tree-dump-times "Removing load: a = \\\*.LC0;" 1 "esra" } } */ +/* { dg-final { scan-tree-dump-times "SR.\[0-9_\]+ = \\\*.LC0\\.b\\\[0\\\]\\.f\\\[0\\\]\\.x" 1 "esra" } } */ +/* { dg-final { scan-tree-dump-times "SR.\[0-9_\]+ = \\\*.LC0\\.b\\\[0\\\]\\.f\\\[1\\\]\\.x" 1 "esra" } } */ +/* { dg-final { scan-tree-dump-times "SR.\[0-9_\]+ = \\\*.LC0\\.b\\\[1\\\]\\.f\\\[0\\\]\\.x" 1 "esra" } } */ +/* { dg-final { scan-tree-dump-times "SR.\[0-9_\]+ = \\\*.LC0\\.b\\\[1\\\]\\.f\\\[1\\\]\\.x" 1 "esra" } } */ diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c index c4fea5b..4ea8550 100644 --- a/gcc/tree-sra.c +++ b/gcc/tree-sra.c @@ -328,6 +328,10 @@ candidate (unsigned uid) those which cannot be (because they are and need be used as a whole). */ static bitmap should_scalarize_away_bitmap, cannot_scalarize_away_bitmap; +/* Bitmap of candidates in the constant pool, which cannot be scalarized + because this would produce non-constant expressions (e.g. Ada). */ +static bitmap disqualified_constants; + /* Obstack for creation of fancy names. */ static struct obstack name_obstack; @@ -652,6 +656,7 @@ sra_initialize (void) (vec_safe_length (cfun->local_decls) / 2); should_scalarize_away_bitmap = BITMAP_ALLOC (NULL);
[PATCH 0/4 v3] Fix PR/63679 when --param sra-max-scalarization-size specified
This is a respin of previous patch series: https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03271.html Minus three of the smaller patches having already been committed; with the updated version of the main patch to SRA; and the patches to DOM reworked to avoid constructing gimple stmt's. IMHO this feels quite invasive for stage 3, however, I note the PR/63679 (ssa-dom-cse-2.c being XFAILed on a bunch of platforms, and currently still FAILing on ARM) is a regression from gcc 4.9.1. So I'd ask maintainer's thoughts as to whether we should take this patch set for gcc 6.
Re: [PATCH 4/4] Un-XFAIL ssa-dom-cse-2.c for most platforms
On 21/12/15 14:59, Bill Schmidt wrote: On powerpc64, the test passes with -mcpu=power8 (the loop is vectorized as a reduction); however, without that, similar code is generated to Alpha (the vectorizer decides the reduction is not worthwhile without SIMD support), and the test fails; hence, I've XFAILed for powerpc, but I think I could condition the XFAIL on powerpc64 && !check_p8vector_hw_available, if preferred? Fun. Does it work with -mcpu=power7? Yes, it works with -mcpu=power7, as well as -mcpu=power8, but not e.g. -mcpu=power6. Bill: What GCC DejaGNU incantation would you like to see? This sounds like more fallout from unaligned accesses being faster on POWER8 than previous hardware. What about conditioning the XFAIL on { powerpc*-*-* && { ! vect_hw_misalign } } -- does this work properly? Not on a stage1 compiler - check_p8vector_hw_available itself requires being able to run executables - I'll check on gcc112. However, both look like they're really about the host (ability to execute an asm instruction), not the target (/ability for gcc to output such an instruction) --Alan
Re: [PATCH 4/4] Un-XFAIL ssa-dom-cse-2.c for most platforms
On 21/12/15 14:59, Bill Schmidt wrote: On powerpc64, the test passes with -mcpu=power8 (the loop is vectorized as a reduction); however, without that, similar code is generated to Alpha (the vectorizer decides the reduction is not worthwhile without SIMD support), and the test fails; hence, I've XFAILed for powerpc, but I think I could condition the XFAIL on powerpc64 && !check_p8vector_hw_available, if preferred? Fun. Does it work with -mcpu=power7? Bill: What GCC DejaGNU incantation would you like to see? This sounds like more fallout from unaligned accesses being faster on POWER8 than previous hardware. What about conditioning the XFAIL on { powerpc*-*-* && { ! vect_hw_misalign } } -- does this work properly? Right now that's just an alternative way of saying what you suggested It works properly in that it is equivalent to what I suggested. However, the test passes on power7 as well, so my suggestion of check_p8vector_hw_available is itself wrong... --Alan
Re: [PATCH] Fix PR68707, 67323
On 16/12/15 15:01, Richard Biener wrote: The following patch adds a heuristic to prefer store/load-lanes over SLP when vectorizing. Compared to the variant attached to the PR I made the STMT_VINFO_STRIDED_P behavior explicit (matching what you've tested). Not sure I follow this. Compared to the variant attached to the PR - we will now attempt to use load-lanes, if (say) all of the loads are strided, even if we know we don't support load-lanes (for any of them). That sounds the wrong way around and I think rather different to what you proposed earlier? (At the least, the debug message "can use load/store lanes" is potentially misleading, that's not necessarily the case!) There are arguments that we want to do less SLP, generally, on ARM/AArch64 but I think Wilco's permute cost patch https://gcc.gnu.org/ml/gcc-patches/2015-12/msg01469.html is a better way of achieving that? Just my gut feeling at this point - I haven't evaluated this version of the patch on any benchmarks etc... Thanks, Alan
Re: [PATCH] Fix PR68707, 67323
On 17/12/15 10:46, Richard Biener wrote: On Thu, 17 Dec 2015, Alan Lawrence wrote: On 16/12/15 15:01, Richard Biener wrote: The following patch adds a heuristic to prefer store/load-lanes over SLP when vectorizing. Compared to the variant attached to the PR I made the STMT_VINFO_STRIDED_P behavior explicit (matching what you've tested). Not sure I follow this. Compared to the variant attached to the PR - we will now attempt to use load-lanes, if (say) all of the loads are strided, even if we know we don't support load-lanes (for any of them). That sounds the wrong way around and I think rather different to what you proposed earlier? (At the least, the debug message "can use load/store lanes" is potentially misleading, that's not necessarily the case!) Ah, indeed. Note that the whole thing is still guarded by the check that we can use store-lanes for the store. I can also do it the other way around (as previously proposed) which would change outcome for slp-perm-11.c. That proposal would not reject the SLP if there were any strided grouped loads involved. Indeed; the STMT_VINFO_STRIDED_P || !vect_load_lanes_supported approach (as on PR68707) vectorizes slp-perm-11.c with SLP, which works much better than the !STMT_VINFO_STRIDED_P && !vect_load_lanes_supported, which tries to use st2 (and only sort-of works - you get an st2 output, but no ld2, and lots of faff). I think I move for the patch from PR68707, therefore. (Ramana - any thoughts?) Btw, another option is to push the decision past full SLP analysis and thus make the decision globally for all SLP instances - currently SLP instances are cancelled one a one-by-one basis meaning we might do SLP plus load/store-lanes in the same loop. I don't see anything inherently wrong with doing both in the same loop. On simple loops, I suspect we'll do better committing to one strategy or the other (tho really it's only the VF required I think?), but then, on such simple loops, there are probably not very many SLP instances! Maybe we have to go all the way to implementing a better vectorization cost hook just for the permutations - the SLP path in theory knows exactly which ones it will generate. Yes, I think this sounds like a good plan for GCC 7. It doesn't require constructing an entire stmt (if you are concerned about the cost of that), and on most targets, probably integrates fairly easily with the expand_vec_perm_const hooks. --Alan
[PATCH][Testsuite]Cleanup logs from gdb tests by adding newlines
Runs of the guality testsuite can sometimes end up with gcc.log containing malformed lines like: A debugging session is active.PASS: gcc.dg/guality/pr36728-1.c -O2 line 18 arg4 == 4 A debugging session is active.PASS: gcc.dg/guality/restrict.c -O2 line 30 type:ip == int * Inferior 1 [process 27054] will be killed.PASS: gcc.dg/guality/restrict.c -O2 line 30 type:cicrp == const int * const restrict Inferior 1 [process 27160] will be killed.PASS: gcc.dg/guality/restrict.c -O2 line 30 type:cvirp == int * const volatile restrict This patch just makes sure the PASS/FAIL comes at the beginning of a line. (At the slight cost of adding some extra newlines not in the actual test output.) I moved the remote_close target calls earlier, to avoid any possible race condition of extra output being generated after the newline - this may not be strictly necessary. Tested on aarch64-none-linux-gnu and x86_64-none-linux-gnu. I think this is reasonable for stage 3 - OK for trunk? gcc/testsuite/ChangeLog: * lib/gcc-gdb-test.exp (gdb-test): call remote_close earlier, and send newline to log, before calling pass/fail/unsupported. * lib/gcc-simulate-thread.exp (simulate-thread): Likewise. --- gcc/testsuite/lib/gcc-gdb-test.exp| 15 ++- gcc/testsuite/lib/gcc-simulate-thread.exp | 10 +++--- 2 files changed, 17 insertions(+), 8 deletions(-) diff --git a/gcc/testsuite/lib/gcc-gdb-test.exp b/gcc/testsuite/lib/gcc-gdb-test.exp index d3ba6e4..f60cabf 100644 --- a/gcc/testsuite/lib/gcc-gdb-test.exp +++ b/gcc/testsuite/lib/gcc-gdb-test.exp @@ -84,8 +84,9 @@ proc gdb-test { args } { remote_expect target [timeout_value] { # Too old GDB -re "Unhandled dwarf expression|Error in sourced command file|
Re: [PATCH] Enable libstdc++ numeric conversions on Cygwin
On 13/11/15 14:52, Jonathan Wakely wrote: That patch was wrong, the new macros in include/bits/c++config used "CSTDIO" instead of "STDIO" so it caused several tests to go from PASS to UNSUPPORTED, oops! This is the correct version, tested again more carefully, on powerpc64le-linux and powerpc-aix and x86_64-dragonfly. Committed to trunk. Just as a note, on baremetal ARM and AArch64 (arm-none-eabi / aarch64-none-elf), this causes a bunch more tests to be executed that were previously UNSUPPORTED, including FAIL: 21_strings/basic_string/numeric_conversions/char/stod.cc execution test FAIL: 21_strings/basic_string/numeric_conversions/char/stof.cc execution test FAIL: 21_strings/basic_string/numeric_conversions/char/stold.cc execution test FAIL: 21_strings/basic_string/numeric_conversions/wchar_t/stod.cc execution test FAIL: 21_strings/basic_string/numeric_conversions/wchar_t/stof.cc execution test FAIL: 21_strings/basic_string/numeric_conversions/wchar_t/stold.cc execution test FAIL: 27_io/basic_ostream/inserters_arithmetic/char/hexfloat.cc execution test these were previously gated off by the dg-require-string-conversions, via the "#if !defined(_GLIBCXX_USE_C99) || defined(_GLIBCXX_HAVE_BROKEN_VSWPRINTF)" as GLIBCXX_USE_C99 was not defined. However, now we have "#if !(_GLIBCXX_USE_C99_STDIO && _GLIBCXX_USE_C99_STDLIB && _GLIBCXX_USE_C99_WCHAR)" and all of those are true. (Note the tests would have failed before if we had executed them, there is no change there AFAICS.) Cheers, Alan
[PATCH][contrib] Update download_prerequisites to ISL 0.15.
Since r229889, we now have tests that pass only with ISL 0.15. Although ISL 0.15 is not a requirement, it seems we should make it easy to build the compiler that way. Other opinions? Thanks, Alan contrib/ChangeLog: * download_prerequisites: Update ISL version to 0.15. --- contrib/download_prerequisites | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/contrib/download_prerequisites b/contrib/download_prerequisites index 6940330..a685a1d 100755 --- a/contrib/download_prerequisites +++ b/contrib/download_prerequisites @@ -48,7 +48,7 @@ ln -sf $MPC mpc || exit 1 # Necessary to build GCC with the Graphite loop optimizations. if [ "$GRAPHITE_LOOP_OPT" = "yes" ] ; then - ISL=isl-0.14 + ISL=isl-0.15 wget ftp://gcc.gnu.org/pub/gcc/infrastructure/$ISL.tar.bz2 || exit 1 tar xjf $ISL.tar.bz2 || exit 1 -- 1.9.1
Re: [PATCH][contrib] Update download_prerequisites to ISL 0.15.
On 07/12/15 11:54, Markus Trippelsdorf wrote: On 2015.12.07 at 11:52 +, Alan Lawrence wrote: Since r229889, we now have tests that pass only with ISL 0.15. Although ISL 0.15 is not a requirement, it seems we should make it easy to build the compiler that way. This is already fixed by r231329. Ah, yes, sorry, that'd been sitting in my outbox since Friday but I hadn't sent it. Thanks and sorry for the noise. --Alan
Re: [PATCH][1/2] Fix PR68553
On 27/11/15 08:30, Richard Biener wrote: This is part 1 of a fix for PR68533 which shows that some targets cannot can_vec_perm_p on an identity permutation. I chose to fix this in the vectorizer by detecting the identity itself but with the current structure of vect_transform_slp_perm_load this is somewhat awkward. Thus the following no-op patch simplifies it greatly (from the times it was restricted to do interleaving-kind of permutes). It turned out to not be 100% no-op as we now can handle non-adjacent source operands so I split it out from the actual fix. The two adjusted testcases no longer fail to vectorize because of "need three vectors" but unadjusted would fail because there are simply not enough scalar iterations in the loop. I adjusted that and now we vectorize it just fine (running into PR68559 which I filed). Bootstrapped and tested on x86_64-unknown-linux-gnu, applied. Richard. 2015-11-27 Richard BienerPR tree-optimization/68553 * tree-vect-slp.c (vect_get_mask_element): Remove. (vect_transform_slp_perm_load): Implement in a simpler way. * gcc.dg/vect/pr45752.c: Adjust. * gcc.dg/vect/slp-perm-4.c: Likewise. On aarch64 and ARM targets, this causes PASS->FAIL: gcc.dg/vect/O3-pr36098.c scan-tree-dump-times vect "vectorizing stmts using SLP" 0 That is, we now vectorize using SLP, when previously we did not. On aarch64 (and I expect ARM too), previously we used a VEC_LOAD_LANES, without unrolling, but now we unroll * 4, and vectorize using 3 loads and permutes: ../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt: vect__31.15_94 = VEC_PERM_EXPR ; ../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt: vect__31.16_95 = VEC_PERM_EXPR ; ../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt: vect__31.17_96 = VEC_PERM_EXPR which *is* a valid vectorization strategy... --Alan
Re: [PATCH] enable loop fusion on isl-15
On 05/11/15 21:43, Sebastian Pop wrote: * graphite-optimize-isl.c (optimize_isl): Call isl_options_set_schedule_maximize_band_depth. * gcc.dg/graphite/fuse-1.c: New. * gcc.dg/graphite/fuse-2.c: New. * gcc.dg/graphite/interchange-13.c: Remove bogus check. I note that the test scan-tree-dump-times forwprop4 "gimple_simplified to[^\\n]*\\^ 12" 1 FAILs under isl-0.14, with which GCC can still be built and generally claims to work. Is it worth trying to detect this in the testsuite, so we can XFAIL it? By which I mean, is there a reasonable testsuite mechanism by which we could do that? Cheers, Alan
Re: [PATCH][1/2] Fix PR68553
On 04/12/15 17:46, Ramana Radhakrishnan wrote: On 04/12/15 16:04, Richard Biener wrote: On December 4, 2015 4:32:33 PM GMT+01:00, Alan Lawrence <alan.lawre...@arm.com> wrote: On 27/11/15 08:30, Richard Biener wrote: This is part 1 of a fix for PR68533 which shows that some targets cannot can_vec_perm_p on an identity permutation. I chose to fix this in the vectorizer by detecting the identity itself but with the current structure of vect_transform_slp_perm_load this is somewhat awkward. Thus the following no-op patch simplifies it greatly (from the times it was restricted to do interleaving-kind of permutes). It turned out to not be 100% no-op as we now can handle non-adjacent source operands so I split it out from the actual fix. The two adjusted testcases no longer fail to vectorize because of "need three vectors" but unadjusted would fail because there are simply not enough scalar iterations in the loop. I adjusted that and now we vectorize it just fine (running into PR68559 which I filed). Bootstrapped and tested on x86_64-unknown-linux-gnu, applied. Richard. 2015-11-27 Richard Biener <rguent...@suse.de> PR tree-optimization/68553 * tree-vect-slp.c (vect_get_mask_element): Remove. (vect_transform_slp_perm_load): Implement in a simpler way. * gcc.dg/vect/pr45752.c: Adjust. * gcc.dg/vect/slp-perm-4.c: Likewise. On aarch64 and ARM targets, this causes PASS->FAIL: gcc.dg/vect/O3-pr36098.c scan-tree-dump-times vect "vectorizing stmts using SLP" 0 That is, we now vectorize using SLP, when previously we did not. On aarch64 (and I expect ARM too), previously we used a VEC_LOAD_LANES, without unrolling, but now we unroll * 4, and vectorize using 3 loads and permutes: Happens on x86_64 as well with at least Sse4.1. Unfortunately we'll have to start introducing much more fine-grained target-supports for vect_perm to reliably guard all targets. I don't know enough about SSE4.1 to know whether it's a problem there or not. This is an actual regression on AArch64 and ARM and not just a testism, you now get : .L5: ldr q0, [x5, 16] add x4, x4, 48 ldr q1, [x5, 32] add w6, w6, 1 ldr q4, [x5, 48] cmp w3, w6 ldr q2, [x5], 64 orr v3.16b, v0.16b, v0.16b orr v5.16b, v4.16b, v4.16b orr v4.16b, v1.16b, v1.16b tbl v0.16b, {v0.16b - v1.16b}, v6.16b tbl v2.16b, {v2.16b - v3.16b}, v7.16b tbl v4.16b, {v4.16b - v5.16b}, v16.16b str q0, [x4, -32] str q2, [x4, -48] str q4, [x4, -16] bhi .L5 instead of .L5: ld4 {v4.4s - v7.4s}, [x7], 64 add w4, w4, 1 cmp w3, w4 orr v1.16b, v4.16b, v4.16b orr v2.16b, v5.16b, v5.16b orr v3.16b, v6.16b, v6.16b st3 {v1.4s - v3.4s}, [x6], 48 bhi .L5 LD4 and ST3 do all the permutes without needing actual permute instructions - a strategy that favours generic permutes avoiding the load_lanes case is likely to be more expensive on most implementations. I think worth a PR atleast. regards Ramana Yes, quite right. PR 68707. --Alan
Re: [PATCH] Empty redirect_edge_var_map after each pass and function
On 02/12/15 14:13, Jeff Law wrote: On 12/02/2015 01:33 AM, Richard Biener wrote: Right. So the question I have is how/why did DOM leave anything in the map. And if DOM is fixed to not leave stuff lying around, can we then assert that nothing is ever left in those maps between passes? There's certainly no good reason I'm aware of why DOM would leave things in this state. It happens not only with DOM but with all passes doing edge redirection. This is because the map is populated by GIMPLE cfg hooks just in case it might be used. But there is no such thing as a "start CFG manip" and "end CFG manip" to cleanup such dead state. Sigh. IMHO the redirect-edge-var-map stuff is just the very most possible unclean implementation possible. :( (see how remove_edge "clears" stale info from the map to avoid even more "interesting" stale data) Ideally we could assert the map is empty whenever we leave a pass, but as said it triggers all over the place. Even cfg-cleanup causes such stale data. I agree that the patch is only a half-way "solution", but a full solution would require sth more explicit, like we do with initialize_original_copy_tables/free_original_copy_tables. Thus require passes to explicitely request the edge data to be preserved with a initialize_edge_var_map/free_edge_var_map call pair. Not appropriate at this stage IMHO (well, unless it turns out to be a very localized patch). So maybe as a follow-up to aid folks in the future, how about a debugging verify_whatever function that we can call manually if debugging a problem in this space. With a comment indicating why we can't call it unconditionally (yet). jeff I did a (fwiw disable bootstrap) build with the map-emptying code in passes.c (not functions.c), printing out passes after which the map was non-empty (before emptying it, to make sure passes weren't just carrying through stale data from earlier). My (non-exhaustive!) list of passes after which the edge_var_redirect_map can be non-empty stands at... aprefetch ccp cddce ch ch_vect copyprop crited crited cselim cunroll cunrolli dce dom ehcleanup einline esra fab fnsplit forwprop fre graphite ifcvt isolate-paths ldist lim local-pure-const mergephi oaccdevlow ompexpssa optimized parloops pcom phicprop phiopt phiprop pre profile profile_estimate sccp sink slsr split-paths sra switchconv tailc tailr tracer unswitch veclower2 vect vrm vrp whole-program FWIW, the route by which dom added the edge to the redirect map was: #0 redirect_edge_var_map_add (e=e@entry=0x7fb7a5f508, result=0x7fb725a000, def=0x7fb78eaea0, locus=2147483884) at ../../gcc/gcc/tree-ssa.c:54 #1 0x00cccf58 in ssa_redirect_edge (e=e@entry=0x7fb7a5f508, dest=dest@entry=0x7fb79cc680) at ../../gcc/gcc/tree-ssa.c:158 #2 0x00b00738 in gimple_redirect_edge_and_branch (e=0x7fb7a5f508, dest=0x7fb79cc680) at ../../gcc/gcc/tree-cfg.c:5662 #3 0x006ec678 in redirect_edge_and_branch (e=e@entry=0x7fb7a5f508, dest=) at ../../gcc/gcc/cfghooks.c:356 #4 0x00cb4530 in ssa_fix_duplicate_block_edges (rd=0x1a29f10, local_info=local_info@entry=0x7fed40) at ../../gcc/gcc/tree-ssa-threadupdate.c:1184 #5 0x00cb5520 in ssa_fixup_template_block (slot=, local_info=0x7fed40) at ../../gcc/gcc/tree-ssa-threadupdate.c:1369 #6 traverse_noresize( argument=0x7fed40, this=0x1a21a00) at ../../gcc/gcc/hash-table.h:911 #7 traverse ( argument=0x7fed40, this=0x1a21a00) at ../../gcc/gcc/hash-table.h:933 #8 thread_block_1 (bb=bb@entry=0x7fb7485bc8, noloop_only=noloop_only@entry=true, joiners=joiners@entry=true) at ../../gcc/gcc/tree-ssa-threadupdate.c:1592 #9 0x00cb5a40 in thread_block (bb=0x7fb7485bc8, noloop_only=noloop_only@entry=true) at ../../gcc/gcc/tree-ssa-threadupdate.c:1629 ---Type to continue, or q to quit--- #10 0x00cb6bf8 in thread_through_all_blocks ( may_peel_loop_headers=true) at ../../gcc/gcc/tree-ssa-threadupdate.c:2736 #11 0x00becf6c in (anonymous namespace)::pass_dominator::execute ( this=, fun=0x7fb77d1b28) at ../../gcc/gcc/tree-ssa-dom.c:622 #12 0x009feef4 in execute_one_pass (pass=pass@entry=0x16d1a80) at ../../gcc/gcc/passes.c:2311 The edge is then deleted much later: #3 0x00f858e4 in free_edge (fn=, e=) at ../../gcc/gcc/cfg.c:91 #4 remove_edge_raw (e=) at ../../gcc/gcc/cfg.c:350 #5 0x006ec814 in remove_edge (e=) at ../../gcc/gcc/cfghooks.c:418 #6 0x006ecaec in delete_basic_block (bb=bb@entry=0x7fb74b3618) at ../../gcc/gcc/cfghooks.c:597 #7 0x00f8d1d4 in try_optimize_cfg (mode=32) at ../../gcc/gcc/cfgcleanup.c:2701 #8 cleanup_cfg (mode=mode@entry=32) at ../../gcc/gcc/cfgcleanup.c:3028 #9 0x0070180c in cfg_layout_initialize (flags=flags@entry=0) at ../../gcc/gcc/cfgrtl.c:4264 #10 0x00f7cdc8 in
Re: [PATCH] Empty redirect_edge_var_map after each pass and function
On 03/12/15 12:58, Richard Biener wrote: On Thu, 3 Dec 2015, Alan Lawrence wrote: On 02/12/15 14:13, Jeff Law wrote: On 12/02/2015 01:33 AM, Richard Biener wrote: Right. So the question I have is how/why did DOM leave anything in the map. And if DOM is fixed to not leave stuff lying around, can we then assert that nothing is ever left in those maps between passes? There's certainly no good reason I'm aware of why DOM would leave things in this state. It happens not only with DOM but with all passes doing edge redirection. This is because the map is populated by GIMPLE cfg hooks just in case it might be used. But there is no such thing as a "start CFG manip" and "end CFG manip" to cleanup such dead state. Sigh. IMHO the redirect-edge-var-map stuff is just the very most possible unclean implementation possible. :( (see how remove_edge "clears" stale info from the map to avoid even more "interesting" stale data) Ideally we could assert the map is empty whenever we leave a pass, but as said it triggers all over the place. Even cfg-cleanup causes such stale data. I agree that the patch is only a half-way "solution", but a full solution would require sth more explicit, like we do with initialize_original_copy_tables/free_original_copy_tables. Thus require passes to explicitely request the edge data to be preserved with a initialize_edge_var_map/free_edge_var_map call pair. Not appropriate at this stage IMHO (well, unless it turns out to be a very localized patch). So maybe as a follow-up to aid folks in the future, how about a debugging verify_whatever function that we can call manually if debugging a problem in this space. With a comment indicating why we can't call it unconditionally (yet). jeff I did a (fwiw disable bootstrap) build with the map-emptying code in passes.c (not functions.c), printing out passes after which the map was non-empty (before emptying it, to make sure passes weren't just carrying through stale data from earlier). My (non-exhaustive!) list of passes after which the edge_var_redirect_map can be non-empty stands at... aprefetch ccp cddce ch ch_vect copyprop crited crited cselim cunroll cunrolli dce dom ehcleanup einline esra fab fnsplit forwprop fre graphite ifcvt isolate-paths ldist lim local-pure-const mergephi oaccdevlow ompexpssa optimized parloops pcom phicprop phiopt phiprop pre profile profile_estimate sccp sink slsr split-paths sra switchconv tailc tailr tracer unswitch veclower2 vect vrm vrp whole-program Yeah, exactly my findings... note that most of the above are likely due to cfgcleanup even though it already does sth like e = redirect_edge_and_branch (e, dest); redirect_edge_var_map_clear (e); so eventually placing a redirect_edge_var_map_empty () at the end of the cleanup_tree_cfg function should prune down the above list considerably (well, then assert the map is empty on entry to that function of course) FWIW, the route by which dom added the edge to the redirect map was: #0 redirect_edge_var_map_add (e=e@entry=0x7fb7a5f508, result=0x7fb725a000, def=0x7fb78eaea0, locus=2147483884) at ../../gcc/gcc/tree-ssa.c:54 #1 0x00cccf58 in ssa_redirect_edge (e=e@entry=0x7fb7a5f508, dest=dest@entry=0x7fb79cc680) at ../../gcc/gcc/tree-ssa.c:158 #2 0x00b00738 in gimple_redirect_edge_and_branch (e=0x7fb7a5f508, dest=0x7fb79cc680) at ../../gcc/gcc/tree-cfg.c:5662 #3 0x006ec678 in redirect_edge_and_branch (e=e@entry=0x7fb7a5f508, dest=) at ../../gcc/gcc/cfghooks.c:356 #4 0x00cb4530 in ssa_fix_duplicate_block_edges (rd=0x1a29f10, local_info=local_info@entry=0x7fed40) at ../../gcc/gcc/tree-ssa-threadupdate.c:1184 #5 0x00cb5520 in ssa_fixup_template_block (slot=, local_info=0x7fed40) at ../../gcc/gcc/tree-ssa-threadupdate.c:1369 #6 traverse_noresize<ssa_local_info_t*, ssa_fixup_template_block> ( argument=0x7fed40, this=0x1a21a00) at ../../gcc/gcc/hash-table.h:911 #7 traverse<ssa_local_info_t*, ssa_fixup_template_block> ( argument=0x7fed40, this=0x1a21a00) at ../../gcc/gcc/hash-table.h:933 #8 thread_block_1 (bb=bb@entry=0x7fb7485bc8, noloop_only=noloop_only@entry=true, joiners=joiners@entry=true) at ../../gcc/gcc/tree-ssa-threadupdate.c:1592 #9 0x00cb5a40 in thread_block (bb=0x7fb7485bc8, noloop_only=noloop_only@entry=true) at ../../gcc/gcc/tree-ssa-threadupdate.c:1629 ---Type to continue, or q to quit--- #10 0x00cb6bf8 in thread_through_all_blocks ( may_peel_loop_headers=true) at ../../gcc/gcc/tree-ssa-threadupdate.c:2736 #11 0x00becf6c in (anonymous namespace)::pass_dominator::execute ( this=, fun=0x7fb77d1b28) at ../../gcc/gcc/tree-ssa-dom.c:622 #12 0x009feef4 in execute_one_pass (pass=pass@entry=0x16d1a80) at ../../gcc/gcc/passes.c:2311 The edge is then deleted much later: #3 0
Re: [PATCH][install.texi] Add note against GNAT 4.8 on ARM targets.
On 16/11/15 15:08, Alan Lawrence wrote: This follows from the discussion here: https://gcc.gnu.org/ml/gcc/2015-10/msg00082.html . OK for trunk? --Alan gcc/ChangeLog: doc/install.texi: Add note against GNAT 4.8 on ARM targets. --- gcc/doc/install.texi | 2 ++ 1 file changed, 2 insertions(+) diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi index 1fd773e..1ce93d4 100644 --- a/gcc/doc/install.texi +++ b/gcc/doc/install.texi @@ -3481,6 +3481,8 @@ require GNU binutils 2.13 or newer. Such subtargets include: @code{arm-*-netbsdelf}, @code{arm-*-*linux-*} and @code{arm-*-rtemseabi}. +Building the Ada frontend commonly fails (an infinite loop executing @code{xsinfo}) if the host compiler is GNAT 4.8. Host compilers built from the GNAT 4.6, 4.9 or 5 release branches are known to succeed. + @html @end html Ping.
Re: [PING^2][PATCH] Improve C++ loop's backward-jump location
On 24/11/15 14:55, Andreas Arnez wrote: Ping? https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01192.html I guess we want C and C++ behave the same here? gcc/cp/ChangeLog: * cp-gimplify.c (genericize_cp_loop): Change LOOP_EXPR's location to start of loop body instead of start of loop. gcc/testsuite/ChangeLog: * g++.dg/guality/pr67192.C: New test. Since this, we've been seeing these tests fail natively on AArch64 and ARM: FAIL: g++.dg/gcov/gcov-1.C -std=gnu++11 gcov: 3 failures in line counts, 0 in branch percentages, 0 in return percentages, 0 in intermediate format FAIL: g++.dg/gcov/gcov-1.C -std=gnu++11 line 115: is 27:should be 14 FAIL: g++.dg/gcov/gcov-1.C -std=gnu++11 line 58: is 18:should be 9 FAIL: g++.dg/gcov/gcov-1.C -std=gnu++11 line 73: is 162:should be 81 FAIL: g++.dg/gcov/gcov-1.C -std=gnu++14 gcov: 3 failures in line counts, 0 in branch percentages, 0 in return percentages, 0 in intermediate format FAIL: g++.dg/gcov/gcov-1.C -std=gnu++14 line 115: is 27:should be 14 FAIL: g++.dg/gcov/gcov-1.C -std=gnu++14 line 58: is 18:should be 9 FAIL: g++.dg/gcov/gcov-1.C -std=gnu++14 line 73: is 162:should be 81 FAIL: g++.dg/gcov/gcov-1.C -std=gnu++98 gcov: 3 failures in line counts, 0 in branch percentages, 0 in return percentages, 0 in intermediate format FAIL: g++.dg/gcov/gcov-1.C -std=gnu++98 line 115: is 27:should be 14 FAIL: g++.dg/gcov/gcov-1.C -std=gnu++98 line 58: is 18:should be 9 FAIL: g++.dg/gcov/gcov-1.C -std=gnu++98 line 73: is 162:should be 81 I've not had a chance to look any further yet. Thanks, Alan
Re: [PATCH] Fix PR68559
On 27/11/15 14:13, Richard Biener wrote: The following fixes the excessive peeling for gaps we do when doing SLP now that I removed most of the restrictions on having gaps in the first place. This should make low-trip vectorized loops more efficient (sth also the combine-epilogue-with-vectorized-body-by-masking patches claim to do). Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. Richard. 2015-11-27 Richard BienerPR tree-optimization/68559 * tree-vect-data-refs.c (vect_analyze_group_access_1): Move peeling for gap checks ... * tree-vect-stmts.c (vectorizable_load): ... here and relax for SLP. * tree-vect-loop.c (vect_analyze_loop_2): Re-set LOOP_VINFO_PEELING_FOR_GAPS before re-trying without SLP. * gcc.dg/vect/slp-perm-4.c: Adjust again. * gcc.dg/vect/pr45752.c: Likewise. Since this, we have FAIL: gcc.dg/vect/pr45752.c -flto -ffat-lto-objects scan-tree-dump-times vect "gaps requires scalar epilogue loop" 0 FAIL: gcc.dg/vect/pr45752.c scan-tree-dump-times vect "gaps requires scalar epilogue loop" 0 on aarch64 platforms (aarch64-none-linux-gnu, aarch64-none-elf, aarch64_be-none-elf). Thanks, Alan
[PATCH] Empty redirect_edge_var_map after each pass and function
This follows on from discussion at https://gcc.gnu.org/ml/gcc-patches/2015-11/msg03392.html To recap: Starting in r229479 and continuing until at least 229711, compiling polynom.c from spec2000 on aarch64-none-linux-gnu, with options -O3 -mcpu=cortex-a53 -ffast-math (on both cross, native bootstrapped, and native --disable-bootstrap compilers), produced a verify_gimple ICE after unswitch: ../spec2000/benchspec/CINT2000/254.gap/src/polynom.c: In function 'NormalizeCoeffsListx': ../spec2000/benchspec/CINT2000/254.gap/src/polynom.c:358:11: error: incompatible types in PHI argument 0 TypHandle NormalizeCoeffsListx ( hdC ) ^ long int int ../spec2000/benchspec/CINT2000/254.gap/src/polynom.c:358:11: error: location references block not in block tree l1_279 = PHI <1(28), l1_299(33)> ../spec2000/benchspec/CINT2000/254.gap/src/polynom.c:358:11: error: invalid PHI argument ../spec2000/benchspec/CINT2000/254.gap/src/polynom.c:358:11: internal compiler error: tree check: expected class 'type', have 'declaration' (namespace_decl) in useless_type_conversion_p, at gimple-expr.c:84 0xd110ef tree_class_check_failed(tree_node const*, tree_code_class, char const*, int, char const*) ../../gcc-fsf/gcc/tree.c:9643 0x82561b tree_class_check ../../gcc-fsf/gcc/tree.h:3042 0x82561b useless_type_conversion_p(tree_node*, tree_node*) ../../gcc-fsf/gcc/gimple-expr.c:84 0xaca043 verify_gimple_phi ../../gcc-fsf/gcc/tree-cfg.c:4673 0xaca043 verify_gimple_in_cfg(function*, bool) ../../gcc-fsf/gcc/tree-cfg.c:4967 0x9c2e0b execute_function_todo ../../gcc-fsf/gcc/passes.c:1967 0x9c360b do_per_function ../../gcc-fsf/gcc/passes.c:1659 0x9c3807 execute_todo ../../gcc-fsf/gcc/passes.c:2022 I was not able to reduce the testcase below about 30k characters, with e.g. #define T_VOID 0 T_VOID producing the ICE, but manually changing to 0 preventing the ICE; as did running the preprocessor as a separate step, or a wide variety of options (e.g. -fdump-tree-alias). In the end I traced this to loop_unswitch reading stale values from the edge redirect map, which is keyed on 'edge' (a pointer to struct edge_def); the map entries had been left there by pass_dominator (on a different function), and by "chance" the edge *pointers* were the same as to some current edge_defs (even though they pointed to structures created by different allocations, the first of which had since been freed). Hence the fragility of the testcase and environment. While the ICE is prevented merely by adding a call to redirect_edge_var_map_destroy at the end of pass_dominator::execute, given the fragility of the bug, difficulty of reducing the testcase, and the low overhead of emptying an already-empty map, I believe the right fix is to empty the map as often as can correctly do so, hence this patch - based substantially on Richard's comments in PR/68117. Bootstrapped + check-gcc + check-g++ on x86_64 linux, based on r231105; I've also built SPEC2000 on aarch64-none-linux-gnu by applying this patch (roughly) onto the previously-failing r229711, which also passes aarch64 bootstrap, and a more recent bootstrap on aarch64 is ongoing. Assuming/if no regressions there... Is this ok for trunk? This could also be a candidate for the 5.3 release; backporting depends only on the (fairly trivial) r230357. gcc/ChangeLog: Alan Lawrence <alan.lawre...@arm.com> Richard Biener <richard.guent...@gmail.com> * cfgexpand.c (pass_expand::execute): Replace call to redirect_edge_var_map_destroy with redirect_edge_var_map_empty. * tree-ssa.c (delete_tree_ssa): Likewise. * function.c (set_cfun): Call redirect_edge_var_map_empty. * passes.c (execute_one_ipa_transform_pass, execute_one_pass): Likewise. * tree-ssa.h (redirect_edge_var_map_destroy): Remove. (redirect_edge_var_map_empty): New. * tree-ssa.c (redirect_edge_var_map_destroy): Remove. (redirect_edge_var_map_empty): New. --- gcc/cfgexpand.c | 2 +- gcc/function.c | 2 ++ gcc/passes.c| 2 ++ gcc/tree-ssa.c | 8 gcc/tree-ssa.h | 2 +- 5 files changed, 10 insertions(+), 6 deletions(-) diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c index 1990e10..ede1b82 100644 --- a/gcc/cfgexpand.c +++ b/gcc/cfgexpand.c @@ -6291,7 +6291,7 @@ pass_expand::execute (function *fun) expand_phi_nodes (); /* Release any stale SSA redirection data. */ - redirect_edge_var_map_destroy (); + redirect_edge_var_map_empty (); /* Register rtl specific functions for cfg. */ rtl_register_cfg_hooks (); diff --git a/gcc/function.c b/gcc/function.c index 515d7c0..e452865 100644 --- a/gcc/function.c +++ b/gcc/function.c @@ -75,6 +75,7 @@ along with GCC; see the file COPYING3. If not see #include "tree-chkp.h" #include "rtl-chkp.h" #include "tree-dfa.h" +#include "tree-ssa.h" /* So we
Re: [PATCH] Fix PR68067
On 27/11/15 15:07, Alan Lawrence wrote: On 23/11/15 09:43, Richard Biener wrote: On Fri, 20 Nov 2015, Alan Lawrence wrote: ...the asserts you suggested in (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D68117#c27)... >> So I have to ask, how sure are you that those assertions are(/should be!) "correct"? :) Ideally they should be correct but they happen to be not (and I think the intent was that this should be harmless). Basically I tried to assert that nobody creates stale edge redirect data that is not later consumed or cleared. Happens to be too optimistic :/ Maybe so, but it looks like the edge_var_redirect_map is still suspect here. On the ~~28th call to loop_version, from tree_unswitch_loop, the call to lv_flush_pending_stmts executes (tree-cfg.c flush_pending_stmts): def = redirect_edge_var_map_def (vm); add_phi_arg (phi, def, e, redirect_edge_var_map_location(vm)); and BLOCK_LOCATION (redirect_edge_var_map_location(vm)) is < 0x7fb7704a80 side-effects addressable asm_written used protected static visited tree_0 tree_2 tree_5> so yeah, next question, how'd that get there... A. Well, pass_dominator::execute calls redirect_edge_var_map with that edge pointer, at which time the edge is from from 32 (0x7fb79cc6e8) to block 20 (0x7fb7485e38), and locus is 2147483884; and then again, with locus 0. With no intervening calls to redirect_edge_var_map_clear for that edge, loop_version's call to flush_pending_statements then reads redirect_edge_var_map_vector for that edge pointer - which is now an edge from block 126 (0x7fb7485af8) to 117 (0x7fb74856e8). It sees those locations (2147483884 and 0)... Clearing the edge redirect map at the end of pass_dominator fixes the ICE (as would clearing it at the end of each stage, or clearing it at the beginning of loop_unswitch, I guess). I'll post a patch after more testing, but obviously I'm keen to hear if there are obvious problems with the approach? And coming up with a testcase, well, heh - this broke because of identical pointers to structures allocated at different times, with intervening free...ideas welcome of course! --Alan
Re: [PATCH 1/2][ARM] PR/65956 AAPCS update for alignment attribute
On 26 November 2015 at 14:00, Alan Lawrence <alan.lawre...@arm.com> wrote: > On 6 November 2015 at 16:59, Jakub Jelinek <ja...@redhat.com> wrote: >> >> In any case, to manually reproduce, compile >> gnatmake -g -gnatws macrosub.adb >> with GCC 5.1.1 (before the ARM changes) and then try to run that process >> against >> GCC 5.2.1 (after the ARM changes) libgnat-5.so, which is what make check >> does (it uses host_gnatmake to compile the support stuff, so ideally the >> processes built by host gcc/gnatmake should not be run with the >> LD_LIBRARY_PATH=$ADA_INCLUDE_PATH:$BASE:$LD_LIBRARY_PATH >> in the environment, and others should). >> In macrosub in particular, the problem is in: >> WHILE NOT END_OF_FILE (INFILE1) LOOP >>GET_LINE (INFILE1, A_LINE, A_LENGTH); >> in FILL_TABLE, where A_LINE'First is 0 and A_LINE'Last is 400 (if I remember >> right), but if you step into GET_LINE compiled by GCC 5.2.1, Item'First >> and Item'Last don't match that. > > Ok, I see the mismatch now. The type affected in Jakub's case here is an Ada String, which looks like this: constant 64> unit size constant 8> align 64 symtab -151604912 alias set -1 canonical type 0xf7569720 fields asm_written unsigned SI size unit size align 32 symtab -151604672 alias set -1 canonical type 0xf756a2a0> unsigned nonaddressable SI file line 0 col 0 size unit size align 32 offset_align 64 offset bit offset context chain visited unsigned nonaddressable SI file line 0 col 0 size unit size align 32 offset_align 64 offset bit offset context >> context unconstrained array BLK align 8 symtab 0 alias set -1 canonical type 0xf7569c00 context pointer_to_this reference_to_this chain > chain > i.e. a 64-bit DImode struct, with alignment set to 64, containing P_ARRAY a 32-bit pointer with alignment 32, and P_BOUNDS a 32-bit pointer with alignment 32, pointing to a record (of size 64, alignment 32, containing two 32-bit ints LB0 and UB0). AFAICT, in the fill_table/get_line case, the first parameter to get_line is a file, a simple pointer; then we have a string. So *fill_table (compiled with 5.1, doubleword aligned) should pass the string P_ARRAY in r2 and P_BOUNDS in r3. 0x1f334 <getsubs__fill_table+184> movt r3, #2 0x1f338 <getsubs__fill_table+188> strr3, [r11, #-504]; 0x 0x1f33c <getsubs__fill_table+192> subr3, r11, #508 ; 0x1fc 0x1f340 <getsubs__fill_table+196> ldrd r2, [r3] 0x1f344 <getsubs__fill_table+200> movr0, r1 0x1f348 <getsubs__fill_table+204> bl 0x1aee4
Re: [PATCH] Fix PR68067
On 23/11/15 09:43, Richard Biener wrote: On Fri, 20 Nov 2015, Alan Lawrence wrote: ...the asserts you suggested in (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D68117#c27)... >> So I have to ask, how sure are you that those assertions are(/should be!) "correct"? :) Ideally they should be correct but they happen to be not (and I think the intent was that this should be harmless). Basically I tried to assert that nobody creates stale edge redirect data that is not later consumed or cleared. Happens to be too optimistic :/ Maybe so, but it looks like the edge_var_redirect_map is still suspect here. On the ~~28th call to loop_version, from tree_unswitch_loop, the call to lv_flush_pending_stmts executes (tree-cfg.c flush_pending_stmts): def = redirect_edge_var_map_def (vm); add_phi_arg (phi, def, e, redirect_edge_var_map_location(vm)); and BLOCK_LOCATION (redirect_edge_var_map_location(vm)) is < 0x7fb7704a80 side-effects addressable asm_written used protected static visited tree_0 tree_2 tree_5> so yeah, next question, how'd that get there... A.
Re: [PATCH] Improve verification of loop->latch in verify_loop_structure
This caused an ICE compiling value.c from gdb on aarch64-none-linux-gnu; the testcase, after preprocessing on aarch64, ICEs on both aarch64 and x86_64, but is about 1MB - I'm working on reducing that down to something small enough to post... $ ./gcc/xgcc -B ./gcc -O2 -g value.c ../../binutils-gdb/gdb/value.c: In function ‘show_convenience’: ../../binutils-gdb/gdb/value.c:2615:1: error: loop 3’s latch is missing ../../binutils-gdb/gdb/value.c:2615:1: internal compiler error: in verify_loop_structure, at cfgloop.c:1669 0x71e653 verify_loop_structure() /work/alalaw01/src2/gcc/gcc/cfgloop.c:1669 0x97c6ae checking_verify_loop_structure /work/alalaw01/src2/gcc/gcc/cfgloop.h:325 0x97c6ae loop_optimizer_init(unsigned int) /work/alalaw01/src2/gcc/gcc/loop-init.c:106 0x97c78a rtl_loop_init /work/alalaw01/src2/gcc/gcc/loop-init.c:398 0x97c78a execute /work/alalaw01/src2/gcc/gcc/loop-init.c:425 --Alan
Re: [PATCH] Improve verification of loop->latch in verify_loop_structure
Here's a reduced testcase (reduced to the point of generating lots of warnings, I'm compiling with -O2 -w, on x86_64): struct __jmp_buf_tag { }; typedef struct __jmp_buf_tag sigjmp_buf[1]; extern struct cmd_list_element *showlist; struct internalvar { struct internalvar *next; }; static struct internalvar *internalvars; struct internalvar * create_internalvar (const char *name) { struct internalvar *var = ((struct internalvar *) xmalloc (sizeof (struct internalvar))); internalvars = var; } void show_convenience (char *ignore, int from_tty) { struct gdbarch *gdbarch = get_current_arch (); int varseen = 0; for (struct internalvar *var = internalvars; var; var = var->next) { if (!varseen) varseen = 1; sigjmp_buf *buf = exceptions_state_mc_init (); __sigsetjmp (); while (exceptions_state_mc_action_iter ()) while (exceptions_state_mc_action_iter_1 ()) ; } if (!varseen) printf_unfiltered (); } On 26 November 2015 at 11:33, Alan Lawrence <alan.lawre...@arm.com> wrote: > This caused an ICE compiling value.c from gdb on > aarch64-none-linux-gnu; the testcase, after preprocessing on aarch64, > ICEs on both aarch64 and x86_64, but is about 1MB - I'm working on > reducing that down to something small enough to post... > > $ ./gcc/xgcc -B ./gcc -O2 -g value.c > ../../binutils-gdb/gdb/value.c: In function ‘show_convenience’: > ../../binutils-gdb/gdb/value.c:2615:1: error: loop 3’s latch is missing > ../../binutils-gdb/gdb/value.c:2615:1: internal compiler error: in > verify_loop_structure, at cfgloop.c:1669 > 0x71e653 verify_loop_structure() > /work/alalaw01/src2/gcc/gcc/cfgloop.c:1669 > 0x97c6ae checking_verify_loop_structure > /work/alalaw01/src2/gcc/gcc/cfgloop.h:325 > 0x97c6ae loop_optimizer_init(unsigned int) > /work/alalaw01/src2/gcc/gcc/loop-init.c:106 > 0x97c78a rtl_loop_init > /work/alalaw01/src2/gcc/gcc/loop-init.c:398 > 0x97c78a execute > /work/alalaw01/src2/gcc/gcc/loop-init.c:425 > > --Alan
Re: [PATCH 1/2][ARM] PR/65956 AAPCS update for alignment attribute
On 6 November 2015 at 16:59, Jakub Jelinekwrote: > > In any case, to manually reproduce, compile > gnatmake -g -gnatws macrosub.adb > with GCC 5.1.1 (before the ARM changes) and then try to run that process > against > GCC 5.2.1 (after the ARM changes) libgnat-5.so, which is what make check > does (it uses host_gnatmake to compile the support stuff, so ideally the > processes built by host gcc/gnatmake should not be run with the > LD_LIBRARY_PATH=$ADA_INCLUDE_PATH:$BASE:$LD_LIBRARY_PATH > in the environment, and others should). > In macrosub in particular, the problem is in: > WHILE NOT END_OF_FILE (INFILE1) LOOP >GET_LINE (INFILE1, A_LINE, A_LENGTH); > in FILL_TABLE, where A_LINE'First is 0 and A_LINE'Last is 400 (if I remember > right), but if you step into GET_LINE compiled by GCC 5.2.1, Item'First > and Item'Last don't match that. Ok, I see the mismatch now. However, to get there, I had to use my 5.1 gnatmake -g -gnatws macrosub.ads --rts=/path/to/5.2/arm-none-linux-gnueabihf/libada, as if I ran 5.1 gnatmake without that flag, I did not manage to get the wrong value passed/received with LD_LIBRARY_PATH set to any of build-5.2/gcc/ada/rts, build-5.2/arm-none-linux-gnueabihf/libada, build-5.2/arm-none-linux-gnueabihf/libada/adalib (any further suggestions?). [Also I note 'LD_DEBUG=all ./macrosub' does not show libgnat being loaded that way.] With 5.1 gnatmake -g -gnatws macrosub.ads --rts=/path/to/5.2/arm-none-linux-gnueabihf/libada : $ gdb ./macrosub GNU gdb (Ubuntu 7.7-0ubuntu3) 7.7 [snip] Reading symbols from ./macrosub...done. (gdb) break get_line Breakpoint 1 at 0x1aeec: get_line. (4 locations) (gdb) run Starting program: /home/alalaw01/build-5.1.0/gcc/testsuite/ada/acats/support/macrosub BEGINNING MACRO SUBSTITUTIONS. Breakpoint 1, ada.text_io.get_line (item=...) at a-tigeli.adb:41 41 procedure Get_Line (gdb) print item'first $1 = -443273216 (gdb) print item'last $2 = -514850813 (gdb) n 146FIO.Check_Read_Status (AP (File)); (gdb) n 152if Item'First > Item'Last then (gdb) print item'first $3 = 1 (gdb) print item'last $4 = 0 (gdb) up #1 0x0001f34c in getsubs.fill_table () at getsubs.adb:122 122GET_LINE (INFILE1, A_LINE, A_LENGTH); (gdb) print a_line'first $5 = 1 (gdb) print a_line'last $6 = 400 So yes, we have an ABI change; which is not entirely unexpected. So, questions (1) Why does LD_LIBRARY_PATH affect your system, not mine (i.e. if this is because my gnatmake is building with static linking, then why). This is maybe the least interesting question so I'm leaving it for now... (2) If/when LD_LIBRARY_PATH does have an effect - as you say, things compiled with host gnatmake, should be run against host libraries, not against target libraries. Otherwise, potentially *any* gcc ABI change can break the build process, right? So I think this is of interest regardless of the ARM AAPCS change, but I will be slightly presumptious and hope that the Adacore folk will pick this up...[CC Eric] (3) Has the ARM AAPCS had an effect that we didn't mean it to? I don't see any evidence so far that this is _necessarily_ the case, but I will look into this, bearing Florian's advice in mind (thanks!)... --Alan
Re: [PATCH] libstdc++: Fix libstdc++/67440: pretty-printing of a const set fails
On 16/11/15 21:04, Doug Evans wrote: Hi. Apologies for the delay. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67440 Tested with current trunk. 2015-11-16 Doug EvansPR libstdc++/67440 * python/libstdcxx/v6/printers.py (find_type): Handle "const" in type name. * testsuite/libstdc++-prettyprinters/debug.cc: Add test for const set. * testsuite/libstdc++-prettyprinters/simple.cc: Ditto. * testsuite/libstdc++-prettyprinters/simple11.cc: Ditto. On gcc-5-branch, the debug.cc and simple.cc tests don't seem to compile, on either x86_64-none-linux-gnu or aarch64-none-linux-gnu. I get errors like: /work/alalaw01/src/gcc/libstdc++-v3/testsuite/libstdc++-prettyprinters/simple.cc: In function 'int main()': /work/alalaw01/src/gcc/libstdc++-v3/testsuite/libstdc++-prettyprinters/simple.cc:77:43: error: in C++98 'const_intset' must be initialized by constructor, not by '{...}' const std::set const_intset = {2, 3}; ^ In file included from /work/alalaw01/build_dje/x86_64-unknown-linux-gnu/libstdc++-v3/include/map:60:0, from /work/alalaw01/src/gcc/libstdc++-v3/testsuite/libstdc++-prettyprinters/simple.cc:31: /work/alalaw01/build_dje/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/stl_tree.h: In instantiation of 'void std::_Rb_tree<_Key, _Val, _KeyOfValue, _Compare, _Alloc>::_M_insert_unique(_II, _II) [with _InputIterator = int; _Key = int; _Val = int; _KeyOfValue = std::_Identity; _Compare = std::less; _Alloc = std::allocator]': /work/alalaw01/build_dje/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/stl_set.h:171:4: required from 'std::set<_Key, _Compare, _Alloc>::set(_InputIterator, _InputIterator) [with _InputIterator = int; _Key = int; _Compare = std::less; _Alloc = std::allocator]' /work/alalaw01/src/gcc/libstdc++-v3/testsuite/libstdc++-prettyprinters/simple.cc:77:43: required from here /work/alalaw01/build_dje/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/stl_tree.h:2224:29: error: invalid type argument of unary '*' (have 'int') _M_insert_unique_(end(), *__first, __an); ^ compiler exited with status 1 UNRESOLVED: libstdc++-prettyprinters/simple.cc compilation failed to produce executable Spawning: gdb -nw -nx -quiet -batch -ex "python print(gdb.type_printers)" spawn gdb -nw -nx -quiet -batch -ex python print(gdb.type_printers) [] spawn gdb -nx -nw -quiet -batch -x simple.gdb simple.gdb:2: Error in sourced command file: ./simple.exe: No such file or directory. skipping: simple.gdb:2: Error in sourced command file: skipping: ./simple.exe: No such file or directory. UNSUPPORTED: libstdc++-prettyprinters/simple.cc I also see signs of the same on powerpc64le (https://gcc.gnu.org/ml/gcc-testresults/2015-11/msg02699.html), the test looks to be passing on trunk on all three platforms. Thanks, Alan
Re: [patch] GSoC: Implement std::experimental::shared_ptr
On 14/11/15 13:19, David Edelsohn wrote: The copy_ctor_neg testcase fails on AIX. Also on ARM and AArch64 platforms (arm-none-linux-gnueabihf, arm-none-eabi, aarch64-none-linux-gnu, aarch64-none-elf), with similar error messages. Thanks, Alan
Re: [PATCH] GCC system.h and Graphite header order
I note doc/install.texi says that gcc uses "ISL Library version 0.15, 0.14, 0.13, or 0.12.2". This patch breaks the build with 0.12.2 (a subset of errors below), but seems fine with 0.14. I haven't tested 0.13. Do we want to update install.texi ? Cheers, Alan In file included from /work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/list.h:13:0, from /work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/aff_type.h:4, from /work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/local_space.h:4, from /work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/constraint.h:13, from /work/alalaw01/src/gcc/gcc/graphite-optimize-isl.c:41: /work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/ctx.h:108:8: error: attempt to use poisoned "malloc" malloc(size))) ^ /work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/ctx.h:112:8: error: attempt to use poisoned "realloc" realloc(ptr,size))) ^ /usr/include/c++/4.8/bits/locale_facets.h:2566:44: error: macro "isdigit" passed 2 arguments, but takes just 1 isdigit(_CharT __c, const locale& __loc) ^ /usr/include/c++/4.8/bits/locale_facets.h:2572:44: error: macro "ispunct" passed 2 arguments, but takes just 1 ispunct(_CharT __c, const locale& __loc) ^ /usr/include/c++/4.8/bits/locale_facets.h:2578:45: error: macro "isxdigit" passed 2 arguments, but takes just 1 isxdigit(_CharT __c, const locale& __loc) ^ /usr/include/c++/4.8/bits/locale_facets.h:2584:44: error: macro "isalnum" passed 2 arguments, but takes just 1 isalnum(_CharT __c, const locale& __loc) ^ /usr/include/c++/4.8/bits/locale_facets.h:2590:44: error: macro "isgraph" passed 2 arguments, but takes just 1 isgraph(_CharT __c, const locale& __loc) ^ /usr/include/c++/4.8/bits/locale_facets.h:2596:44: error: macro "toupper" passed 2 arguments, but takes just 1 toupper(_CharT __c, const locale& __loc) ^ /usr/include/c++/4.8/bits/locale_facets.h:2602:44: error: macro "tolower" passed 2 arguments, but takes just 1 tolower(_CharT __c, const locale& __loc) In file included from /usr/include/c++/4.8/bits/basic_ios.h:37:0, from /usr/include/c++/4.8/ios:44, from /usr/include/c++/4.8/ostream:38, from /usr/include/c++/4.8/iostream:39, from /work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/int.h:17, from /work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/ctx.h:16, from /work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/list.h:13, from /work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/aff_type.h:4, from /work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/local_space.h:4, from /work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/constraint.h:13, from /work/alalaw01/src/gcc/gcc/graphite-scop-detection.c:52: /usr/include/c++/4.8/bits/locale_facets.h:2530:5: error: ‘std::isspace’ declared as an ‘inline’ variable isspace(_CharT __c, const locale& __loc) ^ /usr/include/c++/4.8/bits/locale_facets.h:2530:5: error: template declaration of ‘bool std::isspace’ /usr/include/c++/4.8/bits/locale_facets.h:2531:7: error: expected primary-expression before ‘return’ { return use_facet>(__loc).is(ctype_base::space, __c); } ^ /usr/include/c++/4.8/bits/locale_facets.h:2531:7: error: expected ‘}’ before ‘return’ /usr/include/c++/4.8/bits/locale_facets.h:2536:5: error: ‘isprint’ declared as an ‘inline’ variable isprint(_CharT __c, const locale& __loc) ^ /usr/include/c++/4.8/bits/locale_facets.h:2536:5: error: template declaration of ‘bool isprint’ /usr/include/c++/4.8/bits/locale_facets.h:2537:7: error: expected primary-expression before ‘return’ { return use_facet >(__loc).is(ctype_base::print, __c); } ^ /usr/include/c++/4.8/bits/locale_facets.h:2537:7: error: expected ‘}’ before ‘return’ /usr/include/c++/4.8/bits/locale_facets.h:2537:75: error: expected declaration before ‘}’ token { return use_facet >(__loc).is(ctype_base::print, __c); }
Re: [PATCH] Fix PR68067
On 6 November 2015 at 10:39, Richard Bienerwrote: >> ../spec2000/benchspec/CINT2000/254.gap/src/polynom.c:358:11: error: location >> references block not in block tree >> l1_279 = PHI <1(28), l1_299(33)> > > ^^^ > > this is the error to look at! It means that the GC heap will be corrupted > quite easily. > This looked very similar to PR68117 - the invalid phi arg, and block not in block-tree, even if not the invalid tree code - and as the posters there were having success with valgrind, whereas I wasn't, I watched and waited. First observation is that it triggers the asserts you suggested in comment 27 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D68117#c27). Indeed, it fails those asserts, even after the patch in comment 25 (committed as r230594) to tree-ssa.c (delete_tree_ssa), and the patch in comment#35 to function.c (set_cfun), and the patch in comment#30 (committed as r230424) to cfgexpand.c (pass_expand::execute). The patch in comment#29 (which replaces the asserts in comment#27 with empties), however, fixes the problem - although I can't rule out, that that's just by changing the memory allocation pattern. Moreover, if I take those patches and rebase onto a recent trunk (onto which the delete_tree_ssa and pass_expand::execute patches have already been committed), i.e. just adding the assertions from comment#27 and the call in function.c (set_cfun) - the assertions are still failing on my testcase, whereas the original (assertionless) failure was very erratic, and had since disappeared/been hidden on trunk. Indeed those same assertions break in a few other places (even in a --disable-bootstrap build after gcc/xgcc is built), so I feel I have a good chance of producing a reasonable assertion-breaking testcase. So I have to ask, how sure are you that those assertions are(/should be!) "correct"? :) --Alan
Re: [PATCH] Avoid useless work in loop vectorization
On 13/11/15 08:41, Richard Biener wrote: Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. Richard. 2015-11-13 Richard Biener* tree-vect-loop.c (vect_analyze_loop_2): Add fatal parameter. Signal fatal failure if early checks fail. (vect_analyze_loop): If vect_analyze_loop_2 fails fatally do not bother testing further vector sizes. It seems that on AArch64 this causes: FAIL: gcc.dg/vect/vect-outer-1-big-array.c -flto -ffat-lto-objects scan-tree-dump-times vect "grouped access in outer loop" 2 FAIL: gcc.dg/vect/vect-outer-1-big-array.c scan-tree-dump-times vect "grouped access in outer loop" 2 FAIL: gcc.dg/vect/vect-outer-1.c -flto -ffat-lto-objects scan-tree-dump-times vect "grouped access in outer loop" 2 FAIL: gcc.dg/vect/vect-outer-1.c scan-tree-dump-times vect "grouped access in outer loop" 2 FAIL: gcc.dg/vect/vect-outer-1a-big-array.c -flto -ffat-lto-objects scan-tree-dump-times vect "grouped access in outer loop" 2 FAIL: gcc.dg/vect/vect-outer-1a-big-array.c scan-tree-dump-times vect "grouped access in outer loop" 2 FAIL: gcc.dg/vect/vect-outer-1a.c -flto -ffat-lto-objects scan-tree-dump-times vect "grouped access in outer loop" 2 FAIL: gcc.dg/vect/vect-outer-1a.c scan-tree-dump-times vect "grouped access in outer loop" 2 FAIL: gcc.dg/vect/vect-outer-1b-big-array.c -flto -ffat-lto-objects scan-tree-dump-times vect "grouped access in outer loop" 2 FAIL: gcc.dg/vect/vect-outer-1b-big-array.c scan-tree-dump-times vect "grouped access in outer loop" 2 FAIL: gcc.dg/vect/vect-outer-1b.c -flto -ffat-lto-objects scan-tree-dump-times vect "grouped access in outer loop" 2 FAIL: gcc.dg/vect/vect-outer-1b.c scan-tree-dump-times vect "grouped access in outer loop" 2 FAIL: gcc.dg/vect/vect-outer-2b.c -flto -ffat-lto-objects scan-tree-dump-times vect "grouped access in outer loop" 2 FAIL: gcc.dg/vect/vect-outer-2b.c scan-tree-dump-times vect "grouped access in outer loop" 2 FAIL: gcc.dg/vect/vect-outer-3b.c -flto -ffat-lto-objects scan-tree-dump-times vect "grouped access in outer loop" 4 FAIL: gcc.dg/vect/vect-outer-3b.c scan-tree-dump-times vect "grouped access in outer loop" 4 Still there on r230556, I haven't dug any further yet. Thanks, Alan
Re: C++ PATCH to integrate c++-delayed-folding branch
On 14/11/15 00:07, Jason Merrill wrote: And here's the final patch integrating the delayed folding branch. The general idea is to mostly avoid folding until the end of the function, at which point we fold everything as part of genericization. Since many warnings rely on looking at folded trees, we fold the arguments that we pass to the various warning functions. To avoid issues with combinatorial explosion, we cache the results of folding in a hash_map. In the future we probably want to move many of these warnings into language-independent code so we can avoid folding for them in the front end; we also shouldn't need to fold during genericization, but apparently not doing that leads to optimization regressions. This is mostly Kai's work; I just cleaned it up a bit to get it ready for the merge. Marek also helped some. The largest change is hooking into the GTY machinery to handle throwing away the folding cache rather than trying to do it manually in various places. Tested x86_64-pc-linux-gnu, applying to trunk. Also PR68385 (arm-none-eabi).
Re: [PATCH] PR/67682, break SLP groups up if only some elements match
On 16/11/15 14:42, Christophe Lyon wrote: Hi Alan, I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c) fails for armeb targets. I haven't had time to look at more details yet, but I guess you can reproduce it quickly enough. Thanks - yes I see it now. -fdump-tree-optimized looks sensible: __attribute__((noinline)) test () { vector(4) int vect__14.21; vector(4) int vect__2.20; vector(4) int vect__2.19; vector(4) int vect__3.13; vector(4) int vect__2.12; : vect__2.12_24 = MEM[(int *)]; vect__3.13_27 = vect__2.12_24 + { 1, 2, 3, 4 }; MEM[(int *)] = vect__3.13_27; vect__2.19_31 = MEM[(int *) + 16B]; vect__2.20_33 = VEC_PERM_EXPR; vect__14.21_35 = vect__2.20_33 * { 3, 4, 5, 7 }; MEM[(int *) + 16B] = vect__14.21_35; return; } but while a[0...3] end up containing 5 7 9 11 as expected, a[4..7] end up with 30 32 30 28 rather than the expected 12 24 40 70. That is, we end up with (10 8 6 4), rather than the expected (4 6 8 10), being multiplied by {3,4,5,7}. Looking at the RTL, those values come from a UZP1/2 pair that should extract elements {0,2,4,6} of b. Assembler, with my workings as to what's in each register: test: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. movwr2, #:lower16:b movtr2, #:upper16:b vldrd22, .L11 vldrd23, .L11+8 ;; So d22 = (3 4), d23 = (5 7), q11 = (5 7 3 4) movwr3, #:lower16:a movtr3, #:upper16:a vld1.64 {d16-d17}, [r2:64] ;; So d16 = (b[0] b[1]), d17 = (b[2] b[3]), q8 = (b[2] b[3] b[0] b[1]) vmovq9, q8 @ v4si ;; q9 = (b[2] b[3] b[0] b[1]) vldrd20, [r2, #16] vldrd21, [r2, #24] ;; So d20 = (b[4] b[5]), d21 = (b[6] b[7]), q10 = (b[6] b[7] b[4] b[5]) vuzp.32 q10, q9 ;; So q10 = (b[3] b[1] b[7] b[5]), i.e. d20 = (b[7] b[5]) and d21 = (b[3] b[1]) ;; and q9 = (b[2] b[0] b[6] b[4]), i.e. d18 = (b[6] b[4]) and d19 = (b[2] b[0]) vldrd20, .L11+16 vldrd21, .L11+24 ;; d20 = (1 2), d21 = (3 4), q10 = (3 4 1 2) vmul.i32q9, q9, q11 ;; q9 = (b[2]*5 b[0]*7 b[6]*3 b[4]*4) ;; i.e. d18 = (b[6]*3 b[4]*4) and d19 = (b[2]*5 b[0]*7) vadd.i32q8, q8, q10 ;; q8 = (b[2]+3 b[3]+4 b[0]+1 b[1]+2) ;; i.e. d16 = (b[0]+1 b[1]+2), d17 = (b[2]+3 b[3]+4) vst1.64 {d16-d17}, [r3:64] ;; a[0] = b[0]+1, a[1] = b[1]+2, a[2] = b[2]+3, a[3]=b[3]+4 all ok vstrd18, [r3, #16] ;; a[4] = b[6]*3, a[5] = b[4]*4 vstrd19, [r3, #24] ;; a[6] = b[2]*5, a[7] = b[0]*7 bx lr .L12: .align 3 .L11: .word 3 .word 4 .word 5 .word 7 .word 1 .word 2 .word 3 .word 4 Which is to say - the bit order in the q-registers, is neither big- nor little-endian, but the elements get stored back to memory in a consistent order with how they were loaded, so we're OK as long as there are no permutes. Unfortunately for UZP this lane ordering mixup is not idempotent and messes everything up... Hmmm. I'm feeling that "properly" fixing this testcase, amounts to fixing armeb's whole register-numbering/lane-flipping scheme, and might be quite a large task. OTOH it might also fix the significant number of failing vectorizer tests. A simpler solution might be to disable...some part of vector supporton armeb, but I'm not sure which part would be best, yet. Thoughts (CC maintainers)? --Alan
Re: [PATCH][i386]Migrate reduction optabs to reduc__scal
On 03/11/15 14:27, Alan Lawrence wrote: This migrates the various reduction optabs in sse.md to use the reduce-to-scalar form. I took the straightforward approach (equivalent to the migration code in expr.c/optabs.c) of generating a vector temporary, using the existing code to reduce to that, and extracting lane 0, in each pattern. Bootstrapped + check-gcc + check-g++. Ok for trunk? gcc/ChangeLog: * config/i386/sse.md (reduc_splus_v8df): Rename to... (reduc_plus_scal_v8df): ...here; reduce to temp and extract scalar. (reduc_splus_v4df): Rename to... (reduc_plus_scal_v4df): ...here; reduce to temp and extract scalar. (reduc_splus_v2df): Rename to... (reduc_plus_scal_v2df): ...here; reduce to temp and extract scalar. (reduc_splus_v16sf): Rename to... (reduc_plus_scal_v16sf): ...here; reduce to temp and extract scalar. (reduc_splus_v8sf): Rename to... (reduc_plus_scal_v8sf): ...here; reduce to temp and extract scalar. (reduc_splus_v4sf): Rename to... (reduc_plus_scal_v4sf): ...here; reduce to temp and extract scalar. (reduc__, all 3 variants): Rename each to... (reduc__scal_): ...here; reduce to temp and extract scalar. (reduc_umin_v8hf): Rename to... (reduc_umin_scal_v8hf): ...here; reduce to temp and extract scalar. --- gcc/config/i386/sse.md | 82 +++--- 1 file changed, 51 insertions(+), 31 deletions(-) diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 43dcc6a..041e514 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -2424,73 +2424,85 @@ (set_attr "prefix_rep" "1,*") (set_attr "mode" "V4SF")]) -(define_expand "reduc_splus_v8df" - [(match_operand:V8DF 0 "register_operand") +(define_expand "reduc_plus_scal_v8df" + [(match_operand:DF 0 "register_operand") (match_operand:V8DF 1 "register_operand")] "TARGET_AVX512F" { - ix86_expand_reduc (gen_addv8df3, operands[0], operands[1]); + rtx tmp = gen_reg_rtx (V8DFmode); + ix86_expand_reduc (gen_addv8df3, tmp, operands[1]); + emit_insn (gen_vec_extractv8df (operands[0], tmp, const0_rtx)); DONE; }) -(define_expand "reduc_splus_v4df" - [(match_operand:V4DF 0 "register_operand") +(define_expand "reduc_plus_scal_v4df" + [(match_operand:DF 0 "register_operand") (match_operand:V4DF 1 "register_operand")] "TARGET_AVX" { rtx tmp = gen_reg_rtx (V4DFmode); rtx tmp2 = gen_reg_rtx (V4DFmode); + rtx vec_res = gen_reg_rtx (V4DFmode); emit_insn (gen_avx_haddv4df3 (tmp, operands[1], operands[1])); emit_insn (gen_avx_vperm2f128v4df3 (tmp2, tmp, tmp, GEN_INT (1))); - emit_insn (gen_addv4df3 (operands[0], tmp, tmp2)); + emit_insn (gen_addv4df3 (vec_res, tmp, tmp2)); + emit_insn (gen_vec_extractv4df (operands[0], vec_res, const0_rtx)); DONE; }) -(define_expand "reduc_splus_v2df" - [(match_operand:V2DF 0 "register_operand") +(define_expand "reduc_plus_scal_v2df" + [(match_operand:DF 0 "register_operand") (match_operand:V2DF 1 "register_operand")] "TARGET_SSE3" { - emit_insn (gen_sse3_haddv2df3 (operands[0], operands[1], operands[1])); + rtx tmp = gen_reg_rtx (V2DFmode); + emit_insn (gen_sse3_haddv2df3 (tmp, operands[1], operands[1])); + emit_insn (gen_vec_extractv2df (operands[0], tmp, const0_rtx)); DONE; }) -(define_expand "reduc_splus_v16sf" - [(match_operand:V16SF 0 "register_operand") +(define_expand "reduc_plus_scal_v16sf" + [(match_operand:SF 0 "register_operand") (match_operand:V16SF 1 "register_operand")] "TARGET_AVX512F" { - ix86_expand_reduc (gen_addv16sf3, operands[0], operands[1]); + rtx tmp = gen_reg_rtx (V16SFmode); + ix86_expand_reduc (gen_addv16sf3, tmp, operands[1]); + emit_insn (gen_vec_extractv16sf (operands[0], tmp, const0_rtx)); DONE; }) -(define_expand "reduc_splus_v8sf" - [(match_operand:V8SF 0 "register_operand") +(define_expand "reduc_plus_scal_v8sf" + [(match_operand:SF 0 "register_operand") (match_operand:V8SF 1 "register_operand")] "TARGET_AVX" { rtx tmp = gen_reg_rtx (V8SFmode); rtx tmp2 = gen_reg_rtx (V8SFmode); + rtx vec_res = gen_reg_rtx (V8SFmode); emit_insn (gen_avx_haddv8sf3 (tmp, operands[1], operands[1])); emit_insn (gen_avx_haddv8sf3 (tmp2, tmp, tmp)); emit_insn (gen_avx_vperm2f128v8sf3 (tmp, tmp2, tmp2, GEN_INT (1))); - emit_insn (gen_addv8sf3 (operands[0], tmp, tmp2)); + emit_insn (gen_addv8sf3 (vec_res, tmp, tmp2)); + emit_insn (gen_vec_extractv8sf (operands[0], vec_res, const0_rtx)); DONE; }) -(define_expand "
[Obvious][AArch64] Fix gcc.target/aarch64/vclz.c
The scan-assembler vclz2s 34 test in here has been failing since r230091: * tree-vect-slp.c (vect_bb_vectorization_profitable_p): Make equal cost favor vectorized version. which transforms a load of piecewise array assignments (in tree) into 2-element-vector writes to MEMs, and the first RUN_TEST for the V2SI cases then constant-propagates because the INHIB_OPTIMIZATION (which clobbers the data arrays) comes after the load from those arrays. Hence, put the INHIB_OPTIMIZATION in the right place. Moreover, the memory clobber does not affect the registers now holding the values loaded anyway, so we can drop the second one altogether. Tested on aarch64-none-linux-gnu, committed as r230421. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vclz.c: Correctly place INHIB_OPTIMIZATION. --- gcc/testsuite/gcc.target/aarch64/vclz.c | 11 +++ 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/gcc/testsuite/gcc.target/aarch64/vclz.c b/gcc/testsuite/gcc.target/aarch64/vclz.c index 455ba63..60494a8 100644 --- a/gcc/testsuite/gcc.target/aarch64/vclz.c +++ b/gcc/testsuite/gcc.target/aarch64/vclz.c @@ -67,18 +67,13 @@ extern void abort (void); CONCAT1 (vclz, POSTFIX (reg_len, data_len, is_signed)) #define RUN_TEST(test_set, answ_set, reg_len, data_len, is_signed, n) \ + INHIB_OPTIMIZATION; \ a = LOAD_INST (reg_len, data_len, is_signed) (test_set); \ b = LOAD_INST (reg_len, data_len, is_signed) (answ_set); \ - INHIB_OPTIMIZATION; \ a = CLZ_INST (reg_len, data_len, is_signed) (a); \ for (i = 0; i < n; i++) \ -{ \ - INHIB_OPTIMIZATION; \ - if (a [i] != b [i]) \ -{ \ - return 1;\ -} \ -} +if (a [i] != b [i]) \ + return 1; int test_vclz_s8 () -- 1.9.1
[PATCH][install.texi] Add note against GNAT 4.8 on ARM targets.
This follows from the discussion here: https://gcc.gnu.org/ml/gcc/2015-10/msg00082.html . OK for trunk? --Alan gcc/ChangeLog: doc/install.texi: Add note against GNAT 4.8 on ARM targets. --- gcc/doc/install.texi | 2 ++ 1 file changed, 2 insertions(+) diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi index 1fd773e..1ce93d4 100644 --- a/gcc/doc/install.texi +++ b/gcc/doc/install.texi @@ -3481,6 +3481,8 @@ require GNU binutils 2.13 or newer. Such subtargets include: @code{arm-*-netbsdelf}, @code{arm-*-*linux-*} and @code{arm-*-rtemseabi}. +Building the Ada frontend commonly fails (an infinite loop executing @code{xsinfo}) if the host compiler is GNAT 4.8. Host compilers built from the GNAT 4.6, 4.9 or 5 release branches are known to succeed. + @html @end html -- 1.9.1
Re: [PATCH] Improve BB vectorization dependence analysis
On 09/11/15 12:55, Richard Biener wrote: Currently BB vectorization computes all dependences inside a BB region and fails all vectorization if it cannot handle some of them. This is obviously not needed - BB vectorization can restrict the dependence tests to those that are needed to apply the load/store motion effectively performed by the vectorization (sinking all participating loads/stores to the place of the last one). With restructuring it that way it's also easy to not give up completely but only for the SLP instance we cannot vectorize (this gives a slight bump in my SPEC CPU 2006 testing to 756 vectorized basic block regions). But first and foremost this patch is to reduce the dependence analysis cost and somewhat mitigate the compile-time effects of the first patch. For fixing PR56118 only a cost model issue remains. Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. Richard. 2015-11-09 Richard BienerPR tree-optimization/56118 * tree-vectorizer.h (vect_find_last_scalar_stmt_in_slp): Declare. * tree-vect-slp.c (vect_find_last_scalar_stmt_in_slp): Export. * tree-vect-data-refs.c (vect_slp_analyze_node_dependences): New function. (vect_slp_analyze_data_ref_dependences): Instead of computing all dependences of the region DRs just analyze the code motions SLP vectorization will perform. Remove SLP instances that cannot have their store/load motions applied. (vect_analyze_data_refs): Allow DRs without a vectype in BB vectorization. * gcc.dg/vect/no-tree-sra-bb-slp-pr50730.c: Adjust. Since this, I've been seeing an ICE on gfortran.dg/vect/vect-9.f90 at on both aarch64-none-linux-gnu and arm-none-linux-gnueabihf: spawn /home/alalaw01/build/gcc/testsuite/gfortran4/../../gfortran -B/home/alalaw01/build/gcc/testsuite/gfortran4/../../ -B/home/alalaw01/build/aarch64-unknown-linux-gnu/./libgfortran/ /home/alalaw01/gcc/gcc/testsuite/gfortran.dg/vect/vect-9.f90 -fno-diagnostics-show-caret -fdiagnostics-color=never -O -O2 -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-vect-details -Ofast -S -o vect-9.s /home/alalaw01/gcc/gcc/testsuite/gfortran.dg/vect/vect-9.f90:5:0: Error: definition in block 13 follows the use for SSA_NAME: _339 in statement: vectp.156_387 = &*cc_36(D)[_339]; /home/alalaw01/gcc/gcc/testsuite/gfortran.dg/vect/vect-9.f90:5:0: internal compiler error: verify_ssa failed 0xcfc61b verify_ssa(bool, bool) ../../gcc-fsf/gcc/tree-ssa.c:1039 0xa2fc0b execute_function_todo ../../gcc-fsf/gcc/passes.c:1952 0xa30393 do_per_function ../../gcc-fsf/gcc/passes.c:1632 0xa3058f execute_todo ../../gcc-fsf/gcc/passes.c:2000 Please submit a full bug report... FAIL: gfortran.dg/vect/vect-9.f90 -O (internal compiler error) FAIL: gfortran.dg/vect/vect-9.f90 -O (test for excess errors) Still there (on aarch64) at r230329. --Alan
Re: [PATCH] PR/67682, break SLP groups up if only some elements match
On 10/11/15 12:51, Richard Biener wrote: >> >> Just noticing this... if we have a vectorization factor of 4 and matches >> is 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0 then this will split into 1, 1, 1, 1 >> and >> 1, 1, 0, 0, 0, ... where we know from the matches that it will again fail? >> >> Thus shouldn't we split either only if i % vectorization_factor is 0 or >> if not, split "twice", dropping the intermediate surely non-matching >> group of vectorization_factor size? After all if we split like with the >> patch then the upper half will _not_ be splitted again with the >> simplified patch (result will be 1, 1, 0, 0, 0, 0, 0, 0 again). >> >> So I expect that the statistics will be the same if we restrict splitting >> to the i % vectorization_factor == 0 case, or rather split where we do >> now but only re-analyze group2 if i % vectorization_factor == 0 holds? >> >> Ok with that change. Improvements on that incrementally. > > Btw, it just occurs to me that the whole thing is equivalent to splitting > the store-group into vector-size pieces up-front? That way we do > the maximum splitting up-frond and avoid any redundant work? > > The patch is still ok as said, just the above may be a simple thing > to explore. I'd refrained from splitting in vect_analyze_group_access_1 as my understanding was that we only did that once, whereas we would retry the vect_analyze_slp_instance path each time we decreased the vectorization_factor...however, I did try putting code at the beginning of vect_analyze_slp_instance to split up any groups > vf. Unfortunately this loses us some previously-successful SLPs, as some bigger groups cannot be SLPed if we split them as they require 'unrolling'...so not addressing that here. However your suggestion of splitting twice when we know the boundary is in the middle of a vector is a nice compromise; it nets us a good number more successes in SPEC2000 and SPEC2006, about 7% more than without the patch. Hence, here's the patch I've committed, as r230330, after regstrap on x86_64 and AArch64. (I dropped the previous bb-slp-subgroups-2 and renamed the others up as we don't do that one anymore.) Cheers, Alan gcc/ChangeLog: PR tree-optimization/67682 * tree-vect-slp.c (vect_split_slp_store_group): New. (vect_analyze_slp_instance): During basic block SLP, recurse on subgroups if vect_build_slp_tree fails after 1st vector. gcc/testsuite/ChangeLog: PR tree-optimization/67682 * gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic. * gcc.dg/vect/bb-slp-subgroups-1.c: New. * gcc.dg/vect/bb-slp-subgroups-2.c: New. * gcc.dg/vect/bb-slp-subgroups-3.c: New. --- gcc/testsuite/gcc.dg/vect/bb-slp-7.c | 10 +-- gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 + gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 41 + gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 + gcc/tree-vect-slp.c| 85 +- 5 files changed, 215 insertions(+), 6 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c index ab54a48..b8bef8c 100644 --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y) unsigned int *pout = [0]; unsigned int a0, a1, a2, a3; - /* Non isomorphic. */ + /* Non isomorphic, even 64-bit subgroups. */ a0 = *pin++ + 23; - a1 = *pin++ + 142; + a1 = *pin++ * 142; a2 = *pin++ + 2; a3 = *pin++ * 31; - + *pout++ = a0 * x; *pout++ = a1 * y; *pout++ = a2 * x; @@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y) /* Check results. */ if (out[0] != (in[0] + 23) * x - || out[1] != (in[1] + 142) * y + || out[1] != (in[1] * 142) * y || out[2] != (in[2] + 2) * x || out[3] != (in[3] * 31) * y) abort(); @@ -47,4 +47,4 @@ int main (void) } /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */ - + diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c new file mode 100644 index 000..39c23c3 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c @@ -0,0 +1,44 @@ +/* { dg-require-effective-target vect_int } */ +/* PR tree-optimization/67682. */ + +#include "tree-vect.h" + +int __attribute__((__aligned__(8))) a[8]; +int __attribute__((__aligned__(8))) b[4]; + +__attribute__ ((noinline)) void +test () +{ +a[0] = b[0]; +a[1] = b[1]; +a[2] = b[2]; +a[3] = b[3]; +a[4] = 0; +a[5] = 0; +a[6] = 0; +a[7] = 0; +} + +int +main (int argc, char **argv) +{ + check_vect (); + + for (int i = 0; i < 8; i++) +a[i]
Re: [PATCH] Fix PR56118
On 10/11/15 09:34, Richard Biener wrote: The following fixes PR56118 by adjusting the cost model handling of basic-block vectorization to favor the vectorized version in case estimated cost is the same as the estimated cost of the scalar version. This makes sense because we over-estimate the vectorized cost in several places. Bootstrapped and tested on x86_64-unknown-linux-gnu, applied. Richard. 2015-11-10 Richard BienerPR tree-optimization/56118 * tree-vect-slp.c (vect_bb_vectorization_profitable_p): Make equal cost favor vectorized version. * gcc.target/i386/pr56118.c: New testcase. On AArch64 and ARM targets, this causes PASS->FAIL of gcc.dg/vect/bb-slp-32.c scan-tree-dump slp2 "vectorization is not profitable" gcc.dg/vect/bb-slp-32.c -flto -ffat-lto-objects scan-tree-dump slp2 "vectorization is not profitable" that sounds like a good thing ;), so I imagine the xfail directive may just need updating. The test also looks to be failing on powerpc64 (according to https://gcc.gnu.org/ml/gcc-testresults/2015-11/msg01327.html). Regards, Alan
Re: [PATCH 6/6] Make SRA replace constant-pool loads
On 06/11/15 16:29, Richard Biener wrote: >>> 2) You should be able to use fold_ctor_reference directly (in place >> of >>> all your code >>> in case offset and size are readily available - don't remember >> exactly how >>> complete scalarization "walks" elements). Alternatively use >>> fold_const_aggregate_ref. Well, yes I was able to remove subst_constant_pool_initial by using fold_ctor_reference, with some fairly strict checks about the access expressions found while scanning the input. I still needed to either deep-scan the constant value itself, or allow completely_scalarize to fail, due to Ada c460010 where a variable has DECL_INITIAL: {VIEW_CONVERT_EXPR({1, 2, 3})} (that is, a CONSTRUCTOR whose only element is a V.C.E. of another CONSTRUCTOR). However: > I tried to suggest creating piecewise loads from the constant pool as opposed > to the aggregate one. But I suppose the difficulty is to determine 'pieces', > not their values (that followup passes and folding can handle quite > efficiently). I've now got this working, and it simplifies the patch massively :). We now replace e.g. a constant-pool load a = *.LC0, with a series of loads e.g. SR_a1 = SRlc0_1 SR_a2 = SRlc0_2 etc. etc. (well those wouldn't be quite the names, but you get the idea.) And then at the start of the function we initialise SRlc0_1 = *.LC0[0] SRlc0_2 = *.LC0[1] etc. etc. - I needed to use SRlc0_1 etc. variables because some of the constant-pool loads coming from Ada are rather more complicated than 'a = *.LC0' and substituting the access expression (e.g. '*.LC0[0].foo'), rather than the replacement_decl, into those lead to just too many verify_gimple failures. However, the initialization seems to work well enough, if I put a check into sra_modify_expr to avoid changing 'SRlc0_1 = *.LC0[0]' into 'SRlc0_1 = SRlc0_1' (!). Slightly hacky perhaps but harmless as there cannot be a statement writing to a scalar replacement prior to sra_modify_expr. Also I had to prevent writing scalar replacements back to the constant pool (gnat opt31.adb). Finally - I've left the disqualified_constants code in, but am now only using it for an assert...so I guess it'd be reasonable to take that out. In an ideal world we could do those checks only in checking builds but allocating bitmaps only for checking builds feels like it would make checking vs non-checking diverge too much. This is now passing bootstrap+check-{gcc,g++,fortran} on AArch64, and the same plus Ada on x64_64 and ARM, except for some additional guality failures in pr54970-1.c on AArch64 (tests which are already failing on ARM) - I haven't had any success in figuring those out yet, suggestions welcome. However, without the DOM patches, this does not fix the oiginal ssa-dom-cse-2.c testcase, even though the load is scalarized in esra and the individual element loads resolved in ccp2. I don't think I'll be able to respin those between now and stage 1 ending on Saturday, so hence I've had to drop the testsuite change, and can only / will merely describe the remaining problem in PR/63679. I'm now benchmarking on AArch64 to see what difference this patch makes on its own. Thoughts? gcc/ChangeLog: PR target/63679 * tree-sra.c (disqualified_constants): New. (sra_initialize): Allocate disqualified_constants. (sra_deinitialize): Free disqualified_constants. (disqualify_candidate): Update disqualified_constants when appropriate. (create_access): Scan for constant-pool entries as we go along. (scalarizable_type_p): Add check against type_contains_placeholder_p. (maybe_add_sra_candidate): Allow constant-pool entries. (initialize_constant_pool_replacements): New. (sra_modify_expr): Avoid mangling assignments created by previous. (sra_modify_function_body): Call initialize_constant_pool_replacements. --- gcc/tree-sra.c | 93 -- 1 file changed, 90 insertions(+), 3 deletions(-) diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c index 1d4a632..aa60f06 100644 --- a/gcc/tree-sra.c +++ b/gcc/tree-sra.c @@ -325,6 +325,10 @@ candidate (unsigned uid) those which cannot be (because they are and need be used as a whole). */ static bitmap should_scalarize_away_bitmap, cannot_scalarize_away_bitmap; +/* Bitmap of candidates in the constant pool, which cannot be scalarized + because this would produce non-constant expressions (e.g. Ada). */ +static bitmap disqualified_constants; + /* Obstack for creation of fancy names. */ static struct obstack name_obstack; @@ -647,6 +651,7 @@ sra_initialize (void) (vec_safe_length (cfun->local_decls) / 2); should_scalarize_away_bitmap = BITMAP_ALLOC (NULL); cannot_scalarize_away_bitmap = BITMAP_ALLOC (NULL); + disqualified_constants = BITMAP_ALLOC (NULL); gcc_obstack_init (_obstack); base_access_vec = new hash_map ; memset (_stats, 0, sizeof (sra_stats)); @@
Re: [PR64164] drop copyrename, integrate into expand
On 05/11/15 05:08, Alexandre Oliva wrote: [PR67753] fix copy of PARALLEL entry_parm to CONCAT target_reg for gcc/ChangeLog PR rtl-optimization/67753 PR rtl-optimization/64164 * function.c (assign_parm_setup_block): Avoid allocating a stack slot if we don't have an ABI-reserved one. Emit the copy to target_reg in the conversion seq if the copy from entry_parm is in it too. Don't use the conversion seq to copy a PARALLEL to a REG or a CONCAT. Since this change, we have on aarch64_be: FAIL: gcc.target/aarch64/aapcs64/func-ret-4.c execution, -O1 FAIL: gcc.target/aarch64/aapcs64/func-ret-4.c execution, -O2 FAIL: gcc.target/aarch64/aapcs64/func-ret-4.c execution, -O3 -g FAIL: gcc.target/aarch64/aapcs64/func-ret-4.c execution, -Os FAIL: gcc.target/aarch64/aapcs64/func-ret-4.c execution, -Og -g The difference in the assembler looks as follows (this is at -Og): func_return_val_10: - sub sp, sp, #16 - lsr x2, x1, 48 - lsr x1, x1, 32 + ubfxx2, x1, 16, 16 fmovx3, d0 // Start of user assembly // 23 "func-ret-4.c" 1 mov x0, x30 // 0 "" 2 // End of user assembly adrpx3, saved_return_address str x0, [x3, #:lo12:saved_return_address] adrpx0, myfunc add x0, x0, :lo12:myfunc // Start of user assembly // 23 "func-ret-4.c" 1 mov x30, x0 // 0 "" 2 // End of user assembly bfi w0, w2, 16, 16 bfi w0, w1, 0, 16 lsl x0, x0, 32 - add sp, sp, 16 (ubfx is a bitfield extract, the first immediate is the lsbit, the second the width. lsr = logical shift right.) And in the RTL dump, this (before the patch): (insn 4 3 5 2 (set (mem/c:DI (plus:DI (reg/f:DI 68 virtual-stack-vars) (const_int -8 [0xfff8])) [0 t+0 S8 A64]) (reg:DI 1 x1)) func-ret-4.c:23 -1 (nil)) (insn 5 4 6 2 (set (reg:HI 78 [ t ]) (mem/c:HI (plus:DI (reg/f:DI 68 virtual-stack-vars) (const_int -8 [0xfff8])) [0 t+0 S2 A64])) func-ret-4.c:23 -1 (nil)) (insn 6 5 7 2 (set (reg:HI 79 [ t+2 ]) (mem/c:HI (plus:DI (reg/f:DI 68 virtual-stack-vars) (const_int -6 [0xfffa])) [0 t+2 S2 A16])) func-ret-4.c:23 -1 (nil)) becomes (after the patch): (insn 4 3 5 2 (set (subreg:SI (reg:CHI 80) 0) (reg:SI 1 x1 [ t ])) func-ret-4.c:23 -1 (nil)) (insn 5 4 6 2 (set (reg:SI 81) (subreg:SI (reg:CHI 80) 0)) func-ret-4.c:23 -1 (nil)) (insn 6 5 7 2 (set (subreg:DI (reg:HI 82) 0) (zero_extract:DI (subreg:DI (reg:SI 81) 0) (const_int 16 [0x10]) (const_int 16 [0x10]))) func-ret-4.c:23 -1 (nil)) (insn 7 6 8 2 (set (reg:HI 78 [ t ]) (reg:HI 82)) func-ret-4.c:23 -1 (nil)) (insn 8 7 9 2 (set (reg:SI 83) (subreg:SI (reg:CHI 80) 0)) func-ret-4.c:23 -1 (nil)) (insn 9 8 10 2 (set (reg:HI 79 [ t+2 ]) (subreg:HI (reg:SI 83) 2)) func-ret-4.c:23 -1 (nil)) --Alan
Re: [Patch AArch64] Switch constant pools to separate rodata sections.
On 10/11/15 16:39, Alan Lawrence wrote: Since r229878, I've been seeing FAIL: gcc.dg/attr-weakref-1.c (test for excess errors) UNRESOLVED: gcc.dg/attr-weakref-1.c compilation failed to produce executable (both previously passing) on aarch64-none-elf, aarch64_be-none-elf, and aarch64-none-linux-gnu. Ah, these are fixed by Ramana's partial rollback (r230085). --Alan
Re: [Patch AArch64] Switch constant pools to separate rodata sections.
On 04/11/15 14:26, Ramana Radhakrishnan wrote: True and I've just been reading more of the backend - We could now start using blocks for constant pools as well. So let's do that. How does something like this look ? Tested on aarch64-none-elf - no regressions. 2015-11-04 Ramana Radhakrishnan* config/aarch64/aarch64.c (aarch64_can_use_per_function_literal_pools_p): New. (aarch64_use_blocks_for_constant_p): Adjust declaration and use aarch64_can_use_function_literal_pools_p. (aarch64_select_rtx_section): Update. Since r229878, I've been seeing FAIL: gcc.dg/attr-weakref-1.c (test for excess errors) UNRESOLVED: gcc.dg/attr-weakref-1.c compilation failed to produce executable (both previously passing) on aarch64-none-elf, aarch64_be-none-elf, and aarch64-none-linux-gnu. Here's a log from aarch64_be-none-elf (the others look similar): /work/alalaw01/build-aarch64_be-none-elf/obj/gcc2/gcc/xgcc -B/work/alalaw01/build-aarch64_be-none-elf/obj/gcc2/gcc/ /work/alalaw01/src/gcc/gcc/testsuite/gcc.dg/attr-weakref-1.c -fno-diagnostics-show-caret -fdiagnostics-color=never -O2 /work/alalaw01/src/gcc/gcc/testsuite/gcc.dg/attr-weakref-1a.c -specs=aem-validation.specs -lm -o ./attr-weakref-1.exe /tmp/ccEfngi6.o:(.rodata.cst8+0x30): undefined reference to `wv12' /tmp/ccEfngi6.o:(.rodata.cst8+0x38): undefined reference to `wv12' /tmp/ccEfngi6.o:(.rodata.cst8+0x60): undefined reference to `wf12' /tmp/ccEfngi6.o:(.rodata.cst8+0x68): undefined reference to `wf12' collect2: error: ld returned 1 exit status compiler exited with status 1 output is: /tmp/ccEfngi6.o:(.rodata.cst8+0x30): undefined reference to `wv12' /tmp/ccEfngi6.o:(.rodata.cst8+0x38): undefined reference to `wv12' /tmp/ccEfngi6.o:(.rodata.cst8+0x60): undefined reference to `wf12' /tmp/ccEfngi6.o:(.rodata.cst8+0x68): undefined reference to `wf12' collect2: error: ld returned 1 exit status FAIL: gcc.dg/attr-weakref-1.c (test for excess errors)