Re: [PATCH] Add -funconstrained-commons to work around PR/69368 (and others) in SPEC2006

2016-03-11 Thread Alan Lawrence
On 10/03/16 16:18, Dominique d'Humières wrote:

> The test gfortran.dg/unconstrained_commons.f fails in the 32 bit mode. It
> needs some regexp

Indeed, confirmed on ARM, sorry for not spotting this earlier.

I believe the variable, if there is one, should always be called 'j', as it is
in the source. So how about this, tested on ARM, AArch64, x86_64?

gcc/testsuite/ChangeLog:

* gfortran.dg/unconstrained_commons.f: Widen regexp to match j_
---
 gcc/testsuite/gfortran.dg/unconstrained_commons.f | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gfortran.dg/unconstrained_commons.f 
b/gcc/testsuite/gfortran.dg/unconstrained_commons.f
index f9fc471..bee67ab 100644
--- a/gcc/testsuite/gfortran.dg/unconstrained_commons.f
+++ b/gcc/testsuite/gfortran.dg/unconstrained_commons.f
@@ -17,4 +17,4 @@
 ! { dg-final { scan-tree-dump-not "FIND" "dom2" } }
 ! We should retain both a read and write of mycommon.x.
 ! { dg-final { scan-tree-dump-times "  _\[0-9\]+ = 
mycommon\\.x\\\[_\[0-9\]+\\\];" 1 "dom2" } }
-! { dg-final { scan-tree-dump-times "  mycommon\\.x\\\[_\[0-9\]+\\\] = 
_\[0-9\]+;" 1 "dom2" } }
+! { dg-final { scan-tree-dump-times "  mycommon\\.x\\\[j?_\[0-9\]+\\\] = 
_\[0-9\]+;" 1 "dom2" } }
-- 
1.9.1



Re: [PATCH 1/2][AArch64] Implement AAPCS64 updates for alignment attribute

2016-03-11 Thread Alan Lawrence

On 04/03/16 17:24, Alan Lawrence wrote:

On 26/02/16 14:52, James Greenhalgh wrote:


gcc/ChangeLog:

* gcc/config/aarch64/aarch64.c (aarch64_function_arg_alignment):
Rewrite, looking one level down for records and arrays.
---
  gcc/config/aarch64/aarch64.c | 31 ---
  1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 9142ac0..b084f83 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1925,22 +1925,23 @@ aarch64_vfp_is_call_candidate (cumulative_args_t
pcum_v, machine_mode mode,
  static unsigned int
  aarch64_function_arg_alignment (machine_mode mode, const_tree type)
  {
-  unsigned int alignment;
+  if (!type)
+return GET_MODE_ALIGNMENT (mode);
+  if (integer_zerop (TYPE_SIZE (type)))
+return 0;

-  if (type)
-{
-  if (!integer_zerop (TYPE_SIZE (type)))
-{
-  if (TYPE_MODE (type) == mode)
-alignment = TYPE_ALIGN (type);
-  else
-alignment = GET_MODE_ALIGNMENT (mode);
-}
-  else
-alignment = 0;
-}
-  else
-alignment = GET_MODE_ALIGNMENT (mode);
+  gcc_assert (TYPE_MODE (type) == mode);
+
+  if (!AGGREGATE_TYPE_P (type))
+return TYPE_ALIGN (TYPE_MAIN_VARIANT (type));
+
+  if (TREE_CODE (type) == ARRAY_TYPE)
+return TYPE_ALIGN (TREE_TYPE (type));
+
+  unsigned int alignment = 0;
+
+  for (tree field = TYPE_FIELDS (type); field; field = DECL_CHAIN (field))
+alignment = std::max (alignment, DECL_ALIGN (field));

return alignment;
  }




Ping.


[snip]



I'm not proposing to backport these AArch64 changes, hence:

Ping^2.

(For gcc 7 ?)

Also tests https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01073.html .


Ping^3.

Cheers, Alan



[PATCH] Fix PR70013

2016-03-11 Thread Alan Lawrence
In this PR, a packed structure containing bitfields, loses part of its 
constant-pool initialization in SRA.

A fuller explanation is on the PR:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013#c11. In short we need to
treat constant-pool entries, like function parameters, as both come
'pre-initialized' before the function starts.

Bootstrapped + regtest (gcc, g++) on AArch64; same with addition of Ada, on ARM
and x86.

OK for trunk?

gcc/ChangeLog:

PR tree-optimization/70013
* tree-sra.c (analyze_access_subtree): Also set grp_unscalarized_data
for constant-pool entries.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/sra-20.c: New.
---
 gcc/testsuite/gcc.dg/tree-ssa/sra-20.c | 20 
 gcc/tree-sra.c |  3 ++-
 2 files changed, 22 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sra-20.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sra-20.c 
b/gcc/testsuite/gcc.dg/tree-ssa/sra-20.c
new file mode 100644
index 000..5002c24
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/sra-20.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O1 -Wall" } */
+/* PR/70013, SRA of constant-pool loads removes initialization of part of d.  
*/
+#pragma pack (1)
+struct S0 {
+  unsigned f0 : 17;
+};
+
+int c;
+
+int
+main (int argc, char **argv)
+{
+  struct S0 d[] = { { 1 }, { 2 } };
+  struct S0 e = d[1];
+
+  c = d[0].f0;
+  __builtin_printf ("%x\n", e.f0);
+  return 0;
+}
diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
index 72157ed..24eac6a 100644
--- a/gcc/tree-sra.c
+++ b/gcc/tree-sra.c
@@ -2427,7 +2427,8 @@ analyze_access_subtree (struct access *root, struct 
access *parent,
 
   if (!hole || root->grp_total_scalarization)
 root->grp_covered = 1;
-  else if (root->grp_write || TREE_CODE (root->base) == PARM_DECL)
+  else if (root->grp_write || TREE_CODE (root->base) == PARM_DECL
+  || constant_decl_p (root->base))
 root->grp_unscalarized_data = 1; /* not covered and written to */
   return sth_created;
 }
-- 
1.9.1



Re: [PATCH] Add -funconstrained-commons to work around PR/69368 (and others) in SPEC2006

2016-03-09 Thread Alan Lawrence

On 07/03/16 11:02, Alan Lawrence wrote:

On 04/03/16 13:27, Richard Biener wrote:

I think to make it work with LTO you need to mark it 'Optimization'.
Also it's about
arrays so maybe

'Assume common declarations may be overridden with ones with a larger
trailing array'

also if we document it here we should eventually document it in invoke.texi.

Not sure if "unknown commons" is a good term, maybe "unconstrained
commons" instead?


All done; I doubt there is really a good word, unconstrained seems as good as
any. I've reused much the same wording in invoke.texi, unless you think there
is more to add.

On 04/03/16 13:33, Jakub Jelinek wrote:

Also, isn't the *.opt description line supposed to end with a full stop?


Ah, yes, thanks.

Is this version OK for trunk?

gcc/ChangeLog:

DATE  Alan Lawrence  <alan.lawre...@arm.com>
   Jakub Jelinek  <ja...@redhat.com>

 * common.opt (funconstrained-commons, flag_unconstrained_commons): New.
 * tree.c (array_at_struct_end_p): Do not limit to size of decl for
 DECL_COMMONS if flag_unconstrained_commons is set.
 * tree-dfa.c (get_ref_base_and_extent): Likewise.


And add to that
* doc/invoke.texi (Optimize Options): Add -funconstrained-commons.
(funconstrained-commons): Document.

Thanks,
Alan



gcc/testsuite/ChangeLog:

 * gfortran.dg/unconstrained_commons.f: New.
---
  gcc/common.opt|  5 +
  gcc/doc/invoke.texi   |  8 +++-
  gcc/testsuite/gfortran.dg/unconstrained_commons.f | 20 
  gcc/tree-dfa.c| 15 ++-
  gcc/tree.c|  6 --
  5 files changed, 50 insertions(+), 4 deletions(-)
  create mode 100644 gcc/testsuite/gfortran.dg/unconstrained_commons.f

diff --git a/gcc/common.opt b/gcc/common.opt
index 520fa9c..bbf79ef 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2451,6 +2451,11 @@ fsplit-paths
  Common Report Var(flag_split_paths) Init(0) Optimization
  Split paths leading to loop backedges.

+funconstrained-commons
+Common Var(flag_unconstrained_commons) Optimization
+Assume common declarations may be overridden with ones with a larger
+trailing array.
+
  funit-at-a-time
  Common Report Var(flag_unit_at_a_time) Init(1)
  Compile whole compilation unit at a time.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 0a2a6f4..68933a1 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -407,7 +407,7 @@ Objective-C and Objective-C++ Dialects}.
  -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-partial-pre -ftree-pta @gol
  -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
  -ftree-switch-conversion -ftree-tail-merge -ftree-ter @gol
--ftree-vectorize -ftree-vrp @gol
+-ftree-vectorize -ftree-vrp -funconstrained-commons @gol
  -funit-at-a-time -funroll-all-loops -funroll-loops @gol
  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
  -fipa-ra -fvariable-expansion-in-unroller -fvect-cost-model -fvpt @gol
@@ -6659,6 +6659,12 @@ the loop optimizer itself cannot prove that these 
assumptions are valid.
  If you use @option{-Wunsafe-loop-optimizations}, the compiler warns you
  if it finds this kind of loop.

+@item -funconstrained-commons
+@opindex funconstrained-commons
+This option tells the compiler that variables declared in common blocks
+(e.g. Fortran) may later be overridden with longer trailing arrays. This
+prevents certain optimizations that depend on knowing the array bounds.
+
  @item -fcrossjumping
  @opindex fcrossjumping
  Perform cross-jumping transformation.
diff --git a/gcc/testsuite/gfortran.dg/unconstrained_commons.f 
b/gcc/testsuite/gfortran.dg/unconstrained_commons.f
new file mode 100644
index 000..f9fc471
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/unconstrained_commons.f
@@ -0,0 +1,20 @@
+! { dg-do compile }
+! { dg-options "-O3 -funconstrained-commons -fdump-tree-dom2-details" }
+
+! Test for PR69368: a single-element array in a common block, which will be
+! overridden with a larger size at link time (contrary to language spec).
+! Dominator opts considers accesses to differently-computed elements of X as
+! equivalent, unless -funconstrained-commons is passed in.
+  SUBROUTINE FOO
+  IMPLICIT DOUBLE PRECISION (X)
+  INTEGER J
+  COMMON /MYCOMMON / X(1)
+  DO 10 J=1,1024
+ X(J+1)=X(J+7)
+  10  CONTINUE
+  RETURN
+  END
+! { dg-final { scan-tree-dump-not "FIND" "dom2" } }
+! We should retain both a read and write of mycommon.x.
+! { dg-final { scan-tree-dump-times "  _\[0-9\]+ = mycommon\\.x\\\[_\[0-9\]+\\\];" 1 
"dom2" } }
+! { dg-final { scan-tree-dump-times "  mycommon\\.x\\\[_\[0-9\]+\\\] = _\[0-9\]+;" 1 
"dom2" } }
diff --git a/gcc/tree-dfa.c b/gcc/tree-dfa.c
index 0e98056..f133abc 100644
--- a/gcc/tree-dfa

[PATCH] Add -funconstrained-commons to work around PR/69368 (and others) in SPEC2006 (was: Re: [PATCH] Add -funknown-commons ...)

2016-03-07 Thread Alan Lawrence
On 04/03/16 13:27, Richard Biener wrote:
> I think to make it work with LTO you need to mark it 'Optimization'.
> Also it's about
> arrays so maybe
>
> 'Assume common declarations may be overridden with ones with a larger
> trailing array'
>
> also if we document it here we should eventually document it in invoke.texi.
>
> Not sure if "unknown commons" is a good term, maybe "unconstrained
> commons" instead?

All done; I doubt there is really a good word, unconstrained seems as good as
any. I've reused much the same wording in invoke.texi, unless you think there
is more to add.

On 04/03/16 13:33, Jakub Jelinek wrote:
> Also, isn't the *.opt description line supposed to end with a full stop?

Ah, yes, thanks.

Is this version OK for trunk?

gcc/ChangeLog:

DATE  Alan Lawrence  <alan.lawre...@arm.com>
  Jakub Jelinek  <ja...@redhat.com>

* common.opt (funconstrained-commons, flag_unconstrained_commons): New.
* tree.c (array_at_struct_end_p): Do not limit to size of decl for
DECL_COMMONS if flag_unconstrained_commons is set.
* tree-dfa.c (get_ref_base_and_extent): Likewise.

gcc/testsuite/ChangeLog:

* gfortran.dg/unconstrained_commons.f: New.
---
 gcc/common.opt|  5 +
 gcc/doc/invoke.texi   |  8 +++-
 gcc/testsuite/gfortran.dg/unconstrained_commons.f | 20 
 gcc/tree-dfa.c| 15 ++-
 gcc/tree.c|  6 --
 5 files changed, 50 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gfortran.dg/unconstrained_commons.f

diff --git a/gcc/common.opt b/gcc/common.opt
index 520fa9c..bbf79ef 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2451,6 +2451,11 @@ fsplit-paths
 Common Report Var(flag_split_paths) Init(0) Optimization
 Split paths leading to loop backedges.
 
+funconstrained-commons
+Common Var(flag_unconstrained_commons) Optimization
+Assume common declarations may be overridden with ones with a larger
+trailing array.
+
 funit-at-a-time
 Common Report Var(flag_unit_at_a_time) Init(1)
 Compile whole compilation unit at a time.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 0a2a6f4..68933a1 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -407,7 +407,7 @@ Objective-C and Objective-C++ Dialects}.
 -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-partial-pre -ftree-pta @gol
 -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
 -ftree-switch-conversion -ftree-tail-merge -ftree-ter @gol
--ftree-vectorize -ftree-vrp @gol
+-ftree-vectorize -ftree-vrp -funconstrained-commons @gol
 -funit-at-a-time -funroll-all-loops -funroll-loops @gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
 -fipa-ra -fvariable-expansion-in-unroller -fvect-cost-model -fvpt @gol
@@ -6659,6 +6659,12 @@ the loop optimizer itself cannot prove that these 
assumptions are valid.
 If you use @option{-Wunsafe-loop-optimizations}, the compiler warns you
 if it finds this kind of loop.
 
+@item -funconstrained-commons
+@opindex funconstrained-commons
+This option tells the compiler that variables declared in common blocks
+(e.g. Fortran) may later be overridden with longer trailing arrays. This
+prevents certain optimizations that depend on knowing the array bounds.
+
 @item -fcrossjumping
 @opindex fcrossjumping
 Perform cross-jumping transformation.
diff --git a/gcc/testsuite/gfortran.dg/unconstrained_commons.f 
b/gcc/testsuite/gfortran.dg/unconstrained_commons.f
new file mode 100644
index 000..f9fc471
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/unconstrained_commons.f
@@ -0,0 +1,20 @@
+! { dg-do compile }
+! { dg-options "-O3 -funconstrained-commons -fdump-tree-dom2-details" }
+
+! Test for PR69368: a single-element array in a common block, which will be
+! overridden with a larger size at link time (contrary to language spec).
+! Dominator opts considers accesses to differently-computed elements of X as
+! equivalent, unless -funconstrained-commons is passed in.
+  SUBROUTINE FOO
+  IMPLICIT DOUBLE PRECISION (X)
+  INTEGER J
+  COMMON /MYCOMMON / X(1)
+  DO 10 J=1,1024
+ X(J+1)=X(J+7)
+  10  CONTINUE
+  RETURN
+  END
+! { dg-final { scan-tree-dump-not "FIND" "dom2" } }
+! We should retain both a read and write of mycommon.x.
+! { dg-final { scan-tree-dump-times "  _\[0-9\]+ = 
mycommon\\.x\\\[_\[0-9\]+\\\];" 1 "dom2" } }
+! { dg-final { scan-tree-dump-times "  mycommon\\.x\\\[_\[0-9\]+\\\] = 
_\[0-9\]+;" 1 "dom2" } }
diff --git a/gcc/tree-dfa.c b/gcc/tree-dfa.c
index 0e98056..f133abc 100644
--- a/gcc/tree-dfa.c
+++ b/gcc/tree-dfa.c
@@ -612,9 +612,22 @@ get_ref_base_and_extent (tree exp, HOST_WIDE_INT *poffset,
 
   if (DECL_P (exp))
 {
+  if (flag_unconst

Re: [PATCH 1/2][AArch64] Implement AAPCS64 updates for alignment attribute

2016-03-04 Thread Alan Lawrence

On 26/02/16 14:52, James Greenhalgh wrote:


gcc/ChangeLog:

* gcc/config/aarch64/aarch64.c (aarch64_function_arg_alignment):
Rewrite, looking one level down for records and arrays.
---
  gcc/config/aarch64/aarch64.c | 31 ---
  1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 9142ac0..b084f83 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1925,22 +1925,23 @@ aarch64_vfp_is_call_candidate (cumulative_args_t 
pcum_v, machine_mode mode,
  static unsigned int
  aarch64_function_arg_alignment (machine_mode mode, const_tree type)
  {
-  unsigned int alignment;
+  if (!type)
+return GET_MODE_ALIGNMENT (mode);
+  if (integer_zerop (TYPE_SIZE (type)))
+return 0;

-  if (type)
-{
-  if (!integer_zerop (TYPE_SIZE (type)))
-   {
- if (TYPE_MODE (type) == mode)
-   alignment = TYPE_ALIGN (type);
- else
-   alignment = GET_MODE_ALIGNMENT (mode);
-   }
-  else
-   alignment = 0;
-}
-  else
-alignment = GET_MODE_ALIGNMENT (mode);
+  gcc_assert (TYPE_MODE (type) == mode);
+
+  if (!AGGREGATE_TYPE_P (type))
+return TYPE_ALIGN (TYPE_MAIN_VARIANT (type));
+
+  if (TREE_CODE (type) == ARRAY_TYPE)
+return TYPE_ALIGN (TREE_TYPE (type));
+
+  unsigned int alignment = 0;
+
+  for (tree field = TYPE_FIELDS (type); field; field = DECL_CHAIN (field))
+alignment = std::max (alignment, DECL_ALIGN (field));

return alignment;
  }




Ping.

If this is not appropriate for GCC6, then is it OK for Stage 1 of GCC 7?


I think this needs to be a GCC 7 fix at this point.

Additionally, I'd like to fully understand PR69841 before we take this
patch.

In particular, under what circumstances can the C++ front end set DECL_ALIGN
of a type to be wider than we expect. Can we ever end up with 128-bit
alignment of a template parameter when we were expecting 32- or 64-bit
alignment, and if we can what are the implications on this patch?


OK, so IIUC, we *should* be able to rely on DECL_ALIGN for the AAPCS64, as 
PR/69841 occurred on gcc-5-branch because a C++ frontend change had not been 
backported.


I'm not proposing to backport these AArch64 changes, hence:

Ping^2.

(For gcc 7 ?)

Also tests https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01073.html .

Thanks,
Alan



Re: [PATCH] Add -funknown-commons to work around PR/69368 (and others) in SPEC2006

2016-03-03 Thread Alan Lawrence

On 25/02/16 18:00, Alan Lawrence wrote:

On 22/02/16 12:03, Jakub Jelinek wrote:

(f) A global command-line option, which we check alongside DECL_COMMON and
further tests (basically, we want only DECL_COMMON decls that either have
ARRAY_TYPE, or some other aggregate type with flexible array member or some
other trailing array in the struct), and use the resulting predicate in
places where we optimize based on DECL_SIZE{,_UNIT} of the decl - if that
predicate returns true, we assume the DECL_SIZE is
"variable"/"unlimited"/whatever.
The two spots known to me that would need to use this new predicate would
be:
tree.c (array_at_struct_end_p):

[...]

tree-dfa.c (get_ref_base_and_extent):

[...]

Close enough to what I meant by (a)/(b), but never mind that, and I hadn't
looked at where the change for aggressive loop opts needed to be. However...
Well you are essentially writing the patch for me at this point (so, thanks!),
  but here's a patch that puts that under a global -funknown-commons flag.
Testcase as before.

Bootstrapped (with flag overridden to true) on x86_64 and aarch64, check-gcc,
check-fortran, and tested 416.gamess on AArch64, where this patch enables
running *without* the -fno-aggressive-loop-opts previously required.

In the gcc testsuite, these fail with the option turned on:
gcc.dg/pr53265.c  (test for warnings, line 137)
gcc.dg/pr53265.c  (test for warnings, line 138)
gcc.dg/pr64277.c scan-tree-dump cunroll ... (*2)
gcc.dg/tree-ssa/cunroll-{1,2,3,4,9,10,11}.c scan-tree-dump 
cunroll/cunrolli/ivcanon (various)
gcc.dg/tree-ssa/loop-38.c scan-tree-dump cunrolli "Loop 1 iterates at most 11 
times"
...which all appear harmless.

Thomas Koenig suggests (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368#c80)
that this be combined with some flag fiddling and warnings in the Fortran
front-end; this patch doesn't do that, as I'm not very familiar with the
frontends, but that can follow in a separate patch. (Thomas?)

OK for trunk?

Cheers, Alan

gcc/ChangeLog:

DATE  Alan Lawrence  <alan.lawre...@arm.com>
   Jakub Jelinek  <ja...@redhat.com>

* common.opt (funknown-commons, flag_unknown_commons): New.
* tree.c (array_at_struct_end_p): Do not limit to size of decl for
DECL_COMMONS if flag_unknown_commons is set.
* tree-dfa.c (get_ref_base_and_extent): Likewise.

gcc/testsuite/ChangeLog:

* gfortran.dg/unknown_commons.f: New.


Ping.

(Besides my original patch 
https://gcc.gnu.org/ml/gcc-patches/2016-02/msg01356.html which we decided was 
too fragile, I believe this is the only patch proposed that actually fixes the 
miscompare in 416.gamess.)


Thanks, Alan


Re: [PATCH] Add -funknown-commons to work around PR/69368 (and others) in SPEC2006

2016-02-25 Thread Alan Lawrence
On 22/02/16 12:03, Jakub Jelinek wrote:
> (f) A global command-line option, which we check alongside DECL_COMMON and
> further tests (basically, we want only DECL_COMMON decls that either have
> ARRAY_TYPE, or some other aggregate type with flexible array member or some
> other trailing array in the struct), and use the resulting predicate in
> places where we optimize based on DECL_SIZE{,_UNIT} of the decl - if that
> predicate returns true, we assume the DECL_SIZE is
> "variable"/"unlimited"/whatever.
> The two spots known to me that would need to use this new predicate would
> be:
> tree.c (array_at_struct_end_p):
[...]
> tree-dfa.c (get_ref_base_and_extent):
[...]

Close enough to what I meant by (a)/(b), but never mind that, and I hadn't
looked at where the change for aggressive loop opts needed to be. However...
Well you are essentially writing the patch for me at this point (so, thanks!),
 but here's a patch that puts that under a global -funknown-commons flag.
Testcase as before.

Bootstrapped (with flag overridden to true) on x86_64 and aarch64, check-gcc,
check-fortran, and tested 416.gamess on AArch64, where this patch enables
running *without* the -fno-aggressive-loop-opts previously required.

In the gcc testsuite, these fail with the option turned on:
gcc.dg/pr53265.c  (test for warnings, line 137)
gcc.dg/pr53265.c  (test for warnings, line 138)
gcc.dg/pr64277.c scan-tree-dump cunroll ... (*2)
gcc.dg/tree-ssa/cunroll-{1,2,3,4,9,10,11}.c scan-tree-dump 
cunroll/cunrolli/ivcanon (various)
gcc.dg/tree-ssa/loop-38.c scan-tree-dump cunrolli "Loop 1 iterates at most 11 
times"
...which all appear harmless.

Thomas Koenig suggests (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368#c80)
that this be combined with some flag fiddling and warnings in the Fortran
front-end; this patch doesn't do that, as I'm not very familiar with the
frontends, but that can follow in a separate patch. (Thomas?)

OK for trunk?

Cheers, Alan

gcc/ChangeLog:

DATE  Alan Lawrence  <alan.lawre...@arm.com>
  Jakub Jelinek  <ja...@redhat.com>

* common.opt (funknown-commons, flag_unknown_commons): New.
* tree.c (array_at_struct_end_p): Do not limit to size of decl for
DECL_COMMONS if flag_unknown_commons is set.
* tree-dfa.c (get_ref_base_and_extent): Likewise.

gcc/testsuite/ChangeLog:

* gfortran.dg/unknown_commons.f: New.
---
 gcc/common.opt  |  4 
 gcc/testsuite/gfortran.dg/unknown_commons.f | 20 
 gcc/tree-dfa.c  | 15 ++-
 gcc/tree.c  |  6 --
 4 files changed, 42 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gfortran.dg/unknown_commons.f

diff --git a/gcc/common.opt b/gcc/common.opt
index 520fa9c..b5b1282 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2455,6 +2455,10 @@ funit-at-a-time
 Common Report Var(flag_unit_at_a_time) Init(1)
 Compile whole compilation unit at a time.
 
+funknown-commons
+Common Var(flag_unknown_commons)
+Assume common declarations may be overridden with unknown larger sizes
+
 funroll-loops
 Common Report Var(flag_unroll_loops) Optimization
 Perform loop unrolling when iteration count is known.
diff --git a/gcc/testsuite/gfortran.dg/unknown_commons.f 
b/gcc/testsuite/gfortran.dg/unknown_commons.f
new file mode 100644
index 000..97e3ce3
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/unknown_commons.f
@@ -0,0 +1,20 @@
+! { dg-do compile }
+! { dg-options "-O3 -funknown-commons -fdump-tree-dom2-details" }
+
+! Test for PR69368: a single-element array in a common block, which will be
+! overridden with a larger size at link time (contrary to language spec).
+! Dominator opts considers accesses to differently-computed elements of X as
+! equivalent, unless -funknown-commons is passed in.
+  SUBROUTINE FOO
+  IMPLICIT DOUBLE PRECISION (X)
+  INTEGER J
+  COMMON /MYCOMMON / X(1)
+  DO 10 J=1,1024
+ X(J+1)=X(J+7)
+  10  CONTINUE
+  RETURN
+  END
+! { dg-final { scan-tree-dump-not "FIND" "dom2" } }
+! We should retain both a read and write of mycommon.x.
+! { dg-final { scan-tree-dump-times "  _\[0-9\]+ = 
mycommon\\.x\\\[_\[0-9\]+\\\];" 1 "dom2" } }
+! { dg-final { scan-tree-dump-times "  mycommon\\.x\\\[_\[0-9\]+\\\] = 
_\[0-9\]+;" 1 "dom2" } }
diff --git a/gcc/tree-dfa.c b/gcc/tree-dfa.c
index 0e98056..017e98e 100644
--- a/gcc/tree-dfa.c
+++ b/gcc/tree-dfa.c
@@ -612,9 +612,22 @@ get_ref_base_and_extent (tree exp, HOST_WIDE_INT *poffset,
 
   if (DECL_P (exp))
 {
+  if (flag_unknown_commons
+ && TREE_CODE (exp) == VAR_DECL && DECL_COMMON (exp))
+   {
+ tree sz_tree = TYPE_SIZE (TREE_TYPE (exp));
+ /* If size is unknown, or we have read to the end, assume there

Re: [PATCH, PR middle-end/68134] Reject scalar modes in default get_mask_mode hook

2016-02-23 Thread Alan Lawrence

On 20/02/16 09:29, Ilya Enkovich wrote:

2016-02-19 20:36 GMT+03:00 Alan Lawrence <alan.lawre...@foss.arm.com>:

Mostly this is fairly straightforward, relatively little midend code is
required, and the backend cleans up quite a bit. However, I get stuck on the
case of singleton vectors (64x1). No surprises there, then...

The PR/68134 fix, makes the 'mask mode' for comparing 64x1 vectors, into
BLKmode, so that we get past this in expand_vector_operations:

/* A scalar operation pretending to be a vector one.  */
   if (VECTOR_BOOLEAN_TYPE_P (type)
   && !VECTOR_MODE_P (TYPE_MODE (type))
   && TYPE_MODE (type) != BLKmode)
 return;

and we do the operation piecewise. (Which is what we want; there is only one
piece!)

However, with my vec_cmp + vcond_mask changes, dropping vconddidi, this
means we look for a vcond_maskdiblk and vec_cmpdiblk. Which doesn't really
feel right - it feels like the 64x1 mask should be a DImode, just like other
64x1 vectors.


The problem here is to distinguish vector mask of one DI element and
DI scalar mask.  We don't want to lower scalar mask manipulations
because they are simple integer operations, not vector ones. Probably
vector of a single DI should have V1DI mode and not pretend to be a
scalar?  This would make things easier.


Thanks for the quick reply, Ilya.

What's the difference between, as you say, a "simple integer operation" and a 
"vector" operation of just one element?


This is why we do *not* have V1DImode in the AArch64 (or ARM) backends, but 
instead treat 64x1 vectors as DImode - the operations are the same; so keeping 
them as the same mode, enables CSE and lots of other optimizations, plus we 
don't have to have two near-identical copies (DI + V1DI) for many patterns, etc...


If the operations were on a "DI scalar mask", when would the first part of that 
test, VECTOR_BOOLEAN_TYPE_P, hold?


Thanks, Alan



Re: [PATCH] Add -funknown-commons to work around PR/69368 (and others) in SPEC2006

2016-02-23 Thread Alan Lawrence

On 22/02/16 12:03, Jakub Jelinek wrote:


(f) A global command-line option, which we check alongside DECL_COMMON and
further tests (basically, we want only DECL_COMMON decls that either have
ARRAY_TYPE, or some other aggregate type with flexible array member or some
other trailing array in the struct), and use the resulting predicate in
places where we optimize based on DECL_SIZE{,_UNIT} of the decl - if that
predicate returns true, we assume the DECL_SIZE is
"variable"/"unlimited"/whatever.
The two spots known to me that would need to use this new predicate would
be:
tree.c (array_at_struct_end_p):
tree-dfa.c (get_ref_base_and_extent):


Well, with just those two, this fixes 416.gamess, and passes all tests in 
gfortran, with only a few scan-dump/warning tests failing in gcc...


--Alan


Re: [PATCH 1/2][AArch64] Implement AAPCS64 updates for alignment attribute

2016-02-22 Thread Alan Lawrence

On 22/01/16 17:16, Alan Lawrence wrote:


On 21/01/16 17:23, Alan Lawrence wrote:

On 18/01/16 17:10, Eric Botcazou wrote:


Could you post the list of files that differ?  How do they differ exactly?


Hmmm. Well, I definitely had this failing to bootstrap once. I repeated that, to
try to identify exactly what the differences wereand it succeeded even with
my pre-AAPCS64-update host compiler. So, this is probably a false alarm; I'm
bootstrapping again, after a rebase, to make sure...

--Alan


Ok, rebased onto a more recent build, and bootstrapping with Ada posed no
problems. Sorry for the noise.

However, I had to drop the assert that TYPE_FIELDS was non-null because of some
C++ testcases.

Is this version OK for trunk?

--Alan

gcc/ChangeLog:

* gcc/config/aarch64/aarch64.c (aarch64_function_arg_alignment):
Rewrite, looking one level down for records and arrays.
---
  gcc/config/aarch64/aarch64.c | 31 ---
  1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 9142ac0..b084f83 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1925,22 +1925,23 @@ aarch64_vfp_is_call_candidate (cumulative_args_t 
pcum_v, machine_mode mode,
  static unsigned int
  aarch64_function_arg_alignment (machine_mode mode, const_tree type)
  {
-  unsigned int alignment;
+  if (!type)
+return GET_MODE_ALIGNMENT (mode);
+  if (integer_zerop (TYPE_SIZE (type)))
+return 0;

-  if (type)
-{
-  if (!integer_zerop (TYPE_SIZE (type)))
-   {
- if (TYPE_MODE (type) == mode)
-   alignment = TYPE_ALIGN (type);
- else
-   alignment = GET_MODE_ALIGNMENT (mode);
-   }
-  else
-   alignment = 0;
-}
-  else
-alignment = GET_MODE_ALIGNMENT (mode);
+  gcc_assert (TYPE_MODE (type) == mode);
+
+  if (!AGGREGATE_TYPE_P (type))
+return TYPE_ALIGN (TYPE_MAIN_VARIANT (type));
+
+  if (TREE_CODE (type) == ARRAY_TYPE)
+return TYPE_ALIGN (TREE_TYPE (type));
+
+  unsigned int alignment = 0;
+
+  for (tree field = TYPE_FIELDS (type); field; field = DECL_CHAIN (field))
+alignment = std::max (alignment, DECL_ALIGN (field));

return alignment;
  }




Ping.

If this is not appropriate for GCC6, then is it OK for Stage 1 of GCC 7?

Thanks, Alan


Re: [PATCH] Add -funknown-commons to work around PR/69368 (and others) in SPEC2006

2016-02-22 Thread Alan Lawrence

On 19/02/16 17:52, Jakub Jelinek wrote:

On Fri, Feb 19, 2016 at 05:42:34PM +, Alan Lawrence wrote:

This relates to FORTRAN code where different modules give different sizes to the
same array in a COMMON block (contrary to the fortran language specification).
SPEC have refused to patch the source code
(https://www.spec.org/cpu2006/Docs/faq.html#Run.05).

Hence, this patch provides a Fortran-specific option -funknown-commons that
marks such arrays as having unknown size - that is, NULL_TREE for both
TYPE_SIZE and max value of TYPE_DOMAIN. DECL_SIZE is preserved for e.g. output
in varasm.c.

On AArch64, it fixes the 416.gamess issue, and allows compiling 416.gamess
without the -fno-aggressive-loop-optimizations previously required (numerous
other PRs relating to 416.gamess).

I had to fix up a couple of places to deal with null TYPE_SIZE but in most cases


I think it is wrong to touch TYPE_SIZE/TYPE_SIZE_UNIT, IMHO it is much better 
just
to ignore DECL_SIZE/DECL_SIZE_UNIT in the selected few places
(get_ref_base_and_extent, the tree-ssa-loop-niters.c analysis) if the switch
is on, for selected decls (aggregates with flexible array members and other
similar trailing arrays, arrays themselves; all only if DECL_COMMON).


So do you see...

(a) A global command-line option, which we check alongside DECL_COMMON, in (all 
or some of the) places which currently deal with flexible array members? (I 
guess the flag would have no effect for compiling C code...or

 (b) do we require it to 'enable' the existing flexible array member support?)

(c) A new flag on each DECL, which we check in the places dealing with flexible 
array members, which we set on those decls in the fortran frontend if a 
fortran-frontend-only command-line option is passed?


(d) A new flag on each DECL, which replaces the check for flexible array 
members, plus a global command-line option, which controls both setting the flag 
(on DECL_COMMONs) in the fortran frontend, and also setting it on flexible array 
members in the C frontend? This might be 'cleanest' but is also probably the 
most change and I doubt I'm going to have time to do this for GCC 6...


(e) Some other scheme that I've not understood yet?

Thanks,
Alan


[PATCH] Add -funknown-commons to work around PR/69368 (and others) in SPEC2006

2016-02-19 Thread Alan Lawrence
This relates to FORTRAN code where different modules give different sizes to the
same array in a COMMON block (contrary to the fortran language specification).
SPEC have refused to patch the source code
(https://www.spec.org/cpu2006/Docs/faq.html#Run.05).

Hence, this patch provides a Fortran-specific option -funknown-commons that
marks such arrays as having unknown size - that is, NULL_TREE for both
TYPE_SIZE and max value of TYPE_DOMAIN. DECL_SIZE is preserved for e.g. output
in varasm.c.

On AArch64, it fixes the 416.gamess issue, and allows compiling 416.gamess
without the -fno-aggressive-loop-optimizations previously required (numerous
other PRs relating to 416.gamess).

I had to fix up a couple of places to deal with null TYPE_SIZE but in most cases
a test for such was already present. (omp-low.c had too many so I gave up: in
practice I think it's OK to just not use the new flag at the same time as
-fopenmp).

Bootstrapped + check-gcc + check-g++ on AArch64 and x86.
Also check-fortran with a variant enabling the option in all cases (to allow
compilation of C): here, only gfortran.dg/gomp/appendix-a/a.24.1.f90 fails,
this uses -fopenmp mentioned above.

Testcase fails both the FIND and one of the assignments if -funknown-commons is 
removed.

OK for trunk? I'd be happy to move this under -std=legacy if the Fortran folks
prefer. (-funcommon-commons has also been humourously suggested!)

gcc/ChangeLog:

* expr.c (store_field): Handle null TYPE_SIZE.
* tree-vect-data-refs.c (vect_analyze_data_refs): Bail out if TYPE_SIZE
is null.
* tree-dfa.c (get_ref_base_and_extent): Keep maxsize==-1 when reading
the whole of a DECL.

gcc/fortran/ChangeLog:

* lang.opt: Add -funknown-commons.
* trans-common.ci (build_field): If flag_unknown_commons is on, set
upper bound and size of arrays in common block to NULL_TREE.
(create_common): Explicitly set DECL_SIZE and DECL_SIZE_UNIT.

gcc/testsuite/ChangeLog:
* gfortran.dg/unknown_commons.f: New.
---
 gcc/expr.c  |  1 +
 gcc/fortran/lang.opt|  4 
 gcc/fortran/trans-common.c  | 35 +
 gcc/testsuite/gfortran.dg/unknown_commons.f | 20 +
 gcc/tree-dfa.c  |  6 -
 gcc/tree-vect-data-refs.c   |  8 +++
 6 files changed, 73 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gfortran.dg/unknown_commons.f

diff --git a/gcc/expr.c b/gcc/expr.c
index f4bac36..609bf59 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -6638,6 +6638,7 @@ store_field (rtx target, HOST_WIDE_INT bitsize, 
HOST_WIDE_INT bitpos,
 RHS isn't the same size as the bitfield, we must use bitfield
 operations.  */
   || (bitsize >= 0
+ && TYPE_SIZE (TREE_TYPE (exp))
  && TREE_CODE (TYPE_SIZE (TREE_TYPE (exp))) == INTEGER_CST
  && compare_tree_int (TYPE_SIZE (TREE_TYPE (exp)), bitsize) != 0
  /* Except for initialization of full bytes from a CONSTRUCTOR, which
diff --git a/gcc/fortran/lang.opt b/gcc/fortran/lang.opt
index 45428d8..b1b4641 100644
--- a/gcc/fortran/lang.opt
+++ b/gcc/fortran/lang.opt
@@ -686,6 +686,10 @@ funderscoring
 Fortran Var(flag_underscoring) Init(1)
 Append underscores to externally visible names.
 
+funknown-commons
+Fortran Var(flag_unknown_commons)
+Treat arrays in common blocks as having unknown size.
+
 fwhole-file
 Fortran Ignore
 Does nothing.  Preserved for backward compatibility.
diff --git a/gcc/fortran/trans-common.c b/gcc/fortran/trans-common.c
index 21d1928..bf99c56 100644
--- a/gcc/fortran/trans-common.c
+++ b/gcc/fortran/trans-common.c
@@ -284,8 +284,41 @@ build_field (segment_info *h, tree union_type, 
record_layout_info rli)
   unsigned HOST_WIDE_INT desired_align, known_align;
 
   name = get_identifier (h->sym->name);
+
+  gcc_assert (TYPE_P (h->field));
+  tree size = TYPE_SIZE (h->field);
+  tree size_unit = TYPE_SIZE_UNIT (h->field);
+  if (GFC_ARRAY_TYPE_P (h->field)
+  && !h->sym->value
+  && flag_unknown_commons)
+{
+  gcc_assert (!GFC_DESCRIPTOR_TYPE_P (h->field));
+  gcc_assert (TREE_CODE (h->field) == ARRAY_TYPE);
+
+  /* Could be merged at link time with a larger array whose size we don't
+know.  */
+  static tree range_type = NULL_TREE;
+  if (!range_type)
+   range_type = build_range_type (gfc_array_index_type,
+  gfc_index_zero_node, NULL_TREE);
+  tree type = copy_node (h->field);
+  TYPE_DOMAIN (type) = range_type;
+  TYPE_SIZE (type) = NULL_TREE;
+  TYPE_SIZE_UNIT (type) = NULL_TREE;
+  SET_TYPE_MODE (type, BLKmode);
+  gcc_assert (TYPE_CANONICAL (h->field) == h->field);
+  gcc_assert (TYPE_MAIN_VARIANT (h->field) == h->field);
+  TYPE_CANONICAL (type) = type;
+  TYPE_MAIN_VARIANT (type) = type;
+  layout_type (type);
+  h->field = 

Re: [PATCH, PR middle-end/68134] Reject scalar modes in default get_mask_mode hook

2016-02-19 Thread Alan Lawrence

On 17/11/15 11:49, Ilya Enkovich wrote:

Hi,

Default hook for get_mask_mode is supposed to return integer vector modes.  
This means it should reject calar modes returned by mode_for_vector.  
Bootstrapped and regtested on x86_64-unknown-linux-gnu, regtested on 
aarch64-unknown-linux-gnu.  OK for trunk?

Thanks,
Ilya
--
gcc/

2015-11-17  Ilya Enkovich  

PR middle-end/68134
* targhooks.c (default_get_mask_mode): Filter out
scalar modes returned by mode_for_vector.


I've been playing around with a patch that expands arbitrary VEC_COND_EXPRs 
using the vec_cmp and vcond_mask optabs - allowing platforms to drop the ugly 
vcond pattern (for which a cross-product of modes is usually 
required) where vcond = vec_cmp + vcond_mask. (E.g. ARM and AArch64.)


Mostly this is fairly straightforward, relatively little midend code is 
required, and the backend cleans up quite a bit. However, I get stuck on the 
case of singleton vectors (64x1). No surprises there, then...


The PR/68134 fix, makes the 'mask mode' for comparing 64x1 vectors, into 
BLKmode, so that we get past this in expand_vector_operations:


/* A scalar operation pretending to be a vector one.  */
  if (VECTOR_BOOLEAN_TYPE_P (type)
  && !VECTOR_MODE_P (TYPE_MODE (type))
  && TYPE_MODE (type) != BLKmode)
return;

and we do the operation piecewise. (Which is what we want; there is only one 
piece!)

However, with my vec_cmp + vcond_mask changes, dropping vconddidi, this means we 
look for a vcond_maskdiblk and vec_cmpdiblk. Which doesn't really feel right - 
it feels like the 64x1 mask should be a DImode, just like other 64x1 vectors.


So, I'm left thinking - is there a better way to look for "scalar operations 
pretending to be vector ones", such that we can get a DImode vec through 
the above? What exactly do we want that check to do? In my AArch64 testing, I 
was able to take it out altogether and get all tests passing...? (I don't have 
AVX512 hardware)


Thanks, Alan



Re: [PATCH 4/4] Un-XFAIL ssa-dom-cse-2.c for most platforms

2016-02-04 Thread Alan Lawrence

On 04/02/16 09:53, Dominik Vogt wrote:

On Wed, Feb 03, 2016 at 11:41:02AM +, Alan Lawrence wrote:

On 26/01/16 12:23, Dominik Vogt wrote:

On Mon, Dec 21, 2015 at 01:13:28PM +, Alan Lawrence wrote:

...the test passes with --param sra-max-scalarization-size-Ospeed.

Verified on aarch64 and with stage1 compiler for hppa, powerpc, sparc, s390.


How did you test this on s390?  For me, the test still fails
unless I add -march=z13 (s390x).


Sorry for the slow response, was away last week. On x86 host, I built a compiler

configure --enable-languages=c,c++,lto --target=s390-none-linux-gnu

   
Looks like the test fails only on s390x (64-Bit ) without
-march=z13 but works on s390 (31-Bit).


Ah, yes, I see. Loop is not unrolled for dom2, then vectorizer kicks in but 
vectorizes only the initialization/loading, and dom can't see the redundancy in


MEM[(int[8] *)] = { 0, 1 };
...
_23 = a[0];

as they aren't reading equivalent chunks of memory.

Same problem as on Alpha, and powerpc64 (without -mcpu=power7/power8).

Powerpc chose to XFAIL rather than add -mcpu=power7, but I'm OK with any 
testsuite workaround here, such as yours.


Cheers, Alan


Re: [PATCH 4/4] Un-XFAIL ssa-dom-cse-2.c for most platforms

2016-02-03 Thread Alan Lawrence

On 26/01/16 12:23, Dominik Vogt wrote:

On Mon, Dec 21, 2015 at 01:13:28PM +, Alan Lawrence wrote:

...the test passes with --param sra-max-scalarization-size-Ospeed.

Verified on aarch64 and with stage1 compiler for hppa, powerpc, sparc, s390.


How did you test this on s390?  For me, the test still fails
unless I add -march=z13 (s390x).


Sorry for the slow response, was away last week. On x86 host, I built a compiler

configure --enable-languages=c,c++,lto --target=s390-none-linux-gnu
make all-gcc
make check-gcc RUNTESTFLAGS=tree-ssa.exp=ssa-dom-cse-2.c

and that shows the tests passing. gcc -v shows little further:
Reading specs from ./gcc/specs
COLLECT_GCC=./gcc/xgcc
COLLECT_LTO_WRAPPER=./gcc/lto-wrapper
Target: s390-none-linux-gnu
Configured with: /work/alalaw01/src2/gcc/configure --enable-languages=c,c++,lto 
--target=s390-none-linux-gnu

Thread model: posix
gcc version 6.0.0 20151206 (experimental) (GCC)

I speculate that perhaps the -march=z13 is default for s390 *linux* ???

If you can send me alternative configury, or dumps from your failing compiler 
(dom{2,3}-details, optimized), I can take a look.


Thanks, Alan


[PATCH][Testsuite] Fix PR66877

2016-01-22 Thread Alan Lawrence
This is a scan-tree-dump failure in vect-over-widen-3-big-array.c, that occurs
only on ARM - the only platform to have vect_widen_shift.

Tested on arm-none-eabi (armv8-crypto-neon-fp, plus a non-neon variant), also
aarch64 (token platform without vect_widen_shift).

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-over-widen-3-big-array.c: Only look for 1
vect_recog_over_widening_pattern in dump if we have vect_widen_shift.
---
 gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c 
b/gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c
index 1ca3128..69773a5 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c
@@ -58,6 +58,7 @@ int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: 
detected" 2 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: 
detected" 2 "vect" { target { ! vect_widen_shift } } } } */
+/* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: 
detected" 1 "vect" { target vect_widen_shift } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
 
-- 
1.9.1



Re: [PATCH] ARM PR68620 (ICE with FP16 on armeb)

2016-01-22 Thread Alan Lawrence

On 20/01/16 21:10, Christophe Lyon wrote:

On 19 January 2016 at 15:51, Alan Lawrence <alan.lawre...@foss.arm.com> wrote:

On 19/01/16 11:15, Christophe Lyon wrote:


For neon_vdupn, I chose to implement neon_vdup_nv4hf and
neon_vdup_nv8hf instead of updating the VX iterator because I thought
it was not desirable to impact neon_vrev32.



Well, the same instruction will suffice for vrev32'ing vectors of HF just
as
well as vectors of HI, so I think I'd argue that's harmless enough. To
gain the
benefit, we'd need to update arm_evpc_neon_vrev with a few new cases,
though.


Since this is more intrusive, I'd rather leave that part for later. OK?



Sure.


+#ifdef __ARM_BIG_ENDIAN
+  /* Here, 3 is (4-1) where 4 is the number of lanes. This is also the
+ right value for vectors with 8 lanes.  */
+#define __arm_lane(__vec, __idx) (__idx ^ 3)
+#else
+#define __arm_lane(__vec, __idx) __idx
+#endif
+



Looks right, but sounds... my concern here is that I'm hoping at some
point we
will move the *other* vget/set_lane intrinsics to use GCC vector
extensions
too. At which time (unlike __aarch64_lane which can be used everywhere)
this
will be the wrong formula. Can we name (and/or comment) it to avoid
misleading
anyone? The key characteristic seems to be that it is for vectors of
16-bit
elements only.


I'm not to follow, here. Looking at the patterns for
neon_vget_lane_*internal in neon.md,
I can see 2 flavours: one for VD, one for VQ2. The latter uses "halfelts".

Do you prefer that I create 2 macros (say __arm_lane and __arm_laneq),
that would be similar to the aarch64 ones (by computing the number of
lanes of the input vector), but the "q" one would use half the total
number of lanes instead?



That works for me! Sthg like:

#define __arm_lane(__vec, __idx) NUM_LANES(__vec) - __idx
#define __arm_laneq(__vec, __idx) (__idx & (NUM_LANES(__vec)/2)) +
(NUM_LANES(__vec)/2 - __idx)
//or similarly
#define __arm_laneq(__vec, __idx) (__idx ^ (NUM_LANES(__vec)/2 - 1))

Alternatively I'd been thinking

#define __arm_lane_32xN(__idx) __idx ^ 1
#define __arm_lane_16xN(__idx) __idx ^ 3
#define __arm_lane_8xN(__idx) __idx ^ 7

Bear in mind PR64893 that we had on AArch64 :-(



Here is a new version, based on the comments above.
I've also removed the addition of arm_fp_ok effective target since I
added that in my other testsuite patch.

OK now?


Yes. I realize my worry about PR64893 doesn't apply here since we pass constant 
lane numbers / vector bounds into __builtin_arm_lane_check. Thanks!


--Alan



Thanks,

Christophe


Cheers, Alan




Re: [PATCH 1/2][AArch64] Implement AAPCS64 updates for alignment attribute

2016-01-22 Thread Alan Lawrence

On 21/01/16 17:23, Alan Lawrence wrote:
> On 18/01/16 17:10, Eric Botcazou wrote:
>>
>> Could you post the list of files that differ?  How do they differ exactly?
>
> Hmmm. Well, I definitely had this failing to bootstrap once. I repeated that, 
> to
> try to identify exactly what the differences wereand it succeeded even 
> with
> my pre-AAPCS64-update host compiler. So, this is probably a false alarm; I'm
> bootstrapping again, after a rebase, to make sure...
>
> --Alan

Ok, rebased onto a more recent build, and bootstrapping with Ada posed no
problems. Sorry for the noise.

However, I had to drop the assert that TYPE_FIELDS was non-null because of some
C++ testcases.

Is this version OK for trunk?

--Alan

gcc/ChangeLog:

* gcc/config/aarch64/aarch64.c (aarch64_function_arg_alignment):
Rewrite, looking one level down for records and arrays.
---
 gcc/config/aarch64/aarch64.c | 31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 9142ac0..b084f83 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1925,22 +1925,23 @@ aarch64_vfp_is_call_candidate (cumulative_args_t 
pcum_v, machine_mode mode,
 static unsigned int
 aarch64_function_arg_alignment (machine_mode mode, const_tree type)
 {
-  unsigned int alignment;
+  if (!type)
+return GET_MODE_ALIGNMENT (mode);
+  if (integer_zerop (TYPE_SIZE (type)))
+return 0;
 
-  if (type)
-{
-  if (!integer_zerop (TYPE_SIZE (type)))
-   {
- if (TYPE_MODE (type) == mode)
-   alignment = TYPE_ALIGN (type);
- else
-   alignment = GET_MODE_ALIGNMENT (mode);
-   }
-  else
-   alignment = 0;
-}
-  else
-alignment = GET_MODE_ALIGNMENT (mode);
+  gcc_assert (TYPE_MODE (type) == mode);
+
+  if (!AGGREGATE_TYPE_P (type))
+return TYPE_ALIGN (TYPE_MAIN_VARIANT (type));
+
+  if (TREE_CODE (type) == ARRAY_TYPE)
+return TYPE_ALIGN (TREE_TYPE (type));
+
+  unsigned int alignment = 0;
+
+  for (tree field = TYPE_FIELDS (type); field; field = DECL_CHAIN (field))
+alignment = std::max (alignment, DECL_ALIGN (field));
 
   return alignment;
 }
-- 
1.9.1



[PATCH][Testsuite] Fix scan-tree-dump failures with vect_multiple_sizes

2016-01-22 Thread Alan Lawrence
Since r230292, these tests in gcc.dg/vect have been failing on ARM, AArch64, 
and x86_64 with -march=haswell (among others - when prefer_avx128 is true):

vect-outer-1-big-array.c scan-tree-dump-times vect "grouped access in outer 
loop" 2
vect-outer-1.c scan-tree-dump-times vect "grouped access in outer loop" 2
vect-outer-1a-big-array.c scan-tree-dump-times vect "grouped access in outer 
loop" 2
vect-outer-1a.c scan-tree-dump-times vect "grouped access in outer loop" 2
vect-outer-1b-big-array.c scan-tree-dump-times vect "grouped access in outer 
loop" 2
vect-outer-1b.c scan-tree-dump-times vect "grouped access in outer loop" 2
vect-outer-2b.c scan-tree-dump-times vect "grouped access in outer loop" 2
vect-outer-3b.c scan-tree-dump-times vect "grouped access in outer loop" 4
vect-reduc-dot-s8b.c scan-tree-dump-times vect "vect_recog_widen_mult_pattern: 
detected" 2
(plus all the -flto -ffat-lto-objects equivalents).

This is because that commit changed vect_analyze_loop{,_2} to bail out early
and avoid retrying with a different vector size on finding a fatal error;
all the above tests are conditioned on { target vect_multiple_sizes }.

Hence, drop the vect_multiple_sizes version of the scan-tree-dump, as we now
expect those messages to show up once everywhere.

The optional extra would be to add a message that vect_analyze_loop was failing
with *fatal* error, and scan for that, but that doesn't really seem warranted.

Tested vect.exp on aarch64-none-elf, arm-none-eabi, and x86_64 linux with 
-march=haswell.

OK for trunk?

Cheers, Alan

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-outer-1-big-array.c: Drop vect_multiple_sizes;
use same scan-tree-dump-times on all platforms.
* gcc.dg/vect/vect-outer-1.c: Likewise.
* gcc.dg/vect/vect-outer-1a-big-array.c: Likewise.
* gcc.dg/vect/vect-outer-1a.c: Likewise.
* gcc.dg/vect/vect-outer-1b-big-array.c: Likewise.
* gcc.dg/vect/vect-outer-1b.c: Likewise.
* gcc.dg/vect/vect-outer-2b.c: Likewise.
* gcc.dg/vect/vect-outer-3b.c: Likewise.
* gcc.dg/vect/vect-reduc-dot-s8b.c: Likewise.
---
 gcc/testsuite/gcc.dg/vect/vect-outer-1-big-array.c  | 3 +--
 gcc/testsuite/gcc.dg/vect/vect-outer-1.c| 3 +--
 gcc/testsuite/gcc.dg/vect/vect-outer-1a-big-array.c | 3 +--
 gcc/testsuite/gcc.dg/vect/vect-outer-1a.c   | 3 +--
 gcc/testsuite/gcc.dg/vect/vect-outer-1b-big-array.c | 3 +--
 gcc/testsuite/gcc.dg/vect/vect-outer-1b.c   | 3 +--
 gcc/testsuite/gcc.dg/vect/vect-outer-2b.c   | 3 +--
 gcc/testsuite/gcc.dg/vect/vect-outer-3b.c   | 3 +--
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8b.c  | 3 +--
 9 files changed, 9 insertions(+), 18 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/vect-outer-1-big-array.c 
b/gcc/testsuite/gcc.dg/vect/vect-outer-1-big-array.c
index 6c61b09..63918ad 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-outer-1-big-array.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-outer-1-big-array.c
@@ -22,5 +22,4 @@ foo (){
 }
 
 /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail 
*-*-* } } } */
-/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 1 "vect" { 
target { ! vect_multiple_sizes } } } } */
-/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 2 "vect" { 
target vect_multiple_sizes } } } */
+/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 1 "vect" } 
} */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-outer-1.c 
b/gcc/testsuite/gcc.dg/vect/vect-outer-1.c
index 5fdaa83..b1bcbc2 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-outer-1.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-outer-1.c
@@ -22,5 +22,4 @@ foo (){
 }
 
 /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail 
*-*-* } } } */
-/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 1 "vect" { 
target { ! vect_multiple_sizes } } } } */
-/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 2 "vect" { 
target vect_multiple_sizes } } } */
+/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 1 "vect" } 
} */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-outer-1a-big-array.c 
b/gcc/testsuite/gcc.dg/vect/vect-outer-1a-big-array.c
index 68b25f9..98dfcfb 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-outer-1a-big-array.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-outer-1a-big-array.c
@@ -20,5 +20,4 @@ foo (){
 }
 
 /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail 
*-*-* } } } */
-/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 1 "vect" { 
target { ! vect_multiple_sizes } } } } */
-/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 2 "vect" { 
target vect_multiple_sizes } } } */
+/* { dg-final { scan-tree-dump-times "grouped access in outer loop" 1 "vect" } 
} */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-outer-1a.c 
b/gcc/testsuite/gcc.dg/vect/vect-outer-1a.c
index 85649af..0200fb4 100644
--- 

Re: [PATCH 0/2][AArch64] Implement AAPCS64 updates for alignment attribute

2016-01-21 Thread Alan Lawrence

On 18/01/16 17:10, Eric Botcazou wrote:

Similarly to ARM, I note that Ada is affected. Indeed, with a gcc 4.9 host
compiler, I saw a bootstrap miscompare iff including Ada; however, I was
able to bootstrap Ada successfully, if I first built a GCC including this
patch with --disable-bootstrap, and then used that as host compiler. The
best explanation I can see for this is mismatched host vs built libraries
and compiler being used together, something like Jakub's suggestion
http://gcc.gnu.org/ml/gcc-patches/2015-11/msg00338.html. I don't feel I have
the expertise for this, and am CCing the Ada maintainers in the hope they
can help.


That's a bit weird though because this should have also occurred for ARM when
the ABI was broken the same way if the Ada bootstrap is not entirely correct.
Now, as far I know, this didn't occur for ARM during bootstrap but only during
testing with make -k check.  Or else could this be a parallel compilation bug?

Could you post the list of files that differ?  How do they differ exactly?


Hmmm. Well, I definitely had this failing to bootstrap once. I repeated that, to 
try to identify exactly what the differences wereand it succeeded even with 
my pre-AAPCS64-update host compiler. So, this is probably a false alarm; I'm 
bootstrapping again, after a rebase, to make sure...


--Alan



Re: [PATCH 2/4] Equate MEM_REFs and ARRAY_REFs in tree-ssa-scopedtables.c

2016-01-19 Thread Alan Lawrence

On 19/01/16 09:46, Christophe Lyon wrote:

On 19 January 2016 at 04:05, H.J. Lu <hjl.to...@gmail.com> wrote:

On Thu, Dec 24, 2015 at 3:55 AM, Alan Lawrence <alan.lawre...@arm.com> wrote:

This version changes the test cases to fix failures on some platforms, by
rewriting the initializers so that they aren't pushed out to the constant pool.

gcc/ChangeLog:

 * tree-ssa-scopedtables.c (avail_expr_hash): Hash MEM_REF and ARRAY_REF
 using get_ref_base_and_extent.
 (equal_mem_array_ref_p): New.
 (hashable_expr_equal_p): Add call to previous.



This caused:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69352



Hi Alan,

This patch also caused regressions on arm-none-linux-gnueabihf
with GCC configured as:
--with-thumb --with-cpu=cortex-a57 --with-fpu=crypto-neon-fp-armv8

These tests now fail:
gcc.dg/torture/pr61742.c   -O2  (test for excess errors)
gcc.dg/torture/pr61742.c   -O2 -flto -fno-use-linker-plugin
-flto-partition=none  (test for excess errors)
gcc.dg/torture/pr61742.c   -O3 -fomit-frame-pointer -funroll-loops
-fpeel-loops -ftracer -finline-functions  (test for excess errors)
gcc.dg/torture/pr61742.c   -O3 -g  (test for excess errors)



Hmm, I still see these passing, both natively on arm-none-linux-gnueabihf and 
with a cross-build. hf implies --with-float=hard, right? Do you see what the 
error messages are?


Thanks, Alan



Re: [PATCH] ARM PR68620 (ICE with FP16 on armeb)

2016-01-19 Thread Alan Lawrence

On 19/01/16 11:15, Christophe Lyon wrote:


For neon_vdupn, I chose to implement neon_vdup_nv4hf and
neon_vdup_nv8hf instead of updating the VX iterator because I thought
it was not desirable to impact neon_vrev32.


Well, the same instruction will suffice for vrev32'ing vectors of HF just as
well as vectors of HI, so I think I'd argue that's harmless enough. To gain the
benefit, we'd need to update arm_evpc_neon_vrev with a few new cases, though.


Since this is more intrusive, I'd rather leave that part for later. OK?


Sure.


+#ifdef __ARM_BIG_ENDIAN
+  /* Here, 3 is (4-1) where 4 is the number of lanes. This is also the
+ right value for vectors with 8 lanes.  */
+#define __arm_lane(__vec, __idx) (__idx ^ 3)
+#else
+#define __arm_lane(__vec, __idx) __idx
+#endif
+


Looks right, but sounds... my concern here is that I'm hoping at some point we
will move the *other* vget/set_lane intrinsics to use GCC vector extensions
too. At which time (unlike __aarch64_lane which can be used everywhere) this
will be the wrong formula. Can we name (and/or comment) it to avoid misleading
anyone? The key characteristic seems to be that it is for vectors of 16-bit
elements only.


I'm not to follow, here. Looking at the patterns for
neon_vget_lane_*internal in neon.md,
I can see 2 flavours: one for VD, one for VQ2. The latter uses "halfelts".

Do you prefer that I create 2 macros (say __arm_lane and __arm_laneq),
that would be similar to the aarch64 ones (by computing the number of
lanes of the input vector), but the "q" one would use half the total
number of lanes instead?


That works for me! Sthg like:

#define __arm_lane(__vec, __idx) NUM_LANES(__vec) - __idx
#define __arm_laneq(__vec, __idx) (__idx & (NUM_LANES(__vec)/2)) + 
(NUM_LANES(__vec)/2 - __idx)

//or similarly
#define __arm_laneq(__vec, __idx) (__idx ^ (NUM_LANES(__vec)/2 - 1))

Alternatively I'd been thinking

#define __arm_lane_32xN(__idx) __idx ^ 1
#define __arm_lane_16xN(__idx) __idx ^ 3
#define __arm_lane_8xN(__idx) __idx ^ 7

Bear in mind PR64893 that we had on AArch64 :-(

Cheers, Alan


[PATCH][ARM] Remove neon_reinterpret, use casts

2016-01-18 Thread Alan Lawrence
This cleans up the neon_reinterpret code on ARM in a similar way to AArch64.
Rather than a builtin backing onto an expander that emits a mov insn, we can
just use a cast, because GCC defines casts of vector types as keeping the same
bit pattern.

On armeb, this fixes previously-failing test:
gcc.target/arm/crypto-vldrq_p128.c scan-assembler vld1.64\t{d[0-9]+-d[0-9]+}.*

Bootstrap + check-gcc on arm-none-linux-gnueabihf;
cross-tested armeb-none-eabi.

OK for trunk?

gcc/ChangeLog:

* config/arm/arm-protos.h (neon_reinterpret): Remove.
* config/arm/arm.c (neon_reinterpret): Remove.
* config/arm/arm_neon_builtins.def (vreinterpretv8qi, vreinterpretv4hi,
vreinterpretv2si, vreinterpretv2sf, vreinterpretdi, vreinterpretv16qi,
vreinterpretv8hi, vreinterpretv4si, vreinterpretv4sf, vreinterpretv2di,
vreinterpretti): Remove.
* config/arm/neon.md (neon_vreinterpretv8qi,
neon_vreinterpretv4hi, neon_vreinterpretv2si,
neon_vreinterpretv2sf, neon_vreinterpretdi,
neon_vreinterpretti, neon_vreinterpretv16qi,
neon_vreinterpretv8hi, neon_vreinterpretv4si,
neon_vreinterpretv4sf, neon_vreinterpretv2di): Remove.
* config/arm/arm_neon.h (vreinterpret_p8_p16, vreinterpret_p8_f32,
vreinterpret_p8_p64, vreinterpret_p8_s64, vreinterpret_p8_u64,
vreinterpret_p8_s8, vreinterpret_p8_s16, vreinterpret_p8_s32,
vreinterpret_p8_u8, vreinterpret_p8_u16, vreinterpret_p8_u32,
vreinterpret_p16_p8, vreinterpret_p16_f32, vreinterpret_p16_p64,
vreinterpret_p16_s64, vreinterpret_p16_u64, vreinterpret_p16_s8,
vreinterpret_p16_s16, vreinterpret_p16_s32, vreinterpret_p16_u8,
vreinterpret_p16_u16, vreinterpret_p16_u32, vreinterpret_f32_p8,
vreinterpret_f32_p16, vreinterpret_f32_p64, vreinterpret_f32_s64,
vreinterpret_f32_u64, vreinterpret_f32_s8, vreinterpret_f32_s16,
vreinterpret_f32_s32, vreinterpret_f32_u8, vreinterpret_f32_u16,
vreinterpret_f32_u32, vreinterpret_p64_p8, vreinterpret_p64_p16,
vreinterpret_p64_f32, vreinterpret_p64_s64, vreinterpret_p64_u64,
vreinterpret_p64_s8, vreinterpret_p64_s16, vreinterpret_p64_s32,
vreinterpret_p64_u8, vreinterpret_p64_u16, vreinterpret_p64_u32,
vreinterpret_s64_p8, vreinterpret_s64_p16, vreinterpret_s64_f32,
vreinterpret_s64_p64, vreinterpret_s64_u64, vreinterpret_s64_s8,
vreinterpret_s64_s16, vreinterpret_s64_s32, vreinterpret_s64_u8,
vreinterpret_s64_u16, vreinterpret_s64_u32, vreinterpret_u64_p8,
vreinterpret_u64_p16, vreinterpret_u64_f32, vreinterpret_u64_p64,
vreinterpret_u64_s64, vreinterpret_u64_s8, vreinterpret_u64_s16,
vreinterpret_u64_s32, vreinterpret_u64_u8, vreinterpret_u64_u16,
vreinterpret_u64_u32, vreinterpret_s8_p8, vreinterpret_s8_p16,
vreinterpret_s8_f32, vreinterpret_s8_p64, vreinterpret_s8_s64,
vreinterpret_s8_u64, vreinterpret_s8_s16, vreinterpret_s8_s32,
vreinterpret_s8_u8, vreinterpret_s8_u16, vreinterpret_s8_u32,
vreinterpret_s16_p8, vreinterpret_s16_p16, vreinterpret_s16_f32,
vreinterpret_s16_p64, vreinterpret_s16_s64, vreinterpret_s16_u64,
vreinterpret_s16_s8, vreinterpret_s16_s32, vreinterpret_s16_u8,
vreinterpret_s16_u16, vreinterpret_s16_u32, vreinterpret_s32_p8,
vreinterpret_s32_p16, vreinterpret_s32_f32, vreinterpret_s32_p64,
vreinterpret_s32_s64, vreinterpret_s32_u64, vreinterpret_s32_s8,
vreinterpret_s32_s16, vreinterpret_s32_u8, vreinterpret_s32_u16,
vreinterpret_s32_u32, vreinterpret_u8_p8, vreinterpret_u8_p16,
vreinterpret_u8_f32, vreinterpret_u8_p64, vreinterpret_u8_s64,
vreinterpret_u8_u64, vreinterpret_u8_s8, vreinterpret_u8_s16,
vreinterpret_u8_s32, vreinterpret_u8_u16, vreinterpret_u8_u32,
vreinterpret_u16_p8, vreinterpret_u16_p16, vreinterpret_u16_f32,
vreinterpret_u16_p64, vreinterpret_u16_s64, vreinterpret_u16_u64,
vreinterpret_u16_s8, vreinterpret_u16_s16, vreinterpret_u16_s32,
vreinterpret_u16_u8, vreinterpret_u16_u32, vreinterpret_u32_p8,
vreinterpret_u32_p16, vreinterpret_u32_f32, vreinterpret_u32_p64,
vreinterpret_u32_s64, vreinterpret_u32_u64, vreinterpret_u32_s8,
vreinterpret_u32_s16, vreinterpret_u32_s32, vreinterpret_u32_u8,
vreinterpret_u32_u16, vreinterpretq_p8_p16, vreinterpretq_p8_f32,
vreinterpretq_p8_p64, vreinterpretq_p8_p128, vreinterpretq_p8_s64,
vreinterpretq_p8_u64, vreinterpretq_p8_s8, vreinterpretq_p8_s16,
vreinterpretq_p8_s32, vreinterpretq_p8_u8, vreinterpretq_p8_u16,
vreinterpretq_p8_u32, vreinterpretq_p16_p8, vreinterpretq_p16_f32,
vreinterpretq_p16_p64, vreinterpretq_p16_p128, vreinterpretq_p16_s64,
vreinterpretq_p16_u64, vreinterpretq_p16_s8, vreinterpretq_p16_s16,
vreinterpretq_p16_s32, vreinterpretq_p16_u8, vreinterpretq_p16_u16,
  

[PATCH][ARM] Add movv4hf/v8hf expanders & later insns; disable VnHF immediates.

2016-01-18 Thread Alan Lawrence
This fixes ICEs on armeb for float16x[48]_t vectors, e.g. in
check_effective_target_arm_neon_fp_16_ok.

At present, without the expander, moving v4hf/v8hf values around is done
via subregs. On armeb, this ICEs because REG_CANNOT_CHANGE_MODE_P. (On arm-*,
moving via two subregs is less efficient than one native insn!)

However, adding the expanders, reveals a latent bug in the V4HF variant of
*neon_mov, that vector constants are not handled properly in the
neon_valid_immediate code. Hence, for now I've used a separate expander that
disallows immediates, and disabled VnHF vectors as immediates in
neon_valid_immediate_for_move; I'll file a PR for this.

Also to fix the advsimd-intrinsics/vcombine test I had to add HF vector modes to
the VX iterator and V_reg attribute, for vdup_n, as loading a vector of
identical HF elements is now done by loading the scalar + vdup rather than
forcing the vector out to the constant pool.

On armeb, one of the ICEs this fixes, is in the test program for
check_effective_target_arm_neon_fp_16_ok. This means the advsimd-intrinsics
vcvt_f16 test now runs (and passes), and also that the other tests now run
with neon-fp16, rather than only neon as previously (on armeb).
This reveals that the fp16 cases of vld1_lane and vset_lane are (and were)
failing. Since those tests would previously have failed *if fp16 had been
passed in*, I think this is still a step forward; one can still run the tests
with an explicit non-fp16 multilib if the old behaviour is desired.

Note the previous patch removes other uses of VQXMOV (not strictly a dependency,
generating V4HF/V8HF reinterpret patterns is harmless, they just aren't used).

Bootstrapped + check-gcc on arm-none-linux-gnueabihf;
cross-tested armeb-none-eabi.

gcc/ChangeLog:

* config/arm/arm.c (neon_valid_immediate): Disallow vectors of HFmode.
* config/arm/iterators.md (V_HF): New.
(VQXMOV): Add V8HF.
(VX): Add V4HF, V8HF.
(V_reg): Add cases for V4HF, V8HF.
* config/arm/vec-common.md (mov V_HF): New.
---
 gcc/config/arm/arm.c |  2 ++
 gcc/config/arm/iterators.md  |  8 ++--
 gcc/config/arm/vec-common.md | 20 
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 3276b03..4fdba38 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -12371,6 +12371,8 @@ neon_valid_immediate (rtx op, machine_mode mode, int 
inverse,
   /* Vectors of float constants.  */
   if (GET_MODE_CLASS (mode) == MODE_VECTOR_FLOAT)
 {
+  if (GET_MODE_INNER (mode) == HFmode)
+   return -1;
   rtx el0 = CONST_VECTOR_ELT (op, 0);
   const REAL_VALUE_TYPE *r0;
 
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 974cf51..c5db868 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -59,6 +59,9 @@
 ;; Integer and float modes supported by Neon and IWMMXT.
 (define_mode_iterator VALL [V2DI V2SI V4HI V8QI V2SF V4SI V8HI V16QI V4SF])
 
+;; Vectors of half-precision floats.
+(define_mode_iterator V_HF [V4HF V8HF])
+
 ;; Integer and float modes supported by Neon and IWMMXT, except V2DI.
 (define_mode_iterator VALLW [V2SI V4HI V8QI V2SF V4SI V8HI V16QI V4SF])
 
@@ -99,7 +102,7 @@
 (define_mode_iterator VQI [V16QI V8HI V4SI])
 
 ;; Quad-width vector modes, with TImode added, for moves.
-(define_mode_iterator VQXMOV [V16QI V8HI V4SI V4SF V2DI TI])
+(define_mode_iterator VQXMOV [V16QI V8HI V8HF V4SI V4SF V2DI TI])
 
 ;; Opaque structure types wider than TImode.
 (define_mode_iterator VSTRUCT [EI OI CI XI])
@@ -160,7 +163,7 @@
 (define_mode_iterator VMDQI [V4HI V2SI V8HI V4SI])
 
 ;; Modes with 8-bit and 16-bit elements.
-(define_mode_iterator VX [V8QI V4HI V16QI V8HI])
+(define_mode_iterator VX [V8QI V4HI V4HF V16QI V8HI V8HF])
 
 ;; Modes with 8-bit elements.
 (define_mode_iterator VE [V8QI V16QI])
@@ -428,6 +431,7 @@
 ;; Register width from element mode
 (define_mode_attr V_reg [(V8QI "P") (V16QI "q")
  (V4HI "P") (V8HI  "q")
+(V4HF "P") (V8HF  "q")
  (V2SI "P") (V4SI  "q")
  (V2SF "P") (V4SF  "q")
  (DI   "P") (V2DI  "q")
diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
index ce98f71..c27578a 100644
--- a/gcc/config/arm/vec-common.md
+++ b/gcc/config/arm/vec-common.md
@@ -38,6 +38,26 @@
 }
 })
 
+;; This exists separately from the above pattern to exclude an immediate RHS.
+
+(define_expand "mov"
+  [(set (match_operand:V_HF 0 "nonimmediate_operand" "")
+   (match_operand:V_HF 1 "nonimmediate_operand" ""))]
+  "TARGET_NEON
+   || (TARGET_REALLY_IWMMXT && VALID_IWMMXT_REG_MODE (mode))"
+{
+  if (can_create_pseudo_p ())
+{
+  if (!REG_P (operands[0]))
+   operands[1] = force_reg (mode, operands[1]);
+  else if (TARGET_NEON && CONSTANT_P (operands[1]))
+   {
+ operands[1] = 

Re: [PATCH] ARM PR68620 (ICE with FP16 on armeb)

2016-01-18 Thread Alan Lawrence
Thanks for working on this, Christophe, and sorry I missed the PR. You got
further in fixing more things than I did though :). A couple of comments:

> For the vec_set_internal and neon_vld1_dup patterns, I
> switched to an existing iterator which already had the needed
> V4HF/V8HF (so I switched to VD_LANE and VQ2).

It's a separate issue, and I hadn't done this either, but looking again - I
don't see any reason why we shouldn't apply VD->VD_LANE to the vec_extract
standard name pattern too. (At present looks like we have vec_extractv8hf but no
vec_extractv4hf ?)

> For neon_vdupn, I chose to implement neon_vdup_nv4hf and
> neon_vdup_nv8hf instead of updating the VX iterator because I thought
> it was not desirable to impact neon_vrev32.

Well, the same instruction will suffice for vrev32'ing vectors of HF just as
well as vectors of HI, so I think I'd argue that's harmless enough. To gain the
benefit, we'd need to update arm_evpc_neon_vrev with a few new cases, though.

> @@ -5252,12 +5252,22 @@ vget_lane_s32 (int32x2_t __a, const int __b)
> were marked always-inline so there were no call sites, the declaration
> would nonetheless raise an error.  Hence, we must use a macro instead.  */
>
> +  /* For big-endian, GCC's vector indices are the opposite way around
> + to the architectural lane indices used by Neon intrinsics.  */

Not quite the opposite way around, as you take into account yourself! 'Reversed
within each 64 bits', perhaps?

> +#ifdef __ARM_BIG_ENDIAN
> +  /* Here, 3 is (4-1) where 4 is the number of lanes. This is also the
> + right value for vectors with 8 lanes.  */
> +#define __arm_lane(__vec, __idx) (__idx ^ 3)
> +#else
> +#define __arm_lane(__vec, __idx) __idx
> +#endif
> +

Looks right, but sounds... my concern here is that I'm hoping at some point we
will move the *other* vget/set_lane intrinsics to use GCC vector extensions
too. At which time (unlike __aarch64_lane which can be used everywhere) this
will be the wrong formula. Can we name (and/or comment) it to avoid misleading
anyone? The key characteristic seems to be that it is for vectors of 16-bit
elements only.

> @@ -5334,7 +5344,7 @@ vgetq_lane_s32 (int32x4_t __a, const int __b)
>  ({   \
>float16x8_t __vec = (__v); \
>__builtin_arm_lane_check (8, __idx);   \
> -  float16_t __res = __vec[__idx];\
> +  float16_t __res = __vec[__arm_lane(__vec, __idx)]; \

In passing - the function name in the @@ header is of course misleading, this 
is #define vgetq_lane_f16 (and the later hunks)

Thanks, Alan


Re: [PATCH 1/4] Make SRA scalarize constant-pool loads

2016-01-15 Thread Alan Lawrence

On 24/12/15 11:53, Alan Lawrence wrote:

Here's a new version that fixes the gcc.dg/guality/pr54970.c failures seen on
aarch64 and powerpc64. Prior to SRA handling constant pool decls,
-fdump-tree-esra-details (at -O1 -g) had shown:
   :
   a = *.LC0;
   # DEBUG a$0 => MEM[(int[3] *)&*.LC0]
   a$4_3 = MEM[(int[3] *)&*.LC0 + 4B];
   # DEBUG a$4 => a$4_3
   a$8_4 = MEM[(int[3] *)&*.LC0 + 8B];

The previous patch changed this to:
   :
   SR.5_3 = *.LC0[0];
   SR.6_4 = *.LC0[1];
   SR.7_19 = *.LC0[2];
   SR.8_21 = *.LC1[0];
   SR.9_22 = *.LC1[1];
   SR.10_23 = *.LC1[2];
   # DEBUG a$0 => NULL   // Note here
   a$4_24 = SR.6_4;
   # DEBUG a$4 => a$4_24
   a$8_25 = SR.7_19;

Turns out the DEBUG a$0 => NULL was produced in load_assign_lhs_subreplacements:

  if (lacc && lacc->grp_to_be_debug_replaced)
{
  gdebug *ds;
  tree drhs;
  struct access *racc = find_access_in_subtree (sad->top_racc,
offset,
lacc->size);

  if (racc && racc->grp_to_be_replaced)
{
  if (racc->grp_write)
drhs = get_access_replacement (racc);
  else
drhs = NULL;  // <=== HERE
}
...
  ds = gimple_build_debug_bind (get_access_replacement (lacc),
drhs, gsi_stmt (sad->old_gsi));

Prior to the patch, we'd skipped around load_assign_lhs_subreplacements, because
access_has_children_p(racc) (for racc = *.LC0) didn't hold in sra_modify_assign.

I also added a constant_decl_p function, combining the two checks, plus some
testcase fixes.

Bootstrapped + check-gcc,g++ on x86_64, ARM, AArch64,
   also on powerpc64{,le}-none-linux-gnu *in combination with the other patches
   in the series* (I haven't tested the individual patches on PPC),
   plus Ada on ARM and x86_64.

gcc/ChangeLog:

PR target/63679
* tree-sra.c (disqualified_constants, constant_decl_p): New.
(sra_initialize): Allocate disqualified_constants.
(sra_deinitialize): Free disqualified_constants.
(disqualify_candidate): Update disqualified_constants when appropriate.
(create_access): Scan for constant-pool entries as we go along.
(scalarizable_type_p): Add check against type_contains_placeholder_p.
(maybe_add_sra_candidate): Allow constant-pool entries.
(load_assign_lhs_subreplacements): Bind debug for constant pool vars.
(initialize_constant_pool_replacements): New.
(sra_modify_assign): Avoid mangling assignments created by previous,
and don't generate writes into constant pool.
(sra_modify_function_body): Call initialize_constant_pool_replacements.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/sra-17.c: New.
* gcc.dg/tree-ssa/sra-18.c: New.


Ping.

(The next bit is false, unless you force SRA to happen more widely, but all the 
above stands)



This also fixes a bunch of other guality tests on AArch64 that were failing
prior to the patch series, and another bunch on PowerPC64 (bigendian -m32), 
listed below.

>

Tests fixed on aarch64-none-linux-gnu:
gcc.dg/guality/pr54970.c   -O1  line 15 *p == 3
gcc.dg/guality/pr54970.c   -O1  line 15 *q == 2
gcc.dg/guality/pr54970.c   -O1  line 20 *p == 13
gcc.dg/guality/pr54970.c   -O1  line 20 *q == 2
gcc.dg/guality/pr54970.c   -O1  line 25 *p == 13
gcc.dg/guality/pr54970.c   -O1  line 25 *q == 12
gcc.dg/guality/pr54970.c   -O1  line 31 *p == 6
gcc.dg/guality/pr54970.c   -O1  line 31 *q == 5
gcc.dg/guality/pr54970.c   -O1  line 36 *p == 26
gcc.dg/guality/pr54970.c   -O1  line 36 *q == 5
gcc.dg/guality/pr54970.c   -O1  line 45 *p == 26
gcc.dg/guality/pr54970.c   -O1  line 45 *q == 25
gcc.dg/guality/pr54970.c   -O1  line 45 p[-1] == 25
gcc.dg/guality/pr54970.c   -O1  line 45 p[-2] == 4
gcc.dg/guality/pr54970.c   -O1  line 45 q[-1] == 4
gcc.dg/guality/pr54970.c   -O1  line 45 q[1] == 26
gcc.dg/guality/pr56154-1.c   -O1  line pr56154-1.c:20 x.a == 6
gcc.dg/guality/pr59776.c   -O1  line pr59776.c:17 s1.f == 5.0
gcc.dg/guality/pr59776.c   -O1  line pr59776.c:17 s1.g == 6.0
gcc.dg/guality/pr59776.c   -O1  line pr59776.c:17 s2.f == 0.0
gcc.dg/guality/pr59776.c   -O1  line pr59776.c:17 s2.g == 6.0
gcc.dg/guality/pr59776.c   -O1  line pr59776.c:20 s1.f == 5.0
gcc.dg/guality/pr59776.c   -O1  line pr59776.c:20 s1.g == 6.0
gcc.dg/guality/pr59776.c   -O1  line pr59776.c:20 s2.f == 5.0
gcc.dg/guality/pr59776.c   -O1  line pr59776.c:20 s2.g == 6.0
gcc.dg/guality/sra-1.c   -O1  line 21 a.j == 14
gcc.dg/guality/sra-1.c   -O1  line 32 a[1] == 14
gcc.dg/guality/sra-1.c   -O1  line 43 a.i == 4
gcc.dg/guality/sra-1.c   -O1  line 43 a.j == 14
(and other optimization levels)
On ppc64 bigendian, with the rest of the series (but I expect it's this p

Re: [PATCH 3/4 v2] Enhance SCEV to follow copies of SSA_NAMEs.

2016-01-15 Thread Alan Lawrence

On 15/01/16 10:07, Richard Biener wrote:

On Fri, Jan 15, 2016 at 10:47 AM, Alan Lawrence <alan.lawre...@arm.com> wrote:

On Thu, Jan 14, 2016 at 12:30 PM, Richard Biener <richard.guent...@gmail.com> 
wrote:



The vuse test is not necessary


+  && (gimple_assign_rhs_code (def) == SSA_NAME
+  || is_gimple_min_invariant (gimple_assign_rhs1 (def


and the is_gimple_min_invariant (rhs1) test is not sufficient if you
consider - (-INT_MIN) with -ftrapv for example.


Thanks, I didn't realize gimple_min_invariant would allow such cases.


Well, the invariant would be -INT_MIN but gimple_assign_rhs_code (def) would
be NEGATE_EXPR.  Basically you forgot about unary operators.


Hmm, shouldn't those have get_gimple_rhs_class(gimple_assign_rhs_code(stmt)) == 
GIMPLE_UNARY_RHS, rather than GIMPLE_SINGLE_RHS as checked for by 
gimple_assign_single_p?


If SINGLE_RHS includes unary operators, the new version of the patch is as 
flawed as the previous, in that it drops the unary operator altogether.


--Alan


Re: [PATCH 4/4] Un-XFAIL ssa-dom-cse-2.c for most platforms

2016-01-15 Thread Alan Lawrence
It seems the conclusion on PowerPC is to XFAIL the test on powerpc64 (there will 
be XPASSes with -mcpu=power7 or -mcpu=power8). Which is what the original patch 
does (https://gcc.gnu.org/ml/gcc-patches/2015-12/msg01979.html). So,


Ping.

Thanks, Alan

On 21/12/15 15:33, Bill Schmidt wrote:

On Mon, 2015-12-21 at 15:22 +, Alan Lawrence wrote:

On 21/12/15 14:59, Bill Schmidt wrote:


On powerpc64, the test passes with -mcpu=power8 (the loop is vectorized as a
reduction); however, without that, similar code is generated to Alpha (the
vectorizer decides the reduction is not worthwhile without SIMD support), and
the test fails; hence, I've XFAILed for powerpc, but I think I could condition
the XFAIL on powerpc64 && !check_p8vector_hw_available, if preferred?


Fun.

Does it work with -mcpu=power7?


Yes, it works with -mcpu=power7, as well as -mcpu=power8, but not e.g. 
-mcpu=power6.


Bill: What GCC DejaGNU incantation would you like to see?


This sounds like more fallout from unaligned accesses being faster on
POWER8 than previous hardware.  What about conditioning the XFAIL on

{ powerpc*-*-* && { ! vect_hw_misalign } }

-- does this work properly?


Not on a stage1 compiler - check_p8vector_hw_available itself requires being
able to run executables - I'll check on gcc112. However, both look like they're
really about the host (ability to execute an asm instruction), not the target
(/ability for gcc to output such an instruction)


Hm, that looks like a pervasive problem for powerpc.  There are a number
of things that are supposed to be testing effective target but rely on
check_p8vector_hw_available, which as you note requires executing an
instruction and is really about the host.  We need to clean that up; I
should probably open a bug.  Kind of amazed this has gotten past us for
a couple of years.

For now, just XFAILing for powerpc seems the best alternative for this
test.

Thanks,
Bill



--Alan








[PATCH 0/2][AArch64] Implement AAPCS64 updates for alignment attribute

2016-01-15 Thread Alan Lawrence
These parallel the updates to ARM
https://gcc.gnu.org/ml/gcc-patches/2015-07/msg00214.html following from Richard
Earnshaw's proposal for updates to the AAPCS and AAPCS64,
 https://gcc.gnu.org/ml/gcc/2015-07/msg00040.html .

On AArch64 we do not have the problem of broken profiledbootstrap (as originally
on ARM), hence I do not propose to backport these to earlier GCC branches.
Rather, the GCC 5 -> 6 transition, seems a better time for any ABI change.
(However, as previously, there should be no changes for types that are naturally
aligned, only those using types with explicit alignment specifiers, which was
not previously sanctioned by the AAPCS.)

Similarly to ARM, I note that Ada is affected. Indeed, with a gcc 4.9 host
compiler, I saw a bootstrap miscompare iff including Ada; however, I was able to
bootstrap Ada successfully, if I first built a GCC including this patch with
--disable-bootstrap, and then used that as host compiler. The best explanation
I can see for this is mismatched host vs built libraries and compiler being used
together, something like Jakub's suggestion
http://gcc.gnu.org/ml/gcc-patches/2015-11/msg00338.html. I don't feel I have the
expertise for this, and am CCing the Ada maintainers in the hope they can help.

Bootstrapped + check-gcc + check-g++ on aarch64-none-linux-gnu and
aarch64_be-none-elf.
Also bootstrapped + check-ada on aarch64-none-linux-gnu with --disable-bootstrap
host compiler, as above.

OK for trunk?

--Alan




[PATCH 2/2][AArch64] Tests of AAPCS64 updates for alignment attribute

2016-01-15 Thread Alan Lawrence
Here I've added both tests using the abitest.h framework(which verifies values
are passed in the correct registers as specified by the AAPCS64), and separate
tests which verify that called functions read arguments from the same locations
as they are passed. Hence, each test_align-N.c corresponds to rec_align-N.c.

I've tried to stay consistent with the existing naming of e.g. test_10.c,
test_align-1.c, va_arg-10.c, but would be happy to change to another scheme
if preferred! (e.g. func-ret-1.c ?)

Cheers, Alan

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/aapcs64/aapcs64.exp: Also execute rec_*.c
* gcc.target//aarch64/aapcs64/rec_align-5.c: New.
* gcc.target//aarch64/aapcs64/rec_align-6.c: New.
* gcc.target//aarch64/aapcs64/rec_align-7.c: New.
* gcc.target//aarch64/aapcs64/rec_align-8.c: New.
* gcc.target//aarch64/aapcs64/rec_align-9.c: New.
* gcc.target//aarch64/aapcs64/test_align-5.c: New.
* gcc.target//aarch64/aapcs64/test_align-6.c: New.
* gcc.target//aarch64/aapcs64/test_align-7.c: New.
* gcc.target//aarch64/aapcs64/test_align-8.c: New.
* gcc.target//aarch64/aapcs64/test_align-9.c: New.
* gcc.target//aarch64/aapcs64/rec_vaarg-1.c: New.
* gcc.target//aarch64/aapcs64/rec_vaarg-2.c: New.
---
 .../gcc.target/aarch64/aapcs64/aapcs64.exp | 10 +
 .../gcc.target/aarch64/aapcs64/rec_align-5.c   | 44 
 .../gcc.target/aarch64/aapcs64/rec_align-6.c   | 45 +
 .../gcc.target/aarch64/aapcs64/rec_align-7.c   | 47 ++
 .../gcc.target/aarch64/aapcs64/rec_align-8.c   | 37 +
 .../gcc.target/aarch64/aapcs64/rec_align-9.c   | 41 +++
 .../gcc.target/aarch64/aapcs64/rec_align_vaarg-1.c | 38 +
 .../gcc.target/aarch64/aapcs64/rec_align_vaarg-2.c | 28 +
 .../gcc.target/aarch64/aapcs64/test_align-5.c  | 35 
 .../gcc.target/aarch64/aapcs64/test_align-6.c  | 36 +
 .../gcc.target/aarch64/aapcs64/test_align-7.c  | 38 +
 .../gcc.target/aarch64/aapcs64/test_align-8.c  | 33 +++
 .../gcc.target/aarch64/aapcs64/test_align-9.c  | 47 ++
 13 files changed, 479 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-5.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-6.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-7.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-8.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-9.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align_vaarg-1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align_vaarg-2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/test_align-5.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/test_align-6.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/test_align-7.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/test_align-8.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/aapcs64/test_align-9.c

diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/aapcs64.exp 
b/gcc/testsuite/gcc.target/aarch64/aapcs64/aapcs64.exp
index ac2f089..eb297c5 100644
--- a/gcc/testsuite/gcc.target/aarch64/aapcs64/aapcs64.exp
+++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/aapcs64.exp
@@ -38,6 +38,16 @@ foreach src [lsort [glob -nocomplain 
$srcdir/$subdir/test_*.c]] {
 }
 }
 
+# Test parameter receiving.
+set additional_flags_for_rec $additional_flags
+append additional_flags_for_rec " -fno-inline"
+foreach src [lsort [glob -nocomplain $srcdir/$subdir/rec_*.c]] {
+if {[runtest_file_p $runtests $src]} {
+   c-torture-execute [list $src] \
+   $additional_flags_for_rec
+}
+}
+
 # Test unnamed argument retrieval via the va_arg macro.
 foreach src [lsort [glob -nocomplain $srcdir/$subdir/va_arg-*.c]] {
 if {[runtest_file_p $runtests $src]} {
diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-5.c 
b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-5.c
new file mode 100644
index 000..1b42c92
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/rec_align-5.c
@@ -0,0 +1,44 @@
+/* Test AAPCS layout (alignment) for callee.  */
+
+/* { dg-do run { target aarch64*-*-* } } */
+
+extern void abort (void);
+
+typedef __attribute__ ((__aligned__ (8))) int alignedint;
+
+alignedint a = 11;
+alignedint b = 13;
+alignedint c = 17;
+alignedint d = 19;
+alignedint e = 23;
+alignedint f = 29;
+alignedint g = 31;
+alignedint h = 37;
+alignedint i = 41;
+alignedint j = 43;
+
+void
+test_passing_many_alignedint (alignedint x0, alignedint x1, alignedint x2,
+ alignedint x3, alignedint x4, alignedint x5,
+ alignedint x6, 

[PATCH 1/2][AArch64] Implement AAPCS64 updates for alignment attribute

2016-01-15 Thread Alan Lawrence
gcc/ChangeLog:

* gcc/config/aarch64/aarch64.c (aarch64_function_arg_alignment):
Rewrite, looking one level down for records and arrays.
---
 gcc/config/aarch64/aarch64.c | 32 +---
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index ae4cfb3..8eb8c3e 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1925,22 +1925,24 @@ aarch64_vfp_is_call_candidate (cumulative_args_t 
pcum_v, machine_mode mode,
 static unsigned int
 aarch64_function_arg_alignment (machine_mode mode, const_tree type)
 {
-  unsigned int alignment;
+  if (!type)
+return GET_MODE_ALIGNMENT (mode);
+  if (integer_zerop (TYPE_SIZE (type)))
+return 0;
 
-  if (type)
-{
-  if (!integer_zerop (TYPE_SIZE (type)))
-   {
- if (TYPE_MODE (type) == mode)
-   alignment = TYPE_ALIGN (type);
- else
-   alignment = GET_MODE_ALIGNMENT (mode);
-   }
-  else
-   alignment = 0;
-}
-  else
-alignment = GET_MODE_ALIGNMENT (mode);
+  gcc_assert (TYPE_MODE (type) == mode);
+
+  if (!AGGREGATE_TYPE_P (type))
+return TYPE_ALIGN (TYPE_MAIN_VARIANT (type));
+
+  if (TREE_CODE (type) == ARRAY_TYPE)
+return TYPE_ALIGN (TREE_TYPE (type));
+
+  unsigned int alignment = 0;
+  gcc_assert (TYPE_FIELDS (type));
+
+  for (tree field = TYPE_FIELDS (type); field; field = DECL_CHAIN (field))
+alignment = std::max (alignment, DECL_ALIGN (field));
 
   return alignment;
 }
-- 
1.9.1



Re: [PATCH 3/4 v2] Enhance SCEV to follow copies of SSA_NAMEs.

2016-01-15 Thread Alan Lawrence
On Thu, Jan 14, 2016 at 12:30 PM, Richard Biener  
wrote:

>
> The vuse test is not necessary
>
>> +  && (gimple_assign_rhs_code (def) == SSA_NAME
>> +  || is_gimple_min_invariant (gimple_assign_rhs1 (def
>
> and the is_gimple_min_invariant (rhs1) test is not sufficient if you
> consider - (-INT_MIN) with -ftrapv for example.

Thanks, I didn't realize gimple_min_invariant would allow such cases.

>> +  /* Keep symbolic form, but look through obvious copies for constants. 
>>  */
>
> You're also looing for SSA names copied so the comment is sligntly wrong.

So it occurs to me that we do only need to look for constants: picking one SSA
name instead of another (when both are outside the loop) doesn't really help
SCEV; and it seems safer/less change to avoid fiddling around with which names
are used. Moreover, given we are in LC _SSA_, I think we can look through
degenerate phi's even of SSA_NAMES, without violating loop-closed-ness,
so long as we only use cases where we eventually reach a constant.

Therefore, how about this version?

(Bootstrapped + check-gcc + check-g++ on x86_64, AArch64, ARM; again, fixing
vectorization of pr65947-2.c).

If this is OK - that leaves only the first patch to SRA,
https://gcc.gnu.org/ml/gcc-patches/2015-12/msg02117.html (the "extra" fixes
I claimed in the guality testsuite were a testing mistake, but this
version still fixes the regressions in guality pr54970.c from the original
"ok with the comment added and a testcase (or two)"
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01948.html, as well as the
comment + 2*testcase). I think I'd still like to propose these as a stage 3/4
fix (with --param sra-max-scalarization-size-Ospeed=32) of PR/63679, a
regression against gcc 4.9.

Thanks, Alan

gcc/ChangeLog:

* tree-scalar-evolution.c (follow_copies_to_constant): New.
(analyze_initial_condition, analyze_scalar_evolution_1): Call previous.
---
 gcc/tree-scalar-evolution.c | 50 ++---
 1 file changed, 33 insertions(+), 17 deletions(-)

diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c
index 9b33693..470c3ea 100644
--- a/gcc/tree-scalar-evolution.c
+++ b/gcc/tree-scalar-evolution.c
@@ -1522,6 +1522,34 @@ analyze_evolution_in_loop (gphi *loop_phi_node,
   return evolution_function;
 }
 
+/* Looks to see if VAR is a copy of a constant (via straightforward assignments
+   or degenerate phi's).  If so, returns the constant; else, returns VAR.  */
+
+static tree
+follow_copies_to_constant (tree var)
+{
+  tree res = var;
+  while (TREE_CODE (res) == SSA_NAME)
+{
+  gimple *def = SSA_NAME_DEF_STMT (res);
+  if (gphi *phi = dyn_cast  (def))
+   {
+ if (tree rhs = degenerate_phi_result (phi))
+   res = rhs;
+ else
+   break;
+   }
+  else if (gimple_assign_single_p (def))
+   /* Will exit loop if not an SSA_NAME.  */
+   res = gimple_assign_rhs1 (def);
+  else
+   break;
+}
+  if (CONSTANT_CLASS_P (res))
+return res;
+  return var;
+}
+
 /* Given a loop-phi-node, return the initial conditions of the
variable on entry of the loop.  When the CCP has propagated
constants into the loop-phi-node, the initial condition is
@@ -1574,21 +1602,9 @@ analyze_initial_condition (gphi *loop_phi_node)
   if (init_cond == chrec_not_analyzed_yet)
 init_cond = chrec_dont_know;
 
-  /* During early loop unrolling we do not have fully constant propagated IL.
- Handle degenerate PHIs here to not miss important unrollings.  */
-  if (TREE_CODE (init_cond) == SSA_NAME)
-{
-  gimple *def = SSA_NAME_DEF_STMT (init_cond);
-  if (gphi *phi = dyn_cast  (def))
-   {
- tree res = degenerate_phi_result (phi);
- if (res != NULL_TREE
- /* Only allow invariants here, otherwise we may break
-loop-closed SSA form.  */
- && is_gimple_min_invariant (res))
-   init_cond = res;
-   }
-}
+  /* We may not have fully constant propagated IL.  Handle degenerate PHIs here
+ to not miss important early loop unrollings.  */
+  init_cond = follow_copies_to_constant (init_cond);
 
   if (dump_file && (dump_flags & TDF_SCEV))
 {
@@ -1968,8 +1984,8 @@ analyze_scalar_evolution_1 (struct loop *loop, tree var, 
tree res)
   if (bb == NULL
   || !flow_bb_inside_loop_p (loop, bb))
 {
-  /* Keep the symbolic form.  */
-  res = var;
+  /* Keep symbolic form, but look through obvious copies for constants.  */
+  res = follow_copies_to_constant (var);
   goto set_and_end;
 }
 
-- 
1.9.1



Re: [PATCH] Tidy: remove reduc_xxx_optab migration code

2016-01-14 Thread Alan Lawrence

On 14/01/16 12:22, Richard Biener wrote:

On Thu, Jan 14, 2016 at 11:26 AM, Alan Lawrence <alan.lawre...@arm.com> wrote:

If/when mips-ps-3d.md is moved from reduc_* to reduc_*_scal optabs (patch here:
https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00311.html ), there will be no
uses of the old reduc_* optabs remaining. This patch removes those optabs and
the migration path.

Bootstrapped + check-gcc + check-g++ on x86_64 and AArch64.

I'm not sure whether such a code tidy is acceptable for stage 3, please advise?


Yes, this is still very much welcome.

Thanks,
Richard.



I took that as an approval, and committed r232373.

Thanks, Alan



[PATCH 3/4 v2] Enhance SCEV to follow copies of SSA_NAMEs.

2016-01-14 Thread Alan Lawrence
(Previous message: https://gcc.gnu.org/ml/gcc-patches/2015-12/msg02159.html)
On Sat, Dec 26, 2015 at 18:58 PM, Richard Biener  
wrote:

>> I'm not sure whether adding a pass_copy_prop is the right thing here, but 
>> since
>> loop-header-copying can create such opportunities, it seems good to take 
>> them!
>
> Aww.  I'd rather have general infrastructure (like SCEV) deal with
> those non-propagated copies.
>
> There are likely other missed propagation / folding opportunities
> caused from partial peeling.
>
> Richard.

I take your second point, but I am concerned that this leads to duplication of
copy-propagation code throughout the compiler?

However, this patch does that. Making analyze_initial_condition (alone) follow
copies, is enough to solve my case of missed vectorization of pr65947-2.c;
but I also used the routine in analyze_scalar_evolution_1, as it found copies
to follow in both bootstrap and a significant number of testcases:

c-c++-common/gomp/pr58472.c
gcc.c-torture/execute/2422-1.c
gcc.c-torture/execute/20041213-2.c
gcc.c-torture/execute/20100430-1.c
gcc.c-torture/execute/pr49712.c
gcc.c-torture/execute/pr58640.c
gcc.dg/Warray-bounds-13.c
gcc.dg/graphite/block-{1,5,6}.c
gcc.dg/graphite/interchange-12.c
gcc.dg/graphite/interchange-mvt.c
gcc.dg/graphite/pr19910.c
gcc.dg/graphite/uns-block-1.c
gcc.dg/loop-unswitch-{2,4}.c
gcc.dg/pr59670.c
gcc.dg/torture/matrix-{1,2,3,4,5,6}.c
gcc.dg/torture/pr24750-1.c
gcc.dg/torture/transpose-{1,2,3,4,5,6}.c

...some of which were resolved to constants.

(Of course, this approach is not as thorough as adding a pass_copy_prop. E.g. it
does *not* enable vectorization of the inlined copy in pr65947-9.c.)

Bootstrapped + check-gcc + check-g++ on ARM, AArch64, x86_64.

gcc/ChangeLog:

* tree-scalar-evolution.c (follow_copies): New.
(analyze_initial_condition, analyze_scalar_evolution_1): Call previous.
---
 gcc/tree-scalar-evolution.c | 53 ++---
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c
index 9b33693..357eb8f 100644
--- a/gcc/tree-scalar-evolution.c
+++ b/gcc/tree-scalar-evolution.c
@@ -1522,6 +1522,37 @@ analyze_evolution_in_loop (gphi *loop_phi_node,
   return evolution_function;
 }
 
+/* While VAR is an SSA_NAME whose definition is a straight copy of another 
name,
+   or (perhaps via a degenerate phi) a constant, follows that definition.
+   Returns the constant, or the earliest SSA_NAME whose definition is neither
+   constant nor copy.  */
+
+static tree
+follow_copies (tree var)
+{
+  while (TREE_CODE (var) == SSA_NAME)
+{
+  gimple *def = SSA_NAME_DEF_STMT (var);
+  if (gphi *phi = dyn_cast  (def))
+   {
+ tree rhs = degenerate_phi_result (phi);
+ /* Don't follow degenerate phi's of SSA_NAMES, that can break
+loop-closed SSA form.  */
+ if (rhs && is_gimple_min_invariant (rhs))
+   var = rhs;
+ break;
+   }
+  else if (gimple_assign_single_p (def)
+  && !gimple_vuse (def)
+  && (gimple_assign_rhs_code (def) == SSA_NAME
+  || is_gimple_min_invariant (gimple_assign_rhs1 (def
+   var = gimple_assign_rhs1 (def);
+  else
+   break;
+}
+  return var;
+}
+
 /* Given a loop-phi-node, return the initial conditions of the
variable on entry of the loop.  When the CCP has propagated
constants into the loop-phi-node, the initial condition is
@@ -1574,21 +1605,9 @@ analyze_initial_condition (gphi *loop_phi_node)
   if (init_cond == chrec_not_analyzed_yet)
 init_cond = chrec_dont_know;
 
-  /* During early loop unrolling we do not have fully constant propagated IL.
- Handle degenerate PHIs here to not miss important unrollings.  */
-  if (TREE_CODE (init_cond) == SSA_NAME)
-{
-  gimple *def = SSA_NAME_DEF_STMT (init_cond);
-  if (gphi *phi = dyn_cast  (def))
-   {
- tree res = degenerate_phi_result (phi);
- if (res != NULL_TREE
- /* Only allow invariants here, otherwise we may break
-loop-closed SSA form.  */
- && is_gimple_min_invariant (res))
-   init_cond = res;
-   }
-}
+  /* We may not have fully constant propagated IL.  Handle degenerate PHIs here
+ to not miss important early loop unrollings.  */
+  init_cond = follow_copies (init_cond);
 
   if (dump_file && (dump_flags & TDF_SCEV))
 {
@@ -1968,8 +1987,8 @@ analyze_scalar_evolution_1 (struct loop *loop, tree var, 
tree res)
   if (bb == NULL
   || !flow_bb_inside_loop_p (loop, bb))
 {
-  /* Keep the symbolic form.  */
-  res = var;
+  /* Keep symbolic form, but look through obvious copies for constants.  */
+  res = follow_copies (var);
   goto set_and_end;
 }
 
-- 
1.9.1



Re: [PATCH][MIPS] Migrate reduction optabs in mips-ps-3d.md

2016-01-14 Thread Alan Lawrence

On 07/01/16 12:47, Alan Lawrence wrote:

Here's an updated version, also covering the min/max patterns I missed before.
I've now managed to do some testing with a stage 1 compiler, by compiling all
tests in gcc.dg/vect at -O2 -ftree-vectorize -mips3d -march=mips64r2 -mabi=n32
$x -ffast-math -ffinite-math-only.

There were no changes in which files compiled (871 did, some did not due to
libraries etc. missing in my stage 1 compiler), and no changes in any assembly
output. The patterns were triggered on:

fast-math-vect-reduc-5.c, fast-math-vect-reduc-8.c, fast-math-vect-reduc-9.c,
no-fast-math-vect16.c, pr66142.c, vect-outer-fir-big-array.c, vect-outer-fir.c,
vect-outer-fir-lb-big-array.c, vect-outer-fir-lb.c, vect-reduc-10.c,
vect-reduc-6.c

I realize this is not completely representative of a 'proper' test run, in
which different files are compiled with different options, but it provides
reasonable confidence that I'm not changing any code generation.

OK for trunk? (stage 3?)


Ping.


gcc/ChangeLog:

* config/mips/mips-ps-3d.md (reduc_splus_v2sf): Remove.
(reduc_plus_scal_v2sf): New.
(reduc_smax_v2sf): Rename to...
(reduc_smax_scal_v2sf): ...here, make result SFmode, add vec_extract.
(reduc_smin_v2sf): Rename to...
(reduc_smin_scal_v2sf): ...here, make result SFmode, add vec_extract.
---
  gcc/config/mips/mips-ps-3d.md | 34 ++
  1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/gcc/config/mips/mips-ps-3d.md b/gcc/config/mips/mips-ps-3d.md
index 8bc7608..a93a2b9 100644
--- a/gcc/config/mips/mips-ps-3d.md
+++ b/gcc/config/mips/mips-ps-3d.md
@@ -371,13 +371,17 @@
[(set_attr "type" "fadd")
 (set_attr "mode" "SF")])

-(define_insn "reduc_splus_v2sf"
-  [(set (match_operand:V2SF 0 "register_operand" "=f")
-   (unspec:V2SF [(match_operand:V2SF 1 "register_operand" "f")
- (match_dup 1)]
-UNSPEC_ADDR_PS))]
+(define_expand "reduc_plus_scal_v2sf"
+  [(match_operand:SF 0 "register_operand" "=f")
+   (match_operand:V2SF 1 "register_operand" "f")]
"TARGET_HARD_FLOAT && TARGET_MIPS3D"
-  "")
+  {
+rtx temp = gen_reg_rtx (V2SFmode);
+emit_insn (gen_mips_addr_ps (temp, operands[1], operands[1]));
+rtx lane = BYTES_BIG_ENDIAN ? const1_rtx : const0_rtx;
+emit_insn (gen_vec_extractv2sf (operands[0], temp, lane));
+DONE;
+  })

  ; cvt.pw.ps - Floating Point Convert Paired Single to Paired Word
  (define_insn "mips_cvt_pw_ps"
@@ -745,20 +749,26 @@
DONE;
  })

-(define_expand "reduc_smin_v2sf"
-  [(match_operand:V2SF 0 "register_operand")
+(define_expand "reduc_smin_scal_v2sf"
+  [(match_operand:SF 0 "register_operand")
 (match_operand:V2SF 1 "register_operand")]
"TARGET_HARD_FLOAT && TARGET_PAIRED_SINGLE_FLOAT"
  {
-  mips_expand_vec_reduc (operands[0], operands[1], gen_sminv2sf3);
+  rtx temp = gen_reg_rtx (V2SFmode);
+  mips_expand_vec_reduc (temp, operands[1], gen_sminv2sf3);
+  rtx lane = BYTES_BIG_ENDIAN ? const1_rtx : const0_rtx;
+  emit_insn (gen_vec_extractv2sf (operands[0], temp, lane));
DONE;
  })

-(define_expand "reduc_smax_v2sf"
-  [(match_operand:V2SF 0 "register_operand")
+(define_expand "reduc_smax_scal_v2sf"
+  [(match_operand:SF 0 "register_operand")
 (match_operand:V2SF 1 "register_operand")]
"TARGET_HARD_FLOAT && TARGET_PAIRED_SINGLE_FLOAT"
  {
-  mips_expand_vec_reduc (operands[0], operands[1], gen_smaxv2sf3);
+  rtx temp = gen_reg_rtx (V2SFmode);
+  mips_expand_vec_reduc (temp, operands[1], gen_smaxv2sf3);
+  rtx lane = BYTES_BIG_ENDIAN ? const1_rtx : const0_rtx;
+  emit_insn (gen_vec_extractv2sf (operands[0], temp, lane));
DONE;
  })





[PATCH] Tidy: remove reduc_xxx_optab migration code

2016-01-14 Thread Alan Lawrence
If/when mips-ps-3d.md is moved from reduc_* to reduc_*_scal optabs (patch here:
https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00311.html ), there will be no
uses of the old reduc_* optabs remaining. This patch removes those optabs and
the migration path.

Bootstrapped + check-gcc + check-g++ on x86_64 and AArch64.

I'm not sure whether such a code tidy is acceptable for stage 3, please advise?

Thanks, Alan

gcc/ChangeLog:

* doc/md.texi (reduc_smin_@var{m}, reduc_smax_@var{m},
reduc_umin_@var{m}, reduc_umax_@var{m}, reduc_splus_@var{m},
reduc_uplus_@var{m}): Remove.
* expr.c (expand_expr_real_2): Remove expansion path for
reduc_[us](min|max|plus) optabs.
* optabs-tree.c (scalar_reduc_to_vector): Remove.
* optabs-tree.h (scalar_reduc_to_vector): Remove.
* optabs.def (reduc_smax_optab, reduc_smin_optab, reduc_splus_optab,
reduc_umax_optab, reduc_umin_optab, reduc_uplus_optab): Remove.
* tree-vect-loop.c (vectorizable_reduction): Remove test for
reduc_[us](min|max|plus) optabs.
---
 gcc/doc/md.texi  | 27 ---
 gcc/expr.c   | 39 ++-
 gcc/optabs-tree.c| 20 
 gcc/optabs-tree.h|  1 -
 gcc/optabs.def   |  7 ---
 gcc/tree-vect-loop.c | 12 
 6 files changed, 14 insertions(+), 92 deletions(-)

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index faf3910..68321dc 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5007,33 +5007,6 @@ raised and a quiet @code{NaN} is returned.
 All operands have mode @var{m}, which is a scalar or vector
 floating-point mode.  These patterns are not allowed to @code{FAIL}.
 
-@cindex @code{reduc_smin_@var{m}} instruction pattern
-@cindex @code{reduc_smax_@var{m}} instruction pattern
-@item @samp{reduc_smin_@var{m}}, @samp{reduc_smax_@var{m}}
-Find the signed minimum/maximum of the elements of a vector. The vector is
-operand 1, and the result is stored in the least significant bits of
-operand 0 (also a vector). The output and input vector should have the same
-modes. These are legacy optabs, and platforms should prefer to implement
-@samp{reduc_smin_scal_@var{m}} and @samp{reduc_smax_scal_@var{m}}.
-
-@cindex @code{reduc_umin_@var{m}} instruction pattern
-@cindex @code{reduc_umax_@var{m}} instruction pattern
-@item @samp{reduc_umin_@var{m}}, @samp{reduc_umax_@var{m}}
-Find the unsigned minimum/maximum of the elements of a vector. The vector is
-operand 1, and the result is stored in the least significant bits of
-operand 0 (also a vector). The output and input vector should have the same
-modes. These are legacy optabs, and platforms should prefer to implement
-@samp{reduc_umin_scal_@var{m}} and @samp{reduc_umax_scal_@var{m}}.
-
-@cindex @code{reduc_splus_@var{m}} instruction pattern
-@cindex @code{reduc_uplus_@var{m}} instruction pattern
-@item @samp{reduc_splus_@var{m}}, @samp{reduc_uplus_@var{m}}
-Compute the sum of the signed/unsigned elements of a vector. The vector is
-operand 1, and the result is stored in the least significant bits of operand 0
-(also a vector). The output and input vector should have the same modes.
-These are legacy optabs, and platforms should prefer to implement
-@samp{reduc_plus_scal_@var{m}}.
-
 @cindex @code{reduc_smin_scal_@var{m}} instruction pattern
 @cindex @code{reduc_smax_scal_@var{m}} instruction pattern
 @item @samp{reduc_smin_scal_@var{m}}, @samp{reduc_smax_scal_@var{m}}
diff --git a/gcc/expr.c b/gcc/expr.c
index 8973893..e1ed44d 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9164,35 +9164,16 @@ expand_expr_real_2 (sepops ops, rtx target, 
machine_mode tmode,
 this_optab = optab_for_tree_code (code, type, optab_default);
 machine_mode vec_mode = TYPE_MODE (TREE_TYPE (treeop0));
 
-   if (optab_handler (this_optab, vec_mode) != CODE_FOR_nothing)
- {
-   struct expand_operand ops[2];
-   enum insn_code icode = optab_handler (this_optab, vec_mode);
-
-   create_output_operand ([0], target, mode);
-   create_input_operand ([1], op0, vec_mode);
-   if (maybe_expand_insn (icode, 2, ops))
- {
-   target = ops[0].value;
-   if (GET_MODE (target) != mode)
- return gen_lowpart (tmode, target);
-   return target;
- }
- }
-   /* Fall back to optab with vector result, and then extract scalar.  */
-   this_optab = scalar_reduc_to_vector (this_optab, type);
-temp = expand_unop (vec_mode, this_optab, op0, NULL_RTX, unsignedp);
-gcc_assert (temp);
-/* The tree code produces a scalar result, but (somewhat by convention)
-   the optab produces a vector with the result in element 0 if
-   little-endian, or element N-1 if big-endian.  So pull the scalar
-   result out of that element.  */
-int index = BYTES_BIG_ENDIAN ? GET_MODE_NUNITS (vec_mode) - 

Re: [AARCH64][ACLE] Implement __ARM_FP_FENV_ROUNDING in aarch64 backend.

2016-01-11 Thread Alan Lawrence

On 11/01/16 10:28, Marcus Shawcroft wrote:

On 11 January 2016 at 08:12, Bilyan Borisov  wrote:


2015-XX-XX  Bilyan Borisov  

 * config/aarch64/aarch64-c.c (aarch64_update_cpp_builtins): New
macro
 definition.

gcc/testsuite/

2015-XX-XX  Bilyan Borisov  

 * gcc.target/aarch64/fesetround-checking-baremetal.c: New.
 * gcc.target/aarch64/fesetround-checking-linux.c: Likewise.


for new test cases in gcc.target/aarch64 use the name convention here
https://gcc.gnu.org/wiki/TestCaseWriting, in this case:

fesetround_checking_baremetal_1.c
fesetround_checking_linux_1.c

If we are distinguishing between glibc/musl/bionic/uclibc then the use
of the name "linux" in the last test case would seem to be
inappropriate (all of glibc/musl/bionic and uclibc can be used on top
of a linux kernel). Suggest s/linux/glibc/ in the test case name.

Cheers
/Marcus


However, the test doesn't really look at whether we're using glibc vs 
musl/bionic/uclibc, only at whether we are targeting -linux-gnu or -none-elf.


Indeed, IIUC, the test would fail, if one (somehow) built a gcc using 
musl/bionic/glibc, as I'm not aware of us having a mechanism to detect that from 
the testsuite...


Just my two cents!

Alan


[PATCH] Cleanup vect testsuite includes

2016-01-11 Thread Alan Lawrence
This was an attempt to make more of the vect testsuite compilable with a stage-1
compiler, i.e. without standard header files like stdlib.h, to ease looking for
differences in assembly output. (It is still necessary to comment out most of
tree-vect.h to do this, but at least such temporary/local changes can be
restricted to one file.)

Inclusion of stdlib.h and signal.h are quite inconsistent, with some files
explicitly declaring common functions like abort, and others #including the
header, sometimes totally unnecessarily.

I left files using malloc, calloc and free as is, tho I guess the same treatment
could be applied there.

Tested (natively) on x86_64-none-linux-gnu and aarch64-none-linux-gnu.

Is this OK for trunk?

gcc/testsuite/ChangeLog:

* gcc.dg/vect/fast-math-bb-slp-call-3.c: Declare functions as 'extern'
rather than #including math.h & stdlib.h.
* gcc.dg/vect/pr47001.c: Declare abort as 'extern', remove stdlib.h.
* gcc.dg/vect/pr49771.c: Likewise.
* gcc.dg/vect/vect-10-big-array.c: Likewise.
* gcc.dg/vect/vect-neg-store-1.c: Likewise.
* gcc.dg/vect/vect-neg-store-2.c: Likewise.
* gcc.dg/vect/slp-37.c: Change NULL to 0, remove stdlib.h.
* gcc.dg/vect/pr40254.c: Remove unnecessary include of stdlib.h.
* gcc.dg/vect/pr44507.c: Likewise.
* gcc.dg/vect/pr45902.c: Likewise.
* gcc.dg/vect/slp-widen-mult-half.c: Likewise.
* gcc.dg/vect/vect-117.c: Likewise.
* gcc.dg/vect/vect-99.c: Likewise.
* gcc.dg/vect/vect-aggressive-1.c: Likewise.
* gcc.dg/vect/vect-cond-1.c: Likewise.
* gcc.dg/vect/vect-cond-2.c: Likewise.
* gcc.dg/vect/vect-cond-3.c: Likewise.
* gcc.dg/vect/vect-cond-4.c: Likewise.
* gcc.dg/vect/vect-mask-load-1.c: Likewise.
* gcc.dg/vect/vect-mask-loadstore-1.c: Likewise.
* gcc.dg/vect/vect-over-widen-1-big-array.c: Likewise.
* gcc.dg/vect/vect-over-widen-1.c: Likewise.
* gcc.dg/vect/vect-over-widen-2-big-array.c: Likewise.
* gcc.dg/vect/vect-over-widen-2.c: Likewise.
* gcc.dg/vect/vect-over-widen-3-big-array.c: Likewise.
* gcc.dg/vect/vect-over-widen-3.c: Likewise.
* gcc.dg/vect/vect-over-widen-4-big-array.c: Likewise.
* gcc.dg/vect/vect-over-widen-4.c: Likewise.
* gcc.dg/vect/vect-widen-mult-const-s16.c: Likewise.
* gcc.dg/vect/vect-widen-mult-const-u16.c: Likewise.
* gcc.dg/vect/vect-widen-mult-half-u8.c: Likewise.
* gcc.dg/vect/vect-widen-mult-half.c: Likewise.
* gcc.dg/vect/no-trapping-math-vect-ifcvt-11.c: Remove unnecessary
include of signal.h.
* gcc.dg/vect/no-trapping-math-vect-ifcvt-12.c: Likewise.
* gcc.dg/vect/no-trapping-math-vect-ifcvt-13.c: Likewise.
* gcc.dg/vect/no-trapping-math-vect-ifcvt-14.c: Likewise.
* gcc.dg/vect/no-trapping-math-vect-ifcvt-15.c: Likewise.
* gcc.dg/vect/no-trapping-math-vect-ifcvt-16.c: Likewise.
* gcc.dg/vect/vect-ifcvt-16.c: Likewise.
* gcc.dg/vect/vect-ifcvt-17.c: Likewise.
* gcc.dg/vect/vect-ifcvt-2.c: Likewise.
* gcc.dg/vect/vect-ifcvt-3.c: Likewise.
* gcc.dg/vect/vect-ifcvt-4.c: Likewise.
* gcc.dg/vect/vect-ifcvt-5.c: Likewise.
* gcc.dg/vect/vect-ifcvt-6.c: Likewise.
* gcc.dg/vect/vect-ifcvt-7.c: Likewise.
* gcc.dg/vect/vect-ifcvt-9.c: Likewise.
* gcc.dg/vect/vect-outer-5.c: Likewise.
* gcc.dg/vect/vect-outer-6.c: Likewise.
* gcc.dg/vect/vect-strided-u8-i8-gap4-unknown.c: Remove unnecessary
include of stdio.h.
---
 gcc/testsuite/gcc.dg/vect/fast-math-bb-slp-call-3.c | 8 ++--
 gcc/testsuite/gcc.dg/vect/no-trapping-math-vect-ifcvt-11.c  | 1 -
 gcc/testsuite/gcc.dg/vect/no-trapping-math-vect-ifcvt-12.c  | 1 -
 gcc/testsuite/gcc.dg/vect/no-trapping-math-vect-ifcvt-13.c  | 1 -
 gcc/testsuite/gcc.dg/vect/no-trapping-math-vect-ifcvt-14.c  | 1 -
 gcc/testsuite/gcc.dg/vect/no-trapping-math-vect-ifcvt-15.c  | 1 -
 gcc/testsuite/gcc.dg/vect/no-vfa-vect-dv-2.c| 1 -
 gcc/testsuite/gcc.dg/vect/pr40254.c | 1 -
 gcc/testsuite/gcc.dg/vect/pr44507.c | 1 -
 gcc/testsuite/gcc.dg/vect/pr45902.c | 1 -
 gcc/testsuite/gcc.dg/vect/pr47001.c | 2 +-
 gcc/testsuite/gcc.dg/vect/pr49771.c | 3 ++-
 gcc/testsuite/gcc.dg/vect/slp-37.c  | 5 ++---
 gcc/testsuite/gcc.dg/vect/slp-widen-mult-half.c | 1 -
 gcc/testsuite/gcc.dg/vect/vect-10-big-array.c   | 3 ++-
 gcc/testsuite/gcc.dg/vect/vect-117.c| 1 -
 gcc/testsuite/gcc.dg/vect/vect-99.c | 1 -
 gcc/testsuite/gcc.dg/vect/vect-aggressive-1.c   | 1 -
 gcc/testsuite/gcc.dg/vect/vect-cond-1.c | 1 -
 gcc/testsuite/gcc.dg/vect/vect-cond-2.c   

[PATCH] Fix PR68707

2016-01-08 Thread Alan Lawrence
Here's an alternative patch, using the hunk from
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707#c16, which 'improves' (heh,
we think) on the previous by allowing SLP to proceed where the loads are
strided, for example, slp-perm-11.c. Also I fix the testsuite failures (looking
for SLP where we've now decided to do load/store-lanes instead). I used simple
scan-tree-dump (rather than -times) because load_lanes typically appears many
times in the log for each instance output, and it felt fragile to look for a
specific number.

Note, this doesn't fix PR67323 - that's an example where unpermuted SLP is
still (a bit) worse than load-lanes, as it needs unrolling * 2. The
previously-posted patch doesn't fix this either, however
(https://gcc.gnu.org/ml/gcc-patches/2015-12/msg01620.html).

Bootstrapped + check-gcc + check-g++ on x86_64 (no changes), arm and aarch64
(various new scan-tree-dump tests all passing, and PASS->FAIL for O3-pr36098.c).

gcc/ChangeLog:

2016-01-XX  Alan Lawrence  <alan.lawre...@arm.com>
Richard Biener  <rguent...@suse.de>

* tree-vect-slp.c (vect_analyze_slp_instance): Cancel permuted SLP
instances that can be handled via vect_load_lanes.

gcc/testsuite/ChangeLog:

* lib/target-supports.exp (check_effective_target_vect_load_lanes): New.
* gcc.dg/vect/slp-perm-1.c: Look for vect_load_lanes instead of SLP
on platforms supporting it.
* gcc.dg/vect/slp-perm-2.c: Likewise.
* gcc.dg/vect/slp-perm-3.c: Likewise.
* gcc.dg/vect/slp-perm-5.c: Likewise.
* gcc.dg/vect/slp-perm-7.c: Likewise.
* gcc.dg/vect/slp-perm-8.c: Likewise.
* gcc.dg/vect/slp-perm-6.c: Look for vect_load_lanes in addition to SLP
on platforms supporting it.
---
 gcc/testsuite/gcc.dg/vect/slp-perm-1.c |  6 +-
 gcc/testsuite/gcc.dg/vect/slp-perm-2.c |  7 +--
 gcc/testsuite/gcc.dg/vect/slp-perm-3.c |  7 +--
 gcc/testsuite/gcc.dg/vect/slp-perm-5.c |  7 +--
 gcc/testsuite/gcc.dg/vect/slp-perm-6.c |  7 +--
 gcc/testsuite/gcc.dg/vect/slp-perm-7.c |  8 +---
 gcc/testsuite/gcc.dg/vect/slp-perm-8.c |  7 +--
 gcc/testsuite/lib/target-supports.exp  | 19 +++
 gcc/tree-vect-slp.c| 30 ++
 9 files changed, 84 insertions(+), 14 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-1.c 
b/gcc/testsuite/gcc.dg/vect/slp-perm-1.c
index 2b6c134..6d0d66f 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-perm-1.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-perm-1.c
@@ -58,5 +58,9 @@ int main (int argc, const char* argv[])
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target 
vect_perm } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target vect_perm } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target { vect_perm && {! vect_load_lanes } } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { 
target vect_load_lanes } } } */
+/* { dg-final { scan-tree-dump "note: Built SLP cancelled: can use 
load/store-lanes" { target { vect_perm && vect_load_lanes } } } } */
+/* { dg-final { scan-tree-dump "LOAD_LANES" "vect" { target vect_load_lanes } 
} } */
+/* { dg-final { scan-tree-dump "STORE_LANES" "vect" { target vect_load_lanes } 
} } */
 
diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-2.c 
b/gcc/testsuite/gcc.dg/vect/slp-perm-2.c
index da50e16..34b413d 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-perm-2.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-perm-2.c
@@ -54,5 +54,8 @@ int main (int argc, const char* argv[])
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target 
vect_perm } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target vect_perm } } } */
-
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target { vect_perm && {! vect_load_lanes } } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { 
target vect_load_lanes } } } */
+/* { dg-final { scan-tree-dump "note: Built SLP cancelled: can use 
load/store-lanes" { target { vect_perm && vect_load_lanes } } } } */
+/* { dg-final { scan-tree-dump "LOAD_LANES" "vect" { target vect_load_lanes } 
} } */
+/* { dg-final { scan-tree-dump "STORE_LANES" "vect" { target vect_load_lanes } 
} } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-3.c 
b/gcc/testsuite/gcc.dg/vect/slp-perm-3.c
index 1d33f9b..ec13e0f 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-perm-3.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-perm-3.c
@@ -67,6 +67,9 @@ int main (int argc, const char

Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization

2016-01-08 Thread Alan Lawrence

On Tues, Oct 27, 2015 at 2:39 PM, Richard Biener  
wrote:

On Mon, Oct 26, 2015 at 6:59 AM, sameera  wrote:


Richard, we have defined the input language for convenience in prototype
implementation. However, we will be using GIMPLE as our IR. As per grammar
of our tree, p-tree denote the permute order associated with the statement,
whereas c-tree is actually the GIMPLE instruction, which performs compute
operation. I tried looking at structures used in SLP, however they can not
be used as they are, as main difference between current SLP implementation
in GCC versus our representation is that, permute order in SLP is part of
the tree node in current GCC, whereas in our representation permute order is
represented as independent tree-node. Hence, I have created new tree
structure for our pass, which will create p-tree nodes for permute order,
and c-tree node which points to appropriate gimple statement.


Yes, that's the whole purpose - get the vectorizer (and SLP) a better
data structure
which is explicit about permutes.



I somewhat miss an idea on how to deal with irregular SLP cases - that is,
does the AST model each (current) SLP node as a single statement or
does it model individual vector elements (so you can insert no-op
compensation
code to make the operations match).  Consider

  a[i] = b[i] * 3;
  a[i+1] = b[i+1];

which can be vectorized with SLP if you realize you can multiply b[i+1] by
1.



The pass structure we are having for our optimization is as follows:
- New pass target-aware-loop-vect with following subpasses:
  - Loop structure identification
  - Restricted loop construction
  - Loop analysis (6 phases of target-aware reordering-instruction selection
algorithm)
  - Loop transformation (Cost aware gimple code generation)
We might need some optimizations to be performed on loop body before we
start our optimization for reordering instruction selection, so that input
program can be transformed to fit into our restricted loop structure. These
transformations can be performed by restricted loop construction subpass.
Eg.: multi-dimentional array, nested loops, reduction chains etc. can be
supported by transforming input loop. The case that you have mentioned, can
be transformed at this level.



As of implementing the whole thing in GCC I recently spent some more time
thinking and came to the conclusion that the path to the fastest
improvements
in GCC code-gen would be to build the new AST after the current analysis
phase finished, do the optimization and drive code-generation off the new
AST.
Thus keep things like data-ref and dependence analysis as is, as well as
group analysis and SLP tree construction.  Build the AST from the
vectorizer
"IL" at the point vect_transform_loop / vect_slp_transform_bb is called.



Current pass structure that we have designed, does exactly that. The only
difference is that, we are planning to use the transform phase as a
co-process of our transform pass, than reconstruct an AST from our
representation to pass to vect_transform_loop.



More and more of the rest of the vectorizer code can then be "slowly"
moved
to work on the AST and AST construction be moved earlier.


So I think an incremental approach is much more likely to get us some benefit, 
and I'm wondering how much of the benefit here stems from which bit, and how 
many of these changes can be broken off.


For example, making permutes into explicit operations on the SLP tree, and 
adding a pass that tries to push them up/down the tree into each other, might be 
done on the existing codebase, and would be a significant win in itself 
(bringing some/much(?) of improvements in interleaving code of your proposal).


I'm not really sure how much of the benefit stems from what, though - e.g. does 
reducing vector length down to fit the target machine only after the other 
steps, bring any benefits besides raising abstraction? (Which certainly can be a 
benefit!). AFAICT this could be done independently of other changes?


Another limitation of the present SLP structure is it's not understanding 
sharing (until later CSE, hence costing wrongly); it would be good to fix this 
(making the SLP tree into a DAG, and potentially reusing subtrees by adding 
permutes). This is a step we'd like to take anyway, but might tie in with your 
"Bottom-up commoning of p-node subtrees". And would help with (BB as well as 
loop) cost calculations.


Are there any really fundamental differences that we cannot overcome in steps? ( 
/ are there any steps we shouldn't take in isolation, because we'd have to 
'throw them away' later?)


Alan


[PATCH][RS6000] Migrate reduction optabs in paired.md

2016-01-07 Thread Alan Lawrence
There are only a couple of uses of the old reduction optabs remaining
(the optabs producing a vector with only one element set). This migrates the
uses in gcc.target/rs6000/paired.md.

In the absence of a vec_extract pattern, I generate two subreg moves, the same
as usually produced by the midend when using the old pattern.

I don't have hardware to properly test this, but using a stage 1 compiler, I
have compiled all the tests in gcc.dg/vect, at -O2 -ftree-vectorize -mpaired -S
-mno-altivec -ffast-math -ffinite-math-only, on both
powerpc-none-linux-gnupaired and ppcel-none-linux-gnupaired.

The patterns were triggered on
fast-math-vect-reduc-5.c, fast-math-vect-reduc-8.c, no-fast-math-vect16.c,
vect-reduc-6.c.

no-fast-math-vect16.c exhibited some regalloc differences (regs 0 and 12 are
swapped):

addi 9,1,136
+   lfs 12,136(29)
psq_stx 0,0,9,0,0
-   lfs 0,136(29)
-   lfs 12,140(1)
-   fcmpu 7,12,0
+   lfs 0,140(1)
+   fcmpu 7,0,12

no other assembly was changed.

Is this OK for trunk? (stage 3?)

Cheers, Alan

gcc/ChangeLog:

* gcc.target/rs6000/paired.md (reduc_smax_v2sf): Rename to...
(reduc_smax_scal_v2sf): ...here, make result SFmode, extract element.
(reduc_smin_v2sf): Rename to...
(reduc_smin_scal_v2sf): ...here, make result SFmode, extract element.
(reduc_splus_v2sf): Rename to...
(reduc_plus_scal_v2sf): ...here, make result SFmode, extract element.
---
 gcc/config/rs6000/paired.md | 39 ---
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/gcc/config/rs6000/paired.md b/gcc/config/rs6000/paired.md
index 5d094fb..c3f4d66 100644
--- a/gcc/config/rs6000/paired.md
+++ b/gcc/config/rs6000/paired.md
@@ -421,45 +421,62 @@
   DONE;
 })
 
-(define_expand "reduc_smax_v2sf"
-  [(match_operand:V2SF 0 "gpc_reg_operand" "=f")
+(define_expand "reduc_smax_scal_v2sf"
+  [(match_operand:SF 0 "gpc_reg_operand" "=f")
(match_operand:V2SF 1 "gpc_reg_operand" "f")]
   "TARGET_PAIRED_FLOAT"
 {
   rtx tmp_swap = gen_reg_rtx (V2SFmode);
   rtx tmp = gen_reg_rtx (V2SFmode);
+  rtx vec_res = gen_reg_rtx (V2SFmode);
+  rtx di_res = gen_reg_rtx (DImode);
 
   emit_insn (gen_paired_merge10 (tmp_swap, operands[1], operands[1]));
   emit_insn (gen_subv2sf3 (tmp, operands[1], tmp_swap));
-  emit_insn (gen_selv2sf4 (operands[0], tmp, operands[1], tmp_swap, CONST0_RTX 
(SFmode)));
+  emit_insn (gen_selv2sf4 (vec_res, tmp, operands[1], tmp_swap,
+  CONST0_RTX (SFmode)));
+  emit_move_insn (di_res, simplify_gen_subreg (DImode, vec_res, V2SFmode, 0));
+  emit_move_insn (operands[0], simplify_gen_subreg (SFmode, di_res, DImode,
+   BYTES_BIG_ENDIAN ? 4 : 0));
 
   DONE;
 })
 
-(define_expand "reduc_smin_v2sf"
-  [(match_operand:V2SF 0 "gpc_reg_operand" "=f")
+(define_expand "reduc_smin_scal_v2sf"
+  [(match_operand:SF 0 "gpc_reg_operand" "=f")
(match_operand:V2SF 1 "gpc_reg_operand" "f")]
   "TARGET_PAIRED_FLOAT"
 {
   rtx tmp_swap = gen_reg_rtx (V2SFmode);
   rtx tmp = gen_reg_rtx (V2SFmode);
+  rtx vec_res = gen_reg_rtx (V2SFmode);
+  rtx di_res = gen_reg_rtx (DImode);
 
   emit_insn (gen_paired_merge10 (tmp_swap, operands[1], operands[1]));
   emit_insn (gen_subv2sf3 (tmp, operands[1], tmp_swap));
-  emit_insn (gen_selv2sf4 (operands[0], tmp, tmp_swap, operands[1], CONST0_RTX 
(SFmode)));
+  emit_insn (gen_selv2sf4 (vec_res, tmp, tmp_swap, operands[1],
+  CONST0_RTX (SFmode)));
+  emit_move_insn (di_res, simplify_gen_subreg (DImode, vec_res, V2SFmode, 0));
+  emit_move_insn (operands[0], simplify_gen_subreg (SFmode, di_res, DImode,
+   BYTES_BIG_ENDIAN ? 4 : 0));
 
   DONE;
 })
 
-(define_expand "reduc_splus_v2sf"
-  [(set (match_operand:V2SF 0 "gpc_reg_operand" "=f")
+(define_expand "reduc_plus_scal_v2sf"
+  [(set (match_operand:SF 0 "gpc_reg_operand" "=f")
 (match_operand:V2SF 1 "gpc_reg_operand" "f"))]
   "TARGET_PAIRED_FLOAT"
-  "
 {
-  emit_insn (gen_paired_sum1 (operands[0], operands[1], operands[1], 
operands[1]));
+  rtx vec_res = gen_reg_rtx (V2SFmode);
+  rtx di_res = gen_reg_rtx (DImode);
+
+  emit_insn (gen_paired_sum1 (vec_res, operands[1], operands[1], operands[1]));
+  emit_move_insn (di_res, simplify_gen_subreg (DImode, vec_res, V2SFmode, 0));
+  emit_move_insn (operands[0], simplify_gen_subreg (SFmode, di_res, DImode,
+   BYTES_BIG_ENDIAN ? 4 : 0));
   DONE;
-}")
+})
 
 (define_expand "movmisalignv2sf"
   [(set (match_operand:V2SF 0 "nonimmediate_operand" "")
-- 
1.9.1



[PATCH][MIPS] Migrate reduction optabs in mips-ps-3d.md

2016-01-07 Thread Alan Lawrence
Here's an updated version, also covering the min/max patterns I missed before.
I've now managed to do some testing with a stage 1 compiler, by compiling all
tests in gcc.dg/vect at -O2 -ftree-vectorize -mips3d -march=mips64r2 -mabi=n32
$x -ffast-math -ffinite-math-only.

There were no changes in which files compiled (871 did, some did not due to
libraries etc. missing in my stage 1 compiler), and no changes in any assembly
output. The patterns were triggered on:

fast-math-vect-reduc-5.c, fast-math-vect-reduc-8.c, fast-math-vect-reduc-9.c,
no-fast-math-vect16.c, pr66142.c, vect-outer-fir-big-array.c, vect-outer-fir.c,
vect-outer-fir-lb-big-array.c, vect-outer-fir-lb.c, vect-reduc-10.c,
vect-reduc-6.c

I realize this is not completely representative of a 'proper' test run, in
which different files are compiled with different options, but it provides
reasonable confidence that I'm not changing any code generation.

OK for trunk? (stage 3?)

Cheers, Alan

gcc/ChangeLog:

* config/mips/mips-ps-3d.md (reduc_splus_v2sf): Remove.
(reduc_plus_scal_v2sf): New.
(reduc_smax_v2sf): Rename to...
(reduc_smax_scal_v2sf): ...here, make result SFmode, add vec_extract.
(reduc_smin_v2sf): Rename to...
(reduc_smin_scal_v2sf): ...here, make result SFmode, add vec_extract.
---
 gcc/config/mips/mips-ps-3d.md | 34 ++
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/gcc/config/mips/mips-ps-3d.md b/gcc/config/mips/mips-ps-3d.md
index 8bc7608..a93a2b9 100644
--- a/gcc/config/mips/mips-ps-3d.md
+++ b/gcc/config/mips/mips-ps-3d.md
@@ -371,13 +371,17 @@
   [(set_attr "type" "fadd")
(set_attr "mode" "SF")])
 
-(define_insn "reduc_splus_v2sf"
-  [(set (match_operand:V2SF 0 "register_operand" "=f")
-   (unspec:V2SF [(match_operand:V2SF 1 "register_operand" "f")
- (match_dup 1)]
-UNSPEC_ADDR_PS))]
+(define_expand "reduc_plus_scal_v2sf"
+  [(match_operand:SF 0 "register_operand" "=f")
+   (match_operand:V2SF 1 "register_operand" "f")]
   "TARGET_HARD_FLOAT && TARGET_MIPS3D"
-  "")
+  {
+rtx temp = gen_reg_rtx (V2SFmode);
+emit_insn (gen_mips_addr_ps (temp, operands[1], operands[1]));
+rtx lane = BYTES_BIG_ENDIAN ? const1_rtx : const0_rtx;
+emit_insn (gen_vec_extractv2sf (operands[0], temp, lane));
+DONE;
+  })
 
 ; cvt.pw.ps - Floating Point Convert Paired Single to Paired Word
 (define_insn "mips_cvt_pw_ps"
@@ -745,20 +749,26 @@
   DONE;
 })
 
-(define_expand "reduc_smin_v2sf"
-  [(match_operand:V2SF 0 "register_operand")
+(define_expand "reduc_smin_scal_v2sf"
+  [(match_operand:SF 0 "register_operand")
(match_operand:V2SF 1 "register_operand")]
   "TARGET_HARD_FLOAT && TARGET_PAIRED_SINGLE_FLOAT"
 {
-  mips_expand_vec_reduc (operands[0], operands[1], gen_sminv2sf3);
+  rtx temp = gen_reg_rtx (V2SFmode);
+  mips_expand_vec_reduc (temp, operands[1], gen_sminv2sf3);
+  rtx lane = BYTES_BIG_ENDIAN ? const1_rtx : const0_rtx;
+  emit_insn (gen_vec_extractv2sf (operands[0], temp, lane));
   DONE;
 })
 
-(define_expand "reduc_smax_v2sf"
-  [(match_operand:V2SF 0 "register_operand")
+(define_expand "reduc_smax_scal_v2sf"
+  [(match_operand:SF 0 "register_operand")
(match_operand:V2SF 1 "register_operand")]
   "TARGET_HARD_FLOAT && TARGET_PAIRED_SINGLE_FLOAT"
 {
-  mips_expand_vec_reduc (operands[0], operands[1], gen_smaxv2sf3);
+  rtx temp = gen_reg_rtx (V2SFmode);
+  mips_expand_vec_reduc (temp, operands[1], gen_smaxv2sf3);
+  rtx lane = BYTES_BIG_ENDIAN ? const1_rtx : const0_rtx;
+  emit_insn (gen_vec_extractv2sf (operands[0], temp, lane));
   DONE;
 })
-- 
1.9.1



Re: [PATCH 2/4] Equate MEM_REFs and ARRAY_REFs in tree-ssa-scopedtables.c

2016-01-05 Thread Alan Lawrence

On 05/01/16 07:29, Richard Biener wrote:

On January 4, 2016 8:08:17 PM GMT+01:00, Jeff Law <l...@redhat.com> wrote:

On 12/21/2015 06:13 AM, Alan Lawrence wrote:

This is a respin of patches
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03266.html and
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03267.html, which were
"too quickly" approved before concerns with efficiency were pointed

out.


I tried to change the hashing just in tree-ssa-dom.c using C++

subclassing, but

couldn't cleanly separate this out from tree-ssa-scopedtables and
tree-ssa-threadedge.c due to use of avail_exprs_stack. So I figured

it was

probably appropriate to use the equivalences in jump threading too.

Also,

using get_ref_base_and_extent unifies handling of MEM_REFs and

ARRAY_REFs


Without looking at the patch, ARRAY_REFs can have non-constant indices which 
get_ref_base_and_extend handles conservative.  You should make sure to not 
regress here.


Thanks for the warning - my understanding is that in such a case, 
get_ref_base_and_extent returns max_size=(size of the whole array), size=(size 
of one element); and I only handle cases where size==max_size. Arrays of unknown 
length have size -1, so will never be equal.


Thanks, Alan


RS6000/PowerPC -mpaired

2016-01-05 Thread Alan Lawrence
Can anyone give me any pointers on how to use the -mpaired option ("PowerPC 
750CL paired-single instructions") ? I'm trying to get it to use the 
reduc_s{min,max,plus}_v2sf patterns in gcc/config/rs6000/paired.md, so far I 
haven't managed to find a configuration that supports loading a vector of floats 
but that doesn't also have superior altivec/v4sf support...


Thanks, Alan


Re: [PATCH 4/4] Un-XFAIL ssa-dom-cse-2.c for most platforms

2016-01-04 Thread Alan Lawrence

On 24/12/15 19:59, Mike Stump wrote:

On Dec 22, 2015, at 8:00 AM, Alan Lawrence <alan.lawre...@foss.arm.com> wrote:

On 21/12/15 15:33, Bill Schmidt wrote:


Not on a stage1 compiler - check_p8vector_hw_available itself requires being
able to run executables - I'll check on gcc112. However, both look like they're
really about the host (ability to execute an asm instruction), not the target
(/ability for gcc to output such an instruction)


Hm, that looks like a pervasive problem for powerpc.  There are a number
of things that are supposed to be testing effective target but rely on
check_p8vector_hw_available, which as you note requires executing an
instruction and is really about the host.  We need to clean that up; I
should probably open a bug.  Kind of amazed this has gotten past us for
a couple of years.


Well, I was about to apologize for making a bogus remark. A really "proper" 
setup, would be to tell dejagnu to run your execution tests in some kind of 
emulator/simulator (on your host, perhaps one kind of powerpc) that only/additionally 
runs instructions for the other, _target_, kind of powerpc...and whatever setup you'd 
need for all that probably does not live in the GCC repository!


I’m not following.  dejagnu can already run tests on the target to makes 
decisions on which tests to run and what to expect from them, if it wants.  
Some ports already do this.  Further, this is pretty typical and standard and 
easy to do

You confuse the issue by mentioning host, but this I think is wrong.  These 
decisions have nothing to do with the host.  The are properties of the target 
execution environment.

I’d be happy to help if you’d like.  I’d just need the details of what you’d 
like help with.


You're right, which is why I described my first (wrong) remark as bogus. That 
is, check_p8vector_hw_available is executing an assembly instruction, and on a 
well-configured test setup, that would potentially invoke an emulator etc. - 
whereas I am just doing 'native' testing on gcc110/gcc112 on the compile farm.


So (as Mike says) there is no bug here, but one just needs to be aware that 
passing -mcpu=power7 (say) is not sufficient to make check_p8vector_hw_available 
return false when executing on a power8 host; you would also need to set up some 
kind of power7 emulator/simulator.


Hope that's clear!

Thanks,
Alan


Re: [PATCH 1/4] Make SRA scalarize constant-pool loads

2016-01-04 Thread Alan Lawrence

On 24/12/15 11:53, Alan Lawrence wrote:

Here's a new version that fixes the gcc.dg/guality/pr54970.c failures seen on
aarch64 and powerpc64.

[snip]

This also fixes a bunch of other guality tests on AArch64 that were failing
prior to the patch series, and another bunch on PowerPC64 (bigendian -m32), 
listed below.


Ach, sorry, not quite. That version avoids any regressions (e.g. in pr54970.c), 
but does not fix all those other tests, unless you also have this hunk 
(https://gcc.gnu.org/ml/gcc-patches/2015-08/msg01483.html):


diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
index a3ff2df..2a741b8 100644
--- a/gcc/tree-sra.c
+++ b/gcc/tree-sra.c
@@ -2651,7 +2651,8 @@ analyze_all_variable_accesses (void)
&& scalarizable_type_p (TREE_TYPE (var)))
  {
if (tree_to_uhwi (TYPE_SIZE (TREE_TYPE (var)))
-   <= max_scalarization_size)
+ <= max_scalarization_size
+   || DECL_IN_CONSTANT_POOL (var))
  {
create_total_scalarization_access (var);
completely_scalarize (var, TREE_TYPE (var), 0, var);


...which I was using to increase test coverage of the SRA changes. 
(Alternatively, you can "fix" the tests by running the testsuite with a forced 
--param sra-max-scalarization-size. But this is only saying that the dwarf info 
now generated by scalarizing constant-pools, is better than whatever dwarf was 
being generated by whatever other part of the compiler before.)


--Alan


Re: [PATCH][Testsuite]Cleanup logs from gdb tests by adding newlines

2016-01-04 Thread Alan Lawrence

Ping.

--Alan

On 10/12/15 10:31, Alan Lawrence wrote:

Runs of the guality testsuite can sometimes end up with gcc.log containing
malformed lines like:

A debugging session is active.PASS: gcc.dg/guality/pr36728-1.c   -O2  line 18 
arg4 == 4
A debugging session is active.PASS: gcc.dg/guality/restrict.c   -O2  line 30 
type:ip == int *
Inferior 1 [process 27054] will be killed.PASS: 
gcc.dg/guality/restrict.c   -O2  line 30 type:cicrp == const int * const 
restrict
Inferior 1 [process 27160] will be killed.PASS: 
gcc.dg/guality/restrict.c   -O2  line 30 type:cvirp == int * const volatile 
restrict

This patch just makes sure the PASS/FAIL comes at the beginning of a line.  (At
the slight cost of adding some extra newlines not in the actual test output.)

I moved the remote_close target calls earlier, to avoid any possible race
condition of extra output being generated after the newline - this may not be
strictly necessary.

Tested on aarch64-none-linux-gnu and x86_64-none-linux-gnu.

I think this is reasonable for stage 3 - OK for trunk?

gcc/testsuite/ChangeLog:
* lib/gcc-gdb-test.exp (gdb-test): call remote_close earlier, and send
newline to log, before calling pass/fail/unsupported.
* lib/gcc-simulate-thread.exp (simulate-thread): Likewise.
---
  gcc/testsuite/lib/gcc-gdb-test.exp| 15 ++-
  gcc/testsuite/lib/gcc-simulate-thread.exp | 10 +++---
  2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/gcc/testsuite/lib/gcc-gdb-test.exp 
b/gcc/testsuite/lib/gcc-gdb-test.exp
index d3ba6e4..f60cabf 100644
--- a/gcc/testsuite/lib/gcc-gdb-test.exp
+++ b/gcc/testsuite/lib/gcc-gdb-test.exp
@@ -84,8 +84,9 @@ proc gdb-test { args } {
  remote_expect target [timeout_value] {
# Too old GDB
-re "Unhandled dwarf expression|Error in sourced command file|



Re: [PATCH 2/4] Equate MEM_REFs and ARRAY_REFs in tree-ssa-scopedtables.c

2015-12-24 Thread Alan Lawrence
This version changes the test cases to fix failures on some platforms, by
rewriting the initializers so that they aren't pushed out to the constant pool.

gcc/ChangeLog:

* tree-ssa-scopedtables.c (avail_expr_hash): Hash MEM_REF and ARRAY_REF
using get_ref_base_and_extent.
(equal_mem_array_ref_p): New.
(hashable_expr_equal_p): Add call to previous.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/ssa-dom-cse-5.c: New.
* gcc.dg/tree-ssa/ssa-dom-cse-6.c: New.
* gcc.dg/tree-ssa/ssa-dom-cse-7.c: New.

---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c | 18 +++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c | 20 
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c | 21 +
 gcc/tree-ssa-scopedtables.c   | 67 +--
 4 files changed, 123 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c
new file mode 100644
index 000..cd38d3e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c
@@ -0,0 +1,18 @@
+/* Test normalization of ARRAY_REF expressions to MEM_REFs in dom.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-tree-fre -fdump-tree-dom2" } */
+
+#define N 8
+
+int
+main (int argc, char **argv)
+{
+  int a[N];
+  for (int i = 0; i < N; i++)
+a[i] = 2*i + 1;
+  int *p = [0];
+  __builtin_printf ("%d\n", a[argc]);
+  return *(++p);
+}
+
+/* { dg-final { scan-tree-dump-times "return 3;" 1 "dom2"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c
new file mode 100644
index 000..002fd81
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c
@@ -0,0 +1,20 @@
+/* Test normalization of ARRAY_REF expressions to MEM_REFs in dom.  */
+/* { dg-do compile } */
+/* { dg-options "-O1 -fno-tree-fre -fdump-tree-dom2" } */
+
+int
+main (int argc, char **argv)
+{
+  union {
+int a[4];
+int b[2];
+  } u;
+  u.a[0] = 1;
+  u.a[1] = 42;
+  u.a[2] = 3;
+  u.a[3] = 4;
+  __builtin_printf ("%d\n", u.a[argc]);
+  return u.b[1];
+}
+
+/* { dg-final { scan-tree-dump-times "return 42;" 1 "dom2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c
new file mode 100644
index 000..151e5d4e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c
@@ -0,0 +1,21 @@
+/* Test normalization of MEM_REF expressions in dom.  */
+/* { dg-do compile } */
+/* { dg-options "-O3 -fno-tree-fre -fno-tree-pre -fdump-tree-optimized" } */
+
+typedef struct {
+  int a[8];
+} foo;
+
+foo f;
+
+int
+test ()
+{
+  foo g;
+  g.a[0] = 1; g.a[1] = 2; g.a[2] = 3; g.a[3] = 4;
+  g.a[4] = 5; g.a[5] = 6; g.a[6] = 7; g.a[7] = 8;
+  f=g;
+  return f.a[2];
+}
+
+/* { dg-final { scan-tree-dump-times "return 3;" 1 "optimized" } } */
diff --git a/gcc/tree-ssa-scopedtables.c b/gcc/tree-ssa-scopedtables.c
index 6034f79..8df7125 100644
--- a/gcc/tree-ssa-scopedtables.c
+++ b/gcc/tree-ssa-scopedtables.c
@@ -32,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "fold-const.h"
 #include "tree-eh.h"
 #include "internal-fn.h"
+#include "tree-dfa.h"
 
 static bool hashable_expr_equal_p (const struct hashable_expr *,
   const struct hashable_expr *);
@@ -209,11 +210,70 @@ avail_expr_hash (class expr_hash_elt *p)
   const struct hashable_expr *expr = p->expr ();
   inchash::hash hstate;
 
+  if (expr->kind == EXPR_SINGLE)
+{
+  /* T could potentially be a switch index or a goto dest.  */
+  tree t = expr->ops.single.rhs;
+  if (TREE_CODE (t) == MEM_REF || TREE_CODE (t) == ARRAY_REF)
+   {
+ /* Make equivalent statements of both these kinds hash together.
+Dealing with both MEM_REF and ARRAY_REF allows us not to care
+about equivalence with other statements not considered here.  */
+ bool reverse;
+ HOST_WIDE_INT offset, size, max_size;
+ tree base = get_ref_base_and_extent (t, , , _size,
+  );
+ /* Strictly, we could try to normalize variable-sized accesses too,
+   but here we just deal with the common case.  */
+ if (size == max_size)
+   {
+ enum tree_code code = MEM_REF;
+ hstate.add_object (code);
+ inchash::add_expr (base, hstate);
+ hstate.add_object (offset);
+ hstate.add_object (size);
+ return hstate.end ();
+   }
+   }
+}
+
   inchash::add_hashable_expr (expr, hstate);
 
   return hstate.end ();
 }
 
+/* Compares trees T0 and T1 to see if they are MEM_REF or ARRAY_REFs equivalent
+   to each other.  (That is, they return the 

Re: [PATCH 1/4] Make SRA scalarize constant-pool loads

2015-12-24 Thread Alan Lawrence
Here's a new version that fixes the gcc.dg/guality/pr54970.c failures seen on
aarch64 and powerpc64. Prior to SRA handling constant pool decls,
-fdump-tree-esra-details (at -O1 -g) had shown:
  :
  a = *.LC0;
  # DEBUG a$0 => MEM[(int[3] *)&*.LC0]
  a$4_3 = MEM[(int[3] *)&*.LC0 + 4B];
  # DEBUG a$4 => a$4_3
  a$8_4 = MEM[(int[3] *)&*.LC0 + 8B];

The previous patch changed this to:
  :
  SR.5_3 = *.LC0[0];
  SR.6_4 = *.LC0[1];
  SR.7_19 = *.LC0[2];
  SR.8_21 = *.LC1[0];
  SR.9_22 = *.LC1[1];
  SR.10_23 = *.LC1[2];
  # DEBUG a$0 => NULL   // Note here
  a$4_24 = SR.6_4;
  # DEBUG a$4 => a$4_24
  a$8_25 = SR.7_19;

Turns out the DEBUG a$0 => NULL was produced in load_assign_lhs_subreplacements:

  if (lacc && lacc->grp_to_be_debug_replaced)
{
  gdebug *ds;
  tree drhs;
  struct access *racc = find_access_in_subtree (sad->top_racc,
offset,
lacc->size);

  if (racc && racc->grp_to_be_replaced)
{
  if (racc->grp_write)
drhs = get_access_replacement (racc);
  else
drhs = NULL;  // <=== HERE
}
...
  ds = gimple_build_debug_bind (get_access_replacement (lacc),
drhs, gsi_stmt (sad->old_gsi));

Prior to the patch, we'd skipped around load_assign_lhs_subreplacements, because
access_has_children_p(racc) (for racc = *.LC0) didn't hold in sra_modify_assign.

I also added a constant_decl_p function, combining the two checks, plus some
testcase fixes.

This also fixes a bunch of other guality tests on AArch64 that were failing
prior to the patch series, and another bunch on PowerPC64 (bigendian -m32), 
listed below.

Bootstrapped + check-gcc,g++ on x86_64, ARM, AArch64,
  also on powerpc64{,le}-none-linux-gnu *in combination with the other patches
  in the series* (I haven't tested the individual patches on PPC),
  plus Ada on ARM and x86_64.

gcc/ChangeLog:

PR target/63679
* tree-sra.c (disqualified_constants, constant_decl_p): New.
(sra_initialize): Allocate disqualified_constants.
(sra_deinitialize): Free disqualified_constants.
(disqualify_candidate): Update disqualified_constants when appropriate.
(create_access): Scan for constant-pool entries as we go along.
(scalarizable_type_p): Add check against type_contains_placeholder_p.
(maybe_add_sra_candidate): Allow constant-pool entries.
(load_assign_lhs_subreplacements): Bind debug for constant pool vars.
(initialize_constant_pool_replacements): New.
(sra_modify_assign): Avoid mangling assignments created by previous,
and don't generate writes into constant pool.
(sra_modify_function_body): Call initialize_constant_pool_replacements.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/sra-17.c: New.
* gcc.dg/tree-ssa/sra-18.c: New.

Tests fixed on aarch64-none-linux-gnu:
gcc.dg/guality/pr54970.c   -O1  line 15 *p == 3
gcc.dg/guality/pr54970.c   -O1  line 15 *q == 2
gcc.dg/guality/pr54970.c   -O1  line 20 *p == 13
gcc.dg/guality/pr54970.c   -O1  line 20 *q == 2
gcc.dg/guality/pr54970.c   -O1  line 25 *p == 13
gcc.dg/guality/pr54970.c   -O1  line 25 *q == 12
gcc.dg/guality/pr54970.c   -O1  line 31 *p == 6
gcc.dg/guality/pr54970.c   -O1  line 31 *q == 5
gcc.dg/guality/pr54970.c   -O1  line 36 *p == 26
gcc.dg/guality/pr54970.c   -O1  line 36 *q == 5
gcc.dg/guality/pr54970.c   -O1  line 45 *p == 26
gcc.dg/guality/pr54970.c   -O1  line 45 *q == 25
gcc.dg/guality/pr54970.c   -O1  line 45 p[-1] == 25
gcc.dg/guality/pr54970.c   -O1  line 45 p[-2] == 4
gcc.dg/guality/pr54970.c   -O1  line 45 q[-1] == 4
gcc.dg/guality/pr54970.c   -O1  line 45 q[1] == 26
gcc.dg/guality/pr56154-1.c   -O1  line pr56154-1.c:20 x.a == 6
gcc.dg/guality/pr59776.c   -O1  line pr59776.c:17 s1.f == 5.0
gcc.dg/guality/pr59776.c   -O1  line pr59776.c:17 s1.g == 6.0
gcc.dg/guality/pr59776.c   -O1  line pr59776.c:17 s2.f == 0.0
gcc.dg/guality/pr59776.c   -O1  line pr59776.c:17 s2.g == 6.0
gcc.dg/guality/pr59776.c   -O1  line pr59776.c:20 s1.f == 5.0
gcc.dg/guality/pr59776.c   -O1  line pr59776.c:20 s1.g == 6.0
gcc.dg/guality/pr59776.c   -O1  line pr59776.c:20 s2.f == 5.0
gcc.dg/guality/pr59776.c   -O1  line pr59776.c:20 s2.g == 6.0
gcc.dg/guality/sra-1.c   -O1  line 21 a.j == 14
gcc.dg/guality/sra-1.c   -O1  line 32 a[1] == 14
gcc.dg/guality/sra-1.c   -O1  line 43 a.i == 4
gcc.dg/guality/sra-1.c   -O1  line 43 a.j == 14
(and other optimization levels)
On ppc64 bigendian, with the rest of the series (but I expect it's this patch):
unix/-m32: gcc.dg/guality/pr54970.c   -O1  line 15 a[0] == 1
unix/-m32: gcc.dg/guality/pr54970.c   -O1  line 20 a[0] == 1
unix/-m32: gcc.dg/guality/pr54970.c   -O1  line 25 a[0] == 1
unix/-m32: gcc.dg/guality/pr54970.c   -O1  line 

Re: [AArch32][NEON] Implementing vmaxnmQ_ST and vminnmQ_ST intrinsincs.

2015-12-22 Thread Alan Lawrence

On 21/12/15 11:58, Bilyan Borisov wrote:

This patch implements the vmaxnmQ_ST and vminnmQ_ST intrinsincs. It also
implements the __ARM_FEATURE_NUMERIC_MAXMIN macro, which is defined when
__ARM_ARCH >= 8, and which enables the intrinsincs.

Tested on arm-none-eabi, armeb-none-eabi, arm-none-linux-gnueabihf.

---

gcc/

2015-XX-XX  Bilyan Borisov  

 * config/arm/arm-c.c (arm_cpu_builtins): New macro definition.
 * config/arm/arm_neon.h (vmaxnm_f32): New intrinsinc.
 (vmaxnmq_f32): Likewise.
 (vminnm_f32): Likewise.
 (vminnmq_f32): Likewise.
 * config/arm/arm_neon_builtins.def (vmaxnm): New builtin.
 (vminnm): Likewise.
 * config/arm/iterators.md (VMAXMINNM): New iterator.
 (maxmin): Updated iterator.
 * config/arm/neon.md (neon_v, VCVTF): New pattern.
 * config/arm/unspecs.md (UNSPEC_VMAXNM): New unspec.
 (UNSPEC_VMINNM): Likewise.

gcc/testsuite/

2015-XX-XX  Bilyan Borisov  

 * gcc.target/arm/simd/vmaxnm_f32_1.c: New.
 * gcc.target/arm/simd/vmaxnmq_f32_1.c: Likewise.
 * gcc.target/arm/simd/vminnm_f32_1.c: Likewise.
 * gcc.target/arm/simd/vminnmq_f32_1.c: Likewise.



I note strong similarities between this patch and David Sherwood's 
https://gcc.gnu.org/ml/gcc-patches/2015-12/msg01560.html


Both add the same UNSPEC_s, and equivalent VMAXMIN(F?)NM. David's adds  
and  attributes, whereas Bilyan reuses some elements of the existing 
. AFAICT, the patterns they add are in other ways equivalent (same type, 
condition, modes, alternatives), albeit in different files and constructed using 
those different iterators, and David's has the standard names (which IIUC we 
want, so the autovectorizer finds them) whereas Bilyan adds the intrinsics 
(which we also want)...


--Alan


Re: [PING][Patch 3/3][Arm] Add support for IEEE-conformant versions of scalar fmin* and fmax*

2015-12-22 Thread Alan Lawrence

On 16/12/15 09:31, Kyrill Tkachov wrote:

Hi David,

On 16/12/15 08:53, David Sherwood wrote:

Hi,

Here is the last patch of the fmin/fmax change, which adds the optabs
to the arm backend.

Tested:

arm-none-eabi: no regressions

Good to go?
David Sherwood.

ChangeLog:

2015-12-08  David Sherwood  

 gcc/
 * config/arm/iterators.md: New iterators.
 * config/arm/unspecs.md: New unspecs.
 * config/arm/neon.md: New pattern.
 * config/arm/vfp.md: Likewise.


Please list the new entities you add in parentheses.
For example:
 * config/arm/iterators.md (VMAXMINFNM): New int iterator.
 (fmaxmin): New int attribute.
 (fmaxmin): Likewise.

Same for the other files. That way one can grep through the ChangeLogs to
find when any particular pattern/iterator/whatever was modified.

+;; Auto-vectorized forms for the IEEE-754 fmax()/fmin() functions
+(define_insn "3"
+  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+(unspec:VCVTF [(match_operand:VCVTF 1 "s_register_operand" "w")
+   (match_operand:VCVTF 2 "s_register_operand" "w")]
+   VMAXMINFNM))]
+  "TARGET_NEON && TARGET_FPU_ARMV8"
+  ".\t%0, %1, %2"
+  [(set_attr "type" "neon_fp_minmax_s")]
+)

I would just say "Vector forms" rather than "Auto-vectorized".
In principle we can get vector types through other means besides
auto-vectorisation.

+;; Scalar forms for the IEEE-754 fmax()/fmin() functions
+(define_insn "3"
+  [(set (match_operand:SDF 0 "s_register_operand" "=")
+(unspec:SDF [(match_operand:SDF 1 "s_register_operand" "")
+ (match_operand:SDF 2 "s_register_operand" "")]
+ VMAXMINFNM))]
+  "TARGET_HARD_FLOAT && TARGET_VFP5 "
+  ".\\t%0, %1, %2"
+  [(set_attr "type" "f_minmax")
+   (set_attr "conds" "unconditional")]
+)
+

I notice your new test doesn't test the SF variant of this.
Could you please add something to test it?
Looks good to me otherwise.

Thanks,
Kyrill




I note strong similarities between this patch and Bilyan Borisov's 
https://gcc.gnu.org/ml/gcc-patches/2015-12/msg01971.html


Both add the same UNSPEC_s, and equivalent VMAXMIN(F?)NM. David's adds  
and  attributes, whereas Bilyan reuses some elements of the existing 
. AFAICT, the patterns they add are in other ways equivalent (same type, 
condition, modes, alternatives), albeit in different files and constructed using 
those different iterators, and David's has the standard names (which IIUC we 
want, so the autovectorizer finds them) whereas Bilyan adds the intrinsics 
(which we also want)...


--Alan


Re: [PATCH 0/4 v3] Fix PR/63679 when --param sra-max-scalarization-size specified

2015-12-22 Thread Alan Lawrence

On 21/12/15 13:13, Alan Lawrence wrote:

This is a respin of previous patch series:
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03271.html
Minus three of the smaller patches having already been committed; with the
updated version of the main patch to SRA; and the patches to DOM reworked
to avoid constructing gimple stmt's.

IMHO this feels quite invasive for stage 3, however, I note the PR/63679
(ssa-dom-cse-2.c being XFAILed on a bunch of platforms, and currently still
FAILing on ARM) is a regression from gcc 4.9.1. So I'd ask maintainer's thoughts
as to whether we should take this patch set for gcc 6.




I've now tested the series on powerpc64-none-linux-gnu (gcc110) and 
powerpc64le-none-linux-gnu (gcc112).


On powerpc64le, this causes the same guality failures as on AArch64:
gcc.dg/guality/pr54970.c   -O1  line 15 a[0] == 1
gcc.dg/guality/pr54970.c   -O1  line 20 a[0] == 1
gcc.dg/guality/pr54970.c   -O1  line 25 a[0] == 1
and the same with other optimization flags. (I note that there are quite a lot 
of guality failures on both powerpc and also aarch64, which are generally not 
included in the posts on the gcc-testresults mailing list).


The same pr54970 tests still seem to pass on powerpc64 big-endian even with the 
patches.


On both powerpc64 and powerpc64le, the ssa-dom-cse-7.c test also fails, because 
the constant gets pushed to the constant pool :(, which was not the point of 
that test (I'd tried to construct it to test normalization of MEM_REFs in DOM 
prior to the SRA changes). So, adding --param 
sra-max-scalarization-size-Ospeed=32 fixes this once the SRA patch is in (!), or 
we could xfail on powerpc64*-*-*-*.


--Alan


Re: [PATCH 4/4] Un-XFAIL ssa-dom-cse-2.c for most platforms

2015-12-22 Thread Alan Lawrence

On 21/12/15 15:33, Bill Schmidt wrote:


Not on a stage1 compiler - check_p8vector_hw_available itself requires being
able to run executables - I'll check on gcc112. However, both look like they're
really about the host (ability to execute an asm instruction), not the target
(/ability for gcc to output such an instruction)


Hm, that looks like a pervasive problem for powerpc.  There are a number
of things that are supposed to be testing effective target but rely on
check_p8vector_hw_available, which as you note requires executing an
instruction and is really about the host.  We need to clean that up; I
should probably open a bug.  Kind of amazed this has gotten past us for
a couple of years.


Well, I was about to apologize for making a bogus remark. A really "proper" 
setup, would be to tell dejagnu to run your execution tests in some kind of 
emulator/simulator (on your host, perhaps one kind of powerpc) that 
only/additionally runs instructions for the other, _target_, kind of 
powerpc...and whatever setup you'd need for all that probably does not live in 
the GCC repository!



For now, just XFAILing for powerpc seems the best alternative for this
test.


Ok, thanks.


--Alan



Re: [PATCH 0/4 v3] Fix PR/63679 when --param sra-max-scalarization-size specified

2015-12-22 Thread Alan Lawrence

On 22/12/15 16:05, Bill Schmidt wrote:

On Tue, 2015-12-22 at 15:54 +, Alan Lawrence wrote:

On 21/12/15 13:13, Alan Lawrence wrote:

This is a respin of previous patch series:
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03271.html
Minus three of the smaller patches having already been committed; with the
updated version of the main patch to SRA; and the patches to DOM reworked
to avoid constructing gimple stmt's.

IMHO this feels quite invasive for stage 3, however, I note the PR/63679
(ssa-dom-cse-2.c being XFAILed on a bunch of platforms, and currently still
FAILing on ARM) is a regression from gcc 4.9.1. So I'd ask maintainer's thoughts
as to whether we should take this patch set for gcc 6.




I've now tested the series on powerpc64-none-linux-gnu (gcc110) and
powerpc64le-none-linux-gnu (gcc112).

On powerpc64le, this causes the same guality failures as on AArch64:
gcc.dg/guality/pr54970.c   -O1  line 15 a[0] == 1
gcc.dg/guality/pr54970.c   -O1  line 20 a[0] == 1
gcc.dg/guality/pr54970.c   -O1  line 25 a[0] == 1
and the same with other optimization flags. (I note that there are quite a lot
of guality failures on both powerpc and also aarch64, which are generally not
included in the posts on the gcc-testresults mailing list).


That's interesting.  I never see these for powerpc64le on my internal
build system.  I wonder what the difference is?


I've ssh'd into gcc112.fsffrance.org, a power8 powerpc64le system;
and configure --with-cpu=power8 --disable-libsanitizer --with-long-double-128 
--enable-languages=c,c++,lto.


Hmmm...ISTR one variable that can make a big difference is the version (or 
absence!) of gdb...gcc112 has gdb "Fedora 7.8.2-38.fc21", copyright 2014, and 
GDB 7.8 looks to have been released at the end of 2014.



The same pr54970 tests still seem to pass on powerpc64 big-endian even with the
patches.


Ach, not quite. In fact those three are failing on powerpc64 bigendian even 
*without* the patches (at all optimization levels besides -O0 where they pass). 
This is on gcc110, configure --enable-languages=c,c++,lto --disable-libsanitizer 
--with-cpu=power7 --with-long-double-128. GDB is Fedora 7.7.1-21.fc20, a bit 
older, and I tested both unix/-m32 and unix/-m64 (following Bill Seurer's posts 
on gcc-testresults).


I'll email a list of the failures I'm seeing offlist (summary: 186 on gcc112,
148 on gcc110 with -m32, 395 on gcc112 with -m64); however, I suspect gdb 
version is the difference we are looking for.


Which *may* mean that with a more up-to-date GDB, it's possible those failures 
may not be introduced on ppc64le. (Or similarly AArch64.) Hmmm


--Alan


[PATCH 4/4] Un-XFAIL ssa-dom-cse-2.c for most platforms

2015-12-21 Thread Alan Lawrence
...the test passes with --param sra-max-scalarization-size-Ospeed.

Verified on aarch64 and with stage1 compiler for hppa, powerpc, sparc, s390.

On alpha, tree-optimized is:

  MEM[(int[8] *)] = { 0, 1 };
  MEM[(int[8] *) + 8B] = { 2, 3 };
  MEM[(int[8] *) + 16B] = { 4, 5 };
  MEM[(int[8] *) + 24B] = { 6, 7 };
  _23 = a[0];
  _29 = a[1];
  sum_30 = _23 + _29;
  _36 = a[2];
  sum_37 = sum_30 + _36;

Which is beyond the scope of these changes to DOM to optimize.

On powerpc64, the test passes with -mcpu=power8 (the loop is vectorized as a
reduction); however, without that, similar code is generated to Alpha (the
vectorizer decides the reduction is not worthwhile without SIMD support), and
the test fails; hence, I've XFAILed for powerpc, but I think I could condition
the XFAIL on powerpc64 && !check_p8vector_hw_available, if preferred?

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/ssa-dom-cse-2.c: Remove XFAIL for powerpc(32), hppa,
aarch64, sparc, s390. Add --param sra-max-scalarization-size-Ospeed.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-2.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-2.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-2.c
index 9eccdc9..748448e 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O3 -fno-tree-fre -fno-tree-pre -fdump-tree-optimized" } */
+/* { dg-options "-O3 -fno-tree-fre -fno-tree-pre -fdump-tree-optimized --param 
sra-max-scalarization-size-Ospeed=32" } */
 
 int
 foo ()
@@ -17,7 +17,8 @@ foo ()
 /* After late unrolling the above loop completely DOM should be
able to optimize this to return 28.  */
 
-/* See PR63679 and PR64159, if the target forces the initializer to memory then
-   DOM is not able to perform this optimization.  */
+/* On alpha, the vectorizer generates writes of two vector elements at once,
+   but the loop reads only one element at a time, and DOM cannot resolve these.
+   The same happens on powerpc depending on the SIMD support available.  */
 
-/* { dg-final { scan-tree-dump "return 28;" "optimized" { xfail aarch64*-*-* 
alpha*-*-* hppa*-*-* powerpc*-*-* sparc*-*-* s390*-*-* } } } */
+/* { dg-final { scan-tree-dump "return 28;" "optimized" { xfail alpha*-*-* 
powerpc64*-*-* } } } */
-- 
1.9.1



[PATCH 2/4] Equate MEM_REFs and ARRAY_REFs in tree-ssa-scopedtables.c

2015-12-21 Thread Alan Lawrence
This is a respin of patches
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03266.html and
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03267.html, which were
"too quickly" approved before concerns with efficiency were pointed out.

I tried to change the hashing just in tree-ssa-dom.c using C++ subclassing, but
couldn't cleanly separate this out from tree-ssa-scopedtables and
tree-ssa-threadedge.c due to use of avail_exprs_stack. So I figured it was
probably appropriate to use the equivalences in jump threading too. Also,
using get_ref_base_and_extent unifies handling of MEM_REFs and ARRAY_REFs
(hence only one patch rather than two).

I've added a couple of testcases that show the improvement in DOM, but in all
cases I had to disable FRE, even PRE, to get any improvement, apart from on
ssa-dom-cse-2.c itself (where on the affected platforms FRE still does not do
the optimization). This makes me wonder if this is the right approach or whether
changing the references output by SRA (as per
https://gcc.gnu.org/ml/gcc-patches/2015-08/msg01490.html , judged as a hack to
SRA to work around limitations in DOM - or is it?) would be better.

Also, this causes gcc to miss vectorization of pr65947-2.c, because the
additional equivalents spotted in dom1 lead to an extra jump-thread (partially
peeling the loop's first iter, as this has become entirely predictable);
loop-header-copying (ch_vect) then kicks in, restoring the loop structure but
leading to:

  i_49 = 1;
  ivtmp_46 = 253;
  if (ivtmp_46 != 0) goto ; else goto ;

  : <--- OUTSIDE LOOP

  : <--- LOOP HEADER
  # i_53 = PHI 

and scalar evolution is unable to analyze i_49 (defined outside the loop) even
though it's equal to a constant (analyze_scalar_evolution_1, test of
flow_bb_inside_loop_p).

This is fixed in the next patch...

gcc/ChangeLog:

* tree-ssa-scopedtables.c (avail_expr_hash): Hash MEM_REF and ARRAY_REF
using get_ref_base_and_extent.
(equal_mem_array_ref_p): New.
(hashable_expr_equal_p): Add call to previous.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/ssa-dom-cse-5.c: New.
* gcc.dg/tree-ssa/ssa-dom-cse-6.c: New.
* gcc.dg/tree-ssa/ssa-dom-cse-7.c: New.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c | 18 +++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c | 16 +++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c | 19 
 gcc/tree-ssa-scopedtables.c   | 67 +--
 4 files changed, 117 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c
new file mode 100644
index 000..cd38d3e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-5.c
@@ -0,0 +1,18 @@
+/* Test normalization of ARRAY_REF expressions to MEM_REFs in dom.  */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-tree-fre -fdump-tree-dom2" } */
+
+#define N 8
+
+int
+main (int argc, char **argv)
+{
+  int a[N];
+  for (int i = 0; i < N; i++)
+a[i] = 2*i + 1;
+  int *p = [0];
+  __builtin_printf ("%d\n", a[argc]);
+  return *(++p);
+}
+
+/* { dg-final { scan-tree-dump-times "return 3;" 1 "dom2"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c
new file mode 100644
index 000..b0c5138
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-6.c
@@ -0,0 +1,16 @@
+/* Test normalization of ARRAY_REF expressions to MEM_REFs in dom.  */
+/* { dg-do compile } */
+/* { dg-options "-O1 -fno-tree-fre -fdump-tree-dom1" } */
+
+int
+main (int argc, char **argv)
+{
+  union {
+int a[8];
+int b[2];
+  } u = { .a = { 1, 42, 3, 4, 5, 6, 7, 8 } };
+  __builtin_printf ("%d\n", u.a[argc]);
+  return u.b[1];
+}
+
+/* { dg-final { scan-assembler-times "return 42;" 1 "dom1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c
new file mode 100644
index 000..1fc4c5e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-cse-7.c
@@ -0,0 +1,19 @@
+/* Test normalization of MEM_REF expressions in dom.  */
+/* { dg-do compile } */
+/* { dg-options "-O3 -fno-tree-fre -fno-tree-pre -fdump-tree-optimized" } */
+
+typedef struct {
+  int a[8];
+} foo;
+
+foo f;
+
+int
+test ()
+{
+  foo g = { { 1, 2, 3, 4, 5, 6, 7, 8 } };
+  f=g;
+  return f.a[2];
+}
+
+/* { dg-final { scan-tree-dump-times "return 3;" 1 "optimized" } } */
diff --git a/gcc/tree-ssa-scopedtables.c b/gcc/tree-ssa-scopedtables.c
index 6034f79..8df7125 100644
--- a/gcc/tree-ssa-scopedtables.c
+++ b/gcc/tree-ssa-scopedtables.c
@@ -32,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "fold-const.h"
 #include "tree-eh.h"
 #include "internal-fn.h"
+#include 

[PATCH 3/4] Add a pass_copy_prop following pass_ch_vect

2015-12-21 Thread Alan Lawrence
This fixes the missed vectorization of gcc.dg/vect/pr65947-2.c following the
previous patch, by cleaning up:

  i_49 = 1;
  ...
goto ;

  :

  : <--- LOOP HEADER
  # i_53 = PHI 

into:

:
  # i_53 = PHI 

Allowing scalar evolution and vectorization to proceed.

A similar sequence of partial peeling, header-copying, and constant propagation
also leads to pr65947-9.c successfully vectorizing (it previously didn't as
the iteration count was too high) when inlined and so using known data.

I'm not sure whether adding a pass_copy_prop is the right thing here, but since
loop-header-copying can create such opportunities, it seems good to take them!

gcc/ChangeLog:

* passes.def: Add a pass_copy_prop following pass_ch_vect.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/pr65947-9.c: Add __attribute__ ((noinline)).

--Alan
---
 gcc/passes.def| 1 +
 gcc/testsuite/gcc.dg/vect/pr65947-9.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/passes.def b/gcc/passes.def
index 624d121..1043548 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -269,6 +269,7 @@ along with GCC; see the file COPYING3.  If not see
  NEXT_PASS (pass_expand_omp_ssa);
  POP_INSERT_PASSES ()
  NEXT_PASS (pass_ch_vect);
+ NEXT_PASS (pass_copy_prop);
  NEXT_PASS (pass_if_conversion);
  /* pass_vectorize must immediately follow pass_if_conversion.
 Please do not add any other passes in between.  */
diff --git a/gcc/testsuite/gcc.dg/vect/pr65947-9.c 
b/gcc/testsuite/gcc.dg/vect/pr65947-9.c
index d0da13f..c7aac8d 100644
--- a/gcc/testsuite/gcc.dg/vect/pr65947-9.c
+++ b/gcc/testsuite/gcc.dg/vect/pr65947-9.c
@@ -7,7 +7,7 @@ extern void abort (void) __attribute__ ((noreturn));
 /* Condition reduction with maximum possible loop size.  Will fail to
vectorize because the vectorisation requires a slot for default values.  */
 
-char
+__attribute__ ((noinline)) char
 condition_reduction (char *a, char min_v)
 {
   char last = -72;
-- 
1.9.1



[PATCH 1/4] Make SRA scalarize constant-pool loads

2015-12-21 Thread Alan Lawrence
This is the same as the version conditionally approved near the end of stage 1
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01948.html, plus the
requested comment and a couple of testcases for platforms where the constants
are pushed into the constant pool. (I didn't commit this at the time because
without the rest of the series, benchmarking results didn't really show any
improvement, and the PR was not fixed.)

I've left the disqualified_constants check in, although I could be persuaded to
remove it if people felt strongly.

Also on AArch64, this still causes FAILs in guality pr54970.c:
FAIL: gcc.dg/guality/pr54970.c   -Os  line 15 a[0] == 1
FAIL: gcc.dg/guality/pr54970.c   -Os  line 20 a[0] == 1
FAIL: gcc.dg/guality/pr54970.c   -Os  line 25 a[0] == 1
and I think I must ask the AArch64 maintainers to take these for the time being.

Bootstrapped + check-gcc,g++,ada on x86_64, ARM, AArch64.

gcc/ChangeLog:

PR target/63679
* tree-sra.c (disqualified_constants): New.
(sra_initialize): Allocate disqualified_constants.
(sra_deinitialize): Free disqualified_constants.
(disqualify_candidate): Update disqualified_constants when appropriate.
(create_access): Scan for constant-pool entries as we go along.
(scalarizable_type_p): Add check against type_contains_placeholder_p.
(maybe_add_sra_candidate): Allow constant-pool entries.
(initialize_constant_pool_replacements): New.
(sra_modify_expr): Avoid mangling assignments created by previous.
(sra_modify_function_body): Call initialize_constant_pool_replacements.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/sra-17.c: New.
* gcc.dg/tree-ssa/sra-18.c: New.
---
 gcc/testsuite/gcc.dg/tree-ssa/sra-17.c | 19 +++
 gcc/testsuite/gcc.dg/tree-ssa/sra-18.c | 28 ++
 gcc/tree-sra.c | 99 --
 3 files changed, 143 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sra-17.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/sra-18.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sra-17.c 
b/gcc/testsuite/gcc.dg/tree-ssa/sra-17.c
new file mode 100644
index 000..cff868a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/sra-17.c
@@ -0,0 +1,19 @@
+/* { dg-do run { target { aarch64*-*-* alpha*-*-* arm*-*-* hppa*-*-* 
powerpc*-*-* s390*-*-* } } } */
+/* { dg-options "-O2 -fdump-tree-esra --param 
sra-max-scalarization-size-Ospeed=16" } */
+
+extern void abort (void);
+
+int
+main (int argc, char **argv)
+{
+  int a[4] = { 7, 19, 11, 255 };
+  int tot = 0;
+  for (int i = 0; i < 4; i++)
+tot = (tot*256) + a[i];
+  if (tot == 0x07130bff)
+return 0;
+  abort ();
+}
+
+/* { dg-final { scan-tree-dump-times "Removing load: a = \\\*.LC0;" 1 "esra" } 
} */
+/* { dg-final { scan-tree-dump-times "SR.\[0-9_\]+ = \\\*.LC0\\\[" 4 "esra" } 
} */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/sra-18.c 
b/gcc/testsuite/gcc.dg/tree-ssa/sra-18.c
new file mode 100644
index 000..6c0d52b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/sra-18.c
@@ -0,0 +1,28 @@
+/* { dg-do run { target { aarch64*-*-* alpha*-*-* arm*-*-* hppa*-*-* 
powerpc*-*-* s390*-*-* } } } */
+/* { dg-options "-O2 -fdump-tree-esra --param 
sra-max-scalarization-size-Ospeed=16" } */
+
+extern void abort (void);
+struct foo { int x; };
+
+struct bar { struct foo f[2]; };
+
+struct baz { struct bar b[2]; };
+
+int
+main (int argc, char **argv)
+{
+  struct baz a = { { { { { 4 }, { 5 } } }, { { { 6 }, { 7 } } }  } };
+  int tot = 0;
+  for (int i = 0; i < 2; i++)
+for (int j = 0; j < 2; j++)
+  tot = (tot*256) + a.b[i].f[j].x;
+  if (tot == 0x04050607)
+return 0;
+  abort ();
+}
+
+/* { dg-final { scan-tree-dump-times "Removing load: a = \\\*.LC0;" 1 "esra" } 
} */
+/* { dg-final { scan-tree-dump-times "SR.\[0-9_\]+ = 
\\\*.LC0\\.b\\\[0\\\]\\.f\\\[0\\\]\\.x" 1 "esra" } } */
+/* { dg-final { scan-tree-dump-times "SR.\[0-9_\]+ = 
\\\*.LC0\\.b\\\[0\\\]\\.f\\\[1\\\]\\.x" 1 "esra" } } */
+/* { dg-final { scan-tree-dump-times "SR.\[0-9_\]+ = 
\\\*.LC0\\.b\\\[1\\\]\\.f\\\[0\\\]\\.x" 1 "esra" } } */
+/* { dg-final { scan-tree-dump-times "SR.\[0-9_\]+ = 
\\\*.LC0\\.b\\\[1\\\]\\.f\\\[1\\\]\\.x" 1 "esra" } } */
diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
index c4fea5b..4ea8550 100644
--- a/gcc/tree-sra.c
+++ b/gcc/tree-sra.c
@@ -328,6 +328,10 @@ candidate (unsigned uid)
those which cannot be (because they are and need be used as a whole).  */
 static bitmap should_scalarize_away_bitmap, cannot_scalarize_away_bitmap;
 
+/* Bitmap of candidates in the constant pool, which cannot be scalarized
+   because this would produce non-constant expressions (e.g. Ada).  */
+static bitmap disqualified_constants;
+
 /* Obstack for creation of fancy names.  */
 static struct obstack name_obstack;
 
@@ -652,6 +656,7 @@ sra_initialize (void)
 (vec_safe_length (cfun->local_decls) / 2);
   should_scalarize_away_bitmap = BITMAP_ALLOC (NULL);
   

[PATCH 0/4 v3] Fix PR/63679 when --param sra-max-scalarization-size specified

2015-12-21 Thread Alan Lawrence
This is a respin of previous patch series:
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03271.html
Minus three of the smaller patches having already been committed; with the
updated version of the main patch to SRA; and the patches to DOM reworked
to avoid constructing gimple stmt's.

IMHO this feels quite invasive for stage 3, however, I note the PR/63679
(ssa-dom-cse-2.c being XFAILed on a bunch of platforms, and currently still
FAILing on ARM) is a regression from gcc 4.9.1. So I'd ask maintainer's thoughts
as to whether we should take this patch set for gcc 6.



Re: [PATCH 4/4] Un-XFAIL ssa-dom-cse-2.c for most platforms

2015-12-21 Thread Alan Lawrence

On 21/12/15 14:59, Bill Schmidt wrote:


On powerpc64, the test passes with -mcpu=power8 (the loop is vectorized as a
reduction); however, without that, similar code is generated to Alpha (the
vectorizer decides the reduction is not worthwhile without SIMD support), and
the test fails; hence, I've XFAILed for powerpc, but I think I could condition
the XFAIL on powerpc64 && !check_p8vector_hw_available, if preferred?


Fun.

Does it work with -mcpu=power7?


Yes, it works with -mcpu=power7, as well as -mcpu=power8, but not e.g. 
-mcpu=power6.


Bill: What GCC DejaGNU incantation would you like to see?


This sounds like more fallout from unaligned accesses being faster on
POWER8 than previous hardware.  What about conditioning the XFAIL on

{ powerpc*-*-* && { ! vect_hw_misalign } }

-- does this work properly?


Not on a stage1 compiler - check_p8vector_hw_available itself requires being 
able to run executables - I'll check on gcc112. However, both look like they're 
really about the host (ability to execute an asm instruction), not the target 
(/ability for gcc to output such an instruction)


--Alan


Re: [PATCH 4/4] Un-XFAIL ssa-dom-cse-2.c for most platforms

2015-12-21 Thread Alan Lawrence

On 21/12/15 14:59, Bill Schmidt wrote:


On powerpc64, the test passes with -mcpu=power8 (the loop is vectorized as a
reduction); however, without that, similar code is generated to Alpha (the
vectorizer decides the reduction is not worthwhile without SIMD support), and
the test fails; hence, I've XFAILed for powerpc, but I think I could condition
the XFAIL on powerpc64 && !check_p8vector_hw_available, if preferred?


Fun.

Does it work with -mcpu=power7?

Bill: What GCC DejaGNU incantation would you like to see?


This sounds like more fallout from unaligned accesses being faster on
POWER8 than previous hardware.  What about conditioning the XFAIL on

{ powerpc*-*-* && { ! vect_hw_misalign } }

-- does this work properly?  Right now that's just an alternative way of
saying what you suggested


It works properly in that it is equivalent to what I suggested. However, the 
test passes on power7 as well, so my suggestion of check_p8vector_hw_available 
is itself wrong...


--Alan


Re: [PATCH] Fix PR68707, 67323

2015-12-17 Thread Alan Lawrence

On 16/12/15 15:01, Richard Biener wrote:


The following patch adds a heuristic to prefer store/load-lanes
over SLP when vectorizing.  Compared to the variant attached to
the PR I made the STMT_VINFO_STRIDED_P behavior explicit (matching
what you've tested).


Not sure I follow this. Compared to the variant attached to the PR - we will now 
attempt to use load-lanes, if (say) all of the loads are strided, even if we 
know we don't support load-lanes (for any of them). That sounds the wrong way 
around and I think rather different to what you proposed earlier? (At the least, 
the debug message "can use load/store lanes" is potentially misleading, that's 
not necessarily the case!)


There are arguments that we want to do less SLP, generally, on ARM/AArch64 but I 
think Wilco's permute cost patch 
https://gcc.gnu.org/ml/gcc-patches/2015-12/msg01469.html is a better way of 
achieving that?


Just my gut feeling at this point - I haven't evaluated this version of the 
patch on any benchmarks etc...


Thanks, Alan



Re: [PATCH] Fix PR68707, 67323

2015-12-17 Thread Alan Lawrence

On 17/12/15 10:46, Richard Biener wrote:

On Thu, 17 Dec 2015, Alan Lawrence wrote:


On 16/12/15 15:01, Richard Biener wrote:


The following patch adds a heuristic to prefer store/load-lanes
over SLP when vectorizing.  Compared to the variant attached to
the PR I made the STMT_VINFO_STRIDED_P behavior explicit (matching
what you've tested).


Not sure I follow this. Compared to the variant attached to the PR - we will
now attempt to use load-lanes, if (say) all of the loads are strided, even if
we know we don't support load-lanes (for any of them). That sounds the wrong
way around and I think rather different to what you proposed earlier? (At the
least, the debug message "can use load/store lanes" is potentially misleading,
that's not necessarily the case!)


Ah, indeed.  Note that the whole thing is still guarded by the check
that we can use store-lanes for the store.

I can also do it the other way around (as previously proposed) which
would change outcome for slp-perm-11.c.  That proposal would not reject
the SLP if there were any strided grouped loads involved.


Indeed; the STMT_VINFO_STRIDED_P || !vect_load_lanes_supported approach (as on 
PR68707) vectorizes slp-perm-11.c with SLP, which works much better than the 
!STMT_VINFO_STRIDED_P && !vect_load_lanes_supported, which tries to use st2 (and 
only sort-of works - you get an st2 output, but no ld2, and lots of faff).


I think I move for the patch from PR68707, therefore. (Ramana - any thoughts?)


Btw, another option is to push the decision past full SLP analysis
and thus make the decision globally for all SLP instances - currently
SLP instances are cancelled one a one-by-one basis meaning we might
do SLP plus load/store-lanes in the same loop.


I don't see anything inherently wrong with doing both in the same loop. On 
simple loops, I suspect we'll do better committing to one strategy or the other 
(tho really it's only the VF required I think?), but then, on such simple loops, 
there are probably not very many SLP instances!



Maybe we have to go all the way to implementing a better vectorization
cost hook just for the permutations - the SLP path in theory knows
exactly which ones it will generate.


Yes, I think this sounds like a good plan for GCC 7. It doesn't require 
constructing an entire stmt (if you are concerned about the cost of that), and 
on most targets, probably integrates fairly easily with the 
expand_vec_perm_const hooks.


--Alan


[PATCH][Testsuite]Cleanup logs from gdb tests by adding newlines

2015-12-10 Thread Alan Lawrence
Runs of the guality testsuite can sometimes end up with gcc.log containing
malformed lines like:

A debugging session is active.PASS: gcc.dg/guality/pr36728-1.c   -O2  line 18 
arg4 == 4
A debugging session is active.PASS: gcc.dg/guality/restrict.c   -O2  line 30 
type:ip == int *
Inferior 1 [process 27054] will be killed.PASS: 
gcc.dg/guality/restrict.c   -O2  line 30 type:cicrp == const int * const 
restrict
Inferior 1 [process 27160] will be killed.PASS: 
gcc.dg/guality/restrict.c   -O2  line 30 type:cvirp == int * const volatile 
restrict

This patch just makes sure the PASS/FAIL comes at the beginning of a line.  (At
the slight cost of adding some extra newlines not in the actual test output.)

I moved the remote_close target calls earlier, to avoid any possible race
condition of extra output being generated after the newline - this may not be
strictly necessary.

Tested on aarch64-none-linux-gnu and x86_64-none-linux-gnu.

I think this is reasonable for stage 3 - OK for trunk?

gcc/testsuite/ChangeLog:
* lib/gcc-gdb-test.exp (gdb-test): call remote_close earlier, and send
newline to log, before calling pass/fail/unsupported.
* lib/gcc-simulate-thread.exp (simulate-thread): Likewise.
---
 gcc/testsuite/lib/gcc-gdb-test.exp| 15 ++-
 gcc/testsuite/lib/gcc-simulate-thread.exp | 10 +++---
 2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/gcc/testsuite/lib/gcc-gdb-test.exp 
b/gcc/testsuite/lib/gcc-gdb-test.exp
index d3ba6e4..f60cabf 100644
--- a/gcc/testsuite/lib/gcc-gdb-test.exp
+++ b/gcc/testsuite/lib/gcc-gdb-test.exp
@@ -84,8 +84,9 @@ proc gdb-test { args } {
 remote_expect target [timeout_value] {
# Too old GDB
-re "Unhandled dwarf expression|Error in sourced command file|

Re: [PATCH] Enable libstdc++ numeric conversions on Cygwin

2015-12-08 Thread Alan Lawrence

On 13/11/15 14:52, Jonathan Wakely wrote:

That patch was wrong, the new macros in include/bits/c++config used
"CSTDIO" instead of "STDIO" so it caused several tests to go from
PASS to UNSUPPORTED, oops!

This is the correct version, tested again more carefully, on
powerpc64le-linux and powerpc-aix and x86_64-dragonfly.

Committed to trunk.


Just as a note, on baremetal ARM and AArch64 (arm-none-eabi / aarch64-none-elf), 
this causes a bunch more tests to be executed that were previously UNSUPPORTED, 
including


FAIL: 21_strings/basic_string/numeric_conversions/char/stod.cc execution test
FAIL: 21_strings/basic_string/numeric_conversions/char/stof.cc execution test
FAIL: 21_strings/basic_string/numeric_conversions/char/stold.cc execution test
FAIL: 21_strings/basic_string/numeric_conversions/wchar_t/stod.cc execution test
FAIL: 21_strings/basic_string/numeric_conversions/wchar_t/stof.cc execution test
FAIL: 21_strings/basic_string/numeric_conversions/wchar_t/stold.cc execution 
test
FAIL: 27_io/basic_ostream/inserters_arithmetic/char/hexfloat.cc execution test

these were previously gated off by the dg-require-string-conversions, via the 
"#if !defined(_GLIBCXX_USE_C99) || defined(_GLIBCXX_HAVE_BROKEN_VSWPRINTF)" as 
GLIBCXX_USE_C99 was not defined. However, now we have "#if 
!(_GLIBCXX_USE_C99_STDIO && _GLIBCXX_USE_C99_STDLIB && _GLIBCXX_USE_C99_WCHAR)" 
and all of those are true.


(Note the tests would have failed before if we had executed them, there is no 
change there AFAICS.)


Cheers,
Alan



[PATCH][contrib] Update download_prerequisites to ISL 0.15.

2015-12-07 Thread Alan Lawrence
Since r229889, we now have tests that pass only with ISL 0.15. Although ISL 0.15
is not a requirement, it seems we should make it easy to build the compiler that
way.

Other opinions?

Thanks,
Alan

contrib/ChangeLog:

* download_prerequisites: Update ISL version to 0.15.
---
 contrib/download_prerequisites | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/download_prerequisites b/contrib/download_prerequisites
index 6940330..a685a1d 100755
--- a/contrib/download_prerequisites
+++ b/contrib/download_prerequisites
@@ -48,7 +48,7 @@ ln -sf $MPC mpc || exit 1
 
 # Necessary to build GCC with the Graphite loop optimizations.
 if [ "$GRAPHITE_LOOP_OPT" = "yes" ] ; then
-  ISL=isl-0.14
+  ISL=isl-0.15
 
   wget ftp://gcc.gnu.org/pub/gcc/infrastructure/$ISL.tar.bz2 || exit 1
   tar xjf $ISL.tar.bz2  || exit 1
-- 
1.9.1



Re: [PATCH][contrib] Update download_prerequisites to ISL 0.15.

2015-12-07 Thread Alan Lawrence

On 07/12/15 11:54, Markus Trippelsdorf wrote:

On 2015.12.07 at 11:52 +, Alan Lawrence wrote:

Since r229889, we now have tests that pass only with ISL 0.15. Although ISL 0.15
is not a requirement, it seems we should make it easy to build the compiler that
way.


This is already fixed by r231329.



Ah, yes, sorry, that'd been sitting in my outbox since Friday but I hadn't sent 
it. Thanks and sorry for the noise.


--Alan



Re: [PATCH][1/2] Fix PR68553

2015-12-04 Thread Alan Lawrence

On 27/11/15 08:30, Richard Biener wrote:


This is part 1 of a fix for PR68533 which shows that some targets
cannot can_vec_perm_p on an identity permutation.  I chose to fix
this in the vectorizer by detecting the identity itself but with
the current structure of vect_transform_slp_perm_load this is
somewhat awkward.  Thus the following no-op patch simplifies it
greatly (from the times it was restricted to do interleaving-kind
of permutes).  It turned out to not be 100% no-op as we now can
handle non-adjacent source operands so I split it out from the
actual fix.

The two adjusted testcases no longer fail to vectorize because
of "need three vectors" but unadjusted would fail because there
are simply not enough scalar iterations in the loop.  I adjusted
that and now we vectorize it just fine (running into PR68559
which I filed).

Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.

Richard.

2015-11-27  Richard Biener  

PR tree-optimization/68553
* tree-vect-slp.c (vect_get_mask_element): Remove.
(vect_transform_slp_perm_load): Implement in a simpler way.

* gcc.dg/vect/pr45752.c: Adjust.
* gcc.dg/vect/slp-perm-4.c: Likewise.


On aarch64 and ARM targets, this causes

PASS->FAIL: gcc.dg/vect/O3-pr36098.c scan-tree-dump-times vect "vectorizing 
stmts using SLP" 0


That is, we now vectorize using SLP, when previously we did not.

On aarch64 (and I expect ARM too), previously we used a VEC_LOAD_LANES, without 
unrolling, but now we unroll * 4, and vectorize using 3 loads and permutes:


../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt: 
vect__31.15_94 = VEC_PERM_EXPR ;
../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt: 
vect__31.16_95 = VEC_PERM_EXPR ;
../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt: 
vect__31.17_96 = VEC_PERM_EXPR 


which *is* a valid vectorization strategy...


--Alan



Re: [PATCH] enable loop fusion on isl-15

2015-12-04 Thread Alan Lawrence

On 05/11/15 21:43, Sebastian Pop wrote:

* graphite-optimize-isl.c (optimize_isl): Call
isl_options_set_schedule_maximize_band_depth.

* gcc.dg/graphite/fuse-1.c: New.
* gcc.dg/graphite/fuse-2.c: New.
* gcc.dg/graphite/interchange-13.c: Remove bogus check.


I note that the test

scan-tree-dump-times forwprop4 "gimple_simplified to[^\\n]*\\^ 12" 1

FAILs under isl-0.14, with which GCC can still be built and generally claims to 
work.


Is it worth trying to detect this in the testsuite, so we can XFAIL it? By which 
I mean, is there a reasonable testsuite mechanism by which we could do that?


Cheers, Alan





Re: [PATCH][1/2] Fix PR68553

2015-12-04 Thread Alan Lawrence

On 04/12/15 17:46, Ramana Radhakrishnan wrote:



On 04/12/15 16:04, Richard Biener wrote:

On December 4, 2015 4:32:33 PM GMT+01:00, Alan Lawrence <alan.lawre...@arm.com> 
wrote:

On 27/11/15 08:30, Richard Biener wrote:


This is part 1 of a fix for PR68533 which shows that some targets
cannot can_vec_perm_p on an identity permutation.  I chose to fix
this in the vectorizer by detecting the identity itself but with
the current structure of vect_transform_slp_perm_load this is
somewhat awkward.  Thus the following no-op patch simplifies it
greatly (from the times it was restricted to do interleaving-kind
of permutes).  It turned out to not be 100% no-op as we now can
handle non-adjacent source operands so I split it out from the
actual fix.

The two adjusted testcases no longer fail to vectorize because
of "need three vectors" but unadjusted would fail because there
are simply not enough scalar iterations in the loop.  I adjusted
that and now we vectorize it just fine (running into PR68559
which I filed).

Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.

Richard.

2015-11-27  Richard Biener  <rguent...@suse.de>

PR tree-optimization/68553
* tree-vect-slp.c (vect_get_mask_element): Remove.
(vect_transform_slp_perm_load): Implement in a simpler way.

* gcc.dg/vect/pr45752.c: Adjust.
* gcc.dg/vect/slp-perm-4.c: Likewise.


On aarch64 and ARM targets, this causes

PASS->FAIL: gcc.dg/vect/O3-pr36098.c scan-tree-dump-times vect
"vectorizing
stmts using SLP" 0

That is, we now vectorize using SLP, when previously we did not.

On aarch64 (and I expect ARM too), previously we used a VEC_LOAD_LANES,
without
unrolling,

but now we unroll * 4, and vectorize using 3 loads and

permutes:


Happens on x86_64 as well with at least Sse4.1.  Unfortunately we'll have to 
start introducing much more fine-grained target-supports for vect_perm to 
reliably guard all targets.


I don't know enough about SSE4.1 to know whether it's a problem there or not. 
This is an actual regression on AArch64 and ARM and not just a testism, you now 
get :

.L5:
 ldr q0, [x5, 16]
 add x4, x4, 48
 ldr q1, [x5, 32]
 add w6, w6, 1
 ldr q4, [x5, 48]
 cmp w3, w6
 ldr q2, [x5], 64
 orr v3.16b, v0.16b, v0.16b
 orr v5.16b, v4.16b, v4.16b
 orr v4.16b, v1.16b, v1.16b
 tbl v0.16b, {v0.16b - v1.16b}, v6.16b
 tbl v2.16b, {v2.16b - v3.16b}, v7.16b
 tbl v4.16b, {v4.16b - v5.16b}, v16.16b
 str q0, [x4, -32]
 str q2, [x4, -48]
 str q4, [x4, -16]
 bhi .L5

instead of

.L5:
 ld4 {v4.4s - v7.4s}, [x7], 64
 add w4, w4, 1
 cmp w3, w4
 orr v1.16b, v4.16b, v4.16b
 orr v2.16b, v5.16b, v5.16b
 orr v3.16b, v6.16b, v6.16b
 st3 {v1.4s - v3.4s}, [x6], 48
 bhi .L5

LD4 and ST3 do all the permutes without needing actual permute instructions - a 
strategy that favours generic permutes avoiding the load_lanes case is likely 
to be more expensive on most implementations. I think worth a PR atleast.

regards
Ramana



Yes, quite right. PR 68707.

--Alan



Re: [PATCH] Empty redirect_edge_var_map after each pass and function

2015-12-03 Thread Alan Lawrence

On 02/12/15 14:13, Jeff Law wrote:

On 12/02/2015 01:33 AM, Richard Biener wrote:

Right.  So the question I have is how/why did DOM leave anything in the map.
And if DOM is fixed to not leave stuff lying around, can we then assert that
nothing is ever left in those maps between passes?  There's certainly no good
reason I'm aware of why DOM would leave things in this state.


It happens not only with DOM but with all passes doing edge redirection.
This is because the map is populated by GIMPLE cfg hooks just in case
it might be used.  But there is no such thing as a "start CFG manip"
and "end CFG manip" to cleanup such dead state.

Sigh.



IMHO the redirect-edge-var-map stuff is just the very most possible
unclean implementation possible. :(  (see how remove_edge "clears"
stale info from the map to avoid even more "interesting" stale
data)

Ideally we could assert the map is empty whenever we leave a pass,
but as said it triggers all over the place.  Even cfg-cleanup causes
such stale data.

I agree that the patch is only a half-way "solution", but a full
solution would require sth more explicit, like we do with
initialize_original_copy_tables/free_original_copy_tables.  Thus
require passes to explicitely request the edge data to be preserved
with a initialize_edge_var_map/free_edge_var_map call pair.

Not appropriate at this stage IMHO (well, unless it turns out to be
a very localized patch).

So maybe as a follow-up to aid folks in the future, how about a debugging
verify_whatever function that we can call manually if debugging a problem in
this space.  With a comment indicating why we can't call it unconditionally 
(yet).


jeff


I did a (fwiw disable bootstrap) build with the map-emptying code in passes.c 
(not functions.c), printing out passes after which the map was non-empty (before 
emptying it, to make sure passes weren't just carrying through stale data from 
earlier). My (non-exhaustive!) list of passes after which the 
edge_var_redirect_map can be non-empty stands at...


aprefetch ccp cddce ch ch_vect copyprop crited crited cselim cunroll cunrolli 
dce dom ehcleanup einline esra fab fnsplit forwprop fre graphite ifcvt 
isolate-paths ldist lim local-pure-const mergephi oaccdevlow ompexpssa optimized 
parloops pcom phicprop phiopt phiprop pre profile profile_estimate sccp sink 
slsr split-paths sra switchconv tailc tailr tracer unswitch veclower2 vect vrm 
vrp whole-program


FWIW, the route by which dom added the edge to the redirect map was:
#0  redirect_edge_var_map_add (e=e@entry=0x7fb7a5f508, result=0x7fb725a000,
def=0x7fb78eaea0, locus=2147483884) at ../../gcc/gcc/tree-ssa.c:54
#1  0x00cccf58 in ssa_redirect_edge (e=e@entry=0x7fb7a5f508,
dest=dest@entry=0x7fb79cc680) at ../../gcc/gcc/tree-ssa.c:158
#2  0x00b00738 in gimple_redirect_edge_and_branch (e=0x7fb7a5f508,
dest=0x7fb79cc680) at ../../gcc/gcc/tree-cfg.c:5662
#3  0x006ec678 in redirect_edge_and_branch (e=e@entry=0x7fb7a5f508,
dest=) at ../../gcc/gcc/cfghooks.c:356
#4  0x00cb4530 in ssa_fix_duplicate_block_edges (rd=0x1a29f10,
local_info=local_info@entry=0x7fed40)
at ../../gcc/gcc/tree-ssa-threadupdate.c:1184
#5  0x00cb5520 in ssa_fixup_template_block (slot=,
local_info=0x7fed40) at ../../gcc/gcc/tree-ssa-threadupdate.c:1369
#6  traverse_noresize (
argument=0x7fed40, this=0x1a21a00) at ../../gcc/gcc/hash-table.h:911
#7  traverse (
argument=0x7fed40, this=0x1a21a00) at ../../gcc/gcc/hash-table.h:933
#8  thread_block_1 (bb=bb@entry=0x7fb7485bc8,
noloop_only=noloop_only@entry=true, joiners=joiners@entry=true)
at ../../gcc/gcc/tree-ssa-threadupdate.c:1592
#9  0x00cb5a40 in thread_block (bb=0x7fb7485bc8,
noloop_only=noloop_only@entry=true)
at ../../gcc/gcc/tree-ssa-threadupdate.c:1629
---Type  to continue, or q  to quit---
#10 0x00cb6bf8 in thread_through_all_blocks (
may_peel_loop_headers=true) at ../../gcc/gcc/tree-ssa-threadupdate.c:2736
#11 0x00becf6c in (anonymous namespace)::pass_dominator::execute (
this=, fun=0x7fb77d1b28)
at ../../gcc/gcc/tree-ssa-dom.c:622
#12 0x009feef4 in execute_one_pass (pass=pass@entry=0x16d1a80)
at ../../gcc/gcc/passes.c:2311

The edge is then deleted much later:
#3  0x00f858e4 in free_edge (fn=, e=)
at ../../gcc/gcc/cfg.c:91
#4  remove_edge_raw (e=) at ../../gcc/gcc/cfg.c:350
#5  0x006ec814 in remove_edge (e=)
at ../../gcc/gcc/cfghooks.c:418
#6  0x006ecaec in delete_basic_block (bb=bb@entry=0x7fb74b3618)
at ../../gcc/gcc/cfghooks.c:597
#7  0x00f8d1d4 in try_optimize_cfg (mode=32)
at ../../gcc/gcc/cfgcleanup.c:2701
#8  cleanup_cfg (mode=mode@entry=32) at ../../gcc/gcc/cfgcleanup.c:3028
#9  0x0070180c in cfg_layout_initialize (flags=flags@entry=0)
at ../../gcc/gcc/cfgrtl.c:4264
#10 0x00f7cdc8 in 

Re: [PATCH] Empty redirect_edge_var_map after each pass and function

2015-12-03 Thread Alan Lawrence

On 03/12/15 12:58, Richard Biener wrote:

On Thu, 3 Dec 2015, Alan Lawrence wrote:


On 02/12/15 14:13, Jeff Law wrote:

On 12/02/2015 01:33 AM, Richard Biener wrote:

Right.  So the question I have is how/why did DOM leave anything in the
map.
And if DOM is fixed to not leave stuff lying around, can we then assert
that
nothing is ever left in those maps between passes?  There's certainly no
good
reason I'm aware of why DOM would leave things in this state.


It happens not only with DOM but with all passes doing edge redirection.
This is because the map is populated by GIMPLE cfg hooks just in case
it might be used.  But there is no such thing as a "start CFG manip"
and "end CFG manip" to cleanup such dead state.

Sigh.



IMHO the redirect-edge-var-map stuff is just the very most possible
unclean implementation possible. :(  (see how remove_edge "clears"
stale info from the map to avoid even more "interesting" stale
data)

Ideally we could assert the map is empty whenever we leave a pass,
but as said it triggers all over the place.  Even cfg-cleanup causes
such stale data.

I agree that the patch is only a half-way "solution", but a full
solution would require sth more explicit, like we do with
initialize_original_copy_tables/free_original_copy_tables.  Thus
require passes to explicitely request the edge data to be preserved
with a initialize_edge_var_map/free_edge_var_map call pair.

Not appropriate at this stage IMHO (well, unless it turns out to be
a very localized patch).

So maybe as a follow-up to aid folks in the future, how about a debugging
verify_whatever function that we can call manually if debugging a problem in
this space.  With a comment indicating why we can't call it unconditionally
(yet).


jeff


I did a (fwiw disable bootstrap) build with the map-emptying code in passes.c
(not functions.c), printing out passes after which the map was non-empty
(before emptying it, to make sure passes weren't just carrying through stale
data from earlier). My (non-exhaustive!) list of passes after which the
edge_var_redirect_map can be non-empty stands at...

aprefetch ccp cddce ch ch_vect copyprop crited crited cselim cunroll cunrolli
dce dom ehcleanup einline esra fab fnsplit forwprop fre graphite ifcvt
isolate-paths ldist lim local-pure-const mergephi oaccdevlow ompexpssa
optimized parloops pcom phicprop phiopt phiprop pre profile profile_estimate
sccp sink slsr split-paths sra switchconv tailc tailr tracer unswitch
veclower2 vect vrm vrp whole-program


Yeah, exactly my findings...  note that most of the above are likely
due to cfgcleanup even though it already does sth like

   e = redirect_edge_and_branch (e, dest);
   redirect_edge_var_map_clear (e);

so eventually placing a redirect_edge_var_map_empty () at the end
of the cleanup_tree_cfg function should prune down the above list
considerably (well, then assert the map is empty on entry to that
function of course)


FWIW, the route by which dom added the edge to the redirect map was:
#0  redirect_edge_var_map_add (e=e@entry=0x7fb7a5f508, result=0x7fb725a000,
 def=0x7fb78eaea0, locus=2147483884) at ../../gcc/gcc/tree-ssa.c:54
#1  0x00cccf58 in ssa_redirect_edge (e=e@entry=0x7fb7a5f508,
 dest=dest@entry=0x7fb79cc680) at ../../gcc/gcc/tree-ssa.c:158
#2  0x00b00738 in gimple_redirect_edge_and_branch (e=0x7fb7a5f508,
 dest=0x7fb79cc680) at ../../gcc/gcc/tree-cfg.c:5662
#3  0x006ec678 in redirect_edge_and_branch (e=e@entry=0x7fb7a5f508,
 dest=) at ../../gcc/gcc/cfghooks.c:356
#4  0x00cb4530 in ssa_fix_duplicate_block_edges (rd=0x1a29f10,
 local_info=local_info@entry=0x7fed40)
 at ../../gcc/gcc/tree-ssa-threadupdate.c:1184
#5  0x00cb5520 in ssa_fixup_template_block (slot=,
 local_info=0x7fed40) at ../../gcc/gcc/tree-ssa-threadupdate.c:1369
#6  traverse_noresize<ssa_local_info_t*, ssa_fixup_template_block> (
 argument=0x7fed40, this=0x1a21a00) at ../../gcc/gcc/hash-table.h:911
#7  traverse<ssa_local_info_t*, ssa_fixup_template_block> (
 argument=0x7fed40, this=0x1a21a00) at ../../gcc/gcc/hash-table.h:933
#8  thread_block_1 (bb=bb@entry=0x7fb7485bc8,
 noloop_only=noloop_only@entry=true, joiners=joiners@entry=true)
 at ../../gcc/gcc/tree-ssa-threadupdate.c:1592
#9  0x00cb5a40 in thread_block (bb=0x7fb7485bc8,
 noloop_only=noloop_only@entry=true)
 at ../../gcc/gcc/tree-ssa-threadupdate.c:1629
---Type  to continue, or q  to quit---
#10 0x00cb6bf8 in thread_through_all_blocks (
 may_peel_loop_headers=true) at ../../gcc/gcc/tree-ssa-threadupdate.c:2736
#11 0x00becf6c in (anonymous namespace)::pass_dominator::execute (
 this=, fun=0x7fb77d1b28)
 at ../../gcc/gcc/tree-ssa-dom.c:622
#12 0x009feef4 in execute_one_pass (pass=pass@entry=0x16d1a80)
 at ../../gcc/gcc/passes.c:2311

The edge is then deleted much later:
#3  0

Re: [PATCH][install.texi] Add note against GNAT 4.8 on ARM targets.

2015-12-03 Thread Alan Lawrence

On 16/11/15 15:08, Alan Lawrence wrote:

This follows from the discussion here: 
https://gcc.gnu.org/ml/gcc/2015-10/msg00082.html .

OK for trunk?

--Alan

gcc/ChangeLog:

doc/install.texi: Add note against GNAT 4.8 on ARM targets.
---
  gcc/doc/install.texi | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi
index 1fd773e..1ce93d4 100644
--- a/gcc/doc/install.texi
+++ b/gcc/doc/install.texi
@@ -3481,6 +3481,8 @@ require GNU binutils 2.13 or newer.  Such subtargets 
include:
  @code{arm-*-netbsdelf}, @code{arm-*-*linux-*}
  and @code{arm-*-rtemseabi}.

+Building the Ada frontend commonly fails (an infinite loop executing 
@code{xsinfo}) if the host compiler is GNAT 4.8.  Host compilers built from the 
GNAT 4.6, 4.9 or 5 release branches are known to succeed.
+
  @html
  
  @end html



Ping.



Re: [PING^2][PATCH] Improve C++ loop's backward-jump location

2015-12-02 Thread Alan Lawrence

On 24/11/15 14:55, Andreas Arnez wrote:

Ping?

   https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01192.html

I guess we want C and C++ behave the same here?


gcc/cp/ChangeLog:

* cp-gimplify.c (genericize_cp_loop): Change LOOP_EXPR's location
to start of loop body instead of start of loop.

gcc/testsuite/ChangeLog:

* g++.dg/guality/pr67192.C: New test.






Since this, we've been seeing these tests fail natively on AArch64 and ARM:

FAIL: g++.dg/gcov/gcov-1.C  -std=gnu++11  gcov: 3 failures in line counts, 0 in 
branch percentages, 0 in return percentages, 0 in intermediate format

FAIL: g++.dg/gcov/gcov-1.C  -std=gnu++11  line 115: is 27:should be 14
FAIL: g++.dg/gcov/gcov-1.C  -std=gnu++11  line 58: is 18:should be 9
FAIL: g++.dg/gcov/gcov-1.C  -std=gnu++11  line 73: is 162:should be 81
FAIL: g++.dg/gcov/gcov-1.C  -std=gnu++14  gcov: 3 failures in line counts, 0 in 
branch percentages, 0 in return percentages, 0 in intermediate format

FAIL: g++.dg/gcov/gcov-1.C  -std=gnu++14  line 115: is 27:should be 14
FAIL: g++.dg/gcov/gcov-1.C  -std=gnu++14  line 58: is 18:should be 9
FAIL: g++.dg/gcov/gcov-1.C  -std=gnu++14  line 73: is 162:should be 81
FAIL: g++.dg/gcov/gcov-1.C  -std=gnu++98  gcov: 3 failures in line counts, 0 in 
branch percentages, 0 in return percentages, 0 in intermediate format

FAIL: g++.dg/gcov/gcov-1.C  -std=gnu++98  line 115: is 27:should be 14
FAIL: g++.dg/gcov/gcov-1.C  -std=gnu++98  line 58: is 18:should be 9
FAIL: g++.dg/gcov/gcov-1.C  -std=gnu++98  line 73: is 162:should be 81

I've not had a chance to look any further yet.

Thanks, Alan



Re: [PATCH] Fix PR68559

2015-12-02 Thread Alan Lawrence

On 27/11/15 14:13, Richard Biener wrote:


The following fixes the excessive peeling for gaps we do when doing
SLP now that I removed most of the restrictions on having gaps in
the first place.

This should make low-trip vectorized loops more efficient (sth
also the combine-epilogue-with-vectorized-body-by-masking patches
claim to do).

Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.

Richard.

2015-11-27  Richard Biener  

PR tree-optimization/68559
* tree-vect-data-refs.c (vect_analyze_group_access_1): Move
peeling for gap checks ...
* tree-vect-stmts.c (vectorizable_load): ... here and relax
for SLP.
* tree-vect-loop.c (vect_analyze_loop_2): Re-set
LOOP_VINFO_PEELING_FOR_GAPS before re-trying without SLP.

* gcc.dg/vect/slp-perm-4.c: Adjust again.
* gcc.dg/vect/pr45752.c: Likewise.


Since this, we have

FAIL: gcc.dg/vect/pr45752.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"gaps requires scalar epilogue loop" 0
FAIL: gcc.dg/vect/pr45752.c scan-tree-dump-times vect "gaps requires scalar 
epilogue loop" 0


on aarch64 platforms (aarch64-none-linux-gnu, aarch64-none-elf, 
aarch64_be-none-elf).



Thanks, Alan



[PATCH] Empty redirect_edge_var_map after each pass and function

2015-12-01 Thread Alan Lawrence
This follows on from discussion at
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg03392.html
To recap: Starting in r229479 and continuing until at least 229711, compiling
polynom.c from spec2000 on aarch64-none-linux-gnu, with options
-O3 -mcpu=cortex-a53 -ffast-math (on both cross, native bootstrapped, and native
--disable-bootstrap compilers), produced a verify_gimple ICE after unswitch:

../spec2000/benchspec/CINT2000/254.gap/src/polynom.c: In function 
'NormalizeCoeffsListx':
../spec2000/benchspec/CINT2000/254.gap/src/polynom.c:358:11: error: 
incompatible types in PHI argument 0
 TypHandle NormalizeCoeffsListx ( hdC )
   ^
long int

int

../spec2000/benchspec/CINT2000/254.gap/src/polynom.c:358:11: error: location 
references block not in block tree
l1_279 = PHI <1(28), l1_299(33)>
../spec2000/benchspec/CINT2000/254.gap/src/polynom.c:358:11: error: invalid PHI 
argument

../spec2000/benchspec/CINT2000/254.gap/src/polynom.c:358:11: internal compiler 
error: tree check: expected class 'type', have 'declaration' (namespace_decl) 
in useless_type_conversion_p, at gimple-expr.c:84
0xd110ef tree_class_check_failed(tree_node const*, tree_code_class, char 
const*, int, char const*)
../../gcc-fsf/gcc/tree.c:9643
0x82561b tree_class_check
../../gcc-fsf/gcc/tree.h:3042
0x82561b useless_type_conversion_p(tree_node*, tree_node*)
../../gcc-fsf/gcc/gimple-expr.c:84
0xaca043 verify_gimple_phi
../../gcc-fsf/gcc/tree-cfg.c:4673
0xaca043 verify_gimple_in_cfg(function*, bool)
../../gcc-fsf/gcc/tree-cfg.c:4967
0x9c2e0b execute_function_todo
../../gcc-fsf/gcc/passes.c:1967
0x9c360b do_per_function
../../gcc-fsf/gcc/passes.c:1659
0x9c3807 execute_todo
../../gcc-fsf/gcc/passes.c:2022

I was not able to reduce the testcase below about 30k characters, with e.g.
#define T_VOID 0
 T_VOID 
producing the ICE, but manually changing to
 0 
preventing the ICE; as did running the preprocessor as a separate step, or a
wide variety of options (e.g. -fdump-tree-alias).

In the end I traced this to loop_unswitch reading stale values from the edge
redirect map, which is keyed on 'edge' (a pointer to struct edge_def); the map
entries had been left there by pass_dominator (on a different function), and by
"chance" the edge *pointers* were the same as to some current edge_defs (even
though they pointed to structures created by different allocations, the first
of which had since been freed). Hence the fragility of the testcase and
environment.

While the ICE is prevented merely by adding a call to
redirect_edge_var_map_destroy at the end of pass_dominator::execute, given the
fragility of the bug, difficulty of reducing the testcase, and the low overhead
of emptying an already-empty map, I believe the right fix is to empty the map
as often as can correctly do so, hence this patch - based substantially on
Richard's comments in PR/68117.

Bootstrapped + check-gcc + check-g++ on x86_64 linux, based on r231105; I've
also built SPEC2000 on aarch64-none-linux-gnu by applying this patch (roughly)
onto the previously-failing r229711, which also passes aarch64 bootstrap, and
a more recent bootstrap on aarch64 is ongoing. Assuming/if no regressions 
there...

Is this ok for trunk?

This could also be a candidate for the 5.3 release; backporting depends only on
the (fairly trivial) r230357.

gcc/ChangeLog:

  Alan Lawrence  <alan.lawre...@arm.com>
Richard Biener  <richard.guent...@gmail.com>

* cfgexpand.c (pass_expand::execute): Replace call to
redirect_edge_var_map_destroy with redirect_edge_var_map_empty.
* tree-ssa.c (delete_tree_ssa): Likewise.
* function.c (set_cfun): Call redirect_edge_var_map_empty.
* passes.c (execute_one_ipa_transform_pass, execute_one_pass): Likewise.
* tree-ssa.h (redirect_edge_var_map_destroy): Remove.
(redirect_edge_var_map_empty): New.
* tree-ssa.c (redirect_edge_var_map_destroy): Remove.
(redirect_edge_var_map_empty): New.

---
 gcc/cfgexpand.c | 2 +-
 gcc/function.c  | 2 ++
 gcc/passes.c| 2 ++
 gcc/tree-ssa.c  | 8 
 gcc/tree-ssa.h  | 2 +-
 5 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index 1990e10..ede1b82 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -6291,7 +6291,7 @@ pass_expand::execute (function *fun)
   expand_phi_nodes ();
 
   /* Release any stale SSA redirection data.  */
-  redirect_edge_var_map_destroy ();
+  redirect_edge_var_map_empty ();
 
   /* Register rtl specific functions for cfg.  */
   rtl_register_cfg_hooks ();
diff --git a/gcc/function.c b/gcc/function.c
index 515d7c0..e452865 100644
--- a/gcc/function.c
+++ b/gcc/function.c
@@ -75,6 +75,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-chkp.h"
 #include "rtl-chkp.h"
 #include "tree-dfa.h"
+#include "tree-ssa.h"
 
 /* So we 

Re: [PATCH] Fix PR68067

2015-11-27 Thread Alan Lawrence

On 27/11/15 15:07, Alan Lawrence wrote:

On 23/11/15 09:43, Richard Biener wrote:

On Fri, 20 Nov 2015, Alan Lawrence wrote:


...the asserts
you suggested in (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D68117#c27)...

 >>

So I have to ask, how sure are you that those assertions are(/should
be!) "correct"? :)


Ideally they should be correct but they happen to be not (and I think
the intent was that this should be harmless).  Basically I tried
to assert that nobody creates stale edge redirect data that is not
later consumed or cleared.  Happens to be too optimistic :/


Maybe so, but it looks like the edge_var_redirect_map is still suspect here. On
the ~~28th call to loop_version, from tree_unswitch_loop, the call to
lv_flush_pending_stmts executes (tree-cfg.c flush_pending_stmts):

def = redirect_edge_var_map_def (vm);
add_phi_arg (phi, def, e, redirect_edge_var_map_location(vm));

and BLOCK_LOCATION (redirect_edge_var_map_location(vm)) is

< 0x7fb7704a80 side-effects addressable asm_written used
protected static visited tree_0 tree_2 tree_5>

so yeah, next question, how'd that get there...

A.


Well, pass_dominator::execute calls redirect_edge_var_map with that edge 
pointer, at which time the edge is from from 32 (0x7fb79cc6e8) to block 20 
(0x7fb7485e38), and locus is 2147483884; and then again, with locus 0.


With no intervening calls to redirect_edge_var_map_clear for that edge, 
loop_version's call to flush_pending_statements then reads 
redirect_edge_var_map_vector for that edge pointer - which is now an edge from 
block 126 (0x7fb7485af8) to 117 (0x7fb74856e8). It sees those locations 
(2147483884 and 0)...


Clearing the edge redirect map at the end of pass_dominator fixes the ICE (as 
would clearing it at the end of each stage, or clearing it at the beginning of 
loop_unswitch, I guess).


I'll post a patch after more testing, but obviously I'm keen to hear if there 
are obvious problems with the approach?


And coming up with a testcase, well, heh - this broke because of identical 
pointers to structures allocated at different times, with intervening 
free...ideas welcome of course!


--Alan



Re: [PATCH 1/2][ARM] PR/65956 AAPCS update for alignment attribute

2015-11-27 Thread Alan Lawrence
On 26 November 2015 at 14:00, Alan Lawrence <alan.lawre...@arm.com> wrote:
> On 6 November 2015 at 16:59, Jakub Jelinek <ja...@redhat.com> wrote:
>>
>> In any case, to manually reproduce, compile
>> gnatmake -g -gnatws macrosub.adb
>> with GCC 5.1.1 (before the ARM changes) and then try to run that process 
>> against
>> GCC 5.2.1 (after the ARM changes) libgnat-5.so, which is what make check
>> does (it uses host_gnatmake to compile the support stuff, so ideally the
>> processes built by host gcc/gnatmake should not be run with the
>> LD_LIBRARY_PATH=$ADA_INCLUDE_PATH:$BASE:$LD_LIBRARY_PATH
>> in the environment, and others should).
>> In macrosub in particular, the problem is in:
>>   WHILE NOT END_OF_FILE (INFILE1) LOOP
>>GET_LINE (INFILE1, A_LINE, A_LENGTH);
>> in FILL_TABLE, where A_LINE'First is 0 and A_LINE'Last is 400 (if I remember
>> right), but if you step into GET_LINE compiled by GCC 5.2.1, Item'First
>> and Item'Last don't match that.
>
> Ok, I see the mismatch now.

The type affected in Jakub's case here is an Ada String, which looks like this:

 constant 64>
unit size  constant 8>
align 64 symtab -151604912 alias set -1 canonical type 0xf7569720
fields 
asm_written unsigned SI
size 
unit size 
align 32 symtab -151604672 alias set -1 canonical type 0xf756a2a0>
unsigned nonaddressable SI file  line 0 col 0 size
 unit size 
align 32 offset_align 64
offset 
bit offset  context

chain 
visited unsigned nonaddressable SI file  line 0
col 0 size  unit size 
align 32 offset_align 64 offset 
bit offset  context >> context 
unconstrained array 
BLK
align 8 symtab 0 alias set -1 canonical type 0xf7569c00
context 
pointer_to_this 
reference_to_this  chain
>
chain >

i.e. a 64-bit DImode struct, with alignment set to 64, containing
P_ARRAY a 32-bit pointer with alignment 32, and P_BOUNDS a 32-bit pointer
with alignment 32, pointing to a record (of size 64, alignment 32, containing
two 32-bit ints LB0 and UB0).

AFAICT, in the fill_table/get_line case, the first parameter to
get_line is a file, a simple pointer; then we have a string. So

*fill_table (compiled with 5.1, doubleword aligned) should pass the
string P_ARRAY in r2 and P_BOUNDS in r3.

0x1f334 <getsubs__fill_table+184>   movt   r3, #2
0x1f338 <getsubs__fill_table+188>   strr3, [r11, #-504]; 0x
0x1f33c <getsubs__fill_table+192>   subr3, r11, #508   ; 0x1fc
0x1f340 <getsubs__fill_table+196>   ldrd   r2, [r3]
0x1f344 <getsubs__fill_table+200>   movr0, r1
0x1f348 <getsubs__fill_table+204>   bl 0x1aee4 

Re: [PATCH] Fix PR68067

2015-11-27 Thread Alan Lawrence

On 23/11/15 09:43, Richard Biener wrote:

On Fri, 20 Nov 2015, Alan Lawrence wrote:


...the asserts
you suggested in (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D68117#c27)...

>>

So I have to ask, how sure are you that those assertions are(/should
be!) "correct"? :)


Ideally they should be correct but they happen to be not (and I think
the intent was that this should be harmless).  Basically I tried
to assert that nobody creates stale edge redirect data that is not
later consumed or cleared.  Happens to be too optimistic :/


Maybe so, but it looks like the edge_var_redirect_map is still suspect here. On 
the ~~28th call to loop_version, from tree_unswitch_loop, the call to 
lv_flush_pending_stmts executes (tree-cfg.c flush_pending_stmts):


   def = redirect_edge_var_map_def (vm);
   add_phi_arg (phi, def, e, redirect_edge_var_map_location(vm));

and BLOCK_LOCATION (redirect_edge_var_map_location(vm)) is

< 0x7fb7704a80 side-effects addressable asm_written used 
protected static visited tree_0 tree_2 tree_5>


so yeah, next question, how'd that get there...

A.



Re: [PATCH] Improve verification of loop->latch in verify_loop_structure

2015-11-26 Thread Alan Lawrence
This caused an ICE compiling value.c from gdb on
aarch64-none-linux-gnu; the testcase, after preprocessing on aarch64,
ICEs on both aarch64 and x86_64, but is about 1MB - I'm working on
reducing that down to something small enough to post...

$ ./gcc/xgcc -B ./gcc -O2 -g value.c
../../binutils-gdb/gdb/value.c: In function ‘show_convenience’:
../../binutils-gdb/gdb/value.c:2615:1: error: loop 3’s latch is missing
../../binutils-gdb/gdb/value.c:2615:1: internal compiler error: in
verify_loop_structure, at cfgloop.c:1669
0x71e653 verify_loop_structure()
/work/alalaw01/src2/gcc/gcc/cfgloop.c:1669
0x97c6ae checking_verify_loop_structure
/work/alalaw01/src2/gcc/gcc/cfgloop.h:325
0x97c6ae loop_optimizer_init(unsigned int)
/work/alalaw01/src2/gcc/gcc/loop-init.c:106
0x97c78a rtl_loop_init
/work/alalaw01/src2/gcc/gcc/loop-init.c:398
0x97c78a execute
/work/alalaw01/src2/gcc/gcc/loop-init.c:425

--Alan


Re: [PATCH] Improve verification of loop->latch in verify_loop_structure

2015-11-26 Thread Alan Lawrence
Here's a reduced testcase (reduced to the point of generating lots of
warnings, I'm compiling with -O2 -w, on x86_64):

struct __jmp_buf_tag
  {
  };
typedef struct __jmp_buf_tag sigjmp_buf[1];
extern struct cmd_list_element *showlist;
struct internalvar
{
  struct internalvar *next;
};
static struct internalvar *internalvars;
struct internalvar *
create_internalvar (const char *name)
{
  struct internalvar *var = ((struct internalvar *) xmalloc (sizeof
(struct internalvar)));
  internalvars = var;
}

void
show_convenience (char *ignore, int from_tty)
{
  struct gdbarch *gdbarch = get_current_arch ();
  int varseen = 0;
  for (struct internalvar *var = internalvars; var; var = var->next)
{
  if (!varseen)
varseen = 1;
  sigjmp_buf *buf = exceptions_state_mc_init ();
  __sigsetjmp ();
  while (exceptions_state_mc_action_iter ())
while (exceptions_state_mc_action_iter_1 ())
  ;
}
  if (!varseen)
  printf_unfiltered ();
}

On 26 November 2015 at 11:33, Alan Lawrence <alan.lawre...@arm.com> wrote:
> This caused an ICE compiling value.c from gdb on
> aarch64-none-linux-gnu; the testcase, after preprocessing on aarch64,
> ICEs on both aarch64 and x86_64, but is about 1MB - I'm working on
> reducing that down to something small enough to post...
>
> $ ./gcc/xgcc -B ./gcc -O2 -g value.c
> ../../binutils-gdb/gdb/value.c: In function ‘show_convenience’:
> ../../binutils-gdb/gdb/value.c:2615:1: error: loop 3’s latch is missing
> ../../binutils-gdb/gdb/value.c:2615:1: internal compiler error: in
> verify_loop_structure, at cfgloop.c:1669
> 0x71e653 verify_loop_structure()
> /work/alalaw01/src2/gcc/gcc/cfgloop.c:1669
> 0x97c6ae checking_verify_loop_structure
> /work/alalaw01/src2/gcc/gcc/cfgloop.h:325
> 0x97c6ae loop_optimizer_init(unsigned int)
> /work/alalaw01/src2/gcc/gcc/loop-init.c:106
> 0x97c78a rtl_loop_init
> /work/alalaw01/src2/gcc/gcc/loop-init.c:398
> 0x97c78a execute
> /work/alalaw01/src2/gcc/gcc/loop-init.c:425
>
> --Alan


Re: [PATCH 1/2][ARM] PR/65956 AAPCS update for alignment attribute

2015-11-26 Thread Alan Lawrence
On 6 November 2015 at 16:59, Jakub Jelinek  wrote:
>
> In any case, to manually reproduce, compile
> gnatmake -g -gnatws macrosub.adb
> with GCC 5.1.1 (before the ARM changes) and then try to run that process 
> against
> GCC 5.2.1 (after the ARM changes) libgnat-5.so, which is what make check
> does (it uses host_gnatmake to compile the support stuff, so ideally the
> processes built by host gcc/gnatmake should not be run with the
> LD_LIBRARY_PATH=$ADA_INCLUDE_PATH:$BASE:$LD_LIBRARY_PATH
> in the environment, and others should).
> In macrosub in particular, the problem is in:
>   WHILE NOT END_OF_FILE (INFILE1) LOOP
>GET_LINE (INFILE1, A_LINE, A_LENGTH);
> in FILL_TABLE, where A_LINE'First is 0 and A_LINE'Last is 400 (if I remember
> right), but if you step into GET_LINE compiled by GCC 5.2.1, Item'First
> and Item'Last don't match that.

Ok, I see the mismatch now.

However, to get there, I had to use my 5.1 gnatmake -g -gnatws
macrosub.ads --rts=/path/to/5.2/arm-none-linux-gnueabihf/libada, as if
I ran 5.1 gnatmake without that flag, I did not manage to get the
wrong value passed/received with LD_LIBRARY_PATH set to any of
build-5.2/gcc/ada/rts, build-5.2/arm-none-linux-gnueabihf/libada,
build-5.2/arm-none-linux-gnueabihf/libada/adalib (any further
suggestions?). [Also I note 'LD_DEBUG=all ./macrosub' does not show
libgnat being loaded that way.]

With 5.1 gnatmake -g -gnatws macrosub.ads
--rts=/path/to/5.2/arm-none-linux-gnueabihf/libada :

$ gdb ./macrosub
GNU gdb (Ubuntu 7.7-0ubuntu3) 7.7
[snip]
Reading symbols from ./macrosub...done.
(gdb) break get_line
Breakpoint 1 at 0x1aeec: get_line. (4 locations)
(gdb) run
Starting program:
/home/alalaw01/build-5.1.0/gcc/testsuite/ada/acats/support/macrosub
BEGINNING MACRO SUBSTITUTIONS.

Breakpoint 1, ada.text_io.get_line (item=...) at a-tigeli.adb:41
41  procedure Get_Line
(gdb) print item'first
$1 = -443273216
(gdb) print item'last
$2 = -514850813
(gdb) n
146FIO.Check_Read_Status (AP (File));
(gdb) n
152if Item'First > Item'Last then
(gdb) print item'first
$3 = 1
(gdb) print item'last
$4 = 0
(gdb) up
#1  0x0001f34c in getsubs.fill_table () at getsubs.adb:122
122GET_LINE (INFILE1, A_LINE, A_LENGTH);
(gdb) print a_line'first
$5 = 1
(gdb) print a_line'last
$6 = 400

So yes, we have an ABI change; which is not entirely unexpected. So,
questions

(1) Why does LD_LIBRARY_PATH affect your system, not mine (i.e. if
this is because my gnatmake is building with static linking, then
why). This is maybe the least interesting question so I'm leaving it
for now...
(2) If/when LD_LIBRARY_PATH does have an effect - as you say, things
compiled with host gnatmake, should be run against host libraries, not
against target libraries. Otherwise, potentially *any* gcc ABI change
can break the build process, right? So I think this is of interest
regardless of the ARM AAPCS change, but I will be slightly
presumptious and hope that the Adacore folk will pick this up...[CC
Eric]
(3) Has the ARM AAPCS had an effect that we didn't mean it to? I don't
see any evidence so far that this is _necessarily_ the case, but I
will look into this, bearing Florian's advice in mind (thanks!)...

--Alan


Re: [PATCH] libstdc++: Fix libstdc++/67440: pretty-printing of a const set fails

2015-11-25 Thread Alan Lawrence

On 16/11/15 21:04, Doug Evans wrote:

Hi.

Apologies for the delay.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67440

Tested with current trunk.

2015-11-16  Doug Evans  

 PR libstdc++/67440
 * python/libstdcxx/v6/printers.py (find_type): Handle "const" in
 type name.
 * testsuite/libstdc++-prettyprinters/debug.cc: Add test for
 const set.
 * testsuite/libstdc++-prettyprinters/simple.cc: Ditto.
 * testsuite/libstdc++-prettyprinters/simple11.cc: Ditto.


On gcc-5-branch, the debug.cc and simple.cc tests don't seem to compile, on 
either x86_64-none-linux-gnu or aarch64-none-linux-gnu. I get errors like:


/work/alalaw01/src/gcc/libstdc++-v3/testsuite/libstdc++-prettyprinters/simple.cc: In 
function 'int main()':
/work/alalaw01/src/gcc/libstdc++-v3/testsuite/libstdc++-prettyprinters/simple.cc:77:43: 
error: in C++98 'const_intset' must be initialized by constructor, not by '{...}'

   const std::set const_intset = {2, 3};
   ^
In file included from 
/work/alalaw01/build_dje/x86_64-unknown-linux-gnu/libstdc++-v3/include/map:60:0,
 from 
/work/alalaw01/src/gcc/libstdc++-v3/testsuite/libstdc++-prettyprinters/simple.cc:31:
/work/alalaw01/build_dje/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/stl_tree.h: 
In instantiation of 'void std::_Rb_tree<_Key, _Val, _KeyOfValue, _Compare, 
_Alloc>::_M_insert_unique(_II, _II) [with _InputIterator = int; _Key = int; _Val 
= int; _KeyOfValue = std::_Identity; _Compare = std::less; _Alloc = 
std::allocator]':
/work/alalaw01/build_dje/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/stl_set.h:171:4: 
  required from 'std::set<_Key, _Compare, _Alloc>::set(_InputIterator, 
_InputIterator) [with _InputIterator = int; _Key = int; _Compare = 
std::less; _Alloc = std::allocator]'
/work/alalaw01/src/gcc/libstdc++-v3/testsuite/libstdc++-prettyprinters/simple.cc:77:43: 
  required from here
/work/alalaw01/build_dje/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/stl_tree.h:2224:29: 
error: invalid type argument of unary '*' (have 'int')

_M_insert_unique_(end(), *__first, __an);
 ^
compiler exited with status 1
UNRESOLVED: libstdc++-prettyprinters/simple.cc compilation failed to produce 
executable

Spawning: gdb -nw -nx -quiet -batch -ex "python print(gdb.type_printers)"
spawn gdb -nw -nx -quiet -batch -ex python print(gdb.type_printers)
[]
spawn gdb -nx -nw -quiet -batch -x simple.gdb
simple.gdb:2: Error in sourced command file:
./simple.exe: No such file or directory.
skipping: simple.gdb:2: Error in sourced command file:
skipping: ./simple.exe: No such file or directory.
UNSUPPORTED: libstdc++-prettyprinters/simple.cc

I also see signs of the same on powerpc64le 
(https://gcc.gnu.org/ml/gcc-testresults/2015-11/msg02699.html), the test looks 
to be passing on trunk on all three platforms.


Thanks, Alan



Re: [patch] GSoC: Implement std::experimental::shared_ptr

2015-11-25 Thread Alan Lawrence

On 14/11/15 13:19, David Edelsohn wrote:

The copy_ctor_neg testcase fails on AIX.


Also on ARM and AArch64 platforms (arm-none-linux-gnueabihf, arm-none-eabi, 
aarch64-none-linux-gnu, aarch64-none-elf), with similar error messages.


Thanks,
Alan






Re: [PATCH] GCC system.h and Graphite header order

2015-11-24 Thread Alan Lawrence
I note doc/install.texi says that gcc uses "ISL Library version 0.15,
0.14, 0.13, or 0.12.2". This patch breaks the build with 0.12.2 (a
subset of errors below), but seems fine with 0.14. I haven't tested
0.13. Do we want to update install.texi ?

Cheers, Alan


In file included from
/work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/list.h:13:0,
 from
/work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/aff_type.h:4,
 from
/work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/local_space.h:4,
 from
/work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/constraint.h:13,
 from /work/alalaw01/src/gcc/gcc/graphite-optimize-isl.c:41:
/work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/ctx.h:108:8:
error: attempt to use poisoned "malloc"
malloc(size)))
^
/work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/ctx.h:112:8:
error: attempt to use poisoned "realloc"
realloc(ptr,size)))
^
/usr/include/c++/4.8/bits/locale_facets.h:2566:44: error: macro
"isdigit" passed 2 arguments, but takes just 1
 isdigit(_CharT __c, const locale& __loc)
^
/usr/include/c++/4.8/bits/locale_facets.h:2572:44: error: macro
"ispunct" passed 2 arguments, but takes just 1
 ispunct(_CharT __c, const locale& __loc)
^
/usr/include/c++/4.8/bits/locale_facets.h:2578:45: error: macro
"isxdigit" passed 2 arguments, but takes just 1
 isxdigit(_CharT __c, const locale& __loc)
 ^
/usr/include/c++/4.8/bits/locale_facets.h:2584:44: error: macro
"isalnum" passed 2 arguments, but takes just 1
 isalnum(_CharT __c, const locale& __loc)
^
/usr/include/c++/4.8/bits/locale_facets.h:2590:44: error: macro
"isgraph" passed 2 arguments, but takes just 1
 isgraph(_CharT __c, const locale& __loc)
^
/usr/include/c++/4.8/bits/locale_facets.h:2596:44: error: macro
"toupper" passed 2 arguments, but takes just 1
 toupper(_CharT __c, const locale& __loc)
^
/usr/include/c++/4.8/bits/locale_facets.h:2602:44: error: macro
"tolower" passed 2 arguments, but takes just 1
 tolower(_CharT __c, const locale& __loc)

In file included from /usr/include/c++/4.8/bits/basic_ios.h:37:0,
 from /usr/include/c++/4.8/ios:44,
 from /usr/include/c++/4.8/ostream:38,
 from /usr/include/c++/4.8/iostream:39,
 from
/work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/int.h:17,
 from
/work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/ctx.h:16,
 from
/work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/list.h:13,
 from
/work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/aff_type.h:4,
 from
/work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/local_space.h:4,
 from
/work/alalaw01/build-aarch64-none-elf/host-tools/include/isl/constraint.h:13,
 from /work/alalaw01/src/gcc/gcc/graphite-scop-detection.c:52:
/usr/include/c++/4.8/bits/locale_facets.h:2530:5: error:
‘std::isspace’ declared as an ‘inline’ variable
 isspace(_CharT __c, const locale& __loc)
 ^
/usr/include/c++/4.8/bits/locale_facets.h:2530:5: error: template
declaration of ‘bool std::isspace’
/usr/include/c++/4.8/bits/locale_facets.h:2531:7: error: expected
primary-expression before ‘return’
 { return use_facet >(__loc).is(ctype_base::space, __c); }
   ^
/usr/include/c++/4.8/bits/locale_facets.h:2531:7: error: expected ‘}’
before ‘return’
/usr/include/c++/4.8/bits/locale_facets.h:2536:5: error: ‘isprint’
declared as an ‘inline’ variable
 isprint(_CharT __c, const locale& __loc)
 ^
/usr/include/c++/4.8/bits/locale_facets.h:2536:5: error: template
declaration of ‘bool isprint’
/usr/include/c++/4.8/bits/locale_facets.h:2537:7: error: expected
primary-expression before ‘return’
 { return use_facet >(__loc).is(ctype_base::print, __c); }
   ^
/usr/include/c++/4.8/bits/locale_facets.h:2537:7: error: expected ‘}’
before ‘return’
/usr/include/c++/4.8/bits/locale_facets.h:2537:75: error: expected
declaration before ‘}’ token
 { return use_facet >(__loc).is(ctype_base::print, __c); }


Re: [PATCH] Fix PR68067

2015-11-20 Thread Alan Lawrence
On 6 November 2015 at 10:39, Richard Biener  wrote:
>> ../spec2000/benchspec/CINT2000/254.gap/src/polynom.c:358:11: error: location
>> references block not in block tree
>> l1_279 = PHI <1(28), l1_299(33)>
>
> ^^^
>
> this is the error to look at!  It means that the GC heap will be corrupted
> quite easily.
>

This looked very similar to PR68117 - the invalid phi arg, and block
not in  block-tree, even if not the invalid tree code - and as the
posters there were having success with valgrind, whereas I wasn't, I
watched and waited. First observation is that it triggers the asserts
you suggested in comment 27
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D68117#c27). Indeed, it
fails those asserts, even after the patch in comment 25 (committed as
r230594) to tree-ssa.c (delete_tree_ssa), and the patch in comment#35
to function.c (set_cfun), and the patch in comment#30 (committed as
r230424) to cfgexpand.c (pass_expand::execute).

The patch in comment#29 (which replaces the asserts in comment#27 with
empties), however, fixes the problem - although I can't rule out, that
that's just by changing the memory allocation pattern.

Moreover, if I take those patches and rebase onto a recent trunk (onto
which the delete_tree_ssa and pass_expand::execute patches have
already been committed), i.e. just adding the assertions from
comment#27 and the call in function.c (set_cfun) - the assertions are
still failing on my testcase, whereas the original (assertionless)
failure was very erratic, and had since disappeared/been hidden on
trunk. Indeed those same assertions break in a few other places (even
in a --disable-bootstrap build after gcc/xgcc is built), so I feel I
have a good chance of producing a reasonable assertion-breaking
testcase.

So I have to ask, how sure are you that those assertions are(/should
be!) "correct"? :)

--Alan


Re: [PATCH] Avoid useless work in loop vectorization

2015-11-18 Thread Alan Lawrence

On 13/11/15 08:41, Richard Biener wrote:


Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.

Richard.

2015-11-13  Richard Biener  

* tree-vect-loop.c (vect_analyze_loop_2): Add fatal parameter.
Signal fatal failure if early checks fail.
(vect_analyze_loop): If vect_analyze_loop_2 fails fatally
do not bother testing further vector sizes.


It seems that on AArch64 this causes:

FAIL: gcc.dg/vect/vect-outer-1-big-array.c -flto -ffat-lto-objects 
scan-tree-dump-times vect "grouped access in outer loop" 2
FAIL: gcc.dg/vect/vect-outer-1-big-array.c scan-tree-dump-times vect "grouped 
access in outer loop" 2
FAIL: gcc.dg/vect/vect-outer-1.c -flto -ffat-lto-objects  scan-tree-dump-times 
vect "grouped access in outer loop" 2
FAIL: gcc.dg/vect/vect-outer-1.c scan-tree-dump-times vect "grouped access in 
outer loop" 2
FAIL: gcc.dg/vect/vect-outer-1a-big-array.c -flto -ffat-lto-objects 
scan-tree-dump-times vect "grouped access in outer loop" 2
FAIL: gcc.dg/vect/vect-outer-1a-big-array.c scan-tree-dump-times vect "grouped 
access in outer loop" 2
FAIL: gcc.dg/vect/vect-outer-1a.c -flto -ffat-lto-objects  scan-tree-dump-times 
vect "grouped access in outer loop" 2
FAIL: gcc.dg/vect/vect-outer-1a.c scan-tree-dump-times vect "grouped access in 
outer loop" 2
FAIL: gcc.dg/vect/vect-outer-1b-big-array.c -flto -ffat-lto-objects 
scan-tree-dump-times vect "grouped access in outer loop" 2
FAIL: gcc.dg/vect/vect-outer-1b-big-array.c scan-tree-dump-times vect "grouped 
access in outer loop" 2
FAIL: gcc.dg/vect/vect-outer-1b.c -flto -ffat-lto-objects  scan-tree-dump-times 
vect "grouped access in outer loop" 2
FAIL: gcc.dg/vect/vect-outer-1b.c scan-tree-dump-times vect "grouped access in 
outer loop" 2
FAIL: gcc.dg/vect/vect-outer-2b.c -flto -ffat-lto-objects  scan-tree-dump-times 
vect "grouped access in outer loop" 2
FAIL: gcc.dg/vect/vect-outer-2b.c scan-tree-dump-times vect "grouped access in 
outer loop" 2
FAIL: gcc.dg/vect/vect-outer-3b.c -flto -ffat-lto-objects  scan-tree-dump-times 
vect "grouped access in outer loop" 4
FAIL: gcc.dg/vect/vect-outer-3b.c scan-tree-dump-times vect "grouped access in 
outer loop" 4


Still there on r230556, I haven't dug any further yet.

Thanks, Alan



Re: C++ PATCH to integrate c++-delayed-folding branch

2015-11-17 Thread Alan Lawrence

On 14/11/15 00:07, Jason Merrill wrote:

And here's the final patch integrating the delayed folding branch.  The general
idea is to mostly avoid folding until the end of the function, at which point we
fold everything as part of genericization.  Since many warnings rely on looking
at folded trees, we fold the arguments that we pass to the various warning
functions.  To avoid issues with combinatorial explosion, we cache the results
of folding in a hash_map.

In the future we probably want to move many of these warnings into
language-independent code so we can avoid folding for them in the front end; we
also shouldn't need to fold during genericization, but apparently not doing that
leads to optimization regressions.

This is mostly Kai's work; I just cleaned it up a bit to get it ready for the
merge.  Marek also helped some.  The largest change is hooking into the GTY
machinery to handle throwing away the folding cache rather than trying to do it
manually in various places.

Tested x86_64-pc-linux-gnu, applying to trunk.


Also PR68385 (arm-none-eabi).



Re: [PATCH] PR/67682, break SLP groups up if only some elements match

2015-11-17 Thread Alan Lawrence

On 16/11/15 14:42, Christophe Lyon wrote:


Hi Alan,

I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c)
fails for armeb targets.
I haven't had time to look at more details yet, but I guess you can
reproduce it quickly enough.



Thanks - yes I see it now.

-fdump-tree-optimized looks sensible:

__attribute__((noinline))
test ()
{
  vector(4) int vect__14.21;
  vector(4) int vect__2.20;
  vector(4) int vect__2.19;
  vector(4) int vect__3.13;
  vector(4) int vect__2.12;

  :
  vect__2.12_24 = MEM[(int *)];
  vect__3.13_27 = vect__2.12_24 + { 1, 2, 3, 4 };
  MEM[(int *)] = vect__3.13_27;
  vect__2.19_31 = MEM[(int *) + 16B];
  vect__2.20_33 = VEC_PERM_EXPR ;
  vect__14.21_35 = vect__2.20_33 * { 3, 4, 5, 7 };
  MEM[(int *) + 16B] = vect__14.21_35;
  return;
}

but while a[0...3] end up containing 5 7 9 11 as expected,
a[4..7] end up with 30 32 30 28 rather than the expected 12 24 40 70.
That is, we end up with (10 8 6 4), rather than the expected (4 6 8 10), being 
multiplied by {3,4,5,7}. Looking at the RTL, those values come from a UZP1/2 
pair that should extract elements {0,2,4,6} of b. Assembler, with my workings as 
to what's in each register:


test:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
movwr2, #:lower16:b
movtr2, #:upper16:b
vldrd22, .L11
vldrd23, .L11+8
;; So d22 = (3 4), d23 = (5 7), q11 = (5 7 3 4)
movwr3, #:lower16:a
movtr3, #:upper16:a
vld1.64 {d16-d17}, [r2:64]
;; So d16 = (b[0] b[1]), d17 = (b[2] b[3]), q8 = (b[2] b[3] b[0] b[1])
vmovq9, q8  @ v4si
;; q9 = (b[2] b[3] b[0] b[1])
vldrd20, [r2, #16]
vldrd21, [r2, #24]
;; So d20 = (b[4] b[5]), d21 = (b[6] b[7]), q10 = (b[6] b[7] b[4] b[5])
vuzp.32 q10, q9
;; So  q10 = (b[3] b[1] b[7] b[5]), i.e. d20 = (b[7] b[5]) and d21 = (b[3] b[1])
;; and q9 = (b[2] b[0] b[6] b[4]), i.e. d18 = (b[6] b[4]) and d19 = (b[2] b[0])
vldrd20, .L11+16
vldrd21, .L11+24
;; d20 = (1 2), d21 = (3 4), q10 = (3 4 1 2)
vmul.i32q9, q9, q11
;; q9 = (b[2]*5 b[0]*7 b[6]*3 b[4]*4)
;; i.e. d18 = (b[6]*3 b[4]*4) and d19 = (b[2]*5 b[0]*7)
vadd.i32q8, q8, q10
;; q8 = (b[2]+3 b[3]+4 b[0]+1 b[1]+2)
;; i.e. d16 = (b[0]+1 b[1]+2), d17 = (b[2]+3 b[3]+4)
vst1.64 {d16-d17}, [r3:64]
;; a[0] = b[0]+1, a[1] = b[1]+2, a[2] = b[2]+3, a[3]=b[3]+4 all ok
vstrd18, [r3, #16]
;; a[4] = b[6]*3, a[5] = b[4]*4
vstrd19, [r3, #24]
;; a[6] = b[2]*5, a[7] = b[0]*7
bx  lr
.L12:
.align  3
.L11:
.word   3
.word   4
.word   5
.word   7
.word   1
.word   2
.word   3
.word   4

Which is to say - the bit order in the q-registers, is neither big- nor 
little-endian, but the elements get stored back to memory in a consistent order 
with how they were loaded, so we're OK as long as there are no permutes. 
Unfortunately for UZP this lane ordering mixup is not idempotent and messes 
everything up...


Hmmm. I'm feeling that "properly" fixing this testcase, amounts to fixing 
armeb's whole register-numbering/lane-flipping scheme, and might be quite a 
large task. OTOH it might also fix the significant number of failing vectorizer 
tests. A simpler solution might be to disable...some part of vector 
supporton armeb, but I'm not sure which part would be best, yet.


Thoughts (CC maintainers)?

--Alan



Re: [PATCH][i386]Migrate reduction optabs to reduc__scal

2015-11-16 Thread Alan Lawrence

On 03/11/15 14:27, Alan Lawrence wrote:

This migrates the various reduction optabs in sse.md to use the reduce-to-scalar
form. I took the straightforward approach (equivalent to the migration code in
expr.c/optabs.c) of generating a vector temporary, using the existing code to
reduce to that, and extracting lane 0, in each pattern.

Bootstrapped + check-gcc + check-g++.

Ok for trunk?

gcc/ChangeLog:

* config/i386/sse.md (reduc_splus_v8df): Rename to...
(reduc_plus_scal_v8df): ...here; reduce to temp and extract scalar.

(reduc_splus_v4df): Rename to...
(reduc_plus_scal_v4df): ...here; reduce to temp and extract scalar.

(reduc_splus_v2df): Rename to...
(reduc_plus_scal_v2df): ...here; reduce to temp and extract scalar.

(reduc_splus_v16sf): Rename to...
(reduc_plus_scal_v16sf): ...here; reduce to temp and extract scalar.

(reduc_splus_v8sf): Rename to...
(reduc_plus_scal_v8sf): ...here; reduce to temp and extract scalar.

(reduc_splus_v4sf): Rename to...
(reduc_plus_scal_v4sf): ...here; reduce to temp and extract scalar.

(reduc__, all 3 variants): Rename each to...
(reduc__scal_): ...here; reduce to temp and extract scalar.

(reduc_umin_v8hf): Rename to...
(reduc_umin_scal_v8hf): ...here; reduce to temp and extract scalar.
---
  gcc/config/i386/sse.md | 82 +++---
  1 file changed, 51 insertions(+), 31 deletions(-)

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 43dcc6a..041e514 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -2424,73 +2424,85 @@
 (set_attr "prefix_rep" "1,*")
 (set_attr "mode" "V4SF")])

-(define_expand "reduc_splus_v8df"
-  [(match_operand:V8DF 0 "register_operand")
+(define_expand "reduc_plus_scal_v8df"
+  [(match_operand:DF 0 "register_operand")
 (match_operand:V8DF 1 "register_operand")]
"TARGET_AVX512F"
  {
-  ix86_expand_reduc (gen_addv8df3, operands[0], operands[1]);
+  rtx tmp = gen_reg_rtx (V8DFmode);
+  ix86_expand_reduc (gen_addv8df3, tmp, operands[1]);
+  emit_insn (gen_vec_extractv8df (operands[0], tmp, const0_rtx));
DONE;
  })

-(define_expand "reduc_splus_v4df"
-  [(match_operand:V4DF 0 "register_operand")
+(define_expand "reduc_plus_scal_v4df"
+  [(match_operand:DF 0 "register_operand")
 (match_operand:V4DF 1 "register_operand")]
"TARGET_AVX"
  {
rtx tmp = gen_reg_rtx (V4DFmode);
rtx tmp2 = gen_reg_rtx (V4DFmode);
+  rtx vec_res = gen_reg_rtx (V4DFmode);
emit_insn (gen_avx_haddv4df3 (tmp, operands[1], operands[1]));
emit_insn (gen_avx_vperm2f128v4df3 (tmp2, tmp, tmp, GEN_INT (1)));
-  emit_insn (gen_addv4df3 (operands[0], tmp, tmp2));
+  emit_insn (gen_addv4df3 (vec_res, tmp, tmp2));
+  emit_insn (gen_vec_extractv4df (operands[0], vec_res, const0_rtx));
DONE;
  })

-(define_expand "reduc_splus_v2df"
-  [(match_operand:V2DF 0 "register_operand")
+(define_expand "reduc_plus_scal_v2df"
+  [(match_operand:DF 0 "register_operand")
 (match_operand:V2DF 1 "register_operand")]
"TARGET_SSE3"
  {
-  emit_insn (gen_sse3_haddv2df3 (operands[0], operands[1], operands[1]));
+  rtx tmp = gen_reg_rtx (V2DFmode);
+  emit_insn (gen_sse3_haddv2df3 (tmp, operands[1], operands[1]));
+  emit_insn (gen_vec_extractv2df (operands[0], tmp, const0_rtx));
DONE;
  })

-(define_expand "reduc_splus_v16sf"
-  [(match_operand:V16SF 0 "register_operand")
+(define_expand "reduc_plus_scal_v16sf"
+  [(match_operand:SF 0 "register_operand")
 (match_operand:V16SF 1 "register_operand")]
"TARGET_AVX512F"
  {
-  ix86_expand_reduc (gen_addv16sf3, operands[0], operands[1]);
+  rtx tmp = gen_reg_rtx (V16SFmode);
+  ix86_expand_reduc (gen_addv16sf3, tmp, operands[1]);
+  emit_insn (gen_vec_extractv16sf (operands[0], tmp, const0_rtx));
DONE;
  })

-(define_expand "reduc_splus_v8sf"
-  [(match_operand:V8SF 0 "register_operand")
+(define_expand "reduc_plus_scal_v8sf"
+  [(match_operand:SF 0 "register_operand")
 (match_operand:V8SF 1 "register_operand")]
"TARGET_AVX"
  {
rtx tmp = gen_reg_rtx (V8SFmode);
rtx tmp2 = gen_reg_rtx (V8SFmode);
+  rtx vec_res = gen_reg_rtx (V8SFmode);
emit_insn (gen_avx_haddv8sf3 (tmp, operands[1], operands[1]));
emit_insn (gen_avx_haddv8sf3 (tmp2, tmp, tmp));
emit_insn (gen_avx_vperm2f128v8sf3 (tmp, tmp2, tmp2, GEN_INT (1)));
-  emit_insn (gen_addv8sf3 (operands[0], tmp, tmp2));
+  emit_insn (gen_addv8sf3 (vec_res, tmp, tmp2));
+  emit_insn (gen_vec_extractv8sf (operands[0], vec_res, const0_rtx));
DONE;
  })

-(define_expand "

[Obvious][AArch64] Fix gcc.target/aarch64/vclz.c

2015-11-16 Thread Alan Lawrence
The scan-assembler vclz2s 34 test in here has been failing since r230091:

* tree-vect-slp.c (vect_bb_vectorization_profitable_p): Make equal
cost favor vectorized version.

which transforms a load of piecewise array assignments (in tree) into
2-element-vector writes to MEMs, and the first RUN_TEST for the V2SI cases
then constant-propagates because the INHIB_OPTIMIZATION (which clobbers the
data arrays) comes after the load from those arrays.

Hence, put the INHIB_OPTIMIZATION in the right place.

Moreover, the memory clobber does not affect the registers now holding the
values loaded anyway, so we can drop the second one altogether.

Tested on aarch64-none-linux-gnu, committed as r230421.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vclz.c: Correctly place INHIB_OPTIMIZATION.
---
 gcc/testsuite/gcc.target/aarch64/vclz.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/gcc/testsuite/gcc.target/aarch64/vclz.c 
b/gcc/testsuite/gcc.target/aarch64/vclz.c
index 455ba63..60494a8 100644
--- a/gcc/testsuite/gcc.target/aarch64/vclz.c
+++ b/gcc/testsuite/gcc.target/aarch64/vclz.c
@@ -67,18 +67,13 @@ extern void abort (void);
   CONCAT1 (vclz, POSTFIX (reg_len, data_len, is_signed))
 
 #define RUN_TEST(test_set, answ_set, reg_len, data_len, is_signed, n)  \
+  INHIB_OPTIMIZATION;  \
   a = LOAD_INST (reg_len, data_len, is_signed) (test_set); \
   b = LOAD_INST (reg_len, data_len, is_signed) (answ_set); \
-  INHIB_OPTIMIZATION;  \
   a = CLZ_INST (reg_len, data_len, is_signed) (a); \
   for (i = 0; i < n; i++)  \
-{  \
-  INHIB_OPTIMIZATION;  \
-  if (a [i] != b [i])  \
-{  \
-  return 1;\
-}  \
-}
+if (a [i] != b [i])
\
+  return 1;
 
 int
 test_vclz_s8 ()
-- 
1.9.1



[PATCH][install.texi] Add note against GNAT 4.8 on ARM targets.

2015-11-16 Thread Alan Lawrence
This follows from the discussion here: 
https://gcc.gnu.org/ml/gcc/2015-10/msg00082.html .

OK for trunk?

--Alan

gcc/ChangeLog:

doc/install.texi: Add note against GNAT 4.8 on ARM targets.
---
 gcc/doc/install.texi | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi
index 1fd773e..1ce93d4 100644
--- a/gcc/doc/install.texi
+++ b/gcc/doc/install.texi
@@ -3481,6 +3481,8 @@ require GNU binutils 2.13 or newer.  Such subtargets 
include:
 @code{arm-*-netbsdelf}, @code{arm-*-*linux-*}
 and @code{arm-*-rtemseabi}.
 
+Building the Ada frontend commonly fails (an infinite loop executing 
@code{xsinfo}) if the host compiler is GNAT 4.8.  Host compilers built from the 
GNAT 4.6, 4.9 or 5 release branches are known to succeed.
+
 @html
 
 @end html
-- 
1.9.1



Re: [PATCH] Improve BB vectorization dependence analysis

2015-11-16 Thread Alan Lawrence

On 09/11/15 12:55, Richard Biener wrote:


Currently BB vectorization computes all dependences inside a BB
region and fails all vectorization if it cannot handle some of them.

This is obviously not needed - BB vectorization can restrict the
dependence tests to those that are needed to apply the load/store
motion effectively performed by the vectorization (sinking all
participating loads/stores to the place of the last one).

With restructuring it that way it's also easy to not give up completely
but only for the SLP instance we cannot vectorize (this gives
a slight bump in my SPEC CPU 2006 testing to 756 vectorized basic
block regions).

But first and foremost this patch is to reduce the dependence analysis
cost and somewhat mitigate the compile-time effects of the first patch.

For fixing PR56118 only a cost model issue remains.

Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.

Richard.

2015-11-09  Richard Biener  

PR tree-optimization/56118
* tree-vectorizer.h (vect_find_last_scalar_stmt_in_slp): Declare.
* tree-vect-slp.c (vect_find_last_scalar_stmt_in_slp): Export.
* tree-vect-data-refs.c (vect_slp_analyze_node_dependences): New
function.
(vect_slp_analyze_data_ref_dependences): Instead of computing
all dependences of the region DRs just analyze the code motions
SLP vectorization will perform.  Remove SLP instances that
cannot have their store/load motions applied.
(vect_analyze_data_refs): Allow DRs without a vectype
in BB vectorization.

* gcc.dg/vect/no-tree-sra-bb-slp-pr50730.c: Adjust.


Since this, I've been seeing an ICE on gfortran.dg/vect/vect-9.f90 at on both 
aarch64-none-linux-gnu and arm-none-linux-gnueabihf:


spawn /home/alalaw01/build/gcc/testsuite/gfortran4/../../gfortran 
-B/home/alalaw01/build/gcc/testsuite/gfortran4/../../ 
-B/home/alalaw01/build/aarch64-unknown-linux-gnu/./libgfortran/ 
/home/alalaw01/gcc/gcc/testsuite/gfortran.dg/vect/vect-9.f90 
-fno-diagnostics-show-caret -fdiagnostics-color=never -O -O2 -ftree-vectorize 
-fvect-cost-model=unlimited -fdump-tree-vect-details -Ofast -S -o vect-9.s
/home/alalaw01/gcc/gcc/testsuite/gfortran.dg/vect/vect-9.f90:5:0: Error: 
definition in block 13 follows the use for SSA_NAME: _339 in statement:

vectp.156_387 = &*cc_36(D)[_339];
/home/alalaw01/gcc/gcc/testsuite/gfortran.dg/vect/vect-9.f90:5:0: internal 
compiler error: verify_ssa failed

0xcfc61b verify_ssa(bool, bool)
../../gcc-fsf/gcc/tree-ssa.c:1039
0xa2fc0b execute_function_todo
../../gcc-fsf/gcc/passes.c:1952
0xa30393 do_per_function
../../gcc-fsf/gcc/passes.c:1632
0xa3058f execute_todo
../../gcc-fsf/gcc/passes.c:2000
Please submit a full bug report...
FAIL: gfortran.dg/vect/vect-9.f90   -O  (internal compiler error)
FAIL: gfortran.dg/vect/vect-9.f90   -O  (test for excess errors)

Still there (on aarch64) at r230329.

--Alan



Re: [PATCH] PR/67682, break SLP groups up if only some elements match

2015-11-13 Thread Alan Lawrence
On 10/11/15 12:51, Richard Biener wrote:
>>
>> Just noticing this... if we have a vectorization factor of 4 and matches
>> is 1, 1, 1, 1,  1, 1, 0, 0, 0, 0, 0, 0 then this will split into 1, 1, 1, 1 
>> and
>> 1, 1, 0, 0, 0, ... where we know from the matches that it will again fail?
>>
>> Thus shouldn't we split either only if i % vectorization_factor is 0 or
>> if not, split "twice", dropping the intermediate surely non-matching
>> group of vectorization_factor size?  After all if we split like with the
>> patch then the upper half will _not_ be splitted again with the
>> simplified patch (result will be 1, 1, 0, 0, 0, 0, 0, 0 again).
>>
>> So I expect that the statistics will be the same if we restrict splitting
>> to the i % vectorization_factor == 0 case, or rather split where we do
>> now but only re-analyze group2 if i % vectorization_factor == 0 holds?
>>
>> Ok with that change.  Improvements on that incrementally.
>
> Btw, it just occurs to me that the whole thing is equivalent to splitting
> the store-group into vector-size pieces up-front?  That way we do
> the maximum splitting up-frond and avoid any redundant work?
>
> The patch is still ok as said, just the above may be a simple thing
> to explore.

I'd refrained from splitting in vect_analyze_group_access_1 as my understanding
was that we only did that once, whereas we would retry the
vect_analyze_slp_instance path each time we decreased the
vectorization_factor...however, I did try putting code at the beginning of
vect_analyze_slp_instance to split up any groups > vf. Unfortunately this loses
us some previously-successful SLPs, as some bigger groups cannot be SLPed if we
split them as they require 'unrolling'...so not addressing that here.

However your suggestion of splitting twice when we know the boundary is in the
middle of a vector is a nice compromise; it nets us a good number more
successes in SPEC2000 and SPEC2006, about 7% more than without the patch.

Hence, here's the patch I've committed, as r230330, after regstrap on x86_64
and AArch64. (I dropped the previous bb-slp-subgroups-2 and renamed the others
up as we don't do that one anymore.)

Cheers, Alan

gcc/ChangeLog:

PR tree-optimization/67682
* tree-vect-slp.c (vect_split_slp_store_group): New.
(vect_analyze_slp_instance): During basic block SLP, recurse on
subgroups if vect_build_slp_tree fails after 1st vector.

gcc/testsuite/ChangeLog:

PR tree-optimization/67682
* gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
* gcc.dg/vect/bb-slp-subgroups-1.c: New.
* gcc.dg/vect/bb-slp-subgroups-2.c: New.
* gcc.dg/vect/bb-slp-subgroups-3.c: New.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-7.c   | 10 +--
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 41 +
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 +
 gcc/tree-vect-slp.c| 85 +-
 5 files changed, 215 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
index ab54a48..b8bef8c 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
@@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
   unsigned int *pout = [0];
   unsigned int a0, a1, a2, a3;
 
-  /* Non isomorphic.  */
+  /* Non isomorphic, even 64-bit subgroups.  */
   a0 = *pin++ + 23;
-  a1 = *pin++ + 142;
+  a1 = *pin++ * 142;
   a2 = *pin++ + 2;
   a3 = *pin++ * 31;
-  
+
   *pout++ = a0 * x;
   *pout++ = a1 * y;
   *pout++ = a2 * x;
@@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y)
 
   /* Check results.  */
   if (out[0] != (in[0] + 23) * x
-  || out[1] != (in[1] + 142) * y
+  || out[1] != (in[1] * 142) * y
   || out[2] != (in[2] + 2) * x
   || out[3] != (in[3] * 31) * y)
 abort();
@@ -47,4 +47,4 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
-  
+
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
new file mode 100644
index 000..39c23c3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
@@ -0,0 +1,44 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+a[0] = b[0];
+a[1] = b[1];
+a[2] = b[2];
+a[3] = b[3];
+a[4] = 0;
+a[5] = 0;
+a[6] = 0;
+a[7] = 0;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+a[i] 

Re: [PATCH] Fix PR56118

2015-11-13 Thread Alan Lawrence

On 10/11/15 09:34, Richard Biener wrote:


The following fixes PR56118 by adjusting the cost model handling of
basic-block vectorization to favor the vectorized version in case
estimated cost is the same as the estimated cost of the scalar
version.  This makes sense because we over-estimate the vectorized
cost in several places.

Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.

Richard.

2015-11-10  Richard Biener  

PR tree-optimization/56118
* tree-vect-slp.c (vect_bb_vectorization_profitable_p): Make equal
cost favor vectorized version.

* gcc.target/i386/pr56118.c: New testcase.


On AArch64 and ARM targets, this causes PASS->FAIL of

gcc.dg/vect/bb-slp-32.c scan-tree-dump slp2 "vectorization is not profitable"
gcc.dg/vect/bb-slp-32.c -flto -ffat-lto-objects scan-tree-dump slp2 
"vectorization is not profitable"


that sounds like a good thing ;), so I imagine the xfail directive may just 
need updating. The test also looks to be failing on powerpc64 (according to 
https://gcc.gnu.org/ml/gcc-testresults/2015-11/msg01327.html).


Regards, Alan



Re: [PATCH 6/6] Make SRA replace constant-pool loads

2015-11-12 Thread Alan Lawrence
On 06/11/15 16:29, Richard Biener wrote:
>>> 2) You should be able to use fold_ctor_reference directly (in place
>> of
>>> all your code
>>> in case offset and size are readily available - don't remember
>> exactly how
>>> complete scalarization "walks" elements).  Alternatively use
>>> fold_const_aggregate_ref.

Well, yes I was able to remove subst_constant_pool_initial by using
fold_ctor_reference, with some fairly strict checks about the access expressions
found while scanning the input. I still needed to either deep-scan the constant
value itself, or allow completely_scalarize to fail, due to Ada c460010 where
a variable has DECL_INITIAL: {VIEW_CONVERT_EXPR({1, 2, 3})}
(that is, a CONSTRUCTOR whose only element is a V.C.E. of another CONSTRUCTOR).

However:

> I tried to suggest creating piecewise loads from the constant pool as opposed 
> to the aggregate one.  But I suppose the difficulty is to determine 'pieces', 
> not their values (that followup passes and folding can handle quite 
> efficiently).

I've now got this working, and it simplifies the patch massively :).

We now replace e.g. a constant-pool load a = *.LC0, with a series of loads e.g.
SR_a1 = SRlc0_1
SR_a2 = SRlc0_2
etc. etc. (well those wouldn't be quite the names, but you get the idea.)

And then at the start of the function we initialise
SRlc0_1 = *.LC0[0]
SRlc0_2 = *.LC0[1]
etc. etc.
- I needed to use SRlc0_1 etc. variables because some of the constant-pool loads
coming from Ada are rather more complicated than 'a = *.LC0' and substituting
the access expression (e.g. '*.LC0[0].foo'), rather than the replacement_decl,
into those lead to just too many verify_gimple failures.

However, the initialization seems to work well enough, if I put a check into
sra_modify_expr to avoid changing 'SRlc0_1 = *.LC0[0]' into 'SRlc0_1 = SRlc0_1'
(!). Slightly hacky perhaps but harmless as there cannot be a statement writing
to a scalar replacement prior to sra_modify_expr.

Also I had to prevent writing scalar replacements back to the constant pool
(gnat opt31.adb).

Finally - I've left the disqualified_constants code in, but am now only using it
for an assert...so I guess it'd be reasonable to take that out. In an ideal
world we could do those checks only in checking builds but allocating bitmaps
only for checking builds feels like it would make checking vs non-checking
diverge too much.

This is now passing bootstrap+check-{gcc,g++,fortran} on AArch64, and the same
plus Ada on x64_64 and ARM, except for some additional guality failures in
pr54970-1.c on AArch64 (tests which are already failing on ARM) - I haven't had
any success in figuring those out yet, suggestions welcome.

However, without the DOM patches, this does not fix the oiginal ssa-dom-cse-2.c
testcase, even though the load is scalarized in esra and the individual element
loads resolved in ccp2. I don't think I'll be able to respin those between now
and stage 1 ending on Saturday, so hence I've had to drop the testsuite change,
and can only / will merely describe the remaining problem in PR/63679. I'm
now benchmarking on AArch64 to see what difference this patch makes on its own.

Thoughts?

gcc/ChangeLog:

PR target/63679
* tree-sra.c (disqualified_constants): New.
(sra_initialize): Allocate disqualified_constants.
(sra_deinitialize): Free disqualified_constants.
(disqualify_candidate): Update disqualified_constants when appropriate.
(create_access): Scan for constant-pool entries as we go along.
(scalarizable_type_p): Add check against type_contains_placeholder_p.
(maybe_add_sra_candidate): Allow constant-pool entries.
(initialize_constant_pool_replacements): New.
(sra_modify_expr): Avoid mangling assignments created by previous.
(sra_modify_function_body): Call initialize_constant_pool_replacements.
---
 gcc/tree-sra.c | 93 --
 1 file changed, 90 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
index 1d4a632..aa60f06 100644
--- a/gcc/tree-sra.c
+++ b/gcc/tree-sra.c
@@ -325,6 +325,10 @@ candidate (unsigned uid)
those which cannot be (because they are and need be used as a whole).  */
 static bitmap should_scalarize_away_bitmap, cannot_scalarize_away_bitmap;
 
+/* Bitmap of candidates in the constant pool, which cannot be scalarized
+   because this would produce non-constant expressions (e.g. Ada).  */
+static bitmap disqualified_constants;
+
 /* Obstack for creation of fancy names.  */
 static struct obstack name_obstack;
 
@@ -647,6 +651,7 @@ sra_initialize (void)
 (vec_safe_length (cfun->local_decls) / 2);
   should_scalarize_away_bitmap = BITMAP_ALLOC (NULL);
   cannot_scalarize_away_bitmap = BITMAP_ALLOC (NULL);
+  disqualified_constants = BITMAP_ALLOC (NULL);
   gcc_obstack_init (_obstack);
   base_access_vec = new hash_map;
   memset (_stats, 0, sizeof (sra_stats));
@@ 

Re: [PR64164] drop copyrename, integrate into expand

2015-11-10 Thread Alan Lawrence

On 05/11/15 05:08, Alexandre Oliva wrote:

[PR67753] fix copy of PARALLEL entry_parm to CONCAT target_reg
for  gcc/ChangeLog

PR rtl-optimization/67753
PR rtl-optimization/64164
* function.c (assign_parm_setup_block): Avoid allocating a
stack slot if we don't have an ABI-reserved one.  Emit the
copy to target_reg in the conversion seq if the copy from
entry_parm is in it too.  Don't use the conversion seq to copy
a PARALLEL to a REG or a CONCAT.


Since this change, we have on aarch64_be:

FAIL: gcc.target/aarch64/aapcs64/func-ret-4.c execution,  -O1
FAIL: gcc.target/aarch64/aapcs64/func-ret-4.c execution,  -O2
FAIL: gcc.target/aarch64/aapcs64/func-ret-4.c execution,  -O3 -g
FAIL: gcc.target/aarch64/aapcs64/func-ret-4.c execution,  -Os
FAIL: gcc.target/aarch64/aapcs64/func-ret-4.c execution,  -Og -g

The difference in the assembler looks as follows (this is at -Og):

 func_return_val_10:
-   sub sp, sp, #16
-   lsr x2, x1, 48
-   lsr x1, x1, 32
+   ubfxx2, x1, 16, 16
fmovx3, d0
// Start of user assembly
 // 23 "func-ret-4.c" 1
mov x0, x30
 // 0 "" 2
// End of user assembly
adrpx3, saved_return_address
str x0, [x3, #:lo12:saved_return_address]
adrpx0, myfunc
add x0, x0, :lo12:myfunc
// Start of user assembly
 // 23 "func-ret-4.c" 1
mov x30, x0
 // 0 "" 2
// End of user assembly
bfi w0, w2, 16, 16
bfi w0, w1, 0, 16
lsl x0, x0, 32
-   add sp, sp, 16

(ubfx is a bitfield extract, the first immediate is the lsbit, the second the 
width. lsr = logical shift right.) And in the RTL dump, this (before the patch):


(insn 4 3 5 2 (set (mem/c:DI (plus:DI (reg/f:DI 68 virtual-stack-vars)
(const_int -8 [0xfff8])) [0 t+0 S8 A64])
(reg:DI 1 x1)) func-ret-4.c:23 -1
 (nil))
(insn 5 4 6 2 (set (reg:HI 78 [ t ])
(mem/c:HI (plus:DI (reg/f:DI 68 virtual-stack-vars)
(const_int -8 [0xfff8])) [0 t+0 S2 A64])) 
func-ret-4.c:23 -1

 (nil))
(insn 6 5 7 2 (set (reg:HI 79 [ t+2 ])
(mem/c:HI (plus:DI (reg/f:DI 68 virtual-stack-vars)
(const_int -6 [0xfffa])) [0 t+2 S2 A16])) 
func-ret-4.c:23 -1

 (nil))

becomes (after the patch):

(insn 4 3 5 2 (set (subreg:SI (reg:CHI 80) 0)
(reg:SI 1 x1 [ t ])) func-ret-4.c:23 -1
 (nil))
(insn 5 4 6 2 (set (reg:SI 81)
(subreg:SI (reg:CHI 80) 0)) func-ret-4.c:23 -1
 (nil))
(insn 6 5 7 2 (set (subreg:DI (reg:HI 82) 0)
(zero_extract:DI (subreg:DI (reg:SI 81) 0)
(const_int 16 [0x10])
(const_int 16 [0x10]))) func-ret-4.c:23 -1
 (nil))
(insn 7 6 8 2 (set (reg:HI 78 [ t ])
(reg:HI 82)) func-ret-4.c:23 -1
 (nil))
(insn 8 7 9 2 (set (reg:SI 83)
(subreg:SI (reg:CHI 80) 0)) func-ret-4.c:23 -1
 (nil))
(insn 9 8 10 2 (set (reg:HI 79 [ t+2 ])
(subreg:HI (reg:SI 83) 2)) func-ret-4.c:23 -1
 (nil))

--Alan



Re: [Patch AArch64] Switch constant pools to separate rodata sections.

2015-11-10 Thread Alan Lawrence

On 10/11/15 16:39, Alan Lawrence wrote:

Since r229878, I've been seeing

FAIL: gcc.dg/attr-weakref-1.c (test for excess errors)
UNRESOLVED: gcc.dg/attr-weakref-1.c compilation failed to produce executable

(both previously passing) on aarch64-none-elf, aarch64_be-none-elf, and
aarch64-none-linux-gnu.


Ah, these are fixed by Ramana's partial rollback (r230085).

--Alan



Re: [Patch AArch64] Switch constant pools to separate rodata sections.

2015-11-10 Thread Alan Lawrence

On 04/11/15 14:26, Ramana Radhakrishnan wrote:


True and I've just been reading more of the backend - We could now start using 
blocks for constant pools as well. So let's do that.

How does something like this look ?

Tested on aarch64-none-elf - no regressions.

2015-11-04  Ramana Radhakrishnan  

 * config/aarch64/aarch64.c
 (aarch64_can_use_per_function_literal_pools_p): New.
 (aarch64_use_blocks_for_constant_p): Adjust declaration
 and use aarch64_can_use_function_literal_pools_p.
 (aarch64_select_rtx_section): Update.



Since r229878, I've been seeing

FAIL: gcc.dg/attr-weakref-1.c (test for excess errors)
UNRESOLVED: gcc.dg/attr-weakref-1.c compilation failed to produce executable

(both previously passing) on aarch64-none-elf, aarch64_be-none-elf, and 
aarch64-none-linux-gnu. Here's a log from aarch64_be-none-elf (the others look 
similar):


/work/alalaw01/build-aarch64_be-none-elf/obj/gcc2/gcc/xgcc 
-B/work/alalaw01/build-aarch64_be-none-elf/obj/gcc2/gcc/ 
/work/alalaw01/src/gcc/gcc/testsuite/gcc.dg/attr-weakref-1.c 
-fno-diagnostics-show-caret -fdiagnostics-color=never -O2 
/work/alalaw01/src/gcc/gcc/testsuite/gcc.dg/attr-weakref-1a.c 
-specs=aem-validation.specs -lm -o ./attr-weakref-1.exe

/tmp/ccEfngi6.o:(.rodata.cst8+0x30): undefined reference to `wv12'
/tmp/ccEfngi6.o:(.rodata.cst8+0x38): undefined reference to `wv12'
/tmp/ccEfngi6.o:(.rodata.cst8+0x60): undefined reference to `wf12'
/tmp/ccEfngi6.o:(.rodata.cst8+0x68): undefined reference to `wf12'
collect2: error: ld returned 1 exit status
compiler exited with status 1
output is:
/tmp/ccEfngi6.o:(.rodata.cst8+0x30): undefined reference to `wv12'
/tmp/ccEfngi6.o:(.rodata.cst8+0x38): undefined reference to `wv12'
/tmp/ccEfngi6.o:(.rodata.cst8+0x60): undefined reference to `wf12'
/tmp/ccEfngi6.o:(.rodata.cst8+0x68): undefined reference to `wf12'
collect2: error: ld returned 1 exit status

FAIL: gcc.dg/attr-weakref-1.c (test for excess errors)



  1   2   3   4   5   6   >