Re: [PATCH] Fix PR84512

2018-03-21 Thread Rainer Orth
Hi Eric,

>> So it failed before Toms original patch.  Please add sparc-solaris
>> to the list of XFAILed targets.
>
> SPARC/Linux is affected too so sparc*-*-* instead.

actually, it's sparc*-*-* && lp64 only.  Done like this after testing on
sparc-sun-solaris2.11 and i386-pc-solaris2.11.

Rainer

-- 
-
Rainer Orth, Center for Biotechnology, Bielefeld University


2018-03-21  Rainer Orth  

* gcc.dg/tree-ssa/pr84512.c: xfail on 64-bit SPARC.

# HG changeset patch
# Parent  50996d41bbbc78ab2cf0002ba6479559089a2337
xfail gcc.dg/tree-ssa/pr84512.c on 64-bit sparc

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
@@ -13,4 +13,4 @@ int foo()
 }
 
 /* Target nvptx xfail due to PR84958.  */
-/* { dg-final { scan-tree-dump "return 285;" "optimized" { xfail nvptx*-*-* } } } */
+/* { dg-final { scan-tree-dump "return 285;" "optimized" { xfail { nvptx*-*-* || { sparc*-*-* && lp64 } } } } } */


Re: [PATCH] Fix PR84512

2018-03-21 Thread Eric Botcazou
> So it failed before Toms original patch.  Please add sparc-solaris
> to the list of XFAILed targets.

SPARC/Linux is affected too so sparc*-*-* instead.

-- 
Eric Botcazou


Re: [PATCH] Fix PR84512

2018-03-21 Thread Richard Biener
On Tue, 20 Mar 2018, Rainer Orth wrote:

> Hi Tom,
> 
> > On 03/19/2018 10:11 AM, Richard Biener wrote:
> >> On Fri, 16 Mar 2018, Tom de Vries wrote:
> >>
> >>> On 03/16/2018 12:55 PM, Richard Biener wrote:
>  On Fri, 16 Mar 2018, Tom de Vries wrote:
> 
> > On 02/27/2018 01:42 PM, Richard Biener wrote:
> >> Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
> >> ===
> >> --- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c(nonexistent)
> >> +++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c(working copy)
> >> @@ -0,0 +1,15 @@
> >> +/* { dg-do compile } */
> >> +/* { dg-options "-O3 -fdump-tree-optimized" } */
> >> +
> >> +int foo()
> >> +{
> >> +  int a[10];
> >> +  for(int i = 0; i < 10; ++i)
> >> +a[i] = i*i;
> >> +  int res = 0;
> >> +  for(int i = 0; i < 10; ++i)
> >> +res += a[i];
> >> +  return res;
> >> +}
> >> +
> >> +/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */
> >
> > This fails for nvptx, because it doesn't have the required vector
> > operations.
> > To fix the fail, I've added requiring effective target vect_int_mult.
> 
>  On targets that do not vectorize you should see the scalar loops unrolled
>  instead.  Or do you have only one loop vectorized?
> >>>
> >>> Sort of. Loop vectorization has no effect, and the scalar loops are 
> >>> completely
> >>> unrolled. But then slp vectorization vectorizes the stores.
> >>>
> >>> So at optimized we have:
> >>> ...
> >>>MEM[(int *)] = { 0, 1 };
> >>>MEM[(int *) + 8B] = { 4, 9 };
> >>>MEM[(int *) + 16B] = { 16, 25 };
> >>>MEM[(int *) + 24B] = { 36, 49 };
> >>>MEM[(int *) + 32B] = { 64, 81 };
> >>>_6 = a[0];
> >>>_28 = a[1];
> >>>res_29 = _6 + _28;
> >>>_35 = a[2];
> >>>res_36 = res_29 + _35;
> >>>_42 = a[3];
> >>>res_43 = res_36 + _42;
> >>>_49 = a[4];
> >>>res_50 = res_43 + _49;
> >>>_56 = a[5];
> >>>res_57 = res_50 + _56;
> >>>_63 = a[6];
> >>>res_64 = res_57 + _63;
> >>>_70 = a[7];
> >>>res_71 = res_64 + _70;
> >>>_77 = a[8];
> >>>res_78 = res_71 + _77;
> >>>_2 = a[9];
> >>>res_11 = _2 + res_78;
> >>>a ={v} {CLOBBER};
> >>>return res_11;
> >>> ...
> >>>
> >>> The stores and loads are eliminated by dse1 in the rtl phase, and in the 
> >>> end
> >>> we have:
> >>> ...
> >>> .visible .func (.param.u32 %value_out) foo
> >>> {
> >>>  .reg.u32 %value;
> >>>  .local .align 16 .b8 %frame_ar[48];
> >>>  .reg.u64 %frame;
> >>>  cvta.local.u64 %frame, %frame_ar;
> >>>  mov.u32 %value, 285;
> >>>  st.param.u32[%value_out], %value;
> >>>  ret;
> >>> }
> >>> ...
> >>>
>  That's precisely
>  what the PR was about...  which means it isn't fixed for nvptx :/
> >>>
> >>> Indeed the assembly is not optimal, and would be optimal if we'd have 
> >>> optimal
> >>> code at optimized.
> >>>
> >>> FWIW, using this patch we generate optimal code at optimized:
> >>> ...
> >>> diff --git a/gcc/passes.def b/gcc/passes.def
> >>> index 3ebcfc30349..6b64f600c4a 100644
> >>> --- a/gcc/passes.def
> >>> +++ b/gcc/passes.def
> >>> @@ -325,6 +325,7 @@ along with GCC; see the file COPYING3.  If not see
> >>> NEXT_PASS (pass_tracer);
> >>> NEXT_PASS (pass_thread_jumps);
> >>> NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
> >>> +  NEXT_PASS (pass_fre);
> >>> NEXT_PASS (pass_strlen);
> >>> NEXT_PASS (pass_thread_jumps);
> >>> NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
> >>> ...
> >>>
> >>> and we get:
> >>> ...
> >>> .visible .func (.param.u32 %value_out) foo
> >>> {
> >>>  .reg.u32 %value;
> >>>  mov.u32 %value, 285;
> >>>  st.param.u32[%value_out], %value;
> >>>  ret;
> >>> }
> >>> ...
> >>>
> >>> I could file a missing optimization PR for nvptx, but I'm not sure where 
> >>> this
> >>> should be fixed.
> >>
> >> Ah, yeah... the usual issue then.
> >>
> >> Can you please XFAIL the test on nvptx instead of requiring vect_int_mult?
> >>
> >
> > Done.
> >
> > Committed at attached.
> 
> this caused the test to FAIL on 64-bit (only) sparc-sun-solaris2.11:
> 
> FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;"
> 
> where it was UNSUPPORTED before.

So it failed before Toms original patch.  Please add sparc-solaris
to the list of XFAILed targets.

> The dump has
> 
> ;; Function foo (foo, funcdef_no=0, decl_uid=1557, cgraph_uid=0, 
> symbol_order=0)
> 
> foo ()
> {
>   int res;
>   int a[10];
>   int _2;
>   int _6;
>   int _28;
>   int _35;
>   int _42;
>   int _49;
>   int _56;
>   int _63;
>   int _70;
>   int _77;
> 
>[local count: 97603132]:
>   MEM[(int *)] = { 0, 1 };
>   MEM[(int *) + 8B] = { 4, 9 };
>   MEM[(int *) + 16B] = { 16, 25 };
>   MEM[(int *) + 24B] = { 36, 49 };
>   

Re: [PATCH] Fix PR84512

2018-03-20 Thread Rainer Orth
Hi Tom,

> On 03/19/2018 10:11 AM, Richard Biener wrote:
>> On Fri, 16 Mar 2018, Tom de Vries wrote:
>>
>>> On 03/16/2018 12:55 PM, Richard Biener wrote:
 On Fri, 16 Mar 2018, Tom de Vries wrote:

> On 02/27/2018 01:42 PM, Richard Biener wrote:
>> Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
>> ===
>> --- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c  (nonexistent)
>> +++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c  (working copy)
>> @@ -0,0 +1,15 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O3 -fdump-tree-optimized" } */
>> +
>> +int foo()
>> +{
>> +  int a[10];
>> +  for(int i = 0; i < 10; ++i)
>> +a[i] = i*i;
>> +  int res = 0;
>> +  for(int i = 0; i < 10; ++i)
>> +res += a[i];
>> +  return res;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */
>
> This fails for nvptx, because it doesn't have the required vector
> operations.
> To fix the fail, I've added requiring effective target vect_int_mult.

 On targets that do not vectorize you should see the scalar loops unrolled
 instead.  Or do you have only one loop vectorized?
>>>
>>> Sort of. Loop vectorization has no effect, and the scalar loops are 
>>> completely
>>> unrolled. But then slp vectorization vectorizes the stores.
>>>
>>> So at optimized we have:
>>> ...
>>>MEM[(int *)] = { 0, 1 };
>>>MEM[(int *) + 8B] = { 4, 9 };
>>>MEM[(int *) + 16B] = { 16, 25 };
>>>MEM[(int *) + 24B] = { 36, 49 };
>>>MEM[(int *) + 32B] = { 64, 81 };
>>>_6 = a[0];
>>>_28 = a[1];
>>>res_29 = _6 + _28;
>>>_35 = a[2];
>>>res_36 = res_29 + _35;
>>>_42 = a[3];
>>>res_43 = res_36 + _42;
>>>_49 = a[4];
>>>res_50 = res_43 + _49;
>>>_56 = a[5];
>>>res_57 = res_50 + _56;
>>>_63 = a[6];
>>>res_64 = res_57 + _63;
>>>_70 = a[7];
>>>res_71 = res_64 + _70;
>>>_77 = a[8];
>>>res_78 = res_71 + _77;
>>>_2 = a[9];
>>>res_11 = _2 + res_78;
>>>a ={v} {CLOBBER};
>>>return res_11;
>>> ...
>>>
>>> The stores and loads are eliminated by dse1 in the rtl phase, and in the end
>>> we have:
>>> ...
>>> .visible .func (.param.u32 %value_out) foo
>>> {
>>>  .reg.u32 %value;
>>>  .local .align 16 .b8 %frame_ar[48];
>>>  .reg.u64 %frame;
>>>  cvta.local.u64 %frame, %frame_ar;
>>>  mov.u32 %value, 285;
>>>  st.param.u32[%value_out], %value;
>>>  ret;
>>> }
>>> ...
>>>
 That's precisely
 what the PR was about...  which means it isn't fixed for nvptx :/
>>>
>>> Indeed the assembly is not optimal, and would be optimal if we'd have 
>>> optimal
>>> code at optimized.
>>>
>>> FWIW, using this patch we generate optimal code at optimized:
>>> ...
>>> diff --git a/gcc/passes.def b/gcc/passes.def
>>> index 3ebcfc30349..6b64f600c4a 100644
>>> --- a/gcc/passes.def
>>> +++ b/gcc/passes.def
>>> @@ -325,6 +325,7 @@ along with GCC; see the file COPYING3.  If not see
>>> NEXT_PASS (pass_tracer);
>>> NEXT_PASS (pass_thread_jumps);
>>> NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
>>> +  NEXT_PASS (pass_fre);
>>> NEXT_PASS (pass_strlen);
>>> NEXT_PASS (pass_thread_jumps);
>>> NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
>>> ...
>>>
>>> and we get:
>>> ...
>>> .visible .func (.param.u32 %value_out) foo
>>> {
>>>  .reg.u32 %value;
>>>  mov.u32 %value, 285;
>>>  st.param.u32[%value_out], %value;
>>>  ret;
>>> }
>>> ...
>>>
>>> I could file a missing optimization PR for nvptx, but I'm not sure where 
>>> this
>>> should be fixed.
>>
>> Ah, yeah... the usual issue then.
>>
>> Can you please XFAIL the test on nvptx instead of requiring vect_int_mult?
>>
>
> Done.
>
> Committed at attached.

this caused the test to FAIL on 64-bit (only) sparc-sun-solaris2.11:

FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;"

where it was UNSUPPORTED before.

The dump has

;; Function foo (foo, funcdef_no=0, decl_uid=1557, cgraph_uid=0, symbol_order=0)

foo ()
{
  int res;
  int a[10];
  int _2;
  int _6;
  int _28;
  int _35;
  int _42;
  int _49;
  int _56;
  int _63;
  int _70;
  int _77;

   [local count: 97603132]:
  MEM[(int *)] = { 0, 1 };
  MEM[(int *) + 8B] = { 4, 9 };
  MEM[(int *) + 16B] = { 16, 25 };
  MEM[(int *) + 24B] = { 36, 49 };
  MEM[(int *) + 32B] = { 64, 81 };
  _6 = a[0];
  _28 = a[1];
  res_29 = _6 + _28;
  _35 = a[2];
  res_36 = res_29 + _35;
  _42 = a[3];
  res_43 = res_36 + _42;
  _49 = a[4];
  res_50 = res_43 + _49;
  _56 = a[5];
  res_57 = res_50 + _56;
  _63 = a[6];
  res_64 = res_57 + _63;
  _70 = a[7];
  res_71 = res_64 + _70;
  _77 = a[8];
  res_78 = res_71 + _77;
  _2 = a[9];
  res_11 = _2 + res_78;
  a ={v} {CLOBBER};
  return res_11;

}

Rainer

-- 

Re: [PATCH] Fix PR84512

2018-03-20 Thread Tom de Vries

On 03/19/2018 10:11 AM, Richard Biener wrote:

On Fri, 16 Mar 2018, Tom de Vries wrote:


On 03/16/2018 12:55 PM, Richard Biener wrote:

On Fri, 16 Mar 2018, Tom de Vries wrote:


On 02/27/2018 01:42 PM, Richard Biener wrote:

Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c (nonexistent)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c (working copy)
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+
+int foo()
+{
+  int a[10];
+  for(int i = 0; i < 10; ++i)
+a[i] = i*i;
+  int res = 0;
+  for(int i = 0; i < 10; ++i)
+res += a[i];
+  return res;
+}
+
+/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */


This fails for nvptx, because it doesn't have the required vector
operations.
To fix the fail, I've added requiring effective target vect_int_mult.


On targets that do not vectorize you should see the scalar loops unrolled
instead.  Or do you have only one loop vectorized?


Sort of. Loop vectorization has no effect, and the scalar loops are completely
unrolled. But then slp vectorization vectorizes the stores.

So at optimized we have:
...
   MEM[(int *)] = { 0, 1 };
   MEM[(int *) + 8B] = { 4, 9 };
   MEM[(int *) + 16B] = { 16, 25 };
   MEM[(int *) + 24B] = { 36, 49 };
   MEM[(int *) + 32B] = { 64, 81 };
   _6 = a[0];
   _28 = a[1];
   res_29 = _6 + _28;
   _35 = a[2];
   res_36 = res_29 + _35;
   _42 = a[3];
   res_43 = res_36 + _42;
   _49 = a[4];
   res_50 = res_43 + _49;
   _56 = a[5];
   res_57 = res_50 + _56;
   _63 = a[6];
   res_64 = res_57 + _63;
   _70 = a[7];
   res_71 = res_64 + _70;
   _77 = a[8];
   res_78 = res_71 + _77;
   _2 = a[9];
   res_11 = _2 + res_78;
   a ={v} {CLOBBER};
   return res_11;
...

The stores and loads are eliminated by dse1 in the rtl phase, and in the end
we have:
...
.visible .func (.param.u32 %value_out) foo
{
 .reg.u32 %value;
 .local .align 16 .b8 %frame_ar[48];
 .reg.u64 %frame;
 cvta.local.u64 %frame, %frame_ar;
 mov.u32 %value, 285;
 st.param.u32[%value_out], %value;
 ret;
}
...


That's precisely
what the PR was about...  which means it isn't fixed for nvptx :/


Indeed the assembly is not optimal, and would be optimal if we'd have optimal
code at optimized.

FWIW, using this patch we generate optimal code at optimized:
...
diff --git a/gcc/passes.def b/gcc/passes.def
index 3ebcfc30349..6b64f600c4a 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -325,6 +325,7 @@ along with GCC; see the file COPYING3.  If not see
NEXT_PASS (pass_tracer);
NEXT_PASS (pass_thread_jumps);
NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+  NEXT_PASS (pass_fre);
NEXT_PASS (pass_strlen);
NEXT_PASS (pass_thread_jumps);
NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
...

and we get:
...
.visible .func (.param.u32 %value_out) foo
{
 .reg.u32 %value;
 mov.u32 %value, 285;
 st.param.u32[%value_out], %value;
 ret;
}
...

I could file a missing optimization PR for nvptx, but I'm not sure where this
should be fixed.


Ah, yeah... the usual issue then.

Can you please XFAIL the test on nvptx instead of requiring vect_int_mult?



Done.

Committed at attached.

Thanks,
- Tom
[testsuite] Add nvptx xfail to pr84512.c

2018-03-19  Tom de Vries  

	* gcc.dg/tree-ssa/pr84512.c: Don't require effective target
	vect_int_mult.  Add nvptx xfail for PR84958.

---
 gcc/testsuite/ChangeLog | 5 +
 gcc/testsuite/gcc.dg/tree-ssa/pr84512.c | 4 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
index 41b6c06..9560160 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
@@ -1,6 +1,5 @@
 /* { dg-do compile } */
 /* { dg-options "-O3 -fdump-tree-optimized" } */
-/* { dg-require-effective-target vect_int_mult } */
 
 int foo()
 {
@@ -13,4 +12,5 @@ int foo()
   return res;
 }
 
-/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */
+/* Target nvptx xfail due to PR84958.  */
+/* { dg-final { scan-tree-dump "return 285;" "optimized" { xfail nvptx*-*-* } } } */


Re: [PATCH] Fix PR84512

2018-03-19 Thread Richard Biener
On Fri, 16 Mar 2018, Tom de Vries wrote:

> On 03/16/2018 12:55 PM, Richard Biener wrote:
> > On Fri, 16 Mar 2018, Tom de Vries wrote:
> > 
> > > On 02/27/2018 01:42 PM, Richard Biener wrote:
> > > > Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
> > > > ===
> > > > --- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c (nonexistent)
> > > > +++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c (working copy)
> > > > @@ -0,0 +1,15 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O3 -fdump-tree-optimized" } */
> > > > +
> > > > +int foo()
> > > > +{
> > > > +  int a[10];
> > > > +  for(int i = 0; i < 10; ++i)
> > > > +a[i] = i*i;
> > > > +  int res = 0;
> > > > +  for(int i = 0; i < 10; ++i)
> > > > +res += a[i];
> > > > +  return res;
> > > > +}
> > > > +
> > > > +/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */
> > > 
> > > This fails for nvptx, because it doesn't have the required vector
> > > operations.
> > > To fix the fail, I've added requiring effective target vect_int_mult.
> > 
> > On targets that do not vectorize you should see the scalar loops unrolled
> > instead.  Or do you have only one loop vectorized?
> 
> Sort of. Loop vectorization has no effect, and the scalar loops are completely
> unrolled. But then slp vectorization vectorizes the stores.
> 
> So at optimized we have:
> ...
>   MEM[(int *)] = { 0, 1 };
>   MEM[(int *) + 8B] = { 4, 9 };
>   MEM[(int *) + 16B] = { 16, 25 };
>   MEM[(int *) + 24B] = { 36, 49 };
>   MEM[(int *) + 32B] = { 64, 81 };
>   _6 = a[0];
>   _28 = a[1];
>   res_29 = _6 + _28;
>   _35 = a[2];
>   res_36 = res_29 + _35;
>   _42 = a[3];
>   res_43 = res_36 + _42;
>   _49 = a[4];
>   res_50 = res_43 + _49;
>   _56 = a[5];
>   res_57 = res_50 + _56;
>   _63 = a[6];
>   res_64 = res_57 + _63;
>   _70 = a[7];
>   res_71 = res_64 + _70;
>   _77 = a[8];
>   res_78 = res_71 + _77;
>   _2 = a[9];
>   res_11 = _2 + res_78;
>   a ={v} {CLOBBER};
>   return res_11;
> ...
> 
> The stores and loads are eliminated by dse1 in the rtl phase, and in the end
> we have:
> ...
> .visible .func (.param.u32 %value_out) foo
> {
> .reg.u32 %value;
> .local .align 16 .b8 %frame_ar[48];
> .reg.u64 %frame;
> cvta.local.u64 %frame, %frame_ar;
> mov.u32 %value, 285;
> st.param.u32[%value_out], %value;
> ret;
> }
> ...
> 
> > That's precisely
> > what the PR was about...  which means it isn't fixed for nvptx :/
> 
> Indeed the assembly is not optimal, and would be optimal if we'd have optimal
> code at optimized.
> 
> FWIW, using this patch we generate optimal code at optimized:
> ...
> diff --git a/gcc/passes.def b/gcc/passes.def
> index 3ebcfc30349..6b64f600c4a 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -325,6 +325,7 @@ along with GCC; see the file COPYING3.  If not see
>NEXT_PASS (pass_tracer);
>NEXT_PASS (pass_thread_jumps);
>NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
> +  NEXT_PASS (pass_fre);
>NEXT_PASS (pass_strlen);
>NEXT_PASS (pass_thread_jumps);
>NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
> ...
> 
> and we get:
> ...
> .visible .func (.param.u32 %value_out) foo
> {
> .reg.u32 %value;
> mov.u32 %value, 285;
> st.param.u32[%value_out], %value;
> ret;
> }
> ...
> 
> I could file a missing optimization PR for nvptx, but I'm not sure where this
> should be fixed.

Ah, yeah... the usual issue then.

Can you please XFAIL the test on nvptx instead of requiring vect_int_mult?

Thanks,
Richard.


Re: [PATCH] Fix PR84512

2018-03-16 Thread Tom de Vries

On 03/16/2018 12:55 PM, Richard Biener wrote:

On Fri, 16 Mar 2018, Tom de Vries wrote:


On 02/27/2018 01:42 PM, Richard Biener wrote:

Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c (nonexistent)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c (working copy)
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+
+int foo()
+{
+  int a[10];
+  for(int i = 0; i < 10; ++i)
+a[i] = i*i;
+  int res = 0;
+  for(int i = 0; i < 10; ++i)
+res += a[i];
+  return res;
+}
+
+/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */


This fails for nvptx, because it doesn't have the required vector operations.
To fix the fail, I've added requiring effective target vect_int_mult.


On targets that do not vectorize you should see the scalar loops unrolled
instead.  Or do you have only one loop vectorized?


Sort of. Loop vectorization has no effect, and the scalar loops are 
completely unrolled. But then slp vectorization vectorizes the stores.


So at optimized we have:
...
  MEM[(int *)] = { 0, 1 };
  MEM[(int *) + 8B] = { 4, 9 };
  MEM[(int *) + 16B] = { 16, 25 };
  MEM[(int *) + 24B] = { 36, 49 };
  MEM[(int *) + 32B] = { 64, 81 };
  _6 = a[0];
  _28 = a[1];
  res_29 = _6 + _28;
  _35 = a[2];
  res_36 = res_29 + _35;
  _42 = a[3];
  res_43 = res_36 + _42;
  _49 = a[4];
  res_50 = res_43 + _49;
  _56 = a[5];
  res_57 = res_50 + _56;
  _63 = a[6];
  res_64 = res_57 + _63;
  _70 = a[7];
  res_71 = res_64 + _70;
  _77 = a[8];
  res_78 = res_71 + _77;
  _2 = a[9];
  res_11 = _2 + res_78;
  a ={v} {CLOBBER};
  return res_11;
...

The stores and loads are eliminated by dse1 in the rtl phase, and in the 
end we have:

...
.visible .func (.param.u32 %value_out) foo
{
.reg.u32 %value;
.local .align 16 .b8 %frame_ar[48];
.reg.u64 %frame;
cvta.local.u64 %frame, %frame_ar;
mov.u32 %value, 285;
st.param.u32[%value_out], %value;
ret;
}
...


That's precisely
what the PR was about...  which means it isn't fixed for nvptx :/


Indeed the assembly is not optimal, and would be optimal if we'd have 
optimal code at optimized.


FWIW, using this patch we generate optimal code at optimized:
...
diff --git a/gcc/passes.def b/gcc/passes.def
index 3ebcfc30349..6b64f600c4a 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -325,6 +325,7 @@ along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_tracer);
   NEXT_PASS (pass_thread_jumps);
   NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+  NEXT_PASS (pass_fre);
   NEXT_PASS (pass_strlen);
   NEXT_PASS (pass_thread_jumps);
   NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
...

and we get:
...
.visible .func (.param.u32 %value_out) foo
{
.reg.u32 %value;
mov.u32 %value, 285;
st.param.u32[%value_out], %value;
ret;
}
...

I could file a missing optimization PR for nvptx, but I'm not sure where 
this should be fixed.


Thanks,
- Tom


Re: [PATCH] Fix PR84512

2018-03-16 Thread Richard Biener
On Fri, 16 Mar 2018, Tom de Vries wrote:

> On 02/27/2018 01:42 PM, Richard Biener wrote:
> > Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
> > ===
> > --- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c (nonexistent)
> > +++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c (working copy)
> > @@ -0,0 +1,15 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -fdump-tree-optimized" } */
> > +
> > +int foo()
> > +{
> > +  int a[10];
> > +  for(int i = 0; i < 10; ++i)
> > +a[i] = i*i;
> > +  int res = 0;
> > +  for(int i = 0; i < 10; ++i)
> > +res += a[i];
> > +  return res;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */
> 
> This fails for nvptx, because it doesn't have the required vector operations.
> To fix the fail, I've added requiring effective target vect_int_mult.

On targets that do not vectorize you should see the scalar loops unrolled
instead.  Or do you have only one loop vectorized?  That's precisely
what the PR was about...  which means it isn't fixed for nvptx :/

Richard.

> Thanks,
> - Tom
> 

-- 
Richard Biener 
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)


Re: [PATCH] Fix PR84512

2018-03-16 Thread Tom de Vries

On 02/27/2018 01:42 PM, Richard Biener wrote:

Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c (nonexistent)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c (working copy)
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+
+int foo()
+{
+  int a[10];
+  for(int i = 0; i < 10; ++i)
+a[i] = i*i;
+  int res = 0;
+  for(int i = 0; i < 10; ++i)
+res += a[i];
+  return res;
+}
+
+/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */


This fails for nvptx, because it doesn't have the required vector 
operations. To fix the fail, I've added requiring effective target 
vect_int_mult.


Thanks,
- Tom
[testsuite] Require vect_int_mult in pr84512.c

2018-03-16  Tom de Vries  

	* gcc.dg/tree-ssa/pr84512.c: Require effective target vect_int_mult.

---
 gcc/testsuite/gcc.dg/tree-ssa/pr84512.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
index 288fa5d..41b6c06 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O3 -fdump-tree-optimized" } */
+/* { dg-require-effective-target vect_int_mult } */
 
 int foo()
 {