On 03/16/2018 12:55 PM, Richard Biener wrote:
On Fri, 16 Mar 2018, Tom de Vries wrote:

On 02/27/2018 01:42 PM, Richard Biener wrote:
Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c     (nonexistent)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c     (working copy)
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+
+int foo()
+{
+  int a[10];
+  for(int i = 0; i < 10; ++i)
+    a[i] = i*i;
+  int res = 0;
+  for(int i = 0; i < 10; ++i)
+    res += a[i];
+  return res;
+}
+
+/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */

This fails for nvptx, because it doesn't have the required vector operations.
To fix the fail, I've added requiring effective target vect_int_mult.

On targets that do not vectorize you should see the scalar loops unrolled
instead.  Or do you have only one loop vectorized?

Sort of. Loop vectorization has no effect, and the scalar loops are completely unrolled. But then slp vectorization vectorizes the stores.

So at optimized we have:
...
  MEM[(int *)&a] = { 0, 1 };
  MEM[(int *)&a + 8B] = { 4, 9 };
  MEM[(int *)&a + 16B] = { 16, 25 };
  MEM[(int *)&a + 24B] = { 36, 49 };
  MEM[(int *)&a + 32B] = { 64, 81 };
  _6 = a[0];
  _28 = a[1];
  res_29 = _6 + _28;
  _35 = a[2];
  res_36 = res_29 + _35;
  _42 = a[3];
  res_43 = res_36 + _42;
  _49 = a[4];
  res_50 = res_43 + _49;
  _56 = a[5];
  res_57 = res_50 + _56;
  _63 = a[6];
  res_64 = res_57 + _63;
  _70 = a[7];
  res_71 = res_64 + _70;
  _77 = a[8];
  res_78 = res_71 + _77;
  _2 = a[9];
  res_11 = _2 + res_78;
  a ={v} {CLOBBER};
  return res_11;
...

The stores and loads are eliminated by dse1 in the rtl phase, and in the end we have:
...
.visible .func (.param.u32 %value_out) foo
{
        .reg.u32 %value;
        .local .align 16 .b8 %frame_ar[48];
        .reg.u64 %frame;
        cvta.local.u64 %frame, %frame_ar;
        mov.u32 %value, 285;
        st.param.u32    [%value_out], %value;
        ret;
}
...

That's precisely
what the PR was about...  which means it isn't fixed for nvptx :/

Indeed the assembly is not optimal, and would be optimal if we'd have optimal code at optimized.

FWIW, using this patch we generate optimal code at optimized:
...
diff --git a/gcc/passes.def b/gcc/passes.def
index 3ebcfc30349..6b64f600c4a 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -325,6 +325,7 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_tracer);
       NEXT_PASS (pass_thread_jumps);
       NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
+      NEXT_PASS (pass_fre);
       NEXT_PASS (pass_strlen);
       NEXT_PASS (pass_thread_jumps);
       NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
...

and we get:
...
.visible .func (.param.u32 %value_out) foo
{
        .reg.u32 %value;
        mov.u32 %value, 285;
        st.param.u32    [%value_out], %value;
        ret;
}
...

I could file a missing optimization PR for nvptx, but I'm not sure where this should be fixed.

Thanks,
- Tom

Reply via email to