Add support for in-order addition reduction using SVE FADDA

Richard Sandiford Fri, 17 Nov 2017 08:53:53 -0800

This patch adds support for in-order floating-point addition reductions,
which are suitable even in strict IEEE mode.


Previously vect_is_simple_reduction would reject any cases that forbid
reassociation.  The idea is instead to tentatively accept them as
"FOLD_LEFT_REDUCTIONs" and only fail later if there is no target
support for them.  Although this patch only handles the particular
case of plus and minus on floating-point types, there's no reason in
principle why targets couldn't handle other cases.

The vect_force_simple_reduction change makes it simpler for parloops
to read the type of reduction.

Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
and powerpc64le-linux-gnu.  OK to install?

Richard


2017-11-17  Richard Sandiford  <richard.sandif...@linaro.org>
            Alan Hayward  <alan.hayw...@arm.com>
            David Sherwood  <david.sherw...@arm.com>

gcc/
        * tree.def (FOLD_LEFT_PLUS_EXPR): New tree code.
        * doc/generic.texi (FOLD_LEFT_PLUS_EXPR): Document.
        * optabs.def (fold_left_plus_optab): New optab.
        * doc/md.texi (fold_left_plus_@var{m}): Document.
        * doc/sourcebuild.texi (vect_fold_left_plus): Document.
        * cfgexpand.c (expand_debug_expr): Handle FOLD_LEFT_PLUS_EXPR.
        * expr.c (expand_expr_real_2): Likewise.
        * fold-const.c (const_binop): Likewise.
        * optabs-tree.c (optab_for_tree_code): Likewise.
        * tree-cfg.c (verify_gimple_assign_binary): Likewise.
        * tree-inline.c (estimate_operator_cost): Likewise.
        * tree-pretty-print.c (dump_generic_node): Likewise.
        (op_code_prio): Likewise.
        (op_symbol_code): Likewise.
        * tree-vect-stmts.c (vectorizable_operation): Likewise.
        * tree-parloops.c (valid_reduction_p): New function.
        (gather_scalar_reductions): Use it.
        * tree-vectorizer.h (FOLD_LEFT_REDUCTION): New vect_reduction_type.
        (vect_finish_replace_stmt): Declare.
        * tree-vect-loop.c (fold_left_reduction_code): New function.
        (needs_fold_left_reduction_p): New function, split out from...
        (vect_is_simple_reduction): ...here.  Accept reductions that
        forbid reassociation, but give them type FOLD_LEFT_REDUCTION.
        (vect_force_simple_reduction): Also store the reduction type in
        the assignment's STMT_VINFO_REDUC_TYPE.
        (vect_model_reduction_cost): Handle FOLD_LEFT_REDUCTION.
        (merge_with_identity): New function.
        (vectorize_fold_left_reduction): Likewise.
        (vectorizable_reduction): Handle FOLD_LEFT_REDUCTION.  Leave the
        scalar phi in place for it.  Require target support and reject
        cases that would reassociate the operation.  Defer the transform
        phase to vectorize_fold_left_reduction.
        * config/aarch64/aarch64.md (UNSPEC_FADDA): New unspec.
        * config/aarch64/aarch64-sve.md (fold_left_plus_<mode>): New expander.
        (*fold_left_plus_<mode>, *pred_fold_left_plus_<mode>): New insns.

gcc/testsuite/
        * lib/target-supports.exp (check_effective_target_vect_fold_left_plus):
        New proc.
        * gcc.dg/vect/no-fast-math-vect16.c: Expect the test to pass if
        vect_fold_left_plus.
        * gcc.dg/vect/pr79920.c: Expect both loops to be vectorized if
        vect_fold_left_plus.
        * gcc.dg/vect/trapv-vect-reduc-4.c: Expect the first loop to be
        recognized as a reduction and then rejected for lack of target
        support.
        * gcc.dg/vect/vect-reduc-6.c: Expect the loop to be vectorized if
        vect_fold_left_plus.
        * gcc.target/aarch64/sve_reduc_strict_1.c: New test.
        * gcc.target/aarch64/sve_reduc_strict_1_run.c: Likewise.
        * gcc.target/aarch64/sve_reduc_strict_2.c: Likewise.
        * gcc.target/aarch64/sve_reduc_strict_2_run.c: Likewise.
        * gcc.target/aarch64/sve_reduc_strict_3.c: Likewise.
        * gcc.target/aarch64/sve_slp_13.c: Add floating-point types.
        * gfortran.dg/vect/vect-8.f90: Expect 25 loops to be vectorized if
        vect_fold_left_plus.

Index: gcc/tree.def
===================================================================
--- gcc/tree.def        2017-11-17 16:52:07.246852461 +0000
+++ gcc/tree.def        2017-11-17 16:52:07.631930981 +0000
@@ -1302,6 +1302,8 @@ DEFTREECODE (REDUC_AND_EXPR, "reduc_and_
 DEFTREECODE (REDUC_IOR_EXPR, "reduc_ior_expr", tcc_unary, 1)
 DEFTREECODE (REDUC_XOR_EXPR, "reduc_xor_expr", tcc_unary, 1)
 
+DEFTREECODE (FOLD_LEFT_PLUS_EXPR, "fold_left_plus_expr", tcc_binary, 2)
+
 /* Widening dot-product.
    The first two arguments are of type t1.
    The third argument and the result are of type t2, such that t2 is at least
Index: gcc/doc/generic.texi
===================================================================
--- gcc/doc/generic.texi        2017-11-17 16:52:07.246852461 +0000
+++ gcc/doc/generic.texi        2017-11-17 16:52:07.620954871 +0000
@@ -1746,6 +1746,7 @@ a value from @code{enum annot_expr_kind}
 @tindex REDUC_AND_EXPR
 @tindex REDUC_IOR_EXPR
 @tindex REDUC_XOR_EXPR
+@tindex FOLD_LEFT_PLUS_EXPR
 
 @table @code
 @item VEC_DUPLICATE_EXPR
@@ -1861,6 +1862,12 @@ the maximum element in @var{x}.  The ass
 is unspecified; for example, @samp{REDUC_PLUS_EXPR <@var{x}>} could
 sum floating-point @var{x} in forward order, in reverse order,
 using a tree, or in some other way.
+
+@item FOLD_LEFT_PLUS_EXPR
+This node takes two arguments: a scalar of type @var{t} and a vector
+of @var{t}s.  It successively adds each element of the vector to the
+scalar and returns the result.  The operation is strictly in-order:
+there is no reassociation.
 @end table
 
 
Index: gcc/optabs.def
===================================================================
--- gcc/optabs.def      2017-11-17 16:52:07.246852461 +0000
+++ gcc/optabs.def      2017-11-17 16:52:07.625528250 +0000
@@ -306,6 +306,7 @@ OPTAB_D (reduc_umin_scal_optab, "reduc_u
 OPTAB_D (reduc_and_scal_optab,  "reduc_and_scal_$a")
 OPTAB_D (reduc_ior_scal_optab,  "reduc_ior_scal_$a")
 OPTAB_D (reduc_xor_scal_optab,  "reduc_xor_scal_$a")
+OPTAB_D (fold_left_plus_optab, "fold_left_plus_$a")
 
 OPTAB_D (extract_last_optab, "extract_last_$a")
 OPTAB_D (fold_extract_last_optab, "fold_extract_last_$a")
Index: gcc/doc/md.texi
===================================================================
--- gcc/doc/md.texi     2017-11-17 16:52:07.246852461 +0000
+++ gcc/doc/md.texi     2017-11-17 16:52:07.621869547 +0000
@@ -5285,6 +5285,14 @@ has mode @var{m} and operands 0 and 1 ha
 one element of @var{m}.  Operand 2 has the usual mask mode for vectors
 of mode @var{m}; see @code{TARGET_VECTORIZE_GET_MASK_MODE}.
 
+@cindex @code{fold_left_plus_@var{m}} instruction pattern
+@item @code{fold_left_plus_@var{m}}
+Take scalar operand 1 and successively add each element from vector
+operand 2.  Store the result in scalar operand 0.  The vector has
+mode @var{m} and the scalars have the mode appropriate for one
+element of @var{m}.  The operation is strictly in-order: there is
+no reassociation.
+
 @cindex @code{sdot_prod@var{m}} instruction pattern
 @item @samp{sdot_prod@var{m}}
 @cindex @code{udot_prod@var{m}} instruction pattern
Index: gcc/doc/sourcebuild.texi
===================================================================
--- gcc/doc/sourcebuild.texi    2017-11-17 16:52:07.246852461 +0000
+++ gcc/doc/sourcebuild.texi    2017-11-17 16:52:07.621869547 +0000
@@ -1580,6 +1580,9 @@ Target supports AND, IOR and XOR reducti
 
 @item vect_fold_extract_last
 Target supports the @code{fold_extract_last} optab.
+
+@item vect_fold_left_plus
+Target supports the @code{fold_left_plus} optab.
 @end table
 
 @subsubsection Thread Local Storage attributes
Index: gcc/cfgexpand.c
===================================================================
--- gcc/cfgexpand.c     2017-11-17 16:52:07.246852461 +0000
+++ gcc/cfgexpand.c     2017-11-17 16:52:07.620040195 +0000
@@ -5072,6 +5072,7 @@ expand_debug_expr (tree exp)
     case REDUC_AND_EXPR:
     case REDUC_IOR_EXPR:
     case REDUC_XOR_EXPR:
+    case FOLD_LEFT_PLUS_EXPR:
     case VEC_COND_EXPR:
     case VEC_PACK_FIX_TRUNC_EXPR:
     case VEC_PACK_SAT_EXPR:
Index: gcc/expr.c
===================================================================
--- gcc/expr.c  2017-11-17 16:52:07.246852461 +0000
+++ gcc/expr.c  2017-11-17 16:52:07.622784222 +0000
@@ -9438,6 +9438,28 @@ #define REDUCE_BIT_FIELD(expr)   (reduce_b
         return target;
       }
 
+    case FOLD_LEFT_PLUS_EXPR:
+      {
+       op0 = expand_normal (treeop0);
+       op1 = expand_normal (treeop1);
+       this_optab = optab_for_tree_code (code, type, optab_default);
+       machine_mode vec_mode = TYPE_MODE (TREE_TYPE (treeop1));
+       insn_code icode = optab_handler (this_optab, vec_mode);
+
+       if (icode != CODE_FOR_nothing)
+         {
+           struct expand_operand ops[3];
+           create_output_operand (&ops[0], target, mode);
+           create_input_operand (&ops[1], op0, mode);
+           create_input_operand (&ops[2], op1, vec_mode);
+           if (maybe_expand_insn (icode, 3, ops))
+             return ops[0].value;
+         }
+
+       /* Nothing to fall back to.  */
+       gcc_unreachable ();
+      }
+
     case REDUC_MAX_EXPR:
     case REDUC_MIN_EXPR:
     case REDUC_PLUS_EXPR:
Index: gcc/fold-const.c
===================================================================
--- gcc/fold-const.c    2017-11-17 16:52:07.246852461 +0000
+++ gcc/fold-const.c    2017-11-17 16:52:07.623698898 +0000
@@ -1603,6 +1603,32 @@ const_binop (enum tree_code code, tree a
        return NULL_TREE;
       return build_vector_from_val (TREE_TYPE (arg1), sub);
     }
+
+  if (CONSTANT_CLASS_P (arg1)
+      && TREE_CODE (arg2) == VECTOR_CST)
+    {
+      tree_code subcode;
+
+      switch (code)
+       {
+       case FOLD_LEFT_PLUS_EXPR:
+         subcode = PLUS_EXPR;
+         break;
+       default:
+         return NULL_TREE;
+       }
+
+      int nelts = VECTOR_CST_NELTS (arg2);
+      tree accum = arg1;
+      for (int i = 0; i < nelts; i++)
+       {
+         accum = const_binop (subcode, accum, VECTOR_CST_ELT (arg2, i));
+         if (accum == NULL_TREE || !CONSTANT_CLASS_P (accum))
+           return NULL_TREE;
+       }
+
+      return accum;
+    }
   return NULL_TREE;
 }
 
Index: gcc/optabs-tree.c
===================================================================
--- gcc/optabs-tree.c   2017-11-17 16:52:07.246852461 +0000
+++ gcc/optabs-tree.c   2017-11-17 16:52:07.623698898 +0000
@@ -166,6 +166,9 @@ optab_for_tree_code (enum tree_code code
     case REDUC_XOR_EXPR:
       return reduc_xor_scal_optab;
 
+    case FOLD_LEFT_PLUS_EXPR:
+      return fold_left_plus_optab;
+
     case VEC_WIDEN_MULT_HI_EXPR:
       return TYPE_UNSIGNED (type) ?
        vec_widen_umult_hi_optab : vec_widen_smult_hi_optab;
Index: gcc/tree-cfg.c
===================================================================
--- gcc/tree-cfg.c      2017-11-17 16:52:07.246852461 +0000
+++ gcc/tree-cfg.c      2017-11-17 16:52:07.628272277 +0000
@@ -4116,6 +4116,19 @@ verify_gimple_assign_binary (gassign *st
       /* Continue with generic binary expression handling.  */
       break;
 
+    case FOLD_LEFT_PLUS_EXPR:
+      if (!VECTOR_TYPE_P (rhs2_type)
+         || !useless_type_conversion_p (lhs_type, TREE_TYPE (rhs2_type))
+         || !useless_type_conversion_p (lhs_type, rhs1_type))
+       {
+         error ("reduction should convert from vector to element type");
+         debug_generic_expr (lhs_type);
+         debug_generic_expr (rhs1_type);
+         debug_generic_expr (rhs2_type);
+         return true;
+       }
+      return false;
+
     case VEC_SERIES_EXPR:
       if (!useless_type_conversion_p (rhs1_type, rhs2_type))
        {
Index: gcc/tree-inline.c
===================================================================
--- gcc/tree-inline.c   2017-11-17 16:52:07.246852461 +0000
+++ gcc/tree-inline.c   2017-11-17 16:52:07.628272277 +0000
@@ -3881,6 +3881,7 @@ estimate_operator_cost (enum tree_code c
     case REDUC_AND_EXPR:
     case REDUC_IOR_EXPR:
     case REDUC_XOR_EXPR:
+    case FOLD_LEFT_PLUS_EXPR:
     case WIDEN_SUM_EXPR:
     case WIDEN_MULT_EXPR:
     case DOT_PROD_EXPR:
Index: gcc/tree-pretty-print.c
===================================================================
--- gcc/tree-pretty-print.c     2017-11-17 16:52:07.246852461 +0000
+++ gcc/tree-pretty-print.c     2017-11-17 16:52:07.629186953 +0000
@@ -3232,6 +3232,7 @@ dump_generic_node (pretty_printer *pp, t
       break;
 
     case VEC_SERIES_EXPR:
+    case FOLD_LEFT_PLUS_EXPR:
     case VEC_WIDEN_MULT_HI_EXPR:
     case VEC_WIDEN_MULT_LO_EXPR:
     case VEC_WIDEN_MULT_EVEN_EXPR:
@@ -3628,6 +3629,7 @@ op_code_prio (enum tree_code code)
     case REDUC_MAX_EXPR:
     case REDUC_MIN_EXPR:
     case REDUC_PLUS_EXPR:
+    case FOLD_LEFT_PLUS_EXPR:
     case VEC_UNPACK_HI_EXPR:
     case VEC_UNPACK_LO_EXPR:
     case VEC_UNPACK_FLOAT_HI_EXPR:
@@ -3749,6 +3751,9 @@ op_symbol_code (enum tree_code code)
     case REDUC_PLUS_EXPR:
       return "r+";
 
+    case FOLD_LEFT_PLUS_EXPR:
+      return "fl+";
+
     case WIDEN_SUM_EXPR:
       return "w+";
 
Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c       2017-11-17 16:52:07.246852461 +0000
+++ gcc/tree-vect-stmts.c       2017-11-17 16:52:07.631016305 +0000
@@ -5415,6 +5415,10 @@ vectorizable_operation (gimple *stmt, gi
 
   code = gimple_assign_rhs_code (stmt);
 
+  /* Ignore operations that mix scalar and vector input operands.  */
+  if (code == FOLD_LEFT_PLUS_EXPR)
+    return false;
+
   /* For pointer addition, we should use the normal plus for
      the vector addition.  */
   if (code == POINTER_PLUS_EXPR)
Index: gcc/tree-parloops.c
===================================================================
--- gcc/tree-parloops.c 2017-11-17 16:52:07.246852461 +0000
+++ gcc/tree-parloops.c 2017-11-17 16:52:07.629186953 +0000
@@ -2531,6 +2531,19 @@ set_reduc_phi_uids (reduction_info **slo
   return 1;
 }
 
+/* Return true if the type of reduction performed by STMT is suitable
+   for this pass.  */
+
+static bool
+valid_reduction_p (gimple *stmt)
+{
+  /* Parallelization would reassociate the operation, which isn't
+     allowed for in-order reductions.  */
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (stmt_info);
+  return reduc_type != FOLD_LEFT_REDUCTION;
+}
+
 /* Detect all reductions in the LOOP, insert them into REDUCTION_LIST.  */
 
 static void
@@ -2564,7 +2577,7 @@ gather_scalar_reductions (loop_p loop, r
       gimple *reduc_stmt
        = vect_force_simple_reduction (simple_loop_info, phi,
                                       &double_reduc, true);
-      if (!reduc_stmt)
+      if (!reduc_stmt || !valid_reduction_p (reduc_stmt))
        continue;
 
       if (double_reduc)
@@ -2610,7 +2623,8 @@ gather_scalar_reductions (loop_p loop, r
                = vect_force_simple_reduction (simple_loop_info, inner_phi,
                                               &double_reduc, true);
              gcc_assert (!double_reduc);
-             if (inner_reduc_stmt == NULL)
+             if (inner_reduc_stmt == NULL
+                 || !valid_reduction_p (inner_reduc_stmt))
                continue;
 
              build_new_reduction (reduction_list, double_reduc_stmts[i], phi);
Index: gcc/tree-vectorizer.h
===================================================================
--- gcc/tree-vectorizer.h       2017-11-17 16:52:07.246852461 +0000
+++ gcc/tree-vectorizer.h       2017-11-17 16:52:07.631016305 +0000
@@ -74,7 +74,15 @@ enum vect_reduction_type {
 
        for (int i = 0; i < VF; ++i)
          res = cond[i] ? val[i] : res;  */
-  EXTRACT_LAST_REDUCTION
+  EXTRACT_LAST_REDUCTION,
+
+  /* Use a folding reduction within the loop to implement:
+
+       for (int i = 0; i < VF; ++i)
+         res = res OP val[i];
+
+     (with no reassocation).  */
+  FOLD_LEFT_REDUCTION
 };
 
 #define VECTORIZABLE_CYCLE_DEF(D) (((D) == vect_reduction_def)           \
@@ -1389,6 +1397,7 @@ extern void vect_model_load_cost (stmt_v
 extern unsigned record_stmt_cost (stmt_vector_for_cost *, int,
                                  enum vect_cost_for_stmt, stmt_vec_info,
                                  int, enum vect_cost_model_location);
+extern void vect_finish_replace_stmt (gimple *, gimple *);
 extern void vect_finish_stmt_generation (gimple *, gimple *,
                                          gimple_stmt_iterator *);
 extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info);
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c        2017-11-17 16:52:07.246852461 +0000
+++ gcc/tree-vect-loop.c        2017-11-17 16:52:07.630101629 +0000
@@ -2573,6 +2573,29 @@ vect_analyze_loop (struct loop *loop, lo
     }
 }
 
+/* Return true if the target supports in-order reductions for operation
+   CODE and type TYPE.  If the target supports it, store the reduction
+   operation in *REDUC_CODE.  */
+
+static bool
+fold_left_reduction_code (tree_code code, tree type, tree_code *reduc_code)
+{
+  switch (code)
+    {
+    case PLUS_EXPR:
+      code = FOLD_LEFT_PLUS_EXPR;
+      break;
+
+    default:
+      return false;
+    }
+
+  if (!target_supports_op_p (type, code, optab_vector))
+    return false;
+
+  *reduc_code = code;
+  return true;
+}
 
 /* Function reduction_code_for_scalar_code
 
@@ -2880,6 +2903,42 @@ vect_is_slp_reduction (loop_vec_info loo
   return true;
 }
 
+/* Returns true if we need an in-order reduction for operation CODE
+   on type TYPE.  NEED_WRAPPING_INTEGRAL_OVERFLOW is true if integer
+   overflow must wrap.  */
+
+static bool
+needs_fold_left_reduction_p (tree type, tree_code code,
+                            bool need_wrapping_integral_overflow)
+{
+  /* CHECKME: check for !flag_finite_math_only too?  */
+  if (SCALAR_FLOAT_TYPE_P (type))
+    switch (code)
+      {
+      case MIN_EXPR:
+      case MAX_EXPR:
+       return false;
+
+      default:
+       return !flag_associative_math;
+      }
+
+  if (INTEGRAL_TYPE_P (type))
+    {
+      if (!operation_no_trapping_overflow (type, code))
+       return true;
+      if (need_wrapping_integral_overflow
+         && !TYPE_OVERFLOW_WRAPS (type)
+         && operation_can_overflow (code))
+       return true;
+      return false;
+    }
+
+  if (SAT_FIXED_POINT_TYPE_P (type))
+    return true;
+
+  return false;
+}
 
 /* Function vect_is_simple_reduction
 
@@ -3198,58 +3257,18 @@ vect_is_simple_reduction (loop_vec_info
       return NULL;
     }
 
-  /* Check that it's ok to change the order of the computation.
+  /* Check whether it's ok to change the order of the computation.
      Generally, when vectorizing a reduction we change the order of the
      computation.  This may change the behavior of the program in some
      cases, so we need to check that this is ok.  One exception is when
      vectorizing an outer-loop: the inner-loop is executed sequentially,
      and therefore vectorizing reductions in the inner-loop during
      outer-loop vectorization is safe.  */
-
-  if (*v_reduc_type != COND_REDUCTION
-      && check_reduction)
-    {
-      /* CHECKME: check for !flag_finite_math_only too?  */
-      if (SCALAR_FLOAT_TYPE_P (type) && !flag_associative_math)
-       {
-         /* Changing the order of operations changes the semantics.  */
-         if (dump_enabled_p ())
-           report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
-                       "reduction: unsafe fp math optimization: ");
-         return NULL;
-       }
-      else if (INTEGRAL_TYPE_P (type))
-       {
-         if (!operation_no_trapping_overflow (type, code))
-           {
-             /* Changing the order of operations changes the semantics.  */
-             if (dump_enabled_p ())
-               report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
-                               "reduction: unsafe int math optimization"
-                               " (overflow traps): ");
-             return NULL;
-           }
-         if (need_wrapping_integral_overflow
-             && !TYPE_OVERFLOW_WRAPS (type)
-             && operation_can_overflow (code))
-           {
-             /* Changing the order of operations changes the semantics.  */
-             if (dump_enabled_p ())
-               report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
-                               "reduction: unsafe int math optimization"
-                               " (overflow doesn't wrap): ");
-             return NULL;
-           }
-       }
-      else if (SAT_FIXED_POINT_TYPE_P (type))
-       {
-         /* Changing the order of operations changes the semantics.  */
-         if (dump_enabled_p ())
-         report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
-                         "reduction: unsafe fixed-point math optimization: ");
-         return NULL;
-       }
-    }
+  if (check_reduction
+      && *v_reduc_type == TREE_CODE_REDUCTION
+      && needs_fold_left_reduction_p (type, code,
+                                     need_wrapping_integral_overflow))
+    *v_reduc_type = FOLD_LEFT_REDUCTION;
 
   /* Reduction is safe. We're dealing with one of the following:
      1) integer arithmetic and no trapv
@@ -3513,6 +3532,7 @@ vect_force_simple_reduction (loop_vec_in
       STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type;
       STMT_VINFO_REDUC_DEF (reduc_def_info) = def;
       reduc_def_info = vinfo_for_stmt (def);
+      STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type;
       STMT_VINFO_REDUC_DEF (reduc_def_info) = phi;
     }
   return def;
@@ -4065,7 +4085,8 @@ vect_model_reduction_cost (stmt_vec_info
 
   code = gimple_assign_rhs_code (orig_stmt);
 
-  if (reduction_type == EXTRACT_LAST_REDUCTION)
+  if (reduction_type == EXTRACT_LAST_REDUCTION
+      || reduction_type == FOLD_LEFT_REDUCTION)
     {
       /* No extra instructions needed in the prologue.  */
       prologue_cost = 0;
@@ -4138,7 +4159,8 @@ vect_model_reduction_cost (stmt_vec_info
                                          scalar_stmt, stmt_info, 0,
                                          vect_epilogue);
        }
-      else if (reduction_type == EXTRACT_LAST_REDUCTION)
+      else if (reduction_type == EXTRACT_LAST_REDUCTION
+              || reduction_type == FOLD_LEFT_REDUCTION)
        /* No extra instructions need in the epilogue.  */
        ;
       else
@@ -5884,6 +5906,155 @@ vect_create_epilog_for_reduction (vec<tr
     }
 }
 
+/* Return a vector of type VECTYPE that is equal to the vector select
+   operation "MASK ? VEC : IDENTITY".  Insert the select statements
+   before GSI.  */
+
+static tree
+merge_with_identity (gimple_stmt_iterator *gsi, tree mask, tree vectype,
+                    tree vec, tree identity)
+{
+  tree cond = make_temp_ssa_name (vectype, NULL, "cond");
+  gimple *new_stmt = gimple_build_assign (cond, VEC_COND_EXPR,
+                                         mask, vec, identity);
+  gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);
+  return cond;
+}
+
+/* Perform an in-order reduction (FOLD_LEFT_REDUCTION).  STMT is the
+   statement that sets the live-out value.  REDUC_DEF_STMT is the phi
+   statement.  CODE is the operation performed by STMT and OPS are
+   its scalar operands.  REDUC_INDEX is the index of the operand in
+   OPS that is set by REDUC_DEF_STMT.  REDUC_CODE is the code that
+   implements in-order reduction and VECTYPE_IN is the type of its
+   vector input.  MASKS specifies the masks that should be used to
+   control the operation in a fully-masked loop.  */
+
+static bool
+vectorize_fold_left_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
+                              gimple **vec_stmt, slp_tree slp_node,
+                              gimple *reduc_def_stmt,
+                              tree_code code, tree_code reduc_code,
+                              tree ops[3], tree vectype_in,
+                              int reduc_index, vec_loop_masks *masks)
+{
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  tree vectype_out = STMT_VINFO_VECTYPE (stmt_info);
+  gimple *new_stmt = NULL;
+
+  int ncopies;
+  if (slp_node)
+    ncopies = 1;
+  else
+    ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
+
+  gcc_assert (!nested_in_vect_loop_p (loop, stmt));
+  gcc_assert (ncopies == 1);
+  gcc_assert (TREE_CODE_LENGTH (code) == binary_op);
+  gcc_assert (reduc_index == (code == MINUS_EXPR ? 0 : 1));
+  gcc_assert (STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info)
+             == FOLD_LEFT_REDUCTION);
+
+  if (slp_node)
+    gcc_assert (must_eq (TYPE_VECTOR_SUBPARTS (vectype_out),
+                        TYPE_VECTOR_SUBPARTS (vectype_in)));
+
+  tree op0 = ops[1 - reduc_index];
+
+  int group_size = 1;
+  gimple *scalar_dest_def;
+  auto_vec<tree> vec_oprnds0;
+  if (slp_node)
+    {
+      vect_get_vec_defs (op0, NULL_TREE, stmt, &vec_oprnds0, NULL, slp_node);
+      group_size = SLP_TREE_SCALAR_STMTS (slp_node).length ();
+      scalar_dest_def = SLP_TREE_SCALAR_STMTS (slp_node)[group_size - 1];
+    }
+  else
+    {
+      tree loop_vec_def0 = vect_get_vec_def_for_operand (op0, stmt);
+      vec_oprnds0.create (1);
+      vec_oprnds0.quick_push (loop_vec_def0);
+      scalar_dest_def = stmt;
+    }
+
+  tree scalar_dest = gimple_assign_lhs (scalar_dest_def);
+  tree scalar_type = TREE_TYPE (scalar_dest);
+  tree reduc_var = gimple_phi_result (reduc_def_stmt);
+
+  int vec_num = vec_oprnds0.length ();
+  gcc_assert (vec_num == 1 || slp_node);
+  tree vec_elem_type = TREE_TYPE (vectype_out);
+  gcc_checking_assert (useless_type_conversion_p (scalar_type, vec_elem_type));
+
+  tree vector_identity = NULL_TREE;
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    vector_identity = build_zero_cst (vectype_out);
+
+  int i;
+  tree def0;
+  FOR_EACH_VEC_ELT (vec_oprnds0, i, def0)
+    {
+      tree mask = NULL_TREE;
+      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+       mask = vect_get_loop_mask (gsi, masks, vec_num, vectype_in, i);
+
+      /* Handle MINUS by adding the negative.  */
+      if (code == MINUS_EXPR)
+       {
+         tree negated = make_ssa_name (vectype_out);
+         new_stmt = gimple_build_assign (negated, NEGATE_EXPR, def0);
+         gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);
+         def0 = negated;
+       }
+
+      if (mask)
+       def0 = merge_with_identity (gsi, mask, vectype_out, def0,
+                                   vector_identity);
+
+      /* On the first iteration the input is simply the scalar phi
+        result, and for subsequent iterations it is the output of
+        the preceding operation.  */
+      tree expr = build2 (reduc_code, scalar_type, reduc_var, def0);
+
+      /* For chained SLP reductions the output of the previous reduction
+        operation serves as the input of the next. For the final statement
+        the output cannot be a temporary - we reuse the original
+        scalar destination of the last statement.  */
+      if (i == vec_num - 1)
+       reduc_var = scalar_dest;
+      else
+       reduc_var = vect_create_destination_var (scalar_dest, NULL);
+      new_stmt = gimple_build_assign (reduc_var, expr);
+
+      if (i == vec_num - 1)
+       {
+         SSA_NAME_DEF_STMT (reduc_var) = new_stmt;
+         /* For chained SLP stmt is the first statement in the group and
+            gsi points to the last statement in the group.  For non SLP stmt
+            points to the same location as gsi. In either case tmp_gsi and gsi
+            should both point to the same insertion point.  */
+         gcc_assert (scalar_dest_def == gsi_stmt (*gsi));
+         vect_finish_replace_stmt (scalar_dest_def, new_stmt);
+       }
+      else
+       {
+         reduc_var = make_ssa_name (reduc_var, new_stmt);
+         gimple_assign_set_lhs (new_stmt, reduc_var);
+         vect_finish_stmt_generation (stmt, new_stmt, gsi);
+       }
+
+      if (slp_node)
+       SLP_TREE_VEC_STMTS (slp_node).quick_push (new_stmt);
+    }
+
+  if (!slp_node)
+    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
+
+  return true;
+}
 
 /* Function is_nonwrapping_integer_induction.
 
@@ -6063,6 +6234,12 @@ vectorizable_reduction (gimple *stmt, gi
          return true;
        }
 
+      if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION)
+       /* Leave the scalar phi in place.  Note that checking
+          STMT_VINFO_VEC_REDUCTION_TYPE (as below) only works
+          for reductions involving a single statement.  */
+       return true;
+
       gimple *reduc_stmt = STMT_VINFO_REDUC_DEF (stmt_info);
       if (STMT_VINFO_IN_PATTERN_P (vinfo_for_stmt (reduc_stmt)))
        reduc_stmt = STMT_VINFO_RELATED_STMT (vinfo_for_stmt (reduc_stmt));
@@ -6289,6 +6466,14 @@ vectorizable_reduction (gimple *stmt, gi
      directy used in stmt.  */
   if (reduc_index == -1)
     {
+      if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION)
+       {
+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                            "in-order reduction chain without SLP.\n");
+         return false;
+       }
+
       if (orig_stmt)
        reduc_def_stmt = STMT_VINFO_REDUC_DEF (orig_stmt_info);
       else
@@ -6508,7 +6693,9 @@ vectorizable_reduction (gimple *stmt, gi
 
   vect_reduction_type reduction_type
     = STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info);
-  if (orig_stmt && reduction_type == TREE_CODE_REDUCTION)
+  if (orig_stmt
+      && (reduction_type == TREE_CODE_REDUCTION
+         || reduction_type == FOLD_LEFT_REDUCTION))
     {
       /* This is a reduction pattern: get the vectype from the type of the
          reduction variable, and get the tree-code from orig_stmt.  */
@@ -6555,13 +6742,22 @@ vectorizable_reduction (gimple *stmt, gi
   epilog_reduc_code = ERROR_MARK;
 
   if (reduction_type == TREE_CODE_REDUCTION
+      || reduction_type == FOLD_LEFT_REDUCTION
       || reduction_type == INTEGER_INDUC_COND_REDUCTION
       || reduction_type == CONST_COND_REDUCTION)
     {
-      if (reduction_code_for_scalar_code (orig_code, &epilog_reduc_code))
+      bool have_reduc_support;
+      if (reduction_type == FOLD_LEFT_REDUCTION)
+       have_reduc_support = fold_left_reduction_code (orig_code, vectype_out,
+                                                      &epilog_reduc_code);
+      else
+       have_reduc_support
+         = reduction_code_for_scalar_code (orig_code, &epilog_reduc_code);
+
+      if (have_reduc_support)
        {
          reduc_optab = optab_for_tree_code (epilog_reduc_code, vectype_out,
-                                         optab_default);
+                                            optab_default);
          if (!reduc_optab)
            {
              if (dump_enabled_p ())
@@ -6687,6 +6883,41 @@ vectorizable_reduction (gimple *stmt, gi
        }
     }
 
+  if (double_reduc && reduction_type == FOLD_LEFT_REDUCTION)
+    {
+      /* We can't support in-order reductions of code such as this:
+
+          for (int i = 0; i < n1; ++i)
+            for (int j = 0; j < n2; ++j)
+              l += a[j];
+
+        since GCC effectively transforms the loop when vectorizing:
+
+          for (int i = 0; i < n1 / VF; ++i)
+            for (int j = 0; j < n2; ++j)
+              for (int k = 0; k < VF; ++k)
+                l += a[j];
+
+        which is a reassociation of the original operation.  */
+      if (dump_enabled_p ())
+       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                        "in-order double reduction not supported.\n");
+
+      return false;
+    }
+
+  if (reduction_type == FOLD_LEFT_REDUCTION
+      && slp_node
+      && !GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)))
+    {
+      /* We cannot in-order reductions in this case because there is
+         an implicit reassociation of the operations involved.  */
+      if (dump_enabled_p ())
+        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                        "in-order unchained SLP reductions not supported.\n");
+      return false;
+    }
+
   /* In case of widenning multiplication by a constant, we update the type
      of the constant to be the type of the other operand.  We check that the
      constant fits the type in the pattern recognition pass.  */
@@ -6807,9 +7038,10 @@ vectorizable_reduction (gimple *stmt, gi
        vect_model_reduction_cost (stmt_info, epilog_reduc_code, ncopies);
       if (loop_vinfo && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
        {
-         if (cond_fn == IFN_LAST
-             || !direct_internal_fn_supported_p (cond_fn, vectype_in,
-                                                 OPTIMIZE_FOR_SPEED))
+         if (reduction_type != FOLD_LEFT_REDUCTION
+             && (cond_fn == IFN_LAST
+                 || !direct_internal_fn_supported_p (cond_fn, vectype_in,
+                                                     OPTIMIZE_FOR_SPEED)))
            {
              if (dump_enabled_p ())
                dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -6844,6 +7076,11 @@ vectorizable_reduction (gimple *stmt, gi
 
   bool masked_loop_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
 
+  if (reduction_type == FOLD_LEFT_REDUCTION)
+    return vectorize_fold_left_reduction
+      (stmt, gsi, vec_stmt, slp_node, reduc_def_stmt, code,
+       epilog_reduc_code, ops, vectype_in, reduc_index, masks);
+
   if (reduction_type == EXTRACT_LAST_REDUCTION)
     {
       gcc_assert (!slp_node);
Index: gcc/config/aarch64/aarch64.md
===================================================================
--- gcc/config/aarch64/aarch64.md       2017-11-17 16:52:07.246852461 +0000
+++ gcc/config/aarch64/aarch64.md       2017-11-17 16:52:07.620954871 +0000
@@ -164,6 +164,7 @@ (define_c_enum "unspec" [
     UNSPEC_STN
     UNSPEC_INSR
     UNSPEC_CLASTB
+    UNSPEC_FADDA
 ])
 
 (define_c_enum "unspecv" [
Index: gcc/config/aarch64/aarch64-sve.md
===================================================================
--- gcc/config/aarch64/aarch64-sve.md   2017-11-17 16:52:07.246852461 +0000
+++ gcc/config/aarch64/aarch64-sve.md   2017-11-17 16:52:07.620040195 +0000
@@ -1574,6 +1574,45 @@ (define_insn "*reduc_<optab>_scal_<mode>
   "<bit_reduc_op>\t%<Vetype>0, %1, %2.<Vetype>"
 )
 
+;; Unpredicated in-order FP reductions.
+(define_expand "fold_left_plus_<mode>"
+  [(set (match_operand:<VEL> 0 "register_operand")
+       (unspec:<VEL> [(match_dup 3)
+                      (match_operand:<VEL> 1 "register_operand")
+                      (match_operand:SVE_F 2 "register_operand")]
+                     UNSPEC_FADDA))]
+  "TARGET_SVE"
+  {
+    operands[3] = force_reg (<VPRED>mode, CONSTM1_RTX (<VPRED>mode));
+  }
+)
+
+;; In-order FP reductions predicated with PTRUE.
+(define_insn "*fold_left_plus_<mode>"
+  [(set (match_operand:<VEL> 0 "register_operand" "=w")
+       (unspec:<VEL> [(match_operand:<VPRED> 1 "register_operand" "Upl")
+                      (match_operand:<VEL> 2 "register_operand" "0")
+                      (match_operand:SVE_F 3 "register_operand" "w")]
+                     UNSPEC_FADDA))]
+  "TARGET_SVE"
+  "fadda\t%<Vetype>0, %1, %<Vetype>0, %3.<Vetype>"
+)
+
+;; Predicated form of the above in-order reduction.
+(define_insn "*pred_fold_left_plus_<mode>"
+  [(set (match_operand:<VEL> 0 "register_operand" "=w")
+       (unspec:<VEL>
+         [(match_operand:<VEL> 1 "register_operand" "0")
+          (unspec:SVE_F
+            [(match_operand:<VPRED> 2 "register_operand" "Upl")
+             (match_operand:SVE_F 3 "register_operand" "w")
+             (match_operand:SVE_F 4 "aarch64_simd_imm_zero")]
+            UNSPEC_SEL)]
+         UNSPEC_FADDA))]
+  "TARGET_SVE"
+  "fadda\t%<Vetype>0, %2, %<Vetype>0, %3.<Vetype>"
+)
+
 ;; Unpredicated floating-point addition.
 (define_expand "add<mode>3"
   [(set (match_operand:SVE_F 0 "register_operand")
Index: gcc/testsuite/lib/target-supports.exp
===================================================================
--- gcc/testsuite/lib/target-supports.exp       2017-11-17 16:52:07.246852461 
+0000
+++ gcc/testsuite/lib/target-supports.exp       2017-11-17 16:52:07.627357602 
+0000
@@ -7180,6 +7180,12 @@ proc check_effective_target_vect_fold_ex
     return [check_effective_target_aarch64_sve]
 }
 
+# Return 1 if the target supports the fold_left_plus optab.
+
+proc check_effective_target_vect_fold_left_plus { } {
+    return [check_effective_target_aarch64_sve]
+}
+
 # Return 1 if the target supports section-anchors
 
 proc check_effective_target_section_anchors { } {
Index: gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c     2017-11-17 
16:52:07.246852461 +0000
+++ gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c     2017-11-17 
16:52:07.625528250 +0000
@@ -34,4 +34,4 @@ int main (void)
 }
 
 /* Requires fast-math.  */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail 
*-*-* } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! 
vect_fold_left_plus } } } } */
Index: gcc/testsuite/gcc.dg/vect/pr79920.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/pr79920.c 2017-11-17 16:52:07.246852461 +0000
+++ gcc/testsuite/gcc.dg/vect/pr79920.c 2017-11-17 16:52:07.625528250 +0000
@@ -41,4 +41,5 @@ int main()
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { 
vect_double && { vect_perm && vect_hw_misalign } } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { 
{ vect_double && { ! vect_fold_left_plus } } && { vect_perm && vect_hw_misalign 
} } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { 
{ vect_double && vect_fold_left_plus } && { vect_perm && vect_hw_misalign } } } 
} } */
Index: gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c      2017-11-17 
16:52:07.246852461 +0000
+++ gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c      2017-11-17 
16:52:07.625528250 +0000
@@ -46,5 +46,9 @@ int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "Detected reduction\\." 2 "vect"  } } */
+/* 2 for the first loop.  */
+/* { dg-final { scan-tree-dump-times "Detected reduction\\." 3 "vect" { target 
{ ! vect_multiple_sizes } } } } */
+/* { dg-final { scan-tree-dump "Detected reduction\\." "vect" { target 
vect_multiple_sizes } } } */
+/* { dg-final { scan-tree-dump-times "not vectorized" 1 "vect" { target { ! 
vect_multiple_sizes } } } } */
+/* { dg-final { scan-tree-dump "not vectorized" "vect" { target 
vect_multiple_sizes } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { 
! vect_no_int_min_max } } } } */
Index: gcc/testsuite/gcc.dg/vect/vect-reduc-6.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-reduc-6.c    2017-11-17 16:52:07.246852461 
+0000
+++ gcc/testsuite/gcc.dg/vect/vect-reduc-6.c    2017-11-17 16:52:07.625528250 
+0000
@@ -50,4 +50,5 @@ int main (void)
 
 /* need -ffast-math to vectorizer these loops.  */
 /* ARM NEON passes -ffast-math to these tests, so expect this to fail.  */
-/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { xfail 
arm_neon_ok } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { 
! vect_fold_left_plus } xfail arm_neon_ok } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
vect_fold_left_plus } } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c
===================================================================
--- /dev/null   2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c       2017-11-17 
16:52:07.625528250 +0000
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#define NUM_ELEMS(TYPE) ((int)(5 * (256 / sizeof (TYPE)) + 3))
+
+#define DEF_REDUC_PLUS(TYPE)                   \
+  TYPE __attribute__ ((noinline, noclone))     \
+  reduc_plus_##TYPE (TYPE *a, TYPE *b)         \
+  {                                            \
+    TYPE r = 0, q = 3;                         \
+    for (int i = 0; i < NUM_ELEMS(TYPE); i++)  \
+      {                                                \
+       r += a[i];                              \
+       q -= b[i];                              \
+      }                                                \
+    return r * q;                              \
+  }
+
+#define TEST_ALL(T) \
+  T (_Float16) \
+  T (float) \
+  T (double)
+
+TEST_ALL (DEF_REDUC_PLUS)
+
+/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, 
z[0-9]+\.h} 2 } } */
+/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, 
z[0-9]+\.s} 2 } } */
+/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, 
z[0-9]+\.d} 2 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c
===================================================================
--- /dev/null   2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c   2017-11-17 
16:52:07.625528250 +0000
@@ -0,0 +1,29 @@
+/* { dg-do run { target { aarch64_sve_hw } } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_reduc_strict_1.c"
+
+#define TEST_REDUC_PLUS(TYPE)                  \
+  {                                            \
+    TYPE a[NUM_ELEMS (TYPE)];                  \
+    TYPE b[NUM_ELEMS (TYPE)];                  \
+    TYPE r = 0, q = 3;                         \
+    for (int i = 0; i < NUM_ELEMS (TYPE); i++) \
+      {                                                \
+       a[i] = (i * 0.1) * (i & 1 ? 1 : -1);    \
+       b[i] = (i * 0.3) * (i & 1 ? 1 : -1);    \
+       r += a[i];                              \
+       q -= b[i];                              \
+       asm volatile ("" ::: "memory");         \
+      }                                                \
+    TYPE res = reduc_plus_##TYPE (a, b);       \
+    if (res != r * q)                          \
+      __builtin_abort ();                      \
+  }
+
+int __attribute__ ((optimize (1)))
+main ()
+{
+  TEST_ALL (TEST_REDUC_PLUS);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c
===================================================================
--- /dev/null   2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c       2017-11-17 
16:52:07.625528250 +0000
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#define NUM_ELEMS(TYPE) ((int) (5 * (256 / sizeof (TYPE)) + 3))
+
+#define DEF_REDUC_PLUS(TYPE)                                   \
+void __attribute__ ((noinline, noclone))                       \
+reduc_plus_##TYPE (TYPE (*restrict a)[NUM_ELEMS(TYPE)],                \
+                  TYPE *restrict r, int n)                     \
+{                                                              \
+  for (int i = 0; i < n; i++)                                  \
+    {                                                          \
+      r[i] = 0;                                                        \
+      for (int j = 0; j < NUM_ELEMS(TYPE); j++)                        \
+        r[i] += a[i][j];                                       \
+    }                                                          \
+}
+
+#define TEST_ALL(T) \
+  T (_Float16) \
+  T (float) \
+  T (double)
+
+TEST_ALL (DEF_REDUC_PLUS)
+
+/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, 
z[0-9]+\.h} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, 
z[0-9]+\.s} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, 
z[0-9]+\.d} 1 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c
===================================================================
--- /dev/null   2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c   2017-11-17 
16:52:07.626442926 +0000
@@ -0,0 +1,31 @@
+/* { dg-do run { target { aarch64_sve_hw } } } */
+/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve" } */
+
+#include "sve_reduc_strict_2.c"
+
+#define NROWS 5
+
+#define TEST_REDUC_PLUS(TYPE)                                  \
+  {                                                            \
+    TYPE a[NROWS][NUM_ELEMS (TYPE)];                           \
+    TYPE r[NROWS];                                             \
+    TYPE expected[NROWS] = {};                                 \
+    for (int i = 0; i < NROWS; ++i)                            \
+      for (int j = 0; j < NUM_ELEMS (TYPE); ++j)               \
+       {                                                       \
+         a[i][j] = (i * 0.1 + j * 0.6) * (j & 1 ? 1 : -1);     \
+         expected[i] += a[i][j];                               \
+         asm volatile ("" ::: "memory");                       \
+       }                                                       \
+    reduc_plus_##TYPE (a, r, NROWS);                           \
+    for (int i = 0; i < NROWS; ++i)                            \
+      if (r[i] != expected[i])                                 \
+       __builtin_abort ();                                     \
+  }
+
+int __attribute__ ((optimize (1)))
+main ()
+{
+  TEST_ALL (TEST_REDUC_PLUS);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c
===================================================================
--- /dev/null   2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c       2017-11-17 
16:52:07.626442926 +0000
@@ -0,0 +1,131 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve 
-msve-vector-bits=256 -fdump-tree-vect-details" } */
+
+double mat[100][4];
+double mat2[100][8];
+double mat3[100][12];
+double mat4[100][3];
+
+double
+slp_reduc_plus (int n)
+{
+  double tmp = 0.0;
+  for (int i = 0; i < n; i++)
+    {
+      tmp = tmp + mat[i][0];
+      tmp = tmp + mat[i][1];
+      tmp = tmp + mat[i][2];
+      tmp = tmp + mat[i][3];
+    }
+  return tmp;
+}
+
+double
+slp_reduc_plus2 (int n)
+{
+  double tmp = 0.0;
+  for (int i = 0; i < n; i++)
+    {
+      tmp = tmp + mat2[i][0];
+      tmp = tmp + mat2[i][1];
+      tmp = tmp + mat2[i][2];
+      tmp = tmp + mat2[i][3];
+      tmp = tmp + mat2[i][4];
+      tmp = tmp + mat2[i][5];
+      tmp = tmp + mat2[i][6];
+      tmp = tmp + mat2[i][7];
+    }
+  return tmp;
+}
+
+double
+slp_reduc_plus3 (int n)
+{
+  double tmp = 0.0;
+  for (int i = 0; i < n; i++)
+    {
+      tmp = tmp + mat3[i][0];
+      tmp = tmp + mat3[i][1];
+      tmp = tmp + mat3[i][2];
+      tmp = tmp + mat3[i][3];
+      tmp = tmp + mat3[i][4];
+      tmp = tmp + mat3[i][5];
+      tmp = tmp + mat3[i][6];
+      tmp = tmp + mat3[i][7];
+      tmp = tmp + mat3[i][8];
+      tmp = tmp + mat3[i][9];
+      tmp = tmp + mat3[i][10];
+      tmp = tmp + mat3[i][11];
+    }
+  return tmp;
+}
+
+void
+slp_non_chained_reduc (int n, double * restrict out)
+{
+  for (int i = 0; i < 3; i++)
+    out[i] = 0;
+
+  for (int i = 0; i < n; i++)
+    {
+      out[0] = out[0] + mat4[i][0];
+      out[1] = out[1] + mat4[i][1];
+      out[2] = out[2] + mat4[i][2];
+    }
+}
+
+/* Strict FP reductions shouldn't be used for the outer loops, only the
+   inner loops.  */
+
+float
+double_reduc1 (float (*restrict i)[16])
+{
+  float l = 0;
+
+  for (int a = 0; a < 8; a++)
+    for (int b = 0; b < 8; b++)
+      l += i[b][a];
+  return l;
+}
+
+float
+double_reduc2 (float *restrict i)
+{
+  float l = 0;
+
+  for (int a = 0; a < 8; a++)
+    for (int b = 0; b < 16; b++)
+      {
+        l += i[b * 4];
+        l += i[b * 4 + 1];
+        l += i[b * 4 + 2];
+        l += i[b * 4 + 3];
+      }
+  return l;
+}
+
+float
+double_reduc3 (float *restrict i, float *restrict j)
+{
+  float k = 0, l = 0;
+
+  for (int a = 0; a < 8; a++)
+    for (int b = 0; b < 8; b++)
+      {
+        k += i[b];
+        l += j[b];
+      }
+  return l * k;
+}
+
+/* We can't yet handle double_reduc1.  */
+/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, 
z[0-9]+\.s} 3 } } */
+/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, 
z[0-9]+\.d} 9 } } */
+/* 1 reduction each for double_reduc{1,2} and 2 for double_reduc3.  Each one
+   is reported three times, once for SVE, once for 128-bit AdvSIMD and once
+   for 64-bit AdvSIMD.  */
+/* { dg-final { scan-tree-dump-times "Detected double reduction" 12 "vect" } } 
*/
+/* double_reduc2 has 2 reductions and slp_non_chained_reduc has 3.
+   double_reduc1 is reported 3 times (SVE, 128-bit AdvSIMD, 64-bit AdvSIMD)
+   before failing.  */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 12 "vect" } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_13.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve_slp_13.c       2017-11-17 
16:52:07.246852461 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_13.c       2017-11-17 
16:52:07.626442926 +0000
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve 
-msve-vector-bits=scalable" } */
+/* The cost model thinks that the double loop isn't a win for SVE-128.  */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve 
-msve-vector-bits=scalable -fno-vect-cost-model" } */
 
 #include <stdint.h>
 
@@ -24,7 +25,10 @@ #define TEST_ALL(T)                          \
   T (int32_t)                                  \
   T (uint32_t)                                 \
   T (int64_t)                                  \
-  T (uint64_t)
+  T (uint64_t)                                 \
+  T (_Float16)                                 \
+  T (float)                                    \
+  T (double)
 
 TEST_ALL (VEC_PERM)
 
@@ -32,21 +36,25 @@ TEST_ALL (VEC_PERM)
 /* ??? We don't treat the uint loops as SLP.  */
 /* The loop should be fully-masked.  */
 /* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tld1h\t} 2 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tld1w\t} 2 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */
-/* { dg-final { scan-assembler-times {\tld1d\t} 2 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 2 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 2 } } */
 /* { dg-final { scan-assembler-not {\tldr} { xfail *-*-* } } } */
 
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 { xfail *-*-* } 
} } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 4 { xfail *-*-* } 
} } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 4 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 4 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 { xfail *-*-* } 
} } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
 
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b\n} 
2 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h\n} 
2 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 
2 } } */
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 
2 } } */
+/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, 
z[0-9]+\.h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, 
z[0-9]+\.s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, 
z[0-9]+\.d\n} 1 } } */
+/* { dg-final { scan-assembler-not {\tfadd\n} } } */
 
 /* { dg-final { scan-assembler-not {\tuqdec} } } */
Index: gcc/testsuite/gfortran.dg/vect/vect-8.f90
===================================================================
--- gcc/testsuite/gfortran.dg/vect/vect-8.f90   2017-11-17 16:52:07.246852461 
+0000
+++ gcc/testsuite/gfortran.dg/vect/vect-8.f90   2017-11-17 16:52:07.626442926 
+0000
@@ -704,5 +704,6 @@ CALL track('KERNEL  ')
 RETURN
 END SUBROUTINE kernel
 
-! { dg-final { scan-tree-dump-times "vectorized 21 loops" 1 "vect" { target { 
vect_intdouble_cvt } } } }
 ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target { 
! vect_intdouble_cvt } } } }
+! { dg-final { scan-tree-dump-times "vectorized 21 loops" 1 "vect" { target { 
vect_intdouble_cvt && { ! vect_fold_left_plus } } } } }
+! { dg-final { scan-tree-dump-times "vectorized 25 loops" 1 "vect" { target { 
vect_intdouble_cvt && vect_fold_left_plus } } } }

Add support for in-order addition reduction using SVE FADDA

Reply via email to