Tamar Christina <tamar.christ...@arm.com> writes:
> Hi All,
>
> In gimple the operation
>
> short _8;
> double _9;
> _9 = (double) _8;
>
> denotes two operations.  First we have to widen from short to long and then
> convert this integer to a double.

Think it's worth saying "two operations on AArch64".  Some targets
can do int->double directly.

Saying that would explain...

> Currently however we only count the widen/truncate operations:
>
> (double) _5 6 times vec_promote_demote costs 12 in body
> (double) _5 12 times vec_promote_demote costs 24 in body
>
> but not the actual conversion operation, which needs an additional 12
> instructions in the attached testcase.   Without this the attached testcase 
> ends
> up incorrectly thinking that it's beneficial to vectorize the loop at a very
> high VF = 8 (4x unrolled).
>
> Because we can't change the mid-end to account for this the costing code in 
> the

...why we can't do this.

> backend now keeps track of whether the previous operation was a
> promotion/demotion and ajdusts the expected number of instructions to:
>
> 1. If it's the first FLOAT_EXPR and the precision of the lhs and rhs are
>    different, double it, since we need to convert and promote.
> 2. If it's the previous operation was a demonition/promotion then reduce the
>    cost of the current operation by the amount we added extra in the last.
>
> with the patch we get:
>
> (double) _5 6 times vec_promote_demote costs 24 in body
> (double) _5 12 times vec_promote_demote costs 36 in body
>
> which correctly accounts for 30 operations.
>
> This fixes the regression reported on Neoverse N2 and using the new generic
> Armv9-a cost model.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>       PR target/110625
>       * config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost):
>       Adjust throughput and latency calculations for vector conversions.
>       (class aarch64_vector_costs): Add m_num_last_promote_demote.
>
> gcc/testsuite/ChangeLog:
>
>       PR target/110625
>       * gcc.target/aarch64/pr110625_4.c: New test.
>       * gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Add
>       --param aarch64-sve-compare-costs=0.
>       * gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> f9850320f61c5ddccf47e6583d304e5f405a484f..5622221413e52717974b96f79cc83008f237c536
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16077,6 +16077,15 @@ private:
>       leaving a vectorization of { elts }.  */
>    bool m_stores_to_vector_load_decl = false;
>  
> +  /* Non-zero if the last operation we costed is a vector promotion or 
> demotion.
> +     In this case the value is the number of insn in the last operation.

s/insn/insns/

OK with those changes.  Thanks for tracking this down and working
out what was missing.

Richard

> +
> +     On AArch64 vector promotion and demotions require us to first widen or
> +     narrow the input and only after that emit conversion instructions.  For
> +     costing this means we need to emit the cost of the final conversions as
> +     well.  */
> +  unsigned int m_num_last_promote_demote = 0;
> +
>    /* - If M_VEC_FLAGS is zero then we're costing the original scalar code.
>       - If M_VEC_FLAGS & VEC_ADVSIMD is nonzero then we're costing Advanced
>         SIMD code.
> @@ -17132,6 +17141,29 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
>      stmt_cost = aarch64_sve_adjust_stmt_cost (m_vinfo, kind, stmt_info,
>                                             vectype, stmt_cost);
>  
> +  /*  Vector promotion and demotion requires us to widen the operation first
> +      and only after that perform the conversion.  Unfortunately the mid-end
> +      expects this to be doable as a single operation and doesn't pass on
> +      enough context here for us to tell which operation is happening.  To
> +      account for this we count every promote-demote operation twice and if
> +      the previously costed operation was also a promote-demote we reduce
> +      the cost of the currently being costed operation to simulate the final
> +      conversion cost.  Note that for SVE we can do better here if the 
> converted
> +      value comes from a load since the widening load would consume the 
> widening
> +      operations.  However since we're in stage 3 we can't change the helper
> +      vect_is_extending_load and duplicating the code seems not useful.  */
> +  gassign *assign = NULL;
> +  if (kind == vec_promote_demote
> +      && (assign = dyn_cast <gassign *> (STMT_VINFO_STMT (stmt_info)))
> +      && gimple_assign_rhs_code (assign) == FLOAT_EXPR)
> +    {
> +      auto new_count = count * 2 - m_num_last_promote_demote;
> +      m_num_last_promote_demote = count;
> +      count = new_count;
> +    }
> +  else
> +    m_num_last_promote_demote = 0;
> +
>    if (stmt_info && aarch64_use_new_vector_costs_p ())
>      {
>        /* Account for any extra "embedded" costs that apply additively
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_4.c 
> b/gcc/testsuite/gcc.target/aarch64/pr110625_4.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..34dac19d81a85d63706d54f4cb0c738ce592d5d7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_4.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -mcpu=neoverse-n2 -fdump-tree-vect-details" } */
> +
> +typedef struct {
> +  short blue, green, red, opacity;
> +} Pixel;
> +
> +double foo (long n, double *k, Pixel *k_pixels) {
> +  double result_2, result_1, result_0;
> +  for (; n; n++, k--) {
> +    result_0 += *k * k_pixels[n].red;
> +    result_1 += *k * k_pixels[n].green;
> +    result_2 += *k * k_pixels[n].blue;
> +  }
> +  return result_0 + result_1 + result_2;
> +}
> +
> +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c
> index 
> 0f96dc2ff007340541c2ba7d51e1ccfa0f3f2d39..4c5e88657408f61156035012212ed542fac45efb
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-vectorize -fno-inline" } */
> +/* { dg-options "-O2 -ftree-vectorize -fno-inline --param 
> aarch64-sve-compare-costs=0" } */
>  
>  #include <stdint.h>
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c
> index 
> 70465f91eba4f80140b2059481eb8f06bbc9ace7..3ff2bd127756b2ff08095513b09325db4779ba02
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-vectorize" } */
> +/* { dg-options "-O2 -ftree-vectorize --param aarch64-sve-compare-costs=0" } 
> */
>  
>  #include <stdint.h>

Reply via email to