On Mon, Jun 19, 2023 at 8:35 PM Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> The following works around the lack of the x86 backend making the
> vectorizer compare the costs of the different possible vector
> sizes the backed advertises through the vector_modes hook.  When
> enabling masked epilogues or main loops then this means we will
> select the prefered vector mode which is usually the largest even
> for loops that do not iterate close to the times the vector has
> lanes.  When not using masking the vectorizer would reject any
> mode resulting in a VF bigger than the number of iterations
> but with masking they are simply masked out.
>
> So this overloads the finish_cost function and matches for
> the problematic case, forcing a high cost to make us try a
> smaller vector size.
>
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.  This should
> avoid regressing 525.x264_r with partial vector epilogues and
> instead improves it by 25% with -march=znver4 (need to re-check
> that, that was true with some earlier attempt).
>
> This falls short of enabling cost comparison in the x86 backend
> which I also considered doing for --param vect-partial-vector-usage=1
> but which will also cause a much larger churn and compile-time
> impact (but it should be bearable as seen with aarch64).
>
> I've filed PR110310 for an oddity I noticed around vectorizing
> epilogues, I failed to adjust things for the case in that PR.
>
> I'm using INT_MAX to fend off the vectorizer, I wondered if
> we should be able to signal that with a bool return value of
> finish_cost?  Though INT_MAX seems to work fine.
>
> Does this look reasonable?
Reasonable for me, even for VECT_COMPARE_COSTS.
>
> Thanks,
> Richard.
>
>         * config/i386/i386.cc (ix86_vector_costs::finish_cost):
>         Overload.  For masked main loops make sure the vectorization
>         factor isn't more than double the number of iterations.
>
>
>         * gcc.target/i386/vect-partial-vectors-1.c: New testcase.
>         * gcc.target/i386/vect-partial-vectors-2.c: Likewise.
> ---
>  gcc/config/i386/i386.cc                       | 26 +++++++++++++++++++
>  .../gcc.target/i386/vect-partial-vectors-1.c  | 13 ++++++++++
>  .../gcc.target/i386/vect-partial-vectors-2.c  | 12 +++++++++
>  3 files changed, 51 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/vect-partial-vectors-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/vect-partial-vectors-2.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index b20cb86b822..32851a514a9 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -23666,6 +23666,7 @@ class ix86_vector_costs : public vector_costs
>                               stmt_vec_info stmt_info, slp_tree node,
>                               tree vectype, int misalign,
>                               vect_cost_model_location where) override;
> +  void finish_cost (const vector_costs *) override;
>  };
>
>  /* Implement targetm.vectorize.create_costs.  */
> @@ -23918,6 +23919,31 @@ ix86_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
>    return retval;
>  }
>
> +void
> +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> +{
> +  loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
> +  if (loop_vinfo && !m_costing_for_scalar)
> +    {
> +      /* We are currently not asking the vectorizer to compare costs
> +        between different vector mode sizes.  When using predication
> +        that will end up always choosing the prefered mode size even
> +        if there's a smaller mode covering all lanes.  Test for this
> +        situation and artificially reject the larger mode attempt.
> +        ???  We currently lack masked ops for sub-SSE sized modes,
> +        so we could restrict this rejection to AVX and AVX512 modes
> +        but error on the safe side for now.  */
> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> +         && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> +         && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> +         && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
> +             > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
> +       m_costs[vect_body] = INT_MAX;
> +    }
> +
> +  vector_costs::finish_cost (scalar_costs);
> +}
> +
>  /* Validate target specific memory model bits in VAL. */
>
>  static unsigned HOST_WIDE_INT
> diff --git a/gcc/testsuite/gcc.target/i386/vect-partial-vectors-1.c 
> b/gcc/testsuite/gcc.target/i386/vect-partial-vectors-1.c
> new file mode 100644
> index 00000000000..3834720e8e2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vect-partial-vectors-1.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx512f -mavx512vl -mprefer-vector-width=512 --param 
> vect-partial-vector-usage=1" } */
> +
> +void foo (int * __restrict a, int *b)
> +{
> +  for (int i = 0; i < 4; ++i)
> +    a[i] = b[i] + 42;
> +}
> +
> +/* We do not want to optimize this using masked AVX or AXV512
> +   but unmasked SSE.  */
> +/* { dg-final { scan-assembler-not "\[yz\]mm" } } */
> +/* { dg-final { scan-assembler "xmm" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/vect-partial-vectors-2.c 
> b/gcc/testsuite/gcc.target/i386/vect-partial-vectors-2.c
> new file mode 100644
> index 00000000000..4ab2cbc4203
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vect-partial-vectors-2.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx512f -mavx512vl -mprefer-vector-width=512 --param 
> vect-partial-vector-usage=1" } */
> +
> +void foo (int * __restrict a, int *b)
> +{
> +  for (int i = 0; i < 7; ++i)
> +    a[i] = b[i] + 42;
> +}
> +
> +/* We want to optimize this using masked AVX, not AXV512 or SSE.  */
> +/* { dg-final { scan-assembler-not "zmm" } } */
> +/* { dg-final { scan-assembler "ymm\[^\r\n\]*\{%k" } } */
> --
> 2.35.3



-- 
BR,
Hongtao

Reply via email to