On Tue, Nov 13, 2018 at 10:15 AM Kyrill Tkachov
<kyrylo.tkac...@foss.arm.com> wrote:
>
> Hi Richard,
>
> On 13/11/18 08:24, Richard Biener wrote:
> > On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov
> > <kyrylo.tkac...@foss.arm.com> wrote:
> >>
> >> On 12/11/18 14:10, Richard Biener wrote:
> >>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov
> >>> <kyrylo.tkac...@foss.arm.com> wrote:
> >>>> On 09/11/18 12:18, Richard Biener wrote:
> >>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov
> >>>>> <kyrylo.tkac...@foss.arm.com> wrote:
> >>>>>> Hi all,
> >>>>>>
> >>>>>> In this testcase the codegen for VLA SVE is worse than it could be due 
> >>>>>> to unrolling:
> >>>>>>
> >>>>>> fully_peel_me:
> >>>>>>            mov     x1, 5
> >>>>>>            ptrue   p1.d, all
> >>>>>>            whilelo p0.d, xzr, x1
> >>>>>>            ld1d    z0.d, p0/z, [x0]
> >>>>>>            fadd    z0.d, z0.d, z0.d
> >>>>>>            st1d    z0.d, p0, [x0]
> >>>>>>            cntd    x2
> >>>>>>            addvl   x3, x0, #1
> >>>>>>            whilelo p0.d, x2, x1
> >>>>>>            beq     .L1
> >>>>>>            ld1d    z0.d, p0/z, [x0, #1, mul vl]
> >>>>>>            fadd    z0.d, z0.d, z0.d
> >>>>>>            st1d    z0.d, p0, [x3]
> >>>>>>            cntw    x2
> >>>>>>            incb    x0, all, mul #2
> >>>>>>            whilelo p0.d, x2, x1
> >>>>>>            beq     .L1
> >>>>>>            ld1d    z0.d, p0/z, [x0]
> >>>>>>            fadd    z0.d, z0.d, z0.d
> >>>>>>            st1d    z0.d, p0, [x0]
> >>>>>> .L1:
> >>>>>>            ret
> >>>>>>
> >>>>>> In this case, due to the vector-length-agnostic nature of SVE the 
> >>>>>> compiler doesn't know the loop iteration count.
> >>>>>> For such loops we don't want to unroll if we don't end up eliminating 
> >>>>>> branches as this just bloats code size
> >>>>>> and hurts icache performance.
> >>>>>>
> >>>>>> This patch introduces a new unroll-known-loop-iterations-only param 
> >>>>>> that disables cunroll when the loop iteration
> >>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often 
> >>>>>> for SVE VLA code, but it does help some
> >>>>>> Advanced SIMD cases as well where loops with an unknown iteration 
> >>>>>> count are not unrolled when it doesn't eliminate
> >>>>>> the branches.
> >>>>>>
> >>>>>> So for the above testcase we generate now:
> >>>>>> fully_peel_me:
> >>>>>>            mov     x2, 5
> >>>>>>            mov     x3, x2
> >>>>>>            mov     x1, 0
> >>>>>>            whilelo p0.d, xzr, x2
> >>>>>>            ptrue   p1.d, all
> >>>>>> .L2:
> >>>>>>            ld1d    z0.d, p0/z, [x0, x1, lsl 3]
> >>>>>>            fadd    z0.d, z0.d, z0.d
> >>>>>>            st1d    z0.d, p0, [x0, x1, lsl 3]
> >>>>>>            incd    x1
> >>>>>>            whilelo p0.d, x1, x3
> >>>>>>            bne     .L2
> >>>>>>            ret
> >>>>>>
> >>>>>> Not perfect still, but it's preferable to the original code.
> >>>>>> The new param is enabled by default on aarch64 but disabled for other 
> >>>>>> targets, leaving their behaviour unchanged
> >>>>>> (until other target people experiment with it and set it, if 
> >>>>>> appropriate).
> >>>>>>
> >>>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
> >>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences 
> >>>>>> in performance.
> >>>>>>
> >>>>>> Ok for trunk?
> >>>>> Hum.  Why introduce a new --param and not simply key on
> >>>>> flag_peel_loops instead?  That is
> >>>>> enabled by default at -O3 and with FDO but you of course can control
> >>>>> that in your targets
> >>>>> post-option-processing hook.
> >>>> You mean like this?
> >>>> It's certainly a simpler patch, but I was just a bit hesitant of making 
> >>>> this change for all targets :)
> >>>> But I suppose it's a reasonable change.
> >>> No, that change is backward.  What I said is that peeling is already
> >>> conditional on
> >>> flag_peel_loops and that is enabled by -O3.  So you want to disable
> >>> flag_peel_loops for
> >>> SVE instead in the target.
> >> Sorry, I got confused by the similarly named functions.
> >> I'm talking about try_unroll_loop_completely when run as part of 
> >> canonicalize_induction_variables i.e. the "ivcanon" pass
> >> (sorry about blaming cunroll here). This doesn't get called through the 
> >> try_unroll_loops_completely path.
> > Well, peeling gets disabled.  From your patch I see you want to
> > disable "unrolling" when
> > the number of loop iteration is not constant.  That is called peeling
> > where we need to
> > emit the loop exit test N times.
> >
> > Did you check your testcases with -fno-peel-loops?
>
> -fno-peel-loops doesn't help in the testcases. The code that does this 
> peeling (try_unroll_loop_completely)
> can be called through two paths, only one of which is gated on 
> flag_peel_loops.

I don't see the obvious here so I have to either sit down with a
non-SVE specific testcase
showing this, or I am misunderstanding the actual transform that you
want to avoid.
allow_peel is false when called from canonicalize_induction_variables.
There's the slight
chance that UL_NO_GROWTH lets through cases - is your case one of
that?  That is,
does the following help?

Index: gcc/tree-ssa-loop-ivcanon.c
===================================================================
--- gcc/tree-ssa-loop-ivcanon.c (revision 266056)
+++ gcc/tree-ssa-loop-ivcanon.c (working copy)
@@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop
     exit = NULL;

   /* See if we can improve our estimate by using recorded loop bounds.  */
-  if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH)
+  if ((allow_peel || maxiter == 0)
       && maxiter >= 0
       && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll))
     {

IIRC I allowed that case when adding allow_peel simply because it avoided some
testsuite regressions.  This means you eventually want to work on the
size estimate
of SVE style loops?

Richard.

> Thanks,
> Kyrill
>
>
> >> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or 
> >> -fno-unroll-loops.
> >> Maybe disabling peeling inside try_unroll_loop_completely itself when 
> >> !flag_peel_loops is viable?
> >>
> >> Thanks,
> >> Kyrill
> >>
> >>>>> It might also make sense to have more fine-grained control for this
> >>>>> and allow a target
> >>>>> to say whether it wants to peel a specific loop or not when the
> >>>>> middle-end thinks that
> >>>>> would be profitable.
> >>>> Can be worth looking at as a follow-up. Do you envisage the target 
> >>>> analysing
> >>>> the gimple statements of the loop to figure out its cost?
> >>> Kind-of.  Sth like
> >>>
> >>>     bool targetm.peel_loop (struct loop *);
> >>>
> >>> I have no idea whether you can easily detect a SVE vectorized loop though.
> >>> Maybe there's always a special IV or so (the mask?)
> >>>
> >>> Richard.
> >>>
> >>>> Thanks,
> >>>> Kyrill
> >>>>
> >>>>
> >>>> 2018-11-09  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
> >>>>
> >>>>           * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not 
> >>>> unroll
> >>>>           loop when number of iterations is not known and 
> >>>> flag_peel_loops is in
> >>>>           effect.
> >>>>
> >>>> 2018-11-09  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
> >>>>
> >>>>           * gcc.target/aarch64/sve/unroll-1.c: New test.
> >>>>
>

Reply via email to