Thanks as always for the review.

Ayal Zaks <ayal.z...@gmail.com> writes:
>> Richard Sandiford <richard.sandif...@linaro.org> wrote on 30/08/2011
>> 03:29:26 PM:
>>
>> > From: Richard Sandiford <richard.sandif...@linaro.org>
>> > To: gcc-patches@gcc.gnu.org
>> > Cc: Ayal Zaks/Haifa/IBM@IBMIL
>> > Date: 30/08/2011 03:29 PM
>> > Subject: [4/4] Make SMS schedule register moves
>> >
>> > This is the move-scheduling patch itself.  It should be fairly
>> > self-explanatory.  Let me know if it isn't, and I'll try to improve
>> > the commentary.
>>
>> >
>> > One potentially controversial change is to the way we handle moves
>> > in the prologue and epilogue.  The current code uses a conservative
>
>
>  - conservative>>accurate? (Your approach seems to be more "conservative")

They're both conservative (and yeah, mine is more conservative :-)).
What I was trying to say is that the current code already generates
more moves than necessary.  If ddg node N is "R = foo", the current code
generates enough moves to handle definitions of R from both the prologue
copies of N _and_ the incoming edge (i.e. the value defined outside the
loop, such as an accumulator being initialised to zero).  But the code
doesn't check whether this initial definition actually exists.  If it
doesn't -- because we're dealing with an intra-loop dependency --
we generate one more move than necessary.

If we wanted to be accurate, we'd need to check whether there's
an incoming value or not.

The current code is also conservative in that the epilogue can end up
with sequences like:

    R1 = foo
    R2 = R1 (from a move)
    ... no set of R1, because we've done with that stage...
    R3 = ...R2...

where the last instruction could use R1 directly instead.
But none of this matters because later passes (including IRA)
fix it up nicely.

So this was all just a big excuse on my part along the lines of
"we already rely on this stuff being cleaned up, and it is,
so let's keep things simple".

>> > check to decide which stages need which moves.  This check copes
>> > with values that are live before the loop, and emits some moves
>> > that aren't actually needed.  The code also emits chains of moves
>> > that can be trivially optimised through propagation.  We rely on
>> > later patches to clean this up for us (and they do).
>> >
>> > So, rather than keep a rather complicated check here, I've simply
>> > emitted the moves for all stages.  In principle, that will generate
>> > a little more dead code than now in cases where chains of two moves
>> > are needed, but that seems to be very rare (when moves are scheduled).
>
> indeed, it's better to simplify code generation and rely on subsequent
> cleanups. Not sure about generating more (dead?) code in principle;
> could you elaborate, for the record?

Sorry, I wasn't very clear, but what I meant was: the patch generates
every move regardless of the stage, so it will in principle generate more
moves to unused registers than we do now.  (As above, we already generate
some such moves.)  Thoese insns will be dead, and will be later removed
as dead.  But in practice, the number of extra instructions seems to be
very low, and is usually zero.

>> >    * modulo-sched.c (extend_node_sched_params): New function.
>> >    (print_node_sched_params): Print the stage.
>> >    (set_columns_for_row, schedule_reg_move): New functions.
>> >    (set_columns_for_ps): Move up file and use set_columns_for_now.
>
>
> typo: now>>row (unless you intend this to be temporary ;-)

Doh! :-)

>> >    (schedule_reg_moves): Call extend_node_sched_params and
>> >    schedule_reg_move.  Extend size of uses bitmap.  Return false
>> >    if a move could not be scheduled.
>> >    (apply_reg_moves): Don't emit moves here.
>> >    (permute_partial_schedule): Handle register moves.
>> >    (duplicate_insns_of_cycles): Remove for_prolog.  Always emit moves.
>
>
> Always emit all moves

Oops, fixed.

>> +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
>> +   The source of the move is provided by I_MUST_FOLLOW, which has
>> +   already been scheduled.
>
> The source y of an x=y move is defined by the previous iteration of
> the next move y=z (or by the original y=... move->def). We schedule
> moves one after the other bottom-up, starting from the original
> move->def moving backwards in cycles. This way, the instruction
> I_MUST_FOLLOW defining y is always already scheduled when we come to
> schedule the dependent I_REG_MOVE x=y move.

Right.

>> +  /* The cyclic lifetime of move->new_reg starts and ends at move->def
>> +     (the instruction that defines move->old_reg).
>
> So instruction I_REG_MOVE (new_reg=reg) must be scheduled before the
> next I_MUST_FOLLOW move/original-def (due to anti-dependence: it
> overwrites reg), but after the previous instance of I_MUST_FOLLOW (due
> to true dependence; i.e. account for latency also). Why do moves,
> except for the one closest to move->def (which is directly dependent
> upon it, i.e. for whom move->def == I_MUST_FOLLOW), have to worry
> about move->def at all? (Or have their cyclic lifetimes start and end
> there?)

Because the uses of new_reg belong to the same move->def based cycle.

E.g. say we have the following partial schedule, with [A]...[D] denoting
instructions:

  row 0:  empty
  row 1:  empty
  row 2:  [A] use R_2
  row 3:  [B] R_0 = ...
  row 4:  [C] use R_1
  row 5:  [D] use R_2

and with the following as-yet-unplaced moves:

  [M1] R_1 = R_0
  [M2] R_2 = R_1

The loop with scheduled moves must behave in the same way as it
would at the moment.  That is, it must behave "as if" the loop were:

LOOP 1:
  [A]  use R_2
  [M2] R_2 = R_1
  [M1] R_1 = R_0
  [B]  R_0 = ...
  [C]  use R_1
  [D]  use R_2

So (I think this is the uncontroversial bit): [M1] must be scheduled
"cyclically before" [B] and "cyclically after" [C], with the cycle
based at [B]:

    row 3 after [B]:  empty
    row 4:            [C]
    row 5:            [D]
    row 0:            empty
    row 1:            empty
    row 2:            [A]
    row 3 before [B]: empty

[M1] could therefore go in row 1.  This part is OK.

[M2] must then come "cyclically before" [M1] and "cyclically after" [A]
and [D].  If we based that cycle on [M1], we would have:

    row 1 after [M1]:  empty
    row 2:             [A]
    row 3:             [B]
    row 4:             [C]
    row 5:             [D]
    row 0:             empty
    row 1 before [M1]: empty

So it would seem that [M2] could go in row 0.  But that's incorrect,
it would lead to:

LOOP 2:
  [M2] R_2 = R_1
  [M1] R_1 = R_0
  [A]  use R_2
  [B]  R_0 = ...
  [C]  use R_1
  [D]  use R_2

and [A] now gets the wrong value of R_2.  In the first iteration of
loop 1, [A] uses the prologue-supplied value of R_2.  In loop 2 it
uses the prologue-supplied value of R_1.  I suppose it's effectively
moved up a stage.

If we keep the cycle at [B] when scheduling [M2], we have:

    row 3 after [B]:  empty
    row 4:            [C]
    row 5:            [D]
    row 0:            empty
    row 1:            [M1]
    row 2:            [A]
    row 3 before [B]: empty

We now have an empty scheduling window (successor [M1] comes
before predecessor [A]) and we reject the current ii.

> In general, there's a distinction between the "cycle" an instruction
> is scheduled at (for the first iteration), which is an absolute
> arbitrary integer, and the "row" it is placed in the PS, which is
> between 0 and ii-1, and is simply cycle mod ii. When scheduling, it's
> clearer to reason about cycles, especially if your window includes a
> row twice.
>
> In addition to the (true+anti) dependences involving I_REG_MOVE with
> I_MUST_FOLLOW, it has (true+anti) dependences with each use it feeds.
>
> We could alternatively set these dependencies of I_REG_MOVE on
> I_MUST_FOLLOW, and on its uses, and call get_sched_window(). But it's
> presumably simpler to handle it directly here.
>
> Right?

Well, I think get_sched_window would need a fair bit of surgery to
handle both cases.  And IMO it'd be a bit of a false abstraction.
In this case, we always want things to be scheduled close to the
successor (I_MUST_FOLLOW), even if there are several predecessors
(the uses).  But if there are more predecessors than successors,
get_sched_window would normally prefer to schedule closer to the
predecessors.  So I think we'd still be treating them as two separate
cases to some extent.

As you say, doing it here is much simpler :-)

Richard

Reply via email to