Re: [4/4] Make SMS schedule register moves

2011-10-11 Thread Richard Sandiford
Ayal Zaks ayal.z...@gmail.com writes:
 The issue of assigning stages to reg-moves is mostly relevant for
 prolog and epilog generation, which requires and receives special
 attention -- handled very nicely by ps_num_consecutive_stages! Note
 that currently a simple boolean indicator for (the exceptional case
 of) double stages would suffice, instead of generalizing to arbitrary
 nums of consecutive stages (see any potential use for them?).

 Not in the immediate term.  But I think having a boolean indicator
 would be inconsistent.  If the distance field is an int (even though
 we only expect distance-0 and distance-1 register dependencies)
 then I think the number of stages should be too.

 I did wonder originally about using a boolean, but IMO, it makes
 the code less readable rather than more.  Instead of a simple
 range check like:

     if (first_stage_for_insn = last_stage_in_range
          last_stage_for_insn = first_stage_in_range)

 we end up with the equivalent of:

     if (first_stage_for_insn = last_stage_in_range
          (double_stage_move_p (...)
             ? first_stage_for_insn + 1 = first_stage_in_range
             : first_stage_for_insn = first_stage_in_range))

 with no corresponding simplification elsewhere.


 Sure. But setting the range can be done by consulting an simple
 indicator, rather than generalizing to arbitrary stage numbers; e.g.:

 +ps_num_consecutive_stages (partial_schedule_ptr ps, int id)
 +{
 +  if (id = ps-g-num_nodes  ps_reg_move (ps, id)-double_stages)
 +return 2;
 +  else
 +return 1;
 +}

 or

 -  last_u = first_u + ps_num_consecutive_stages (ps, u) - 1;
 +  if (...double_stages) last_u = first_u + 1;
 +  else last_u = first_u;

Understood.  I still prefer the posted version though.

 E.g. adding something like this at the end:

   ??? The algorithm restricts the scheduling window to II cycles.
   In rare cases, it may be better to allow windows of II+1 cycles.
   The window would then start and end on the same row, but with
   different must precede and must follow requirements.

 Let me know what you think and I'll add it as a follow-on patch.


 great, thanks.

OK, added with the patch below.

 +
 +   The move is part of a chain that satisfies register dependencies
 +   between a producing ddg node and various consuming ddg nodes.
 +   If some of these dependencies cross a loop iteration (that is,
 +   have a distance of 1) then DISTANCE1_USES is nonnull and contains
 +   the set of uses with distance-1 dependencies.  DISTANCE1_USES
 +   is null otherwise.
 +

 Maybe clarify that they are upwards-exposed or live-in uses.

 OK, changed to:

   The move is part of a chain that satisfies register dependencies
   between a producing ddg node and various consuming ddg nodes.
   If some of these dependencies have a distance of 1 (meaning that
   the use is upward-exposoed) then DISTANCE1_USES is nonnull and

 exposed (typo)

Oops, also fixed below (and applied).

Richard


gcc/
* modulo-sched.c: Fix comment typo.  Mention the possibility
of using scheduling windows of II+1 cycles.

Index: gcc/modulo-sched.c
===
--- gcc/modulo-sched.c  2011-10-10 12:42:41.0 +0100
+++ gcc/modulo-sched.c  2011-10-11 09:07:08.069166743 +0100
@@ -545,7 +545,7 @@ set_columns_for_ps (partial_schedule_ptr
The move is part of a chain that satisfies register dependencies
between a producing ddg node and various consuming ddg nodes.
If some of these dependencies have a distance of 1 (meaning that
-   the use is upward-exposoed) then DISTANCE1_USES is nonnull and
+   the use is upward-exposed) then DISTANCE1_USES is nonnull and
contains the set of uses with distance-1 dependencies.
DISTANCE1_USES is null otherwise.
 
@@ -1810,7 +1810,11 @@ sms_schedule (void)
41. endif
42. compute epilogue  prologue
43. finish - succeeded to schedule
-*/
+
+   ??? The algorithm restricts the scheduling window to II cycles.
+   In rare cases, it may be better to allow windows of II+1 cycles.
+   The window would then start and end on the same row, but with
+   different must precede and must follow requirements.  */
 
 /* A limit on the number of cycles that resource conflicts can span.  ??? 
Should
be provided by DFA, and be dependent on the type of insn scheduled.  
Currently


Re: [4/4] Make SMS schedule register moves

2011-10-10 Thread Richard Sandiford
Ayal Zaks ayal.z...@gmail.com writes:
 I agree it's natural to schedule moves for intra-iteration dependencies
 in the normal get_sched_window way.  But suppose we have a dependency:

   A --(T,N,1)-- B

 that requires two moves M1 and M2.  If we think in terms of cycles
 (in the SCHED_TIME sense), then this effectively becomes:

   A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B

 because it is now M1 that is fed by both the loop and the incoming edge.
 But if there is a second dependency:

   A --(T,M,0)-- C

 that also requires two moves, we end up with:

   A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B
                                        --(T,M3,-1)-- B

 and dependence distances of -1 feel a bit weird. :-)  Of course,
 what we really have are two parallel dependencies:

   A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B

   A --(T,M1,0)-- M1' --(T,M2,0)-- M2' --(T,N3,0)-- B

 where M1' and M2' occupy the same position as M1 and M2 in the schedule,
 but are one stage further along.  But we only schedule them once,
 so if we take the cycle/SCHED_TIME route, we have to introduce
 dependencies of distance -1.


 Interesting; had to digest this distance 1 business, a result of
 thinking in cycles instead of rows (or conversely), and mixing
 dependences with scheduling; here's my understanding, based on your
 explanations:

 Suppose a Use is truely dependent on a Def, where both have been
 scheduled at some absolute cycles; think of them as timing the first
 iteration of the loop.
 Assume first that Use appears originally after Def in the original
 instruction sequence of the loop (dependence distance 0). In this
 case, Use requires register moves if its distance D from Def according
 to the schedule is more than ii cycles long -- by the time Use is
 executed, the value it needs is no longer available in the def'd
 register due to intervening occurrences of Def. So in this case, the
 first reg-move (among D/ii) should appear after Def, recording its
 value before the next occurrence of Def overwrites it, and feeding
 subsequent moves as needed before each is overwritten. Thus the
 scheduling window of this first reg-move is within (Def, Def+ii).

 Now, suppose Use appears before Def, i.e., Use is upwards-exposed; if
 it remains scheduled before the Def, no reg-move is needed. If, otoh,
 Def is scheduled (D0 cycles) before Use, breaking the anti-dependence
 between them, (D/ii + 1) reg-moves are needed in order to feed Use
 with the live-in value before Def. So in this case, the first reg-move
 should appear before Def (again feeding subsequent moves as needed
 before each is overwritten). Thus the scheduling window of this first
 reg-move is within (Def-ii, Def).

 In any case, each use requires a specific number of reg-moves, which
 begin either before or after the def; it is always safe (though
 potentially redundant) to start reg-moving before the def -- uses that
 do not need the live-in value will ignore it by feeding from a later
 reg-move.

Right.  The distance on the Def-Use dependency is effectively transferred
to the dependency between the Def and first move.

And we can potentially have both kinds of Use at the same time.
We then end up scheduling two moves, one in each window, but require
them to occupy the same row and column.  It feels more convenient to
schedule the earlier of the two moves.

 Q: if we generate reg-moves starting from before the def, would
 redundant reg-moves be eliminated if no use needs access to live-in
 value? If so, would that simplify their generation? (If not, it may be
 interesting to understand how to improve DCE to catch it.)

To be clear, the new version of the patch (unlike the old one)
doesn't emit reg-moves before the def if all uses are distance 0.
We only do it where some or all uses are distance 1.  The first move
before the def is always needed in that case.

So redudant moves are confined to the case where (a) we have more
than one move, (b) we have both distance 0 and distance 1 uses, and
(c) one of the two distance sets requires more moves than the other.
If the distance 0 dependencies require more moves than the distance
1 dependencies, we will have a redudant move in the prologue.
If the distance 1 dependencies require more moves than the distance
0 dependencies, we will have a redudant move in the epilogue.
But I believe that this is already handled by dce.

(For avoidance of doubt, the new code is more accurate than
current trunk in this respect.)

 The issue of assigning stages to reg-moves is mostly relevant for
 prolog and epilog generation, which requires and receives special
 attention -- handled very nicely by ps_num_consecutive_stages! Note
 that currently a simple boolean indicator for (the exceptional case
 of) double stages would suffice, instead of generalizing to arbitrary
 nums of consecutive stages (see any potential use for them?).

Not in the immediate term.  But I think having a boolean indicator
would be inconsistent.  If 

Re: [4/4] Make SMS schedule register moves

2011-10-10 Thread Ayal Zaks
On Mon, Oct 10, 2011 at 1:57 PM, Richard Sandiford
richard.sandif...@linaro.org wrote:
 Ayal Zaks ayal.z...@gmail.com writes:
 I agree it's natural to schedule moves for intra-iteration dependencies
 in the normal get_sched_window way.  But suppose we have a dependency:

   A --(T,N,1)-- B

 that requires two moves M1 and M2.  If we think in terms of cycles
 (in the SCHED_TIME sense), then this effectively becomes:

   A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B

 because it is now M1 that is fed by both the loop and the incoming edge.
 But if there is a second dependency:

   A --(T,M,0)-- C

 that also requires two moves, we end up with:

   A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B
                                        --(T,M3,-1)-- B

 and dependence distances of -1 feel a bit weird. :-)  Of course,
 what we really have are two parallel dependencies:

   A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B

   A --(T,M1,0)-- M1' --(T,M2,0)-- M2' --(T,N3,0)-- B

 where M1' and M2' occupy the same position as M1 and M2 in the schedule,
 but are one stage further along.  But we only schedule them once,
 so if we take the cycle/SCHED_TIME route, we have to introduce
 dependencies of distance -1.


 Interesting; had to digest this distance 1 business, a result of
 thinking in cycles instead of rows (or conversely), and mixing
 dependences with scheduling; here's my understanding, based on your
 explanations:

 Suppose a Use is truely dependent on a Def, where both have been
 scheduled at some absolute cycles; think of them as timing the first
 iteration of the loop.
 Assume first that Use appears originally after Def in the original
 instruction sequence of the loop (dependence distance 0). In this
 case, Use requires register moves if its distance D from Def according
 to the schedule is more than ii cycles long -- by the time Use is
 executed, the value it needs is no longer available in the def'd
 register due to intervening occurrences of Def. So in this case, the
 first reg-move (among D/ii) should appear after Def, recording its
 value before the next occurrence of Def overwrites it, and feeding
 subsequent moves as needed before each is overwritten. Thus the
 scheduling window of this first reg-move is within (Def, Def+ii).

 Now, suppose Use appears before Def, i.e., Use is upwards-exposed; if
 it remains scheduled before the Def, no reg-move is needed. If, otoh,
 Def is scheduled (D0 cycles) before Use, breaking the anti-dependence
 between them, (D/ii + 1) reg-moves are needed in order to feed Use
 with the live-in value before Def. So in this case, the first reg-move
 should appear before Def (again feeding subsequent moves as needed
 before each is overwritten). Thus the scheduling window of this first
 reg-move is within (Def-ii, Def).

 In any case, each use requires a specific number of reg-moves, which
 begin either before or after the def; it is always safe (though
 potentially redundant) to start reg-moving before the def -- uses that
 do not need the live-in value will ignore it by feeding from a later
 reg-move.

 Right.  The distance on the Def-Use dependency is effectively transferred
 to the dependency between the Def and first move.

 And we can potentially have both kinds of Use at the same time.
 We then end up scheduling two moves, one in each window, but require
 them to occupy the same row and column.  It feels more convenient to
 schedule the earlier of the two moves.

 Q: if we generate reg-moves starting from before the def, would
 redundant reg-moves be eliminated if no use needs access to live-in
 value? If so, would that simplify their generation? (If not, it may be
 interesting to understand how to improve DCE to catch it.)

 To be clear, the new version of the patch (unlike the old one)
 doesn't emit reg-moves before the def if all uses are distance 0.
 We only do it where some or all uses are distance 1.  The first move
 before the def is always needed in that case.


Understood. The question was whether it would simplify things if we
always emit reg-moves before the def. And rely on DCE to hopefully
eliminate redundancies.

 So redudant moves are confined to the case where (a) we have more
 than one move, (b) we have both distance 0 and distance 1 uses, and
 (c) one of the two distance sets requires more moves than the other.
 If the distance 0 dependencies require more moves than the distance
 1 dependencies, we will have a redudant move in the prologue.
 If the distance 1 dependencies require more moves than the distance
 0 dependencies, we will have a redudant move in the epilogue.
 But I believe that this is already handled by dce.

 (For avoidance of doubt, the new code is more accurate than
 current trunk in this respect.)

 The issue of assigning stages to reg-moves is mostly relevant for
 prolog and epilog generation, which requires and receives special
 attention -- handled very nicely by ps_num_consecutive_stages! Note
 that currently a simple 

Re: [4/4] Make SMS schedule register moves

2011-10-09 Thread Ayal Zaks
On Wed, Sep 28, 2011 at 4:49 PM, Richard Sandiford
richard.sandif...@linaro.org wrote:
 Ayal Zaks ayal.z...@gmail.com writes:
  +  /* The cyclic lifetime of move-new_reg starts and ends at move-def
  +     (the instruction that defines move-old_reg).
 
  So instruction I_REG_MOVE (new_reg=reg) must be scheduled before the
  next I_MUST_FOLLOW move/original-def (due to anti-dependence: it
  overwrites reg), but after the previous instance of I_MUST_FOLLOW (due
  to true dependence; i.e. account for latency also). Why do moves,
  except for the one closest to move-def (which is directly dependent
  upon it, i.e. for whom move-def == I_MUST_FOLLOW), have to worry
  about move-def at all? (Or have their cyclic lifetimes start and end
  there?)

 Because the uses of new_reg belong to the same move-def based cycle.


 the cycle (overloaded term; rather iteration in this context) to
 which the uses belong, is inferred from the cycle (absolute schedule
 time) in which they are scheduled, regardless of move-def.

 Just to prove your point about cycle being an overloaded term: I wasn't
 actually meaning it in the sense of (loop) iteration.  I meant a circular
 window based on move-def.


Point proven ;-)

 So (I think this is the uncontroversial bit): [M1] must be scheduled
 cyclically before [B] and cyclically after [C], with the cycle
 based at [B]:

    row 3 after [B]:  empty
    row 4:            [C]
    row 5:            [D]
    row 0:            empty
    row 1:            empty
    row 2:            [A]
    row 3 before [B]: empty

 [M1] could therefore go in row 1.  This part is OK.


 Here's how I see it:
 [M1] feeds [C] which is scheduled at cycle 10, so it must be scheduled
 before cycle 10-M_latency and after cycle 10-ii. [M1] uses the result
 of [B] which is scheduled at cycle 3, so must be scheduled after cycle
 3+B_latency and before cycle 3+ii. Taking all latencies to be 1 and
 ii=6, this yields a scheduling window of cycles [4,9]\cap[4,9]=[4,9];
 if scheduled at cycle 4 it must_follow [C], if scheduled at cycle 9 it
 must_precede [B]. This is identical to the logic behind the
 sched_window of any instruction, based on its dependencies (as you've
 updated not too long ago..), if we do not allow reg_moves (and
 arguably, one should not allow reg_moves when scheduling
 reg_moves...).

 To address the potential erroneous scenario of Loop 2, suppose [A] is
 scheduled as in the beginning in cycle 20, and that [M1] is scheduled
 in cycle 7 (\in[4,9]). Then
 [M2] feeds [D] and [A] which are scheduled at cycles 17 and 20, so it
 must be scheduled before cycle 17-1 and after cycle 20-6. [M2] uses
 the result of [M1], so must be scheduled after cycle 7+1 and before
 cycle 7+6. This yields the desired [14,16]\cap[8,13]=\emptyset.

 I agree it's natural to schedule moves for intra-iteration dependencies
 in the normal get_sched_window way.  But suppose we have a dependency:

   A --(T,N,1)-- B

 that requires two moves M1 and M2.  If we think in terms of cycles
 (in the SCHED_TIME sense), then this effectively becomes:

   A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B

 because it is now M1 that is fed by both the loop and the incoming edge.
 But if there is a second dependency:

   A --(T,M,0)-- C

 that also requires two moves, we end up with:

   A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B
                                        --(T,M3,-1)-- B

 and dependence distances of -1 feel a bit weird. :-)  Of course,
 what we really have are two parallel dependencies:

   A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B

   A --(T,M1,0)-- M1' --(T,M2,0)-- M2' --(T,N3,0)-- B

 where M1' and M2' occupy the same position as M1 and M2 in the schedule,
 but are one stage further along.  But we only schedule them once,
 so if we take the cycle/SCHED_TIME route, we have to introduce
 dependencies of distance -1.


Interesting; had to digest this distance 1 business, a result of
thinking in cycles instead of rows (or conversely), and mixing
dependences with scheduling; here's my understanding, based on your
explanations:

Suppose a Use is truely dependent on a Def, where both have been
scheduled at some absolute cycles; think of them as timing the first
iteration of the loop.
Assume first that Use appears originally after Def in the original
instruction sequence of the loop (dependence distance 0). In this
case, Use requires register moves if its distance D from Def according
to the schedule is more than ii cycles long -- by the time Use is
executed, the value it needs is no longer available in the def'd
register due to intervening occurrences of Def. So in this case, the
first reg-move (among D/ii) should appear after Def, recording its
value before the next occurrence of Def overwrites it, and feeding
subsequent moves as needed before each is overwritten. Thus the
scheduling window of this first reg-move is within (Def, Def+ii).

Now, suppose Use appears before Def, i.e., Use is upwards-exposed; if
it remains 

Re: [4/4] Make SMS schedule register moves

2011-09-28 Thread Richard Sandiford
Ayal Zaks ayal.z...@gmail.com writes:
  +  /* The cyclic lifetime of move-new_reg starts and ends at move-def
  +     (the instruction that defines move-old_reg).
 
  So instruction I_REG_MOVE (new_reg=reg) must be scheduled before the
  next I_MUST_FOLLOW move/original-def (due to anti-dependence: it
  overwrites reg), but after the previous instance of I_MUST_FOLLOW (due
  to true dependence; i.e. account for latency also). Why do moves,
  except for the one closest to move-def (which is directly dependent
  upon it, i.e. for whom move-def == I_MUST_FOLLOW), have to worry
  about move-def at all? (Or have their cyclic lifetimes start and end
  there?)

 Because the uses of new_reg belong to the same move-def based cycle.


 the cycle (overloaded term; rather iteration in this context) to
 which the uses belong, is inferred from the cycle (absolute schedule
 time) in which they are scheduled, regardless of move-def.

Just to prove your point about cycle being an overloaded term: I wasn't
actually meaning it in the sense of (loop) iteration.  I meant a circular
window based on move-def.

 So (I think this is the uncontroversial bit): [M1] must be scheduled
 cyclically before [B] and cyclically after [C], with the cycle
 based at [B]:

    row 3 after [B]:  empty
    row 4:            [C]
    row 5:            [D]
    row 0:            empty
    row 1:            empty
    row 2:            [A]
    row 3 before [B]: empty

 [M1] could therefore go in row 1.  This part is OK.


 Here's how I see it:
 [M1] feeds [C] which is scheduled at cycle 10, so it must be scheduled
 before cycle 10-M_latency and after cycle 10-ii. [M1] uses the result
 of [B] which is scheduled at cycle 3, so must be scheduled after cycle
 3+B_latency and before cycle 3+ii. Taking all latencies to be 1 and
 ii=6, this yields a scheduling window of cycles [4,9]\cap[4,9]=[4,9];
 if scheduled at cycle 4 it must_follow [C], if scheduled at cycle 9 it
 must_precede [B]. This is identical to the logic behind the
 sched_window of any instruction, based on its dependencies (as you've
 updated not too long ago..), if we do not allow reg_moves (and
 arguably, one should not allow reg_moves when scheduling
 reg_moves...).

 To address the potential erroneous scenario of Loop 2, suppose [A] is
 scheduled as in the beginning in cycle 20, and that [M1] is scheduled
 in cycle 7 (\in[4,9]). Then
 [M2] feeds [D] and [A] which are scheduled at cycles 17 and 20, so it
 must be scheduled before cycle 17-1 and after cycle 20-6. [M2] uses
 the result of [M1], so must be scheduled after cycle 7+1 and before
 cycle 7+6. This yields the desired [14,16]\cap[8,13]=\emptyset.

I agree it's natural to schedule moves for intra-iteration dependencies
in the normal get_sched_window way.  But suppose we have a dependency:

   A --(T,N,1)-- B

that requires two moves M1 and M2.  If we think in terms of cycles
(in the SCHED_TIME sense), then this effectively becomes:

   A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B

because it is now M1 that is fed by both the loop and the incoming edge.
But if there is a second dependency:

   A --(T,M,0)-- C

that also requires two moves, we end up with:

   A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B
--(T,M3,-1)-- B

and dependence distances of -1 feel a bit weird. :-)  Of course,
what we really have are two parallel dependencies:

   A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B

   A --(T,M1,0)-- M1' --(T,M2,0)-- M2' --(T,N3,0)-- B

where M1' and M2' occupy the same position as M1 and M2 in the schedule,
but are one stage further along.  But we only schedule them once,
so if we take the cycle/SCHED_TIME route, we have to introduce
dependencies of distance -1.

Another reason why it seemed a little confusing to think in terms of
cycles (in the SCHED_TIME sense) was that, by this stage, we were
already thinking in terms of rows and columns to some extent:

/* If dest precedes src in the schedule of the kernel, then dest
   will read before src writes and we can save one reg_copy.  */
if (SCHED_ROW (e-dest-cuid) == SCHED_ROW (e-src-cuid)
 SCHED_COLUMN (e-dest-cuid)  SCHED_COLUMN (e-src-cuid))
  nreg_moves4e--;

However...

 Also note that if moves are assigned absolute cycles, it becomes clear
 to which stage each belongs (just like any other instruction),
 regulating their generation in prolog and epilog.

...I have to concede that it makes the stage calculation easier,
and that that tips the balance.  (Although again, a move can belong
to two stages simultanouesly.)

Anyway, here's an updated patch that uses cycle times.  I've also
dropped the code that tried to allow windows to start and end on
the same row (and therefore schedule either side of that row).
I thought it might help with some VLIWy DFAs, but it was always
going to be of minor benefit, and we don't try that elsewhere,
so let's avoid the complication.


Re: [4/4] Make SMS schedule register moves

2011-09-22 Thread Ayal Zaks
 Richard Sandiford richard.sandif...@linaro.org wrote on 30/08/2011
 03:29:26 PM:

  From: Richard Sandiford richard.sandif...@linaro.org
  To: gcc-patches@gcc.gnu.org
  Cc: Ayal Zaks/Haifa/IBM@IBMIL
  Date: 30/08/2011 03:29 PM
  Subject: [4/4] Make SMS schedule register moves
 
  This is the move-scheduling patch itself.  It should be fairly
  self-explanatory.  Let me know if it isn't, and I'll try to improve
  the commentary.

 
  One potentially controversial change is to the way we handle moves
  in the prologue and epilogue.  The current code uses a conservative


 - conservativeaccurate? (Your approach seems to be more conservative)


  check to decide which stages need which moves.  This check copes
  with values that are live before the loop, and emits some moves
  that aren't actually needed.  The code also emits chains of moves
  that can be trivially optimised through propagation.  We rely on
  later patches to clean this up for us (and they do).
 
  So, rather than keep a rather complicated check here, I've simply
  emitted the moves for all stages.  In principle, that will generate
  a little more dead code than now in cases where chains of two moves
  are needed, but that seems to be very rare (when moves are scheduled).


indeed, it's better to simplify code generation and rely on subsequent
cleanups. Not sure about generating more (dead?) code in principle;
could you elaborate, for the record?


 
  Richard
 
 
  gcc/
     * modulo-sched.c (extend_node_sched_params): New function.
     (print_node_sched_params): Print the stage.
     (set_columns_for_row, schedule_reg_move): New functions.
     (set_columns_for_ps): Move up file and use set_columns_for_now.


typo: nowrow (unless you intend this to be temporary ;-)


     (schedule_reg_moves): Call extend_node_sched_params and
     schedule_reg_move.  Extend size of uses bitmap.  Return false
     if a move could not be scheduled.
     (apply_reg_moves): Don't emit moves here.
     (permute_partial_schedule): Handle register moves.
     (duplicate_insns_of_cycles): Remove for_prolog.  Always emit moves.


Always emit all moves


     (generate_prolog_epilog): Update calls accordingly.
 
  Index: gcc/modulo-sched.c



As I'm going over the patch, let me make sure I understand what's going on:

..
 +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
 +   The source of the move is provided by I_MUST_FOLLOW, which has
 +   already been scheduled.

The source y of an x=y move is defined by the previous iteration of
the next move y=z (or by the original y=... move-def). We schedule
moves one after the other bottom-up, starting from the original
move-def moving backwards in cycles. This way, the instruction
I_MUST_FOLLOW defining y is always already scheduled when we come to
schedule the dependent I_REG_MOVE x=y move.

..
 +  /* The cyclic lifetime of move-new_reg starts and ends at move-def
 +     (the instruction that defines move-old_reg).

So instruction I_REG_MOVE (new_reg=reg) must be scheduled before the
next I_MUST_FOLLOW move/original-def (due to anti-dependence: it
overwrites reg), but after the previous instance of I_MUST_FOLLOW (due
to true dependence; i.e. account for latency also). Why do moves,
except for the one closest to move-def (which is directly dependent
upon it, i.e. for whom move-def == I_MUST_FOLLOW), have to worry
about move-def at all? (Or have their cyclic lifetimes start and end
there?)

In general, there's a distinction between the cycle an instruction
is scheduled at (for the first iteration), which is an absolute
arbitrary integer, and the row it is placed in the PS, which is
between 0 and ii-1, and is simply cycle mod ii. When scheduling, it's
clearer to reason about cycles, especially if your window includes a
row twice.

In addition to the (true+anti) dependences involving I_REG_MOVE with
I_MUST_FOLLOW, it has (true+anti) dependences with each use it feeds.

We could alternatively set these dependencies of I_REG_MOVE on
I_MUST_FOLLOW, and on its uses, and call get_sched_window(). But it's
presumably simpler to handle it directly here.

Right?
Ayal.


Re: [4/4] Make SMS schedule register moves

2011-09-22 Thread Richard Sandiford
Thanks as always for the review.

Ayal Zaks ayal.z...@gmail.com writes:
 Richard Sandiford richard.sandif...@linaro.org wrote on 30/08/2011
 03:29:26 PM:

  From: Richard Sandiford richard.sandif...@linaro.org
  To: gcc-patches@gcc.gnu.org
  Cc: Ayal Zaks/Haifa/IBM@IBMIL
  Date: 30/08/2011 03:29 PM
  Subject: [4/4] Make SMS schedule register moves
 
  This is the move-scheduling patch itself.  It should be fairly
  self-explanatory.  Let me know if it isn't, and I'll try to improve
  the commentary.

 
  One potentially controversial change is to the way we handle moves
  in the prologue and epilogue.  The current code uses a conservative


  - conservativeaccurate? (Your approach seems to be more conservative)

They're both conservative (and yeah, mine is more conservative :-)).
What I was trying to say is that the current code already generates
more moves than necessary.  If ddg node N is R = foo, the current code
generates enough moves to handle definitions of R from both the prologue
copies of N _and_ the incoming edge (i.e. the value defined outside the
loop, such as an accumulator being initialised to zero).  But the code
doesn't check whether this initial definition actually exists.  If it
doesn't -- because we're dealing with an intra-loop dependency --
we generate one more move than necessary.

If we wanted to be accurate, we'd need to check whether there's
an incoming value or not.

The current code is also conservative in that the epilogue can end up
with sequences like:

R1 = foo
R2 = R1 (from a move)
... no set of R1, because we've done with that stage...
R3 = ...R2...

where the last instruction could use R1 directly instead.
But none of this matters because later passes (including IRA)
fix it up nicely.

So this was all just a big excuse on my part along the lines of
we already rely on this stuff being cleaned up, and it is,
so let's keep things simple.

  check to decide which stages need which moves.  This check copes
  with values that are live before the loop, and emits some moves
  that aren't actually needed.  The code also emits chains of moves
  that can be trivially optimised through propagation.  We rely on
  later patches to clean this up for us (and they do).
 
  So, rather than keep a rather complicated check here, I've simply
  emitted the moves for all stages.  In principle, that will generate
  a little more dead code than now in cases where chains of two moves
  are needed, but that seems to be very rare (when moves are scheduled).

 indeed, it's better to simplify code generation and rely on subsequent
 cleanups. Not sure about generating more (dead?) code in principle;
 could you elaborate, for the record?

Sorry, I wasn't very clear, but what I meant was: the patch generates
every move regardless of the stage, so it will in principle generate more
moves to unused registers than we do now.  (As above, we already generate
some such moves.)  Thoese insns will be dead, and will be later removed
as dead.  But in practice, the number of extra instructions seems to be
very low, and is usually zero.

     * modulo-sched.c (extend_node_sched_params): New function.
     (print_node_sched_params): Print the stage.
     (set_columns_for_row, schedule_reg_move): New functions.
     (set_columns_for_ps): Move up file and use set_columns_for_now.


 typo: nowrow (unless you intend this to be temporary ;-)

Doh! :-)

     (schedule_reg_moves): Call extend_node_sched_params and
     schedule_reg_move.  Extend size of uses bitmap.  Return false
     if a move could not be scheduled.
     (apply_reg_moves): Don't emit moves here.
     (permute_partial_schedule): Handle register moves.
     (duplicate_insns_of_cycles): Remove for_prolog.  Always emit moves.


 Always emit all moves

Oops, fixed.

 +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
 +   The source of the move is provided by I_MUST_FOLLOW, which has
 +   already been scheduled.

 The source y of an x=y move is defined by the previous iteration of
 the next move y=z (or by the original y=... move-def). We schedule
 moves one after the other bottom-up, starting from the original
 move-def moving backwards in cycles. This way, the instruction
 I_MUST_FOLLOW defining y is always already scheduled when we come to
 schedule the dependent I_REG_MOVE x=y move.

Right.

 +  /* The cyclic lifetime of move-new_reg starts and ends at move-def
 +     (the instruction that defines move-old_reg).

 So instruction I_REG_MOVE (new_reg=reg) must be scheduled before the
 next I_MUST_FOLLOW move/original-def (due to anti-dependence: it
 overwrites reg), but after the previous instance of I_MUST_FOLLOW (due
 to true dependence; i.e. account for latency also). Why do moves,
 except for the one closest to move-def (which is directly dependent
 upon it, i.e. for whom move-def == I_MUST_FOLLOW), have to worry
 about move-def at all? (Or have their cyclic lifetimes start and end
 there?)

Because the uses of new_reg 

Re: [4/4] Make SMS schedule register moves

2011-08-30 Thread Richard Guenther
On Tue, Aug 30, 2011 at 2:29 PM, Richard Sandiford
richard.sandif...@linaro.org wrote:
 This is the move-scheduling patch itself.  It should be fairly
 self-explanatory.  Let me know if it isn't, and I'll try to improve
 the commentary.

Can you add some testcases?

Thanks,
Richard.

 One potentially controversial change is to the way we handle moves
 in the prologue and epilogue.  The current code uses a conservative
 check to decide which stages need which moves.  This check copes
 with values that are live before the loop, and emits some moves
 that aren't actually needed.  The code also emits chains of moves
 that can be trivially optimised through propagation.  We rely on
 later patches to clean this up for us (and they do).

 So, rather than keep a rather complicated check here, I've simply
 emitted the moves for all stages.  In principle, that will generate
 a little more dead code than now in cases where chains of two moves
 are needed, but that seems to be very rare (when moves are scheduled).

 Richard


 gcc/
        * modulo-sched.c (extend_node_sched_params): New function.
        (print_node_sched_params): Print the stage.
        (set_columns_for_row, schedule_reg_move): New functions.
        (set_columns_for_ps): Move up file and use set_columns_for_now.
        (schedule_reg_moves): Call extend_node_sched_params and
        schedule_reg_move.  Extend size of uses bitmap.  Return false
        if a move could not be scheduled.
        (apply_reg_moves): Don't emit moves here.
        (permute_partial_schedule): Handle register moves.
        (duplicate_insns_of_cycles): Remove for_prolog.  Always emit moves.
        (generate_prolog_epilog): Update calls accordingly.

 Index: gcc/modulo-sched.c
 ===
 --- gcc/modulo-sched.c  2011-08-30 13:06:36.528669762 +0100
 +++ gcc/modulo-sched.c  2011-08-30 13:22:08.029124537 +0100
 @@ -220,8 +220,6 @@ static partial_schedule_ptr sms_schedule
  static void permute_partial_schedule (partial_schedule_ptr, rtx);
  static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
                                     rtx, rtx);
 -static void duplicate_insns_of_cycles (partial_schedule_ptr,
 -                                      int, int, int, rtx);
  static int calculate_stage_count (partial_schedule_ptr, int);
  static void calculate_must_precede_follow (ddg_node_ptr, int, int,
                                           int, int, sbitmap, sbitmap, 
 sbitmap);
 @@ -458,6 +456,15 @@ set_node_sched_params (ddg_ptr g)
                         node_sched_param_vec, g-num_nodes);
  }

 +/* Make sure that node_sched_param_vec has an entry for every move in PS.  */
 +static void
 +extend_node_sched_params (partial_schedule_ptr ps)
 +{
 +  VEC_safe_grow_cleared (node_sched_params, heap, node_sched_param_vec,
 +                        ps-g-num_nodes + VEC_length (ps_reg_move_info,
 +                                                       ps-reg_moves));
 +}
 +
  static void
  print_node_sched_params (FILE *file, int num_nodes, partial_schedule_ptr ps)
  {
 @@ -474,6 +481,7 @@ print_node_sched_params (FILE *file, int
               INSN_UID (ps_rtl_insn (ps, i)));
       fprintf (file,  asap = %d:\n, NODE_ASAP (ps-g-nodes[i]));
       fprintf (file,  time = %d:\n, nsp-time);
 +      fprintf (file,  stage = %d:\n, nsp-stage);
       fprintf (file,  nreg_moves = %d:\n, nsp-nreg_moves);
       for (j = 0; j  nsp-nreg_moves; j++)
        {
 @@ -485,6 +493,168 @@ print_node_sched_params (FILE *file, int
     }
  }

 +/* Set SCHED_COLUMN for each instruction in row ROW of PS.  */
 +static void
 +set_columns_for_row (partial_schedule_ptr ps, int row)
 +{
 +  ps_insn_ptr cur_insn;
 +  int column;
 +
 +  column = 0;
 +  for (cur_insn = ps-rows[row]; cur_insn; cur_insn = cur_insn-next_in_row)
 +    SCHED_COLUMN (cur_insn-id) = column++;
 +}
 +
 +/* Set SCHED_COLUMN for each instruction in PS.  */
 +static void
 +set_columns_for_ps (partial_schedule_ptr ps)
 +{
 +  int row;
 +
 +  for (row = 0; row  ps-ii; row++)
 +    set_columns_for_row (ps, row);
 +}
 +
 +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
 +   The source of the move is provided by I_MUST_FOLLOW, which has
 +   already been scheduled.  MUST_FOLLOW is a scratch bitmap that
 +   is big enough to hold I_MUST_FOLLOW.
 +
 +   Return true on success.  */
 +static bool
 +schedule_reg_move (partial_schedule_ptr ps, int i_reg_move,
 +                  sbitmap must_follow, int i_must_follow)
 +{
 +  unsigned int u;
 +  int i, start_row, end_row, this_start, this_end, this_row, latency;
 +  int origin_row, origin_column;
 +  sbitmap_iterator sbi;
 +  ps_reg_move_info *move;
 +
 +  move = ps_reg_move (ps, i_reg_move);
 +
 +  /* The cyclic lifetime of move-new_reg starts and ends at move-def
 +     (the instruction that defines move-old_reg).  We need to decide
 +     where in that cyclic lifetime the move should go.  The position
 

Re: [4/4] Make SMS schedule register moves

2011-08-30 Thread Richard Sandiford
Richard Guenther richard.guent...@gmail.com writes:
 On Tue, Aug 30, 2011 at 2:29 PM, Richard Sandiford
 richard.sandif...@linaro.org wrote:
 This is the move-scheduling patch itself.  It should be fairly
 self-explanatory.  Let me know if it isn't, and I'll try to improve
 the commentary.

 Can you add some testcases?

I don't think it's an easy thing to test for.  E.g. with the resampling
loop in the covering message, there might well be a reasonable ii=27
schedule that doesn't need as many moves.

Richard


Re: [4/4] Make SMS schedule register moves

2011-08-30 Thread Richard Guenther
On Tue, Aug 30, 2011 at 2:43 PM, Richard Sandiford
richard.sandif...@linaro.org wrote:
 Richard Guenther richard.guent...@gmail.com writes:
 On Tue, Aug 30, 2011 at 2:29 PM, Richard Sandiford
 richard.sandif...@linaro.org wrote:
 This is the move-scheduling patch itself.  It should be fairly
 self-explanatory.  Let me know if it isn't, and I'll try to improve
 the commentary.

 Can you add some testcases?

 I don't think it's an easy thing to test for.  E.g. with the resampling
 loop in the covering message, there might well be a reasonable ii=27
 schedule that doesn't need as many moves.

So, does it only improve code generation or does it expose more
loops as SMS candidates?  If the former, ok ...

Richard.

 Richard



Re: [4/4] Make SMS schedule register moves

2011-08-30 Thread Richard Sandiford
Richard Guenther richard.guent...@gmail.com writes:
 On Tue, Aug 30, 2011 at 2:43 PM, Richard Sandiford
 richard.sandif...@linaro.org wrote:
 Richard Guenther richard.guent...@gmail.com writes:
 On Tue, Aug 30, 2011 at 2:29 PM, Richard Sandiford
 richard.sandif...@linaro.org wrote:
 This is the move-scheduling patch itself.  It should be fairly
 self-explanatory.  Let me know if it isn't, and I'll try to improve
 the commentary.

 Can you add some testcases?

 I don't think it's an easy thing to test for.  E.g. with the resampling
 loop in the covering message, there might well be a reasonable ii=27
 schedule that doesn't need as many moves.

 So, does it only improve code generation or does it expose more
 loops as SMS candidates?  If the former, ok ...

Yeah, it just improves code generation.  The code I'm adding only kicks
in once we've found a successful SMS schedule.  At the moment, once
we've found that kind of schedule, we always emit it, regardless of how
many moves it needs.

The new code instead allows the schedule to be rejected.  So from that
point of view, it reduces the number of SMS candidates (but hopefully
only in cases where we would otherwise generate worse code).

Richard