Thanks as always for the review. Ayal Zaks <ayal.z...@gmail.com> writes: >> Richard Sandiford <richard.sandif...@linaro.org> wrote on 30/08/2011 >> 03:29:26 PM: >> >> > From: Richard Sandiford <richard.sandif...@linaro.org> >> > To: gcc-patches@gcc.gnu.org >> > Cc: Ayal Zaks/Haifa/IBM@IBMIL >> > Date: 30/08/2011 03:29 PM >> > Subject: [4/4] Make SMS schedule register moves >> > >> > This is the move-scheduling patch itself. It should be fairly >> > self-explanatory. Let me know if it isn't, and I'll try to improve >> > the commentary. >> >> > >> > One potentially controversial change is to the way we handle moves >> > in the prologue and epilogue. The current code uses a conservative > > > - conservative>>accurate? (Your approach seems to be more "conservative")
They're both conservative (and yeah, mine is more conservative :-)). What I was trying to say is that the current code already generates more moves than necessary. If ddg node N is "R = foo", the current code generates enough moves to handle definitions of R from both the prologue copies of N _and_ the incoming edge (i.e. the value defined outside the loop, such as an accumulator being initialised to zero). But the code doesn't check whether this initial definition actually exists. If it doesn't -- because we're dealing with an intra-loop dependency -- we generate one more move than necessary. If we wanted to be accurate, we'd need to check whether there's an incoming value or not. The current code is also conservative in that the epilogue can end up with sequences like: R1 = foo R2 = R1 (from a move) ... no set of R1, because we've done with that stage... R3 = ...R2... where the last instruction could use R1 directly instead. But none of this matters because later passes (including IRA) fix it up nicely. So this was all just a big excuse on my part along the lines of "we already rely on this stuff being cleaned up, and it is, so let's keep things simple". >> > check to decide which stages need which moves. This check copes >> > with values that are live before the loop, and emits some moves >> > that aren't actually needed. The code also emits chains of moves >> > that can be trivially optimised through propagation. We rely on >> > later patches to clean this up for us (and they do). >> > >> > So, rather than keep a rather complicated check here, I've simply >> > emitted the moves for all stages. In principle, that will generate >> > a little more dead code than now in cases where chains of two moves >> > are needed, but that seems to be very rare (when moves are scheduled). > > indeed, it's better to simplify code generation and rely on subsequent > cleanups. Not sure about generating more (dead?) code in principle; > could you elaborate, for the record? Sorry, I wasn't very clear, but what I meant was: the patch generates every move regardless of the stage, so it will in principle generate more moves to unused registers than we do now. (As above, we already generate some such moves.) Thoese insns will be dead, and will be later removed as dead. But in practice, the number of extra instructions seems to be very low, and is usually zero. >> > * modulo-sched.c (extend_node_sched_params): New function. >> > (print_node_sched_params): Print the stage. >> > (set_columns_for_row, schedule_reg_move): New functions. >> > (set_columns_for_ps): Move up file and use set_columns_for_now. > > > typo: now>>row (unless you intend this to be temporary ;-) Doh! :-) >> > (schedule_reg_moves): Call extend_node_sched_params and >> > schedule_reg_move. Extend size of uses bitmap. Return false >> > if a move could not be scheduled. >> > (apply_reg_moves): Don't emit moves here. >> > (permute_partial_schedule): Handle register moves. >> > (duplicate_insns_of_cycles): Remove for_prolog. Always emit moves. > > > Always emit all moves Oops, fixed. >> +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS. >> + The source of the move is provided by I_MUST_FOLLOW, which has >> + already been scheduled. > > The source y of an x=y move is defined by the previous iteration of > the next move y=z (or by the original y=... move->def). We schedule > moves one after the other bottom-up, starting from the original > move->def moving backwards in cycles. This way, the instruction > I_MUST_FOLLOW defining y is always already scheduled when we come to > schedule the dependent I_REG_MOVE x=y move. Right. >> + /* The cyclic lifetime of move->new_reg starts and ends at move->def >> + (the instruction that defines move->old_reg). > > So instruction I_REG_MOVE (new_reg=reg) must be scheduled before the > next I_MUST_FOLLOW move/original-def (due to anti-dependence: it > overwrites reg), but after the previous instance of I_MUST_FOLLOW (due > to true dependence; i.e. account for latency also). Why do moves, > except for the one closest to move->def (which is directly dependent > upon it, i.e. for whom move->def == I_MUST_FOLLOW), have to worry > about move->def at all? (Or have their cyclic lifetimes start and end > there?) Because the uses of new_reg belong to the same move->def based cycle. E.g. say we have the following partial schedule, with [A]...[D] denoting instructions: row 0: empty row 1: empty row 2: [A] use R_2 row 3: [B] R_0 = ... row 4: [C] use R_1 row 5: [D] use R_2 and with the following as-yet-unplaced moves: [M1] R_1 = R_0 [M2] R_2 = R_1 The loop with scheduled moves must behave in the same way as it would at the moment. That is, it must behave "as if" the loop were: LOOP 1: [A] use R_2 [M2] R_2 = R_1 [M1] R_1 = R_0 [B] R_0 = ... [C] use R_1 [D] use R_2 So (I think this is the uncontroversial bit): [M1] must be scheduled "cyclically before" [B] and "cyclically after" [C], with the cycle based at [B]: row 3 after [B]: empty row 4: [C] row 5: [D] row 0: empty row 1: empty row 2: [A] row 3 before [B]: empty [M1] could therefore go in row 1. This part is OK. [M2] must then come "cyclically before" [M1] and "cyclically after" [A] and [D]. If we based that cycle on [M1], we would have: row 1 after [M1]: empty row 2: [A] row 3: [B] row 4: [C] row 5: [D] row 0: empty row 1 before [M1]: empty So it would seem that [M2] could go in row 0. But that's incorrect, it would lead to: LOOP 2: [M2] R_2 = R_1 [M1] R_1 = R_0 [A] use R_2 [B] R_0 = ... [C] use R_1 [D] use R_2 and [A] now gets the wrong value of R_2. In the first iteration of loop 1, [A] uses the prologue-supplied value of R_2. In loop 2 it uses the prologue-supplied value of R_1. I suppose it's effectively moved up a stage. If we keep the cycle at [B] when scheduling [M2], we have: row 3 after [B]: empty row 4: [C] row 5: [D] row 0: empty row 1: [M1] row 2: [A] row 3 before [B]: empty We now have an empty scheduling window (successor [M1] comes before predecessor [A]) and we reject the current ii. > In general, there's a distinction between the "cycle" an instruction > is scheduled at (for the first iteration), which is an absolute > arbitrary integer, and the "row" it is placed in the PS, which is > between 0 and ii-1, and is simply cycle mod ii. When scheduling, it's > clearer to reason about cycles, especially if your window includes a > row twice. > > In addition to the (true+anti) dependences involving I_REG_MOVE with > I_MUST_FOLLOW, it has (true+anti) dependences with each use it feeds. > > We could alternatively set these dependencies of I_REG_MOVE on > I_MUST_FOLLOW, and on its uses, and call get_sched_window(). But it's > presumably simpler to handle it directly here. > > Right? Well, I think get_sched_window would need a fair bit of surgery to handle both cases. And IMO it'd be a bit of a false abstraction. In this case, we always want things to be scheduled close to the successor (I_MUST_FOLLOW), even if there are several predecessors (the uses). But if there are more predecessors than successors, get_sched_window would normally prefer to schedule closer to the predecessors. So I think we'd still be treating them as two separate cases to some extent. As you say, doing it here is much simpler :-) Richard