Re: [4/4] Make SMS schedule register moves
Ayal Zaks ayal.z...@gmail.com writes: The issue of assigning stages to reg-moves is mostly relevant for prolog and epilog generation, which requires and receives special attention -- handled very nicely by ps_num_consecutive_stages! Note that currently a simple boolean indicator for (the exceptional case of) double stages would suffice, instead of generalizing to arbitrary nums of consecutive stages (see any potential use for them?). Not in the immediate term. But I think having a boolean indicator would be inconsistent. If the distance field is an int (even though we only expect distance-0 and distance-1 register dependencies) then I think the number of stages should be too. I did wonder originally about using a boolean, but IMO, it makes the code less readable rather than more. Instead of a simple range check like: if (first_stage_for_insn = last_stage_in_range last_stage_for_insn = first_stage_in_range) we end up with the equivalent of: if (first_stage_for_insn = last_stage_in_range (double_stage_move_p (...) ? first_stage_for_insn + 1 = first_stage_in_range : first_stage_for_insn = first_stage_in_range)) with no corresponding simplification elsewhere. Sure. But setting the range can be done by consulting an simple indicator, rather than generalizing to arbitrary stage numbers; e.g.: +ps_num_consecutive_stages (partial_schedule_ptr ps, int id) +{ + if (id = ps-g-num_nodes ps_reg_move (ps, id)-double_stages) +return 2; + else +return 1; +} or - last_u = first_u + ps_num_consecutive_stages (ps, u) - 1; + if (...double_stages) last_u = first_u + 1; + else last_u = first_u; Understood. I still prefer the posted version though. E.g. adding something like this at the end: ??? The algorithm restricts the scheduling window to II cycles. In rare cases, it may be better to allow windows of II+1 cycles. The window would then start and end on the same row, but with different must precede and must follow requirements. Let me know what you think and I'll add it as a follow-on patch. great, thanks. OK, added with the patch below. + + The move is part of a chain that satisfies register dependencies + between a producing ddg node and various consuming ddg nodes. + If some of these dependencies cross a loop iteration (that is, + have a distance of 1) then DISTANCE1_USES is nonnull and contains + the set of uses with distance-1 dependencies. DISTANCE1_USES + is null otherwise. + Maybe clarify that they are upwards-exposed or live-in uses. OK, changed to: The move is part of a chain that satisfies register dependencies between a producing ddg node and various consuming ddg nodes. If some of these dependencies have a distance of 1 (meaning that the use is upward-exposoed) then DISTANCE1_USES is nonnull and exposed (typo) Oops, also fixed below (and applied). Richard gcc/ * modulo-sched.c: Fix comment typo. Mention the possibility of using scheduling windows of II+1 cycles. Index: gcc/modulo-sched.c === --- gcc/modulo-sched.c 2011-10-10 12:42:41.0 +0100 +++ gcc/modulo-sched.c 2011-10-11 09:07:08.069166743 +0100 @@ -545,7 +545,7 @@ set_columns_for_ps (partial_schedule_ptr The move is part of a chain that satisfies register dependencies between a producing ddg node and various consuming ddg nodes. If some of these dependencies have a distance of 1 (meaning that - the use is upward-exposoed) then DISTANCE1_USES is nonnull and + the use is upward-exposed) then DISTANCE1_USES is nonnull and contains the set of uses with distance-1 dependencies. DISTANCE1_USES is null otherwise. @@ -1810,7 +1810,11 @@ sms_schedule (void) 41. endif 42. compute epilogue prologue 43. finish - succeeded to schedule -*/ + + ??? The algorithm restricts the scheduling window to II cycles. + In rare cases, it may be better to allow windows of II+1 cycles. + The window would then start and end on the same row, but with + different must precede and must follow requirements. */ /* A limit on the number of cycles that resource conflicts can span. ??? Should be provided by DFA, and be dependent on the type of insn scheduled. Currently
Re: [4/4] Make SMS schedule register moves
Ayal Zaks ayal.z...@gmail.com writes: I agree it's natural to schedule moves for intra-iteration dependencies in the normal get_sched_window way. But suppose we have a dependency: A --(T,N,1)-- B that requires two moves M1 and M2. If we think in terms of cycles (in the SCHED_TIME sense), then this effectively becomes: A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B because it is now M1 that is fed by both the loop and the incoming edge. But if there is a second dependency: A --(T,M,0)-- C that also requires two moves, we end up with: A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B --(T,M3,-1)-- B and dependence distances of -1 feel a bit weird. :-) Of course, what we really have are two parallel dependencies: A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B A --(T,M1,0)-- M1' --(T,M2,0)-- M2' --(T,N3,0)-- B where M1' and M2' occupy the same position as M1 and M2 in the schedule, but are one stage further along. But we only schedule them once, so if we take the cycle/SCHED_TIME route, we have to introduce dependencies of distance -1. Interesting; had to digest this distance 1 business, a result of thinking in cycles instead of rows (or conversely), and mixing dependences with scheduling; here's my understanding, based on your explanations: Suppose a Use is truely dependent on a Def, where both have been scheduled at some absolute cycles; think of them as timing the first iteration of the loop. Assume first that Use appears originally after Def in the original instruction sequence of the loop (dependence distance 0). In this case, Use requires register moves if its distance D from Def according to the schedule is more than ii cycles long -- by the time Use is executed, the value it needs is no longer available in the def'd register due to intervening occurrences of Def. So in this case, the first reg-move (among D/ii) should appear after Def, recording its value before the next occurrence of Def overwrites it, and feeding subsequent moves as needed before each is overwritten. Thus the scheduling window of this first reg-move is within (Def, Def+ii). Now, suppose Use appears before Def, i.e., Use is upwards-exposed; if it remains scheduled before the Def, no reg-move is needed. If, otoh, Def is scheduled (D0 cycles) before Use, breaking the anti-dependence between them, (D/ii + 1) reg-moves are needed in order to feed Use with the live-in value before Def. So in this case, the first reg-move should appear before Def (again feeding subsequent moves as needed before each is overwritten). Thus the scheduling window of this first reg-move is within (Def-ii, Def). In any case, each use requires a specific number of reg-moves, which begin either before or after the def; it is always safe (though potentially redundant) to start reg-moving before the def -- uses that do not need the live-in value will ignore it by feeding from a later reg-move. Right. The distance on the Def-Use dependency is effectively transferred to the dependency between the Def and first move. And we can potentially have both kinds of Use at the same time. We then end up scheduling two moves, one in each window, but require them to occupy the same row and column. It feels more convenient to schedule the earlier of the two moves. Q: if we generate reg-moves starting from before the def, would redundant reg-moves be eliminated if no use needs access to live-in value? If so, would that simplify their generation? (If not, it may be interesting to understand how to improve DCE to catch it.) To be clear, the new version of the patch (unlike the old one) doesn't emit reg-moves before the def if all uses are distance 0. We only do it where some or all uses are distance 1. The first move before the def is always needed in that case. So redudant moves are confined to the case where (a) we have more than one move, (b) we have both distance 0 and distance 1 uses, and (c) one of the two distance sets requires more moves than the other. If the distance 0 dependencies require more moves than the distance 1 dependencies, we will have a redudant move in the prologue. If the distance 1 dependencies require more moves than the distance 0 dependencies, we will have a redudant move in the epilogue. But I believe that this is already handled by dce. (For avoidance of doubt, the new code is more accurate than current trunk in this respect.) The issue of assigning stages to reg-moves is mostly relevant for prolog and epilog generation, which requires and receives special attention -- handled very nicely by ps_num_consecutive_stages! Note that currently a simple boolean indicator for (the exceptional case of) double stages would suffice, instead of generalizing to arbitrary nums of consecutive stages (see any potential use for them?). Not in the immediate term. But I think having a boolean indicator would be inconsistent. If
Re: [4/4] Make SMS schedule register moves
On Mon, Oct 10, 2011 at 1:57 PM, Richard Sandiford richard.sandif...@linaro.org wrote: Ayal Zaks ayal.z...@gmail.com writes: I agree it's natural to schedule moves for intra-iteration dependencies in the normal get_sched_window way. But suppose we have a dependency: A --(T,N,1)-- B that requires two moves M1 and M2. If we think in terms of cycles (in the SCHED_TIME sense), then this effectively becomes: A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B because it is now M1 that is fed by both the loop and the incoming edge. But if there is a second dependency: A --(T,M,0)-- C that also requires two moves, we end up with: A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B --(T,M3,-1)-- B and dependence distances of -1 feel a bit weird. :-) Of course, what we really have are two parallel dependencies: A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B A --(T,M1,0)-- M1' --(T,M2,0)-- M2' --(T,N3,0)-- B where M1' and M2' occupy the same position as M1 and M2 in the schedule, but are one stage further along. But we only schedule them once, so if we take the cycle/SCHED_TIME route, we have to introduce dependencies of distance -1. Interesting; had to digest this distance 1 business, a result of thinking in cycles instead of rows (or conversely), and mixing dependences with scheduling; here's my understanding, based on your explanations: Suppose a Use is truely dependent on a Def, where both have been scheduled at some absolute cycles; think of them as timing the first iteration of the loop. Assume first that Use appears originally after Def in the original instruction sequence of the loop (dependence distance 0). In this case, Use requires register moves if its distance D from Def according to the schedule is more than ii cycles long -- by the time Use is executed, the value it needs is no longer available in the def'd register due to intervening occurrences of Def. So in this case, the first reg-move (among D/ii) should appear after Def, recording its value before the next occurrence of Def overwrites it, and feeding subsequent moves as needed before each is overwritten. Thus the scheduling window of this first reg-move is within (Def, Def+ii). Now, suppose Use appears before Def, i.e., Use is upwards-exposed; if it remains scheduled before the Def, no reg-move is needed. If, otoh, Def is scheduled (D0 cycles) before Use, breaking the anti-dependence between them, (D/ii + 1) reg-moves are needed in order to feed Use with the live-in value before Def. So in this case, the first reg-move should appear before Def (again feeding subsequent moves as needed before each is overwritten). Thus the scheduling window of this first reg-move is within (Def-ii, Def). In any case, each use requires a specific number of reg-moves, which begin either before or after the def; it is always safe (though potentially redundant) to start reg-moving before the def -- uses that do not need the live-in value will ignore it by feeding from a later reg-move. Right. The distance on the Def-Use dependency is effectively transferred to the dependency between the Def and first move. And we can potentially have both kinds of Use at the same time. We then end up scheduling two moves, one in each window, but require them to occupy the same row and column. It feels more convenient to schedule the earlier of the two moves. Q: if we generate reg-moves starting from before the def, would redundant reg-moves be eliminated if no use needs access to live-in value? If so, would that simplify their generation? (If not, it may be interesting to understand how to improve DCE to catch it.) To be clear, the new version of the patch (unlike the old one) doesn't emit reg-moves before the def if all uses are distance 0. We only do it where some or all uses are distance 1. The first move before the def is always needed in that case. Understood. The question was whether it would simplify things if we always emit reg-moves before the def. And rely on DCE to hopefully eliminate redundancies. So redudant moves are confined to the case where (a) we have more than one move, (b) we have both distance 0 and distance 1 uses, and (c) one of the two distance sets requires more moves than the other. If the distance 0 dependencies require more moves than the distance 1 dependencies, we will have a redudant move in the prologue. If the distance 1 dependencies require more moves than the distance 0 dependencies, we will have a redudant move in the epilogue. But I believe that this is already handled by dce. (For avoidance of doubt, the new code is more accurate than current trunk in this respect.) The issue of assigning stages to reg-moves is mostly relevant for prolog and epilog generation, which requires and receives special attention -- handled very nicely by ps_num_consecutive_stages! Note that currently a simple
Re: [4/4] Make SMS schedule register moves
On Wed, Sep 28, 2011 at 4:49 PM, Richard Sandiford richard.sandif...@linaro.org wrote: Ayal Zaks ayal.z...@gmail.com writes: + /* The cyclic lifetime of move-new_reg starts and ends at move-def + (the instruction that defines move-old_reg). So instruction I_REG_MOVE (new_reg=reg) must be scheduled before the next I_MUST_FOLLOW move/original-def (due to anti-dependence: it overwrites reg), but after the previous instance of I_MUST_FOLLOW (due to true dependence; i.e. account for latency also). Why do moves, except for the one closest to move-def (which is directly dependent upon it, i.e. for whom move-def == I_MUST_FOLLOW), have to worry about move-def at all? (Or have their cyclic lifetimes start and end there?) Because the uses of new_reg belong to the same move-def based cycle. the cycle (overloaded term; rather iteration in this context) to which the uses belong, is inferred from the cycle (absolute schedule time) in which they are scheduled, regardless of move-def. Just to prove your point about cycle being an overloaded term: I wasn't actually meaning it in the sense of (loop) iteration. I meant a circular window based on move-def. Point proven ;-) So (I think this is the uncontroversial bit): [M1] must be scheduled cyclically before [B] and cyclically after [C], with the cycle based at [B]: row 3 after [B]: empty row 4: [C] row 5: [D] row 0: empty row 1: empty row 2: [A] row 3 before [B]: empty [M1] could therefore go in row 1. This part is OK. Here's how I see it: [M1] feeds [C] which is scheduled at cycle 10, so it must be scheduled before cycle 10-M_latency and after cycle 10-ii. [M1] uses the result of [B] which is scheduled at cycle 3, so must be scheduled after cycle 3+B_latency and before cycle 3+ii. Taking all latencies to be 1 and ii=6, this yields a scheduling window of cycles [4,9]\cap[4,9]=[4,9]; if scheduled at cycle 4 it must_follow [C], if scheduled at cycle 9 it must_precede [B]. This is identical to the logic behind the sched_window of any instruction, based on its dependencies (as you've updated not too long ago..), if we do not allow reg_moves (and arguably, one should not allow reg_moves when scheduling reg_moves...). To address the potential erroneous scenario of Loop 2, suppose [A] is scheduled as in the beginning in cycle 20, and that [M1] is scheduled in cycle 7 (\in[4,9]). Then [M2] feeds [D] and [A] which are scheduled at cycles 17 and 20, so it must be scheduled before cycle 17-1 and after cycle 20-6. [M2] uses the result of [M1], so must be scheduled after cycle 7+1 and before cycle 7+6. This yields the desired [14,16]\cap[8,13]=\emptyset. I agree it's natural to schedule moves for intra-iteration dependencies in the normal get_sched_window way. But suppose we have a dependency: A --(T,N,1)-- B that requires two moves M1 and M2. If we think in terms of cycles (in the SCHED_TIME sense), then this effectively becomes: A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B because it is now M1 that is fed by both the loop and the incoming edge. But if there is a second dependency: A --(T,M,0)-- C that also requires two moves, we end up with: A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B --(T,M3,-1)-- B and dependence distances of -1 feel a bit weird. :-) Of course, what we really have are two parallel dependencies: A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B A --(T,M1,0)-- M1' --(T,M2,0)-- M2' --(T,N3,0)-- B where M1' and M2' occupy the same position as M1 and M2 in the schedule, but are one stage further along. But we only schedule them once, so if we take the cycle/SCHED_TIME route, we have to introduce dependencies of distance -1. Interesting; had to digest this distance 1 business, a result of thinking in cycles instead of rows (or conversely), and mixing dependences with scheduling; here's my understanding, based on your explanations: Suppose a Use is truely dependent on a Def, where both have been scheduled at some absolute cycles; think of them as timing the first iteration of the loop. Assume first that Use appears originally after Def in the original instruction sequence of the loop (dependence distance 0). In this case, Use requires register moves if its distance D from Def according to the schedule is more than ii cycles long -- by the time Use is executed, the value it needs is no longer available in the def'd register due to intervening occurrences of Def. So in this case, the first reg-move (among D/ii) should appear after Def, recording its value before the next occurrence of Def overwrites it, and feeding subsequent moves as needed before each is overwritten. Thus the scheduling window of this first reg-move is within (Def, Def+ii). Now, suppose Use appears before Def, i.e., Use is upwards-exposed; if it remains
Re: [4/4] Make SMS schedule register moves
Ayal Zaks ayal.z...@gmail.com writes: + /* The cyclic lifetime of move-new_reg starts and ends at move-def + (the instruction that defines move-old_reg). So instruction I_REG_MOVE (new_reg=reg) must be scheduled before the next I_MUST_FOLLOW move/original-def (due to anti-dependence: it overwrites reg), but after the previous instance of I_MUST_FOLLOW (due to true dependence; i.e. account for latency also). Why do moves, except for the one closest to move-def (which is directly dependent upon it, i.e. for whom move-def == I_MUST_FOLLOW), have to worry about move-def at all? (Or have their cyclic lifetimes start and end there?) Because the uses of new_reg belong to the same move-def based cycle. the cycle (overloaded term; rather iteration in this context) to which the uses belong, is inferred from the cycle (absolute schedule time) in which they are scheduled, regardless of move-def. Just to prove your point about cycle being an overloaded term: I wasn't actually meaning it in the sense of (loop) iteration. I meant a circular window based on move-def. So (I think this is the uncontroversial bit): [M1] must be scheduled cyclically before [B] and cyclically after [C], with the cycle based at [B]: row 3 after [B]: empty row 4: [C] row 5: [D] row 0: empty row 1: empty row 2: [A] row 3 before [B]: empty [M1] could therefore go in row 1. This part is OK. Here's how I see it: [M1] feeds [C] which is scheduled at cycle 10, so it must be scheduled before cycle 10-M_latency and after cycle 10-ii. [M1] uses the result of [B] which is scheduled at cycle 3, so must be scheduled after cycle 3+B_latency and before cycle 3+ii. Taking all latencies to be 1 and ii=6, this yields a scheduling window of cycles [4,9]\cap[4,9]=[4,9]; if scheduled at cycle 4 it must_follow [C], if scheduled at cycle 9 it must_precede [B]. This is identical to the logic behind the sched_window of any instruction, based on its dependencies (as you've updated not too long ago..), if we do not allow reg_moves (and arguably, one should not allow reg_moves when scheduling reg_moves...). To address the potential erroneous scenario of Loop 2, suppose [A] is scheduled as in the beginning in cycle 20, and that [M1] is scheduled in cycle 7 (\in[4,9]). Then [M2] feeds [D] and [A] which are scheduled at cycles 17 and 20, so it must be scheduled before cycle 17-1 and after cycle 20-6. [M2] uses the result of [M1], so must be scheduled after cycle 7+1 and before cycle 7+6. This yields the desired [14,16]\cap[8,13]=\emptyset. I agree it's natural to schedule moves for intra-iteration dependencies in the normal get_sched_window way. But suppose we have a dependency: A --(T,N,1)-- B that requires two moves M1 and M2. If we think in terms of cycles (in the SCHED_TIME sense), then this effectively becomes: A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B because it is now M1 that is fed by both the loop and the incoming edge. But if there is a second dependency: A --(T,M,0)-- C that also requires two moves, we end up with: A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B --(T,M3,-1)-- B and dependence distances of -1 feel a bit weird. :-) Of course, what we really have are two parallel dependencies: A --(T,N1,1)-- M1 --(T,N2,0)-- M2 --(T,N3,0)-- B A --(T,M1,0)-- M1' --(T,M2,0)-- M2' --(T,N3,0)-- B where M1' and M2' occupy the same position as M1 and M2 in the schedule, but are one stage further along. But we only schedule them once, so if we take the cycle/SCHED_TIME route, we have to introduce dependencies of distance -1. Another reason why it seemed a little confusing to think in terms of cycles (in the SCHED_TIME sense) was that, by this stage, we were already thinking in terms of rows and columns to some extent: /* If dest precedes src in the schedule of the kernel, then dest will read before src writes and we can save one reg_copy. */ if (SCHED_ROW (e-dest-cuid) == SCHED_ROW (e-src-cuid) SCHED_COLUMN (e-dest-cuid) SCHED_COLUMN (e-src-cuid)) nreg_moves4e--; However... Also note that if moves are assigned absolute cycles, it becomes clear to which stage each belongs (just like any other instruction), regulating their generation in prolog and epilog. ...I have to concede that it makes the stage calculation easier, and that that tips the balance. (Although again, a move can belong to two stages simultanouesly.) Anyway, here's an updated patch that uses cycle times. I've also dropped the code that tried to allow windows to start and end on the same row (and therefore schedule either side of that row). I thought it might help with some VLIWy DFAs, but it was always going to be of minor benefit, and we don't try that elsewhere, so let's avoid the complication.
Re: [4/4] Make SMS schedule register moves
Richard Sandiford richard.sandif...@linaro.org wrote on 30/08/2011 03:29:26 PM: From: Richard Sandiford richard.sandif...@linaro.org To: gcc-patches@gcc.gnu.org Cc: Ayal Zaks/Haifa/IBM@IBMIL Date: 30/08/2011 03:29 PM Subject: [4/4] Make SMS schedule register moves This is the move-scheduling patch itself. It should be fairly self-explanatory. Let me know if it isn't, and I'll try to improve the commentary. One potentially controversial change is to the way we handle moves in the prologue and epilogue. The current code uses a conservative - conservativeaccurate? (Your approach seems to be more conservative) check to decide which stages need which moves. This check copes with values that are live before the loop, and emits some moves that aren't actually needed. The code also emits chains of moves that can be trivially optimised through propagation. We rely on later patches to clean this up for us (and they do). So, rather than keep a rather complicated check here, I've simply emitted the moves for all stages. In principle, that will generate a little more dead code than now in cases where chains of two moves are needed, but that seems to be very rare (when moves are scheduled). indeed, it's better to simplify code generation and rely on subsequent cleanups. Not sure about generating more (dead?) code in principle; could you elaborate, for the record? Richard gcc/ * modulo-sched.c (extend_node_sched_params): New function. (print_node_sched_params): Print the stage. (set_columns_for_row, schedule_reg_move): New functions. (set_columns_for_ps): Move up file and use set_columns_for_now. typo: nowrow (unless you intend this to be temporary ;-) (schedule_reg_moves): Call extend_node_sched_params and schedule_reg_move. Extend size of uses bitmap. Return false if a move could not be scheduled. (apply_reg_moves): Don't emit moves here. (permute_partial_schedule): Handle register moves. (duplicate_insns_of_cycles): Remove for_prolog. Always emit moves. Always emit all moves (generate_prolog_epilog): Update calls accordingly. Index: gcc/modulo-sched.c As I'm going over the patch, let me make sure I understand what's going on: .. +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS. + The source of the move is provided by I_MUST_FOLLOW, which has + already been scheduled. The source y of an x=y move is defined by the previous iteration of the next move y=z (or by the original y=... move-def). We schedule moves one after the other bottom-up, starting from the original move-def moving backwards in cycles. This way, the instruction I_MUST_FOLLOW defining y is always already scheduled when we come to schedule the dependent I_REG_MOVE x=y move. .. + /* The cyclic lifetime of move-new_reg starts and ends at move-def + (the instruction that defines move-old_reg). So instruction I_REG_MOVE (new_reg=reg) must be scheduled before the next I_MUST_FOLLOW move/original-def (due to anti-dependence: it overwrites reg), but after the previous instance of I_MUST_FOLLOW (due to true dependence; i.e. account for latency also). Why do moves, except for the one closest to move-def (which is directly dependent upon it, i.e. for whom move-def == I_MUST_FOLLOW), have to worry about move-def at all? (Or have their cyclic lifetimes start and end there?) In general, there's a distinction between the cycle an instruction is scheduled at (for the first iteration), which is an absolute arbitrary integer, and the row it is placed in the PS, which is between 0 and ii-1, and is simply cycle mod ii. When scheduling, it's clearer to reason about cycles, especially if your window includes a row twice. In addition to the (true+anti) dependences involving I_REG_MOVE with I_MUST_FOLLOW, it has (true+anti) dependences with each use it feeds. We could alternatively set these dependencies of I_REG_MOVE on I_MUST_FOLLOW, and on its uses, and call get_sched_window(). But it's presumably simpler to handle it directly here. Right? Ayal.
Re: [4/4] Make SMS schedule register moves
Thanks as always for the review. Ayal Zaks ayal.z...@gmail.com writes: Richard Sandiford richard.sandif...@linaro.org wrote on 30/08/2011 03:29:26 PM: From: Richard Sandiford richard.sandif...@linaro.org To: gcc-patches@gcc.gnu.org Cc: Ayal Zaks/Haifa/IBM@IBMIL Date: 30/08/2011 03:29 PM Subject: [4/4] Make SMS schedule register moves This is the move-scheduling patch itself. It should be fairly self-explanatory. Let me know if it isn't, and I'll try to improve the commentary. One potentially controversial change is to the way we handle moves in the prologue and epilogue. The current code uses a conservative - conservativeaccurate? (Your approach seems to be more conservative) They're both conservative (and yeah, mine is more conservative :-)). What I was trying to say is that the current code already generates more moves than necessary. If ddg node N is R = foo, the current code generates enough moves to handle definitions of R from both the prologue copies of N _and_ the incoming edge (i.e. the value defined outside the loop, such as an accumulator being initialised to zero). But the code doesn't check whether this initial definition actually exists. If it doesn't -- because we're dealing with an intra-loop dependency -- we generate one more move than necessary. If we wanted to be accurate, we'd need to check whether there's an incoming value or not. The current code is also conservative in that the epilogue can end up with sequences like: R1 = foo R2 = R1 (from a move) ... no set of R1, because we've done with that stage... R3 = ...R2... where the last instruction could use R1 directly instead. But none of this matters because later passes (including IRA) fix it up nicely. So this was all just a big excuse on my part along the lines of we already rely on this stuff being cleaned up, and it is, so let's keep things simple. check to decide which stages need which moves. This check copes with values that are live before the loop, and emits some moves that aren't actually needed. The code also emits chains of moves that can be trivially optimised through propagation. We rely on later patches to clean this up for us (and they do). So, rather than keep a rather complicated check here, I've simply emitted the moves for all stages. In principle, that will generate a little more dead code than now in cases where chains of two moves are needed, but that seems to be very rare (when moves are scheduled). indeed, it's better to simplify code generation and rely on subsequent cleanups. Not sure about generating more (dead?) code in principle; could you elaborate, for the record? Sorry, I wasn't very clear, but what I meant was: the patch generates every move regardless of the stage, so it will in principle generate more moves to unused registers than we do now. (As above, we already generate some such moves.) Thoese insns will be dead, and will be later removed as dead. But in practice, the number of extra instructions seems to be very low, and is usually zero. * modulo-sched.c (extend_node_sched_params): New function. (print_node_sched_params): Print the stage. (set_columns_for_row, schedule_reg_move): New functions. (set_columns_for_ps): Move up file and use set_columns_for_now. typo: nowrow (unless you intend this to be temporary ;-) Doh! :-) (schedule_reg_moves): Call extend_node_sched_params and schedule_reg_move. Extend size of uses bitmap. Return false if a move could not be scheduled. (apply_reg_moves): Don't emit moves here. (permute_partial_schedule): Handle register moves. (duplicate_insns_of_cycles): Remove for_prolog. Always emit moves. Always emit all moves Oops, fixed. +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS. + The source of the move is provided by I_MUST_FOLLOW, which has + already been scheduled. The source y of an x=y move is defined by the previous iteration of the next move y=z (or by the original y=... move-def). We schedule moves one after the other bottom-up, starting from the original move-def moving backwards in cycles. This way, the instruction I_MUST_FOLLOW defining y is always already scheduled when we come to schedule the dependent I_REG_MOVE x=y move. Right. + /* The cyclic lifetime of move-new_reg starts and ends at move-def + (the instruction that defines move-old_reg). So instruction I_REG_MOVE (new_reg=reg) must be scheduled before the next I_MUST_FOLLOW move/original-def (due to anti-dependence: it overwrites reg), but after the previous instance of I_MUST_FOLLOW (due to true dependence; i.e. account for latency also). Why do moves, except for the one closest to move-def (which is directly dependent upon it, i.e. for whom move-def == I_MUST_FOLLOW), have to worry about move-def at all? (Or have their cyclic lifetimes start and end there?) Because the uses of new_reg
Re: [4/4] Make SMS schedule register moves
On Tue, Aug 30, 2011 at 2:29 PM, Richard Sandiford richard.sandif...@linaro.org wrote: This is the move-scheduling patch itself. It should be fairly self-explanatory. Let me know if it isn't, and I'll try to improve the commentary. Can you add some testcases? Thanks, Richard. One potentially controversial change is to the way we handle moves in the prologue and epilogue. The current code uses a conservative check to decide which stages need which moves. This check copes with values that are live before the loop, and emits some moves that aren't actually needed. The code also emits chains of moves that can be trivially optimised through propagation. We rely on later patches to clean this up for us (and they do). So, rather than keep a rather complicated check here, I've simply emitted the moves for all stages. In principle, that will generate a little more dead code than now in cases where chains of two moves are needed, but that seems to be very rare (when moves are scheduled). Richard gcc/ * modulo-sched.c (extend_node_sched_params): New function. (print_node_sched_params): Print the stage. (set_columns_for_row, schedule_reg_move): New functions. (set_columns_for_ps): Move up file and use set_columns_for_now. (schedule_reg_moves): Call extend_node_sched_params and schedule_reg_move. Extend size of uses bitmap. Return false if a move could not be scheduled. (apply_reg_moves): Don't emit moves here. (permute_partial_schedule): Handle register moves. (duplicate_insns_of_cycles): Remove for_prolog. Always emit moves. (generate_prolog_epilog): Update calls accordingly. Index: gcc/modulo-sched.c === --- gcc/modulo-sched.c 2011-08-30 13:06:36.528669762 +0100 +++ gcc/modulo-sched.c 2011-08-30 13:22:08.029124537 +0100 @@ -220,8 +220,6 @@ static partial_schedule_ptr sms_schedule static void permute_partial_schedule (partial_schedule_ptr, rtx); static void generate_prolog_epilog (partial_schedule_ptr, struct loop *, rtx, rtx); -static void duplicate_insns_of_cycles (partial_schedule_ptr, - int, int, int, rtx); static int calculate_stage_count (partial_schedule_ptr, int); static void calculate_must_precede_follow (ddg_node_ptr, int, int, int, int, sbitmap, sbitmap, sbitmap); @@ -458,6 +456,15 @@ set_node_sched_params (ddg_ptr g) node_sched_param_vec, g-num_nodes); } +/* Make sure that node_sched_param_vec has an entry for every move in PS. */ +static void +extend_node_sched_params (partial_schedule_ptr ps) +{ + VEC_safe_grow_cleared (node_sched_params, heap, node_sched_param_vec, + ps-g-num_nodes + VEC_length (ps_reg_move_info, + ps-reg_moves)); +} + static void print_node_sched_params (FILE *file, int num_nodes, partial_schedule_ptr ps) { @@ -474,6 +481,7 @@ print_node_sched_params (FILE *file, int INSN_UID (ps_rtl_insn (ps, i))); fprintf (file, asap = %d:\n, NODE_ASAP (ps-g-nodes[i])); fprintf (file, time = %d:\n, nsp-time); + fprintf (file, stage = %d:\n, nsp-stage); fprintf (file, nreg_moves = %d:\n, nsp-nreg_moves); for (j = 0; j nsp-nreg_moves; j++) { @@ -485,6 +493,168 @@ print_node_sched_params (FILE *file, int } } +/* Set SCHED_COLUMN for each instruction in row ROW of PS. */ +static void +set_columns_for_row (partial_schedule_ptr ps, int row) +{ + ps_insn_ptr cur_insn; + int column; + + column = 0; + for (cur_insn = ps-rows[row]; cur_insn; cur_insn = cur_insn-next_in_row) + SCHED_COLUMN (cur_insn-id) = column++; +} + +/* Set SCHED_COLUMN for each instruction in PS. */ +static void +set_columns_for_ps (partial_schedule_ptr ps) +{ + int row; + + for (row = 0; row ps-ii; row++) + set_columns_for_row (ps, row); +} + +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS. + The source of the move is provided by I_MUST_FOLLOW, which has + already been scheduled. MUST_FOLLOW is a scratch bitmap that + is big enough to hold I_MUST_FOLLOW. + + Return true on success. */ +static bool +schedule_reg_move (partial_schedule_ptr ps, int i_reg_move, + sbitmap must_follow, int i_must_follow) +{ + unsigned int u; + int i, start_row, end_row, this_start, this_end, this_row, latency; + int origin_row, origin_column; + sbitmap_iterator sbi; + ps_reg_move_info *move; + + move = ps_reg_move (ps, i_reg_move); + + /* The cyclic lifetime of move-new_reg starts and ends at move-def + (the instruction that defines move-old_reg). We need to decide + where in that cyclic lifetime the move should go. The position
Re: [4/4] Make SMS schedule register moves
Richard Guenther richard.guent...@gmail.com writes: On Tue, Aug 30, 2011 at 2:29 PM, Richard Sandiford richard.sandif...@linaro.org wrote: This is the move-scheduling patch itself. It should be fairly self-explanatory. Let me know if it isn't, and I'll try to improve the commentary. Can you add some testcases? I don't think it's an easy thing to test for. E.g. with the resampling loop in the covering message, there might well be a reasonable ii=27 schedule that doesn't need as many moves. Richard
Re: [4/4] Make SMS schedule register moves
On Tue, Aug 30, 2011 at 2:43 PM, Richard Sandiford richard.sandif...@linaro.org wrote: Richard Guenther richard.guent...@gmail.com writes: On Tue, Aug 30, 2011 at 2:29 PM, Richard Sandiford richard.sandif...@linaro.org wrote: This is the move-scheduling patch itself. It should be fairly self-explanatory. Let me know if it isn't, and I'll try to improve the commentary. Can you add some testcases? I don't think it's an easy thing to test for. E.g. with the resampling loop in the covering message, there might well be a reasonable ii=27 schedule that doesn't need as many moves. So, does it only improve code generation or does it expose more loops as SMS candidates? If the former, ok ... Richard. Richard
Re: [4/4] Make SMS schedule register moves
Richard Guenther richard.guent...@gmail.com writes: On Tue, Aug 30, 2011 at 2:43 PM, Richard Sandiford richard.sandif...@linaro.org wrote: Richard Guenther richard.guent...@gmail.com writes: On Tue, Aug 30, 2011 at 2:29 PM, Richard Sandiford richard.sandif...@linaro.org wrote: This is the move-scheduling patch itself. It should be fairly self-explanatory. Let me know if it isn't, and I'll try to improve the commentary. Can you add some testcases? I don't think it's an easy thing to test for. E.g. with the resampling loop in the covering message, there might well be a reasonable ii=27 schedule that doesn't need as many moves. So, does it only improve code generation or does it expose more loops as SMS candidates? If the former, ok ... Yeah, it just improves code generation. The code I'm adding only kicks in once we've found a successful SMS schedule. At the moment, once we've found that kind of schedule, we always emit it, regardless of how many moves it needs. The new code instead allows the schedule to be rejected. So from that point of view, it reduces the number of SMS candidates (but hopefully only in cases where we would otherwise generate worse code). Richard